I can’t believe schools are still teaching kids about the null hypothesis. I remember reading a big study that conclusively disproved it years ago.

***

To most scientists, the endpoint of a research study is achieving the mystical ‘p < 0.05’, but what does this mean? At the core, it means that one can reject the null hypothesis (H_{o}). Let me use as an example one of the more common studies, a comparison of one treatment (e.g., breakthrough drug) with a standard (e.g., placebo), with the hope of improving (increasing) the average benefit. The null hypothesis is typically of the form H_{o}: μ_{1} = μ_{2}. The alternative hypothesis is typically that they are not the same, H_{A}: μ_{1} ≠ μ_{2}. Let me do a trivial bit of algebra on the H_{o}, μ_{1} – μ_{2} = 0. That is the difference is zero.

Let me go quickly over the ‘number line’. When we talk about the population mean improvement seen for a Drug A, it will have a reasonable upper and lower limit. There are LDL, heights, hemoglobin levels beyond which are inconsistent with human life. You can’t have a human height of 1,000 feet. But any value consistent with human life IS possible. Any value! An LDL mean population value for Drug A of 96.0 is a possibility, so is 96.1 and 96.148900848924104274…, etc. The same would true for the comparative treatment, e.g., placebo. The difference between Drug A and Placebo is a number of infinite length.

The null hypothesis doesn’t test if the difference is near zero (e.g., Mean_{1} – Mean_{2 }< 0.01). It is not very near zero (e.g., Mean_{1} – Mean_{2 }< 0.00001), nor even the limit as it approaches zero (e.g., Mean_{1} – Mean_{2} < 0.0000 … [a trillion zeros later] … 0001). What is zero? Well zero is zero. Mathematically, the probability that an infinity of values is any single value (i.e., the null hypothesis difference is EXACTLY zero) approaches zero. So, is there any treatment for which any sapient individual believes is completely and utterly the same as a different treatment? With the possible exception of the field of ESP research, the answer is no. I cannot imagine any comparison of different treatments which might produce no difference, no matter how minuscule. So, **mathematically the null hypothesis is meaningless.**

This is mirrored by reality in that researchers always do everything in their power to find treatments which are maximally different from standard. For example, the treatments typically use the maximum dose they can safely use or engineers have been working for years on developing the device they want to test. In sum, my best guess is that **no scientist has ever believed that their treatment effect is zero.**

You might be thinking that statistics is different in that it is much more practical and deals with real world data and issues. A difference of only a small amount (e.g., Mean_{1} – Mean_{2} = 0.00001) wouldn’t be statistically significant. As a proud statistician, you have a point. Statistics is certainly a real world, practical way to view data. However, a small difference can become statistically significant. The root of this conundrum is hidden in the denominator of all statistical tests. Let me take the simple t-test comparing two sample means: t = (Mean_{1} – Mean_{2})/s√(2/N). We are dividing the mean standardized difference (Mean_{1} – Mean_{2})/s by a reciprocal function of N. After a bit of algebra, the difference is being multiplied by a constant times the square root of N. In other words, as the study sample size increases, given any non-zero difference, the t will increase. As mentioned above, all test statistics are of this form, with the sample size multiplying the test statistic. This applies to non-parametric testing, to Bayesian statistics, comparisons of correlations, variances, skewness, survival analyses, all test statistics.

Let me put it another way, can you imagine any comparison which fails to reject the null hypothesis if the sample size were 100,000 or 10,000,000 or 1,000,000,000? I can’t. The converse is also true, can you imagine a successful trial (rejecting the null hypothesis) when the sample size per group were 2? That is, the ability to reject the null hypothesis is a pure function of N. Even a poorly run study would be significant if you threw enough subjects into it.

At the great risk of boring you to tears and making you say ‘enough already’, I need to say this again, p-values are a function of N, the sample size, when any difference exists. As I said above, the likelihood that any difference is EXACTLY zero is infinitely small. Let me assume that we are dealing with mean differences of two different samples – as in comparing a control to an experimental group (the dependence of N on any test statistic [Fisher’s exact test, logistic regression, correlations] is still true no matter the statistic). Let me further assume that the mean difference is quite small, a tenth of a standard deviation difference. I shall also assume the typical 2-sided test. By manipulating the number of patients (N) I can get almost any p-value. The following table presents a variety of sample sizes, from ‘non-significant’ to very ‘highly significant’.

N | p-value |

4 | 0.90 |

14 | 0.80 |

31 | 0.70 |

56 | 0.60 |

92 | 0.50 |

143 | 0.40 |

216 | 0.30 |

330 | 0.20 |

543 | 0.10 |

771 | 0.05 |

1331 | 0.01 |

2172 |
0.001 |

3036 | 0.0001 |

3913 | 0.00001 |

To repeat myself a last time, p-values are a function of sample size. They reach ‘significance’ faster (i.e., with smaller sample sizes) when the true difference is larger, but they can always become any level of ‘statistical significance’ as long as the difference is not exactly zero. **Statistically, with a large enough N, the null hypothesis will be rejected. **[In fact, one main job of a statistician is to determine the N which will give you a statistically significant result.]

This brings me to a second theoretical issue with the null hypothesis heard in all Statistics 101 classes. Given the issues above, one can NEVER accept the null hypothesis. One can only fail to reject it. Sorry about the double negatives. The reason for this is that with a better run study (decreasing the internal variability and/or increasing the sample size), one should eventually reject the null hypothesis. To put things another way, a study which fails to reject the null hypothesis is, in essence, a failed study. The scientists who ran it did not appreciate the magnitude of the relative treatment difference and either failed to control the noise of the study or ran it with an inadequate sample size. If a study failed to reject the null hypothesis, one cannot say that the null hypothesis is true, it is because the scientists who designed it failed.

Another issue is that the null hypothesis is one of many assumptions of the statistical test. There are many other assumptions. For example, for the Student t-test comparing two sample means assumes normality, independence of observations, each observation comes from a similar distribution, equality of variances, etc. If we reject the null hypothesis it could be for other, non-null hypothesis, reasons, for example, non-normality (like outliers). I’ll return to this issue in a future Blog ‘Parametric or non-parametric analysis – assumptions we can live with (or not)’. **Statistically, rejecting the null hypothesis might be a failure of the mathematical test’s assumptions.**

Finally, let me stress that the near sacred p-value (i.e., p < 0.05) indicates our ability to reject the null hypothesis. As it is theoretically false, believed by all to be false, and practically false, all statisticians I’ve ever talked to believe that the p-value is a near meaningless concept. It is the statistician’s job to enable the scientists to reject the null hypothesis (p < 0.05). Fortunately, they are very quick (i.e., cheap) and very easy to do. Please see a future blog – ‘8. What is a Power Analysis?’

I mentioned above ‘all statisticians … believe that the p-value is a near meaningless concept’. This ‘Dirty Little Secret’ isn’t new. Everyone who has taken Stat 101 has heard of the Student t-test. ‘Student’, aka William Gosset, said “Statistical significance is easily mistaken for evidence of a causal or important effect, when there is none”, according to an article in Significance (published by the ASA), September 2011. ‘Student’ also said “Similarly, a lack of statistical significance – statistical insignificance – is easily though often mistakenly said to show a lack of cause and effect when in fact there is one.”

To forestall any ambiguity, let me mention that every statistical analysis I’ve ever given to clients has always included p-values, among other statistics. However, I will discuss why I always include p-values in the next blog.

## Bio-Statistical Blog

I have retired from consulting. However I will post blogs on statistics when I see interesting materials to comment on.

Allen I. Fleishman, PhD, PStat®

“Like other occult techniques of divination, the statistical method has a private jargon deliberately contrived to obscure its methods from non-practitioners.” G. O. Ashley***These blogs were written with the non-statistician in mind, although statisticians could benefit from my thirty plus years of experience consulting for the pharmaceutical/biologic/device industry. It is for those people who have taken at least a single statistics class and use statistics for clinical research in the pharmaceutical/device/biotech industry. Although simple equations will be presented, to make points, math will be kept to the level of a first week in high school algebra. Nor will I present proofs.

I will be making postings on important issues for the users of statistics and insights I’ve made from my many years of experience. I’ll also include ‘tricks’ for running a smaller study. Please start at the bottom of this blog and read upward (starting with 1. Statistic’s dirty little secret).

Feel free to post your thoughts, agreeing or disagreeing (include why you disagree, please). I will post questions or statistically related agreements/disagreements. Interesting (either positively or negatively) comments might be the lead-in for a full post. However I will attempt to answer all comments within a day. [Note: I use a spam filter, so if your comment is ignored send it to me at my e-mail address allen-fleishman (at) comcast.net.] Feel free to ask me a question through a comment. However, I am no Dr. Phil. I will almost never say your approach was the correct one, especially with a typical 4 sentence description of your trial. Even given a well written protocol, I could never guess all possible data perturbations. So I will point out potential issues, most can be anticipated if you read the entire set of blogs.

Blogs I have written are (although I reserve the right to change the blogs and comments after they were initially published):

1. Statistic’s dirty little secret – Published 30Sept2011

1.A. Another View on Testing by Peter Flom, PhD – Published 12July2012

1.B. Am I a nattering nabob of negatisism? – Published 23April2017

2. Why do we compute p-values? – Published 5Oct2011

3. Meaningful ways to determine the adequacy of a treatment effect when you have an intuitive knowledge of the d.v. – Published 12Oct2011

4. Meaningful ways to determine the adequacy of a treatment effect when you lack an intuitive knowledge of the d.v. – Published 19Oct2011

5. Accepting the null hypothesis – Published 30Oct2011

5.A. Accepting the null hypothesis by Randy Gallisted, PhD of Rutgers – Published 23Apr2012

6. ‘Lies, Damned Lies and Statistics’ part 1, and Analysis Plan (an essential tool) – Published 5Nov2011

7. Assumptions of Statistical Tests – Published 11Nov2011

7a. Assumptions of Statistical Test: Ordinal Data – Published 2Aug2012

8. What is a Power Analysis? – Published 28Nov2011

9. Dichotomization as a devil’s tool – Published 10Dec2011

10. Parametric or non-parametric analysis – Why one is almost useless – Published 26Dec2011

11. p-values by the pound – Published 5Jan2012

12. Significant p-values in small samples – Published 25Jan2012

13. Multiple observations and Statistical ‘Cheapies’ – Published 12Mar2012

14. Great and Not so Great Designs – Published 22Mar2012

15. Variance, and t-tests, and ANOVA, oh my! – Published 9Apr2012

16. Comparing many means – Analysis of VARIANCE? – Published 7May2012

17. Statistical Freebies/Cheapies – Multiple Comparisons and Adaptive Trials without self-immolation – Published 21May2012

18. Percentage Change from Baseline – Great or Poor? – Published 4Jun2012

19. A Reconsideration of my Biases – Published 25Jun2012

20. Graphs I: A Picture is Worth a Thousand Words – Published 17Aug2012

21. Graphs II: The Worst and Better Graphs – Published 18Sept2012

22. A question on QoL, Percentage Change from Baseline, and Compassionate-Usage Protocols – Published 20Apr2013

23. Small N study, to Publish or not – Published 12May2014

24. Simple, but Simple Minded – Published 8Aug2014

25. Psychology I: A Science – Published 20Mar2015

26. Psychology II: A Totally Different Paradigm -Published 25Mar2015

27. Number of Events from Week x to Week y – Published 7Apr2015

18.1 Percentage Change – A Right Way and a Wrong Way – Published 28Aug 2015

28. Failure to Reject the Null Hypothesis – Published 7Nov 2015

29. Should you publish a non-significant result? – Published 22Nov2015

18.2 Percentage Change Revisited – Published 9March2016

30. ‘Natural’ Herbs and Alternative Medicine – Published 25July2016

31. Case History of a Trial – To be Done