1. Statistic’s Dirty Little Secret

I can’t believe schools are still teaching kids about the null hypothesis.  I remember reading a big study that conclusively disproved it years ago.

***

To most scientists, the endpoint of a research study is achieving the mystical ‘p < 0.05’, but what does this mean?  At the core, it means that one can reject the null hypothesis (Ho).  Let me use as an example one of the more common studies, a comparison of one treatment (e.g., breakthrough drug) with a standard (e.g., placebo), with the hope of improving (increasing) the average benefit.  The null hypothesis is typically of the form Ho: μ1 = μ2.  The alternative hypothesis is typically that they are not the same, HA: μ1 ≠ μ2.  Let me do a trivial bit of algebra on the Ho, μ1 – μ2 = 0.  That is the difference is zero.

Let me go quickly over the ‘number line’.  When we talk about the population mean improvement seen for a Drug A, it will have a reasonable upper and lower limit.  There are LDL, heights, hemoglobin levels beyond which are inconsistent with human life.  You can’t have a human height of 1,000 feet.  But any value consistent with human life IS possible.  Any value!  An LDL mean population value for Drug A of 96.0 is a possibility, so is 96.1 and 96.148900848924104274…, etc.  The same would true for the comparative treatment, e.g., placebo.  The difference between Drug A and Placebo is a number of infinite length.

The null hypothesis doesn’t test if the difference is near zero (e.g., Mean1 – Mean2 < 0.01).  It is not very near zero (e.g., Mean1 – Mean2 < 0.00001), nor even the limit as it approaches zero (e.g., Mean1 – Mean2 < 0.0000 … [a trillion zeros later] … 0001).  What is zero?  Well zero is zero.  Mathematically, the probability that an infinity of values is any single value (i.e., the null hypothesis difference is EXACTLY zero) approaches zero.  So, is there any treatment for which any sapient individual believes is completely and utterly the same as a different treatment?    With the possible exception of the field of ESP research, the answer is no.  I cannot imagine any comparison of different treatments which might produce no difference, no matter how minuscule.  So, mathematically the null hypothesis is meaningless.

This is mirrored by reality in that researchers always do everything in their power to find treatments which are maximally different from standard.  For example, the treatments typically use the maximum dose they can safely use or engineers have been working for years on developing the device they want to test.  In sum, my best guess is that no scientist has ever believed that their treatment effect is zero.

You might be thinking that statistics is different in that it is much more practical and deals with real world data and issues.  A difference of only a small amount (e.g., Mean1 – Mean2 = 0.00001) wouldn’t be statistically significant.  As a proud statistician, you have a point.  Statistics is certainly a real world, practical way to view data.  However, a small difference can become statistically significant.  The root of this conundrum is hidden in the denominator of all statistical tests.  Let me take the simple t-test comparing two sample means:  t = (Mean1 – Mean2)/s√(2/N).  We are dividing the mean standardized difference (Mean1 – Mean2)/s by a reciprocal function of N.  After a bit of algebra, the difference is being multiplied by a constant times the square root of N.  In other words, as the study sample size increases, given any non-zero difference, the t will increase.  As mentioned above, all test statistics are of this form, with the sample size multiplying the test statistic.  This applies to non-parametric testing, to Bayesian statistics, comparisons of correlations, variances, skewness, survival analyses, all test statistics.

Let me put it another way, can you imagine any comparison which fails to reject the null hypothesis if the sample size were 100,000  or  10,000,000  or  1,000,000,000?  I can’t.   The converse is also true, can you imagine a successful trial (rejecting the null hypothesis) when the sample size per group were 2?  That is, the ability to reject the null hypothesis is a pure function of N.  Even a poorly run study would be significant if you threw enough subjects into it.

At the great risk of boring you to tears and making you say ‘enough already’, I need to say this again, p-values are a function of N, the sample size, when any difference exists.  As I said above, the likelihood that any difference is EXACTLY zero is infinitely small.  Let me assume that we are dealing with mean differences of two different samples – as in comparing a control to an experimental group (the dependence of N on any test statistic [Fisher’s exact test, logistic regression, correlations] is still true no matter the statistic).  Let me further assume that the mean difference is quite small, a tenth of a standard deviation difference.  I shall also assume the typical 2-sided test.  By manipulating the number of patients (N) I can get almost any p-value.  The following table presents a variety of sample sizes, from ‘non-significant’ to very ‘highly significant’.

N p-value
4 0.90
14 0.80
31 0.70
56 0.60
92 0.50
143 0.40
216 0.30
330 0.20
543 0.10
771 0.05
1331 0.01
2172

0.001

3036 0.0001
3913 0.00001

To repeat myself a last time, p-values are a function of sample size.  They reach ‘significance’ faster (i.e., with smaller sample sizes) when the true difference is larger, but they can always become any level of ‘statistical significance’ as long as the difference is not exactly zero.  Statistically, with a large enough N, the null hypothesis will be rejected.  [In fact, one main job of a statistician is to determine the N which will give you a statistically significant result.]

This brings me to a second theoretical issue with the null hypothesis heard in all Statistics 101 classes.  Given the issues above, one can NEVER accept the null hypothesis.  One can only fail to reject it.  Sorry about the double negatives.  The reason for this is that with a better run study (decreasing the internal variability and/or increasing the sample size), one should eventually reject the null hypothesis.  To put things another way, a study which fails to reject the null hypothesis is, in essence, a failed study.  The scientists who ran it did not appreciate the magnitude of the relative treatment difference and either failed to control the noise of the study or ran it with an inadequate sample size.  If a study failed to reject the null hypothesis, one cannot say that the null hypothesis is true, it is because the scientists who designed it failed.

Another issue is that the null hypothesis is one of many assumptions of the statistical test.  There are many other assumptions.  For example, for the Student t-test comparing two sample means assumes normality, independence of observations, each observation comes from a similar distribution, equality of variances, etc.  If we reject the null hypothesis it could be for other, non-null hypothesis, reasons, for example, non-normality (like outliers).  I’ll return to this issue in a future Blog ‘Parametric or non-parametric analysis – assumptions we can live with (or not)’.  Statistically, rejecting the null hypothesis might be a failure of the mathematical test’s assumptions.

Finally, let me stress that the near sacred p-value (i.e., p < 0.05) indicates our ability to reject the null hypothesis.  As it is theoretically false, believed by all to be false, and practically false, all statisticians I’ve ever talked to believe that the p-value is a near meaningless concept.  It is the statistician’s job to enable the scientists to reject the null hypothesis (p < 0.05).  Fortunately, they are very quick (i.e., cheap) and very easy to do.  Please see a future blog – ‘8. What is a Power Analysis?’

I mentioned above ‘all statisticians … believe that the p-value is a near meaningless concept’.  This ‘Dirty Little Secret’ isn’t new.  Everyone who has taken Stat 101 has heard of the Student t-test.  ‘Student’, aka William Gosset, said “Statistical significance is easily mistaken for evidence of a causal or important effect, when there is none”, according to an article in Significance (published by the ASA), September 2011.  ‘Student’ also said “Similarly, a lack of statistical significance – statistical insignificance – is easily though often mistakenly said to show a lack of cause and effect when in fact there is one.”

To forestall any ambiguity, let me mention that every statistical analysis I’ve ever given to clients has always included p-values, among other statistics.  However, I will discuss why I always include p-values in the next blog.

This entry was posted in Biostatistics, Statistical, Statistics, Uncategorized and tagged , , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *