In my previous blog I said that the p-value, which test the null hypothesis, is a near meaningless concept. This was based on:
- In nature, the likelihood that the difference between two different treatments will be exactly any number (e.g., zero) is zero. [Actually mathematicians would say ‘approaches zero’ (1/∞), which in normal English translates to ‘is zero.’] When the theoretical difference is different from zero (even infinitesimally different) the Ho is not true. That is, theoretically the Ho cannot be not true.
- Scientists do everything in their power to make sure that the difference will never be zero. That is, they never believed in the Ho. Scientifically, the Ho should not be true.
- With any true difference, a large enough sample size will reject the Ho. Practically, the Ho will not be not true.
- We can never believe (accept) the Ho, we can only reject it. Philosophically, the Ho is not allowed to be true.
- the Ho is only one of many assumptions which affect p-values, others include independence of observations, similarity of distributions in subgroups (e.g., equal variances), distributional assumptions, etc. We have trouble knowing if it is the Ho which isn’t true.
So, why do we compute the p-value of such an unbelievable theory (Ho)? Three reasons: tradition, a primarily-false belief that p-values indicate the importance of an effect, and ‘something’ happened in the study.
Journals, our colleagues, regulatory agencies require we present p-values. For example, the FDA wants to know that our drug treatment is better than doing nothing (e.g., placebo or standard of care). Any difference, no matter how small would grant approval. Let me say that again, no matter how small! As long as the difference is greater than zero, the drug is judged efficacious.
In the first blog I did a bit of legal sleight of hand. I made the null hypothesis of the form, Ho: μ1 = μ2. This is the usual two-sided test. The FDA requires it, so do many journals. The alternative hypothesis is HA: μ1 ≠ μ2, which can be restated as either μ1 < μ2 or μ1 > μ2. It is possible to have a directional or one-sided null hypothesis (e.g., Ho: μ1 < μ2), but this is quite uncommon. The ‘scientific gatekeepers’ use the two sided hypothesis for two reasons, it gives a level playing field. All will test the superiority hypothesis (μ1 > μ2) by an alpha of 0.025 (with another 0.025 going to a inferiority test), rather than letting some researchers use an alpha of 0.05 and others use 0.025. The second reason is that if the treatment were harmful (inferiority test), they want to know about it. If it were harmful and a one-side test were used, one could not say it was harmful, only you weren’t able to say, with your inadequately run study, that the drug was useful.
For tradition, I have always included the p-value in all statistical reports I have done.
One of the biggest blunders I see made by non-statisticians is the mistaken belief that if p is < 0.05 then the results are significant or meaningful. If a difference of < 0.05 is (practically or clinically) significant, then a p-value of < 0.001 is even more significant. They also often make the even worse error in thinking that if the p is > 0.05 the treatment wasn’t useful. These blunders are compounded by the use of the term ‘statistically significant’. Statistical significance only means that it is highly likely that the difference is non-zero, the semi-meaningless notion (see ‘1. Statistic’s Dirty Little Secret’).
Statistical significance has very little to do with clinical significance. Very little? Let me qualify, statistical significance to use a term in logic is a necessary, but not sufficient, quality for demonstrating clinical significance. An effect which is statistically significant might also be clinically significant. An effect which was unable to achieve statistical significance will need more information (i.e., a cleaner or larger study) to demonstrate clinical significance, although the magnitude of the effect might have the potential to be quite clinically meaningful. A non-significant effect is unable to demonstrate that the effect could be zero or even negative (worse than the alternative). However, it is very possible that when the N is small and the results are not statistically significant, the effect size may possibly be very, very large (see my next blog ‘Meaningful ways to determine the adequacy of a treatment effect when you have an intuitive knowledge of the d.v.’).
The reason statisticians still feel justified in providing their clients with p-values is that at a minimum they know that if the p-value is sufficiently low (e.g., < 0.05), they can be certain, with some degree of probability, the difference favors the treatment. What I’m referring to is the confidence interval. While I’m not a Bayesian statistician, I still tell my clients that “with 95% certainty, the true difference excludes zero”. Therefore we know that the treatment is ‘better’ than the standard. Is it better by a millimeter, a mile? The p-value cannot answer that question, but the p-value indicates that the experimental treatment is better.
[A classically trained Frequentist statistician would say something like ‘if the study were replicated an infinite number of times, 95% of the observed mean differences would not include zero’. It goes without saying that a client would, and should, fire me on the spot if I included the latter in a report. Otherwise, I don’t tend to use Bayesian stat.]
I mentioned the article in Significance in my last blog, the author of that article states “Ziliak and McCloskey show that 8 or 9 out of every 10 articles published in the leading journals of science commit the significance mistake – equating significance with a real and important practical effect while at the same time equating insignificance with chance, randomness, no no association or causal effect at all.”
I implied above that p-values might indirectly measure clinical importance. Let us assume that we are doing an analysis in the same exact way for a variety of dependent variables in a single study. Let me further assume that the N’s are identical for all the dependent variables (e.g., no missing data) and we are dealing with parametric (interval-level) or non-parametric non-tied ordinal data. Then parameters which have a statistical significance have larger (relative) mean differences in comparison to the non-statistically significant parameters. To illustrate this, let us use one version of the t-test comparing two means: t = (Mean1 – Mean2)/s√(2/N). We can ignore the √(2/N) term, a constant, as we assumed the Ns were identical for all parameters. If one t-test were significant and another not, it would mean that the (Mean1 – Mean2)/s term was larger. This term is the mean difference relative to its variability (actually standard deviation). In other words, how many standard deviations different are the two treatments. This is also called by many statisticians the ‘effect size’. Let me rephrase this, if within a study, if one dependent variable has a larger (e.g., statistically significant) t-test relative to another parameter, then the effect size is larger.
A corollary of this is that if within this single study, a p-value was smaller than another (e.g., one is 0.04 and another 0.003), then the smaller (‘more statistically significant) p-value implies a larger effect size (greater relative mean difference). In other words, a smaller p-value implies a greater clinical effect. This only applies for parameters with identical Ns. If one study of 100,000 patients had a 0.003 p-value and a second study of 10 patients had the 0.04 p-value, then it can be demonstrated that the ten patient study indicated an average effect size of a LARGER amount. If I were investing in one of the two companies, I’d invest in the one who had the ten patient 0.04 p-value, not the one hundred thousand patient company.
I will delve into how to measure clinically meaningfulness in a future blog (3. Meaningful ways to determine the adequacy of a treatment effect when you have an intuitive knowledge of the d.v.’). But to give a taste of it, it has to do with confidence intervals – the one-to-one alternative to p-values. In that blog I shall elaborate on the case where the scientists and literature KNOW and completely understand their metric (in my experience a relatively rare event), and when they either don’t understand it or don’t understand it in the current setting (e.g., population of patients, treatment regimen) (see ‘4. Meaningful ways to determine the adequacy of a treatment effect when you lack an intuitive knowledge of the d.v.’). In this second (partial ignorance) case, one can still have a (situationally) simple statistic to discuss treatment effects across completely different metrics.
In addition to p-values, I always (try to) include a measure of treatment differences and their confidence intervals in all my statistical analyses.
In my next blog I shall discuss the best method of describing results when you understand your dependent variable.