In previous blogs I discussed how little relevance the p-value actually has, but explained in a second blog why we still do it. I gave in my last blog a 1-to-1 alternative for the p-value, and why we need to rely not on one value, the lower end of the confidence interval, but also the center (average or mean) effect, and the upper end of the interval, the maximum expected effect. This upper end was promoted as a method to determine the clinical significance and more importantly, a way to demonstrate clinically insignificance (i.e., prove the null hypothesis). However, in my last blog I observed that understanding confidence intervals is ONLY useful when the reader has a deep intuitive understanding of the dependent variable (e.g., weight loss for diets for most Westerners).

What do we do when the patient population differs, or we don’t have a strong and deep intuitive experience with the dependent variable. For example, a 15 pound weight loss might be clinically meaningful, but would it be meaningful for a neonate? or a morbidly obese patient. For patients with hypertriglyceridemia is a 10 mg/dL a clinically meaningful improvement? 100 mg/dL, 1000 mg/dL? Well the less informed (i.e., those who haven’t been reading my blog), would just report the p-value and rely on ‘statistical significance’.

What we would need is a way to get a unit free and easily understood method of taking the results and describing it. Fortunately, we have already seen this described in my blog ‘Why do we compute p-values’. I’m referring to effect size = [(Mean_{1} – Mean_{2})/s] which is embedded within every t-test formula and power calculation. It is different from the t-test. While the effect size is not affected by N (the t-test or F test are directly related to N), the variability (confidence interval) of effect size is affected by N. [Actually all CI are affected by N.]

Effect size is very easily calculated by dividing the mean difference of the key comparison by the standard deviation (or square root of the residual error variance [MS_{error} of the ANOVA]). In English, the effect size can be described as ‘how many standard deviations apart are the two means’? All comparisons among means could use this final study metric.

Hmm, what about studies with more than two treatment groups, for example, a placebo, low and high dose trial. In my experience, the key comparison ALWAYS is a single (or one degree of freedom) comparison. In statistical jargon, a single linear contrast. In this case, the key hypothesis might be a simple comparison of vehicle against high dosage. Alternatively, it could be placebo against the average of the active doses. This would get us back to the effect size.

Effect size is also a necessary component to power calculations for means. In fact, statisticians only need an α (e.g., 0.05 two-tailed), 1 – β error, aka power of the study (e.g., 80% power), and the effect size. The first two are traditionally set by the ‘scientific gatekeepers’ and by ‘upper management’, so all you actually need is the effect size and the necessary sample size would ‘pop out’. I will discuss power analyses in a future blog.

Although it might not have been obvious, effect size is a unit-free measure. For example, for my diet study example, if the mean difference was 5 pounds and the standard deviation was 20 pounds, after division, the effect size is 0.25, unit free.

Effect size allows you to compare apples and oranges. When you combine the independence of effect size from a study’s sample size and its unit-free property, one could compare two studies together and see if a parameter X in Study A indicates a more ‘sensitive’ parameter than parameter Y from Study B. I observed for one disease that a stair climb test was a better parameter to use in a future study than a six minute walk distance, despite the fact that the stair climb was used in one published study and the walking test was from a second study, each with widely different sample sizes.

One criticism you might raise is many scientists have only a weak understanding of what a clinically meaningful effect size is. That is, is a 0.5 s.d. mean difference large or small? In truth, this is a reasonable objection, as it is domain specific. Indomethacin is a tremendous treatment for acute gout attacks. A new acute gout attack medicine would need to be at least as efficacious as that treatment, e.g., a treatment effect of at least 2.0. Unfortunately, for most diseases the treatment is not that dramatic. A 0.25 might be much more realistic.

One statistician, Jacob Cohen, who worked in this field called effects sizes of 0.2 to 0.3 as small, around 0.5 as medium and greater than 0.8 as large. As implied above, this should be an is domain specific. Actually Cohen called what I refer to as the effect size, a ‘signal to noise ratio’. Different people called this different things. In any case, I STRONGLY recommend computing the effect sizes in your literature.

For most gold-standard pharmaceutical trials (db, placebo controlled study against placebo), I would unhesitantly use Cohen’s original rule of thumb. I might add an effect size of 1.5 as huge. I hesitantly add that anything larger than ‘huge’ might be considered suspect (e.g,, too large and/or trivial).

One statistician I knew, who worked at a local hospital’s IRB, routinely used an effect size of 1.0 for all of his clients. This allowed him to accept studies of approximately 34 patients as acceptable. If my experience is correct, and a 0.25 effect size is more typical of medical treatments, then a study with a sample size less than 500 is likely to be a complete waste of time.

Another useful unit-free measure of the magnitude of the treatment effect is the ‘correlation ratio’. This is more useful when there are many levels of treatment. Simply put this is the proportion of variance explained by the treatment in the study. Mathematically (in the population) it is: η^{2} = σ^{2}_{total}. The numerator, σ^{2}_{treat}, is the variability which is explained by the treatment in an ANOVA the Mean Squared treatment or model, the denominator, σ^{2}_{total}, is the total variability in the study, the Mean Squared total in the ANOVA. The correlation ratio ranges from 0 to 1.0, like any correlation which is squared. A correlation ratio of 0.25 would say that the treatment accounts for 25% of the variability seen in the study’s population. Or conversely 1 – 0.25 or 75% of the variability is unrelated to the treatment. Again meaningful effects are quite domain specific. Engineers might expect their studies to account for 99% of the variance. Psychologists might be overjoyed to explain 10% (correlations of 0.3 ~ square root of 0.10).

I talked as if the effect size and correlation ratio are two unique and non-comparable statistics. They aren’t. Knowing one tells you the other. Let δ be the effect size and η^{2} the correlation ratio, then

η^{2} = (1.0 + (δ^{2})^{-1})^{-1}

δ^{2} = η^{2}/( 1.0 – η^{2})

If one only had a t-test result could you estimate the effect size? Yes

E(δ) = sqrt [ (t^{2}(υ – 2)/υ – 1)/( υ + 2) ], where t is the t-test and υ is the degrees of freedom.

Things get a bit more complicated when dealing with more complex ANOVAs (e.g., 3 way ANOVA, models with random effects). Its also possible to generate confidence intervals for the effect size.

Having repeatedly said that modern statistics does not allow you to accept the null hypothesis, my next blog will discuss how you can demonstrate that your treatment is not worse than an active treatment or the difference is near zero.