*“All models are wrong. … But some are useful.” George Box*

*****

In my first blog I stated a truism, that was hopefully taught in your first statistics class:

You can’t accept the null hypothesis. You can ONLY reject the null hypothesis.

It is not possible to prove that a difference is exactly zero (H_{o}: μ_{1} – μ_{2} = 0). However, you might be able to prove that it is near zero (μ_{1} – μ_{2} ~ 0).

From bio-equivalence trials, statisticians were able to come up with a simple way to reformulate the problem into something which can be tested statistically. However, to do that statisticians needed a value, below which the difference was clinically insignificant. In the bio-equivalence area, a new formulation had to have a drug potency between 80% and 125% of the original formulation to be considered bio-equivalent. The 80% and 125% boundaries came from the FDA. [Note: You might be thinking that the interval isn’t symmetrical, 100%-80% = -20%, while 100%-125% = +25%. Actually we’re dealing with ratios. 4/5 (80%) is the reciprocal of 5/4 (125%). In the log world, the interval is symmetrical.]

The simple (at least to us statisticians) solution was to demonstrate that the ratio of potency (e.g., area under the curves [AUC]) had a confidence interval (CI) which was above 80% but was below 125%. Let me clarify, the confidence interval must fall completely within the range. They didn’t look at the mean (center), but the 95% CI to determine equivalence. For example, if the ratio of AUCs was 1.00 but had a CI of 0.79 to 1.26 it would fail. So would CIs of (0.78 to 1.10) or (0.98 to 1.30). However, a mean ratio of 1.10 with a CI of (1.01 to 1.19) would succeed in proving bio-equivalence, despite demonstrating that the active treatment was statistically significantly larger than the standard.

For most studies, we don’t need to demonstrate equivalence, but to demonstrate non-inferiority. Let me clarify. If a new drug were shown to be better than a standard drug, that would be good. The drug manufacturer would ask the Agency for permission to include that on its label. On the other hand, if it was shown not to be worse than a standard drug that might suffice. Like the null hypothesis, one would look at the lower end of the confidence interval and determine if it includes the clinically important, ‘not worse than’ amount. For example, in a Phase 3 mortality trial that I designed, there was a treatment and a standard. The clinical team believed that the treatment would have 3% better mortality than the standard treatment. However, the sample size to run a trial to demonstrate superiority was prohibitive. The sponsor got permission to run the trial with a ‘not worse than’ 3% hypothesis. Therefore the key hypothesis was to test if the improvement in mortality was greater than **–**3%. That is, if the treatment had a confidence interval which might have included zero, but say only included a possibility of being 2% worse then the drug would be approved. For example, if the mean improvement were 1% and had a CI of (-2% to 4%), the study would have succeeded. We would have proved that the experimental treatment was not 3% worse. That is, the entire CI was above -3%.

Now what would have happened if the lower end of the CI were a bit higher than -2% (e.g., +0.3% to 4.1%), we would have demonstrated that not only was the treatment not 3% worse than the standard, but actually superior to the standard (i.e., reject the traditional null hypothesis). Hmm, one confidence interval but two tests/hypotheses. Will we need to take a hit on the alpha (i.e., test each at 0.025 two-sided alpha levels)? Nope. We are able to test both of them with a single 0.05 test. [Note: As we typically use two-sided CI, and since each side is 0.025 of the alpha, we’re actually testing with a single 0.025 alpha level. But that’s still better than testing each with a 0.0125 alpha level.]

I am explicitly stating that the non-inferiority test is FREE, FREE, FREE, FREE. (See my future blog ‘Statistical freebies – multiple hypotheses/variables and still control alpha’). Since it is easier to prove that the lower end of a CI is above a negative, rather than a zero, value, not only is it FREE but it is easier to achieve. Not only that, but since we (should) do a confidence interval anyway and ALL statistical programs include confidence intervals, then it is CHEAP, FREE, and EASY (pick any three!), and still is completely respectable. What more would a scientist want??

If a test of superiority was non-statistically significant then you can’t say that the study demonstrated that no difference existed, unless and only unless you included the ‘non-inferiority’ hypothesis into the protocol. However, don’t be too concerned if you previously implied a non-statistical significant result indicated no difference. Ziliak and McCloskey (see my blog ‘Why do we compute p-values?’) said this is a very common error. You now have the ability to not make this common error (and get it past the Agency).

In sum, we can accept a form of the null hypothesis, but we need to include a statement in the protocol about a cut-off for clinically unimportance and demonstrate that our results (CI) do not go past that level. What happens if we didn’t include the clinical un/important boundary in the protocol and didn’t include the non-inferiority hypothesis? For example, if we only included a test of superiority? I can only say SOL (which could stand for Sorry Out of Luck).

***

Let me return to the topic of this blog ‘Accepting the null hypothesis’. The more astute reader will notice that the approach I recommended is to see if the product was not worse than the standard active product. In the bio-equivalence testing arena, we’re not only seeing if the product had less potency than the standard (i.e., < 80%) but also if it had greater potency than the standard (i.e., > 125%). If you’re making different lots of the same compound you don’t want the different lots to be either too weak or too strong. In most drug testing we would love our product to be superior to an approved product.

What if we’re actually interested in proving equivalence? Like the 80% – 125% rule, we would need to ensure that our confidence interval, both the upper and lower end, were within the interval. Most times, we’d like to have symmetric intervals. One easy way to do that is to ignore the sign of the difference. Fortunately or unfortunately, statisticians seldom transform by ignoring the sign. However, one trick we statisticians often do is to square the statistic. This would always give us a positive (or zero) number. I mentioned in my previous blog that one statistical metric which was easily transformed into the effect size was the correlation ratio. It is a squared (hence always positive) number, and is identical to how much variance the experimental treatment explains of the total study’s variance. It is a number from 0.00 to 1.00. Zero would mean the treatment explains nothing. One (one hundred percent) would mean that the treatment would explain all the differences we see among the numbers. We could test the null hypothesis, that the treatment is near zero, by looking at the confidence interval on the correlation ratio. If the lower end of the CI was above zero, that would be equivalent to saying that the treatment effect is above zero. For example, if the lower end of the correlation ratio was 0.0001 that would mean that the treatment explained 0.01% of the study’s variance, not zero, but small. No, that isn’t what would interest us. What we’re actually interested in, for proving the null hypothesis, is in looking at the upper end of the interval. For example, we could say that if the treatment explained 5% or less of the variance, than the treatment effect was near zero. [Note: as implied in a previous blog, we could have an interval from 0.0001 to 0.048, and say that the treatment statistically proved the treatment had an effect, but clinically it wasn’t useful (i.e., < 0.05%).]

We don’t see much of this type of analysis however. As I will shortly illustrate, it is very, very costly to do this study, and anyway, who wants to spend a very large chunk of research cash into proving that the treatment didn’t work?

How expensive? I’m going to provide various correlation ratios (η²), its square root (η) – in a sample this would be the simple correlation (r), effect size (δ), and the sample size for each group – double for a two group t-test comparison (assuming 80% power, and alpha of 0.05 two-sided). For example, in a two group study which wanted to prove that the treatment difference between two means was less than 10% of a standard deviation (alternatively, a study in which the treatment effect accounted for no more than 1% of the study’s variance), one would need 1,571 per group or 3,142 total patients. As implied above, this is a pretty expensive way to prove your product isn’t really different from your competitors.

η² η δ N/group 0.100 0.316 0.333 143 0.075 0.274 0.285 195 0.050 0.224 0.229 301 0.040 0.200 0.204 379 0.030 0.173 0.176 508 0.020 0.141 0.143 769 0.010 0.100 0.100 1,571 0.005 0.071 0.071 3,115 0.000 0.000 0.000 ∞My next blog will discuss why statisticians cannot lie and why the analysis plan is essential.