If you shoot enough arrows, everyone can hit a bull’s-eye

Free! Free! Free!

BOGO – Buy One Get One Free

***

Multiple Comparison Problem: If you shoot at a target once and hit the bull’s-eye, then you’ve clearly hit your mark. If you shot at ten targets, hit one, then it isn’t clear if you’ve succeeded. Examples of this issue are doing the test at different times (e.g., Week 1, 2, 4, 8, 12), different dependent variables, different sub-samples (older or younger children), different populations (ITT, per protocol), or different active treatments (High v Placebo; Medium v Placebo, Low v Placebo). One way to circumvent this issue of multiple comparisons is to define one comparison as the key comparison in the protocol. The other comparisons could be secondary (or tertiary). Sometimes you can’t decide.

Simple Bonferroni Adjustment: Perhaps the first place to start is the impact of having multiple active treatments. Perhaps we have a high and moderate dose for a Phase II trial, each to be compared to a control. In this case, we can ‘win’ if either the high dose is better than control or if the low dose is statistically significant. Now in probability anything with an ‘or’ is additive, anything with an ‘and’ is multiplicative. Therefore, if the p-value of the first and second comparison is 0.05, then for either comparison A **or** B to be statistically significant, the p-value for either achieving a spurious statistical significance is about 0.10 (0.05 + 0.05). If we had used three comparisons it would be about 0.15. I’m not going to expound on the meaning of ‘about’, as we are double counting some small probabilities (the actual experiment-wise p-values are 0.098 and 0.143, respectively (I’m not going to quibble over 0.098 v 0.010 or 0.143 v 0.150). So, if we wanted to have a 0.05 OVERALL (aka experiment-wise) probability level we’d need to divide the overall p-value by the number of comparisons, like 0.05/2 = 0.025 for a two comparison study and 0.05/3 for a three comparison study. This is known as the Bonferroni correction.

Naively one might think that by halving the p-value one would need to double the number of subjects. Nope. Let me take the case where the N for a single comparison study is rather large. The critical t would be near 2 (actually 1.96). What would we need for half the p-value? No, not about 4 (actually 3.92), but you would need only 2.242. That’s only 14.4% larger (a ratio of 1.144) than the original 1.96. Unfortunately the new N is not linear to the critical t-value, but related to N^{2}. So the N would need to be 1.144^{2} or 1.308 or 30.8% larger, but clearly not 100% larger. [BTW, things get slightly worse when the starting N is not large, but things level off quite quickly. When the N_{group} is about 30 then the t is only slightly larger than 1.96, i.e., 2.04.]

In summary, for the simple Bonferroni adjustment for multiple comparisons one would need to test the critical p-value at the alpha of 0.05/Number of comparisons. This will increase the N_{group} by a relatively small degree.

Improved Bonferroni Adjustment: When one makes two comparisons, one difference of means will be larger than the other. This is just like that archery contest, one arrow will be closer than the others. For the purposes of this blog, let me assume that the High dose vs. Control is a bigger difference than the Low dose vs. Control. With the above simple Bonferroni the critical p-value for the High dose vs. Control is 0.025. That will still apply. However, if and only if that largest comparison was statistically significant, the weaker comparison, Low vs. Control, could be tested at the 0.05 level. Buy one, get one free. For three active comparisons, the largest comparison would be tested at 0.05/3 (=0.0167). If the first were significant, the second can be tested at 0.05/2 (=0.025). If the first two were statistically significant at their respective p-values, the smallest mean difference can be tested at 0.05/1 (=0.05). Therefore, the smallest difference could be tested at a less severe condition. One still needs to plan for a study with the larger N, as above for the best mean comparison. However, the Improved Bonferroni makes the other tests easier to achieve. I’m simplifying things by only focusing on a step-down approach, but I can’t explain everything in this simple blog.

In summary, for multiple comparisons, the ‘penalty’ for multiple tests 1) isn’t prohibitive and 2) can be reduced to nothing for the second best comparisons.

Interim Analysis: Let me talk about another specialized type of multiple comparison. This is the case where we look at the data while the study is ongoing and see if the results are significant, an interim analysis. If so, then we might stop the trial. Perhaps we analyze the data when the trial is half completed. By the simple Bonferroni approach we might test either the interim or the final by 0.05/2 or 0.025. However, statisticians realized that in the final analysis we’re analyzing the data of the first half twice (i.e., in the final analysis the only ‘new’ data is the results from the second half). Pocock and Pocock worked this out and realized that the two critical p-values should be 0.029, which is slightly better than 0.025. If there were three comparisons (two interim and the final) we could test them equally at 0.0221 (not 0.0167). Other statisticians (e.g., Peto) asked the question, why treat the comparisons equally? Why not ‘spend’ the alpha trivially at the first interim, for example, 0.001. Then the remainder, 0.049, could be spent at the final. This evolved into a class of spending functions, how one spends alpha. One could make the first comparison as important as the final, like the Pocock boundary, or have an accelerating spending function, saving one’s alpha for the final analysis. The latter been popularized by O’Brien and Fleming (O-F). For example, for a two interim and final analysis (three analyses in all), each equally spaced, then the Pocock alpha levels would be: 0.022, 0.022, and 0.022. The O-F alpha levels would be 0.005, 0.014, and 0.045. Personally I favor the O-F approach. The analyses when the N is low (i.e., low power – when you only have 1/3 the data) could be statistically significant, but you’d need overwhelming proof to stop the trial. At the end of the trial, the critical p-value is near 0.05. But what happens if you don’t want to do the interim analysis at equal points (e.g., in a three hundred patient study, when 100, 200 and all 300 patients complete)? Well, Lan and DeMets worked out a way to come up with p-values dependent on the actual number of patients who completed the trial. One only needs to specify the spending function and the approximate time of the interim analyses. Therefore, one could do an interim analysis (e.g., Pocock and Pocock, or Peto, or O’Brien and Fleming) when 122, 171, and 300 patients complete.

BTW, one can reverse the type of analysis to determine if the results were so bad that there is no way to achieve statistical significance. This follows from the above, but is called a futility analysis.

In sum, when you analyze an ongoing trial with multiple looks, you don’t spend your alpha as badly as with a Bonferroni approach. If fact you can flexibly spend it so most is saved for the end of the trial, when you are most likely to get statistical significance, but still have the ability to stop the trial, proclaim a win, if you get extraordinary results at an early time point. This was referred to by an FDAer as a ‘Front Page of the New York Times’ result.

If you are going to do an interim analysis, a few STRONG suggestions. First, the people who do it must not tell anyone involved in the trial the results. If the CMO talks to the people who are running the trial, then the CMO must not know the interim results. The best way to handle this is to create a Data and Safety Monitoring Board (DSMB) who are tasked with looking at the results and are empowered to halt the trial. They, and only they, are empowered to review the unblinded data. Second, expect to do additional analyses for your NDA on the results of your key parameter at each interim analysis point to determine if the results change. [Hint: Always expect the results to change – Murphy’s Law. I’ve often noticed that the patient populations change over time.] Hence, you may need to explain away the interim differences.

An Adaptive Trial: I’m going to describe a type of adaptive trial that is free of FDA questions. An adaptive trial is one where at one point or other, the actual design of the trial changes. For example, if one looked at the results early (breaking the blind) and noticed that the Low dose vs. Control had a miniscule effect size, one could drop the Low dose group and only collect data for the High dose and Control. This type of trial has too many problems to be considered a Phase III trial. Foremost is that the conduct of the trial changes with the ‘best’ treatment continuing. First, one treatment is always better than a second, and that could be spurious. Experimentally, this has been called ‘regression to the mean’. Second, there is a real possibility that investigators might realize that the trial is changing and change the way they collect data, biasing the results. One could do an adaptive trial in Phase II, but I do not recommend it for a Phase III (or Phase II/III) trial.

No, I’m going to describe an adaptive trial which is 100% FDA safe. A completely FREE insurance policy! We discussed in the power analysis blog (8. What is a Power Analysis?) that we need to know something about the results, for example, the standard deviation or the proportion of successes in the control group. One might have an estimate of the sd from the literature. However, your study might not be identical to that trial: the patient population might differ (e.g., inclusion criteria, baseline severity restrictions), the design, or the investigators might be different (e.g., you are including Italian sites). What I’m recommending is to look at the blinded results (i.e., not know the treatment group membership) and from this compute the sd. One could then re-estimate the needed sample size.

Let me describe one such analysis I did. The trial investigated mortality for a new and standard treatment. Mortality was estimated to be 17% for the control, the active was assumed to be 2% better (i.e., 15%). However the 17% was an educated guess. If the 17% were 25% many more patients would be needed, if the control rate was 6%, much fewer would be needed. The FDA liked our N of about 750 for safety reasons. So we wrote the adaptive analysis so that if the overall rate was less than 16% (Average of 17% and 15%) we wouldn’t change the trial. If the overall mortality rate were larger, we would increase the N (within some bounds). We ran the BLINDED analysis when a third of the data was in, saw a lower overall mortality rate (14%) and let the trial continue with its original sample size. The FDA had many questions about the trial, but never a question about the adaptive component.

Free insurance!

Administrative Look: Sometimes one wants to examine the data to plan for the next trial, i.e., determine the mean difference and s.d. – aka effect size. The comments above still apply. The next-to-best way to handle this planning is to have the DSMB compute this and keep the results quiet to anyone involved in the current trial (CMO?). I would say, the best way to do this is to do it in a blinded manner. This would obviate the need to do the secondary analyses at each interim point and hiring a DSMB. Also, as the data is always available to the sponsor for monitoring and error corrections, I would say that if one did a peak of the blinded data to assess the s.d. (or overall proportions), then one might not even need report that the blinded administrative look was even done in the protocol or SAP. The weakness is that one doesn’t have an estimate of the mean (proportion) treatment difference, aka the effect size. Well, the unblinded review has a strong cost, but the blinded look is free.

Hello Allan,

Thank you for the insightful articles on the blog. I completely agree that “In summary, for multiple comparisons, the ‘penalty’ for multiple tests 1) isn’t prohibitive and 2) can be reduced to nothing.”.

However, I was surprised that you didn’t mention the Hochberg-Benjamini-Yekutieli method called False Discovery Rate correction. I’ve been fascinated with it since it really suits most practical applications much better than the Bonferroni FWER method and all of his improvements (it’s a more powerful test).

Are you familiar with it? Any reason for not mentioning it?

Thanks,

Georgi

The great thing about statistics, like any ‘science’, is that improvements (and purported improvements) are always seen. For the purposes of my blog, I attempted to present a simpler approach, but the Hochberg-Benjamini-Yekutieli should have been mentioned.

You have to remember, my guiding principle as by the first two sentences in my blog were: “These blogs were written with the non-statistician in mind, although statisticians could benefit from my thirty plus years of experience consulting for the pharmaceutical/biologic/device industry. It is for those people who have taken at least a single statistics class and use statistics for clinical research in the pharmaceutical/device/biotech industry.”

Thank you for your very valuable observation. The False Discovery Rate should have been mentioned.