People who haven’t the time to do things ‘perfectly’, always have time to do them over.
Measure once, cut twice. Measure twice, cut once.
‘My god, you’ve conclusively proven it. Time equals Money.’
Cost for Failure
Every CEO I’ve ever heard knows that they only make money when the product is on the market. In the pharmaceutical/biologic/device industry, you need to demonstrate the effectiveness of a product. This means you need to reject the null hypothesis. This means you must design the trial with an excellent shot at rejecting the null hypothesis (i.e., p < 0.05). What are the consequences of failure? Time. Let me elaborate. It takes time for upper management to decide that they want to do a trial (e.g., 2 weeks), unless its around the holidays, vacations, conferences. The upper management will then ask a medical director to design a trial. This will initially produce a shell of a design (e.g., another 2 weeks), then another month might be spent in producing a first version of the protocol, then another couple of months in getting investigators, (redesign of protocol?), case report forms, data bases, edit checks, drug, etc. Finally patient recruitment (three months?). Investigators ALWAYS underestimate their subject pool. New investigators are enrolled (another month). The last patient enrolled needs to complete the trial (e.g., 6 months). Data is collected from the field (1-2 months), cleaned (a month), queried (another month), re-queried (weeks), blind is broken, analyses are done, everything is QCed, top-line report is issued (another 2 months). From soup to nuts a very modest trial would take a year and a half. Costs? Internal – one and a half years of full time equivalents of medical, statistical, programming, CRA, CDM staff. External – payment to investigators (institutions) with their staffs, medical supplies, testing equipment, patients. Cost for failure? Redo everything and the product is off the market for all that time. If the trial was not successful then the planners are completely and solely responsible for the loss, frequently in the millions.
I will assume that the trial has a clearly stated, operational objective. [You might be surprised at the number of times I’ve seen a protocol without a clear goal (e.g., ‘to study the relationship of Drug A and efficacy.’ A two patient trial would meet that objective!)]
Input for Power Analysis
The first and most important thing one needs is an excellent literature review. This is not something that a statistician can do, although a statistician can extract information and help guide the final conclusions. The scientists needs to go to the literature and get the best studies which have been done in the past.
They need to review the literature with regard to:
- the proper patient population,
- who would qualify and who shouldn’t (inclusion/exclusion criteria),
- patient subsets,
- how do they measure success, especially what the Agency has previously considered valid measures of success (e.g., Agency Guidances),
- important variables to control or stratify the randomization – e.g., gender, age, ethnicity, baseline severity, center/region/continent,
- observed drop-out rate,
- standard of care (placebo?), and relevant active treatment groups, and
- the most important for the power analysis: summary statistics for the key parameter/time point.
A statistician might be asked to review the set of key studies which were published. One point I’ve made many times in these blogs is that there are many times where a non-statistically significant result with a parameter-A might be more useful than a statistically significant result for a parameter-B. For parametric data (variables for which one computes means), I would review the effect size and decide which one was larger. For example, in one literature review for a type of multiple sclerosis I observed that the most frequent ‘primary’ parameter of six minute walking time had an effect size of 0.3 while a supportive parameter, stair climb test, had an effect size of 0.7. I unequivocally recommended that the stair climb be used as the key parameter and the six minute walking time test to be a secondary parameter. Why? sample size is proportional to the square of the effect size. Therefore, the stair climb parameter would need one-fifth of the number of patients. For example, if the stair climb parameter would need 50 patients, the six minute walking test would need 270 patients. Do you need to be a statistician to do this? Nope. Effect size is simply the mean difference divided by the standard deviation, something anyone can do with the cheapest calculator.
One difficulty in computing effect size is the standard deviation (sd). Sometimes the authors report only the standard error of the means, sometimes standard error of the difference in means. Sometimes they only present error bars (95%?) graphically, in which case you could enlarge the graph and use a ruler to estimate its size. Fortunately to convert from the standard error of the mean to the sd one would multiply the standard error of the mean by the square root of N. To convert from the standard error of the difference between two means to the sd, one would multiply the standard error by the square root of N/2. If it were 95% error bars of the mean, one would multiply by approximately 2 times the square root of N.
Another difficulty is what do you do with multiple time points. ‘A man with one watch knows what time it is, a man with two is never quite sure.’ Two simple solutions, use the effect size at the key time point or take a simple average of the many effect sizes. Or do both – ignore the period for which no reasonable treatment effect would be expected, then average the useful ones. A third option is to compute power for each. Power calculations are relatively fast.
Power Analysis Program Input
For a parametric analysis there are only a few things I would need: alpha level (with 1 or two-tails), the power of the trial, and the mean and standard deviation (these two could be replaced by the effect size). I’ll expand on each.
Alpha (α) – This is usually fixed by the scientific gate keepers (e.g., publication, Agency). I almost always use 0.05 with a two sided alpha. I would deviate from 0.05 only when there is not one but multiple ‘key’ comparisons. For example, if there were two ways the trial could be a success (using the Bonferroni test) I would use an alpha/2 or 0.025 (two-sided), but more about that in a later blog. In my power analyses, once I select the alpha I don’t bother considering alternative values.
Power of the trial (1 – β) – β is the type two error rate (how often the trial will fail when the effect size is not zero but δ – see below), 1 minus that is called the power. Power is the likelihood that the study will be a success. This will often be a proportion or percent. I frequently use 0.8 or an 80% chance that the study will succeed. 0.7 or less is inadequate for planning a trial. Often large pharma will use a 0.90 or 0.95 likelihood for success, especially for a pivotal Phase III trial. One could examine the results with multiple levels of power to determine costs.
One thing about power, is that if it is above 0.50 then there is some overage in the sample size. In actuality one just needs the p-value to be < 0.05. An alpha level of 0.049 would be a success. When power is much greater than 0.50 (e.g., 0.95) then the results would not only be less than 0.05 but frequently much less (e.g., p < 0.001). I occasionally run the power program with a power of 0.50 to determine what sample size would just barely be significant with my assumed effect size. Of course in any study, the observed effect size could be larger than expected or smaller. A 0.50 power would just be statistically significant half the time. If you used 50% power, you would fail half the time. Ya pays ya money and ya takes ya chances.
Like I said above, I often use a power of 0.80. Studies of that size are successful four times out of five.
Effect size (δ) – Or mean for group A minus mean for group B divided by the standard deviation. By now, after reading my previous blogs, you should be quite comfortable with effect size. [Note: most power programs allow either the effect sizes or the two means and standard deviation.]
Let me first state the wrong way of doing things.
- One should NEVER decide on the maximum N you could afford, then reverse estimate the effect size. It’s trivially easy to do that, but it’s still wrong. If the proper effect size is 0.25 and all you can afford is to run the trial assuming that the power is 0.50, then the trial is 1/4 the size it should be. It will most likely fail, and you wasted your time (year and a half?) and money (ten million dollars?).
- A second incredible blunder I’ve come across was to select the largest effect size ever seen in the literature and expect to see this in your study. For example, one client saw a series of 0.3 to 0.4 effect sizes, and a 0.8 effect size (at a non-key time point). They used the 0.8 and the study failed. D’oh. The larger effect size will invariably ‘shrink’ – a phenomenon called ‘regression to the mean’, which was first identified by Galton over a hundred years ago. When you have multiple effect sizes use the median. [Note: One possible exception to this is that the large effect size was due to a MAJOR change in methodology. I’d still recommend a shrunken version of that large effect size.]
- A corollary of this is never settle for a standard large effect size, like 1.0. This was once recommended to me by a young IRB statistician. WRONG, very wrong, at least for all concerned but the statistician. If you over-estimate your effect size and the trial is inadequate you’ve just wasted time and resources and are keeping your product from the market. On the other hand, for an IRB statistician, it would allow foolish investigators to run trials, run many statistical analyses to see if anything was significant, and keep that IRB statistician employed. A ‘standard’ effect size for a specific scientific domain can only be derived based on experience in the field.
- Another blunder is to say that previously research found a moderate effect size (e.g., 0.40), but your engineers/formulation people think that their new, better than all other, novel, exciting, groundbreaking, … approach will be larger than this (e.g., 0.80 effect size). Alternatively, by controlling for something (e.g., disease severity/gender) or modifying the methodology you can reduce the noise (half the sd, so a 0.40 effect size can become 0.80). Unfortunately the competitor’s scientists (with their observed 0.40 effect sizes) were also dealing with novel, exciting, groundbreaking products. There are many, many reasons to mitigate the above effects, the largest one is that clinical research must use people to do research on. I remember one woman who had a diabetic foot ulcer who insisted on dancing at her youngest daughter’s wedding. The wound got worse, much worse, but she had a great time. Real data on human patients has noise in it, uncontrollable noise, degrading any effect. Real data is also collected by human investigators. They often do things wrong, degrading any effect. Murphy’s law is THE LAW.
- (Addendum 15Oct2015): I was asked by an investigator, after the study was over and analyzed: “The abstract committee asks us to include a ‘power analysis estimate.’ Is there a succinct way to address this given the data and your analysis?” I told him this was wrong, totally wrong. (a) The power analysis ONLY applies to planning a (future) trial, and should best be written into a protocol prior to the enrollment of the first investigator. (b) It has no utility after the enrollment is completed. At that time, it is too late. (c) The most onerous statement is to say “We didn’t get statistical significance, if we had x more patients it would have been statistically significant.” Failing to reject the null hypotheses will always mean that the investigators were unable to prove that the treatment was effective. As stated in Blog #1, any treatment is different from a second treatment, always! To be snarky, the report should say, “We never thought to plan the trial with a realistic goal, so the failure to get positive results is because we were incompetent. We jeopardized patients, and wasted our, the patient’s, and the researcher’s time and money because we couldn’t plan.”
Therefore, I strongly recommend you let the literature suggest the means and standard deviations. One caveat is that the literature might itself be overestimating the effect sizes. Although it shouldn’t, many journals don’t publish failed (e.g., non-statistically significant) results, many investigators don’t publish failures, many industry sponsored trials don’t publish improperly run, inadequately sized studies (i.e., failures). So, you might even want to design the trial with an effect size even lower than that seen in the (published) literature.
Three outputs and two inputs
As I mentioned above, one typically sets alpha due to the study design/gatekeepers. It doesn’t vary. There are actually three real pieces of information for which you input two of them. Knowing two will tell you the third:
- N/group – One inputs one or more powers and one or more effect sizes [or fix the standard of care (control) mean and the standard deviation and have a set of treatment means]. For example, one could have 0.70, 0.80, 0.90 and 0.95 power and effect sizes of 0.40, 0.50, and 0.70 for 12 estimates of N/group.
- Power – One inputs the (set of) N/groups and a (set of) effect sizes. For example, compute power with 50, 100, 150 patients per group and effect sizes of 0.40, 0.50, and 0.70 for 9 estimates of the likelihoods of the study succeeding.
- Effect size – One inputs the set of N/groups, the set of powers for the trial, and the assumed standard deviation to get the detectable differences.
When I’m already doing a power analysis, to generate a second one takes less than a minute. The difference in computing one result or 9 is trivial in terms of cost. I highly recommend having a set of N/groups, power and effect sizes (treatment mean differences).
Multiple Treatment Groups
I have a simple way of looking at things. When I have multiple treatment groups, I reduce the problem to its simplest form. The key comparison becomes a simple, single two group comparison. Let’s say we have two active groups (high and low dose) and a placebo group, three groups in all. There can be three comparisons: 1) high dose compared to placebo, 2) low dose compared to placebo, and 3) high compared to low dose. For efficacy, the FDA wants to see a comparison against placebo – deprioritize the third comparison. One typically would expect the high dose to be at least as effective if not more effective than the low dose, so the largest effect size would be a single comparison of high dose vs. placebo. High vs placebo would be the key (or primary) comparison. Done.
Alternatively, if the chief medical officer says the low dose vs. placebo is important, then the simple approach is to use the smaller expected difference (effect size) in the power analysis and to divide the alpha by 2 for the two “equally important” comparisons (high vs. placebo and low vs. placebo). The alpha for a traditional 2-sided comparison is typically 0.025 for either tail. Dividing the alpha by two (again) would make it 0.0125. You could pass the new sample size (cost of the trial) on to your CMO and see if they still want to pay for this larger trial. If they don’t want to pay to see the low dose vs. placebo difference, my next question is: why include the low dose at all – drop it and speed up the trial by 50%!!! To me, time to market is critical, and one can always go back and do a phase IV with the lower dose.
Eventually all my clients expect to see differences among all treatments, so I power the trial to detect the biggest/most important difference. In a repeated measurement design, I would still use one literature based comparison to power the trial. I might throw into the final N an overage for each additional effect I’m computing (e.g., each degree of freedom in the model). So, if you’re comparing 3 treatment groups measured at 5 time points, you would need (3*5 – 1) or 14 extra subjects enrolled.
Overage for drop-outs
Speaking of overage, once you get your estimated evaluable N per group you would need to increase it by a fudge factor based on expected drop-outs. For example, if you have seen a 15% drop-out rate you would multiply the final N/group by 1.15 to come up with the to be enrolled N/group.
Non-Superiority and/or Non-Parametric Power Analyses
Up to now, I’ve been focusing on a parametric superiority trial.
A non-inferiority (not-worse than) trial would have all of the above with the equivalence limit difference (aka the minimally meaningful treatment difference). As one is typically running a non-inferiority trial against an active treatment the treatment difference should be around zero. What you want is as large a negative value as possible. This is something that you will often negotiate with the Agency. For example, in one trial against the standard of care, the active treatment was initially suggested it to be not 10% worse. The agency wanted it not 1% worse, and we finally compromised on not 3% worse.
Power analyses for non-parametric analysis of ordinal data is always a problem. One does non-parametric analyses because the data has outliers – extreme values, or have a non-normal distributions (e.g., many laboratory tests have a log-normal distribution), or have limits (e.g., if a patient died he would be given a score below all other scores; or many laboratory tests have limits of detectability). What I’m referring to is doing a statistical test like the the Mann-Whitney or Wilcoxon test. Crudely put, one ranks all the data and does a t-test or ANOVA analysis on the ranks. It isn’t exactly that, but that is the underlying approach. If I need to compute power for such cases I would do my best to convert the power analysis into a parametric power analysis. I might replace the means with medians. Alternatively, for the power analysis I might transform the data to make it more normal (e.g., log transformation).
Dichotomous data would replace the effect size above with the two proportions (treatment and the control success rates). One doesn’t need the sd, as the standard deviation of a proportion is known and solely related to the proportion of successes. The two success rates have a known relationship with N/group. A larger study is possible when the average is around 50%. The more the two proportions approach 0% (or 100%) the smaller the study is needed. Of course, the bigger the difference in proportions the smaller the N/group.
Other approaches (e.g., power for survival analysis) are beyond the scope of this blog.
How do you actually do a power analysis? Can anyone do one?
I used to write my own programs (in FORTRAN or C). Currently I use validated programs like SAS (www.sas.com) or nQuery Advisor (www.statistical-solutions-software.com). There are many others (especially for specialty methodologies [e.g., interim analyses or testing for inferiority]). nQuery Advisor is very simple to use and it handles many types of statistical power analysis. So an inexperienced user could easily use it. However as it currently costs about $1,300 to own nQuery and many times more than that for SAS, it might be a lot cheaper to have a statistician run the power analysis. I could knock off a power analysis with 10 variations and a simple report in two hours or less.
In sum, a literature review and a power analysis is the essential first and second steps for any study. It can easily save millions of dollars. The inputs into it are either routine (alpha level, power of the trial) or readily available (expected treatment difference, standard deviation). It is strongly recommended that multiple views of the power (cost) of a trial be undertaken, as it is very easily done. Finally, I recommend that the effect size selected be on the small end of possible effects.
In my next blog ‘Dichotomization as a devil’s tool’, I will suggest that one shouldn’t take a continuous parameter and dichotomize it to generate a success failure (e.g., a weight loss of ≥ 10 pounds is a ‘success’ while < 10 pounds a ‘failure’).