‘Failure is always an option’ – Myth Busters

‘The Statistician is in. 5¢ per p-value.’

‘Your statistical report is being delivered by three UPS trucks.’

‘It takes three weeks to prepare a good ad-lib speech.’ – Mark Twain

***

Let me go on a tangent.

Steve Jobs recently died. I was thinking of the things I do in life. And things I am incapable of doing. I could never do the things he did. There are two types of people. (<*Cough*> Actually I personally think of characteristics along a continuum. <*Cough*>) The two types of people are those people who BELIEVE and those who think that truth is relative and only partially known. To return to Steve Jobs, he BELIEVED in his goals. He was maniacally dedicated to accomplish his vision, create his product, and get it out of the door. Great CEOs are like that. Don’t get me wrong. All members of the medical team are fully dedicated to produce the best product and to do it in the shortest time possible. But great statisticians have worked on hundreds of projects. Some projects succeed and some fail. We aren’t personally responsible for the failures. If a drug doesn’t work or if it’s unsafe, that is the probabilistic nature of our research. ‘Failure is always an option.’ The FDA rejected a blood product I worked on because it wasn’t as safe as giving saline. A ‘jiggling’ device a woman stood on to increase her bone density didn’t work. An oxygenated liquid which floated diseased lung tissue of patients dying of upper respiratory failure did more harm than good. While we statisticians might look at subsets of the data or alternative analyses, to ensure we did the analysis correctly, we don’t take the drug’s failure personally. If a CEO makes a mistake, they take it personally. They drive their people to prove that the treatment worked. They make the decisions. They will frequently look at data subsets and alternative analyses to demonstrate efficacy.

The FDA knows this. Publication editors know that it is by the number of publications that a person gets tenure and higher pay. They are intrinsically distrustful. If I am an external scientist, I would also be distrustful. To put this another way, if you were reviewing a report from a competitor, what would you like to see? Would you blindly trust their findings?

As an external scientist, what would I demand to see? The first and second things I would request are the protocol and analysis plan. If the protocol/SAP didn’t state what the key hypothesis (key parameter and key time point) and how they planned on testing I’d be suspicious, deeply suspicious. If they switched the key parameter, I’d go back to the report and see if they presented the analysis using that analysis.

Let me tell you of a recent analysis and what it included along with the rationale of each component. All the below were in the appendix (i.e., not the main body of the report).

- Data listing – The FDA requires this. Actually they require an electronic version in the CDISC regulations. If any second party wanted to do a reanalysis, they have the raw data to replicate my results.
- Summary statistics – Many analyses present covariate adjusted means or compute standard errors based on the pooled data. Don’t get me wrong, such model driven results are excellent. But for deeper looks into the assumptions of the model, nothing beats the simple summary statistics. The summary statistics typically present the arithmetic N, means, medians, standard deviations for the different treatments for each subset (e.g., by visit). One thing an external scientist could do is examine the standard deviations for each subset and see if one were suspiciously large. We know that the p-value wouldn’t be badly affected by non-normality or outliers (see ‘7. Assumptions of Statistical Tests’), but if there were any trend (e.g., standard deviations are large when the means are large) then the data might be better analyzed by a data transformation. Otherwise if there were an idiosyncratically large sd then a outlier (error in data???) might be present, throwing off the results.
- Analysis of pre-treatment (e.g., baseline) data. This shouldn’t be necessary if the data were correctly randomized/blinded. Nevertheless, if one group (e.g., placebo) were more asymptomatic at baseline, it would be a major red-flag. As an external scientist, I would want to see that it was not the case. For this crossover trial, I had two pre-treatment analyses, the second examined the period baselines.
- Final analysis. This was a statistical model including any other effects. It was a treatment by time analysis. I would include all output from the model. This included how I handled repeated measurements (see ‘7. Assumptions of Statistical Tests’). I also tested for all treatment differences within each time point.
- Supportive analyses. This includes all the secondary parameters and all the secondary time points. For every key (primary) parameter, there may be a dozen secondary parameters. For every key (primary) time point, there may be five intermediary time points. An external scientist would be interested in seeing the ‘pattern’ of results. If the key parameter/time point favored the treatment but the others showed no such tendency (not statistically significant, but at least in the correct direction) then the external scientist would be suspicious. Hmm, a dozen plus one parameters and five plus one time points would mean 78 sets of analyses. The devil’s is in such details.
- Other Supportive analyses. This might include non-parametric analysis. As I mentioned in the last blog, the ordinal non-parametric analysis should have similar power to the parametric analysis. Of course, if it weren’t statistically significant, but the final analysis were, I would need to see if outliers were causing the difference, or subgroups, etc. I would expect the non-parametric analyses to support and to be close in value to the findings of the parametric analyses.
- Exploratory analyses. In order to have a final analysis, we might need to have a more complicated initial model. For example, if we ran a final analysis without interactions, then we would need to demonstrate that the interactions were not important. What do I mean by ‘not important’, well I frequently use a less conservative p-value of 0.10. If an interaction with the treatment effect were statistically significant then we cannot talk about a simple treatment difference. For example, let us say we ran an analysis of treatment by investigator site. If the investigator by treatment interaction was statistically significant then we cannot talk about how the new treatment were different from placebo. We would need to talk first about if some investigator(s) had positive results and if other investigators didn’t have positive, or <shudder> some had negative results. My first look at the interaction would be for simple mean treatment differences. I would hope that all were in the correct direction, at least numerically. Why not include p-values? Back to my first blog, no p-value will be statistically significant if the N were too small. In any case, if the interaction were relative large (e.g., p<0.10), then I would suggest presenting, at least in an appendix, the treatment by interacting effect means. In a recent analysis the treatment by time interaction was significant. Although some treatment differences were smaller than others, at all time points, the treatment difference had statistically significant differences favoring the experimental treatment. Concerned scientists would want to see the interaction tests and the consistency of results within strata.
- 1. Sensitivity analyses – Population. There are a few versions of this. One is where we depart from including all patients (i.e., the intent-to-treat or ITT) analysis. This is frequently called the per-protocol or PP analysis. Another version is where we handle missing data in different manners. For each population, all analyses are repeated.
- 2. Sensitivity or robustness analysis – Missing data and model assumptions. Another type of sensitivity analysis is where the FDA will request we do a special analysis for the missing data. For example, in one trial, they said that for their sensitivity analysis, missing data in the treated group were to be considered failures (i.e., died), but missing data in the control group were to be considered successes (i.e., alive). This was an extreme case of testing the robustness of our results to missing data. The sponsor elected to go to extraordinary lengths (i.e., hiring private detectives) to track down missing patients, minimizing the effect of this sensitivity analysis. This sensitivity analysis is typically done only for the primary (secondary?) parameter(s).

This last week I noticed on a statistical consultant’s blog a colleague who re-examined a barely statistically significant result with twenty patients. He elected to examine the robustness of this finding by eliminating each patient separately and computing the p-value for each re-analysis. Other statisticians recommend avoiding the parametric assumptions by repeatedly sampling with replacement N patients, for example 30,000 replications from the observed data and computing the p-value of each resample, then summarizing the mean p-value.

What is the net results of all these analyses? For this study of 45 rats, the statistical appendix was a five hundred pages long, some pages with thirty p-values. I should mention that I usually throw in the complete output of the analysis. If this were up to NDA standards I might extract only the key results for a photo-ready concise summary table. However, it takes considerable time and effort to create a simple brief report, or as Mark Twain observed, “It takes three weeks to prepare a good ad-lib speech.” When you look at all the key, secondary and tertiary analyses, the supportive, exploratory, sensitivity analyses; when you look at the tests of the assumptions, there may be thousands of p-values. Most of which are only incidental to the single key hypothesis of the study. But all are important to a skeptical scientist/reviewer.

In my next blog I will discuss multiple observations – Multiple observations and crossover trials as a devil’s tool (and challenge the devil to a fiddle contest).