Everything should be made as simple as possible, but no simpler.
– Attributed to Albert Einstein
In my third and forth blog I addressed useful ways to present the results of an analysis. Of course, p-values wasn’t it. I favored differences in means and, especially, their confidence interval, when one understands the dependent variable (d.v.). For those cases where one doesn’t understand the d.v., I recommended dividing the mean difference by its s.d. (i.e., the effect size). This would be how many standard deviations the means are apart.
In “9. Dichotomization as the Devils Tool“, I said that transforming the data by creating a dichotomy of ‘winners’ or ‘losers’ (or ‘successes’/’failures’ or ‘responders’/’non-responders’ [e.g., from RECIST for ontological studies]) was a poor way of analyzing data. Primarily because it throws away a lot of information and is statistically inefficient. That is, you need to pump up the sample size (e.g., under the best case you’d only have to increase the N by 60%. Under realistic other cases, you’d have to increase the N four fold).
Percentage change is another very easy to understand transformation. In this blog I’ll be discussing a paper by Andres J. Vickers, “The Use of Percentage Change from Baseline as an Outcome in a Controlled Trial is Statistically Inefficient: A Simulation Study” in Medical Research Methodology (2001) 1:6. He states “a percentage change from baseline gives the results of a randomized trial in clinically relevant terms immediately accessible to patients and clinicians alike.” I mean, what could be clearer than hearing that patients improve 40% relative to their baseline? Like dichotomies, percentage change has a clear and intuitive intrinsic meaning.
[Note added on 20Apr2013: I forgot to mention one KEY assumption of the percentage change from Baseline, the scale MUST have a unassailable zero point. Zero must unequivocally be zero. A zero must be the complete absence of the attribute (e.g., a zero pain or free of illness). One MUST not compute anything dividing by a variable (e.g., baseline), unless that variable is measured on a ratio level scale – zero is zero. Also see Blog 22.]
I’m not going to go too much into the methodology he used. He basically used computer generated random numbers to simulate a study with 100 observations, half treated by an ‘active’ and half by a ‘control’. He assumed that the ‘active’ treatment was a half-standard deviation better than the ‘control’ (i.e., the effect size = 0.50). He ‘ran’ 1,000 simulated studies and recorded how often various methods were able to reject the untrue null hypothesis. Such simulations are often used in statistics. In fact, my masters and doctoral theses were similar simulations. The great thing about such simulations is that answers can be obtained rapidly, cheaply, and no humans would be harmed in the course of such a simulation. His simulation allowed the correlations between the baseline and post score to vary from 0.20 to 0.80.
In all cases, Analysis of Covariance (ANCOVA) with baseline as the covariate was the most efficient statistical methodology. Analyzing the change from baseline “has acceptable power when correlations between baseline and post-treatment scores are high; when correlations are low, POST [i.e., analyzing only the post-score and ignoring baseline – AIF] has reasonable power. FRACTION [i.e., percentage change from baseline – AIF] has the poorest statistical efficiency at all correlations.”
[Note: In ANCOVA, one can analyze either the change from baseline or the post treatment scores as the d.v. ‘Change’ or ‘Post’ will give IDENTICAL p-values when baseline is a covariate in ANCOVA.]
As an example of his results, when the correlation between baseline and post was low (i.e., 0.20) the percentage change was able to be statistically significant only 45% of the time. Next worse, was change from baseline with 51% significant results. Near the top was analyzing only the post score at 70% significant results. The best was ANCOVA with 72% significant results.
Furthermore, percentage change from baseline “is sensitive in the characteristics of the baseline distribution.” When the baseline has relatively large variability, he observed that “power falls.”
He also makes two other theoretical observations:
First, one would think that with baseline in both the numerator and denominator, it would be extraordinarily powerful in controlling for treatment group differences at baseline differences. Vickers observed that the percentage change from baseline “will create a bias towards the group with poorer baseline scores.” That is, if you’re unlucky (remember that buttered bread tends to fall butter side down, especially on expensive rugs), and the control group had a lower baseline, percentage change will be better for the control group.
Second, due to creating a ratio of two normally distributed variables (post – baseline) divided by baseline one would expect the percentage change to be non-normally distributed. That is, percentage change is often heavily skewed with outliers, especially when low baselines (e.g., near zero) are observed.
I have often observed a third issue with percent change. One often sees unequal variances at different levels of the baseline. Let me briefly illustrate this. Let us say we have a scale from 0 to 4 (0. asymptotic, 1. mild, 2. moderate, 3. severe, 4. life threatening). At baseline, the lowest we might let enter into a trial is 1. mild. How much can they improve? Obviously they could go from their 1. mild to 0. asymptotic or 100% improvement; they could remain the same at mild (or 0% improvement); or then could get worse (3. moderate severity or -100%, etc.). What about the 3. severe patients? If the drug works they could go to 2. moderate (i.e., 33% improvement), 1. mild (i.e., 67% improvement) or 0. asymptomatic (i.e., 100% improvement) or get worse – 4. life threatening (-33% worse). If you start out near zero (e.g., Mild), then you get a large s.d. If you start high, a 1 point change would be far smaller, 33% change. That is, percent change breaks another assumption of the analysis, unequal variances, heteroscedasticity.
Theoretically one would expect with percentage change: 1) an over adjustment of baseline differences, 2) non-normality, marked with outliers, and 3) heteroscedasticity.
To get percent change, Vickers recommends “ANCOVA [on change from baseline – AIF] to test significance and calculate confidence intervals. They should then convert to percentage change by using mean baseline and post-treatment scores.” I have a very large hesitation in computing ratios of means. In arithmetic it is a truism that means of ratios (e.g., mean percent change) is not the same as ratios of means (e.g., mean change from baseline divided by mean baseline). Personally, I would have suggested computing the percentage change for each observation and descriptively reporting the median and not reporting any inferential statistics for percent change.
In sum, Vickers recommends using ANCOVA and never using percentage change to do the inferential (i.e., p-value) analysis. I further recommend reporting percentage change from baseline only as a descriptive statistic.