If you hear hoof beats, the first thing you should NOT look for is unicorns.
I’d like to thank Rob Musterer, President of ER Squared, for posting a reference to a 2009 paper by Ling Zhang and Han Kun, ‘How to Analyze Change from Baseline: Absolute or Percentage Change‘. The Zhang and Kun paper was in opposition to a paper by Vickers, referenced in my Blog 18. Percentage Change from Baseline – Great or Poor? The Zhang and Kun paper was written in ‘D-level Essays in Statistics’ at a Dalarna University in Sweden. It was co-signed by 3 faculty members. [Post-publish note: I received a kind reply from one of the faculty at the Dalarna University in Sweden. He explained, “A D-level essay is an independent student work to be done to obtain a degree named “magister”. It is between Bachelor and Master, yet on an undergraduate level.”] The paper was very well written, both with theoretical and empirical results. It used a simulation which appears to have been well executed. Unfortunately I disagree with a number of their assumptions, underlying their theory and simulation (i.e., generalizability of their conclusions).
The authors make a number of points.
1. The non-normality and heteroscedacity seen with analyzing percentage change has little influence. I totally agree (see my Blog 7. Assumptions of Statistical Tests).
2. They state a ‘rule of thumb’ that the correlation between the baseline and change from baseline, should be 0.75. They gave a handful of examples, including a set of 5 patients who were given captopril, with measurement made immediately before and after the administration of captopril. It should be noted that the actual data set included 15 patients. Nevertheless, using only the 5 patients presented, I observed a pre-, post-score correlation of 0.91. The correlation between baseline and change from baseline was 0.19, and with percentage change was 0.08. In most (all?) parts of their essay, they did simulations with r=0.75. It is my observation, that a 0.75 pre-, post-score correlation is unusually (pathologically?) high. In the Captopril trial it appears that the measurements were done within hour(s??) of the initial measurement. Personally, I’m more used to the final score measured six months to two years after the baseline. Unless the baseline were a very stable medical characteristic, I would expect a more reasonable correlation, like 0.3 to 0.4. The authors point out that the larger the correlation, the smaller the standard deviance (variance) of the change score (e.g., when the r=1.00, the s.d. of change will be zero). The 0.91 correlation explained 83% of the post-score variability. This is extraordinary high!
As as side note, in dealing with repeated measurements (e.g., pre- and a set of post-scores), a model I’ve used frequently is the AR(1), in which a coefficient (like a correlation) is raised to a power related to how many steps the two measurements are separated. For example, if the Week 1 v 2 (and 2 v 3, and 3 v 4, etc.) measurements have a similar (auto)correlation of 0.5 then the AR(1) model would estimate that the correlation between measurements taken 2 periods apart (e.g., 1 v 3 or 2 v 4) would be 0.5*0.5 or 0.25. The correlation between 3 weeks apart (e.g., 1 v 4) would be 0.5*0.5*0.5 or 0.125. Therefore, this frequently used model would postulate that measurements taken immediately after one another would be maximally correlated, but measurements taken months or years apart would have very low pre- and post-score correlations.
In sum, the assumed correlation of 0.75 between the pre- and the post-scores might be pathological in long- (or medium-) term clinical trials. However, this is an empirical question. If a statistician liked this model, they could suggest that “percent change be the primary metric in the statistical models, if the pre- and post-score correlation was 0.75 or greater. If the correlation was lower than absolute change would be used.”
3. They deduced “From equation (5), we know that, in order to simulate a dataset such that R <1 [percent change has greater power than absolute change – AIF], we should let the percentage change have a large mean and small standard deviation (page 6).” In a simulated example of a ‘Case that percentage change has higher statistical power’, they started with a normally distributed baseline with mean of 200 and a s.d. of 20. Based on this, there will not be any baselines near zero. In fact, the lower end of the 95% CI would be 160, which is much, much higher than zero. They set the pre-, post- correlation to 0.75 [Note: it might have been the pre- and percentage change correlation]. They also forced the percentage change to have a 50% ‘improvement’ with a s.d. of 1%. They simulated using the percentage change and back-calculated absolute change. That is, the data are intrinsically log-normal. They concluded, “We see that, the value of the test statistic R increases as the standard deviation of P [percentage change – AIF] increases. Although R [R is the ratio of change relative to percent change t-test statistics; hence values < 1 indicate percentage change is superior – AIF] increases, it is still less than 1. In this case, we prefer percentage change to absolute change.” In their simulations, they varied the s.d. for the percentage change from 1% to 20%. In the ‘worst case’ situation, the percentage change had a mean of 50% and a 95% CI of 10% to 90%. Even in that case, it appeared that the results always favored analyzing percentage change.
They were trying to find cases where percentage change would be best. They found it. One case included where percentage change was 50% with a CI of about 48% to 52%. It was this pathologically small s.d. of their percentage change, which enabled them to find this example. Again, such cases may be seen, and it might be seen with your data set. But I feel it is like hearing hoof prints and declaring that unicorns exist. I have never seen any data like this with EVERY patient having almost exactly 50% (+2%) improvement – EVER!
They then changed their focus from percentage change to absolute change. They chose a change of 100 (a 50% change from the baseline of 200) with a s.d. of 5 to 40 and a correlation of 0.75. In all their examples they observed a superiority of absolute change relative to percentage change! Let me repeat that, in their example absolute change was superior to percentage change. I would guess, from their Figure 5, that when the s.d. was 5 or 10, 95% of the cases had a superior power for absolute change relative to percentage change (i.e., in only 5% was percentage change superior). The larger the s.d., the greater the superiority of absolute change. For example for s.d. of 40, 100% of the samples had greater power for absolute change relative to percentage change. They do point out that the amount of relative improvement was small. That is, the distribution of their test statistics was within 3.5% of one another.
4. They did some simulation work demonstrating that the s.d. of the change score is affected by the size of the correlation. The smaller the correlation (e.g., 0.3 relative to 0.75), the larger the s.d. for the change score. They also selected ten datasets and observed a median correlation of 0.71.
Yes, it is possible to find cases where percentage change is a more powerful parameter (relative to absolute change or post-score). Based on their findings, one would need a) to have a baseline score which has a very high mean and very small standard deviation, b) the correlation of the pre- and post-scores would need to be very high (at least 0.75), and almost no variability in the percentage change parameter (e.g., 50% +/-2%). When they based their simulation on a r=0.75 and a 50% change from baseline but a variable post-score variability there was a small difference between analyzing percentage change and absolute change. Nevertheless, absolute change was consistently superior (in > 95% in their best case and 100% in the others).
The situations where one would expect a randomized clinical trial to have a superiority by analyzing percentage change is very, very remote. You must have 1) a huge pre- post-score correlation (e.g., post measured before the pre-score can change) – never seen in long term trials; 2) Very high baselines with little chance of low scores – e.g., never use a 5 point rating scale; 3) use parameters which are log normal, so a natural parameterization is ratios; and 4) it would help if the s.d. of the post score is quite low. This is like seeing real unicorns loose in Central Park. The simulations used might be applicable someplace, but not to pharmaceutical or device clinical trials.