*Everything should be made as simple as possible, but no simpler. *

*- Attributed to Albert Einstein*

***

In my third and forth blog I addressed useful ways to present the results of an analysis. Of course, p-values wasn’t it. I favored differences in means and, especially, their confidence interval, when one understands the dependent variable (d.v.). For those cases where one doesn’t understand the d.v., I recommended dividing the mean difference by its s.d. (i.e., the effect size). This would be how many standard deviations the means are apart.

In “9. Dichotomization as the Devils Tool“, I said that transforming the data by creating a dichotomy of ‘winners’ or ‘losers’ (or ‘successes’/'failures’ or ‘responders’/'non-responders’ [e.g., from RECIST for ontological studies]) was a poor way of analyzing data. Primarily because it throws away a lot of information and is statistically inefficient. That is, you need to pump up the sample size (e.g., under the best case you’d *only* have to increase the N by 60%. Under realistic other cases, you’d have to increase the N four fold).

Percentage change is another very easy to understand transformation. In this blog I’ll be discussing a paper by Andres J. Vickers, “The Use of Percentage Change from Baseline as an Outcome in a Controlled Trial is Statistically Inefficient: A Simulation Study” in Medical Research Methodology (2001) 1:6. He states “a percentage change from baseline gives the results of a randomized trial in clinically relevant terms immediately accessible to patients and clinicians alike.” I mean, what could be clearer than hearing that patients improve 40% relative to their baseline? Like dichotomies, percentage change has a clear and intuitive intrinsic meaning.

[Note added on 20Apr2013: I forgot to mention one KEY assumption of the percentage change from Baseline, the scale MUST have a unassailable zero point. Zero must unequivocally be zero. A zero must be the complete absence of the attribute (e.g., a zero pain or free of illness). One MUST not compute anything dividing by a variable (e.g., baseline), unless that variable is measured on a ratio level scale - zero is zero. Also see Blog 22.]

I’m not going to go too much into the methodology he used. He basically used computer generated random numbers to simulate a study with 100 observations, half treated by an ‘active’ and half by a ‘control’. He assumed that the ‘active’ treatment was a half-standard deviation better than the ‘control’ (i.e., the effect size = 0.50). He ‘ran’ 1,000 simulated studies and recorded how often various methods were able to reject the untrue null hypothesis. Such simulations are often used in statistics. In fact, my masters and doctoral theses were similar simulations. The great thing about such simulations is that answers can be obtained rapidly, cheaply, and no humans would be harmed in the course of such a simulation. His simulation allowed the correlations between the baseline and post score to vary from 0.20 to 0.80.

In all cases, Analysis of Covariance (ANCOVA) with baseline as the covariate was the most efficient statistical methodology. Analyzing the change from baseline “has acceptable power when correlations between baseline and post-treatment scores are high; when correlations are low, POST [i.e., analyzing only the post-score and ignoring baseline - AIF] has reasonable power. FRACTION [i.e., percentage change from baseline - AIF] has the poorest statistical efficiency at all correlations.”

[Note: In ANCOVA, one can analyze either the change from baseline or the post treatment scores as the d.v. 'Change' or 'Post' will give IDENTICAL p-values when baseline is a covariate in ANCOVA.]

As an example of his results, when the correlation between baseline and post was low (i.e., 0.20) the percentage change was able to be statistically significant only 45% of the time. Next worse, was change from baseline with 51% significant results. Near the top was analyzing only the post score at 70% significant results. The best was ANCOVA with 72% significant results.

Furthermore, percentage change from baseline “is sensitive in the characteristics of the baseline distribution.” When the baseline has relatively large variability, he observed that “power falls.”

He also makes two other theoretical observations:

First, one would think that with baseline in both the numerator and denominator, it would be extraordinarily powerful in controlling for treatment group differences at baseline differences. Vickers observed that the percentage change from baseline “will create a bias towards the group with poorer baseline scores.” That is, if you’re unlucky (remember that buttered bread tends to fall butter side down, especially on expensive rugs), and the control group had a lower baseline, percentage change will be better for the control group.

Second, due to creating a ratio of two normally distributed variables (post – baseline) divided by baseline one would expect the percentage change to be non-normally distributed. That is, percentage change is often heavily skewed with outliers, especially when low baselines (e.g., near zero) are observed.

I have often observed a third issue with percent change. One often sees unequal variances at different levels of the baseline. Let me briefly illustrate this. Let us say we have a scale from 0 to 4 (0. asymptotic, 1. mild, 2. moderate, 3. severe, 4. life threatening). At baseline, the lowest we might let enter into a trial is 1. mild. How much can they improve? Obviously they could go from their 1. mild to 0. asymptotic or 100% improvement; they could remain the same at mild (or 0% improvement); or then could get worse (3. moderate severity or -100%, etc.). What about the 3. severe patients? If the drug works they could go to 2. moderate (i.e., 33% improvement), 1. mild (i.e., 67% improvement) or 0. asymptomatic (i.e., 100% improvement) or get worse – 4. life threatening (-33% worse). If you start out near zero (e.g., Mild), then you get a large s.d. If you start high, a 1 point change would be far smaller, 33% change. That is, percent change breaks another assumption of the analysis, unequal variances, heteroscedasticity.

Theoretically one would expect with percentage change: 1) an over adjustment of baseline differences, 2) non-normality, marked with outliers, and 3) heteroscedasticity.

To get percent change, Vickers recommends “ANCOVA [on change from baseline - AIF] to test significance and calculate confidence intervals. They should then convert to percentage change by using mean baseline and post-treatment scores.” I have a very large hesitation in computing ratios of means. In arithmetic it is a truism that means of ratios (e.g., mean percent change) is not the same as ratios of means (e.g., mean change from baseline divided by mean baseline). Personally, I would have suggested computing the percentage change for each observation and descriptively reporting the median and not reporting any inferential statistics for percent change.

In sum, Vickers recommends using ANCOVA and never using percentage change to do the inferential (i.e., p-value) analysis. I further recommend reporting percentage change from baseline only as a descriptive statistic.

I have a question, if we use Percentage Change from Baseline as the endpoint variable, and use the baseline as the covariable, then run the AVCOVA, is that ok?

Another Question is What is the “least square mean percent change from baseline”, i am not sure about this, pls give me a suggestion.

To reply to Zaixiang’s questions:

Your first question is to use as a dependent variable the percentage change from baseline or (100*(Y-B)/B) and B as a covariate, where Y is the endpoint score and B is the baseline. According to Dr. Vickers and my blog, the ANCOVA on percentage change would have poorer power and all the other issues mentioned: 1) an over adjustment of baseline differences, 2) non-normality, marked with outliers, and 3) heteroscedasticity.

I don’t think I mentioned the “least square mean percent change from baseline”. As a general answer, a ‘least square mean’ of anything can be obtained by standard methods. It would be the simple mean if no covariate(s) were used or the estimated mean from the linear model when covariates were used. So, if you used ANCOVA on the percent change from baseline with baseline as the covariate, the analysis program would yield a least square estimate. But you’d still face the above three issues and poorer power.

Nevertheless, I would still suggest using ANCOVA on (Y-B) or Y, and reporting the C.I. and p-values from that analysis. Then I’d compute descriptive statistics on the percent change from baseline (e.g., N, mean, median, s.d., min and max) for each treatment group and perhaps on the difference.

Dear Allen. To set the scene, I am not a stat or a biostat. We are treating patients with secondary progressive multiple sclerosis on a “compassionate basis” with an experimental drug – something that is allowed in our country (NZ). The number for patients is very small, about 15. Each patient is their own unique set of symptoms. We are using a MS specific QoL patient reported questionnaire (the MSQLI) to obtain baseline and then 3 monthly data as one means of gauging treatment effect in the absence of biomarkers – one of the challenges of treating this indication. We have been looking at the effect in each patient by using PCFB. In some components of the MSQLI a reduction in score is improvement and in other components, an increase in score is improvement. As a lay person an immediate issue arises. A baseline score of 1 (bad) verses a 3 mth score of 7 (much improved) equals a PCFB of 600%. For a different component a baseline of 7 (bad) verse a 3 mth score of 1 (much improved) equals a PCFB of -85%. This seems wrong! Subsequent ‘googling’ on the issue reveals the apparent minefield of PCFB!! Fundamentally we are interested in how treatment is impacting each patient as opposed to an overall effect in a larger population. Can you suggest an appropriate approach. Sincere thanks.

I elevated your question to a full blog. See Blog 22.

Hi,

I have read your blog, and it ties in with what I was doing, but I am unsure about the interpretation of % change. I have calculated it as suggested in the Vickers (2001) article as BASELINE-POST/BASELINE *100, but have done it for each case as you suggest and then establish a mean percentage change. However, the values seem to be inverted, ie when the average difference between group is positive (the post test value is therefore greater than the baseline measure) the percentage change is coming out as negative and vice versa. Would it be acceptable to then multiply these by -1 so that the directions are the same?

Thanks in advance, this blog was very useful so far.

It sounds like a increasing score indicates a worsening prognosis and a lower score indicates an improvement. It is ALWAYS reasonable to compute percentage change as 100*(Baseline – Post)/Baseline OR 100*(Post – Baseline)/Baseline. This, is mathematically IDENTICAL to multiplying it by -1. I’ll leave the algebraic proof to you or your 15 year old daughter. In general, for interval level data, one can ALWAYS linearly transform ANY parameter X to X’: X’ = aX + b, where a is not zero. In this case: a = -1 and b = 0. The same applies to change from baseline: post – baseline (e.g., weight gain for premature infants growth) or baseline – post (e.g., weight loss for adult diet efficacy). See my blog 22 where I made similar statements. Don’t forget to comment: “The scales were reflected so a positive number indicates improvement.”

[P.S. I changed your positive to negative per your errata comment, which I deleted.]