18. Percentage Change from Baseline – Great or Poor?

Everything should be made as simple as possible, but no simpler.

- Attributed to Albert Einstein


In my third and forth blog I addressed useful ways to present the results of an analysis.  Of course, p-values wasn’t it.  I favored differences in means and, especially, their confidence interval, when one understands the dependent variable (d.v.).  For those cases where one doesn’t understand the d.v., I recommended dividing the mean difference by its s.d. (i.e., the effect size).  This would be how many standard deviations the means are apart.

In “9. Dichotomization as the Devils Tool“, I said that transforming the data by creating a dichotomy of ‘winners’ or ‘losers’ (or ‘successes’/’failures’ or ‘responders’/’non-responders’ [e.g., from RECIST for ontological studies]) was a poor way of analyzing data.  Primarily because it throws away a lot of information and is statistically inefficient.  That is, you need to pump up the sample size (e.g., under the best case you’d only have to increase the N by 60%.  Under realistic other cases, you’d have to increase the N four fold).

Percentage change is another very easy to understand transformation.  In this blog I’ll be discussing a paper by Andres J. Vickers, “The Use of Percentage Change from Baseline as an Outcome in a Controlled Trial is Statistically Inefficient: A Simulation Study” in Medical Research Methodology (2001) 1:6.  He states “a percentage change from baseline gives the results of a randomized trial in clinically relevant terms immediately accessible to patients and clinicians alike.”  I mean, what could be clearer than hearing that patients improve 40% relative to their baseline?  Like dichotomies, percentage change has a clear and intuitive intrinsic meaning.

[Note added on 20Apr2013:  I forgot to mention one KEY assumption of the percentage change from Baseline, the scale MUST have a unassailable zero point.  Zero must unequivocally be zero.  A zero must be the complete absence of the attribute (e.g., a zero pain or free of illness).  One MUST not compute anything dividing by a variable (e.g., baseline), unless that variable is measured on a ratio level scale - zero is zero.  Also see Blog 22.]

I’m not going to go too much into the methodology he used.  He basically used computer generated random numbers to simulate a study with 100 observations, half treated by an ‘active’ and half by a ‘control’.  He assumed that the ‘active’ treatment was a half-standard deviation better than the ‘control’ (i.e., the effect size = 0.50).  He ‘ran’ 1,000 simulated studies and recorded how often various methods were able to reject the untrue null hypothesis.  Such simulations are often used in statistics.  In fact, my masters and doctoral theses were similar simulations.  The great thing about such simulations is that answers can be obtained rapidly, cheaply, and no humans would be harmed in the course of such a simulation.  His simulation allowed the correlations between the baseline and post score to vary from 0.20 to 0.80.

In all cases, Analysis of Covariance (ANCOVA) with baseline as the covariate was the most efficient statistical methodology.  Analyzing the change from baseline “has acceptable power when correlations between baseline and post-treatment scores are high; when correlations are low, POST [i.e., analyzing only the post-score and ignoring baseline - AIF] has reasonable power.  FRACTION [i.e., percentage change from baseline - AIF] has the poorest statistical efficiency at all correlations.”

[Note: In ANCOVA, one can analyze either the change from baseline or the post treatment scores as the d.v.  'Change' or 'Post' will give IDENTICAL p-values when baseline is a covariate in ANCOVA.]

As an example of his results, when the correlation between baseline and post was low (i.e., 0.20) the percentage change was able to be statistically significant only 45% of the time.  Next worse, was change from baseline with 51% significant results.  Near the top was analyzing only the post score at 70% significant results.  The best was ANCOVA with 72% significant results.

Furthermore, percentage change from baseline “is sensitive in the characteristics of the baseline distribution.”  When the baseline has relatively large variability, he observed that “power falls.”

He also makes two other theoretical observations:

First, one would think that with baseline in both the numerator and denominator, it would be extraordinarily powerful in controlling for treatment group differences at baseline differences.  Vickers observed that the percentage change from baseline “will create a bias towards the group with poorer baseline scores.”  That is, if you’re unlucky (remember that buttered bread tends to fall butter side down, especially on expensive rugs), and the control group had a lower baseline, percentage change will be better for the control group.

Second, due to creating a ratio of two normally distributed variables (post – baseline) divided by baseline one would expect the percentage change to be non-normally distributed.  That is, percentage change is often heavily skewed with outliers, especially when low baselines (e.g., near zero) are observed.

I have often observed a third issue with percent change.  One often sees unequal variances at different levels of the baseline.  Let me briefly illustrate this.  Let us say we have a scale from 0 to 4 (0. asymptotic, 1. mild, 2. moderate, 3. severe, 4. life threatening).  At baseline, the lowest we might let enter into a trial is 1. mild.  How much can they improve?  Obviously they could go from their 1. mild to 0. asymptotic or 100% improvement; they could remain the same at mild (or 0% improvement); or then could get worse (3. moderate severity or -100%, etc.).  What about the 3. severe patients?  If the drug works they could go to 2. moderate (i.e., 33% improvement), 1. mild (i.e., 67% improvement) or 0. asymptomatic (i.e., 100% improvement) or get worse – 4. life threatening (-33% worse).  If you start out near zero (e.g., Mild), then you get a large s.d.  If you start high, a 1 point change would be far smaller, 33% change.  That is, percent change breaks another assumption of the analysis, unequal variances, heteroscedasticity.

Theoretically one would expect with percentage change: 1) an over adjustment of baseline differences, 2) non-normality, marked with outliers, and 3) heteroscedasticity.

To get percent change, Vickers recommends “ANCOVA [on change from baseline - AIF] to test significance and calculate confidence intervals.  They should then convert to percentage change by using mean baseline and post-treatment scores.”  I have a very large hesitation in computing ratios of means.  In arithmetic it is a truism that means of ratios (e.g., mean percent change) is not the same as ratios of means (e.g., mean change from baseline divided by mean baseline).  Personally, I would have suggested computing the percentage change for each observation and descriptively reporting the median and not reporting any inferential statistics for percent change.

In sum, Vickers recommends using ANCOVA and never using percentage change to do the inferential (i.e., p-value) analysis.  I further recommend reporting percentage change from baseline only as a descriptive statistic.

This entry was posted in Analysis of Covariance, assumptions, Effect Size, heteroscedasticity, non-normality, percentage change from baseline. Bookmark the permalink.

10 Responses to 18. Percentage Change from Baseline – Great or Poor?

  1. Zaixiang Tang says:

    I have a question, if we use Percentage Change from Baseline as the endpoint variable, and use the baseline as the covariable, then run the AVCOVA, is that ok?
    Another Question is What is the “least square mean percent change from baseline”, i am not sure about this, pls give me a suggestion.

    • To reply to Zaixiang’s questions:

      Your first question is to use as a dependent variable the percentage change from baseline or (100*(Y-B)/B) and B as a covariate, where Y is the endpoint score and B is the baseline. According to Dr. Vickers and my blog, the ANCOVA on percentage change would have poorer power and all the other issues mentioned: 1) an over adjustment of baseline differences, 2) non-normality, marked with outliers, and 3) heteroscedasticity.

      I don’t think I mentioned the “least square mean percent change from baseline”. As a general answer, a ‘least square mean’ of anything can be obtained by standard methods. It would be the simple mean if no covariate(s) were used or the estimated mean from the linear model when covariates were used. So, if you used ANCOVA on the percent change from baseline with baseline as the covariate, the analysis program would yield a least square estimate. But you’d still face the above three issues and poorer power.

      Nevertheless, I would still suggest using ANCOVA on (Y-B) or Y, and reporting the C.I. and p-values from that analysis. Then I’d compute descriptive statistics on the percent change from baseline (e.g., N, mean, median, s.d., min and max) for each treatment group and perhaps on the difference.

  2. Simon Wilkinson says:

    Dear Allen. To set the scene, I am not a stat or a biostat. We are treating patients with secondary progressive multiple sclerosis on a “compassionate basis” with an experimental drug – something that is allowed in our country (NZ). The number for patients is very small, about 15. Each patient is their own unique set of symptoms. We are using a MS specific QoL patient reported questionnaire (the MSQLI) to obtain baseline and then 3 monthly data as one means of gauging treatment effect in the absence of biomarkers – one of the challenges of treating this indication. We have been looking at the effect in each patient by using PCFB. In some components of the MSQLI a reduction in score is improvement and in other components, an increase in score is improvement. As a lay person an immediate issue arises. A baseline score of 1 (bad) verses a 3 mth score of 7 (much improved) equals a PCFB of 600%. For a different component a baseline of 7 (bad) verse a 3 mth score of 1 (much improved) equals a PCFB of -85%. This seems wrong! Subsequent ‘googling’ on the issue reveals the apparent minefield of PCFB!! Fundamentally we are interested in how treatment is impacting each patient as opposed to an overall effect in a larger population. Can you suggest an appropriate approach. Sincere thanks.

  3. Fergus Guppy says:


    I have read your blog, and it ties in with what I was doing, but I am unsure about the interpretation of % change. I have calculated it as suggested in the Vickers (2001) article as BASELINE-POST/BASELINE *100, but have done it for each case as you suggest and then establish a mean percentage change. However, the values seem to be inverted, ie when the average difference between group is positive (the post test value is therefore greater than the baseline measure) the percentage change is coming out as negative and vice versa. Would it be acceptable to then multiply these by -1 so that the directions are the same?

    Thanks in advance, this blog was very useful so far.

    • It sounds like a increasing score indicates a worsening prognosis and a lower score indicates an improvement. It is ALWAYS reasonable to compute percentage change as 100*(Baseline – Post)/Baseline OR 100*(Post – Baseline)/Baseline. This, is mathematically IDENTICAL to multiplying it by -1. I’ll leave the algebraic proof to you or your 15 year old daughter. In general, for interval level data, one can ALWAYS linearly transform ANY parameter X to X’: X’ = aX + b, where a is not zero. In this case: a = -1 and b = 0. The same applies to change from baseline: post – baseline (e.g., weight gain for premature infants growth) or baseline – post (e.g., weight loss for adult diet efficacy). See my blog 22 where I made similar statements. Don’t forget to comment: “The scales were reflected so a positive number indicates improvement.”

      [P.S. I changed your positive to negative per your errata comment, which I deleted.]

  4. René dePont Christensen says:

    Dear Allen,

    thank you for an interesting blog. The issue with percentage change from baseline is rearing its ugly head ever so often. Being a statstician, I am repeatedly confronted , by non-statisticians, with the statement that using this outcome is easier to understand and that the controversy is based on statistical “religion”. It is interesting how non-stats have a (mis-) conception of statisticians relying on religous belief, when in fact its all down to the absolute truth of mathematics. But alas absolute truth is a concept only appreciated by mathematicians, SIC…
    My preferred agnostic way of handling the issue is to log-transform, outcomes and baselines, and do an analysis of covariance. However, this is only apropriate when the effect is expected to be multiplicative. The latter point being a clinical rather than statistical pre-analysis consideration which is often ignored. If the effect is assumed to be additive, then the percentage change may be easy to understand but completely irrelevant, and the appropriate analysis would be analysis of covariance of the raw outcome with baseline as a covariate. Unfortunately the lack of a strong positive correlation between “easy to understand” and “valid” is not often appreciated.

    • René thank you for your insightful comment and taking your time to comment. I am curious as to the “statistical ‘religion’”. I am not sure about the absolute truth of mathematics, especially with dirty data, but I am sure of what I see. When we run simulations, we see that percentage change having poorer empirical power relative to ANCOVA or a simple pre-post comparison. Not religion, just empirical observations. Scientists, of all people, should appreciate data. OTOH, if you want to cite statistical dogma, then 1) over-adjustment of baseline differences when the baseline score was low, 2) marked non-normality, with outliers, and 3) heteroscedacity (unequal group variances) should suffice. As I suggested in my blog, I’d still give the client their percent change, as a descriptive statistic, but use the more powerful statistics (e.g., your suggested ANCOVA) as the inferential metric.

      I noticed you suggested a log-transformation of your d.v. That is often a very useful way to analyze data. However, I believe a key issue from your comment might be whether we are analyzing the data in the metric which the client uses. If the client only wants to report after a dichotomous transformation, do we want to do the key analysis on a log-transform of the original metric? If they insist on focusing their report on percentage change and you had done your analysis on pre-post change with a baseline covariate, is their focus correct? If they only report arithmetic means, rather than geometric means, is a log-transform appropriate? To be honest, I don’t know the correct answer. If I were a purist, if a client only wanted to report percent change, then I would only do the percent change in my analysis. In America we often say, ‘the customer is always right’.

      If I were a fish monger and a tourist customer wanted to buy a fresh fish, but I noticed that they were going to be on the road for twelve more hours, I might suggest they buy some ice. If they refuse, then if they have stinky fish, it is their problem, not mine. If the customer offered me a meal of that smelly fish (i.e., offered co-authorship or a citation), I’d thank them but politely refuse.

      Perhaps it might be best to consult with your client before you analyze the data. Ask them how much they expect their pre- and post-scores to correlate, and then ask the client, if they would want their power to be 45% or 72%. That is, use the empirical power calculations from Vickers (2001). I completely agree that the issue of “’easy to understand’ and ‘valid’ is not often appreciated.” However, it is your client’s data and ultimately their report. We can only educate our clients so much. They have to report their conclusions to their ‘clients’, who often don’t understand the metric which the scientist used. That’s why I personally like reporting the ‘effect size’, the non-centrality estimate, but that’s another blog .

  5. Rocco says:


    What about the case where you have just a single group of subjects measured at two time-points (say, baseline and follow-up)? This often occurs in medical studies, as you know, but suppose you have the following model at each time, for each subject:

    Observation = (true mean value) + (some error term)

    Then, is percent change still not useful for each subject?

    • If I were to make this into a blog, I’d title it: Simple, but Simple Minded.

      With regard to this specific design and percentage change, you can compute the mean difference between baseline and post. Alternatively you could compute the difference between baseline and post divided by baseline (assuming you have a ratio level scale) and present the median percent change. [Note: for reasons mentioned in my blog, the individual percentage changes tend to have a highly positively skewed distributions, So I would use medians rather than means.]

      But what does it mean?

      A single group measured at two-time points with an intervention in between is a very poor design. Many effects could cause a true mean change from baseline to post. Unfortunately, it is typically not only the (medical) treatment. For example, natural changes in people (e.g., spontaneous remission, natural healing, regression to the mean), season, selection of subjects (sick patients come to doctors because they are ill – at any later time point they aren’t as ill), subjects saying the nice doctor helped them, etc (and there is a very long list of potential other reasons to explain the difference). Most of these alternative (non-treatment) factors bias the results to make the second observation appear better. The single group pre-post is a truly horrible design. This is not a true experimental design. Campbell and Stanley referred to this design as a Pre-Experimental Design or a quasi-experimental design (see page 7 of their book).

      The very first study I professionally analyzed was a 4 week drug intervention in depression. Yes, the patients treated with our medication changed 13 points. It was ONLY because we also ran randomized and blinded patients treated with placebo who had a mean change of 7, that we could deduce that our drug had a 6 point drug-treatment effect. Without the placebo group, the 13 points could have been solely a placebo (or similar) effect.

      Unfortunately, most experimentalists can make NO credible interpretation as to why the one-group pretest-posttest design [percentage] difference is what it is. One could say, ‘we saw a 13 point difference … or a 31% median percentage improvement relative to baseline’. But there is a major leap from saying ‘there was a change’ to saying that ‘we saw a change due to the treatment’. They typically put two disjoint effects together in the sentence and make such a implication, such as ‘the 31 patients, WHO RECEIVED TREATMENT X, had a 13 point difference … .’

      Unfortunately, for many in the medical device industry this is their favorite type of design. But then again, they would seldom hire a ‘real’ statistician to review their study. They typically use students who only had single Stat course to analyze their data. They don’t like to be told their head of clinical operations is incompetent or they are too cheap to run a real study.

      Again, the One-Group Pretest-Posttest study is not a real experiment.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>