*Science is organized common sense where many a beautiful theory was killed by an ugly fact. – Thomas Huxley*

It really is a nice theory. The only defect I can think it has is probably common to all philosophical theories. It’s wrong – Saul Kripke

***

I received the following question (abbreviated slightly) regarding blog 18 Percentage Change from Baseline – Great or Poor?

What about the case where you have just a single group of subjects measured at two time-points (say, baseline and follow-up)? This often occurs in medical studies, as you know.

Then, is percent change still not useful for each subject?

With regard to this specific design and percentage change, you can compute the mean difference between baseline and post. Alternatively you could compute the difference between baseline and post divided by baseline (assuming you have a ratio level scale) and present the median percent change. [Note: for reasons mentioned in my blog, the individual percentage changes tend to have a highly positively skewed distributions, So I would use medians rather than means.]

But what does it mean?

A single group measured at two-time points with an intervention in between is an internally flawed (horrible) design. Many effects could cause a true mean change from baseline to post. Unfortunately, it is typically not only the (medical) treatment. For example, natural changes in people (e.g., spontaneous remission, natural healing, regression to the mean), season, selection of subjects (sick patients come to doctors because they are ill – at any later time point they aren’t as ill), subjects saying the nice doctor helped them, etc (and there is a very long list of potential other reasons to explain the difference). Most of these alternative (non-treatment) factors bias the results to make the second observation appear better. The single group pre-post is a truly horrible design. This is not a true experimental design. Campbell and Stanley referred to this design as a Pre-Experimental Design or a quasi-experimental design (see page 7 of their book).

The very first study I professionally analyzed was a 4 week drug intervention in depression. Yes, the patients treated with our medication (Amoxapine) changed 13 points. Fortunately, the study was a randomized and blinded study, with patients treated with drug or placebo. It was only because we included placebo patients, who had a mean change of 7, that we could deduce that our drug had a 6 point drug-treatment effect. Without the placebo group, the 13 points could have been solely a placebo or any number of similar effects.

Unfortunately, most experimentalists can make NO credible interpretation as to why the one-group pretest-posttest design [percentage] difference is what it is. One could say, ‘we saw a 13 point difference … or a 31% median percentage improvement relative to baseline’. But there is a major leap from saying ‘there was a change’ to saying that ‘we saw a change due to the treatment’. They typically put two disjoint effects together in the sentence and make such a implication, such as ‘the 31 patients, *WHO RECEIVED TREATMENT X,* had a 13 point difference … .’

Unfortunately, as the commenter noted, this is a frequently used design, especially in the medical device industry. For such a design to work, the scientists MUST believe that patients are static and unchanging – a patently and demonstrably false assumption. But then again, they would seldom hire a ‘real’ statistician to review their study. They typically use students who only had single Stat course to analyze their data. They don’t like to be told their head of clinical operations is incompetent or they are too cheap to run a real study.

Again, the One-Group Pretest-Posttest study is NOT a real experiment, it is little more than a set of testimonials (One-Group (informal Pretest-) Posttest ‘study’ with much missing data). You could compute the change and percentage change, but it cannot be interpreted, hence any conclusions – data analysis – is meaningless. The ONLY good that can come of such a trial is the promise of doing a real trial.

Dear Dr Fleishman,

As a non-statistician PI (read the joke at the start of the blog – very true, unfortunately) I am frequently challenged by stat-related assessments, and here is one example:

A certain test can indicate that a patient progresses on treatment, is stable, or does not progress. The only way to evaluate the result is to compare the pre and post treatment results. Literature says that a 50% decline is good, while a 50% increase is bad; no clinical significance in-between. The assay itself has a 10% analytical CV and ~8% CV for inter-individual (still trying to understand the latter as ~20% of healthy controls have a positive result of the test). Apparently, mathematical coupling of pre- and post make % change incorrect, although it is widely used. I am trying to understand how to (a)correctly evaluate a 50% change in each patient; and (b) what should be the allowable error for such measurement. Is there a way to do this?

Many thanks,

Victor Levenson

O K , s i n c e y o u a r e a P I , I w i l l t y p e m y r e p l y s l o w l y. 😉

Unfortunately your question compares apples to lug wrenches. I will assume that the assay has a universally agreed to zero point (otherwise a ratio is not possible to compute) for either the frequency percent or the ratio percentage.

Your key clinical significant parameter is a pre-post ratio (relative to pre) with a trichotomy 50% increase is bad; 50% decrease is good and -50% to +50% is in between. Let me refer you to my blog on percents (dichotomies) 9. Dichotomization as the Devil’s Tool (http://allenfleishmanbiostatistics.com/Articles/2011/12/9-dichotomization-as-the-devils-tool/). To summarize what I said there (more details on that blog):

My objections to such an easy to understand statistic? Let me make a list:

Power – you need to enroll more patients into your trial.

We throw away interval level information (hence means) and ordinal level information (hence medians).

The statistical approaches often assume large Ns.

The statistical approaches limits the type of analyses.

On the other hand you are citing the CV, which I believe is the Coefficient of Variation or the Relative Standard Deviation (RMD). The CV is the ratio of the s.d./mean. The CV is great in that it is unit free. That is, since s.d. is measured on a certain scale (e.g., the height of a femur might be in centimeters) and the mean is also measured in the same unit (cm), then the ratio [xxx cm/yyy cm] is no longer in the original unit of measurement (i.e., cm) and the CV is often expressed as a percentage (note that this percentage is different from the dichotomy frequency percent). Let me say this again. The clinically significant frequency percent expressed initially, that is a dichotomy, a nominal level of measurement. The CV is a unit free continuous measurement, a ratio level of measurement. Apples vs lug wrenches. To make matters slightly more complicated, CV is typically done on single measurements, whereas, at best, you are talking about the difference between the pre and post measurements. To make the equation simple, I will assume that the pre and post are equal. The s.d. of the difference is equal to:

s.d. of difference = square root (2*s.d.**2 – 2*r[pre, post]*s.d.**2)

If we are talking about highly correlated pre and post measurements (e.g., height of femur), then the s.d. of the difference approaches zero. If the correlation approaches zero, then the s.d. of the difference approaches 1.41*s.d.; 41% larger than a single measurement. With simple algebra, if the pre and post correlation is better than 0.50 then the difference of correlated measurements have a smaller standard error than a single measurement. Unfortunately, I have professionally observed that the correlation is typically lower, especially when the measurements are not temporally close (e.g., 6 months apart, often seen in many trials). Therefore, typically the s.d. of a difference is larger (up to 41% larger) than a single measure. Hence the CV of a difference is larger than the CV of a single measurement.

You will need to check what the 10% analytic, -8% inter-individual, and 20% healthy controls is based on, raw single measurement, difference in two measurements, or frequency of an clinically different percent dichotomy.

Dear Allen,

Thanks much for a “PI-level” response.

The issue here is that the change (50%) is calculated for a single patient; browsing the literature I found numerous papers where 50% decline/increase for a individual patient is considered significant to establish whether the disease has progressed or not. Usually the groups of comparison are pitifully small (15-20 at the most), so the whole premise appears in the area of feeling rather than science. In addition, the zero is not hard; about 20% of healthy folks are test positive; on top of that, 7x difference in results for different methods of measurement is reported. Also, as you mentioned, for patients there are usually two measurements, 6-12 mo apart. From the PI point of view, I was planning to recommend multiple measurements (e.g. every month) to improve the result; using the same lab and same test methodology every time; keeping a portion of the previously tested sample and re-testing it with each incoming one (contemporaneous testing, as you mentioned). Is there anything else that can be done? (Sorry for trying to pick your brain…).

Best regards,

Victor

PS. Read your “Dichotomization…” entry; we used that for a continuous variable, but introduced a gap in-between the tail values (our statistician’s suggestion that was confirmed experimentally – the re-testing of samples with “in the gap” values produced flip-flops around the dichotomization threshold). VL

I did not understand “the groups of comparison are pitifully small (15-20 at the most)”. Is it the Ns seen in literature studies? If so, that is pitifully small. If it is N, and is the total N, then the N/group is between 7 and 10. I will assume the greater N, 10. If this was the case, you would need to have a true 66% difference (i.e., between 83% and 17%) to have 80% power to detect a statistically significant difference with a two-sided p-value. 66% is a HUMONGOUS difference, especially with measurements as fluky as yours. [Note: with 10 subjects 66% is the minimum, the minimum!, difference that could be expected to be detected. If the average ( [17+83]/2 ) was different from 50% you would need a LARGER difference.]

Second, you are not allowed to form ratios when zero is not necessarily zero. Try taking a ratio of today’s vs yesterday’s temperature in Fahrenheit then Celsius. You will get different numbers (that would be OK if you used a scale with a real zero (e.g., Kelvin)).

Third, 7x difference? That is troublesome. Sounds like you have routine outliers.

Fourth, If you did believe in the zero being zero, albeit measured roughly (see final comment to attempt to mitigate that), you could still analyze the data as continuous (ratio-level) data. To control for the influence of outliers, you could transform the data into logarithms. The ratio (division) would then become a difference (subtraction) in the log metric. You could take the final statistic (e.g., mean difference) and exponentiate it to return to percentages. Log the data, compute means, exponentiate the results, provides the geometric mean rather than arithmetic mean. I would then report the CI on the Treated and Control groups, as well as their difference. For example, which is better? Report a non-significant 37% percent of patients who had a greater than 50% improvement in their post measurements with a CI of -12% to 86%; or report a statistically significant geometric mean percentage of 47% with a 95% CI of 18% to 72%. [Note: See objection #1 for frequency percents – ‘Power – you need to enroll more patients into your trial.’] And yes, you are still computing the percentage change on single patients. Doing the primary statistics on the continuous data, does not prohibit you from also reporting (but not doing the statistical analysis) on your frequency percents. Alternatively, you could report your Chi square/Fisher’s Exact test as a secondary or supportive analysis. You can report the literature’s feeling, but base your analyses on the more powerful science!

Fifth, it sounds like retesting the middle values would bring you back to a dichotomy, rather than trichotomy. BTW, if you retested all the data, even the >50% changes, how often would a clinically significant increase become a clinically significant decrease in symptoms?

Finally, I like the notion of multiple measurements. However, what do you plan on doing with them? You could enter them all into a regression (slope of curve) analysis, but that is quite difficult to analyze with frequency percents [but routine with continuous measurements – see objection #4 (‘The statistical approaches limits the types of analyses’)]. You could average the post-baseline results, but you would get (on the average) the Month 6.5 percent. In contrast to that, to get more stable results, I would take multiple measurements at the key times (e.g., 3 measurements at baseline, Months 6 and 12), then analyze the middle one (median). Think of BP or even carpentry (measure once, cut twice; measure twice, cut once). In contrast to monthly measurements, you would do less measurements (9 vs 13 measurements) and save on the number of visits for you, staff, and patients.