“A theory has only the alternative of being wrong. A model has a third possibility – it might be right but irrelevant.” Manfred Eigen

In my last stats course I was amazed to hear my teacher announce that If we did not like our results, all we needed to do was change our levels of confidence. In short fib. This time to ourselves.

***

In my last blogs I pointed out the very striking limitations of the p-values. I also stated a correspondence between the p-value and the confidence interval. The correspondence is 1 to 1. If a p-value is <0.05 then the 95% confidence interval must exclude zero and if the p-value is > 0.05 then the 95% confidence interval must include zero. [Note: One could test if the statistic was any arbitrary number (e.g., H_{o}: μ_{difference} = 1). However it could then be rewritten to be H_{o}: μ_{difference} – 1 = 0, returning to a comparison against zero.]

Let us assume that we can rescale the parameter so a positive difference indicates improvement. This is something we always can do. For example, if we were looking at diet effects on weight loss after ten months, we can take the difference from the subject’s initial weight minus their last weight and call it ‘Weight Loss’. In contrast, if we were interested in looking at weight increases, we would take the difference from the subject’s last weight minus their initial weight and call it ‘Weight Gain’.

The p-value only states that the difference excludes zero. The confidence interval on the difference tells the scientific community how much it differs. Keeping with our weight loss example, a significant p-value would only say that the difference is non-zero. Is it 1 ounce? 5 pounds? 30 pounds? The p-value would not tell you that. The real information would come from the confidence interval and the mean. Let us say that the difference between the diet and a sham diet had a mean and 95% CI of 15 pounds (2 to 28 pounds). Then we know, with 95% certainty that the diet effect must be a weight loss of at least 2 pounds, our best guess is about 15 pounds and can be as much as 28 pounds.

I used pounds, rather than kilograms in the above example as most Americans have an intuitive understanding of pounds. Most of us are concerned with our weight. A two pound reduction is pretty trivial, but its better than nothing. A 15 pound difference sounds pretty good. A 28 pound weight loss might sound terrific. It is this intuitive understanding of the parameter (pounds) which make us capable of understanding the importance of this gedanken study. You should also note that both the lower and upper value of the confidence interval were important. The lower end indicates the minimum reasonable value of the treatment effect. The upper end indicates the maximum reasonable value of the treatment effect, the maximum reasonable clinical effect. In this case, the upper end (28 pounds) could indicate a really powerful effect.

Let us say that the mean and CI was 2 pounds (0.25 to 3.75 pounds). That is, we used a very large and well controlled study. The lower end certainly excludes zero, but is it meaningful? No! The average effect was 2 pounds, again a rather small improvement, but perhaps better than nothing. The upper end of the CI was 3.75 pounds. Again pretty meager. I would conclude that while the results were better than nothing, the diet intervention was rather ineffective. To put this in another way, it was statistically significant, but not clinically significant. For a ten month weight loss intervention, I personally would want to see the diet having a possibility of having at least a 5 pound weight loss. Hmm, after ten months, perhaps at least 10 pounds. Three and three quarters of a pound maximal effect would make me want to pass this ten month diet up. [Note: One TV ad touted a ‘clinically proven’ “average of 3.86 lbs of fat loss over an 8-week university study.”]

Let me now take one last mean and CI: 28 pounds (-2 to 58 pounds). This could be from a small sample, exploratory study. Perhaps this was the first Phase IIa trial by a small company. Examining the CI, we see that the difference could include zero. Oh my god, the results were not statistically significant. Doom? Chalk the treatment up as ineffective and try something new? Well the treatment could be ineffective, the effect might include zero or even some weight gain (the lower CI was negative, or gain of, two pounds). Our best guess is that the treatment effect was 28 pounds, a very nice effect. The upper end was 58 pounds, a very, very large potential effect. What would I conclude? While the study was not statistically significant, the treatment might be a very large effect. I would strongly suggest that the ‘scientists’ hire someone to adequately plan the next trial, their first attempt was ineffective. They failed to run an adequately sized trial. [Yes, I would blame them for delaying the product’s eventual acceptance. They wasted time (e.g., a year) and resources which should have given conclusively, positive results if they had run, perhaps, another half-dozen patients. Any competent statistician could ‘knock off’ a power analysis within 30 minutes. So the issue is never cost, but their incompetence. See upcoming blog – ‘What is a Power Analysis?’] They might have a very promising diet. To hark back to previous blogs, a non-statistically significant result doesn’t mean a non-clinically important result. If, and only if, the upper end of the confidence interval is below a clinically important value can we determine the true lack of an effect.

In sum, the lower end of the confidence is useful in saying that the effect is different from zero (i.e., no effect at all) and what the minimum effect is. The mean would be what our best guess of the effect is. The upper end indicates what the maximum effect could be. I will say more about this in my future blog “Accepting the null hypothesis”.

One reason the weight loss example was useful is because most (75%) adults in the West are concerned with their weight. They understand what a 5 or 10 pound lost weight means.

However, in most clinical research, even experts have a less than intuitive grasp of what a minimally important clinical effect would be. What do we do? I suggest you read my next blog ‘Meaningful ways to determine the adequacy of a treatment effect when you lack an intuitive knowledge of the d.v.’