Allen I. Fleishman, PhD, PStat®
Did you hear the joke about statisticians? … Probably
“Professor, do I look like a therapist? Go solve your own problems!”, anonymous student
“If I only had a day to live, I’d spend it in a statistics class. That way it would seem longer”, anonymous student.
A guy is flying in a hot air balloon and he’s lost. So he lowers himself over a field and shouts to a guy on the ground:
“Can you tell me where I am, and which way I’m headed?”
After fifteen minutes the guy on the grounds says, “Sure! You’re at 43 degrees, 12 minutes, 21.2 seconds north; 123 degrees, 8 minutes, 12.8 seconds west. You’re at 212 meters above sea level. Right now, you’re hovering, but on your way in here you were at a speed of 1.83 meters per second at 1.929 radians”
“Thanks! By the way, are you a statistician?”
“I am! But how did you know?”
“You took a long time to answer; everything you’ve told me is completely accurate; you gave me more detail than I needed, and you told me in such a way that it’s no use to me at all!”
“Dang! By the way, are you a principal investigator?”
“Geeze! How’d you know that????”
“You don’t know where you are, you don’t know where you’re going. You got where you are by blowing hot air, you start asking questions after you get into trouble, and you’re in exactly the same spot you were a few minutes ago, but now, somehow, it’s my fault!
“Like other occult techniques of divination, the statistical method has a private jargon deliberately contrived to obscure its methods from nonpractitioners.” G. O. Ashley
Numbers are like people; torture them enough and they’ll tell you anything.
50% of all citizens of this country have a below average understanding of statistics
“Statistics: the mathematical theory of ignorance” Morris Kline
Statistical Analysis: Mysterious, sometimes bizarre, manipulations performed upon the collected data of an experiment in order to obscure the fact that the results have no generalizable meaning for humanity. Commonly, computers are used, lending an additional aura of unreality to the proceedings.
THE TOP TEN REASONS TO BECOME A STATISTICIAN
Deviation is considered normal.
We feel complete and sufficient. We are “mean” lovers.
Statisticians do it discretely and continuously.
We are right 95% of the time.
We can legally comment on someone’s posterior distribution.
We may not be normal but we are transformable.
We never have to say we are certain.
We are honestly significantly different.
No one wants our jobs.A statistics professor was completing what he thought was a very inspiring lecture on the importance of significance testing in today’s world. A young nursing student in the front row sheepishly raised her hand and said, “But sir, why do nurses have to take statistics courses?”
The professor thought for a few seconds and replied, “Young lady, statistics save lives!”
The nursing student was utterly surprised and after a short pause restored, “But sir, please tell us how statistics saves lives!”
“Well,” the professor’s voice grew loud and somewhat angry, “STATISTICS KEEPS ALL THE IDIOTS OUT OF THE NURSING PROFESSION!!!”
Proofs? We don’t need no stinking proofs.
A colleague once told me of being confronted by a doctor at 4 pm on a Friday with “Could you just ‘t and p’ this data by Monday?” David Spiegelhalter, president of the Royal Statistical Society
***
These blogs were written with the nonstatistician in mind, although statisticians could benefit from my thirty plus years of experience consulting for the pharmaceutical/biologic/device industry. It is for those people who have taken at least a single statistics class and use statistics for clinical research in the pharmaceutical/device/biotech industry. Although simple equations will be presented, to make points, math will be kept to the level of a first week in high school algebra. Nor will I present proofs.
I will be making postings on important issues for the users of statistics and insights I’ve made from my many years of experience. I’ll also include ‘tricks’ for running a smaller study. Please start at the bottom of this blog and read upward (starting with 1. Statistic’s dirty little secret).
Feel free to post your thoughts, agreeing or disagreeing (include why you disagree, please). I will post questions or statistically related agreements/disagreements. Interesting (either positively or negatively) comments might be the leadin for a full post. However I will attempt to answer all comments within a day. [Note: I use a spam filter, so if your comment is ignored send it to me at my email address allenfleishman (at) comcast.net.] Feel free to ask me a question through a comment. However, I am no Dr. Phil. I will almost never say your approach was the correct one, especially with a typical 4 sentence description of your trial. Even given a well written protocol, I could never guess all possible data perturbations. So I will point out potential issues, most can be anticipated if you read the entire set of blogs.
Blogs I have written are (although I reserve the right to change the blogs and comments after they were initially published):
1. Statistic’s dirty little secret – Published 30Sept2011
1.A. Another View on Testing by Peter Flom, PhD – Published 12July2012
1.B. Am I a nattering nabob of negatisism? – Published 23April2017
2. Why do we compute pvalues? – Published 5Oct2011
3. Meaningful ways to determine the adequacy of a treatment effect when you have an intuitive knowledge of the d.v. – Published 12Oct2011
4. Meaningful ways to determine the adequacy of a treatment effect when you lack an intuitive knowledge of the d.v. – Published 19Oct2011
5. Accepting the null hypothesis – Published 30Oct2011
5.A. Accepting the null hypothesis by Randy Gallisted, PhD of Rutgers – Published 23Apr2012
6. ‘Lies, Damned Lies and Statistics’ part 1, and Analysis Plan (an essential tool) – Published 5Nov2011
7. Assumptions of Statistical Tests – Published 11Nov2011
7a. Assumptions of Statistical Test: Ordinal Data – Published 2Aug2012
8. What is a Power Analysis? – Published 28Nov2011
9. Dichotomization as a devil’s tool – Published 10Dec2011
10. Parametric or nonparametric analysis – Why one is almost useless – Published 26Dec2011
11. pvalues by the pound – Published 5Jan2012
12. Significant pvalues in small samples – Published 25Jan2012
13. Multiple observations and Statistical ‘Cheapies’ – Published 12Mar2012
14. Great and Not so Great Designs – Published 22Mar2012
15. Variance, and ttests, and ANOVA, oh my! – Published 9Apr2012
16. Comparing many means – Analysis of VARIANCE? – Published 7May2012
17. Statistical Freebies/Cheapies – Multiple Comparisons and Adaptive Trials without selfimmolation – Published 21May2012
18. Percentage Change from Baseline – Great or Poor? – Published 4Jun2012
19. A Reconsideration of my Biases – Published 25Jun2012
20. Graphs I: A Picture is Worth a Thousand Words – Published 17Aug2012
21. Graphs II: The Worst and Better Graphs – Published 18Sept2012
22. A question on QoL, Percentage Change from Baseline, and CompassionateUsage Protocols – Published 20Apr2013
23. Small N study, to Publish or not – Published 12May2014
24. Simple, but Simple Minded – Published 8Aug2014
25. Psychology I: A Science – Published 20Mar2015
26. Psychology II: A Totally Different Paradigm Published 25Mar2015
27. Number of Events from Week x to Week y – Published 7Apr2015
18.1 Percentage Change – A Right Way and a Wrong Way – Published 28Aug 2015
28. Failure to Reject the Null Hypothesis – Published 7Nov 2015
29. Should you publish a nonsignificant result? – Published 22Nov2015
18.2 Percentage Change Revisited – Published 9March2016
30. ‘Natural’ Herbs and Alternative Medicine – Published 25July2016
31. Case History of a Trial – To be Done
]]>I can’t believe schools are still teaching kids about the null hypothesis. I remember reading a big study that conclusively disproved it years ago.
***
To most scientists, the endpoint of a research study is achieving the mystical ‘p < 0.05’, but what does this mean? At the core, it means that one can reject the null hypothesis (H_{o}). Let me use as an example one of the more common studies, a comparison of one treatment (e.g., breakthrough drug) with a standard (e.g., placebo), with the hope of improving (increasing) the average benefit. The null hypothesis is typically of the form H_{o}: μ_{1} = μ_{2}. The alternative hypothesis is typically that they are not the same, H_{A}: μ_{1} ≠ μ_{2}. Let me do a trivial bit of algebra on the H_{o}, μ_{1} – μ_{2} = 0. That is the difference is zero.
Let me go quickly over the ‘number line’. When we talk about the population mean improvement seen for a Drug A, it will have a reasonable upper and lower limit. There are LDL, heights, hemoglobin levels beyond which are inconsistent with human life. You can’t have a human height of 1,000 feet. But any value consistent with human life IS possible. Any value! An LDL mean population value for Drug A of 96.0 is a possibility, so is 96.1 and 96.148900848924104274…, etc. The same would true for the comparative treatment, e.g., placebo. The difference between Drug A and Placebo is a number of infinite length.
The null hypothesis doesn’t test if the difference is near zero (e.g., Mean_{1} – Mean_{2 }< 0.01). It is not very near zero (e.g., Mean_{1} – Mean_{2 }< 0.00001), nor even the limit as it approaches zero (e.g., Mean_{1} – Mean_{2} < 0.0000 … [a trillion zeros later] … 0001). What is zero? Well zero is zero. Mathematically, the probability that an infinity of values is any single value (i.e., the null hypothesis difference is EXACTLY zero) approaches zero. So, is there any treatment for which any sapient individual believes is completely and utterly the same as a different treatment? With the possible exception of the field of ESP research, the answer is no. I cannot imagine any comparison of different treatments which might produce no difference, no matter how minuscule. So, mathematically the null hypothesis is meaningless.
This is mirrored by reality in that researchers always do everything in their power to find treatments which are maximally different from standard. For example, the treatments typically use the maximum dose they can safely use or engineers have been working for years on developing the device they want to test. In sum, my best guess is that no scientist has ever believed that their treatment effect is zero.
You might be thinking that statistics is different in that it is much more practical and deals with real world data and issues. A difference of only a small amount (e.g., Mean_{1} – Mean_{2} = 0.00001) wouldn’t be statistically significant. As a proud statistician, you have a point. Statistics is certainly a real world, practical way to view data. However, a small difference can become statistically significant. The root of this conundrum is hidden in the denominator of all statistical tests. Let me take the simple ttest comparing two sample means: t = (Mean_{1} – Mean_{2})/s√(2/N). We are dividing the mean standardized difference (Mean_{1} – Mean_{2})/s by a reciprocal function of N. After a bit of algebra, the difference is being multiplied by a constant times the square root of N. In other words, as the study sample size increases, given any nonzero difference, the t will increase. As mentioned above, all test statistics are of this form, with the sample size multiplying the test statistic. This applies to nonparametric testing, to Bayesian statistics, comparisons of correlations, variances, skewness, survival analyses, all test statistics.
Let me put it another way, can you imagine any comparison which fails to reject the null hypothesis if the sample size were 100,000 or 10,000,000 or 1,000,000,000? I can’t. The converse is also true, can you imagine a successful trial (rejecting the null hypothesis) when the sample size per group were 2? That is, the ability to reject the null hypothesis is a pure function of N. Even a poorly run study would be significant if you threw enough subjects into it.
At the great risk of boring you to tears and making you say ‘enough already’, I need to say this again, pvalues are a function of N, the sample size, when any difference exists. As I said above, the likelihood that any difference is EXACTLY zero is infinitely small. Let me assume that we are dealing with mean differences of two different samples – as in comparing a control to an experimental group (the dependence of N on any test statistic [Fisher’s exact test, logistic regression, correlations] is still true no matter the statistic). Let me further assume that the mean difference is quite small, a tenth of a standard deviation difference. I shall also assume the typical 2sided test. By manipulating the number of patients (N) I can get almost any pvalue. The following table presents a variety of sample sizes, from ‘nonsignificant’ to very ‘highly significant’.
N  pvalue 
4  0.90 
14  0.80 
31  0.70 
56  0.60 
92  0.50 
143  0.40 
216  0.30 
330  0.20 
543  0.10 
771  0.05 
1331  0.01 
2172 
0.001 
3036  0.0001 
3913  0.00001 
To repeat myself a last time, pvalues are a function of sample size. They reach ‘significance’ faster (i.e., with smaller sample sizes) when the true difference is larger, but they can always become any level of ‘statistical significance’ as long as the difference is not exactly zero. Statistically, with a large enough N, the null hypothesis will be rejected. [In fact, one main job of a statistician is to determine the N which will give you a statistically significant result.]
This brings me to a second theoretical issue with the null hypothesis heard in all Statistics 101 classes. Given the issues above, one can NEVER accept the null hypothesis. One can only fail to reject it. Sorry about the double negatives. The reason for this is that with a better run study (decreasing the internal variability and/or increasing the sample size), one should eventually reject the null hypothesis. To put things another way, a study which fails to reject the null hypothesis is, in essence, a failed study. The scientists who ran it did not appreciate the magnitude of the relative treatment difference and either failed to control the noise of the study or ran it with an inadequate sample size. If a study failed to reject the null hypothesis, one cannot say that the null hypothesis is true, it is because the scientists who designed it failed.
Another issue is that the null hypothesis is one of many assumptions of the statistical test. There are many other assumptions. For example, for the Student ttest comparing two sample means assumes normality, independence of observations, each observation comes from a similar distribution, equality of variances, etc. If we reject the null hypothesis it could be for other, nonnull hypothesis, reasons, for example, nonnormality (like outliers). I’ll return to this issue in a future Blog ‘Parametric or nonparametric analysis – assumptions we can live with (or not)’. Statistically, rejecting the null hypothesis might be a failure of the mathematical test’s assumptions.
Finally, let me stress that the near sacred pvalue (i.e., p < 0.05) indicates our ability to reject the null hypothesis. As it is theoretically false, believed by all to be false, and practically false, all statisticians I’ve ever talked to believe that the pvalue is a near meaningless concept. It is the statistician’s job to enable the scientists to reject the null hypothesis (p < 0.05). Fortunately, they are very quick (i.e., cheap) and very easy to do. Please see a future blog – ‘8. What is a Power Analysis?’
I mentioned above ‘all statisticians … believe that the pvalue is a near meaningless concept’. This ‘Dirty Little Secret’ isn’t new. Everyone who has taken Stat 101 has heard of the Student ttest. ‘Student’, aka William Gosset, said “Statistical significance is easily mistaken for evidence of a causal or important effect, when there is none”, according to an article in Significance (published by the ASA), September 2011. ‘Student’ also said “Similarly, a lack of statistical significance – statistical insignificance – is easily though often mistakenly said to show a lack of cause and effect when in fact there is one.”
To forestall any ambiguity, let me mention that every statistical analysis I’ve ever given to clients has always included pvalues, among other statistics. However, I will discuss why I always include pvalues in the next blog.
]]>Today, I’ll look at how to make and evaluate a good statistical argument. I’m going to base this on the absolutely wonderful book: Statistics as Principled Argument by Robert Abelson. It’s an easy read, and I urge those interested in this stuff to go buy a copy.
The book makes the point of the title: Statistics should be presented as part of a principled argument. You are trying to make a case, and your argument will be better if it meets certain criteria; but which criteria are the right ones?
In Statistics as Principled Argument, Abelson lists five criteria by which to judge a statistical argument. He calls them the MAGIC criteria
1. Magnitude How big is the effect?
2. Articulation How precisely stated is it?
3. Generality How widely does it apply?
4. Interesting How interesting is it?
5. Credibility How believable is it?
We can tell how big an effect is through various measures of effect size. I can get into some of these in later article, but some of the common ones are correlation coefficients, the difference between two means, and regression coefficients. Big effects are impressive. Small effects are not. How big is big depends on context, and on what we already know. If we find, for example, that a new diet plan lets people lose (on average) 10 pounds in a month, that’s pretty big. 10 ounces in a month is pretty small. But if it was a diet tested on rats, 10 ounces might be a lot.
Articulation is measured in what Abelson calls Ticks and Buts. A ‘tick’ is a statement, and a ‘but’ is an exception. The more ticks the better, the fewer buts the better. There are also blobs, which are masses of undifferentiated results. Blobs are, as you might have guessed, bad.
Generality refers to how general an effect is. Does it apply to all humans everywhere? That would be very general. Or does it apply only to left handed people who have posted 50 or more articles on AC? That would be pretty specific. Usually, more general effects are of greater value than more specific ones, but you should be sure that the study states how general it is.
Interestingness is very hard to measure precisely, but one way is to say how different the reported effect size is from what we thought it would be. For example, I once read a study that showed that Black people, on average, earn less than Whites. Upsetting, but not interesting. I knew that already, and the size of the difference was large (which I thought it would be) but not huge (which I also knew, because, after all, even the average White person doesn’t earn all that much). But then it went on to say that, while Black men earned a lot less than White men (more than I thought the difference would be), Black women and White women earned almost the same (that’s really interesting! I would have thought that Black women earned much less than Whites!)
Finally, credibility. The more hard a result is to believe, the more stringent you have to be about the evidence supporting it. Extraordinary claims require extraordinary evidence.
I wrote my first blog, ‘1. Statistic’s dirty little secret‘, in September of 2011, a few years ago. You, my gentile reader, might feel I was a bit of a reactionary, to quote former Vice President Spiro Agnew about the press, a member of the ‘nattering nabobs of negativism’, one of the ‘the ‘hopeless, hysterical hypochondriacs of history.” As this had been my first blog, you might have felt I am a disgruntled statistician, unrepresented in my field, alone, an extreme maverick, a heretical hermit spouting doomsday, a fool. It is now April of 2017, over five years later. Have I changed my mind? Not in the slightest. Am I alone? Ahh, NO!
I just put down the latest copy (April 2017) of the American Statistical Association’s (ASA) and the Royal Statistical Society’s (RSS) joint journal, Significance. A major article was by Robert Matthews entitled ‘The ASA’s pvalue statement, one year on’. Matthews writes that he noted a major problem with ‘statistically significant’ results in that there is a replication crisis. Statistically significant results are frequently unable to be replicated. Scientists who repeat a study are unable to get significant results. “In nutritional studies and epidemiology in particular, the flipflopping of findings was striking. … [T]he same flipflopping began appearing in large randomised controlled trials of lifesaving drugs.”
He reported that over a year ago the “the American Statistical Association (ASA) took the unprecedented step of issuing a public warning about a statistical method … the pvalue.” He went on to say it “was damaging science, harming people – and even causing avoidable deaths.” The article was published in 2016, ‘The ASA’s statement on pvalues: Context, process, and purpose’, American Statistician, 70 (2), 129133. Matthews quotes “the ASA’s thenpresident Jessica Utts pointed out what all statisticians know: that calls for action over the misuse of pvalues have been made many times before. As she put it, ‘statisticians and other scientists have been writing on the topic for decades’.”
Matthews laments that there has been little no change in the use of pvalues. “Claims are backed by the sine qua non of statistical significance ‘p < 0.05’, plus a smattering of the usual symptoms of statistical cluelessness like ‘p = 0.00315’ and ‘p < 0.02′.”
There were two comments on Matthews’ article.
The first was by Ron Wasserstein, the executive director of the ASA. He begins: “We concede. There is no single, perfect way to turn data into insight! The only surprise is that anyone believes there is!” “Thus, the leadership of the ASA was keen to join in the battle that Robert Matthews describes and that he and many, many others have long fought … because it is a battle that must be won.” “Matthews was right about a lack of consensus among statisticians about how best to navigate in the post p < 0.05 era.”
The second commentator was David Spiegelhalter, the president of the RSS. He begins: “I have a confession to make. I like pvalues.” Dr. Spiegelhalter emphasizes that the fault lies primarily on bad science, not statistics. “[M]any point out that the problem lies not so much with pvalues in themselves as with the willingness of researchers to lurch casually from descriptions of data taken from poorly designed studies, to confident generalisable inferences.” He adds, “pvalues are just too familiar and useful to ditch (even if it were possible).” I made the same point in my second blog ‘2. Why do we compute pvalues?‘
He then goes on to suggest three things: 1a) When dealing with data descriptions (e.g., exploratory results or secondary and tertiary results) “it may be fine to litter a results section with exploratory pvalues, but these should not appear in the conclusions or abstract unless clearly labeled as such”. 1b) “I believe that drawing unjustified conclusions based on selected exploratory pvalues should be considered as scientific misconduct and lead to retraction or correction of papers.” 2) “A pvalue should only be considered part of a confirmatory analysis, … if the analysis has been prespecified, all results reported, and pvalues adjusted for multiple comparisons, an so on.”
I agree presenting pvalues is unavoidable. I always gave my clients pvalues in my reports. I agree with his dichotomization of exploratory and confirmatory analyses. Exploratory ‘significant’ results should never be included in conclusion/abstract sections. However, it is the job of statisticians to focus clients on better approaches. The emphasis should never be p<0.05, but on a mathematical statement of what the results are, primarily the confidence intervals using metrics understandable by the client and his clients (see blogs 3 and 4).
Dr. Spiegelhalter gave as an example of such misconduct a case where a colleague was “confronted by a doctor at 4 PM on a Friday with ‘Could you just ‘t and p’ this data by Monday?'” He lamented on the rise of automated statistical programming, bypassing trained statisticians. He concluded “We must do our best to help them.”
In sum, my conclusion that the pvalue is frequently an incorrect statistic to emphasize is supported by many, many statisticians and the major statistical associations.
]]>So, why do we compute the pvalue of such an unbelievable theory (H_{o})? Three reasons: tradition, a primarilyfalse belief that pvalues indicate the importance of an effect, and ‘something’ happened in the study.
Tradition
Journals, our colleagues, regulatory agencies require we present pvalues. For example, the FDA wants to know that our drug treatment is better than doing nothing (e.g., placebo or standard of care). Any difference, no matter how small would grant approval. Let me say that again, no matter how small! As long as the difference is greater than zero, the drug is judged efficacious.
In the first blog I did a bit of legal sleight of hand. I made the null hypothesis of the form, H_{o}: μ_{1} = μ_{2}. This is the usual twosided test. The FDA requires it, so do many journals. The alternative hypothesis is H_{A}: μ_{1} ≠ μ_{2}, which can be restated as either μ_{1} < μ_{2} or μ_{1} > μ_{2}. It is possible to have a directional or onesided null hypothesis (e.g., H_{o}: μ_{1} < μ_{2}), but this is quite uncommon. The ‘scientific gatekeepers’ use the two sided hypothesis for two reasons, it gives a level playing field. All will test the superiority hypothesis (μ_{1} > μ_{2}) by an alpha of 0.025 (with another 0.025 going to a inferiority test), rather than letting some researchers use an alpha of 0.05 and others use 0.025. The second reason is that if the treatment were harmful (inferiority test), they want to know about it. If it were harmful and a oneside test were used, one could not say it was harmful, only you weren’t able to say, with your inadequately run study, that the drug was useful.
For tradition, I have always included the pvalue in all statistical reports I have done.
Importance
One of the biggest blunders I see made by nonstatisticians is the mistaken belief that if p is < 0.05 then the results are significant or meaningful. If a difference of < 0.05 is (practically or clinically) significant, then a pvalue of < 0.001 is even more significant. They also often make the even worse error in thinking that if the p is > 0.05 the treatment wasn’t useful. These blunders are compounded by the use of the term ‘statistically significant’. Statistical significance only means that it is highly likely that the difference is nonzero, the semimeaningless notion (see ‘1. Statistic’s Dirty Little Secret’).
Statistical significance has very little to do with clinical significance. Very little? Let me qualify, statistical significance to use a term in logic is a necessary, but not sufficient, quality for demonstrating clinical significance. An effect which is statistically significant might also be clinically significant. An effect which was unable to achieve statistical significance will need more information (i.e., a cleaner or larger study) to demonstrate clinical significance, although the magnitude of the effect might have the potential to be quite clinically meaningful. A nonsignificant effect is unable to demonstrate that the effect could be zero or even negative (worse than the alternative). However, it is very possible that when the N is small and the results are not statistically significant, the effect size may possibly be very, very large (see my next blog ‘Meaningful ways to determine the adequacy of a treatment effect when you have an intuitive knowledge of the d.v.’).
Something happened
The reason statisticians still feel justified in providing their clients with pvalues is that at a minimum they know that if the pvalue is sufficiently low (e.g., < 0.05), they can be certain, with some degree of probability, the difference favors the treatment. What I’m referring to is the confidence interval. While I’m not a Bayesian statistician, I still tell my clients that “with 95% certainty, the true difference excludes zero”. Therefore we know that the treatment is ‘better’ than the standard. Is it better by a millimeter, a mile? The pvalue cannot answer that question, but the pvalue indicates that the experimental treatment is better.
[A classically trained Frequentist statistician would say something like ‘if the study were replicated an infinite number of times, 95% of the observed mean differences would not include zero’. It goes without saying that a client would, and should, fire me on the spot if I included the latter in a report. Otherwise, I don’t tend to use Bayesian stat.]
I mentioned the article in Significance in my last blog, the author of that article states “Ziliak and McCloskey show that 8 or 9 out of every 10 articles published in the leading journals of science commit the significance mistake – equating significance with a real and important practical effect while at the same time equating insignificance with chance, randomness, no no association or causal effect at all.”
I implied above that pvalues might indirectly measure clinical importance. Let us assume that we are doing an analysis in the same exact way for a variety of dependent variables in a single study. Let me further assume that the N’s are identical for all the dependent variables (e.g., no missing data) and we are dealing with parametric (intervallevel) or nonparametric nontied ordinal data. Then parameters which have a statistical significance have larger (relative) mean differences in comparison to the nonstatistically significant parameters. To illustrate this, let us use one version of the ttest comparing two means: t = (Mean_{1} – Mean_{2})/s√(2/N). We can ignore the √(2/N) term, a constant, as we assumed the Ns were identical for all parameters. If one ttest were significant and another not, it would mean that the (Mean_{1} – Mean_{2})/s term was larger. This term is the mean difference relative to its variability (actually standard deviation). In other words, how many standard deviations different are the two treatments. This is also called by many statisticians the ‘effect size’. Let me rephrase this, if within a study, if one dependent variable has a larger (e.g., statistically significant) ttest relative to another parameter, then the effect size is larger.
A corollary of this is that if within this single study, a pvalue was smaller than another (e.g., one is 0.04 and another 0.003), then the smaller (‘more statistically significant) pvalue implies a larger effect size (greater relative mean difference). In other words, a smaller pvalue implies a greater clinical effect. This only applies for parameters with identical Ns. If one study of 100,000 patients had a 0.003 pvalue and a second study of 10 patients had the 0.04 pvalue, then it can be demonstrated that the ten patient study indicated an average effect size of a LARGER amount. If I were investing in one of the two companies, I’d invest in the one who had the ten patient 0.04 pvalue, not the one hundred thousand patient company.
I will delve into how to measure clinically meaningfulness in a future blog (3. Meaningful ways to determine the adequacy of a treatment effect when you have an intuitive knowledge of the d.v.’). But to give a taste of it, it has to do with confidence intervals – the onetoone alternative to pvalues. In that blog I shall elaborate on the case where the scientists and literature KNOW and completely understand their metric (in my experience a relatively rare event), and when they either don’t understand it or don’t understand it in the current setting (e.g., population of patients, treatment regimen) (see ‘4. Meaningful ways to determine the adequacy of a treatment effect when you lack an intuitive knowledge of the d.v.’). In this second (partial ignorance) case, one can still have a (situationally) simple statistic to discuss treatment effects across completely different metrics.
In addition to pvalues, I always (try to) include a measure of treatment differences and their confidence intervals in all my statistical analyses.
In my next blog I shall discuss the best method of describing results when you understand your dependent variable.
]]>“A theory has only the alternative of being wrong. A model has a third possibility – it might be right but irrelevant.” Manfred Eigen
In my last stats course I was amazed to hear my teacher announce that If we did not like our results, all we needed to do was change our levels of confidence. In short fib. This time to ourselves.
***
In my last blogs I pointed out the very striking limitations of the pvalues. I also stated a correspondence between the pvalue and the confidence interval. The correspondence is 1 to 1. If a pvalue is <0.05 then the 95% confidence interval must exclude zero and if the pvalue is > 0.05 then the 95% confidence interval must include zero. [Note: One could test if the statistic was any arbitrary number (e.g., H_{o}: μ_{difference} = 1). However it could then be rewritten to be H_{o}: μ_{difference} – 1 = 0, returning to a comparison against zero.]
Let us assume that we can rescale the parameter so a positive difference indicates improvement. This is something we always can do. For example, if we were looking at diet effects on weight loss after ten months, we can take the difference from the subject’s initial weight minus their last weight and call it ‘Weight Loss’. In contrast, if we were interested in looking at weight increases, we would take the difference from the subject’s last weight minus their initial weight and call it ‘Weight Gain’.
The pvalue only states that the difference excludes zero. The confidence interval on the difference tells the scientific community how much it differs. Keeping with our weight loss example, a significant pvalue would only say that the difference is nonzero. Is it 1 ounce? 5 pounds? 30 pounds? The pvalue would not tell you that. The real information would come from the confidence interval and the mean. Let us say that the difference between the diet and a sham diet had a mean and 95% CI of 15 pounds (2 to 28 pounds). Then we know, with 95% certainty that the diet effect must be a weight loss of at least 2 pounds, our best guess is about 15 pounds and can be as much as 28 pounds.
I used pounds, rather than kilograms in the above example as most Americans have an intuitive understanding of pounds. Most of us are concerned with our weight. A two pound reduction is pretty trivial, but its better than nothing. A 15 pound difference sounds pretty good. A 28 pound weight loss might sound terrific. It is this intuitive understanding of the parameter (pounds) which make us capable of understanding the importance of this gedanken study. You should also note that both the lower and upper value of the confidence interval were important. The lower end indicates the minimum reasonable value of the treatment effect. The upper end indicates the maximum reasonable value of the treatment effect, the maximum reasonable clinical effect. In this case, the upper end (28 pounds) could indicate a really powerful effect.
Let us say that the mean and CI was 2 pounds (0.25 to 3.75 pounds). That is, we used a very large and well controlled study. The lower end certainly excludes zero, but is it meaningful? No! The average effect was 2 pounds, again a rather small improvement, but perhaps better than nothing. The upper end of the CI was 3.75 pounds. Again pretty meager. I would conclude that while the results were better than nothing, the diet intervention was rather ineffective. To put this in another way, it was statistically significant, but not clinically significant. For a ten month weight loss intervention, I personally would want to see the diet having a possibility of having at least a 5 pound weight loss. Hmm, after ten months, perhaps at least 10 pounds. Three and three quarters of a pound maximal effect would make me want to pass this ten month diet up. [Note: One TV ad touted a ‘clinically proven’ “average of 3.86 lbs of fat loss over an 8week university study.”]
Let me now take one last mean and CI: 28 pounds (2 to 58 pounds). This could be from a small sample, exploratory study. Perhaps this was the first Phase IIa trial by a small company. Examining the CI, we see that the difference could include zero. Oh my god, the results were not statistically significant. Doom? Chalk the treatment up as ineffective and try something new? Well the treatment could be ineffective, the effect might include zero or even some weight gain (the lower CI was negative, or gain of, two pounds). Our best guess is that the treatment effect was 28 pounds, a very nice effect. The upper end was 58 pounds, a very, very large potential effect. What would I conclude? While the study was not statistically significant, the treatment might be a very large effect. I would strongly suggest that the ‘scientists’ hire someone to adequately plan the next trial, their first attempt was ineffective. They failed to run an adequately sized trial. [Yes, I would blame them for delaying the product’s eventual acceptance. They wasted time (e.g., a year) and resources which should have given conclusively, positive results if they had run, perhaps, another halfdozen patients. Any competent statistician could ‘knock off’ a power analysis within 30 minutes. So the issue is never cost, but their incompetence. See upcoming blog – ‘What is a Power Analysis?’] They might have a very promising diet. To hark back to previous blogs, a nonstatistically significant result doesn’t mean a nonclinically important result. If, and only if, the upper end of the confidence interval is below a clinically important value can we determine the true lack of an effect.
In sum, the lower end of the confidence is useful in saying that the effect is different from zero (i.e., no effect at all) and what the minimum effect is. The mean would be what our best guess of the effect is. The upper end indicates what the maximum effect could be. I will say more about this in my future blog “Accepting the null hypothesis”.
One reason the weight loss example was useful is because most (75%) adults in the West are concerned with their weight. They understand what a 5 or 10 pound lost weight means.
However, in most clinical research, even experts have a less than intuitive grasp of what a minimally important clinical effect would be. What do we do? I suggest you read my next blog ‘Meaningful ways to determine the adequacy of a treatment effect when you lack an intuitive knowledge of the d.v.’
]]>What do we do when the patient population differs, or we don’t have a strong and deep intuitive experience with the dependent variable. For example, a 15 pound weight loss might be clinically meaningful, but would it be meaningful for a neonate? or a morbidly obese patient. For patients with hypertriglyceridemia is a 10 mg/dL a clinically meaningful improvement? 100 mg/dL, 1000 mg/dL? Well the less informed (i.e., those who haven’t been reading my blog), would just report the pvalue and rely on ‘statistical significance’.
What we would need is a way to get a unit free and easily understood method of taking the results and describing it. Fortunately, we have already seen this described in my blog ‘Why do we compute pvalues’. I’m referring to effect size = [(Mean_{1} – Mean_{2})/s] which is embedded within every ttest formula and power calculation. It is different from the ttest. While the effect size is not affected by N (the ttest or F test are directly related to N), the variability (confidence interval) of effect size is affected by N. [Actually all CI are affected by N.]
Effect size is very easily calculated by dividing the mean difference of the key comparison by the standard deviation (or square root of the residual error variance [MS_{error} of the ANOVA]). In English, the effect size can be described as ‘how many standard deviations apart are the two means’? All comparisons among means could use this final study metric.
Hmm, what about studies with more than two treatment groups, for example, a placebo, low and high dose trial. In my experience, the key comparison ALWAYS is a single (or one degree of freedom) comparison. In statistical jargon, a single linear contrast. In this case, the key hypothesis might be a simple comparison of vehicle against high dosage. Alternatively, it could be placebo against the average of the active doses. This would get us back to the effect size.
Effect size is also a necessary component to power calculations for means. In fact, statisticians only need an α (e.g., 0.05 twotailed), 1 – β error, aka power of the study (e.g., 80% power), and the effect size. The first two are traditionally set by the ‘scientific gatekeepers’ and by ‘upper management’, so all you actually need is the effect size and the necessary sample size would ‘pop out’. I will discuss power analyses in a future blog.
Although it might not have been obvious, effect size is a unitfree measure. For example, for my diet study example, if the mean difference was 5 pounds and the standard deviation was 20 pounds, after division, the effect size is 0.25, unit free.
Effect size allows you to compare apples and oranges. When you combine the independence of effect size from a study’s sample size and its unitfree property, one could compare two studies together and see if a parameter X in Study A indicates a more ‘sensitive’ parameter than parameter Y from Study B. I observed for one disease that a stair climb test was a better parameter to use in a future study than a six minute walk distance, despite the fact that the stair climb was used in one published study and the walking test was from a second study, each with widely different sample sizes.
One criticism you might raise is many scientists have only a weak understanding of what a clinically meaningful effect size is. That is, is a 0.5 s.d. mean difference large or small? In truth, this is a reasonable objection, as it is domain specific. Indomethacin is a tremendous treatment for acute gout attacks. A new acute gout attack medicine would need to be at least as efficacious as that treatment, e.g., a treatment effect of at least 2.0. Unfortunately, for most diseases the treatment is not that dramatic. A 0.25 might be much more realistic.
One statistician, Jacob Cohen, who worked in this field called effects sizes of 0.2 to 0.3 as small, around 0.5 as medium and greater than 0.8 as large. As implied above, this should be an is domain specific. Actually Cohen called what I refer to as the effect size, a ‘signal to noise ratio’. Different people called this different things. In any case, I STRONGLY recommend computing the effect sizes in your literature.
For most goldstandard pharmaceutical trials (db, placebo controlled study against placebo), I would unhesitantly use Cohen’s original rule of thumb. I might add an effect size of 1.5 as huge. I hesitantly add that anything larger than ‘huge’ might be considered suspect (e.g,, too large and/or trivial).
One statistician I knew, who worked at a local hospital’s IRB, routinely used an effect size of 1.0 for all of his clients. This allowed him to accept studies of approximately 34 patients as acceptable. If my experience is correct, and a 0.25 effect size is more typical of medical treatments, then a study with a sample size less than 500 is likely to be a complete waste of time.
Another useful unitfree measure of the magnitude of the treatment effect is the ‘correlation ratio’. This is more useful when there are many levels of treatment. Simply put this is the proportion of variance explained by the treatment in the study. Mathematically (in the population) it is: η^{2} = σ^{2}_{total}. The numerator, σ^{2}_{treat}, is the variability which is explained by the treatment in an ANOVA the Mean Squared treatment or model, the denominator, σ^{2}_{total}, is the total variability in the study, the Mean Squared total in the ANOVA. The correlation ratio ranges from 0 to 1.0, like any correlation which is squared. A correlation ratio of 0.25 would say that the treatment accounts for 25% of the variability seen in the study’s population. Or conversely 1 – 0.25 or 75% of the variability is unrelated to the treatment. Again meaningful effects are quite domain specific. Engineers might expect their studies to account for 99% of the variance. Psychologists might be overjoyed to explain 10% (correlations of 0.3 ~ square root of 0.10).
I talked as if the effect size and correlation ratio are two unique and noncomparable statistics. They aren’t. Knowing one tells you the other. Let δ be the effect size and η^{2} the correlation ratio, then
η^{2} = (1.0 + (δ^{2})^{1})^{1}
δ^{2} = η^{2}/( 1.0 – η^{2})
If one only had a ttest result could you estimate the effect size? Yes
E(δ) = sqrt [ (t^{2}(υ – 2)/υ – 1)/( υ + 2) ], where t is the ttest and υ is the degrees of freedom.
Things get a bit more complicated when dealing with more complex ANOVAs (e.g., 3 way ANOVA, models with random effects). Its also possible to generate confidence intervals for the effect size.
Having repeatedly said that modern statistics does not allow you to accept the null hypothesis, my next blog will discuss how you can demonstrate that your treatment is not worse than an active treatment or the difference is near zero.
]]>***
In my first blog I stated a truism, that was hopefully taught in your first statistics class:
You can’t accept the null hypothesis. You can ONLY reject the null hypothesis.
It is not possible to prove that a difference is exactly zero (H_{o}: μ_{1} – μ_{2} = 0). However, you might be able to prove that it is near zero (μ_{1} – μ_{2} ~ 0).
From bioequivalence trials, statisticians were able to come up with a simple way to reformulate the problem into something which can be tested statistically. However, to do that statisticians needed a value, below which the difference was clinically insignificant. In the bioequivalence area, a new formulation had to have a drug potency between 80% and 125% of the original formulation to be considered bioequivalent. The 80% and 125% boundaries came from the FDA. [Note: You might be thinking that the interval isn’t symmetrical, 100%80% = 20%, while 100%125% = +25%. Actually we’re dealing with ratios. 4/5 (80%) is the reciprocal of 5/4 (125%). In the log world, the interval is symmetrical.]
The simple (at least to us statisticians) solution was to demonstrate that the ratio of potency (e.g., area under the curves [AUC]) had a confidence interval (CI) which was above 80% but was below 125%. Let me clarify, the confidence interval must fall completely within the range. They didn’t look at the mean (center), but the 95% CI to determine equivalence. For example, if the ratio of AUCs was 1.00 but had a CI of 0.79 to 1.26 it would fail. So would CIs of (0.78 to 1.10) or (0.98 to 1.30). However, a mean ratio of 1.10 with a CI of (1.01 to 1.19) would succeed in proving bioequivalence, despite demonstrating that the active treatment was statistically significantly larger than the standard.
For most studies, we don’t need to demonstrate equivalence, but to demonstrate noninferiority. Let me clarify. If a new drug were shown to be better than a standard drug, that would be good. The drug manufacturer would ask the Agency for permission to include that on its label. On the other hand, if it was shown not to be worse than a standard drug that might suffice. Like the null hypothesis, one would look at the lower end of the confidence interval and determine if it includes the clinically important, ‘not worse than’ amount. For example, in a Phase 3 mortality trial that I designed, there was a treatment and a standard. The clinical team believed that the treatment would have 3% better mortality than the standard treatment. However, the sample size to run a trial to demonstrate superiority was prohibitive. The sponsor got permission to run the trial with a ‘not worse than’ 3% hypothesis. Therefore the key hypothesis was to test if the improvement in mortality was greater than –3%. That is, if the treatment had a confidence interval which might have included zero, but say only included a possibility of being 2% worse then the drug would be approved. For example, if the mean improvement were 1% and had a CI of (2% to 4%), the study would have succeeded. We would have proved that the experimental treatment was not 3% worse. That is, the entire CI was above 3%.
Now what would have happened if the lower end of the CI were a bit higher than 2% (e.g., +0.3% to 4.1%), we would have demonstrated that not only was the treatment not 3% worse than the standard, but actually superior to the standard (i.e., reject the traditional null hypothesis). Hmm, one confidence interval but two tests/hypotheses. Will we need to take a hit on the alpha (i.e., test each at 0.025 twosided alpha levels)? Nope. We are able to test both of them with a single 0.05 test. [Note: As we typically use twosided CI, and since each side is 0.025 of the alpha, we’re actually testing with a single 0.025 alpha level. But that’s still better than testing each with a 0.0125 alpha level.]
I am explicitly stating that the noninferiority test is FREE, FREE, FREE, FREE. (See my future blog ‘Statistical freebies – multiple hypotheses/variables and still control alpha’). Since it is easier to prove that the lower end of a CI is above a negative, rather than a zero, value, not only is it FREE but it is easier to achieve. Not only that, but since we (should) do a confidence interval anyway and ALL statistical programs include confidence intervals, then it is CHEAP, FREE, and EASY (pick any three!), and still is completely respectable. What more would a scientist want??
If a test of superiority was nonstatistically significant then you can’t say that the study demonstrated that no difference existed, unless and only unless you included the ‘noninferiority’ hypothesis into the protocol. However, don’t be too concerned if you previously implied a nonstatistical significant result indicated no difference. Ziliak and McCloskey (see my blog ‘Why do we compute pvalues?’) said this is a very common error. You now have the ability to not make this common error (and get it past the Agency).
In sum, we can accept a form of the null hypothesis, but we need to include a statement in the protocol about a cutoff for clinically unimportance and demonstrate that our results (CI) do not go past that level. What happens if we didn’t include the clinical un/important boundary in the protocol and didn’t include the noninferiority hypothesis? For example, if we only included a test of superiority? I can only say SOL (which could stand for Sorry Out of Luck).
***
Let me return to the topic of this blog ‘Accepting the null hypothesis’. The more astute reader will notice that the approach I recommended is to see if the product was not worse than the standard active product. In the bioequivalence testing arena, we’re not only seeing if the product had less potency than the standard (i.e., < 80%) but also if it had greater potency than the standard (i.e., > 125%). If you’re making different lots of the same compound you don’t want the different lots to be either too weak or too strong. In most drug testing we would love our product to be superior to an approved product.
What if we’re actually interested in proving equivalence? Like the 80% – 125% rule, we would need to ensure that our confidence interval, both the upper and lower end, were within the interval. Most times, we’d like to have symmetric intervals. One easy way to do that is to ignore the sign of the difference. Fortunately or unfortunately, statisticians seldom transform by ignoring the sign. However, one trick we statisticians often do is to square the statistic. This would always give us a positive (or zero) number. I mentioned in my previous blog that one statistical metric which was easily transformed into the effect size was the correlation ratio. It is a squared (hence always positive) number, and is identical to how much variance the experimental treatment explains of the total study’s variance. It is a number from 0.00 to 1.00. Zero would mean the treatment explains nothing. One (one hundred percent) would mean that the treatment would explain all the differences we see among the numbers. We could test the null hypothesis, that the treatment is near zero, by looking at the confidence interval on the correlation ratio. If the lower end of the CI was above zero, that would be equivalent to saying that the treatment effect is above zero. For example, if the lower end of the correlation ratio was 0.0001 that would mean that the treatment explained 0.01% of the study’s variance, not zero, but small. No, that isn’t what would interest us. What we’re actually interested in, for proving the null hypothesis, is in looking at the upper end of the interval. For example, we could say that if the treatment explained 5% or less of the variance, than the treatment effect was near zero. [Note: as implied in a previous blog, we could have an interval from 0.0001 to 0.048, and say that the treatment statistically proved the treatment had an effect, but clinically it wasn’t useful (i.e., < 0.05%).]
We don’t see much of this type of analysis however. As I will shortly illustrate, it is very, very costly to do this study, and anyway, who wants to spend a very large chunk of research cash into proving that the treatment didn’t work?
How expensive? I’m going to provide various correlation ratios (η²), its square root (η) – in a sample this would be the simple correlation (r), effect size (δ), and the sample size for each group – double for a two group ttest comparison (assuming 80% power, and alpha of 0.05 twosided). For example, in a two group study which wanted to prove that the treatment difference between two means was less than 10% of a standard deviation (alternatively, a study in which the treatment effect accounted for no more than 1% of the study’s variance), one would need 1,571 per group or 3,142 total patients. As implied above, this is a pretty expensive way to prove your product isn’t really different from your competitors.
η² η δ N/group 0.100 0.316 0.333 143 0.075 0.274 0.285 195 0.050 0.224 0.229 301 0.040 0.200 0.204 379 0.030 0.173 0.176 508 0.020 0.141 0.143 769 0.010 0.100 0.100 1,571 0.005 0.071 0.071 3,115 0.000 0.000 0.000 ∞My next blog will discuss why statisticians cannot lie and why the analysis plan is essential.
]]>Let me start this blog with one of my pet peeves. I abhor the quote ‘Lies, Damn Lies, and Statistics’. For me, a statistician, it has as much truth as saying that ‘the earth is flat’.
I T I S N O T T R U E ! ! !
The misquote is mistakenly attributed to Mark Twain, who himself mistakenly attributed it to Disraeli. Actually the quote might be from Leonard Henry Courtney, who gave a speech on proportional representation ‘To My FellowDisciples at Saratoga Springs’, New York, in August 1895, in which this sentence appeared: ‘After all, facts are facts, and although we may quote one to another with a chuckle the words of the Wise Statesman, “Lies – damn lies – and statistics,” still there are some easy figures the simplest must understand, and the astutest cannot wriggle out of.’
If you do quote it (and please don’t quote it to me), then please don’t take it out of context. Make sure that you also say that ‘facts are facts’ and ‘there are some easy figures the simplest must understand, and the astutest cannot wriggle out of.’
For those statisticians, like me, who work in medical clinical trials, the most fiendish, diabolical, dishonest statistician CANNOT lie with statistics. That evil villain must do the analyses as specified in the protocol. MUST!!! I wrote two science fiction novels in which the hero was a cyborg, a humancomputer hybrid. Everything the hero said or did was electronically stored. He, like the statistician, must tell the truth. They must do it following the details of the analysis plan. They must stress the key parameters/time periods/comparisons. Tertiary parameters cannot become primary. All of which were specified before the study was unblinded. We must be honest. We cannot lie with statistics. The worst we could do is present the secondary (or tertiary) parameters using the same analyses as the primary. Others might disregard the key comparison and give extensive reasons why failed key analyses should be ‘set aside’ for other analyses or to look at the ‘pattern’ of results, but we statisticians must include the key comparisons.
So, the first reason why the analysis plan (also known as the statistical analysis plan or SAP) is necessary, is that it keeps the industry honest. The Agency can always look at the timestamp on any analysis plan and determine when it was written/finalized.
Contract
About twentyfive years ago, I created my first analysis plan. I had never called it an analysis plan. I had never even heard of an analysis plan. I’m sure one was independently invented elsewhere, before mine. But I invented the statistical analysis plan. Why? I was working alone on a BLA for HFlu vaccine. I told the medical monitor that I could get the analyses done in an incredibly short time, but I must have buyin from her and her team. I wrote up my statistical methodology (insertable into the submission), details of how I would proceed, how I would handle problematic data (e.g., values below the limit of detection), and example tables with all statistics I would produce. The team and I spent the better part of a month on the details. Finally the data came in on a Friday, the data manager came in on Saturday to check the lastminute data irregularities, the study was unblinded on Monday, and the medical writer received the full report complete with all tables on Tuesday. The total submission was completed within a week of data availability. This was the first agency approved submission in nine years at Lederle Laboratories and I was the sole statistician on the project.
What did I learn from this?
If we want to get the analysis on time and on budget, the SAP is essential.
]]>“All models are incorrect. Some are useful.” George Box
***
When you do a statistical test, you are, in essence, testing if the assumptions are valid. We are typically only interested in one, the null hypothesis. That is, the assumption that the difference is zero (actually it could test if the difference were any amount). We previously discussed the merits of the null hypothesis in previous blogs. But the null hypothesis is only one of many assumptions.
Let me focus on the lowly ttest and also a simple twoway ANOVA (comparing two groups and time [repeated measurements]).
A second assumption is that the data are normally distributed. One unusual thing about the ‘real’ world is that data are often normally distributed. Height, IQ and many, many other parameters are normal (or Gaussian). [Note: William Gossett, aka ‘Student’, was the inventor of the ttest. Interesting side note, he worked for Guinness and developed the tdistribution to monitor stout production. However, Guinness didn’t approve of any publications, hence the pseudonym, ‘Student’. This will not be on any test. Unless you become a statistician. In which case it is mandatory you know it.] In general, if a variable is affected by many, many different factors, it will be normally distributed. [I’ll explain the reason for this shortly.] We even have tests to determine if the data are normal. Unfortunately, almost all variables have a slight departure from normality. As implied in my previous blogs, if we have a large enough sample, then any statistical test will reject the null hypothesis (e.g., the data will never be normally distributed if the sample size is large enough).
So, how bad is the effect of nonnormal data? The answer is simply: NONNORMALITY has almost NO EFFECT ON PVALUES when we compare means, especially when the sample sizes are moderate. There is a theorem in statistics (central limit theorem) which says that as the sample size increases the distribution of means approachs the normal distribution.
Let me back up a bit. Let us imagine we are interested in seeing if our mean isn’t zero. In the plots below the Greek letter mu (μ) could be any number, in this case we might make it a zero. Arbitrarily, let us also assume that the extremes of the plots below are 2 and +2. [Actually its trivially easy for the mean to be zero and the range to be 2 and +2, we just need to look at a new variable with the mean subtracted from the old and the data to be divided by its standard deviation. Let me call this new variable t = (X – μ)/σ, where X is a mean and sigma is the s.d. of means.]
We compute a mean from our sample’s data. The plots below are an artist’s representation of what a bunch of means around mu would look like. The artist’s representation is actually what a mathematician would call a shitload of means given the above true mean and its variability. Actually they would say an infinite number of means, but no one ever has seen an infinity, but we humans have seen loads of the other stuff. If we know the true variability of the original data (σ), then the variability, hence the width of the above pictures is mathematically known (σ/√N). I should also note that the widths below would be narrower as N increases, but the pictures were widened so you could see the shape of the curves.
However let me get to the nub of the problem. The shape of the curve is NOT the same as the shape of the original raw data. So if the original data were skewed, a set of means is less skewed. Let us assume that you have a sample of 2. When you see an extreme point (e.g., a +2), it would be quite rare to see a second point as extreme (i.e., two +2s). So the possibility of getting a mean of +2 would be unusual. If the true mean were zero and you had a sample of say 10, the likelihood of seeing all ten even positive would be quite remote (one time in a thousand).
Let me illustrate this with the following four figures.
The dotted line represents a true normal distribution what we hope to eventually see. The lefthand top plot is the original distribution. As can be seen, we have what we statisticians would call a negatively (the long tail points to the negative side of the number line) skewed distribution. The top right hand plot is means of that original negatively skewed distribution, but when N is 2. As can be seen, even when the sample size is 2, the distribution (solid line) of means is less skewed compared to the first. The bottom plots has sample sizes of 4 and 10, respectively. When the sample size is ten, the distribution of means is virtually identical to the normal curve. So if you analyzed this data with this originally skewed data set but had a sample of ten observations, the statistical test on means will be based on results which is virtually indistinguishable from the normal curve.
What is hypothesis testing? Imagine having a theoretical distribution with a mean of zero and an expected range of 2 to +2. They you get an actual mean of +10. What would you conclude? You’d say “Uh uh uh uh. Wrong. It can’t be. The theory is wrong.” Statistics just codifies such reasonable conclusions.
The above logic with means is also the reason why, when you have a number of small effects culminating in a parameter, that it tends to be normally distributed. For example, IQ is produced by the additive effect of many genes and many environmental factors. The net result is that this characteristic is like the mean of a number of subeffects. It will tend to be normally distributed.
One cause of nonnormality is outliers, or extreme values. The best way to see them is to plot the data. Two quick approaches are the stem and leaf and the boxplot. Outliers can change the means, they also very strongly influence variability and correlation. Many times outliers are transcription errors or bad assays and can be ignored/corrected. I also noticed that sometimes units get confused (e.g., one investigator using grams and the rest use micrograms). Other times, they can’t be ignored as they are valid extreme disease states. Transformations (see below) might be the best way to handle them.
So, I would say that if the data are expected to be pathologically nonnormal, then the ttest would not be affected by nonnormality when you have at least twenty observations. Do we ever do a pivotal trial with that small an N? Never! I will also mention below a second way around this nonnormality issue – data transformation.
Any marginally competent statistician will look at a frequency distribution (e.g., stem and leaf) to see if the data had any marked nonnormality and/or contain outliers. If I’m very lazy, I might just look at the skewness and kurtosis (the third and forth moments of the data).
If the data are skewed, one reason might be that the measuring instrument was not constructed to differentiate on one or both ends of the scale. This is called the floor or ceiling effect. If you detect this in an early study (e.g., Phase 2a), then you might get better differentiation and sensitivity by getting or developing a different method to get your data (i.e., a better scale).
You might be confronted by an ‘expert’ who scoffs at using a ttest, when the raw data is so obviously nonnormal and tells you that only the nondistributional nonparametric test is correct. I would suggest you to tell them: “Yes, I can clearly see that the raw data isn’t normal. You’re right. But I’m analyzing and comparing means from x subjects. Do you know what shape a distribution of means would look like? Indistinguishable from normal. Please run a simulation of [differences of] means when N is x and compare that to a normal distribution. The shape of the original distribution is irrelevant. The sampling distribution of means will be normal. Feel free to bootstrap from my original distribution with a million replication to compute the mean [difference]. No, I’m not saying ‘Damn the torpedoes, full steam ahead.’ I’m saying ‘Damn the torpedoes, I’m in a freaking plane.’ ” I’ll be talking about the usefulness of nonparametric statistics in blog 9.
A third assumption of the ttest is that the variances for the two treatments are equal. This has a fancy five syllable name – homoscedastic (pronounced ‘hoemoeskidasttick’). When the two variances are not equal, it is called heteroscedastic. You could drop these terms to impress your friends and neighbors. [Post publication note: In the November 2017 Significance there was a poll on “What is your favorite statistics word? Heteroscedasticity won, hands down.] On second thought, forget it, unless you want your friends to avoid you and your neighbors to ask you to move. [Statisticians are by nature lonely people (and humorless).]
How badly is the alpha level affected when the two groups have different variances? It depends on the sample size for the two groups. If we have equal N’s in the two groups, the effect is zilch. When you have equal N, if the ratios of the variances were zero or infinite, the 0.05 alpha level is actually 0.05, as I said, zilch. If the two sample sizes differ and the larger variance group has the larger N, then the test is actually conservative. For example, if one group had a variance twice as large as the other and also had twice the number of subjects, then the 0.05 nominal alpha level would actually be 0.029. On the other hand, if the group with the variance half the size of the other had twice the number of subjects, then the 0.05 nominal alpha level would be 0.080. At the extremes (although it is not possible to have either zero or infinite variability): when the group with twice the sample size had zero variability the actual 0.05 pvalue would be 0.17; and when the group with twice the sample size had infinite variability, the 0.05 pvalue would be 0.006. So, again, I would recommend keeping the N’s around the same or as close to the same as you can. This is one of the reasons why we use a 1:1 treatment allocation. [Note: the second is that the power to reject the null hypothesis is maximized.]
Sometimes we are asked to have unequal allocation. In general, I would seldom recommend using more than a 2:1 treatment allocation.
Well, is there any workaround? Actually yes, a pretty neat one. One doesn’t need to use the regular data you have to analyze. Huh??? What I mean is that one can do some type of transformation. For example, many years ago I worked on a wound healing salve (rhPDGFG). We needed to measure the surface area of the wound. Most of the wounds were small (e.g., most with area less than 0.03). Unfortunately, we also saw some gaping wounds (e.g., area of 2.00). We used a square root transformation on the data. Some people might object as this might have little intrinsic meaning. One possible rationale for the square root is the simple geometry formula: Area = πr² or r = sqrt(Area/π). The square root of the area is an approximation for the radius of the wound. Wounds tend to heal from the outside moving in. That is, the radius changes. In any case, God never said that the natural number line was any better at describing the data than the square root number line. Many variables benefit from a log transformation. We mentioned in a previous blog that in analyzing drug potency that we often transform area under the curve by a logarithmic transformation. The same goes for many laboratory variables (e.g., triglycerides). It is beyond the scope of this blog to discuss all the possible transformations one could apply to normalize the distributions or variances. In fact, it is possible to select the best transformation, using any power, aka the BoxCox power transformation. Data transformations often do work surprisingly well. You might want to include in the protocol a CYA (e.g., cover your assets): ‘If it is observed that the data is nonnormally distributed, a transformation (e.g., logarithmic transformation) will be used.’
A side benefit from applying the right distribution is seen in complicated analyses. When the data is correctly transformed, many interactions often disappear, making the data more interpretable. I’ll return to the issue of interactions in a later blog.
One final assumption is the question of correlated errors. I’m referring to a MAJOR problem one often encounters when dealing with repeated measures. For example, one runs a study and collects data weekly for 8 weeks, then one wants to see if the two treatments differ. Let me first say that I have observed that the best predictor of a subject’s datum today, is their value of yesterday. Depending on the parameter and the time difference the correlation of any two consecutive values is of the order of 0.30 to 0.90. In his book Analysis of Variance, Henry Scheffe says that when the correlation is around 0.40 the 0.05 alpha level is actually 0.25. That is, analysis of repeated measurements, destroys the alpha level. I have to admit, this blew me away when I first heard it. Let me tell you of my reaction in another way, I stopped doing any and all repeated measurement analyses as a professional pharmaceutical statistician for about 15 years. Originally, the only approach that was available was to assume that the correlation between weeks 1 and 2, and between weeks 1 and 8 were identical. This is the compound symmetry approach. All correlations would not be the same.
In contrast to assuming they are all the same, I noticed that they typically are approximately the same between any two consecutive weeks (e.g., between 1 and 2, 2 and 3, … , 7 and 8). The correlations separated by two weeks (e.g., between 1 and 3, 2 and 3, … , 6 and 8 ) are typically lower. Correlations with weeks maximally different (e.g., between 1 and 8 ) will be the lowest. What I did in those days was to do a paired ttest looking at the prepost differences (i.e., change from baseline) between the two groups, ignoring intermediary time points. What changed? Well, statisticians (I’m thinking primarily of Box and Jenkins) figured a way to analyze data with multiple observations over time, like the stock market. In time, computer programs implemented ways to handle these ‘correlated errors’. The approach I tend to use is a single parameter approach – its called the autoregressive error structure of lag one or AR(1) for short. One client I used it for, had a parameter estimate for the correlated errors from 0.70 to 0.94. There are times when the compound symmetry approach might work (e.g., many raters measuring the same subjects), but for repeated measurements it is not valid. Let me say it again. For repeated measurements (data over time), compound symmetry is not valid. Any time I ever review any analysis of repeated measurements and the report does not explicitly say AR(1) or a similar approach was used, I will tell the client that it is very likely that the analysis was TOTALLY USELESS and INVALID. I feel that strongly about it. Most cheap statistical programs are not written to handle correlated errors. The better cheap programs will tell you that it is not valid for that approach. For example, GraphPad which only allows for compound symmetry (aka circularity) says, “Repeatedmeasures ANOVA is quite sensitive to violations of the assumption of circularity. If the assumption is violated, the P value will be too low.” The GraphPad manual then suggests “wait long enough between treatments so the subject is essentially the same as before the treatment.” As if that was a real option! No, the only solution for repeated measurements is to ignore them (e.g., by looking at one score, like the change from baseline), or to use a high powered stat program, like SAS’ proc mixed.
In sum, the assumption of normality in any reasonably sized study is not important. Unequal variances, when you have approximately equal Ns is not important. Data transformation can often help. Always look at the data to see if outliers are present (and either change/delete them, or try to transform the data to lessen their effect). However, repeated measurement should only be analyzed by appropriate highpowered programs or avoided completely.
Post 7a will discuss an assumption not presented here, ordinal data.
]]>Most of the time, we can’t assume that the difference between any two points are necessarily equal. For example is the difference of no adverse event and mild, the same as between severe and life threatening? Is the difference between a pain scale score of 1 (the minimum on the five point scale) and 2 the same as between 4 and 5 (the maximum on the scale)? While one can’t have outliers, one can’t have large differences either. Are we justified in computing the average on this rank data? Can one ignore the lack of continuous data and the lack of equal intervals and still have a valid test of the null hypothesis. Many statisticians think not, some extraordinarily fervently.
This blog will present the a summary of two statistical articles. One was by Timothy Heeren and Ralph D’Agostino, “Robustness of the two independent samples ttest when applied to ordinal scaled data”, which appeared in Statistics in Medicine, 6 (1987), p7990. The second was LM Sullivan and RB D’Agostino Sr. which appeared in Stat. Med. April 30, 2003, 22(8), p13171334, entitled “Robustness and power of analysis of covariance applied to ordinal scaled data as arising in randomized controlled trials.”
What Heeren and D’Agostino did was to test small samples sizes, N1 = N2 from 5 to 15 and some cases of unequal Ns. They investigated the cases of a 3, 4, or 5 level ordinal scale and multiple alpha levels. They restricted their investigation so that effectively the probabilities in any of the cells was> 5%, otherwise they wouldn’t have a 3, 4, or 5 level scale. For example, if there was only three levels of the scale, and one level had no observations, then effectively they would only have a twolevel ordinal scale, a dichotomy. Given that they had, for example N/group = 5, a limited number of observations, they could test every possible pattern and compute the observed pvalue for a ttest.
Time for their summary: “Our investigation demonstrates the robustness of the two independent samples ttest on three, four or five point scaled variables when sample sizes are small. The probability of rejecting a correct null hypothesis in this situation will not greatly exceed the stated nominal level of significance.” Greatly exceed was defined by them as less than 10% of the nominal alpha level (e.g., for alpha of 0.05 it would be < 0.055, that is, pretty close to 0.05). In other words, if you have a ordinal scale of only a few observations, the examining and testing the pvalue by a ttest is valid. A ttest is a valid test when you have small Ns and only a few level ordinal scale.
The Sullivan and D’Agostino summary was:
Abstract: In clinical trials comparing two treatments, ordinal scales of three, four or five points are often used to assess severity, both prior to and after treatment. Analysis of covariance is an attractive technique, however, the data clearly violate the normality assumption and in the presence of small samples, and large sample theory may not apply. The robustness and power of various versions of parametric analysis of covariance applied to small samples of ordinal scaled data are investigated through computer simulation. Subjects are randomized to one of two competing treatments and the pretreatment, or baseline, assessment is used as the covariate. We compare two parametric analysis of covariance tests that vary according to the treatment of the homogeneity of regressions slopes and the two independent samples ttest on difference scores. Under the null hypothesis of no difference in adjusted treatment means, we estimated actual significance levels by comparing observed test statistics to appropriate critical values from the F and tdistributions for nominal significance levels of 0.10, 0.05, 0.02 and 0.01. We estimated power by similar comparisons under various alternative hypotheses. The model which assumes homogeneous slopes and the ttest on difference scores were robust in the presence of three, four and five point ordinal scales. The hierarchical approach which first tests for homogeneity of regression slopes and then fits separate slopes if there is significant nonhomogeneity produced significance levels that exceeded the nominal levels especially when the sample sizes were small. The model which assumes homogeneous regression slopes produced the highest power among competing tests for all of the configurations investigated. The ttest on difference scores also produced good power in the presence of small samples.
Up to now, we were only discussing the type I error. The test of the null hypothesis when the difference is truly zero. A more appropriate question is what about our ability to reject the null hypothesis when the difference is not zero. Readers of my first four blogs should realize that the null hypothesis is never true. When the means are not identical, what is the effect of a limited item ordinal scale?
To quote Sullivan and D’Agostino, “The magnitude of the effect size, or the magnitude of the difference in adjusted treatment means, is reduced by ordinal scaling due to the discreteness of the data. In particular, continuous data with an effect size of 0.8 which is scaled into threepoint ordinal scaled data reduces the effect size by approximately 75 per cent. Continuous data with an effect size of 0.8 which is scaled into fivepoint ordinal scaled data reduces the effect size by approximately 37 per cent.”
I had made a very similar comment in my blog ‘9. Dichotomization as the Devils Tool’. If you take a continuous scale and cut it into two parts (e.g., success or failure), you reduce the effect size tremendously. When you reduce the effect size tremendously, to compensate, you need to increase your Ns tremendously. In that blog, I pointed out that under OPTIMAL conditions when dealing with a twolevel (ordinal or dichotomous) scale the Ns would need to increase by 60% to compensate. Under more realistic conditions, you would need to increase the Ns by a FACTOR of four.
FREEBY SUGGESTION: In designing a trial, rather than asking the physician or patients to rate things as good or bad (two levels) or into a fivepoint scale (normal, mild, moderate, severe, or extremely ill), You should consider a much wider framework. For example, rate things on a scale of 0 to 100, or ask them to make a mark on a 10 cm line. Such a change in the CRF is trivial (and free), and will give you much greater ability to prove your treatment’s effectiveness.
In sum, if one has ordinal data (e.g., a threepoint rating scale), using ANOVA or ANCOVA and computing means is a pretty good approach with good control of your alpha level. In fact, for certain analyses (e.g., repeated measurements) it may be the ONLY way to analyze your data. However, one should strive to have quasicontinuous data (e.g., a 100 level scale) to give you greater ability to reject the (false) null hypothesis.
]]>People who haven’t the time to do things ‘perfectly’, always have time to do them over.
Measure once, cut twice. Measure twice, cut once.
‘My god, you’ve conclusively proven it. Time equals Money.’
***
Cost for Failure
Every CEO I’ve ever heard knows that they only make money when the product is on the market. In the pharmaceutical/biologic/device industry, you need to demonstrate the effectiveness of a product. This means you need to reject the null hypothesis. This means you must design the trial with an excellent shot at rejecting the null hypothesis (i.e., p < 0.05). What are the consequences of failure? Time. Let me elaborate. It takes time for upper management to decide that they want to do a trial (e.g., 2 weeks), unless its around the holidays, vacations, conferences. The upper management will then ask a medical director to design a trial. This will initially produce a shell of a design (e.g., another 2 weeks), then another month might be spent in producing a first version of the protocol, then another couple of months in getting investigators, (redesign of protocol?), case report forms, data bases, edit checks, drug, etc. Finally patient recruitment (three months?). Investigators ALWAYS underestimate their subject pool. New investigators are enrolled (another month). The last patient enrolled needs to complete the trial (e.g., 6 months). Data is collected from the field (12 months), cleaned (a month), queried (another month), requeried (weeks), blind is broken, analyses are done, everything is QCed, topline report is issued (another 2 months). From soup to nuts a very modest trial would take a year and a half. Costs? Internal – one and a half years of full time equivalents of medical, statistical, programming, CRA, CDM staff. External – payment to investigators (institutions) with their staffs, medical supplies, testing equipment, patients. Cost for failure? Redo everything and the product is off the market for all that time. If the trial was not successful then the planners are completely and solely responsible for the loss, frequently in the millions.
I will assume that the trial has a clearly stated, operational objective. [You might be surprised at the number of times I’ve seen a protocol without a clear goal (e.g., ‘to study the relationship of Drug A and efficacy.’ A two patient trial would meet that objective!)]
Input for Power Analysis
The first and most important thing one needs is an excellent literature review. This is not something that a statistician can do, although a statistician can extract information and help guide the final conclusions. The scientists needs to go to the literature and get the best studies which have been done in the past.
They need to review the literature with regard to:
A statistician might be asked to review the set of key studies which were published. One point I’ve made many times in these blogs is that there are many times where a nonstatistically significant result with a parameterA might be more useful than a statistically significant result for a parameterB. For parametric data (variables for which one computes means), I would review the effect size and decide which one was larger. For example, in one literature review for a type of multiple sclerosis I observed that the most frequent ‘primary’ parameter of six minute walking time had an effect size of 0.3 while a supportive parameter, stair climb test, had an effect size of 0.7. I unequivocally recommended that the stair climb be used as the key parameter and the six minute walking time test to be a secondary parameter. Why? sample size is proportional to the square of the effect size. Therefore, the stair climb parameter would need onefifth of the number of patients. For example, if the stair climb parameter would need 50 patients, the six minute walking test would need 270 patients. Do you need to be a statistician to do this? Nope. Effect size is simply the mean difference divided by the standard deviation, something anyone can do with the cheapest calculator. In dealing with dichotomous data, it is like the difference between the proportions.
One difficulty in computing effect size is the standard deviation (sd). Sometimes the authors report only the standard error of the means, sometimes standard error of the difference in means. Sometimes they only present error bars (95%?) graphically, in which case you could enlarge the graph and use a ruler to estimate its size. Fortunately to convert from the standard error of the mean to the sd one would multiply the standard error of the mean by the square root of N. To convert from the standard error of the difference between two means to the sd, one would multiply the standard error by the square root of N/2. If it were 95% error bars of the mean, one would multiply by approximately 2 times the square root of N.
Another difficulty is what do you do with multiple time points. ‘A man with one watch knows what time it is, a man with two is never quite sure.’ Two simple solutions, use the effect size at the key time point or take a simple average of the many effect sizes. Or do both – ignore the period for which no reasonable treatment effect would be expected, then average the useful ones. A third option is to compute power for each. Power calculations are relatively fast.
Power Analysis Program Input
For a parametric analysis there are only a few things I would need: alpha level (with 1 or twotails), the power of the trial, and the mean and standard deviation (these two could be replaced by the effect size). I’ll expand on each.
Alpha (α) – This is usually fixed by the scientific gate keepers (e.g., publication, Agency). I almost always use 0.05 with a two sided alpha. I would deviate from 0.05 only when there is not one but multiple ‘key’ comparisons. For example, if there were two ways the trial could be a success (using the Bonferroni test) I would use an alpha/2 or 0.025 (twosided), but more about that in a later blog. In my power analyses, once I select the alpha I don’t bother considering alternative values.
Power of the trial (1 – β) – β is the type two error rate (how often the trial will fail when the effect size is not zero but δ – see below), 1 minus that is called the power. Power is the likelihood that the study will be a success. This will often be a proportion or percent. I frequently use 0.8 or an 80% chance that the study will succeed. 0.7 or less is inadequate for planning a trial. Often large pharma will use a 0.90 or 0.95 likelihood for success, especially for a pivotal Phase III trial. One could examine the results with multiple levels of power to determine costs.
One thing about power, is that if it is above 0.50 then there is some overage in the sample size. In actuality one just needs the pvalue to be < 0.05. An alpha level of 0.049 would be a success. When power is much greater than 0.50 (e.g., 0.95) then the results would not only be less than 0.05 but frequently much less (e.g., p < 0.001). I occasionally run the power program with a power of 0.50 to determine what sample size would just barely be significant with my assumed effect size. Of course in any study, the observed effect size could be larger than expected or smaller. A 0.50 power would just be statistically significant half the time. If you used 50% power, you would fail half the time. Ya pays ya money and ya takes ya chances.
Like I said above, I often use a power of 0.80. Studies of that size are successful four times out of five. If you used a power of 0.95 then you would succeed 19 times out of 20.
Effect size (δ) – Or mean for group A minus mean for group B divided by the standard deviation. By now, after reading my previous blogs, you should be quite comfortable with effect size. [Note: most power programs allow either the effect sizes or the two means and standard deviation.]
Let me first state the wrong way of doing things.
Therefore, I strongly recommend you let the literature suggest the means and standard deviations. One caveat is that the literature might itself be overestimating the effect sizes. Although it shouldn’t, many journals don’t publish failed (e.g., nonstatistically significant) results, many investigators don’t publish failures, many industry sponsored trials don’t publish improperly run, inadequately sized studies (i.e., failures). So, you might even want to design the trial with an effect size even lower than that seen in the (published) literature.
(Addendum 25April2018): For dichotomous data, effect size is related to the difference in proportions (or rates). Effect sizes for proportions is also affected by how close the two proportions are to 50%. It is easier to get statistically significant results when the proportions are near 0% (or 100%) then 50%. For example, a 5% difference between 10% and 15% would require N=725, while a 5% difference between 50% and 55% would require N=1604.
Three outputs and two inputs
As I mentioned above, one typically sets alpha due to the study design/gatekeepers. It doesn’t vary. There are actually three real pieces of information for which you input two of them. Knowing two will tell you the third:
When I’m already doing a power analysis, to generate a second one takes less than a minute. The difference in computing one result or 9 is trivial in terms of cost. I highly recommend having a set of N/groups, power and effect sizes (treatment mean differences).
Multiple Treatment Groups
I have a simple way of looking at things. When I have multiple treatment groups, I reduce the problem to its simplest form. The key comparison becomes a simple, single two group comparison. Let’s say we have two active groups (high and low dose) and a placebo group, three groups in all. There can be three comparisons: 1) high dose compared to placebo, 2) low dose compared to placebo, and 3) high compared to low dose. For efficacy, the FDA wants to see a comparison against placebo – deprioritize the third comparison. One typically would expect the high dose to be at least as effective if not more effective than the low dose, so the largest effect size would be a single comparison of high dose vs. placebo. High vs placebo would be the key (or primary) comparison. Done.
Alternatively, if the chief medical officer says the low dose vs. placebo is important, then the simple approach is to use the smaller expected difference (effect size) in the power analysis and to divide the alpha by 2 for the two “equally important” comparisons (high vs. placebo and low vs. placebo). The alpha for a traditional 2sided comparison is typically 0.025 for either tail. Dividing the alpha by two (again) would make it 0.0125. You could pass the new sample size (cost of the trial) on to your CMO and see if they still want to pay for this larger trial. If they don’t want to pay to see the low dose vs. placebo difference, my next question is: why include the low dose at all – drop it and speed up the trial by 50%!!! To me, time to market is critical, and one can always go back and do a phase IV with the lower dose.
Eventually all my clients expect to see differences among all treatments, so I power the trial to detect the biggest/most important difference. In a repeated measurement design, I would still use one literature based comparison to power the trial. I might throw into the final N an overage for each additional effect I’m computing (e.g., each degree of freedom in the model). So, if you’re comparing 3 treatment groups measured at 5 time points, you would need (3*5 – 1) or 14 extra subjects enrolled.
Overage for dropouts
Speaking of overage, once you get your estimated evaluable N per group you would need to increase it by a fudge factor based on expected dropouts. For example, if you have seen a 15% dropout rate you would multiply the final N/group by 1.15 to come up with the to be enrolled N/group.
NonSuperiority and/or NonParametric Power Analyses
Up to now, I’ve been focusing on a parametric superiority trial.
A noninferiority (notworse than) trial would have all of the above with the equivalence limit difference (aka the minimally meaningful treatment difference). As one is typically running a noninferiority trial against an active treatment the treatment difference should be around zero. What you want is as large a negative value as possible. This is something that you will often negotiate with the Agency. For example, in one trial against the standard of care, the active treatment was initially suggested it to be not 10% worse. The agency wanted it not 1% worse, and we finally compromised on not 3% worse.
Ordinal Data: Power analyses for nonparametric analysis of ordinal data is always a problem. One does nonparametric analyses because the data has outliers – extreme values, or have a nonnormal distributions (e.g., many laboratory tests have a lognormal distribution), or have limits (e.g., if a patient died he would be given a score below all other scores; or many laboratory tests have limits of detectability). What I’m referring to is doing a statistical test like the the MannWhitney or Wilcoxon test. Crudely put, one ranks all the data and does a ttest or ANOVA analysis on the ranks. It isn’t exactly that, but that is the underlying approach. If I need to compute power for such cases I would do my best to convert the power analysis into a parametric power analysis. I might replace the means with medians. Alternatively, for the power analysis I might transform the data to make it more normal (e.g., log transformation). In any case, the sample size for ordinal analyses and parametric analyses would be almost identical.
Dichotomous data would replace the effect size above with the two proportions (treatment and the control success rates). One doesn’t need the sd, as the standard deviation of a proportion is known and solely related to the proportion of successes. The two success rates have a known relationship with N/group. A larger study is possible when the average is around 50%. The more the two proportions approach 0% (or 100%) the smaller the study is needed. Of course, the bigger the difference in proportions the smaller the N/group. Analyses using dichotomous data require larger Ns than parametric analyses, see the next blog.
Other approaches (e.g., power for survival analysis) are beyond the scope of this blog.
How do you actually do a power analysis? Can anyone do one?
I used to write my own programs (in FORTRAN or C). Currently I use validated programs like SAS (www.sas.com) or nQuery Advisor (www.statisticalsolutionssoftware.com). There are many others (especially for specialty methodologies [e.g., interim analyses or testing for inferiority]). nQuery Advisor is very simple to use and it handles many types of statistical power analysis. So an inexperienced user could easily use it. However as it currently costs about $1,300 to own nQuery and many times more than that for SAS, it might be a lot cheaper to have a statistician run the power analysis. I could knock off a power analysis with 10 variations and a simple report in two hours or less.
In sum, a literature review and a power analysis is the essential first and second steps for any study. It can easily save millions of dollars. The inputs into it are either routine (alpha level, power of the trial) or readily available (expected treatment difference, standard deviation). It is strongly recommended that multiple views of the power (cost) of a trial be undertaken, as it is very easily done. Finally, I recommend that the effect size selected be on the small end of possible effects.
In my next blog ‘Dichotomization as a devil’s tool’, I will suggest that one shouldn’t take a continuous parameter and dichotomize it to generate a success failure (e.g., a weight loss of ≥ 10 pounds is a ‘success’ while < 10 pounds a ‘failure’).
]]>There are two types of people, those who classify people into two types of people and those who don’t.
Never trust anyone over thirty.
As Mason said to Dixon, ‘you gotta draw the line somewhere’.
***
Don’t get me wrong, there are many places you need to draw a dichotomy. For example, you need to exclude patients. Obese patients (patients with a body mass index [BMI] >30) might be an exclusion criteria. Similarly you might want to exclude asymptomatic patients (patients whose key parameter at baseline is less than x). What I’ll be talking about is dichotomizing the dependent variable.
Nor am I talking about a natural dichotomy. Examples of real dichotomies are alive or dead; male or female. However, even real dichotomies are often fraught with major difficulties. For example, in one mortality trail we dealt with mortality within 30 days of initial treatment for patients with lifethreatening trauma. Unfortunately, some of those patients were alive only by dint of extraordinary medical intervention (i.e., their loved ones refused to allow the brain dead patients to pass on).
I shall be discussing the effects of drawing a dichotomy. I will assume that there is a continuous parameter which is separated into two parts. Patients who are above a value x and patients who are at or below x. As an example, take this quote “That study, first reported in 2008, found that 40 percent of patients who consumed the drink improved in a test of verbal memory, while 24 percent of patients who received the control drink improved their performance.” They are obviously not talking about a 40% increase from baseline, but the percentage of patients with any improvement, a dichotomy. Also note that the split into improved/not improved wasn’t at 50%.
The best way to dichotomize is to use the published literature and some expert’s dichotomy. Unfortunately, when you delve deeper into why they selected the cutoff, it is often quite arbitrary and not at the optimal 50/50 split. If you’re dichotomizing the data yourself, for reasons given below, try to dichotomize at the median, so you have a 50/50 split between the two groups. Why do I suggest using an ‘expert’ cutoff? It is possible to data dredge and find a split which presents your data to the maximum benefit. The FDA might ‘red flag’ any analysis with an arbitrary cutoff. At a minimum, if you plan on dichotomizing, then state how the cutoff was/will be derived in the protocol (or analysis plan). If you create the cutoff criteria post hoc, it will be completely noncredible.
Let me first discuss why people like to dichotomize a parameter. It’s dirteasy to understand. Everyone feels they understand proportions. If you dichotomize than you classify the world into the winners or losers (successes or failures). It is easy to think in terms of black and white. As I implied in another blog, people might not understand what the average on an esoteric parameter (e.g., verbal memory test) represents. But a difference of 18% is understood by all.
My objections to such an easy to understand statistic? Let me make a list:
Power
In a previous blog, I pointed out that effect size and correlation are different versions of the same thing (see the equation between effect size and correlation in ‘4. Meaningful ways to determine the adequacy of a treatment effect when you lack an intuitive knowledge of the dependent variable’). If you have a correlation and dichotomize a variable, how does that affect the size of the new correlation (with the the dichotomized data). Well it is the relationship between the biserial and pointbiserial correlation. The undicotomized relationship (biserial correlation) is ALWAYS larger than the dichotomized relationship (pointbiserial correlation). How much larger depends on the proportion of patients in the two dichotomies and the ordinate of the normal curve at the dichotomy. How much larger? If one forms the dichotomy at the ideal 50% then the continuous data would have a 25% larger correlation. If the dichotomy is not in the middle, then it will be even larger. For example, a 90%/10% split would make the undichotomized correlation about 70% larger than the dichotomy.
How does the minimum 25% lower correlation affect N or power? Obviously, the number of patients you must enroll will have to be bigger. But how much bigger? Let me assume that that we ran a power analysis and came up with a few sample sizes. As we discussed in the Power Analysis blog (‘8. What is a Power Analysis?’), if we know alpha (I will use 0.05, twosided below) and power (using 80%), we can deduce the effect size given a number of patients enrolled per group. So, I can input various Ns (first column below) and get the detectable effect size (δ) in the second column below. I can then see the effect of the optimal (50%) and more extreme (90%) dichotomy and then determine its effect on the effect size (δ) in the third and sixth column, on the new number of patients needed (fourth and seventh column) and how these increased Ns relate to the original number of patients in the continuous data (fifth and eighth column).
Dichotomize at 

50% 
90% 

N/group 
Continuous δ 
50% dichotomy δ 
N/group 
Sample Size Fold Increase (%) 
90% dichotomy δ 
N/group 
Sample Size Fold Increase (%) 
25 
0.809 
0.580 
48 
92 
0.396 
102 
308 
50 
0.566 
0.427 
88 
76 
0.301 
175 
250 
75 
0.460 
0.354 
127 
69 
0.252 
249 
232 
100 
0.398 
0.309 
166 
66 
0.222 
320 
220 
200 
0.281 
0.221 
323 
62 
0.160 
615 
208 
300 
0.229 
0.181 
481 
60 
0.132 
902 
201 
400 
0.198 
0.157 
638 
60 
0.114 
1,209 
202 
500 
0.177 
0.140 
802 
60 
0.102 
1,510 
202 
750 
0.145 
0.115 
1,188 
58 
0.084 
2,226 
197 
1,000 
0.125 
0.099 
1,603 
60 
0.073 
2,947 
195 
2,000 
0.089 
0.071 
3,115 
56 
0.052 
5,807 
190 
4,000 
0.063 
0.050 
6,281 
57 
0.037 
11,468 
187 
8,000 
0.044 
0.035 
12,816 
60 
0.026 
23,223 
190 
What do we get? When we dichotomize the effect size decreases. It always decreases. At the best, we’d have to increase the sample size by 60% to compensate. However smaller studies would need to almost double in size. Again that is the best. If the dichotomy weren’t at 50%/50%, say 90%/10% then we’d need to increase the sample size by 190% or almost a threefold increase in N. For example, in a very small trial (N/group = 25) we would need to go from 25 patients per group to 48 when we have the optimal 50/50 split. When the split was 90/10, we would go from 25 to 102, over a four fold increase in N.
Why?? Well, we’re throwing away a lot of information. Say we consider a 10 pound weight loss as a success. A 10.1 weight loss is being treated the same as a 40 pound weight loss. A weight loss of 9.9, despite only being a miniscule 0.2 pound difference from 10.1, is a failure. That 9.9 is being considered the same as a weight gain of 10 pounds. That is, all weight gains and all weight losses less than 10 pounds weight gain are treated identically. All weight losses of 10 pounds or greater are also treated identically. We ignore all that information. Why the difference between a 50/50 split and a 90/10 split? With the latter, most of the data are identical. There is nothing to differentiate the lowest 1% with the 89th%. 90% of the data is identical. The only new information is seen at the top (or bottom) 10%. With the optimal 50/50 split, half the data is different from the other half.
Interval, Ordinal and Nominal Data
Let me say it again. What is the difference between a 10.1 pound weight gain and 40.5 (please shut your 3rd grader up, it’s not 30.4 pounds)? It’s zero. What about the difference between a 20 pound weight gain and a 9 pound weight loss? Your fifth grader would be wrong about 29 pounds, it would again be zero. We give up all ‘interval’ level information. In such a dichotomous analysis, we shouldn’t present the average (we are not allowed to add the weights together). Nor do we consider even the order of the numbers. A forty pound weight loss is not more than a thirty pound weight loss, which is not more than a ten pound weight loss. A nine pound weight loss is actually less than a ten pound weight loss, but it is not more than a one pound weight loss or even a twenty pound weight gain. We are ignoring almost all ordered information (‘ordinal’). We should not present median weight loss. We are not allowed to order the weights to find the middle value. We only have nominal information. We are throwing away a lot of information.
With our throwing away all this information, it is staggering that we only need to increase our sample sized by only 60% (optimal value, although it could be a four fold increase under nonoptimal conditions).
Large N Assumption
When you have two groups and a success/failure dependent variable, the analysis of choice is the 2×2 contingency table (e.g., a Chi square test). The Chi square test is not appropriate when the expected sample sizes are ever less than 5. SAS prints out the following warning: “x% of the cells have expected counts less than 5. ChiSquare may not be a valid test.” That means, at best, we need a sample size of at least 10 patients per group or 20 total for a two group study (assuming that success/failure is 50%/50%). If failure (e.g., many AEs or mortality) is rare (e.g., 10%), then we’d need 50 patients per group at a minimum.
The analogue for the analysis of variance test is the logistic regression test. One of its key assumptions is something called ‘asymptotic normality’. What that means is that it assumes that the Ns need to be quite large. Logistic regression routinely uses hundreds of observations. ‘Nuff said.
Dichotomization (or nonparametric statistics, in general) is NOT the viable alternative when you’re concerned with small samples. See my blog ’19. A Reconsideration of my Biases’ when I consider small samples for ordinal data.
Types of Problems
Most clients do more than classify the patients as in treatment group A or B. Patients are assigned to different centers. We have male and female patients. We look at the patients weekly (or monthly). We have slightly different etiologies for the patients. They differ by age. That means factorial designs with treatment, gender, time, etiology, and age, along with all the interactions. Let me take a simple case, we have two treatments and look at the patients at weeks 1, 2, 4, and 8. In the continuous world, we would do a twoway ANOVA, with factors of treatment and time. Key in this type of analysis is what is called the interaction. We would expect that the treatment difference would increase over time. For example, at week 1, with little time for a response, the treatment difference are expected to be very small. As time increases, the difference would increase (although not in a consistent manner). Perhaps the largest difference would be seen at the study’s endpoint (8 weeks). This interaction is the key result in such a trial. I’m not going to elaborate here, but interactions are a horror to analyze in logistic regression and even worse to interpret. Without interactions, there is almost no point in doing a multifactor study. And please don’t even get me started on how to handle correlated data (e.g., repeated measurements over time) – the only assumption which blows away the significance test (see ‘7. Assumptions of Statistical Tests’) and you can’t realistically design around.
Horror Story: In the CHMP submission for Vyndaqel or tafamidis meglumine, the key coprimary d.v. was a dichotomized worsening from baseline score (NISLL) >2 or study dropout. The bottom line: the results were not statistically significant (p = 0.0682). The pvalue for an ‘ancillary analysis’ of the continuous change from baseline at study endpoint was p = 0.0271. The FDA rejected the submission for efficacy reasons. However, the CHMP clearly bent over backwards to provisionally accept the application for this orphan drug as the sole available licensed treatment for TTRFAP. The dichotomized d.v. was not significant, but the continuous d.v. achieved statistical significance.
Conclusion: I agree that dichotomizing data into success and failure makes interpretation much easier. However, to plan a trial for a dichotomy would necessitate at a minimum a 60% increase in patients. A small study would, at best, need to be doubled in size. If the split into the two groups is not the ideal 50/50, then the increase would need to be much larger. A statistical analysis of a dichotomy also requires a large N. It also makes factorial designs almost impossible to analyze or interpret.
Recommendation: If simplicity of interpretation is desired, then analyze the data as a continuum, but present (descriptive [no pvalues or CI]) summary tables with the dichotomy. I personally often relegate variables into three categories: primary, secondary and tertiary. Tertiary variables (like dichotomies) would be presented only descriptively – no (inferential) statistical analyses.
Dichotomization is the most restrictive form of nonparametric analysis. I’ll say more about the ordinal form of nonparametric analyses in ’10. Parametric or nonparametric analysis – Why one is almost useless’.
]]>… ‘If you lost your watch in that dark alley, why are we looking here?’ ‘Well, <hic> there’s light here.’ (old chestnut)
***
In my last blog, I stated that we should avoid dichotomizing as it throws away a lot of information in the data. Specifically it ignored the intervals in the data, precluding computing/presenting the mean, and it ignored the order of the data, precluding medians. The surprising thing is that we can compensate for this massive loss of information by increasing the sample size by only 60% (or up to four or more fold depending on assumptions).
I also stated that many nonparametric tests are basically the old ttests or ANOVAs but replacing the observed data by ranks and analyzing the ranks. So, if you had data like 0.01, 0.03, 0.07, 1.0, and 498.0 ng/ml, you would be analyzing the numbers 1, 2, 3, 4, and 5. The difference between the traditional parametric ttests on the ranks and the MannWhitney nonparametric test is that the nonparametric tests knows the variability, while the ttest has to estimate it. That is, any 5 unique values of the data would have ranks of 1, 2, 3, 4, and 5. Hence, the variability of ranks is known mathematically and doesn’t need to be estimated. [Note: Ties would also be able to be compensated for.]
Statistical approaches which only assume ordering of numbers has a lot less restrictive assumptions than those which assume that differences between any two adjacent numbers are the same. The MannWhitney nonparametric test doesn’t need to assume (or test) normality, nor can the variances differ (heteroscedastcity). So, if a test were almost as good, but it relied on less restrictive set of assumptions, then we should use the approach which had less restrictive assumptions. Shouldn’t we?
As suggested in my last blog the real question might be its effect on power. Actually for rank or ordinal data the power doesn’t meaningfully go down even when the parametric data would have been appropriate. My first nonparametric text book entitled Nonparametric Statistics by Sidney Siegel observed that the efficiency of the MannWhitney when compared to the ttest is 95%, even in small sample sizes. So, when a ttest could have been run with 100 patients, but the MannWhitney was used, it would need only 5 more patients. Trivial!
One loses a lot of power by dichotomizing the data. But loses almost nothing when you can order it. [Note: BTW, all methodologies will suffer when there is a marked number of ties (e.g., zeros). Of course, with dichotomization, all data will be ties!]
Yet, through out these blogs I have been promoting parametric testing. That is, presenting ttest and ANOVAs. Why? The power is almost the same. The assumptions are less rigorous. Why?
The answer is easy. If you feel that the beginning and end of an analysis is a pvalue, then nonparametric testing is best. Stop. Don’t do anything more. Many, many publications only present the nonparametric tests. By this time, dear reader, I hope you do know better. If not, start at my blog #1 and read the first four blogs again. I’ll wait.
[No, I mean it, reread them! This blog isn’t going anywhere.]
From the beginning of these blogs, I pointed out that pvalues were not the beginning and end of the analysis. The most important thing you can get from a statistical analysis is not are the two treatments different (of course they’re different!), but how much are they different?
Let me return to the MannWhitney (or Wilcoxon) test. As I mentioned above, we are computing the average rank for the two treatment groups and comparing it. In a recent analysis for three groups, the Wilcoxon test gave the mean rank (each group had 10 animals) as 18.0, 15.7 and 12.8. Is the mean rank ever useful? Sorry no. I have NEVER, EVER included the mean rank in any statistical report. Never. Would it be reasonable to present mean differences and confidence intervals. Nope, that is not appropriate when we believe the data is ordinal not interval. Can we report the medians? Yes, yes we can. Unfortunately, this statistical test does not use the medians, it does not analyze medians, it does not compare medians.
Another reason I strongly favor parametric testing is that one can more easily test more complicated analyses. For example, looking at treatment differences over time, or by controlling for irrelevant or confounding factors (age, gender, baseline severity). Nonparametric testing has little provision for handling analyses with time in the model or interactions among confounding factors.
Nor can you use any results from a nonparametric analysis to plan for future studies. I’m not aware of any way to compute the power (sample size) for a future study based on data from the nonparametric test.
A final reason I strongly favor parametric testing is that the assumptions for parametric data can be trivially designed away (e.g., use N/group of at least twenty, use almost equal number of patients in either treatment group, and/or transform the data). See blog ‘7. Assumptions of Statistical Tests’.
PostInitial Publication Note: I recently learned that the statistical program I use, SAS, recently implemented a methodology to compute confidence intervals of the differences. They are called HodgesLehmann Estimators. Therefore, it is possible to get 95% confidence intervals and the interval midpoint. I plan on including these HodgesLehmann estimators whenever I do nonparametric testing. However, this does not mitigate my objections: 1) the average rank being tested is not the statistic being analyzed, 2) one cannot do complicated analyses (e.g., covariates, twoway designs), 2a) even the HodgesLehmann estimators can only be done when there are only two treatments, 3) one cannot compute power, and 4) one can still design away most objections which might want us to avoid parametric statistics.
So, is there any place for nonparametric testing. Definitely, but I’ll say more about that in my next blog – 11. pvalues by the pound.
]]>‘Failure is always an option’ – Myth Busters
‘The Statistician is in. 5¢ per pvalue.’
‘Your statistical report is being delivered by three UPS trucks.’
‘It takes three weeks to prepare a good adlib speech.’ – Mark Twain
***
Let me go on a tangent.
Steve Jobs recently died. I was thinking of the things I do in life. And things I am incapable of doing. I could never do the things he did. There are two types of people. (<*Cough*> Actually I personally think of characteristics along a continuum. <*Cough*>) The two types of people are those people who BELIEVE and those who think that truth is relative and only partially known. To return to Steve Jobs, he BELIEVED in his goals. He was maniacally dedicated to accomplish his vision, create his product, and get it out of the door. Great CEOs are like that. Don’t get me wrong. All members of the medical team are fully dedicated to produce the best product and to do it in the shortest time possible. But great statisticians have worked on hundreds of projects. Some projects succeed and some fail. We aren’t personally responsible for the failures. If a drug doesn’t work or if it’s unsafe, that is the probabilistic nature of our research. ‘Failure is always an option.’ The FDA rejected a blood product I worked on because it wasn’t as safe as giving saline. A ‘jiggling’ device a woman stood on to increase her bone density didn’t work. An oxygenated liquid which floated diseased lung tissue of patients dying of upper respiratory failure did more harm than good. While we statisticians might look at subsets of the data or alternative analyses, to ensure we did the analysis correctly, we don’t take the drug’s failure personally. If a CEO makes a mistake, they take it personally. They drive their people to prove that the treatment worked. They make the decisions. They will frequently look at data subsets and alternative analyses to demonstrate efficacy.
The FDA knows this. Publication editors know that it is by the number of publications that a person gets tenure and higher pay. They are intrinsically distrustful. If I am an external scientist, I would also be distrustful. To put this another way, if you were reviewing a report from a competitor, what would you like to see? Would you blindly trust their findings?
As an external scientist, what would I demand to see? The first and second things I would request are the protocol and analysis plan. If the protocol/SAP didn’t state what the key hypothesis (key parameter and key time point) and how they planned on testing I’d be suspicious, deeply suspicious. If they switched the key parameter, I’d go back to the report and see if they presented the analysis using that analysis.
Let me tell you of a recent analysis and what it included along with the rationale of each component. All the below were in the appendix (i.e., not the main body of the report).
This last week I noticed on a statistical consultant’s blog a colleague who reexamined a barely statistically significant result with twenty patients. He elected to examine the robustness of this finding by eliminating each patient separately and computing the pvalue for each reanalysis. Other statisticians recommend avoiding the parametric assumptions by repeatedly sampling with replacement N patients, for example 30,000 replications from the observed data and computing the pvalue of each resample, then summarizing the mean pvalue.
What is the net results of all these analyses? For this study of 45 rats, the statistical appendix was a five hundred pages long, some pages with thirty pvalues. I should mention that I usually throw in the complete output of the analysis. If this were up to NDA standards I might extract only the key results for a photoready concise summary table. However, it takes considerable time and effort to create a simple brief report, or as Mark Twain observed, “It takes three weeks to prepare a good adlib speech.” When you look at all the key, secondary and tertiary analyses, the supportive, exploratory, sensitivity analyses; when you look at the tests of the assumptions, there may be thousands of pvalues. Most of which are only incidental to the single key hypothesis of the study. But all are important to a skeptical scientist/reviewer.
In my next blog I will discuss multiple observations – Multiple observations and crossover trials as a devil’s tool (and challenge the devil to a fiddle contest).
]]>‘Take two, they’re small’
***
Are the results from small, but statistically significant, studies credible?
One of the American Statistical Association’s subsections is for Statistical Consultants. A short time ago, there were over fifty comments on the topic of ‘Does small sample size really affect interpretation of pvalues?’ The motivation came from a statistician who went to a conference where “During the discussion period a wellknown statistician suggested that pvalues are not as trustworthy when the sample size is small and presumably the study is underpowered. Do you agree? The way I see it a pvalue of 0.01 for a sample size of 20 is the same as a pvalue of 0.01 for a sample of size 500. … I would like to hear other points of view.”
More often than not, having small sample size would preclude achieving significance (see my blogs on ‘1. Statistics Dirty Little Secret’ and ‘8. What is a Power Analysis?’). When N is small, only very large effects could be statistically significant. In this case, it was assumed that the pvalue achieved ‘statistical significance’, p < 0.01. There was considerable discussion.
Many statisticians felt that the small sample size (e.g., 20) would not be large enough to test various statistical assumptions. For example, to test for normality typically takes hundreds of observations. A sample size of 20 lacks power to test normality, even when the distribution were quite skewed. So, even though the pvalue was ‘significant’, the test of assumptions are not possible, hence the pvalue is less credible. Or worse, if the data were actually nonnormal and the sample size is small, the ttest is not appropriate. Hence the pvalue is not appropriate.
Other statisticians observed that when N is small, the deletion of a single patient’s data can often reverse the ‘statistically significant’ conclusion. Thus the result, when N is small, is quite ‘fragile’, not very generalizable.
There was a discussion of supplementing the parametric test with nonparametric testing to avoid the many parametric assumptions (see my blog ’11. pvalues by the pound’). They suggested we include what are often called ‘Exact’ tests. That statistician observed that even the ‘Exact’ tests are sensitive to small changes in the data set. One dilemma here is what if the other tests were not statistically significant. ‘A man with one watch knows what time it is. A man with two watches is never quite sure.’ So if the ttest was statistically significant, but the Wilcoxon test was not statistically significant, what would we conclude?
BTW, the statistician who initially asked the question changed his mind and is now leaning on the side of being a bit more hesitant in believing the ‘statistically significant’ pvalue when the N is small.
What do I conclude? Same as I’ve been saying all along.
My next blog will discuss multiple observations and statistical ‘cheapies’.
]]>
‘Measure once cut twice, measure twice cut once’
‘A man with one watch knows what time it is, a man with two is never sure’
The three words which get everyone’s attention: Free, Free, Free
***
Multiple observations occur in a variety of ways.
Multiple Items (replications)
Think of a Quality of Life (QoL) scale. The best ones have a variety of questions which it asks the patient/caregiver/physician. They then add up the individual questions to get a subscale or total score. Why is this useful? Well, statistical theory can demonstrate that the ability for a scale to measure anything is related to the number of ‘parallel’ items which compromise its score. A two item total is always better than using one item to measure something. How much better? Let us first define terms. When I talked about the ‘ability for a scale to measure anything’ the crudest thing a scale can measure is itself. The ability to measure itself is a limit of how well it can measure itself. What do we mean by measure itself? Say we measure the characteristic and then do it again. We could correlate the two measures. This is called the scale’s reliability (often abbreviated r_{x,x’}). There exists an equation (SpearmanBrown prophesy formula) to determine the scale’s reliability when you increase the number of items by a factor of n:
r_{n,n’ }= nr_{1,1’}/(1 + (n1)r_{1,1’}),where r_{1,1’} is the reliability of a single item, n is the number of items in the scale and r_{n,n’} is the reliability of the nitem scale. See below for a broader interpretation of ‘1’ and ‘n’.
Let me illustrate this by doubling the number of items (e.g., going from one item to two). If the original reliability of the single item was 0.50, then by adding a second item and using both, the new subscale would have a reliability of 0.667. The better the reliability the better the ability of the scale to correlate to anything. How much better? Well, if you square a correlation, it gives the amount of variance which could be predicted. For example, with the one item scale, 0.50^{2} indicates that 25% of the variance of a second measurement can be predicted by the first. On the other hand, if one used a two item scale, 0.667^{2} indicates that 44.5% can be predicted. A huge improvement over 25%! Or to put it another way, by increasing the reliability, you directly increase the power of a scale to show a treatment effect. Hence, increasing the reliability, will directly reduce the error variance in a study and therefore, increase the effect size. What may not be obvious from the above equation is that the reliability increase is not linear but has diminishing returns. As mentioned above, if a single item had a reliability of 0.5, 2 items would increase to 0.667, three items a reliability of 0.75, four items – 0.8, five items – 0.833, ten items – 0.909, one hundred items – 0.990.
I initially used the example of a QoL scale, but the above formula is true for anything. For example, having 3 raters (and using their average) is better than using only 1. Rating the total size of all knuckles is a better measure of arthritic severity than measuring only one joint. It is better to measure blood pressure three times than only once.
BTW, I stated that r_{1,1’} is the reliability for a one item test, that is strictly not necessarily true. It actually could be the a test of any length and n is how much larger (or smaller) the new scale would be. Nor does n have to be an integer. So if you are using the average of two raters and want to see the effect of using five raters, then n would be 2.5.
All this is predicated on the different items/raters are measuring the same thing. One recent study I was involved in examined the inter rater correlations of three raters. Unfortunately one of the raters correlated almost zero with the other two. The suggested takeaway for that study is the need to train (standardize) the ‘judges’.
Summary: Having a total score of nitems (or replicate measures) is always better than using a single item (one observation, one rater, or one item scales). Better is defined as smaller error variance, hence larger effect size, hence smaller study size. As patient recruitment is typically hard and it is easy to get patients or physicians to fill out a short questionnaire (or have the physician rate them on multiple related attributes) the study will directly benefit by decreasing noise (errors). All measurements contain error. It is the scientist’s job to reduce that error.
Multiple Dependent Variables
It is seldom the case where a medical discipline has focused on one and only one dependent variable. In discussions with my clients I always attempt to identify one key parameter which the study can rest on. I call this the key parameter, for obvious reasons. I usually do a complete and full analysis of this parameter. Of course, I’ll do inferential tests (pvalues), but I’ll also do assumption testing (e.g., normality, interactions), test different populations, supportive testing, backup alternative testing (e.g., nonparametrics), graphical presentations, etc. Following the key parameter are the secondary and tertiary parameters. The secondary parameters also merit inferential tests, but may lack alternative populations, tests of assumptions, etc. Finally, the tertiary parameters I tend to present only descriptively. I should note that the key parameter is defined by the key time point. Typically the key parameter at a secondary time point is relegated to secondary status.
But what if my client has two or three parameters for which they can’t make up their mind about. Well we can call them the key parameters. Is there any cost for this? Yes. One simple way of handling this is to take the experiment’s alpha level (typically 0.05) and split it equally among the different primary parameters. Se we can test two key parameters but not at 0.05 but 0.025 each. This is the Bonferroni approach. Does it make the study half as powerful? Do you need to double the N? Nope! For a simple comparison, when N is moderate (e.g., > 30), one would typically need a critical ttest of 2.042 for a 0.05 twosided test to be statistically significant. For a 0.025 twosided test, that is, for 2 key parameters, one would need 2.360. That is a 15.6% larger ttest. To compensate, one would need to increase the sample size by 33.6%. If the initial sample size were larger the limit on the increase in sample size would be 30.7% larger. In other words, the cost of doubling the number of key parameters is not 100% increase in N, but roughly thirtyish percent for a small to moderate sized trial.
There are even ways to reduce this! One can use something called an improved Bonferroni and test the larger difference by an alpha level of 0.025, like before, but the second (of two) parameters at 0.05. This isn’t just a cheapy, but a freebe. Nevertheless, I’d still power the trial using an alpha level of 0.025, not 0.05, when dealing with two key parameters.
Multiple Time Points
In a previous blog (Assumptions of Statistical Tests), I pointed out that analyses of repeated measurements HAVE TO BE HANDLED IN A VERY SPECIAL WAY or AVOIDED AT ALL COSTS. I pointed out that Scheffé demonstrated that with a modest (and, if anything, underestimated) 0.30 autocorrelation, the nominal 0.05 alpha level actually is 0.25. I said that autocorrelation DESTROYS the alpha level. I’ve seen autocorrelations of 0.90 and higher. Let me briefly demonstrate why. Let me look at a patient’s right foot and get his shoe size. Is it useful to look at their left shoe? Of course not. Knowing one will tell us the other. Measuring both the left and right shoes is redundant. One doesn’t have two unique pieces of information, one has one. That would be the case of a correlation between two variables (left and right shoe sizes) of 1.00. [Note: The first section of this blog on multiple items is never a case of correlations of 1.00. Correlations of 0.2 and 0.3 predominate.] If, and only if, the correlation were much lower would one be justified in measuring both. Statistical theory mandates that errors must be uncorrelated. When one talks about multiple time point the correlation is seldom zero. As I mentioned in that previous blog, unfortunately, the only viable solution is to use statistical programs which handle correlated error structures or avoid having multiple time points in the analysis. As an example of the former I suggested using SAS’s proc Mixed with an autoregressive structure with one term (AR(1)) or with unequally spaced intervals a structure called spatial power, although other structures are often used. As an example of using only one time point, I would suggest taking the baseline and key time point measurement (often the last observation), take their difference (e.g., call it improvement) and analyzing improvement at the key time point.
Multiple Active Treatments
I am not going to say much of that here, I’ll leave that to a separate blog. However, I will point back to the improved Bonferroni above.
My next blog will discuss Great and Not so Great Designs.
]]>
Contralateral Design
This is a very sweet type of study design, but it can be seldom used in pharma or biotech. Devices are another story. With only one group (rather than two) and the statistical properties of the standard error of the difference between means, right off the bat the contralateral trial is one quarter the size of the parallel groups trial. Say we’re doing an acne study. One could treat part of a person (e.g., left side of their face) with one treatment and another part (e.g., right side) for the second treatment. Of course, the treatments would be randomized for side. The main dependent variable would be the difference between the two treatments and one can analyze it using a simple paired ttest. One could be a bit fancier and do it with an ANOVA modeling both treatment, side (and patient). Why is this a good design? Well, lets assume that the location (e.g., left or right side) has miniscule effect. What would go into a person’s treatment ‘A’ measurement at the study’s endpoint? First off, there would be the individual’s characteristics, for example, age, gender, race, in other words their demography. Also would be their predisposition, e.g., their baseline severity, skin oiliness. But when you’re comparing the side A and B, that would be the same, so all the patient’s generic uniqueness drop out when one does the difference. Other factors could be the flair or remission of the acne, also any environmental effects (e.g., for acne: season is a large effect, so is diet, hormone, stress, and other temporal factors). But since that also is the same for both sides, the difference in time and environmental effects is also nullified. Simply put, a contralateral design controls for a tremendous amount of between and within patient error variance or noise. The denominator of the effect size, the standard deviation, would be smaller than a parallel group study. [Note: I’ll get to the mechanics in a future blog on ANOVA, but I estimated, from a recent contralateral design, that the s.d. from a parallel group trial was a third larger, hence the effect size would be a third smaller. An additional increase of 9/4 or 2.25% in the sample size; a total of 8.75 times smaller number of patients in a contralateral compared to the parallel group trial.] Of course, a contralateral trial is not possible when a treatment has a ‘central’ component, e.g., when the treatment is absorbed within the body.
Crossover Design
Let me talk about a very widely used design for analysis of data – the Crossover Design. The simplest case, the two period crossover, has half the patients receiving one treatment, say Drug A, for a period of time. This period of time is called Period One. Then those patient are allowed to wash out (i.e., not receive any treatment) of a period of time. Finally, in a parallel length of time, Period Two, they receive the second treatment, say Drug B. The other half of the patients receive Drug B in Period One and Drug A in Period Two. As each patient receives both treatments, one could compare the two treatments using the patient as their own control. Of course, since this trial uses repeated measurements, everything mentioned in the previous blog to control for correlated errors (e.g., AR(1)) must be done for crossover trials.
In theory, this will drastically reduce error variance, making the design quite powerful. In theory!
This is a quite dangerous statistical design. The analysis assumes that the patients who receive the treatment in the second period are comparable to themselves during the first period. This is far too often not the case. Patients change over time (disease progression or remission), diseases waxes and wanes over time, measurements/ratings change over time, the patient’s life changes over time, etc. Let me illustrate the issue with a simple example, say we were investigating diabetic foot ulcers. The first period is two months long (Month One and Two) during which the patient receives treatment, followed by a notreatment, washout period of one month (Month Three), then a second treatment period (Period Two) from months four to five. Are the wounds the same on month one as four. Obviously they wouldn’t be. The amount the patient could improve during Period One is far greater than their improvement during Period Two. Furthermore, as some treatments are more effective than others (which is hopefully the purpose to the treatments), the status of the patients during Period Two will depend on the treatment received during Period One. If, and only if, the patient’s wounds are the same during the starts of Period One and Two (Month One and Four) can the Crossover Design be used.
When one is doing a study of efficacy, patients often do not return to their Period One baseline severity levels at Period Two.
The typical solution to this typical problem? Ignore all postPeriod One data and analyze the data as a simple two treatment independent groups analysis. If one had powered the trial for an efficient crossover analysis, then the study would be drastically underpowered for a twogroup ttest type of analysis. Hence the study is likely to fail.
When is a crossover design appropriate? One very frequent application is in pharmacokinetic (PK) trials. This is a trial which measures the amount of a drug (and/or its metabolite) in the blood over time and see if a different formulation has a different PK profile. One typically will measure the halflife of a drug, maximum concentration, time at maximum concentration and other PK parameters. The washout is typically at least a week (e.g., at least 10 halflives of the drug in the body) for all the previous drug to wash out of the body. This can be empirically verified by examining the predrug measurement at period two to verify it is zero. That is, no drug or metabolite should be present in each and every patient.
Let me view the crossover trial relative to the contralateral study. The crossover trial controls for patient differences (e.g., demography and baseline severity), which is good. However, the crossover trial does not control for changes over time.
The following is a very common design to also control for patient differences, with none of the problems associated with the patients returning to their baseline, as assumed by the crossover trial.
Parallel Groups Design
Due to the difficulties in running contralateral and crossover trials, the most typical design in biotech/pharma is the betweenpatient Parallel Groups Trial. In this trial, patients are randomized into as many groups are there are treatments (e.g., two for an active v control trial). But fear not, there are still ways to control for patient ‘noise’.
Historical Control
This type of study, to use a technical term, is HORRIBLE. Here one takes data from an older study and attempts to compare it with results in the current study. Why is it horrible? One must ask: ‘what are the reasons why the two studies could differ?’ Well, there can be subtle measurement differences between the doctors/raters now and then. The populations could differ. Anything could account for the differences, anything. I remember one attempt to use some Swedish notreatment control data to compare with US active treatment data for an orphan drug. The historical control patients had the disease for a much shorter period of time. They also weren’t as severe. There were also a strong gender difference. Standard of care also changed over time. Any one alone would mitigate the utility of the data. We also tried to select a subset of the notreatment control data – too few cases could be extracted. If, and only if, all other factors can be found to have negligible effects (please don’t confuse clinical with statistical significant), can one use a historical control. As the list of potential factors is very long (and frequently not measured) and by chance some will be different, there will almost always be alternative explanations to make Historical Controls of very little use.
Analysis of Covariance
As mentioned above, a study in which one adjusts for a patients own variability is a very powerful analysis. Are there other ways to do this? Yes. Yes, indeed.
One very simple way is to look at the change at the key time point relative to their baseline severity (i.e., improvement). Improvement takes into account the patient’s own pretreatment severity and also controls for the patient’s other unique qualities.
A second, somewhat more elegant technique is something called analysis of covariance (ANCOVA), using the patient’s baseline severity as the covariate. Simply put, instead of analyzing the patient’s key time point scores, one takes them and analyzes the part of the key time point which is unrelated to the covariate, their baseline severity. One can even have more covariates. Covariates are typically selected from the patient’s pretreatment parameters. That is, they typically include only the demographics (e.g., age, race, gender) and/or baseline characteristics (e.g., baseline acne, skin oiliness) of the patient.
Unfortunately one assumption of ANCOVA is that the covariate doesn’t relate to the treatment. This means that the treatment difference doesn’t depend on the covariate. One can actually test for this by including a treatment by covariate (e.g., treatment difference by baseline severity) interaction. Unfortunately this assumption is often not met. Let me illustrated this with (a) a five point rating (0no disease, 1mild, 2moderate, 3severe, and 4life threatening), (b) a completely inert placebo and (c) a perfect active drug. There would be no point in enrolling asymptomatic patients (those with a baseline score of 0). At all four symptomatic baseline severity levels, the placebo patients have no improvement. At each corresponding baseline severity level, the active treatment patients would have a 1 point improvement for the baseline mild patients, 2 points for the moderate patients, 3 points for severe patients, and 4 points of improvement for the life threatening patients. Hence treatment depends on the baseline severity. There is a complete interaction of baseline severity and treatment difference. The bad news is that there can not be a simple treatment difference which is reported. The great news is that although the treatment effect is depends on the baseline severity, this makes perfect sense! One will need to say that the most severe patients have the greatest improvement, and even the mild patients have improvements. At each level of the baseline severity, one can/and should compare the two treatment groups.
I, and most statisticians, use ANCOVA very, very frequently to control for patient variability. There are many, many statistical procedures which allow the analysis (officially called the statistical model by us) to do covariate adjusted analyses.
In my next blog I finally get around to explaining what a ttest and analysis of variance are as well as the core statistic everyone must know about – 15. Variance, and ttests, and ANOVA, oh my!
]]>Mr. McGuire: I just want to say one word to you. Just one word.
Benjamin: Yes, sir.
Mr. McGuire: Are you listening?
Benjamin: Yes, I am.
Mr. McGuire: Plastics. Variance.
(Almost) from The Graduate
It’s always darkest before the dawn. I see the light! I see a light at the end of the tunnel. Hey, I’m not in a tunnel. Whoops, I didn’t see a light. I just thought I did.
***
I had completed my BS, entered graduate school, and was assigned a teaching assistantship for the lab section of an introductory statistics class. The professor of the class asked me, in private, what the most important statistic was. I said the mean, he sagely shook his head no, the correct answer for understanding statistics is the variance. He was right, oh so right.
Variance:
First let me show you the standard formula for the variance.
s^{2} = Σ (X_{i} – M)^{2}/(N – 1),
where X_{i} is some individual’s data, M is the mean of all the data, and N is the number of observations.
If you squint your eyes, this looks quite close to a simple average, Σ X_{i}/N. Except that the denominator is not N but N – 1. [Actually some statisticians recommend using N, as it is a maximum likelihood estimator of the variance. I’ve even heard a great argument for using N+1 (a minimum mean square estimator of the variance). But N1 gives us something called an unbiased estimator, and N1 is so traditional that it is almost always used as a denominator.] The ‘1’ is used because the variance uses one parameter, the mean, to estimate itself.
The important thing with the variance is that it’s a type of average, the average of the differences from the mean (X_{i} – M). Now if we took everyone’s score, subtracted the mean and averaged that, the sum would have to be zero. All the negative and positive changes from the mean would cancel out. What we could do is ignore the sign and average that. This is called the absolute deviation. Unfortunately (or fortunately) statisticians prefer to square things, it has some very, very useful properties, especially for normally distributed data, which we often see. So squaring is something we often do in statistics. Be forewarned, this will be on the test.
One problem with the variance is the units. If you are measuring in inches, the term inside the numerator’s parentheses is inches, when you square it it becomes inches squared. Not useful, so we take the square root to get back to the original unit (e.g,, inches again). This, of course, is the standard deviation. The standard deviation (often abbreviated sd or s.d.) simply is the average difference from the mean.
Returning to the variance, we can see that the variance is a measure of how people differ from the average. Let’s consider this. If you were asked the height of people who visited your favorite drug store, you wouldn’t guess 3 inches, nor 8 feet, unless you were being silly. You’d probably use a number like the mean height; M in the above equation. Just how good is your guess? Well, that’s exactly what the standard deviation is telling you. It’s the average of how much you were off (X_{i} – M), the average exception to your rule. Let me give another way to calculate the numerator, I’ll spare you the algebraic proof, if you took every person in your sample and took the difference between them and every other person, then divide by the proper ‘fudgefactor’, you’d get identical results. That is, you took (X_{1} – X_{2})^{2}, (X_{1} – X_{3})^{2}, … and (X_{N1} – X_{N})^{2} and divided it by the appropriate N. The variance/standard deviation is a measure of how much people (the scores) differ from every one else. We’ll return to this alternative viewpoint later when discussing the analysis of variance.
By the first viewpoint, the standard deviation is simply a measure of the average error in using the mean as a way to summarize all people’s scores. It can be thought of as a measure of noise.
ttest:
I’ll ignore the equation of a ttest. I put you through enough already with the variance. What is a ttest? A ttest is simply a ratio of a signal, typically the difference between means, and ‘noise’. Another way to consider the signal is to think of it as the amount we know following a model, a ‘rule’. If the signal is meaningfully larger than the noise, we say something might be there. Imagine yourself in a totally dark room (or tunnel) and someone may or may not have turned on a very weak light. Did you see this brief flash or did you imagine it? Was there a signal (light) or was it just eyenoise. This is a signal to noise ratio.
What is meaningfully larger? For a twotailed ttest with a 0.05 alpha level, any ratio larger then about 2. A value of 2 makes intuitive sense to me. If the ratio of signal to noise was around 1, then the signal isn’t really larger than the average noise level. So with a ratio less than 1, how could you be really sure it’s a signal and not noise. Well you can’t be sure! Mathematicians have actually worked things out and if the ratio of signal to noise (i.e., the ttest) was between +1 and 1 (and we’re using a normal distribution), then ratios of that size would be seen approximately twothirds of the time. (It’s actually 68.27%, but why quibble.)
One thing I glossed over was ‘noise’. It is very closely related to the standard deviation, but standard deviation is how well we can guess an individual’s score is. With the typical ttest, we’re looking at differences in means. A mean is a more stable estimate than a single person score would be. How much better is usually a function of the number of observations one has. A mean of a million observations would likely be dead on. A mean of two observations wouldn’t be expected to be very accurate. I’ll also gloss over how to get it (it’s typically a function of the square root of N), but the accuracy of the mean is measured by something called the standard error of the mean. You can’t say that we statisticians are very creative in naming things. At least it makes it easy for us to remember. Not like biologists who name things like Ulna or Saccule.
In any case, the ttest is simply the ratio of the difference between means (the signal) and the standard error of (the difference between two) means (the noise). This has a well known distribution – what you’d expect to see. Again, we uncreative statisticians called it a tdistribution.
The ratio of the signal, or amount explained by the model or ‘rule’, divided by the noise, or amount unexplained or exceptions, is the basic method to validate a model. Hence my quip at the beginning of this blog: Statistics – a specialty of mathematics whose basic tenant is ‘Exceptions prove the rule’.
I’ve been discussing the ttest in terms of means. Yes we can compare two observed means. We can compare a treatment mean with a hypothetical mean (e.g., is the difference equal to 0.0 or 1.0). We can do both (e.g., the difference between the means is equal to 2). We could also replace the mean with other things, like correlations.
Test time: How would a statistician transform the ttest?
Please take out your blue books and fill in the answer.
Put your pencils down: we’d square it. Don’t say I didn’t warn you that we love to square things. If we squared the ttest, we’d get the
Analysis of Variance (ANOVA):
The numerator of the ttest for two means is M_{1} – M_{2}, with the subscript indicating the two means (e.g., active and control). So what do we do if we have more than two groups? Well, we’d like to take the pairwise difference between each mean with every other mean. Yes, we saw something like that before – the alternative viewpoint of the variance. Going back to the variance we could do something almost identical: Σ (M_{i} – M.)^{2}/(N_{g} – 1). Instead of each individual’s score, we use each treatment group’s mean, M_{i}. M. is the mean of all the means; some people call it the ‘grand’ mean. N_{g} – 1 is like N – 1, but with N_{g} as the number of means, the number of treatment groups. This is the numerator for the analysis of variance, how the means differ from one another, the signal.
The denominator is still the ‘noise’. In this case, within each group our best guess is that group’s mean, so we look at the errors in using that mean within each group. For example, in group 1 we would compute the squared deviations from each of group 1’s scores from the group 1 mean. We then do the same for each of the other groups and add them all up. Finally we divide that by something like N1, actually N – N_{g}. ‘ N_{g}‘ because like the ‘1’ we are using N_{g} means to compute the errors or noise. Finally, like the ttest we divide the signal by the noise and come up with a ratio, the F test.
As I stated before, the ttest squared with its squared tdistribution is identical to an ANOVA’s Ftest with its distribution when we’re comparing two groups. However, the ANOVA can also handle more than two groups. In that lies its power and adaptability, as well as its weakness. More about that in the next blog.
]]>The Bayesian would say that the truism that one cannot prove the null is a consequence of a misformulation of the inference problem. If we agree that hypothesistesting statistics is the mathematics of probabilistic inference and if we resort to probabilistic inference only when we are faced with some uncertainty as to which conclusion to draw, then the NHST formulation of the problem is ruled out because, given that formulation, we have no uncertainty: Only one possible conclusion is to be tested against the data, the null conclusion, and we are a priori certain that it cannot be true. Thus, there is no inference problem.
One might object that this is not so; the alternative is that there is “some” (positive!) effect. But until we specify what we understand by “some”, this is not a wellformulated alternative. For example, in a typical pharmacological clinical trial, “some” effect could mean that the drug had an effect anywhere between 0 and complete cure in every patient (maximum possible effect). If that is what we understand “some” effect to mean, then for most drugs, the null conclusion (no positive effect) has a greater likelihood than the “some” (positive) effect conclusion.
The Bayesian computation tells us how well each possible conclusion (aka hypothesis) predicts the data that we have gathered. The possible hypotheses are represented by prior distributions. These prior distributions may be thought of as bets made by each hypothesis before the data are examined. Each hypothesis has a unit mass of prior probability with which to bet. The null conclusion bets it all on 0. The unlimited “some” hypothesis spreads its unit mass of prior probability out over all possible effect sizes.The question then becomes which of these prior probability distributions does a better job of predicting the likelihood function.
Likelihood is sometimes called the reverse probability. In forward probability, we assume that we know the distribution (that is, we know its form and the values of its parameters) and we use this knowledge to predict how probable different outcomes are. In reverse probability (likelihood), we assume we know the data and we use the data to compute how likely those data would be for various assumptions about the distribution from which they came (assumptions about the form and about the values of the parameters of the distribution from which the data may have come). The likelihood function tells us the likelihood for all different values of the parameters of an assumed distribution. The highly likely values are the ones that predict what we have observed; the highly unlikely ones are the ones that predict that we should not have observed what we have in fact observed
The possibilities for which probabilities are defined in a probability distribution are mutually exclusive and exhaustive, so their probabilities must sum (integrate) to one. Reverse probabilities (likelihoods), by contrast, are neither mutually exclusive nor exhaustive. It is possible to have two hypotheses that are distinct but overlapping and they may both either predict the data we have with absolute certainty (in which case, they both have a likelihood of 1) or not at all (in which case, they both have a likelihood of 0). Generally, however, one hypothesis does a better job of predicting our data than the other, in which case that hypothesis is more likely than the alternative. The Bayes Factor is the ratio of the likelihoods, in other words, the likelihood of the one hypothesis relative to the other.
Suppose the data suggest only a weak positive effect. That means that we COULD have got those data with reasonable probability even if there is in fact no effect (the null hypothesis), whereas, we could not have got those data if the effect of the drug were so great as to completely cure every patient, which is one of the states of the world encompassed by the unbridled version of the “some” hypothesis. The marginal likelihood of an hypothesis is its average likelihood over each possible value of (say) its mean that is compassed by the associated prior probability function. Because weakly positive results are inconsistent with all the stronger forms of “some”, the marginal likelihood of the unbridled “some” hypothesis is low. The null places all its chips on a single value 0, so the “average” for this hypothesis is simply the likelihood at that value, and, as already noted, if the data are weak, then the likelihood that the true effect is 0 is substantial.
Thus, the Bayesian would argue that when we formulate the inference problem in such a way that there actually is some uncertainty–hence, something to be inferred–the data may very well favor the null hypothesis. The frequentist objects that when we frame the inference problem this way, our inference will depend on the upper limit that we put on what we understand by ‘some,’ and that is true. But there is no reason not to compute the Bayes Factor (the ratio of the marginal likelihoods) as a function of this upper limit. If the Bayes Factor in favor of the null approaches 1 from above as the upper limit goes to 0, that is, as “some effect” becomes indistinguishable for “no effect”, then we can conclude that the inference to the null is to be preferred over ANY (positive!) alternative to it.
When the null is actually true, the data will yield such a function 50% of the time. And, when the pharmacological effect is actually slightly or strongly negative (deleterious), the data will yield such a function even more often. Moreover, this will be true no matter how small the N. Thus, we have a rational basis for favoring one conclusion over the other no matter how little data we have.
When, by chance, the function relating the odds in favor of the null to the upper limit on “some positive effect” dips slightly below 1 for some nonzero assumption about the upper limit, it will not go very far below 1, that is, the “some effect” hypothesis cannot attain high relative likelihood when there is in fact no effect or when the effect is weak and we have little data. Therefore, if we insist that we want some reasonable odds (say 10:1) in favor of “some” (positive) effect before we put the drug on the market, we will more often than not conclude that there is no effect or none worth considering. And that is what we should conclude. Not because it is necessarily true–nothing is certain but death and taxes–but because that is what is consistent with the data we have and the principle that a drug should not be marketed unless the data we have make us reasonably confident that it will do good.
]]>What do you call a numbers cruncher who is creative? An accountant.
What do you call a numbers cruncher who is uncreative? A statistician.
***
In the last blog, I explored the meaning of variance. I said that variance is basically how the scores differed from each other. I observed that if we took every person’s score and compared it with every other person’s score (and squaring it and dividing it by the appropriate N to get an average) it would be a measure of how people differ from one another. Or we could say that the simple variance is a measure of errors in using a mean to describe everyone.
We next observed that we could do something identical using not differences among each person, but differences between each mean. That is, take every mean and compare it with every other mean and divide it by the appropriate N. This would be a second variance, but this one measures if the means differ from one another, the mathematical model.
We also said that we could divide the mathematical model by the noise to get a ratio, a form of the signal to noise ratio. If the treatment differences (mathematical model) were suitably larger than the noise, we’d say that the model has some usefulness, beyond chance.
We are, in essence, comparing two variances. We uncreative statisticians called such a ratio of two variances (model/noise), an Analysis of Variance. Alas, too bad we weren’t logicians, we could have called it Analysis of Means.
So how do we test the ratio of two variances. Sorry, it wasn’t called the ANOVA distribution. Statisticians in those days liked the simple, single alphabet letter (e.g., zdistribution or tdistribution). This ratio was compared to an Fdistribution, named after R. A. Fisher.
Let’s imagine we have two treatments (active and control, ‘A’ and ‘C’, for short) and look at the patients at Weeks 1, 2, and 3 (the end of the trial). Let me follow my uncreative lead and call the weeks 1, 2, and 3. Let’s also imagine that it takes 3 weeks for the treatment to work, with Week 3 as the key time point. Hmm, three weeks and two treatments, there would be six means. ANOVA will compare each and every mean with one another.
A1 
A2 
A3 
C1 
C2 
C3 
One client actually did something like that. But ANOVA has some issues and some GREAT abilities.
First the issues, as readers of my blogs will note, I’ve said in the past that differences among means over time deserve some very careful handling. One can’t ignore this. Blindly comparing the means is a very, very bad nono. My client ignored this issue.
Secondly, things can get unwieldy with so many means. One great ability of ANOVA is its ability to handle things in a compartmentalized fashion. We can treat the three times as columns of a 2 by 3 box, and the 2 treatments by rows.
Treatment 
Time 

1 
2 
3 

Active 
A1 
A2 
A3 
Control 
C1 
C2 
C3 
For the treatment effect, we simply have one real comparison, Active vs Control. In general, we have as many unique (unrelated) comparisons as groups minus one. In statistics we say it has 1 degree of freedom (1 df). Hence, for time there would be a two comparisons (2 df). Hmm, with 6 means, there should be 5 df. But we see 1 df for treatment and 2 df for time. That’s 3 df. What happened to the remaining two degrees of freedom? The remaining two degrees of freedom go into deviations from a simple treatment and a simple time effect. More about interactions below.
In general, we’d love to just say that the Active treatment is better than the control. We could just take the three Active means (A1, A2, and A3) and compare their average with the three Control means (C1, C2, and C3). More about how we do that later. ANOVA gives us a very easy way to compare the average Active and average Control means, with a single degree of freedom comparison, like a superduper ttest. Simply put, it would be the ‘main effect’ of Treatment.
When can one blithely just report the treatment main effect? Well, as we discussed in other blogs the comparison of means assumes:
Interactions
If the differences between the active and control were the same at Week 1 as seen at both Week 2 and 3, then we could typically stop there. That is, A1 – C1 = A2 – C2 = A3 – C3. In that case the interaction would be zero (and the pvalue would not be statistically significant at 0.10).
But we stated that it takes 3 weeks for the treatment to work. So we expected the treatment difference at Week 1 and at Week 2 to be slight and to see a real difference at Week 3. In other words, the drug effect ‘kinda’ levels off at Week 3. The treatment difference at Week 1 is different from the treatment difference at Week 2, which is different than Week 3. We often don’t see an equal treatment difference happening at all three study weeks.
If the interaction was statistically significant, we CANNOT blithely report the overall treatment difference, we CANNOT report the overall Active and Control means. Because it depends on which week we’re talking about, an average doesn’t make any sense. Our report must focus on the treatment difference AT EACH WEEK. In the best of all worlds, let us assume that the treatment difference at Week 1 was a n.s. +0.3 (I’m supposing that a positive difference means the active is better), +0.8 at Week 2, and a statistically significant difference of +1.1 at Week 3 (the key time point). Game over, we succeeded, we publish or submit to the Agency. When the treatment differences are always in the right direction (i.e., positive, in this example), we have what is known as a qualitative interaction. We could also justify a slight negative mean (e.g., 0.03) at Week 1, saying that the treatment needs more than a week to kickin, and prior to that no difference is expected, hence, half the time the difference would be negative.
BTW, virtually all statistical packages reports the overall treatment or interaction means, their standard errors, CI, differences, and pairwise pvalues.
Up to now, we’ve been talking about time as the second factor. Other common factors in ANOVAs are investigators (sites and/or regions), demographic and blocking factors (e.g., gender or age), baseline medical conditions (e.g., baseline severity, genetic factors). In fact, many statisticians believe in the dictum: ‘As Randomized, Analyzed’. If you block on a factor, you should include it in the ANOVA. We can include all factors simultaneously (e.g., treatment, site, gender, and baseline severity in a four way ANOVA). One major problem is that with such a four way ANOVA, you would have 6 twoway interactions (e.g., treatment by site, …, gender by severity), 4 threeway interactions (e.g., treatment by site by gender, site by gender by severity), and 1 four way interaction (i.e., treatment by site by gender by severity). With 11 interactions you should count on at least one being statistically significant (p < 0.10) by chance alone. Murphy is the LAW, which we who analyze data, must OBEY.
One last point about interactions, sometimes they go away following a data transformation. Let us say that the three active means were 4, 9, and 16; and the three control means were 1, 4 and 9. If we looked at the Week 1 difference, it would be 3 (41), at Week 2, it would be 5, at Week 3, it would be 7. Since they are not the same (3 ≠ 5 ≠ 7) it would mean an interaction. However, if we applied a square root transformation, the interaction would be zero (the difference in square root units were all 1.0). Obviously this is ‘cooked’ data. Data transformation can greatly help reducing nonnormality, heteroscedacity, and spurious interactions.
Weighted vs Unweighted Means
When I talked about getting the overall Active treatment mean there are actually two methods of doing it. First let me switch from Study Week to Investigator Site (also abbreviated as Site 1, 2, and 3) in the boxes above. If, and only if, the Ns are identical in the 6 cells (N_{A1} = N_{A2} = … = N_{C2} = N_{C3}), will the weighted and unweighted means will be identical. Let us assume that there was no statistically significant treatment by site interaction, hence the overall Active (and Control) means are meaningful.
Unweighted Means: One way is to do a simple average ((M_{A1} + M_{A2} + M_{A3})/3) and something similar for the control group. This is called the Unweighted Mean. Each investigator (site) is treated equally, hence each investigator is equally ‘important’. However, for the Unweighted means, some patients are counted more heavily than others. Huh? Let me ignore the third site right now. Let me assume that site 1 went gangbusters and enrolled 60 patients, with 30 patients treated with Active. Site 2 had ‘problems’, enrolling only one patient into Active. Without going into mathematical proofs, the single patient in Site 2 is weighted as 30 times more important as each patient in Site 1. So, the weighted means will weight each site equally, but unequally weights each patient. [For SAS users, this is similar to the type III sum of squares.]
Weighted Means: A second way to compute the simple average is to weight each mean by its N, then divide by the total N. Mathematically the active mean would look like this: (N_{A1}M_{A1} + N_{A2}M_{A2} + N_{A3}M_{A3})/(N_{A1} + N_{A2} + N_{A3}), which is identical to adding up all the active patients and dividing by the number of active patients. Weighted means will make better enrolling sites more important, but treats each patient equally. [For SAS users, this is similar to the type II sum of squares.]
Many years ago, the Unweighted means (type II) was THE FDA approved methodology. The problem is that when you see poorly enrolling sites, it tends to make the overall treatment difference have greater noise, hence poorer ability to reject the null hypothesis. I remember doing a type of meta analysis where I combined all the data from every published study for a certain drug, in other words an Integrated Summary of Efficacy (ISE). The best study had 120 patients, the poorest two studies each had 4, fortunately the studies randomized approximately equally to Active vs Comparator. The total N was about 2,400. I saw a significant nonparametric test result, but the overall (unweighted means – each study was equally important) treatment effect was nonsignificant. Again, at this time the FDA wanted to see the unweighted means approach, and I dutifully used only that. I then looked at a plot of the mean difference by study size. The data fell into a triangular pattern. When the Ns were large, the treatment difference converged on one number, when the Ns were small, sometimes the means were above this, sometimes well below, just as one would expect. When N is large, the treatment difference should be close to the true treatment difference, when N is small, the treatment difference would be quite variable. I switched from using Unweighted means to Weighted means and the results were statistically significant. I gave a talk to the FDA on my findings. I don’t know if it was because of me, but they stopped relying exclusively on unweighted mean analyses.
In any case, as I’ve said in my blog on the analysis plan (6. ‘Lies, Damned Lies, and Statistics’ part 1, and Analysis Plans, an essential tool), how you’re going to do the analysis needs to be stated in the protocol and analysis plan, including the type of ANOVA, how you will be testing for interactions, critical pvalues for main and interactions, data transformations, weighted/unweighted means, etc.
]]>
If you shoot enough arrows, everyone can hit a bull’seye
Free! Free! Free!
BOGO – Buy One Get One Free
***
Multiple Comparison Problem: If you shoot at a target once and hit the bull’seye, then you’ve clearly hit your mark. If you shot at ten targets, hit one, then it isn’t clear if you’ve succeeded. Examples of this issue are doing the test at different times (e.g., Week 1, 2, 4, 8, 12), different dependent variables, different subsamples (older or younger children), different populations (ITT, per protocol), or different active treatments (High v Placebo; Medium v Placebo, Low v Placebo). One way to circumvent this issue of multiple comparisons is to define one comparison as the key comparison in the protocol. The other comparisons could be secondary (or tertiary). Sometimes you can’t decide.
Simple Bonferroni Adjustment: Perhaps the first place to start is the impact of having multiple active treatments. Perhaps we have a high and moderate dose for a Phase II trial, each to be compared to a control. In this case, we can ‘win’ if either the high dose is better than control or if the low dose is statistically significant. Now in probability anything with an ‘or’ is additive, anything with an ‘and’ is multiplicative. Therefore, if the pvalue of the first and second comparison is 0.05, then for either comparison A or B to be statistically significant, the pvalue for either achieving a spurious statistical significance is about 0.10 (0.05 + 0.05). If we had used three comparisons it would be about 0.15. I’m not going to expound on the meaning of ‘about’, as we are double counting some small probabilities (the actual experimentwise pvalues are 0.098 and 0.143, respectively (I’m not going to quibble over 0.098 v 0.010 or 0.143 v 0.150). So, if we wanted to have a 0.05 OVERALL (aka experimentwise) probability level we’d need to divide the overall pvalue by the number of comparisons, like 0.05/2 = 0.025 for a two comparison study and 0.05/3 for a three comparison study. This is known as the Bonferroni correction.
Naively one might think that by halving the pvalue one would need to double the number of subjects. Nope. Let me take the case where the N for a single comparison study is rather large. The critical t would be near 2 (actually 1.96). What would we need for half the pvalue? No, not about 4 (actually 3.92), but you would need only 2.242. That’s only 14.4% larger (a ratio of 1.144) than the original 1.96. Unfortunately the new N is not linear to the critical tvalue, but related to N^{2}. So the N would need to be 1.144^{2} or 1.308 or 30.8% larger, but clearly not 100% larger. [BTW, things get slightly worse when the starting N is not large, but things level off quite quickly. When the N_{group} is about 30 then the t is only slightly larger than 1.96, i.e., 2.04.]
In summary, for the simple Bonferroni adjustment for multiple comparisons one would need to test the critical pvalue at the alpha of 0.05/Number of comparisons. This will increase the N_{group} by a relatively small degree.
Improved Bonferroni Adjustment: When one makes two comparisons, one difference of means will be larger than the other. This is just like that archery contest, one arrow will be closer than the others. For the purposes of this blog, let me assume that the High dose vs. Control is a bigger difference than the Low dose vs. Control. With the above simple Bonferroni the critical pvalue for the High dose vs. Control is 0.025. That will still apply. However, if and only if that largest comparison was statistically significant, the weaker comparison, Low vs. Control, could be tested at the 0.05 level. Buy one, get one free. For three active comparisons, the largest comparison would be tested at 0.05/3 (=0.0167). If the first were significant, the second can be tested at 0.05/2 (=0.025). If the first two were statistically significant at their respective pvalues, the smallest mean difference can be tested at 0.05/1 (=0.05). Therefore, the smallest difference could be tested at a less severe condition. One still needs to plan for a study with the larger N, as above for the best mean comparison. However, the Improved Bonferroni makes the other tests easier to achieve. I’m simplifying things by only focusing on a stepdown approach, but I can’t explain everything in this simple blog.
In summary, for multiple comparisons, the ‘penalty’ for multiple tests 1) isn’t prohibitive and 2) can be reduced to nothing for the second best comparisons.
Interim Analysis: Let me talk about another specialized type of multiple comparison. This is the case where we look at the data while the study is ongoing and see if the results are significant, an interim analysis. If so, then we might stop the trial. Perhaps we analyze the data when the trial is half completed. By the simple Bonferroni approach we might test either the interim or the final by 0.05/2 or 0.025. However, statisticians realized that in the final analysis we’re analyzing the data of the first half twice (i.e., in the final analysis the only ‘new’ data is the results from the second half). Pocock and Pocock worked this out and realized that the two critical pvalues should be 0.029, which is slightly better than 0.025. If there were three comparisons (two interim and the final) we could test them equally at 0.0221 (not 0.0167). Other statisticians (e.g., Peto) asked the question, why treat the comparisons equally? Why not ‘spend’ the alpha trivially at the first interim, for example, 0.001. Then the remainder, 0.049, could be spent at the final. This evolved into a class of spending functions, how one spends alpha. One could make the first comparison as important as the final, like the Pocock boundary, or have an accelerating spending function, saving one’s alpha for the final analysis. The latter been popularized by O’Brien and Fleming (OF). For example, for a two interim and final analysis (three analyses in all), each equally spaced, then the Pocock alpha levels would be: 0.022, 0.022, and 0.022. The OF alpha levels would be 0.005, 0.014, and 0.045. Personally I favor the OF approach. The analyses when the N is low (i.e., low power – when you only have 1/3 the data) could be statistically significant, but you’d need overwhelming proof to stop the trial. At the end of the trial, the critical pvalue is near 0.05. But what happens if you don’t want to do the interim analysis at equal points (e.g., in a three hundred patient study, when 100, 200 and all 300 patients complete)? Well, Lan and DeMets worked out a way to come up with pvalues dependent on the actual number of patients who completed the trial. One only needs to specify the spending function and the approximate time of the interim analyses. Therefore, one could do an interim analysis (e.g., Pocock and Pocock, or Peto, or O’Brien and Fleming) when 122, 171, and 300 patients complete.
BTW, one can reverse the type of analysis to determine if the results were so bad that there is no way to achieve statistical significance. This follows from the above, but is called a futility analysis.
In sum, when you analyze an ongoing trial with multiple looks, you don’t spend your alpha as badly as with a Bonferroni approach. If fact you can flexibly spend it so most is saved for the end of the trial, when you are most likely to get statistical significance, but still have the ability to stop the trial, proclaim a win, if you get extraordinary results at an early time point. This was referred to by an FDAer as a ‘Front Page of the New York Times’ result.
If you are going to do an interim analysis, a few STRONG suggestions. First, the people who do it must not tell anyone involved in the trial the results. If the CMO talks to the people who are running the trial, then the CMO must not know the interim results. The best way to handle this is to create a Data and Safety Monitoring Board (DSMB) who are tasked with looking at the results and are empowered to halt the trial. They, and only they, are empowered to review the unblinded data. Second, expect to do additional analyses for your NDA on the results of your key parameter at each interim analysis point to determine if the results change. [Hint: Always expect the results to change – Murphy’s Law. I’ve often noticed that the patient populations change over time.] Hence, you may need to explain away the interim differences.
An Adaptive Trial: I’m going to describe a type of adaptive trial that is free of FDA questions. An adaptive trial is one where at one point or other, the actual design of the trial changes. For example, if one looked at the results early (breaking the blind) and noticed that the Low dose vs. Control had a miniscule effect size, one could drop the Low dose group and only collect data for the High dose and Control. This type of trial has too many problems to be considered a Phase III trial. Foremost is that the conduct of the trial changes with the ‘best’ treatment continuing. First, one treatment is always better than a second, and that could be spurious. Experimentally, this has been called ‘regression to the mean’. Second, there is a real possibility that investigators might realize that the trial is changing and change the way they collect data, biasing the results. One could do an adaptive trial in Phase II, but I do not recommend it for a Phase III (or Phase II/III) trial.
No, I’m going to describe an adaptive trial which is 100% FDA safe. A completely FREE insurance policy! We discussed in the power analysis blog (8. What is a Power Analysis?) that we need to know something about the results, for example, the standard deviation or the proportion of successes in the control group. One might have an estimate of the sd from the literature. However, your study might not be identical to that trial: the patient population might differ (e.g., inclusion criteria, baseline severity restrictions), the design, or the investigators might be different (e.g., you are including Italian sites). What I’m recommending is to look at the blinded results (i.e., not know the treatment group membership) and from this compute the sd. One could then reestimate the needed sample size.
Let me describe one such analysis I did. The trial investigated mortality for a new and standard treatment. Mortality was estimated to be 17% for the control, the active was assumed to be 2% better (i.e., 15%). However the 17% was an educated guess. If the 17% were 25% many more patients would be needed, if the control rate was 6%, much fewer would be needed. The FDA liked our N of about 750 for safety reasons. So we wrote the adaptive analysis so that if the overall rate was less than 16% (Average of 17% and 15%) we wouldn’t change the trial. If the overall mortality rate were larger, we would increase the N (within some bounds). We ran the BLINDED analysis when a third of the data was in, saw a lower overall mortality rate (14%) and let the trial continue with its original sample size. The FDA had many questions about the trial, but never a question about the adaptive component.
Free insurance!
Administrative Look: Sometimes one wants to examine the data to plan for the next trial, i.e., determine the mean difference and s.d. – aka effect size. The comments above still apply. The nexttobest way to handle this planning is to have the DSMB compute this and keep the results quiet to anyone involved in the current trial (CMO?). I would say, the best way to do this is to do it in a blinded manner. This would obviate the need to do the secondary analyses at each interim point and hiring a DSMB. Also, as the data is always available to the sponsor for monitoring and error corrections, I would say that if one did a peak of the blinded data to assess the s.d. (or overall proportions), then one might not even need report that the blinded administrative look was even done in the protocol or SAP. The weakness is that one doesn’t have an estimate of the mean (proportion) treatment difference, aka the effect size. Well, the unblinded review has a strong cost, but the blinded look is free.
]]>– Attributed to Albert Einstein
***
In my third and forth blog I addressed useful ways to present the results of an analysis. Of course, pvalues wasn’t it. I favored differences in means and, especially, their confidence interval, when one understands the dependent variable (d.v.). For those cases where one doesn’t understand the d.v., I recommended dividing the mean difference by its s.d. (i.e., the effect size). This would be how many standard deviations the means are apart.
In “9. Dichotomization as the Devils Tool“, I said that transforming the data by creating a dichotomy of ‘winners’ or ‘losers’ (or ‘successes’/’failures’ or ‘responders’/’nonresponders’ [e.g., from RECIST for ontological studies]) was a poor way of analyzing data. Primarily because it throws away a lot of information and is statistically inefficient. That is, you need to pump up the sample size (e.g., under the best case you’d only have to increase the N by 60%. Under realistic other cases, you’d have to increase the N four fold).
Percentage change is another very easy to understand transformation. In this blog I’ll be discussing a paper by Andres J. Vickers, “The Use of Percentage Change from Baseline as an Outcome in a Controlled Trial is Statistically Inefficient: A Simulation Study” in Medical Research Methodology (2001) 1:6. He states “a percentage change from baseline gives the results of a randomized trial in clinically relevant terms immediately accessible to patients and clinicians alike.” I mean, what could be clearer than hearing that patients improve 40% relative to their baseline? Like dichotomies, percentage change has a clear and intuitive intrinsic meaning.
[Note added on 20Apr2013: I forgot to mention one KEY assumption of the percentage change from Baseline, the scale MUST have a unassailable zero point. Zero must unequivocally be zero. A zero must be the complete absence of the attribute (e.g., a zero pain or free of illness). One MUST not compute anything dividing by a variable (e.g., baseline), unless that variable is measured on a ratio level scale – zero is zero. Also see Blog 22.]
I’m not going to go too much into the methodology he used. He basically used computer generated random numbers to simulate a study with 100 observations, half treated by an ‘active’ and half by a ‘control’. He assumed that the ‘active’ treatment was a halfstandard deviation better than the ‘control’ (i.e., the effect size = 0.50). He ‘ran’ 1,000 simulated studies and recorded how often various methods were able to reject the untrue null hypothesis. Such simulations are often used in statistics. In fact, my masters and doctoral theses were similar simulations. The great thing about such simulations is that answers can be obtained rapidly, cheaply, and no humans would be harmed in the course of such a simulation. His simulation allowed the correlations between the baseline and post score to vary from 0.20 to 0.80.
In all cases, Analysis of Covariance (ANCOVA) with baseline as the covariate was the most efficient statistical methodology. Analyzing the change from baseline “has acceptable power when correlations between baseline and posttreatment scores are high; when correlations are low, POST [i.e., analyzing only the postscore and ignoring baseline – AIF] has reasonable power. FRACTION [i.e., percentage change from baseline – AIF] has the poorest statistical efficiency at all correlations.”
[Note: In ANCOVA, one can analyze either the change from baseline or the post treatment scores as the d.v. ‘Change’ or ‘Post’ will give IDENTICAL pvalues when baseline is a covariate in ANCOVA.]
As an example of his results, when the correlation between baseline and post was low (i.e., 0.20) the percentage change was able to be statistically significant only 45% of the time. Next worse, was change from baseline with 51% significant results. Near the top was analyzing only the post score at 70% significant results. The best was ANCOVA with 72% significant results.
Furthermore, percentage change from baseline “is sensitive in the characteristics of the baseline distribution.” When the baseline has relatively large variability, he observed that “power falls.”
He also makes two other theoretical observations:
First, one would think that with baseline in both the numerator and denominator, it would be extraordinarily powerful in controlling for treatment group differences at baseline differences. Vickers observed that the percentage change from baseline “will create a bias towards the group with poorer baseline scores.” That is, if you’re unlucky (remember that buttered bread tends to fall butter side down, especially on expensive rugs), and the control group had a lower baseline, percentage change will be better for the control group.
Second, due to creating a ratio of two normally distributed variables (post – baseline) divided by baseline one would expect the percentage change to be nonnormally distributed. That is, percentage change is often heavily skewed with outliers, especially when low baselines (e.g., near zero) are observed.
I have often observed a third issue with percent change. One often sees unequal variances at different levels of the baseline. Let me briefly illustrate this. Let us say we have a scale from 0 to 4 (0. asymptotic, 1. mild, 2. moderate, 3. severe, 4. life threatening). At baseline, the lowest we might let enter into a trial is 1. mild. How much can they improve? Obviously they could go from their 1. mild to 0. asymptotic or 100% improvement; they could remain the same at mild (or 0% improvement); or then could get worse (3. moderate severity or 100%, etc.). What about the 3. severe patients? If the drug works they could go to 2. moderate (i.e., 33% improvement), 1. mild (i.e., 67% improvement) or 0. asymptomatic (i.e., 100% improvement) or get worse – 4. life threatening (33% worse). If you start out near zero (e.g., Mild), then you get a large s.d. If you start high, a 1 point change would be far smaller, 33% change. That is, percent change breaks another assumption of the analysis, unequal variances, heteroscedasticity.
Theoretically one would expect with percentage change: 1) an over adjustment of baseline differences, 2) nonnormality, marked with outliers, and 3) heteroscedasticity.
To get percent change, Vickers recommends “ANCOVA [on change from baseline – AIF] to test significance and calculate confidence intervals. They should then convert to percentage change by using mean baseline and posttreatment scores.” I have a very large hesitation in computing ratios of means. In arithmetic it is a truism that means of ratios (e.g., mean percent change) is not the same as ratios of means (e.g., mean change from baseline divided by mean baseline). Personally, I would have suggested computing the percentage change for each observation and descriptively reporting the median and not reporting any inferential statistics for percent change.
In sum, Vickers recommends using ANCOVA and never using percentage change to do the inferential (i.e., pvalue) analysis. I further recommend reporting percentage change from baseline only as a descriptive statistic.
]]>No virtual observations were harmed in the running of this study.
A man with one watch knows what time it is. A man with two is not sure, at least for the Wilcoxon test.
And don’t try this at home, we’re what you would call ‘experts’.
I admit it, I have biases.
One bias, seen in my first blog is “the near sacred pvalue (i.e., p < 0.05) indicating our ability to reject the null hypothesis. As it is theoretically false, believed by all to be false, and practically false, all statisticians I’ve ever talked to believe that the pvalue is a near meaningless concept.” I haven’t changed my mind about that one. See my second blog as to why I still (always) do it.
Another bias, I haven’t changed my mind about, is the purpose of a study is to see which specific values of a treatment difference are credible. My third and fourth blogs dealt with computing the confidence interval (CI) of the raw data or the CI for the effect size. Zero on the low end is just a single value. We also need to know other low values and the upper end as well.
One of my strong biases is the avoidance of nonparametric (np) tests, except for supportive analyses. Yes, it was based on knowledge and good experimental design (e.g., nonnormality, heteroscedasticity, which are controllable by reasonable clinical design of N_{group} > 10 and equal Ns per group [see blog 7. Assumptions of Statistical Tests and below]). I avoided np tests. I observed that they only provided pvalues, and only the median was reported with these pvalues. As I pointed out previously, the median is NOT something used in computing the Wilcoxon/MannWhitney test. These np tests compare the mean rank, and NOBODY reports mean ranks.
Recently a colleague, Catherine Beal, introduced me to an analogue of the 95% CI used in parametric testing, a HodgesLehmann (HL) estimator. Statistics had moved on! HL estimators were only recently made available in SAS, the statistical analysis language used in the industry, and only for the Wilcoxon test. The HL estimator provides the CI and the midpoint of this interval. This satisfied some of my theoretical objections to np testing.
On the other hand, how well does the HL estimators compare to the 95% CI on the means? Has anyone examined the relative efficiency of them? In a recent blog I mentioned that statisticians can do comparisons of different approaches by simulating data. One such approach is the Monte Carlo study. In it, one can take a large number of observations from a random sample and see how well two or more approaches compare.
I did such a study. You are learning about it here first! I generated 10,000 virtual samples for 4 distributions: a normal distribution, a rectangular distribution, a ‘typical’ nonnormal distribution, and a population with 2% outliers (98% of the sample had a s.d. of 0.93 and 2% had a s.d. 3 times larger of 2.78). The ‘typical’ nonnormal distribution was based on a suggestion by Pearson and Please. They suggested not using an extreme population, but with only a moderate skew and kurtosis (two measures of nonnormality). I used a skew of 0.75 and a kurtosis of 0.50.
I ran this simulation assuming the N_{group} was 6 (i.e., small), 51 (i.e., moderate), and 501 (i.e., large), in other words, with 10, 100, and 1,000 degrees of freedom. With the 3 N_{group} and the 4 distributions, 120,000 sample means for an ‘active’ and ‘control’ groups were drawn. In other words, more than 500 million ‘subjects’ were generated for this study. And don’t try this at home, we’re what you would call ‘experts’.
First, let me report one major disappointment. Sometimes the HL CI said the results were not statistically significant when the Wilcoxon test said they were significant. For example, when N_{group} was small (6), I noticed that a good number of times (e.g., 8.16% for the normal distribution) when the Wilcoxon was just barely (p = 0.045) statistically significant, the HL estimators had confidence interval which included zero. The same occurred in the moderate (N_{group} = 51) cases, but less frequently (e.g., 0. 21% for the normal distribution), again when the Wilcoxon was just barely (p = 0.04988) statistically significant. In other words, the pvalue indicated statistical significance, but the HL estimator said it wasn’t significant. Of course, the t‑test pvalue and its CI were consistent with one another in all 10,000 samples of the 4 distributions and 3 levels of N_{group}. The SAS consultant confirmed my observations and told me of a 2011 talk by Riji Yao, et. al. who concluded that “the results from the three statistics [pvalue, HL estimator and medians – AF] are not entirely consistent.”
Second, let me present the empirical power of both tests. In all cases it should be 80%, as I ran the study with different effect sizes for the different N_{group}.
Power of the study N_{group} 

6 
51 
501 

Distribution 
ttest 
Wilcoxon 
ttest 
Wilcoxon 
ttest 
Wilcoxon 
Normal 
79.30 
74.06 
80.60 
78.69 
80.82 
78.91 
Outlier 
81.13 
76.26 
79.99 
82.02 
80.04 
81.99 
Rectangular 
79.83 
70.23 
79.89 
74.97 
79.07 
77.44 
‘Typical’ 
79.96 
74.43 
79.99 
81.77 
79.60 
81.79 
Two major observations can be made of the power. First, when N_{group} is small, the ttest, which had approximately the 80% power, has greater power than the Wilcoxon, for all distributions. That is, when the data were normal, the Wilcoxon had 5.9% lower power than the nominal 80%. For the outlier, rectangular and ‘typical’ distribution, they were underpowered by 3.7%, 9.8%, and 5.6%, respectively. Second, when N_{group} is moderate or large, if the data truly are normal, the Wilcoxon test has power almost as good (with 1.3% to 1.9% lower power) as the ttest. If the data were rectangularly distributed even in larger sample sizes, the Wilcoxon power was also lower than the ttest. However for the ‘typical’ nonnormality or the outlier distributions, for moderate and large sample sizes, the HL power had about 2% better power. In other words, for tailheavy distributions [leptokurtotic in statisticianese], a < 2% power benefit would be gained by using the Wilcoxon test.
It should be pointed out that one NEVER powers a study assuming nonnormality. In fact, we can only power studies assuming normality and ‘adjust’ (increase) the N for np analyses. Siegel’s (1956) book on nonparametrics said the Wilcoxon test had 95% the power of the ttest, a rather good estimate given the above results. Other books dedicated solely to np analyses (e.g., Sprent [1990] or Daniel [1990]) had poorer practical suggestions. So for small studies with an unknown distribution, I would recommend increasing power to 90%.
Third, all things considered, one would like ‘tight’ (or narrow) confidence intervals. This is the primary reason one uses large N, it makes the CI narrow. An approach which produces narrow CI is more efficient than any other. I took the ratio of the width of the HL CI relative to the width of the ttest CI. A ratio of 1 indicates equality, while a ratio greater than 1 indicates that the ttest is more efficient and a ratio less than 1 indicates that the HL is more efficient.
The ratio of HL to ttest intervals is presented below:
HL CI range/ttest CI range N_{group} 

Distribution 
6 
51 
501 
Normal 
1.2101 
1.0296 
1.0235 
Outlier 
1.2137 
0.9863 
0.9727 
Rectangular 
1.2211 
1.0614 
1.0211 
‘Typical’ 
1.2175 
0.9853 
0.9719 
A similar set of observations could be made. First, when N_{group} is small, the ttest has over 21% better better efficiency. This is similar to the above results. Second, when N_{group} is moderate or large, if the data truly are normal, the ttest has slightly better (3% and 2%) efficiency then the Wilcoxon. The rectangular distribution also had better efficiency with the ttest (6% and 2% for the moderate and large Ns respectively). The heavy tailed ‘typical’ nonnormal and outlier distributions had slightly better efficiency for the HL estimators given moderate and large Ns, both about 1.5% and 2.7% respectively.
Finally, one assumption of the ttest is that the distribution of means is normally distributed. However, with the central limit theorem, as the N_{group} increases, the original nonnormal distribution of means becomes much more normal. How normal was the difference between the means? Well, I examined the 10,000 simulated mean differences per N and distribution. We can examine their distributions and test if the means are nonnormal (I used the AndersonDarling test pvalue).
Skew, Kurtosis, and test of normality pvalue N_{group} 

6 
51 
501 

Skew 
Kurtosis 
pvalue 
Skew 
Kurtosis 
pvalue 
Skew 
Kurtosis 
pvalue 

Normal 
0.02 
0.00 
0.07 
0.02 
0.01 
>0.25 
0.00 
0.06 
>0.25 
Outlier 
0.01 
0.06 
>0.25 
0.02 
0.05 
>0.25 
0.06 
0.01 
0.22 
Rectangular 
0.01 
0.11 
>0.25 
0.02 
0.05 
>0.25 
0.00 
0.02 
>0.25 
‘Typical’ 
0.01 
0.02 
>0.25 
0.16 
0.04 
0.13 
0.00 
0.06 
>0.25 
It can be seen that even when only 6 observations were seen per group (or N_{total} was 12), the skew and kurtosis was very close to zero for all distributions. For all distributions, despite having 10,000 observations, no normality pvalue indicated that the means were anything but normally distributed. [Yes, if a trillion observations were used it would be statistically significant, but the skew and kurtosis of these distributions will still be ‘clinically’ nonsignificant.]
Summary: In this statistical study,
Conclusion: I will continue to suggest that the ttest (or ANOVA) should be the primary test to be used. This is especially true when the sample size is small, or there were more than two treatment groups or a multifactor analysis was used or covariates or stratification, or one wanted to determine the sample size for the trial, or design future trials, or when one has to ability to design the trial using sound methodology. Whew, that was a lot of ‘or’s. I should note that I have never seen a moderate or large study that did not include multiple factors, strata, or covariates. Never.
Is nonparametric testing next to useless as I suggested in my tenth blog? Not anymore, as confidence intervals are now possible. However, n.p. testing still focuses on trivially simple analyses (e.g., 2 groups with no other factors, strata or covariates), lacks a methodology (power analysis) to design for n.p. analyses, or the nonnormality assumption can be avoided by either N/group>5 or transforming the data/cleaning it. Would I suggest np analyses for a key analysis? NO. For almost all cases, I would still strongly recommend the use of the more powerful and more bulletproof ttest (ANOVA). I would still suggest presenting nonparametric statistics as a supplemental analysis.
]]>A picture is worth a thousand words
Everything should be made as simple as possible, but no simpler.
C’est la Bérézina – French phrase meaning ‘it’s a complete disaster’
***
We’ve all heard it, ‘A picture is worth a thousand words’. What a preposterous lie! Let’s analyze this phrase.
‘Picture’ can be any visual image, for this blog, I’m thinking of a statistical graph. What can I say? I’m a statistician.
‘Worth’, well beauty is in the eye of the beholder. This is no doubt true. With regard to statistical graphs, what is worth? I can’t judge ‘worth’ in terms of my eye, nor my client’s. Worth should ultimately be judged from my client’s target audience, be it the Agency or a practicing MD scanning a periodical. What is worth from my prospective? As a statistician, truth comes foremost to my mind, followed closely by elegance.
Truth? We all want to focus the reader’s attention on the treatment effect. Let’s say we see a final active treatment mean of 72 and the placebo treatment mean of 70, on a scale which goes from 1 to 100. It would be misleading to present a graphic which presents the results with an axis which goes from 71 to 73, on this 100 point scale. There are various ways to mislead. One can shorten the height of the graphic, minimizing differences or maximize it by cutting off part of the scale. Any way around it? Yes, we can plot the difference between the active and placebo (e.g., +2.0), embellished by its 95% confidence interval on the mean. I love plotting that with a line indicating a zero difference (the null hypothesis). Even better would be the mean changes from baseline (‘improvement’) for active, placebo, and the difference in ‘improvement’, each with their 95% CI of the mean. Sometimes it helps to have two axes (on both the left and right sides), if the magnitude of the scales differ. For example, putting the difference on the right axis and the individual means scaled by the left axis.
As a side note, you might wonder why I stated ‘on the mean’ twice in the last paragraph. In my first job, my boss’s boss (Lou Gura) told me that graphs should be entirely self contained. The reader shouldn’t have to read the text of the paper to figure out what is presented. There are many confidence intervals, like the 95% CI of the raw data. One should, at a glance be able to deduce what each tick on a graph is. One expert suggested including a text box summarizing the conclusion the reader should reach.
Elegance? As a statistician, I favor simplicity. “Everything should be made as simple as possible, but no simpler.” “There are some easy figures the simplest must understand, and the astutest cannot wriggle out of” (for the full quote, see 6. ‘Lies, Damned Lies, and Statistics’ part 1, and Analysis Plans, an essential tool).
Combining truth and elegance, I want to present a graphic which conveys the information clearly and completely. More on this in Graphics II.
‘Thousand words’ is quantifiable. I pulled up a work of fiction and timed how long it took me to leisurely read 1,000 words. It took almost 5 minutes to read these two pages. Obviously, skimming would be faster and reading technical works takes longer. How long do you look at any picture? When I go to a museum, I seldom spend 5 minutes on any picture. Most pictures I spend less than 10 seconds on (< 33 words?). When was the last time you read a medical journal and spent 5 minutes on a statistical graph? Nevertheless, our objective with a statistical graph is to foster the reader to linger, but to understand the graph immediately, especially what the graph’s originator is trying to say.
Let me give you a counterexample, clearly worth many 1,000 words. I took a graphics course by Edward Tufte. Along with three of his books he gave each attendee the following graph from Charles Joseph Minard, it presents Napoleon’s March to Moscow – The War of 1812. Tufte claimed, and I agree, that it “may well be the best statistical graphic ever drawn”. I’ve spent many hours staring at this graphic. It on my office wall above my monitor, for inspiration.
The graph presents a map from the Polish Russian border to Moscow; it presents the size of the Army going to (gold) and returning from (black) Moscow, including various troop diversions; and the temperatures experienced by the returning army at various dates. The Russians successfully used the scorched earth policy to devastate the invading army. It is the most expressive antiwar picture I’ve ever seen. One can’t fail to see the astonishing loss of Napoleon’s troops. One can’t fail to see that 442,000 soldiers entered Russia, with a steady loss of 10,000 men per millimeter. Only 10,000 returned (6,000 of whom returning from the North). The army lost about 99% of its soldiers invading Moscow. One can also see the staggering loss at the Berezina River (“C’est la Bérézina“) and from cold spells. Two years later Napoleon fell.
With all the audiences a graph represents, with all the various elements, let me quip – A picture is worth a thousand dollars. By this I mean, in contrast to computing a set of means or tests, a statistical graph takes many, many iterations (time) to get just right. The elements of a graph include the title and subtitle, font and letter sizes, style of graph, the left (and right) and bottom axes, legends, embedded notes, colors, type of ticks, etc. All require discussion and many, many iterations. Multiply this by each dependent variable and costs rise. In my early days, I never gave my clients graphs – too much arguing about cost. Then I realized that most of my clients were visualizers, whose primary way of assimilating information is not conceptual nor verbal, but visual.
A great graph can, on its own, grant an Agency’s approval. We should replace the summary of a submission with a simple graph; the rest of the submission would be ‘filler’ and ‘boilerplate’.
More about how to achieve this in Graph II.
]]>At times, I strive to be an educated Simpleton.
***
OK, we know that a great graph can Sing!!! But do we know anything about what conveys information the best? Yes. Many psychology studies have been done to tell us what is good and what is ineffective. The information I’ll be conveying will be from Naomi B. Robbins book Creating More Effective Graphs from Wiley (2005) and from a seminar she gave to the Deming Conference in December 2005. The book is written in an exceptionally clear manner and goes over the worst and best of graphs. It is both educational and entertaining. She expects to reprint it this winter. I highly, highly recommend it. Her company’s website is http://www.nbrgraphs.com/. I’ll attempt to summarize some of her suggestions and summary of the literature.
First, her goals: “For our purposes, Graph B is considered more effective than Graph A if the quantitative information contained in Graph B can be decoded more quickly or more easily by most observers than of Graph A (from the 2005 seminar).”
Are there any graph types which are ineffective? Yes, probably the most frequently used type of graph ever used – the pie chart. Why? Information in a pie chart is based on the angles of each wedge. 1) People can’t judge angle differences easily. 2) acute angles are underestimated and obtuse angles overestimated. 3) Angles on the horizon are overestimated relative to angles on the verticals. One expert, Edward Tufte said, “Given their low datadensity and failure to order numbers along a visual dimension, pie charts should never be used.”
Let me give an example from Dr. Robbins’ book:
What do you conclude from Figure 2.3 (sorry, about my first graph being 2.3, but that was her number)?
According to an graph expert William Cleveland, the following is the order from the best to worst (graph types on the same line [e.g., angle and slope] are approximately equal):
If I said pie charts are poor, what can make them poorer? Bling! One no longer includes 3D 32pointed stars indicating direction (e.g., North) on maps; nor sea dragons or Neptune; nor curlicues. Yet 3D block charts, such as 2.3 above, are common. Worse than 3D block charts are 3D block charts at an unusual angle and the bar not up against the scale. The following is an illustration from Dr. Robbin’s book:
What is the height of bar 1? Is it about 1.3 (from the front top)? Is it 1.7 (from the back top)? Is it 1.5 from the middle? No! No! No! If you pushed the bar back to where the scale begins, its height would be 2.0. The eye was fooled by the downward looking prospective of the above graph. Pseudo 3D graphs are tasteless bling!
As mentioned above the best way to convey information is using a common scale. Let me return to the percentages from piechart, Figure 2.3, and represent it using bars.
Did you conclude from pie chart, Figure 2.3, that the left most wedge was 40%? That the other three wedges were all equal and all were 20%? OK, enough said, pie charts are only useful for throwing into the faces of comic statisticians (an oxymoron, if I ever heard of one).
Going upward in Cleveland’s hierarchy is length. Let me illustrate that with a stacked bar graph, in Figure 2.11. What can you conclude from the topmost stacked bar, from ‘All Other OECD’?
Most people can’t judge the height of a bar ‘floating’ very easily. So one simple solution is to pull out the information from ‘All Other OECD’ into its own figure, Figure 2.12. Now, can you see a downward trend in the data. Morale: “it is very difficult to judge lengths that do not have a common baseline (Naomi Robbins, page 31).”
Dr. Robbins, also presented each of the subgroups side by side, so for year 77, she presented the US bar, the Japan bar, the West Germany, and the All Other OECD, followed by the four for 78, etc. However, due to the clutter, it was more difficult to see any patterns.
I think that the above 2.12 is quite clean and understandable. So if you have subgroups which you want to present, it makes sense to ‘waste’ paper and present the subgroup alone with a common scale and without any other cluttering information. Simple is often the best.
OK, simple is the best. One last comment for this blog, the simple presentation of the key information is the best. When dealing with the goldstandard active vs placebo study, the key information is not a plot of the active and placebo means. No. The key piece of information is the difference in the means. People are frequently unable to discern visual differences. Take the following top plot. What can you conclude about the difference in balance of trade?
Yes, we can easily see the large difference between 1720 and 1740. But did you pick up the spike after 1760 presented in the bottom plot? “We miss it because our eyes look at the closest point, rather than the vertical distance (Naomi Robbins, p 37).”
Simple truth: If we are interested in the difference between active and placebo, then we need to plot the difference between active and placebo.
If this was the difference between active and placebo over time, then the only tweak I might make is to include the 95% CI on the difference, with a horizontal line at 0.0 to see if the lower CI bar ever is above zero (i.e., is statistically significant).
Primary lesson: Keep it simple, keep it to the point, stress the key objectives of the research.
]]>
Yesterday was 1 degree Fahrenheit and today is 10. I’m ten times warmer!!
Compassion? We statisticians have evolved beyond such petty human affects.
I received the following question from Simon Wilkinson from New Zealand:
Dear Allen. To set the scene, I am not a stat or a biostat. We are treating patients with secondary progressive multiple sclerosis on a “compassionate basis” with an experimental drug – something that is allowed in our country (NZ). The number for patients is very small, about 15. Each patient is their own unique set of symptoms. We are using a MS specific QoL patient reported questionnaire (the MSQLI) to obtain baseline and then 3 monthly data as one means of gauging treatment effect in the absence of biomarkers – one of the challenges of treating this indication. We have been looking at the effect in each patient by using PCFB. In some components of the MSQLI a reduction in score is improvement and in other components, an increase in score is improvement. As a lay person an immediate issue arises. A baseline score of 1 (bad) verses a 3 mth score of 7 (much improved) equals a PCFB of 600%. For a different component a baseline of 7 (bad) verse a 3 mth score of 1 (much improved) equals a PCFB of 85%. This seems wrong! Subsequent ‘googling’ on the issue reveals the apparent minefield of PCFB!! Fundamentally we are interested in how treatment is impacting each patient as opposed to an overall effect in a larger population. Can you suggest an appropriate approach. Sincere thanks.
First off, a ratio is only meaningful when zero means zero. I had forgotten to include this assumption when I wrote Blog 18. For a percent change from baseline [100*(X{Month 3} – X{Baseline})/X{Baseline}], the baseline must be on a scale where zero is zero, a complete absence of the attribute (e.g., no debilitating effects of the disease). If your scale can be transformed so a 0 is ‘free of illness’, then you can compute a ratio. Therefore a PCFB can be done for Heart rate, Weight, and Height. But a PCFB cannot be done for temperature on a Celsius or Fahrenheit scale, but could be done for a Kelvin or Rankine temperature scale. So if, and only if, a 7 of much improved means free of illness then you could do x’ = 7 – x, where x’ is a new score and x is the old score. However that is NOT possible with your QoL scales. This is the root of the discrepancy of your 600% and 85%.
I would recommend using a simple change from baseline and forget the percentage part, see my blog for other reasons to avoid it. In reporting your results, I recommend you present the mean difference between the Month 3 and Baseline results, and present the CI of the difference. If the CI excludes zero and was positive, then you can say it was ‘statistically significant improvement’. Alternatively you could do a paired ttest and get identical pvalue results.
To make your presentations easier, one can always do a linear transformation of the form X{Transformed} = a + bX{Original} on any data you report means or medians. Such linear transformation have NO EFFECT on pvalues. Means, medians, and Confidence Intervals are unaffected except that they will be transformed using such an identical transformation. For example, you can always transform a proportion (a mean of 0, 1 data) into a percentage by multiplying it by 100. Correlations are totally unaffected. S.D. or standard errors would be affected by SD{transformed} = b*SD{Original}, variances would be a squared ‘b’. As at least one of your component is a reflection of the other I would do the transformation: X’ = 8 – X. You would then say “The scales were reflected so a positive number indicated improvement.”
To put it in context to Class 1 of the Statistics 101 class: For Ratio level data (e.g., inches to centimeters), you can always do transformations of X’ = bX. For Interval level data (e.g., Fahrenheit into Celsius), you can always do transformations of X’ = a + bX. For ordinal data, you can always do a monotone transformation, such that increasing X will produce an increased X’. For nominal data, any transformation is possible.
Finally, I LOVE the idea of providing experimental medications to patients on a compassionate basis. Providing patients with an opportunity to be treated with a novel compound, which may be of unique benefit to them, is fantastic. This is even more important for patients who are nonresponders to the more traditional treatments. I’ve been involved with a large number of such compassionate protocols (e.g., the use of an blood substitute to treat Jehovah’s Witness patients, who might die facing a major surgery otherwise). I was proud to be assisting that company. So congratulations on doing such compassionate treatment.
However, as a statistician (and this is a biostatistics blog), the evaluation of efficacy is frankly a waste of time. I’ve observed that the data is typically much messier than for ‘official Phase x’ studies, no offense to your noble treatment of these very ill patients. Such studies often have greater proportions of missing data, sloppier visit windows, small samples sizes, and poorly written protocols, especially wrt objectives and how the data will be analyzed. Big pharma often reports only the raw data and not present any summaries of efficacy for compassionate protocols. If they do provide analyses, it tends not to be pvalues, only descriptive statistics. Even if your protocol was the best written/executed protocol there was, compassionateusage protocols primary fault is they are nonrandomized, noncomparative, openlabel study designs. If there are comparative groups, the groups often differ on a very large host of baseline demographics/medical conditions. There are a host of reasons/biases which make the meaning of “patients experienced a statistically significant increase in their mean change from baseline” uninterpretable. This includes: spontaneous remission, wanting to please the helpful doctor and staff, cognitive dissonance, progressive nature of the disease, time of year, scale interpretation changing over time, … In other words, such studies are little more than testimonials. The Integrated Summary of Efficacy will possibly link to their report, but exclude them in any summary analysis. That is, the ISE will ignore them. The analysis of safety is typically a subpopulation in the Integrated Summary of Safety, an ignored sidenote in the ISS. You’re lucky that the experimental drug manufacturer is allowing your compassionateuse protocol. Most pharma executives consider it a waste of their resources (money), although they might benefit from the PR.
At least that’s my opinion/observation.
]]>Submitted on 2014/05/12 at 8:23 amDr. Fleishman,
I am so happy I found your site. I have been trying to decide how to best analyze the results of a pilot study I conducted (ABA design: N = 4) examining an intervention targeted for people who stutter . During the first phase (AB) a dependent t test was performed. I am pasting the results of this preliminary analyses below:
Results of the SSI4 indicated that participants reduced their percent syllables stuttered from M= 7.1 (sd=1.7) to M= 5.1 (sd=1.7). Although this did not reach statistical significance, there was a large effect size (d =1.18). Findings from the Burns Anxiety Inventory revealed a large decrease in the number of items checked off as causing anxious feelings, anxious thoughts, and physical symptoms of anxiety. Anxious thoughts were significantly higher before yoga treatment (M=.70; sd=.45) than after yoga (M=.27; sd=.32) as indicated by a significant paired ttest, t(3) = 5.58, p = .011; d = 1.10 (large effect). On the OASES, participants indicated positive changes in the general perception of their impairment with improved reactions to their stuttering, reduced difficulties about speaking in daily situations, and improved satisfaction with their quality of life related to communication. Overall perceptions of speakers’ experiences of stuttering were significantly more negative before yoga treatment (M= 2.70; sd=.22) than after yoga (M= 2.32; sd=.17) as indicated by a significant paired ttest, t(3) = 8.01, p = .004., d = 1.93 (large effect).
I am now analyzing the results including the followup measures. After reading the previous posts, I realize that I should be focusing on reporting descriptives, rather than trying the figure out the appropriate statistical test to use. I just read the results of a study similar in design to my study with a sample size of 3. In that study the authors used a technique called “Split Middle Method of Trend Estimation”. I have never heard of this technique. Could you explain how to perform such an analysis? Would you recommend using this type of analysis?
I am trying to get the results of this small pilot study published, but am worried about just reporting descriptives. . Is there a bias toward not publishing studies that are purely descriptive in nature?
Heather
I believe you are asking two questions. 1) Information about the SplitMiddle method and 2) Publication of descriptive studies.
1) I never heard of the splitmiddle test. I did a quick search on the web and located “http://physther.org/content/62/4/445.full.pdf”. They give a complete description on how to do the calculation (pages 448449). It is an approach which examines the trend of the data by dichotomizing the data into two halves, based on date. Then it computes the median of each half and plots a line between the two medians. This approach uses the ordinal information of the data. Ordinal information is not bad. It is certainly far better than nominal, but slightly less powerful as interval. When you have small N, you need all the information you can get however. One issue with small N, is that a single outlier could vastly affect your conclusions. Did you see any? If not, then nonparametrics isn’t necessary. With regard to your study, it does not appear that you are looking at time (date). So I’m not sure if this approach is applicable for your study.
Reply by Heather: I did not see any outliers as per examination of the boxplots. I think there is a common misconception that exists regarding sample size and parametric analyses, for which I have fell prey to: automatically run nonparametric analyses on studies with small sample sizes.
2) Your study. As I’ve written in my blog, I’m a very large advocate in effect size. Yes, with N=4, I would focus on descriptive results. Are these the only three key parameters? Or were these the best three out of ten or twenty or one hundred parameters.
Reply by Heather: Yes, these were the three best results. We utilized 3 measures but within those three measures we looked at individual section scores in addition to total score.
Reply by Allen: Then alpha level has much less meaning. Your reply didn’t indicate the number of subscales (pvalues) you tested. If there were 10 (subscales and total) per scale then 3*10=30 pvalues. With 30 pvalues, then the likelihood of finding at least one fortuitously statistically significant is (1.0 – (1.0 – 0.05)^30 =) 79%. That is, with completely random data, with 30 tests, you would find statistical significance 4 times out of five. The same would apply to the detectable difference. Out of all the scales and subscales, one must be the biggest difference.
This is THE reason statisticians rigorously adhere to the protocol’s key parameter/hypothesis. By stating a priori what we are looking at we can meaningfully see if we hit the target.
Fortunately, with your three parameters, you are blessed with two which are significant. I also strongly believe in confidence intervals. Looking at your results, you found for the weakest?? first parameter (Percent of Syllables Stuttered) a mean difference of 2.0 (sd = 1.7), for an effect size of 1.18. A 95% CI (I believe) is 0.70 to +4.70. As you noted, since the lower end is minus 0.7 you cannot discount that the intervention’s magnitude could be zero, or even deleterious (negative 0.7 points or an effect size of 0.4). However, your best estimate of the true effect, the mean, is that the intervention was 2.0 points (effect size of 1.2). Finally, the intervention could be as large as an improvement of 4.7 points (effect size of 2.8). The effect size of 0.4 is a moderate negative effect size. However, the 1.2 is quite large and the potential benefit of d=2.8 is huge. My conclusion from these observations is that the intervention is indeed potentially useful. What you did was likely quite positive, especially for these 4 subjects.
I am very concerned that the effect size of d=1.18 was not statistically significant, but the effect size for anxious thoughts (d=1.10) was statistically significant. If both used N=4 and a paired ttest, then there is a MAJOR ERROR somewhere. You CANNOT have a smaller effect size (1.1) significant, but a larger effect size (1.2) not significant in the same parallel analysis.
Reply by Heather: I recalculated the effect size and got .63 (not 1.18) for the nonsignificant test. and 2.79 (not 1.10) for the test that reached statistical significance.
Reply by Allen: Although it wasn’t significant, the 0.63 effect size, is still a large treatment effect. That is, the intervention induced the responses to shift almost 2/3 of a standard deviation, a very, very respectable change.
Assuming you calculated the effect size correctly (and were not reporting the ttest value). For example, you used the standard deviation, and not the standard error of the difference in means. My only comment here is that an effect size of 2.79 is very, very huge. The two distributions do not overlap. I work primarily in drug studies. Effect sizes of 0.3 are common. An effect size of 1.0 is very, very rarely seen, at least in their doubleblind and randomized trials. My gut says that an effect size of 2.79 is not a realistic treatment effect. For example, you measured anxiety levels minutes after your yoga treatment, rather than the next day or week.
One problem with this study is that there may be alternative explanations for the results, not just p>0.05. Were the subjects or raters blinded, etc.? For example, the ‘Goodbye’ effect. This could easily invalidate all of your conclusions. One of my favorite books was a 71 page gem by Campbell and Stanley, titled Experimental and QuasiExperimental Designs for Research (1963). See their page 8, ‘2. OneGroup PretestPosttest Design’. They would say your design is delinquent in that it is potentially invalidated by a) History, b) Maturation, c) Testing, d) Instrumentation, e) Selection by Maturation (etc.) interactions, f) Testing by Intervention interaction, g) Selection by Intervention Interaction, and other potential flaws. When my wife was in graduate school and doing her master’s thesis, she did an intervention study. All it may have proved is that her subjects liked her and wanted to help (give her positive results). It sufficed in that it gained her a masters degree.
Reply by Heather: I agree with you. This study was definitely not a randomized control double blind study.
However let me just focus on the results, not the design. You have three parameters which apparently give strong parallel results on the favorability of your intervention. To finally answer your last question, yes publications hate to publish results which do not have that magic ‘p < 0.05’. As noted above, the first parameter indicates that the intervention could have a moderate negative effect (0.4). [They would actually say (focus on) that the results could be zero – ignorant fools.] Where does that leave you?
My best suggestion is to allow the pilot study to suggest a larger and potentially better study and not to treat a pilot study like the final study. Of course, you could do both. Attempt to publish, while completing the full study.
]]>Comment by Heather: I still, however, would like to get it published as this intervention shows promise and warrants a larger, more controlled study.
Comment by Allen: Best of luck to you.
It really is a nice theory. The only defect I can think it has is probably common to all philosophical theories. It’s wrong – Saul Kripke
***
I received the following question (abbreviated slightly) regarding blog 18 Percentage Change from Baseline – Great or Poor?
What about the case where you have just a single group of subjects measured at two timepoints (say, baseline and followup)? This often occurs in medical studies, as you know.
Then, is percent change still not useful for each subject?
With regard to this specific design and percentage change, you can compute the mean difference between baseline and post. Alternatively you could compute the difference between baseline and post divided by baseline (assuming you have a ratio level scale) and present the median percent change. [Note: for reasons mentioned in my blog, the individual percentage changes tend to have a highly positively skewed distributions, So I would use medians rather than means.]
But what does it mean?
A single group measured at twotime points with an intervention in between is an internally flawed (horrible) design. Many effects could cause a true mean change from baseline to post. Unfortunately, it is typically not only the (medical) treatment. For example, natural changes in people (e.g., spontaneous remission, natural healing, regression to the mean), season, selection of subjects (sick patients come to doctors because they are ill – at any later time point they aren’t as ill), subjects saying the nice doctor helped them, etc (and there is a very long list of potential other reasons to explain the difference). Most of these alternative (nontreatment) factors bias the results to make the second observation appear better. The single group prepost is a truly horrible design. This is not a true experimental design. Campbell and Stanley referred to this design as a PreExperimental Design or a quasiexperimental design (see page 7 of their book).
The very first study I professionally analyzed was a 4 week drug intervention in depression. Yes, the patients treated with our medication (Amoxapine) changed 13 points. Fortunately, the study was a randomized and blinded study, with patients treated with drug or placebo. It was only because we included placebo patients, who had a mean change of 7, that we could deduce that our drug had a 6 point drugtreatment effect. Without the placebo group, the 13 points could have been solely a placebo or any number of similar effects.
Unfortunately, most experimentalists can make NO credible interpretation as to why the onegroup pretestposttest design [percentage] difference is what it is. One could say, ‘we saw a 13 point difference … or a 31% median percentage improvement relative to baseline’. But there is a major leap from saying ‘there was a change’ to saying that ‘we saw a change due to the treatment’. They typically put two disjoint effects together in the sentence and make such a implication, such as ‘the 31 patients, WHO RECEIVED TREATMENT X, had a 13 point difference … .’
Unfortunately, as the commenter noted, this is a frequently used design, especially in the medical device industry. For such a design to work, the scientists MUST believe that patients are static and unchanging – a patently and demonstrably false assumption. But then again, they would seldom hire a ‘real’ statistician to review their study. They typically use students who only had single Stat course to analyze their data. They don’t like to be told their head of clinical operations is incompetent or they are too cheap to run a real study.
Again, the OneGroup PretestPosttest study is NOT a real experiment, it is little more than a set of testimonials (OneGroup (informal Pretest) Posttest ‘study’ with much missing data). You could compute the change and percentage change, but it cannot be interpreted, hence any conclusions – data analysis – is meaningless. The ONLY good that can come of such a trial is the promise of doing a real trial.
]]>“Psychology is a crock”, Bob Newhart show, Season 4 Episode 8
***
Let me start out with a detail I don’t publicize much. My PhD was in Quantitative Psychology or Psychometrics. Like Biometrics, it is the mathematical end of the discipline, in this case psychology. For those who haven’t taken a Psych 101 class, Psychology is divided into two parts: clinical and experimental. Their relationship is like Painting (Art) and Physics, respectively. In other words, there is almost no overlap. I’ll be talking primarily about the ‘scientific’ part – experimental psychology.
My first two years in graduate school at CUNY were in Cognitive Psychology, an experimental branch of Psych. I completely discarded this education, and went to the University of Illinois, for reasons discussed below. I realized that in order to study the mind, I needed tools, far beyond comparing two means – the predominant experimental approach. I chose the U of I as it had a large number of former presidents of the Psychometric Society among its faculty. My U of I education was almost exclusively in quantitative methods, I took only two nonquantitative course, one of which was a required course, on paradigms of psychology. My U of I interests focused on a) Monte Carlo studies, which you might have observed in previous blogs; b) Multivariate statistics, particularly factor analysis; and c) Time series, which I picked up from an econometric course.
My masters and doctoral theses were both Monte Carlo studies. Neither were psychologically oriented. At the U of I, I ran one and only one human study. It used one and only one subject – my wife. I humorously called it, “Asking my Wife What She Thinks”. It was an N of 1 study, using factor analysis, with a very structured set of data, but the parameters were mostly unique (idiosyncratic) to my wife. But more of that in my next blog.
My opinion of scientific psychology after getting a PhD over thirty years ago? It still is wrong. 100% wrong.
Psychology: The Study of Herd Behavior
My original graduate school experimental design focused on the following research paradigm: Get a group of subjects, randomize them into groups, treat the groups differently, and compare the means. This is almost identical to the gold standard used in biomedical research. What can be wrong? For medical research or sociology, nothing. Except for the people studying genomics, medical research tacitly assumes that all subjects are equivalent and interchangeable (prior to treatment). One often would test for interactions among subgroups (e.g., gender, site). Almost every statistician and medical director hope they can ignore the subgroups. In statistics, our tests assume that the data (per treatment group) come from a distribution with a single mean, a single variance, and the data are identically and independently distributed. In other words, except for noise (error), the subjects are all identical to one another. In biology, one lab rat or patient is the same as another. I’m cool with that.
Sociology is supposed to look at group behavior.
However, Psychology is supposed to look at people, individuals. While most psych studies are run on a shoestring budget, the ideal is to run a large study. What is the impact of a single subject? As N approaches infinity, the impact of a single subject on the mean approaches zero. The typical analysis of means of a group of subjects, would ideally ignore the individual.
The methodology used in experimental psychology ignores the person, the individual, and only studies the group. I remember one study which asked ‘which x would you think most people would like the most’. This could be quite different from ‘which x would you like the most’. I think it was pictures of women. I chose the sleepy, vapid, wellendowed, blueeyed blond. The picture looked nothing like my wife, except for the sleepy part <grin>.
Psychology: Explains almost nothing
Having preformed or reviewed the analyses of many, many psych studies, I’ve looked at the effect sizes. The majority have an effect size of 0.3, or less. Let me convert it to 1 – R², or the proportion of unexplained variance (see blog 4 and the correlation ratio). The typical amount of unexplained behavior is over 90% [i.e., 100%*(1.0 – 0.3²)]. With ninety percent of what is error or noise, I have a hard time accepting any psychological study as describing a meaningful amount of behavior. Prediction of any single person is very close to pure guesswork – noise. This lack of ability to explain anything is typically obfuscated by focusing on ‘statistically significant’ findings. Hopefully, if you have read Blogs 1 to 4, you will realize that ‘statistically significant’ and ‘effect size’ are two completely independent concepts. Statistically significant is not clinically significant! One might as well do a blind tea leaf or tarot or entrail reading. From what I’ve heard, a ‘blind reading’ should be capable of explaining more than 10% of an individual’s thoughts.
Perhaps I’m being overly harsh. One could say that study x found a factor, which might be useful in the future. Unfortunately no one attempts to come up with a ‘General Unified Theory of People’ by assimilating all the ‘factors’.
It is only by prediction, that any science could be judged. I saw a TV program where the hero left a psychologist’s office. The clinical psychologist intuited that his father left him (something he had not mentioned) after a five minute interview. All clinical psychologists would laugh at that line. It was a nonrealistic sitcom, a silly fantasy, which few realize is a fantasy.
Are such predictions possible? Yes, all the time. I often fill in words for my wife. I know ahead of time what her reaction will be given most situations. But that has nothing to do with any theory of psychology, despite my PhD. The negative side is also true. My kids know ‘which buttons to push’. They learned them at a young age, and they never read Freud, Adler or watched Dr. Phil. One criticism of strict Freudian Psychoanalysis is that it requires a decade (or decades) of one hour sessions five times a week for insights to be made. Perhaps they were right.
So, prediction is definitely possible, but not by the science of psychology.
Psychology: Whose theory is it?
Different people and cultures assign different importances to different concepts. I can only think of all the different names for snow among the Inuit people. The Sami language of Norway, Sweden and Finland has around 180 snow and icerelated words and as many as 1000 different words for reindeer. Snow is of great importance to them but just something to be shoveled by most of us.
I have read a number of books on psychology. Let me just mention Freud’s Oedipal complex. Simplistically it is the traumatic event resolving the desire for a boy to ‘have’ his mother and replace his father. Is it true? Perhaps for Sigmund Freud, perhaps it was true for “Little Hans”, a patient of Freud. Is it universally true? The Freudians say so; I doubt it.
Are the three independent factors of Osgood (evaluation, potency, and activity) true? Yes for Osgood. For all people? I doubt it.
In sum, any theory or factor exposed by a psychologist is likely true for them or someone they know well, but is unlikely to be universally true or equally important for all people. Or if applicable to person x, then it is likely to take a unique twist.
Any theory derived from a Bushmen, or a 18th century noblemen, or my wife, is unlikely to be true or be as useful to everyone else. Perhaps useful to some, but not as useful.
In sum, scientific Psychology:
You probably see where I’m leading, but more about it in 26. Psychology II: A Totally Different Paradigm. I will have a partial solution to how to make Psychology a science.
]]>Simple Truths:
In my last Blog, I made four basic points. Psychology:
I realized as a neophyte graduate student, that in order to study individuals one needed an advanced methodology. One cannot make advances in any field until the methodology is available. For example, surgery couldn’t advance if the only tools available were flintnapped rocks. Nor could you have any hope for success until you understood sterilization and anesthesia. In Psychology, comparing two means (or a two way ANOVA) does NOT allow any adequate models of individuals, but, at best, groups.
If you want a real science of individuals, you need to study an NofOne, or expand it to many NofOne studies. Is this heretical? Nah. Binet, the author of the first IQ test, based it on his observations of his daughter, an NofOne. The original studies in Psychophysics used ‘observers’, often NofOne. Of course, the original clinical theories of psychopathology, like those of Freud, devised their theories based on NofOne observations. B. F. Skinner used NofOne. The list can go on.
Mathematical Methodological Tools:
Analyses across individuals might tell you something common across all of them, but ignore any and all individual differences. Means and correlations using groups of people will not work. Empirically, predictions based on studying groups have had little prediction for individuals. More to the point, pvalues (inferential statistics) only applies to groups, but not NofOne studies. This is because the basic assumption of independence can never be met, as a person knows all of their prior behaviors. Independence of observations is the cornerstone of inferential statistics (pvalues). One cannot use pvalues in an analysis of a single individual.
For NofOne research, one can still use means, medians, modes, and categorical descriptive analyses. One can still correlate one parameter with another, and use all the correlational tools (e.g., factor analysis, cluster analysis, multidimensional scaling, regression, discriminant analysis). One could still do timeseries analyses.
Concrete Example:
Before I get too esoteric let me describe my one and only human study done at the University of Illinois about 40 years ago, an NofOne study – Asking My Wife What She Thinks. This used some of the methodology of Seymour Rosenberg of Rutgers. I will not talk about the details too much. I first asked my ‘subject’ (my wife) to name all the people she knew well enough to describe. This included different aspects of herself, friends, relatives, classmates, fictional people, me, etc. There were 110 ‘individuals’ she thought about. I asked her to list adjectives for a random subset of 25 ‘individuals’. Next she identified the polar opposite of each adjective (e.g., Imaginative vs. Dull, angry vs. happy). We also expanded the list of adjectives pairs. Finally she reduced the set of adjective/opposites to get rid of duplicates/synonyms and adjectives unique to a specific individual. There were 73 adjective/opposite pairs. Finally she rated each ‘individual’ on each adjective/opposite. That was the hard part, as there were over 8,000 ratings. She spent about 60 hours making the ratings. (*What can I say, she loved me.*) I then did a (correlated or oblique rotation) factor analysis on the data.
It needs to be stressed that what I learned is (potentially) completely unique to her and perhaps unique to her at the age of 27 – an intelligent, Educational Psychology graduate student, focusing on counseling from the Bronx. She is beyond WEIRD (Western, Educated, Industrialized, Rich, and Democratic). She is my wife. Her ‘theory of psychology’ may only be unique to her and may be even more unique to her at that stage of her development, 38 years ago. Furthermore, in the last 38 years, she has thrice grown past her ‘professional career’, her parents and those of that generation are gone, our two children were born and have successfully ‘left the nest’, she has mellowed, and grown wiser. In other words, the way she views people now is very likely different from then.
What did I learn and how does it relate to Theories of Psychology?
Let me again stress, that the ‘individuals’, the adjectives, the factor ‘descriptions’ and the entire framework, were all unique to her and constructed by her, although I did the analyses and was guided by Ledyard Tucker – the world’s leading authority on factor analysis at the time.
Relevance:
Post1976 thoughts:
Can alternative data generation be done? Unforeseen in 1976, Google Glass and similar continuous data collection methods are currently available. In essence we can now record everything we see throughout the day. Continuous recordings (e.g., Google Glass) can also hear everything we hear. Computers are learning how to convert pictures into digital descriptions. Computers can transcribe spoken words into digital data. Our smart phones automatically know our locations, so it is possible to continuously monitor our location. Perhaps the 2015 version might be called ‘Watching and Listening to my Wife to Determine What She Thinks‘.
Factor analysis may not have been the best statistical approach. Perhaps another integrative approach might be better, especially approaches dealing with more ‘granular’ data and unipolar scales. Perhaps completely novel statistical approaches are needed. Repeated measurements on the same subject would be fascinating (and hell to analyze – although Tucker & Messick Three Mode Factor Analysis might be an initial start, if it were integrated with time series analyses).
The data I had my wife collect was ideally suited for factor analysis. In 2015 other data collection techniques (e.g., Google Glass) would imply other or new statistical techniques.
It is my strongly held belief that “Psychology is a Crock”, until 1) every psych graduate student is completely proficient in time series analysis, threemode factor analysis, cluster analysis and newer statistical multivariate methodologies; 2) every psych PhD student has done at least one ‘N of 1’ research project; and 3) full professors would be expected to have integrated many ‘N of 1’ studies to demonstrate a theory. On the other hand, like the proliferation of individual DNA genome databases, the posting of individual theories of personality would make such integrations easier.
Any psychologist who confirms a theory by comparing two averages (across many people) or computes a correlation (across many people) at a single or a couple of time points should be laughed at, or pitied! Given that maturation takes decades, I can forgive ignoring time/situations, but never ignoring people – individuals. You cannot study people (Psychology) by computing averages. Those pseudopsychologists who are unable to make the transition, should be moved over to Sociology or Biology, where group amalgams are appropriate.
Am I being harsh? Would you trust an airplane built by an engineer who didn’t understand algebra or use computers/calculators? Would you trust a cardiologist who didn’t understand how to read an EKG or measure blood pressure? Would you trust a psychologist who never formally understood a single individual and ONLY used ‘intuition’ to make their insights? Yes, there may be an occasional oldschool psychologist who made a cunning insight. But certainly not many insights and certainly not most psychologists.
]]>What is your right shoe size? What is your left shoe size?
How many horses? Simple, you count the number of hooves and divide by four.
A man with one watch knows what time it is. A man with two watches is never sure.
***
A colleague asked me to review a Statistical Analysis Plan (SAP) he inherited. I will simplify it and obfuscate the details to protect the guilty.
In it there was a primary and two secondary parameters. They were “Number of Headaches from Week A to Week B”, where A & B were a) 2 & 7 [the primary parameter], b) 1 & 7 [first secondary parameter], and c) 2 & 8 [second secondary parameter].
I can imagine the origin to be something like this:
Stat Consultant: You want to look at number of headaches. From when to when? We need to put that into the SAP.
Medical Monitor: Uh, you need to know that? Can’t we just add that in later?
Stat Consultant: For the key parameter? No. You should have had that in the protocol
Medical Monitor: OK, let’s make it from the beginning to the end of the study.
Stat Consultant: {Sigh} You’re giving the drug at Day 1, right? The key benefit, the main metabolite, should be barely measureable by Day 4? And it says that the half life is about a week and a half?
Medical Monitor: Yeah
Stat Consultant: OK, it should have a peak at Week 2, although by Week 1 a good part should be there. By Week 4 more than half should be gone, by Week 5 it should be down to a quarter, by Week 7 about an eighth. Et. cetera. Does that help?
Medical Monitor: More than an eighth sounds a bit low, so why don’t we go from Week 2 to 7. Wait a second, someone will ask about Week 1 too. Hmm, we should have some coverage for later too.
Stat Consultant: You gotta pick one.
Medical Monitor: I don’t know. What if I pick wrong?
Stat Consultant: You gotta pick one.
Medical Monitor: Don’t rush me. 1 to 7. No, 2 to 7. Yeah, 2 to 7. Hey why don’t we pick more, just in case?
Stat Consultant: You could do that for the secondaries, but the primary should still be only one.
Medical Monitor: OK, 2 to 7 is the primary and 1 to 7 is a secondary and so is 2 to 8.
Stat Consultant: You’re the boss.
If one were to examine any two of the above parameters, one should expect a very, very high correlation between them. For example, take the relationship between the first (a) and second (b) parameter above. Let X be the number of headaches from Weeks 2 to 7 (Part) and Y be the number of headaches from Weeks 1 to 7 (Total). Obviously Y = X + the number of headaches from Week 1. The correlation of X and Y is the correlation of X to X plus the small component from Week 1. The correlation of a variable with itself (Part with Part) is 1.0. Therefore the correlation of X with Y must be quite high, since Y is mostly X plus a smidgen of something else. Statistically this is known as a part whole correlation. Even if the correlation of the two unique components, number of headaches from Week 1 (let’s call the unique part ‘Q’) and Weeks 2 to 7 (let’s call the part ‘P’), was zero the correlation of the part with the whole would still be high. When r_{PQ} = 0, then r_{P,P+Q }or r_{P,T} would be equal to
Where σ is the standard deviation, σ^{2} is the variance,
P is a part of the total (e.g., Weeks 2 to 7),
Q is the other part, the part of the total unique from P (e.g., Week 1), and
T is the total (e.g., Weeks 1 to 7). T = P + Q.
As the unique part (Q) is small relative to the remainder, this correlation must be much larger than 0.50. Even if we assume a zero correlation of week 1 to the remaining 6 weeks, and the variances proportional to time only, then the correlation between the first and second efficacy parameter is expected to be r_{P,T} = √(6σ^{2}/(6σ^{2} + 1σ^{2})) = √(6/7) = 0.93.
Furthermore, medically we would expect that the correlation of the part and the unique component (r_{P,Q}) to be positive, further increasing the r_{P,T} correlation beyond 0.93. They are measuring the same thing after all! Therefore, we should expect the correlations among these three ‘primary’ and ‘secondary’ efficacy endpoints to be quite high, e.g., > 0.90.
When you have such a high correlation you are getting almost identical information multiple times. They aren’t asking two or three things, they are giving you the same information three times, like asking about your left and right shoe sizes. I’ve seen this problem often, especially with Quality of Life questionnaires with their subscales and total scores. With the QoL scales and total, if the subscale was half the total, the correlation (assuming the subscales were uncorrelated) between the subscale and the total must be 0.7. If the subscales are measuring quality of life, hence correlated, it should be higher.
I should point out that the means will differ, the standard deviations will differ, but the information will be the same. What do I mean? Say you asked how horses were in a herd of stallions. Then asked how many hooves they had. The second will be four times the first (if you counted correctly). The mean should be four times greater, so would the standard deviation, but the correlation between the two should be 1.00. Any inferential analysis (pvalues) or effect size of one would give you identical answers to the other, except for the means or s.d. would differ by a constant factor of 4.
Suggestion: When you have redundant parameters, make your life easier and eliminate the redundancies. It shouldn’t make any difference in your conclusions. It will save trees, focus your report, and save analysis costs. Remember, even if very similar parameters can easily be analyzed in an almost identical manner, every parameter needs to be independently QCed (size in inches of output is proportional to cost). How do you pick? Run a pilot trial and select the best parameter (highest effect size or largest statistical test [e.g., chi square or t or F] value). Otherwise, use the literature or your intuitive guess. Otherwise, pick the largest numerical parameter (e.g., Weeks 1 to 8; Total QoL score).
]]>[old joke punchline] “No, I dropped them in that dark alley, but I’d never find them there. That’s why we’re looking under the light post.”
***
I came across a recent rant by a financial consultant (http://www.littlebear.us/wpcontent/uploads/ITCILittleBearJuly2015FINALWORDPDF.pdf) in which they stated a certain stock was a bad idea. The central concept in their ‘post’ was that a small pharmaceutical company should have reported percentage change, because everyone else does. And since they didn’t report percentage change they were hiding something. I don’t know if percentage change is the standard for antipsychiatric drugs or if the pharmaceutical company was hiding something. Frankly, I don’t care. As I stated in Blog 18, if percentage change was the ‘industry standard’, I would recommend including percentage change only as a tertiary parameter (i.e., present median and no pvalues or confidence intervals). If they and the industry like a certain scale (PANSS) excellent. If the raw metric is interpretable see Blog 3 for assessing effect size. If the scale isn’t intuitively interpretable or their study’s mean or sd is idiosyncratic see Blog 4 for assessing effect size.
However, this investment firm imputed a percentage change by computing the average baseline and dividing it into the average change from baseline. Simply incorrect math.
Let me review the prealgebra you learned in grammar school. You probably remember the cumulative, associative, and distributive laws.
Cumulative law: a+b = b+a or a*b = b*a
Associative law: a+(b+c) = (a+b)+c or a*(b*c) = (a*b)*c
Distributive law: a*(b+c) = a*b + a*c
Let me focus on the distributive law. It works with multiplication, but it DOES NOT work with division. a/(b+c) ≠ a/b + a/c
24/(4+8) = 24/12 = 2, but
24/4 + 24/8 = 6 + 3 = 9
Why is this relevant? Percentage change divides each individual’s change from baseline by their baseline (like 24/4 and 24/8). It is quite different from dividing by the average baseline (like 24/(4+8)).
Let me illustrate the fallacy with a brief example. Say we had a ten point scale and two patients. One patient who was almost asymptomatic (1) at baseline, got slightly worse (1, he went from 1 to 2), a second patient who was severely ill (9) at baseline improved moderately (3, he went from 9 to 6).
1/1 = 1.00 (or a percentage worsening of 100%)
3/9 = 0.33 (or a percentage improvement of 33%)
If we averaged the baselines, we would get an average baseline of 5. If we averaged the changes from baseline, we would get an average change from baseline of 1. Average percentage change from baseline/Average baseline = 0.2, an improvement of a fifth of a point, a pseudo percentage improvement of 20%.
The average change from baseline is 0.333, a worsening of a third of a point or a percentage improvement of MINUS 33.3%.
In sum, it is mathematically incorrect to compute percentage change by dividing an average change by an average baseline. I don’t care if you have no other way to compute average percentage change, it was wrong. Just ask your 5th grade son. <rolling his eyes> “Oh, Dad!”
]]>Failure to reject the null hypothesis is not the same as accepting it. One can ONLY reject the null hypothesis.
To many, failure to reject the null hypothesis is equivalent to saying that the difference is zero. This is absurd. It is wrong. As I’ve said previously, inability to reject the null hypothesis directly implies that the scientists had utterly failed to run the correct study, especially with regard to doing an adequate power analysis. To say it directly:
Failure to reject the null hypothesis means the scientists were INCOMPETENT. Failure to reject the null hypothesis does NOT MEAN THE DIFFERENCE WAS ZERO, only that the difference might be zero, along with an infinite number of nonzero values, some of which might be clinically important.
To repeat my conclusions about testing the null hypothesis from my second blog, I summarized:
In my previous blog I said that the pvalue, which test the null hypothesis, is a near meaningless concept. This was based on:
In nature, the likelihood that the difference between two different treatments will be exactly any number (e.g., zero) is zero. [Actually mathematicians would say ‘approaches zero’ (1/∞), which in normal English translates to ‘is zero.’] When the theoretical difference is different from zero (even infinitesimally different) the H_{o} is not true. That is, theoretically the H_{o} cannot be not true.
Scientists do everything in their power to make sure that the difference will never be zero. That is, they never believed in the H_{o}. Scientifically, the H_{o} should not be not true.
With any true difference, a large enough sample size will reject the H_{o}. Practically, the H_{o} will not be not true.
We can never believe (accept) the H_{o}, we can only reject it. Philosophically, the H_{o} is not allowed to be true.
the H_{o} is only one of many assumptions which affect pvalues, others include independence of observations, similarity of distributions in subgroups (e.g., equal variances), distributional assumptions, etc. We have trouble knowing if it is the H_{o }which isn’t true.
Why do I keep on ranting? I received an email which referred to a Lancet article (http://www.thelancet.com/journals/laneur/article/PIIS14744422%2810%29701071/fulltext). The authors of the Lancet article stated “There were no differences in intellectual outcome, subsequent seizure type, or mutation type between the two groups (all p values >0·3).”
I replied to the email questioner with the following
A few comments:
I agree with the commenter of the Lancet article who said that this study was incapable of differentiating with zero, due to the author’s inappropriate study design, especially in collection of insufficient data in the key vaccinationproximate group. Unfortunately, their other comment that patients “near vaccination have more severe cogitative issues” may also be premature, until a better trial is completed.
Let me be clear, testing a pvalue for many tests is mathematically equivalent to determining if a confidence interval (CI) includes zero. Just take the equation of the ttest, replace the tvalue with a critical t and rearrange the values. You get a CI. Equivalent ≡ Identity. If you use a 5% error, this is the same as looking at the 95% CI. If it includes zero, then zero is a possibility. In that Lancet article, so was a value of +1% or +20% or 1% or 47%. That is why the CI is so far superior to a pvalue. A pvalue only examines one value (zero), while the CI examines the infinity of other credible value. So the result of the study could have been zero. It also could be 47% or +20%. The above quote “There were no differences in intellectual outcome …” makes the invalid assumption that one is only testing against zero. In truth, another point in the CI (mathematically equivalent to a pvalue remember) was 47%. Unless one can say ‘a difference of 47% or less is clinically meaningless’, which no sane clinician would make, then one MUST conclude that ‘there may be huge differences in intellectual outcome’.
My overall comment? Do not publish these inadequate studies as science! If you want to ‘Prove the Null Hypothesis’, one actually needs to prove that the difference is less than a clinically important difference (e.g., π_{1} – π_{2} < 0.10). See my blog 5. Accepting the null hypothesis (http://allenfleishmanbiostatistics.com/Articles/2011/10/acceptingthenullhypothesis/). Unfortunately, this requires a rather large N. For example, if a 10% difference is deemed clinically important and if one doesn’t know the true control group or experimental treatment group rates, then one would need a total of 822 patients (411 per group) to demonstrate that the difference is not clinically important. You used 12 patients? I laugh at this Lancet study.
My only recommendation to the Lancet editors is to demand a CI be presented, perhaps instead of pvalues. If they had observed the very, very large CI, which included potentially huge differences, they would have quashed such opinion pieces masquerading as science.
]]>53. If the beautiful princess that I capture says “I’ll never marry you! Never, do you hear me, NEVER!!!”, I will say “Oh well” and kill her.
61. If my advisors ask “Why are you risking everything on such a mad scheme?”, I will not proceed until I have a response that satisfies them.
Peter’s The T0p 100 Things I’d Do If I Ever Became An Evil Overlord
The above title is actually a trick question. Should you publish a nonsignificant result? You wouldn’t even try, nor would it be published. No journal would publish it. What to do? Either redo the trial with more subjects, improve the methodology (decrease the noise and/or increase the effect), or both, or drop that line of research in favor of a more useful line of inquiry. Hence my quotes above “‘Oh well’ and kill her” and “I will not proceed until I have a response that satisfies them.” There are always more beautiful princesses out there and so little time in life!
There is a very, very good reason why you may have overestimated the effect size.
Why mention this? In the October 2015 issue of Significance there was an article entitled “Psychology papers fail to replicate more than half the time”. It refers to a Science article (bit.ly/1LgoZb2) where 350 researchers attempted to replicate 100 papers. “[W]hile 97% of original studies had P values less than 0.05 – the standard cutoff for statistical significance – only 36% of replicated studies did so. Meanwhile, the mean effect size of the replicated studies was half that of the original findings. (Significance, page 2, Oct 2015)” First of all, this is endemic to all fields of research, not just psychology. If the key hypothesis fails to demonstrate its effect, then the research is either not published or the authors didn’t even submit it for publication. Medical journals give a very biased estimate of treatment effect (effect size).
Let us assume the Frequentist (a school of statistics) vantage point. Imagine an infinite number of studies on the effect of Treatment x. The Frequentists believe that there is a true (population value) treatment difference – δ, and an infinite number of replications. [Note: The other school of statistics, Bayesian, believe there is one replication, but the true value (δ) has infinite possibilities.] If the researchers undersized the trial, or overemphasize the effect size, then the results would be n.s. and do not tend to be published. On the other hand, if they were lucky, it met the p<0.05 criteria.
Let me attempt to explain why this is the case. Imagine that the true mean difference was a set amount (e.g., effect size is 0.3), now let me assume a fixed sample size of 30 per group (60 total) – the fixed N/group just makes things easier to understand. This will be a normal probability bell curve with the mean centered at 0.3 and a sd of 1.0. Half the observed replications will be lower than 0.3, some far less. Half will be higher than 0.3, some far higher. With a total of 60 subjects, alpha of 0.05 (twosided), a true effect size of 0.3, then the study would be statistically significant 20% of the time. If you were one of those 20% of lucky scientists and you saw a p < 0.05, then you wouldn’t have seen an effect size of 0.3, but an effect size of 0.515 or greater. Hence, any published study with an N/group of 30, would have an observed effect size of 0.515 or higher. The true effect size would be unknown unless one knew the number of unpublished/rejected studies.
Let me say this again. When n.s. scientific papers are not submitted by the scientists or rejected by the journal editors, the observed effect size would be drastically overestimated or the true effect size is much, much lower than the that seen in the literature. THE PUBLISHED EFFECT SIZE IS A VERY BIASED ESTIMATE FOR THE TRUE EFFECT SIZE.
One can estimate the true effect size, but this involves knowing the percent of unpublished papers. Probably, the best approach would be to get your hands on every single paper written by your competitors. Yeah, right!
Conclusion: When trying to determine the sample size for a new trial (power analysis), do not use published papers unless you know that all papers are published/accepted for publication. It is still an excellent idea to use a pilot (e.g., Phase II) study as long as you don’t cherrypick the best result.
]]>If you hear hoof beats, the first thing you should NOT look for is unicorns.
I’d like to thank Rob Musterer, President of ER Squared, for posting a reference to a 2009 paper by Ling Zhang and Han Kun, ‘How to Analyze Change from Baseline: Absolute or Percentage Change‘. The Zhang and Kun paper was in opposition to a paper by Vickers, referenced in my Blog 18. Percentage Change from Baseline – Great or Poor? The Zhang and Kun paper was written in ‘Dlevel Essays in Statistics’ at a Dalarna University in Sweden. It was cosigned by 3 faculty members. [Postpublish note: I received a kind reply from one of the faculty at the Dalarna University in Sweden. He explained, “A Dlevel essay is an independent student work to be done to obtain a degree named “magister”. It is between Bachelor and Master, yet on an undergraduate level.”] The paper was very well written, both with theoretical and empirical results. It used a simulation which appears to have been well executed. Unfortunately I disagree with a number of their assumptions, underlying their theory and simulation (i.e., generalizability of their conclusions).
The authors make a number of points.
1. The nonnormality and heteroscedacity seen with analyzing percentage change has little influence. I totally agree (see my Blog 7. Assumptions of Statistical Tests).
2. They state a ‘rule of thumb’ that the correlation between the baseline and change from baseline, should be 0.75. They gave a handful of examples, including a set of 5 patients who were given captopril, with measurement made immediately before and after the administration of captopril. It should be noted that the actual data set included 15 patients. Nevertheless, using only the 5 patients presented, I observed a pre, postscore correlation of 0.91. The correlation between baseline and change from baseline was 0.19, and with percentage change was 0.08. In most (all?) parts of their essay, they did simulations with r=0.75. It is my observation, that a 0.75 pre, postscore correlation is unusually (pathologically?) high. In the Captopril trial it appears that the measurements were done within hour(s??) of the initial measurement. Personally, I’m more used to the final score measured six months to two years after the baseline. Unless the baseline were a very stable medical characteristic, I would expect a more reasonable correlation, like 0.3 to 0.4. The authors point out that the larger the correlation, the smaller the standard deviance (variance) of the change score (e.g., when the r=1.00, the s.d. of change will be zero). The 0.91 correlation explained 83% of the postscore variability. This is extraordinary high!
As as side note, in dealing with repeated measurements (e.g., pre and a set of postscores), a model I’ve used frequently is the AR(1), in which a coefficient (like a correlation) is raised to a power related to how many steps the two measurements are separated. For example, if the Week 1 v 2 (and 2 v 3, and 3 v 4, etc.) measurements have a similar (auto)correlation of 0.5 then the AR(1) model would estimate that the correlation between measurements taken 2 periods apart (e.g., 1 v 3 or 2 v 4) would be 0.5*0.5 or 0.25. The correlation between 3 weeks apart (e.g., 1 v 4) would be 0.5*0.5*0.5 or 0.125. Therefore, this frequently used model would postulate that measurements taken immediately after one another would be maximally correlated, but measurements taken months or years apart would have very low pre and postscore correlations.
In sum, the assumed correlation of 0.75 between the pre and the postscores might be pathological in long (or medium) term clinical trials. However, this is an empirical question. If a statistician liked this model, they could suggest that “percent change be the primary metric in the statistical models, if the pre and postscore correlation was 0.75 or greater. If the correlation was lower than absolute change would be used.”
3. They deduced “From equation (5), we know that, in order to simulate a dataset such that R <1 [percent change has greater power than absolute change – AIF], we should let the percentage change have a large mean and small standard deviation (page 6).” In a simulated example of a ‘Case that percentage change has higher statistical power’, they started with a normally distributed baseline with mean of 200 and a s.d. of 20. Based on this, there will not be any baselines near zero. In fact, the lower end of the 95% CI would be 160, which is much, much higher than zero. They set the pre, post correlation to 0.75 [Note: it might have been the pre and percentage change correlation]. They also forced the percentage change to have a 50% ‘improvement’ with a s.d. of 1%. They simulated using the percentage change and backcalculated absolute change. That is, the data are intrinsically lognormal. They concluded, “We see that, the value of the test statistic R increases as the standard deviation of P [percentage change – AIF] increases. Although R [R is the ratio of change relative to percent change ttest statistics; hence values < 1 indicate percentage change is superior – AIF] increases, it is still less than 1. In this case, we prefer percentage change to absolute change.” In their simulations, they varied the s.d. for the percentage change from 1% to 20%. In the ‘worst case’ situation, the percentage change had a mean of 50% and a 95% CI of 10% to 90%. Even in that case, it appeared that the results always favored analyzing percentage change.
They were trying to find cases where percentage change would be best. They found it. One case included where percentage change was 50% with a CI of about 48% to 52%. It was this pathologically small s.d. of their percentage change, which enabled them to find this example. Again, such cases may be seen, and it might be seen with your data set. But I feel it is like hearing hoof prints and declaring that unicorns exist. I have never seen any data like this with EVERY patient having almost exactly 50% (+2%) improvement – EVER!
They then changed their focus from percentage change to absolute change. They chose a change of 100 (a 50% change from the baseline of 200) with a s.d. of 5 to 40 and a correlation of 0.75. In all their examples they observed a superiority of absolute change relative to percentage change! Let me repeat that, in their example absolute change was superior to percentage change. I would guess, from their Figure 5, that when the s.d. was 5 or 10, 95% of the cases had a superior power for absolute change relative to percentage change (i.e., in only 5% was percentage change superior). The larger the s.d., the greater the superiority of absolute change. For example for s.d. of 40, 100% of the samples had greater power for absolute change relative to percentage change. They do point out that the amount of relative improvement was small. That is, the distribution of their test statistics was within 3.5% of one another.
4. They did some simulation work demonstrating that the s.d. of the change score is affected by the size of the correlation. The smaller the correlation (e.g., 0.3 relative to 0.75), the larger the s.d. for the change score. They also selected ten datasets and observed a median correlation of 0.71.
My conclusion.
Yes, it is possible to find cases where percentage change is a more powerful parameter (relative to absolute change or postscore). Based on their findings, one would need a) to have a baseline score which has a very high mean and very small standard deviation, b) the correlation of the pre and postscores would need to be very high (at least 0.75), and almost no variability in the percentage change parameter (e.g., 50% +/2%). When they based their simulation on a r=0.75 and a 50% change from baseline but a variable postscore variability there was a small difference between analyzing percentage change and absolute change. Nevertheless, absolute change was consistently superior (in > 95% in their best case and 100% in the others).
The situations where one would expect a randomized clinical trial to have a superiority by analyzing percentage change is very, very remote. You must have 1) a huge pre postscore correlation (e.g., post measured before the prescore can change) – never seen in long term trials; 2) Very high baselines with little chance of low scores – e.g., never use a 5 point rating scale; 3) use parameters which are log normal, so a natural parameterization is ratios; and 4) it would help if the s.d. of the post score is quite low. This is like seeing real unicorns loose in Central Park. The simulations used might be applicable someplace, but not to pharmaceutical or device clinical trials.
]]>Just b’cause it ain’t science, don’t mean it ‘taint so.
Phrenology, Four humors, You were cursed, Leeches, …
************************************
This blog is written for the general public, not for the pharma/device expert who is my typical target for my blog.
My wife’s cousin asked me to comment on an article he had come across. The article was Cetyl Mytistoleate: A Unique Natural Compound, Valuable in Arthritis Conditions by Drs. Charles Cochran and Raymond Dent, see http://www.tldp.com/issue/168/168cetyl.html. This article is similar to many similar articles I’ve seen. So any lessons learned here are applicable to other articles.
My summary: 1) The natural chemical, cetyl myristoleate, may work and <cure> arthritis. [Note: I will use the ‘<‘ and ‘>’ to indicate air quotes. That is, open scorn and disbelief.] 2) The logic and science of this article is completely and totally lacking. There is NO EVIDENCE that cetyl myristoleate provides any greater efficacy than a sugar pill (placebo). 3) My best guess (90% historical accuracy) is that it is very likely a vapid hope and a false promise. But in truth, I don’t know.
The article is published in something called the ‘Townsend Letter for Doctors and Patients’, the Examiner of Alternative Medicines. What first caught my eye was a line between the title and the authors “A Sponsored Article”. Unless I’m wrong, the two doctors paid to publish this piece. While this doesn’t invalidate what they say, this is not a piece from the Journal of the American Medical Association (JAMA) or the New England Journal of Medicine.
They introduce arthritis as being a complex disease with 100 causes and many variants. Nevertheless, they indicate that cetyl myristoleate (I’ll abbreviate it as CM), “shows great promise of making a great contribution in noninfective types of arthritis.” They then spend about 3/4 of a page talking about the discoverer of CM, Mr. Harry Diehl, a chemist at the National Institute of Health (NIH), in the division dealing with Arthritis. He couldn’t find anyone at the NIH interested in CM, nor could he find pharma support.
As a side note, in 1980, when I first entered pharma, one of the first projects our entire department worked on was an antiarthritis compound called Fenbufen. It failed to gain acceptance by the FDA. As a pharma professional, it took a very large amount of money and effort to PROVE a drug works. In this case to demonstrate that it didn’t.
Harry Diehl’s proof? Mice don’t get arthritis. They also have natural CM. He ran a study in rats, which <demonstrated> that the rats given CM don’t get an artificial version of arthritis. Such animal testing is typically a necessary first part of any medical program. But you can NEVER stop there for the FDA, nor should you. Next is extensive toxicology – how poisonous and cancer/ulcer inducing the drug is. Actually, the industry is <requested> by the FDA to do many different animal toxicology testing (short, medium, long, and lifelong), a variety of reproductive testing (e.g., rat pups and the children of these rat pups), how readily the drug is absorbed and what it breaks down (metabolized) into, and what organ (e.g., skin, liver) the drug is absorbed into.
This all precedes human testing. And this is where the big bucks come in. Why? You need to run studies, with known controls and very rigorous methodology with hundreds, thousands, or tens of thousands of patients. Each patient will have many, many hour long visits at the doctor’s office. Why? Let me list some of the tests we did on the Fenbufen patients: grip strength with the blood pressure cuff, photographs and doctors assessments of the number of swollen joints, number of painful joints, range of motion sets of tests (e.g., how far can you bend your elbow, turn your head), and some subjective assessments (on a scale of 1 to 10 rate how painful …). To use a medical term, there was a shit bucket full of testing. This was done typically two times before and once every month or two for a year or two on treatment. Yes, tens of millions of dollars to run a trial or medical program. In those days, there was a slow learning curve on how to run a <correct> trial. For Fenbufen there was a dozen or two false starts before the <real> studies could begin. The FDA typically requires two <real> studies. In medicine these are called Phase IIIb trials.
The CM article’s proof? The author’s listed 6 testimonials. Testimonials are the weakest form of any proof. So weak they are totally disallowed by any credible scientist or FDA. Prior to the FDA, we had snake oil salesmen selling their elixirs based on ‘sworn testimonials’. The gold standard to prove that a pill or injection works is to run a doubleblind, randomized, placebocontrolled trial.
A doubleblind trial is where neither the pharmaceutical company, the doctors and their staff, nor the patients know which treatment they are receiving. One does this because it is trivial to get a HUGE bias where patients and the (paid) doctors will say they are getting better when they receive any <bona fide> treatment. There are many forms of this bias (e.g., many people get better over time; many people <fib> to help the friendly doctor; many people lie to themselves, doctors give clues that the treatment ‘should have helped’). Let me just say it is HUGE. Let me give a small example. I went to a doctor yesterday who asked how well a device has helped me. I asked what she meant, she suggested “Like a 100% improvement or 75% improvement.” Most people would have given a largeish number, like 50% or 75%. I’m a bit unusual, I said 5 to 10%. She later showed me some objective measurements which indicated I was still severe, but without any baseline we had no way to see improvement.
How huge the knowledgeofthetreatment bias is depends on the measurement. The more it is controlled/interpreted by patient or doctor the larger the bias is. An example of such a subjectively controlled assessment would be the first CM case study, Leona, who was able to play piano and has an increased range of motion. A more ‘objective’ measurement showed the her ‘nodular deformities have not changed noticeably’.
The randomized trial is easy. You can’t have the doctors select (consciously or unconsciously) which patient should receive the treatment. You don’t want a subgroup (e.g., Asian or female or sicker patients) to predominate the active group. By randomizing, you completely avoid this selection bias.
Placebo controlled is typically the gold standard. Placebo means no active treatment. One can ALWAYS give both the active and control patients a standard treatment. One then will determine if the drug adds anything to what is the typical standard of care. After the trial is over, one examines if the active treatment has ANY credible (i.e., nonzero or statistically significant) improvement. Please see my blogs 14 on my opinion of such analyses. The reason one doesn’t compare your active treatment with your competitor’s active treatment is that this is shooting yourself in the foot – big time. If the other treatment has ANY benefit, then you need a much, much bigger study to prove your treatment works. Let us assume that placebo had a 5 point improvement over time, the other company’s treatment had a 10 point improvement, and your drug had a 15 point improvement, then your study against only an active treatment would need FOUR times more patients (and doctors and time) than if you just compared yourself to placebo controlled study.
In sum, the article shows NO credible proof that the CM treatment works in humans. NONE. They have theoretical proof it might work, but one always has some type of biological or physiological reason to run expensive research programs. According to one article, 90% of drugs which enter human testing (i.e., proved safety and efficacy in animal testing) fail (e.g., prove toxic, prove to worsen, or don’t demonstrate any real improvement) and not demonstrate enough proof for the FDA. And that is a very good thing.
]]>The eyes are the windows to the soul.
“If you cannot measure it, it does not exist.” Young psychometrician
Hmm, lost an eye? My professional opinion is to cover your good eye with gauze so you can only see light or dark. Nonparametric user
If you believe in Science (note the capital ‘S’), you require proof. Without any valid theoretical underpinning, I must admit scientific proof requires additional proof. Therefore, I guiltily admit, I would require more stringent proof to believe in any medicine whose basis is chockras, meridians or other alt approaches, which lack any physiological, chemical, microscopic or neurological evidence, than in medicines who use proteins which are known to affect the correct biological system and have some animal research which validated it. I also believe in the existence of bones. Don’t get me wrong, I still would require two adequate and well controlled studies for any drug, but for a chockra based treatment, I likely would require more.
For me, the converse is also true. I am above all else an empiricist. If something has weak theory, but has a proven value, I would use it. I still might have a healthy skepticism, since I am also very familiar with the placebo effect. More on this later.
Now there is a difference with proving a theory and finding that theory has useful implications. Here is where I differ from the Agency. I require clinical, not just statistical, significance. I have known ESP researchers which have demonstrated (to my satisfaction) that ESP works (consistent p < 0.0001). On the other hand, from what I’ve seen, all demonstrations, even among ‘gifted’ telepaths, show its effect size is trivial. If it is demonstrated that an effect size is greater than zero, but less than 1%, I might smile and say thank you, but no thank you. For example, if a measurement can measure itself with 10% or less true prediction (i.e., 90% error variance or noise), I regard it as potentially useful, but inadmissible for individual prediction.
One gift from psychometrics is the concept of reliability. Let me define terms, reliability is basically how well a set of measurement is repeatable, usually measured by a correlation coefficient. A zero indicates that the measurement is useless – totally useless, and a 1.00 indicates that it consistently gets the same measurement. I’ll return to this in more detail below. Now I should also point out that a reliable measurement may not be useful, that is the validity of a scale. For example, one can very consistently measure a man’s shoe size, but its utility as a measure of ‘manliness’ might be contested. However, it should be obvious, that a reliable measurement is a necessary, but not sufficient, condition for a measurement to be valid (i.e., useful). Let me be more mathematical. If I had a perfect, goldstandard measurement of something (call it ‘T’ for TRUE) and a secondary measurement of it (call it ‘x’) and correlate them together, the maximum correlation of r{T,x} has a limit of r{x,x’}, where r{x,x’} is something call reliability of ‘x’.
Let me spill the beans on radiology’s dirty little secret. When you get an Xray taken and a radiologist reads it, their readings suck! I worked at a diagnostics firm, when I was told this dirty little secret. You might think I exaggerate, let me give you the data. If you give two radiologists the same image, they should give the same reading. If you correlate the different radiologist’s readings they should correlate 1.0. Would you believe 0.30.4? If you read an early blog (4. Meaningful ways to determine the adequacy of a treatment effect when you lack an intuitive knowledge of the dependent variable), you would note that the amount of prediction from one measurement to the other (r) is r². So 0.3 has a prediction of 9% of the variance and 0.4 has a prediction of 16% of the variance. Or 0.3 is 91% NOISE and 0.4 is 84% NOISE. Statisticians who have worked with radiological readings have learned this dirty little secret the hard way. In fact, this is an excellent way to weed out new, inexperienced statisticians from their more seasoned brethren. Would I go to a hospital and get an Xray if I had a totally broken bone? Yes. Is it worth while to go to a hospital to have them read a chip or a subtle crack? NO. If I tell them what to look for, they might find it (or claim to find it – same thing). I was recently asked by a colleague to review a research methodology whose primary endpoint was reading CT scans. I told him my concern. Given the unreliability and the subjective flaws I’ve experienced, I told him to ensure adequate training of these experienced radiologists, use a single central lab, have the images blinded by study, region, site, patient, and date (especially pre vs post). Then test the radiologists for their interrater reliability. While there is excellent theoretical reason to believe that the key parameter should be bone appearance (it was actually muscle ossification) one might want a more reliable measurement, or ways to make it more reliable. Faced with this unreliable measurement, the clinical team suggested categorizing the data into a three point rating. I told them that was the wrong way to go (see 7a Assumptions of Statistical Tests: Ordinal Data). When you have poor data, the only thing to do is to acknowledge it and try to improve it, not throw away more information! The best analogy is telling a person who lost an eye that the best treatment is to cover the good eye with thick gauze so they could only see light or dark. Categorization with poorly measured data is completely counterproductive!
A week ago, my wife and I went to a health food store. She went for a probiotic which was claimed to contain 10,000,000,000 active cultures. As I said above, I’m an empiricist. There is some evidence about the macrobiome in the scientific literature, and more importantly it makes my wife happy. So we bought some huge pills. On the cash register I noticed a flier for an iridology reading. I was curious so I Googled it.
Iridology is a alternative <science> (‘<‘ and ‘>’ indicate airquotes) which allegedly examines the iris of the eyes to <determine> sickness or potential weakness in a human’s biological systems. For example, the innermost part of the iris is said to reflect the health of the stomach. It was based on a <scientist> who made a single observation about an owl’s eye and then created a system to evaluate all bodily systems. If you Google iridology, one of the first entries is the confession of a former iridologist (Confession of a Former Iridologist) who observed that his <readings> were totally unreliable. The placement of the camera, the lights in the room, and even his own ratings varied with each measurement. More to the point, the <science> totally lacks any biological basis, e.g., any color seen in the iris is not from chemicals or metals seeping into the iris (by some magical means) but by melanin.
More to the point, scientific studies on the accuracy of Iridology have found it totally lacking. For example in a website Iridology is Nonsense, they observed a 1979 study where “one iridologist, for example, decided that 88% of the normal patients had kidney disease, while another judged that 74% of patients sick enough to need artificial kidney treatment were normal”.
This site concludes “If you encounter anyone practicing iridology, please complain to your state attorney general.” Perhaps the same warning should be made for clinical radiologists.
]]>The patients had abnormal parathyroid glans, with hypercalcemia. This was a Phase IV study, meaning that the drug (Calcenese) was approved by the agency and the results were exclusively oriented for marketing the compound. There was a four week screening/runin period, followed by a randomization. Patients were required to have an abnormal PTH level and serum calcium > 11.5 mg/dL at baseline (following the runin and before the randomization). Treatment commenced the following day. It was an openlabel, randomized study with 2 dose levels: od (once a day) dose of 2 tablets of Calcenese and bid (twice a day) dose of 1 tablet of Calcenese. That is, both doses were the same number of tablets per day. At Week 8, based on the serum calcium level, the patients might receive double the above number of tablets. In other words, the doctors were allowed to titrate the treatment. The key analysis was the change from baseline calcium levels at 30 weeks within each randomized treatment regimen, although an earlier interim analyses (Week 15) was also planned. No treatment comparison of the od vs bid regime was planned.
The question which was asked by the client was if they needed to ‘pay’ for an interim analysis alpha level (see Blog 17. Statistical Freebies/Cheapies – Multiple Comparisons and Adaptive Trials without selfimmolation). The interim analysis was the data at Week 15 (i.e., prior to the availability of the Week 30 data).
OpenLabel: When I first heard about the trial being openlabel (i.e., investigators and patients aware of the treatment, I initially thought it was hopelessly flawed. However, on secondary reflection, since serum calcium is a totally objective laboratory test, I was somewhat mollified. Yes, one might say it is unlikely that the investigators, (staff), and patients could directly influence the measurements. Nevertheless, there might be more subtle biases due to the openlabel nature of the trial. Some of these potential biases might include: differential patient selection, patients opting out prior to their first treatment, differential dropout rates, etc. Most of these might be discounted as the different treatment arms had identical dosages and all patients were treated.
Regression to the mean: A more subtle, and more likely source of bias in analysis of the change score, is possible natural daytoday variability in the patients and accuracy of the laboratory test. If a patient’s observed score can be given as X = μ + e, where μ is their true baseline serum calcium level and e is the sum of the patient’s natural variation and the laboratory error in assessment. If the patient error (e) were unusually high at baseline then if the patient had another baseline assessment it is likely to be lower. Similarly if e were unusually low, then a second baseline assessment might be higher. In both cases, one would expect their replication score (X) would be closer to their true value (μ). In statistics, this is referred to as regression to the mean. As there is a requirement for the baseline serum calcium to be > 11.5 mg/dL, then those included in the study are expected to have a μ baseline which would be expected to be lower. In other words, the change from baseline is expected to be biased in a positive direction.
Change from Baseline Bias (quasiexperimental design): A larger bias is due to a patient being actively treated. People are changed by that! They were aware that their serum calcium was high, then they were told that it was > 11.5 mg/dL, severe enough to be admitted into the trial. How would many people react? Perhaps diet to lose weight, perhaps diet totally avoiding calcium rich foods, perhaps exercise, … The gold standard for clinical trials is the placebo controlled study, with the key comparison the placebo v active difference. Why? Let me tell you of my first professionally analyzed study. It was a change from baseline analysis. The placebo had a 7 point (statistically significant from 0) change. Fortunately the active had a 14 point difference with was different from 7. Morale: A significant change from baseline is seldom a valid result. Hence the key analysis for this trial is frankly not credible, without comparison to a credible reference group (there was none for this trial).
Interim Analysis at Week 15: Let me return to the question of the Week 15 interim analysis. Does one need to ‘pay’ for doing both the interim and final analysis? The simple answer is no. The Week 15 and Week 30 serum calcium are two different parameters. If the key parameter is Week 30, and Week 15 is secondary, then one need not ‘pay’ for doing the interim analysis.
Multiple Comparisons: However, the actual analysis was to analyze patients who were randomized to the od and bid regimens. There were two regimens. Hence there was two significance tests, not one. There was two ways in which a statistically significant change from baseline could be seen: once for the od regimen and once for the bid regimen. Therefore, using a Bonferroni adjustment, a 0.025 alpha level would be used. In this case, the client was doing a two sided confidence interval, so each should have been a 97.5% CI or 0.0125 on each of the four sides of the CI.
Could they do two 95% CIs instead? Well, this is a Phase IV trial. They can try, and if the referee doesn’t comment about it … If they do comment then they can pool the data of both regimens for a pooled Calcenese treatment for a single 95% CI. A Phase III trial for the FDA will have rigorous statistical review, a journal reviewer seldom is a statistician and would prefer the more familiar 95% CI.
Drug Titration Study: Let me tell you a story of a dose titration study I did 35 years ago. Most investigators in this study didn’t do any titration. Only one investigator actively titrated. Fortunately he also enrolled the largest number of patients. I analyzed him alone. For each study week he either increased the dosage if the patient wasn’t doing well or decreased the dosage if the patient improved. For that investigator I observed that patients who had a high dosage had little improvement and those who had a low dosage had the most improvement. Think about it – this should have been the expected result! The naive conclusion would be treatment was harmful and little/no treatment was beneficial. From that analysis onward I did not allow my clients to do an active drug titration study ever again.
The secondary analysis for the interim was for the patients with each dosage regimen. Remember, patients were to be titrated to receive 2 tablets daily up to Week 8, then they could receive 4 tablets for either od or bid regimen. Therefore, there could be 4 treatment groups (2 tables od, 1 tablet bid, 4 tablets od, and 2 tablets bid). As those patients who were allowed to double their titration dosage are likely to have higher serum calcium levels, their change in baseline must be larger than those who didn’t double their dosage. Most statisticians very strongly avoid ALL grouping based on POSTBASELINE (e.g., Week 8) data.
]]>