‘Take two, they’re small’
Are the results from small, but statistically significant, studies credible?
One of the American Statistical Association’s sub-sections is for Statistical Consultants. A short time ago, there were over fifty comments on the topic of ‘Does small sample size really affect interpretation of p-values?’ The motivation came from a statistician who went to a conference where “During the discussion period a well-known statistician suggested that p-values are not as trustworthy when the sample size is small and presumably the study is underpowered. Do you agree? The way I see it a p-value of 0.01 for a sample size of 20 is the same as a p-value of 0.01 for a sample of size 500. … I would like to hear other points of view.”
More often than not, having small sample size would preclude achieving significance (see my blogs on ‘1. Statistics Dirty Little Secret’ and ‘8. What is a Power Analysis?’). When N is small, only very large effects could be statistically significant. In this case, it was assumed that the p-value achieved ‘statistical significance’, p < 0.01. There was considerable discussion.
Many statisticians felt that the small sample size (e.g., 20) would not be large enough to test various statistical assumptions. For example, to test for normality typically takes hundreds of observations. A sample size of 20 lacks power to test normality, even when the distribution were quite skewed. So, even though the p-value was ‘significant’, the test of assumptions are not possible, hence the p-value is less credible. Or worse, if the data were actually non-normal and the sample size is small, the t-test is not appropriate. Hence the p-value is not appropriate.
Other statisticians observed that when N is small, the deletion of a single patient’s data can often reverse the ‘statistically significant’ conclusion. Thus the result, when N is small, is quite ‘fragile’, not very generalizable.
There was a discussion of supplementing the parametric test with non-parametric testing to avoid the many parametric assumptions (see my blog ’11. p-values by the pound’). They suggested we include what are often called ‘Exact’ tests. That statistician observed that even the ‘Exact’ tests are sensitive to small changes in the data set. One dilemma here is what if the other tests were not statistically significant. ‘A man with one watch knows what time it is. A man with two watches is never quite sure.’ So if the t-test was statistically significant, but the Wilcoxon test was not statistically significant, what would we conclude?
BTW, the statistician who initially asked the question changed his mind and is now leaning on the side of being a bit more hesitant in believing the ‘statistically significant’ p-value when the N is small.
What do I conclude? Same as I’ve been saying all along.
- If one planned the trial with adequate N, albeit a small N, then minor changes in the data set will not change our conclusions.
- If we designed the trial to have equal Ns per treatment groups, then we can ignore the heteroscedacity assumption.
- If we have at least ten observations per group, then the central limit theorem would allow us to ignore the normality assumption. We still need to examine the data, especially for outliers. An outlier could radically shift the mean.
- If the supportive non-parametric tests were similar (in the sense that 0.07 is similar to 0.04), then we can say that they are similar, especially if the directions of the summary statistics (e.g., proportions, mean ranks) are in the same direction and have relatively similar differences to the means.
- With small Ns one cannot do all the tests one can do when the N is large (e.g., logistic regression or tests of interactions). OK, but that is to be expected with small N preliminary or exploratory studies.
- Should we publish when we have a small N? I can honestly see both points sides of the argument.
- The treatment difference was statistically different from zero, hopefully, using the test statistic proposed in the protocol/SAP. You played by the rules and got positive results, publish.
- We shouldn’t be focusing on p-values as much as estimates of the effect size with its confidence intervals.
- Working backward from the p-values and Ns, the effect size when N=20 was 0.85 – a large effect size. When the N=500, the effect size was quite small 0.16. The small study observed a striking effect. The scientific community should learn of your potentially large treatment benefit.
- While the effect size is large, the confidence interval is also large. The lower end of the effect size excludes a value of zero, but it probably cannot reject a quite small effect size. On the other hand, the upper end of the confidence interval is likely HUGE. The middle value says it was large.
- The effect size assumes some possibly questionable assumptions, which might need additional review.
- Although we might be able to reject the null hypothesis, we might not be able to reject small effect sizes (e.g., effect sizes of 0.1 or 0.2).
- Since the study was small, there may be sound reason to repeat the study to validate the findings. The FDA used to require at least two Phase III trials for NDA approval.
- If we always run small studies and just publish statistically significant results then the literature gets a ‘distorted’ view of the treatment. That is, we would only publish small N research when the sample gave us a large treatment effect and never published the scads of non-significant results. So if we ran hundreds of studies and the null hypothesis were actually true (no difference in treatments), then the one in twenty studies we’d published would be 1) statistically significant and 2) have a very large treatment difference, even when the true difference were zero.
- Personally I lean to ‘Yes’ publish, but would suggest the author include in the discussion a caveat emptor on the generalizability of the study results and a comment that ‘further research is needed’. Then run that study.
My next blog will discuss multiple observations and statistical ‘cheapies’.