‘Take two, they’re small’
Are the results from small, but statistically significant, studies credible?
One of the American Statistical Association’s sub-sections is for Statistical Consultants. A short time ago, there were over fifty comments on the topic of ‘Does small sample size really affect interpretation of p-values?’ The motivation came from a statistician who went to a conference where “During the discussion period a well-known statistician suggested that p-values are not as trustworthy when the sample size is small and presumably the study is underpowered. Do you agree? The way I see it a p-value of 0.01 for a sample size of 20 is the same as a p-value of 0.01 for a sample of size 500. … I would like to hear other points of view.”
More often than not, having small sample size would preclude achieving significance (see my blogs on ‘1. Statistics Dirty Little Secret’ and ‘8. What is a Power Analysis?’). When N is small, only very large effects could be statistically significant. In this case, it was assumed that the p-value achieved ‘statistical significance’, p < 0.01. There was considerable discussion.
Many statisticians felt that the small sample size (e.g., 20) would not be large enough to test various statistical assumptions. For example, to test for normality typically takes hundreds of observations. A sample size of 20 lacks power to test normality, even when the distribution were quite skewed. So, even though the p-value was ‘significant’, the test of assumptions are not possible, hence the p-value is less credible. Or worse, if the data were actually non-normal and the sample size is small, the t-test is not appropriate. Hence the p-value is not appropriate.
Other statisticians observed that when N is small, the deletion of a single patient’s data can often reverse the ‘statistically significant’ conclusion. Thus the result, when N is small, is quite ‘fragile’, not very generalizable.
There was a discussion of supplementing the parametric test with non-parametric testing to avoid the many parametric assumptions (see my blog ’11. p-values by the pound’). They suggested we include what are often called ‘Exact’ tests. That statistician observed that even the ‘Exact’ tests are sensitive to small changes in the data set. One dilemma here is what if the other tests were not statistically significant. ‘A man with one watch knows what time it is. A man with two watches is never quite sure.’ So if the t-test was statistically significant, but the Wilcoxon test was not statistically significant, what would we conclude?
BTW, the statistician who initially asked the question changed his mind and is now leaning on the side of being a bit more hesitant in believing the ‘statistically significant’ p-value when the N is small.
What do I conclude? Same as I’ve been saying all along.
- If one planned the trial with adequate N, albeit a small N, then minor changes in the data set will not change our conclusions.
- If we designed the trial to have equal Ns per treatment groups, then we can ignore the heteroscedacity assumption.
- If we have at least ten observations per group, then the central limit theorem would allow us to ignore the normality assumption. We still need to examine the data, especially for outliers. An outlier could radically shift the mean.
- If the supportive non-parametric tests were similar (in the sense that 0.07 is similar to 0.04), then we can say that they are similar, especially if the directions of the summary statistics (e.g., proportions, mean ranks) are in the same direction and have relatively similar differences to the means.
- With small Ns one cannot do all the tests one can do when the N is large (e.g., logistic regression or tests of interactions). OK, but that is to be expected with small N preliminary or exploratory studies.
- Should we publish when we have a small N? I can honestly see both points sides of the argument.
- The effect size assumes some possibly questionable assumptions, which might need additional review.
- Personally I lean to ‘Yes’ publish, but would suggest the author include in the discussion a caveat emptor on the generalizability of the study results and a comment that ‘further research is needed’. Then run that study.
My next blog will discuss multiple observations and statistical ‘cheapies’.