12. Significant p-values in small samples

‘Take two, they’re small’

Image result for statistics humor


Are the results from small, but statistically significant, studies credible?

One of the American Statistical Association’s sub-sections is for Statistical Consultants.  A short time ago, there were over fifty comments on the topic of ‘Does small sample size really affect interpretation of p-values?’  The motivation came from a statistician who went to a conference where “During the discussion period a well-known statistician suggested that p-values are not as trustworthy when the sample size is small and presumably the study is underpowered. Do you agree?  The way I see it a p-value of 0.01 for a sample size of 20 is the same as a p-value of 0.01 for a sample of size 500.  …  I would like to hear other points of view.”

More often than not, having small sample size would preclude achieving significance (see my blogs on ‘1. Statistics Dirty Little Secret’ and ‘8. What is a Power Analysis?’).  When N is small, only very large effects could be statistically significant.  In this case, it was assumed that the p-value achieved ‘statistical significance’, p < 0.01. There was considerable discussion.

Many statisticians felt that the small sample size (e.g., 20) would not be large enough to test various statistical assumptions.  For example, to test for normality typically takes hundreds of observations.  A sample size of 20 lacks power to test normality, even when the distribution were quite skewed.  So, even though the p-value was ‘significant’, the test of assumptions are not possible, hence the p-value is less credible. Or worse, if the data were actually non-normal and the sample size is small, the t-test is not appropriate.  Hence the p-value is not appropriate.

Other statisticians observed that when N is small, the deletion of a single patient’s data can often reverse the ‘statistically significant’ conclusion.  Thus the result, when N is small, is quite ‘fragile’, not very generalizable.

There was a discussion of supplementing the parametric test with non-parametric testing to avoid the many parametric assumptions (see my blog ’11. p-values by the pound’).  They suggested we include what are often called ‘Exact’ tests.  That statistician observed that even the ‘Exact’ tests are sensitive to small changes in the data set.  One dilemma here is what if the other tests were not statistically significant.  ‘A man with one watch knows what time it is.  A man with two watches is never quite sure.’  So if the t-test was statistically significant, but the Wilcoxon test was not statistically significant, what would we conclude?

BTW, the statistician who initially asked the question changed his mind and is now leaning on the side of being a bit more hesitant in believing the ‘statistically significant’ p-value when the N is small.

What do I conclude?  Same as I’ve been saying all along.

  • If one planned the trial with adequate N, albeit a small N, then minor changes in the data set will not change our conclusions.
  • If we designed the trial to have equal Ns per treatment groups, then we can ignore the heteroscedacity assumption.
  • If we have at least ten observations per group, then the central limit theorem would allow us to ignore the normality assumption.  We still need to examine the data, especially for outliers.  An outlier could radically shift the mean.
  • If the supportive non-parametric tests were similar (in the sense that 0.07 is similar to 0.04), then we can say that they are similar, especially if the directions of the summary statistics (e.g., proportions, mean ranks) are in the same direction and have relatively similar differences to the means.
  • With small Ns one cannot do all the tests one can do when the N is large (e.g., logistic regression or tests of interactions).  OK, but that is to be expected with small N preliminary or exploratory studies.
  • Should we publish when we have a small N?  I can honestly see both points sides of the argument.
    • Yes
      • The treatment difference was statistically different from zero, hopefully, using the test statistic proposed in the protocol/SAP.  You played by the rules and got positive results, publish.
      • We shouldn’t be focusing on p-values as much as estimates of the effect size with its confidence intervals.
      • Working backward from the p-values and Ns, the effect size when N=20 was 0.85 – a large effect size.  When the N=500, the effect size was quite small 0.16.  The small study observed a striking effect.  The scientific community should learn of your potentially large treatment benefit.
      • While the effect size is large, the confidence interval is also large.  The lower end of the effect size excludes a value of zero, but it probably cannot reject a quite small effect size.  On the other hand, the upper end of the confidence interval is likely HUGE.  The middle value says it was large.
    • No
      • The effect size assumes some possibly questionable assumptions, which might need additional review.
      • Although we might be able to reject the null hypothesis, we might not be able to reject small effect sizes (e.g., effect sizes of 0.1 or 0.2).
      • Since the study was small, there may be sound reason to repeat the study to validate the findings.  The FDA used to require at least two Phase III trials for NDA approval.
      • If we always run small studies and just publish statistically significant results then the literature gets a ‘distorted’ view of the treatment.  That is, we would only publish small N research when the sample gave us a large treatment effect and never published the scads of non-significant results.  So if we ran hundreds of studies and the null hypothesis were actually true (no difference in treatments), then the one in twenty studies we’d published would be 1) statistically significant and 2) have a very large treatment difference, even when the true difference were zero.
    • Personally I lean to ‘Yes’ publish, but would suggest the author include in the discussion a caveat emptor on the generalizability of the study results and a comment that ‘further research is needed’.  Then run that study.

My next blog will discuss multiple observations and statistical ‘cheapies’.


This entry was posted in assumptions, Confidence intervals, Effect Size, p-values, Treatment Effect. Bookmark the permalink.

85 Responses to 12. Significant p-values in small samples

  1. Tom says:

    Thanks for this post. It seems a little odd, in a post largely about power, to see such a statement as “A sample size of 20 would lack power to test normality, even when the distribution were quite skewed.” Unlike testing equality of means, where alternative hypotheses are easy to state, and where power is easy to compute, how do you specify the alternative to the test of normality in a way that allows power computation? And, even if you could do that, how much power would the test have? And how much do you need? It doesn’t seem quite enough just to say that the test “…lacks power.”

    And, if the t-test is statistically significant, while the Wilcoxon test is not, we conclude that the experiment rejected the hypothesis of equality of means at some alpha level, while the Wilcoxon test did not reject the hypothesis of (whatever hypothesis it tests) at that same level. Why do we not worry about the power of the Wilcoxon test?

    • Thank you very much for your comment. It was the first real statistics comment received.

      The alternative hypothesis is indeed harder to quantify. A long time ago, I developed an algorithm to generate non-normal data for Monte Carlo distributions based on a cubic transformation (a + bX + cX^2 + dX^3), where X is a normally distributed random deviate (Fleishman, Psychometrika, 43(4), Dec 1978, p521-532). I believed that many tests of effects of non-normality were so extreme as to make it irrelevant for practical decisions. For example, some early Monte Carlo simulations used single degree of freedom Chi Square distributions. In the real world, even after a reasonable transformations, distributions are seldom purely normal. They often have a bit of skew (and kurtosis – or heavy tailed) to them. Pearson and Please (1975) focused on ‘typical’ non-normality with skew less than 0.8 and kurtosis between -0.6 and +0.6 in testing robustness.

      But to answer your question directly, I have never seen any SAP or planned analysis which had the power for a normality test. I assume someone had planned one but I have to admit never having seen it. With that said, why do I mention power for normality?

      I tend to use SAS for all my work and their Proc Univariate procedure has normality tests available. As you may know, there are a number of normality tests. SAS has two sets of options, an order statistic, the Shapiro-Wilk Statistic, and a number of empirical distribution function tests (i.e., Kolmogorov-Smirnov, Anderson-Darling, and Cramer-von Mises tests). To return to the sample size issue, SAS says “If the sample size is less than or equal to 2000 and you specify the NORMAL option, PROC UNIVARIATE computes the Shapiro-Wilk statistic.”

      To me, if I have a sample size of 20 and a test requires 2,000, the test would not be appropriate. I have to admit, I never researched the power of the Shapiro-Wilk test.

      To say things another way, I wouldn’t actually rely on a significance p-value for normality. Hence, I wouldn’t do any power analysis for non-normality. In practice, I would examine the data for outliers. I have asked my CRAs to verify the most extreme scores for reasonableness (e.g., did the doctor comment that the value was real, are there corroborating information, like other parameters or measurements at other times, did they screw up the units of measurement). I would examine the frequency distribution. If things still looked non-normal, I would consider a transformation.

      SAS also said, “If you want to test the normality assumptions for analysis of variance methods, beware of using a statistical test for normality alone. A test’s ability to reject the null hypothesis (known as the power of the test) increases with the sample size. As the sample size becomes larger, increasingly smaller departures from normality can be detected. Because small deviations from normality do not severely affect the validity of analysis of variance tests, it is important to examine other statistics and plots to make a final assessment of normality.”

      To return to the case of results of the t-test and Wilcoxon test, I have to admit ignorance on the power of the Wilcoxon test. No Wilcoxon power analysis is available in SAS, StatXact, or nQuery, the three programs I use for power. I’ve always relied on the t-test power for such computations, although I might substitute the medians for the means. So I don’t worry about the Wilcoxon test’s power and estimate its power using a t-test’s power analyses. According to Sidney Siegel in his book Non-Parametric statistics, the Wilcoxon test is 95% as efficient with moderate sample sizes as the t-test. Let me remind the reader that both the t-test and the Wilcoxon compare means. In the former it is the means of the raw scores, in the latter, it is the means of the ranked scores. As long as the Wilcoxon test’s value is ‘near’ the t-test’s p-value I would consider the Wilcoxon supportive. I personally wouldn’t be crushed if the t-test were significant and the Wilcoxon were only ‘marginally significant’, although I might look a second time at the distributions.

      However, I would again like to stress that the key thing a statistician should be doing is not to say this test (e.g., t-test) is a winner (p < 0.05)and this test (e.g., Wilcoxon) is a loser (p > 0.05). We should be focusing on the confidence interval of the difference between treatments, the magnitude of the treatment effect. We should examine (as the p-value is testing) the lower end of the CI, not just to see if it excludes zero, but what other values it excludes. In addition, we should examine the overall mean difference and the upper end of the CI of the difference. It is the magnitude of the difference which relates to importance. p-values are only a sliver of the picture one gets from the CI of the difference.

  2. Tom says:

    Thanks for your thoughtful response to my post. Last night I got out my old copy of Siegel and read until my head started hurting (about 5 minutes). It’s hard to wrap the mind around the idea of power of a non-parametric test. Siegel talks of power-efficiency as the power relative to that of a t-test (for example) when the assumptions of the t-test are met. And of course the power is a point on a curve–we have different powers for different alternative hypotheses. I could draw the curve (or at one time could have) for the t-test, but I don’t even know what are the points on the x-axis for the alternative in the case of the non-parametric test.

    Plus, when it comes to the power of the non-parametric test, Siegel talks of power when the assumptions of the t-test are met. But how does failure of those assumptions affect power? Also, with respect to Wilcoxon specifically, he mentions the asymptotic efficiency ….. “near the null hypothesis.” I really don’t follow that. The power AT the null is alpha (by definition), so what does “near” mean?

    I have no argument with non-parametric procedures. And of course you laid bare their real weakness in post #10. My comment was directed toward ANY preliminary hypothesis test that we perform to help us decide how to analyze data. You happened to mention tests of normality, and of course there is the old issue of testing interaction in an ANOVA before deciding what to use as the error term for main effects. I’ve always believed that, if our choice of analysis is going to depend on the outcomes of such tests, we should give serious thought to their power curves. But that is hard to do.

    • Two quick notes.

      1. The Wilcoxon is a t-test, but on ranked data. Asymptotic efficiency is as N gets large (central limit theory). Perhaps that could make it a bit easier to understand.

      2. I don’t think I recommended any preliminary testing, except checking the data. Perhaps I might question the validity of some outliers or transforming the data to something more normally distributed. Transformations also lessens the likelihood of spurious ANOVA interactions. I’m a firm believer in following the SAP analyses. However, if we see something unanticipated in the data we must adapt.

      WRT ANOVA I was thinking of writing a blog on that topic.

  3. lisa davis says:

    I did a paired sample t-test and found significant difference (t=4.438 and p = .007) but my n is 6. The effect size is high (Cohen’s d=1.8119). However, I’m not sure if this significant t-test or the effect size has any value because of my small n. What should I do? Please help!

    • My first thought was congratulations. You ran a small trial and demonstrated that the effect was not zero. Looking at your Cohen’s d (another name for the effect size), you observed a very large effect. Your ‘pilot (??)’ study demonstrated a real effect.

      Now for a few hesitations.

      With 6 patients in your paired t-test, many assumptions can’t be tested. As mentioned in the blog, I would strongly suggest you plot your 6 observations and look at their distribution. Was one remarkably different from the others (i.e., an outlier)? You might want to try running non-parametric tests on your data by say a runs or a paired-ranks test. The former would be significant at the 0.016 level, if 6 out of 6 patients were in the correct direction. The signed rank test would give identical results. Hence, these two supportive non-parametric tests are rather underpowered. It should also be pointed out that the distribution-free Chebyshev’s Inequality would state that the t of 4.44 has a lower bound estimate p of 0.051 (1/t**2), assuming a two-sided test.

      However, I have an unusual feeling about the results of your study. It is not that you didn’t find a difference (it was ‘significant’), but your result was perhaps too large. Converting your result to a correlation ratio, the treatment effect explained 92% of the variability of your sample. As the distribution of correlation ratios is skewed the median was close to 95%. I believe the lower five percentile of the correlation ratio was around 74%. For biological (or other human) studies, that is very, very unusual. You didn’t mention the nature of your study, but I would suspect it akin to what one of my early professors said ‘You don’t need statistics to prove that a rock can break a window.’

  4. Peter Flom says:

    Very interesting discussion!

    Related to your comment “You don’t need statistics to prove a rock can break a window” is the favorite test of my favorite professor in grad school: The IOTT. That’s interocular trauma test. It hits you between the eyes!

    As for Lisa’s point of very small N with very large effect size, I would suggest a permutation test.

    • Great suggestion. Unfortunately, permutation tests have very poor power when N is small. I forget at what N/group you’d need in order to reject the Ho, but when you can, the two groups MUST not overlap to reject the Ho. With N=6, if and only if, all six differences are in the same direction would the asymptotic p-value be statistically significant (p = 0.031). The Exact Wilcoxon Signed Rank test would be n.s. with a p of 0.062. If only 5 of the six were in the right direction, the asymptotic p-value would be 0.104 and the Exact p-value would be 0.19. [Results from StatXact] In sum, with small samples one would seldom get statistically significant results with non-parametric tests, unless ‘every planet aligned’.

  5. Jeneifer says:

    I am overwhelmed by the statistics jargons. For an ordinary learner like, whose interest is on language, I need to exert more effort to understand some. However,must admit I have fun reading the exchange of technical ideas.

    I have one concern. Please help me. I have a friend who is doing his undergraduate thesis. He has only 10 respondents (deaf and mute, two groups of same year level but different mode of instruction), complete enumeration. He is trying to find the significant relationship between the respondents’ demographic profile (sex, parents’ educational attainment, and family income) and their reading comprehension performance. He is also interested to find out the significant difference between the mode of instruction and respondent’s reading comprehension performance.

    He is worried of the small N. He can describe/discuss the results of his study
    qualitatively, but is there a better way of doing it (although it’s really true sttistics is NOT needed to prove that a rock can break a window)? He has Ho’s. What will he do?

    I am looking forward to hearing from you.


    • If I understand correctly, the study has an N of 10 with 5? in each group. Basically, with a sample that small, if his objective is to statistically significantly reject the Ho then the means would need to be >= 1.41 standard deviations apart for p<0.05. Yes, this is an effect size similar to demonstrating that a rock can break a window. In short, with realistic effect sizes, the study was doomed to failure if it's goal was statistical significance.

      That said, it is unrealistic to hope to reject Ho or even to plan to make rejecting the Ho the goal of the study. I would classify the study as an exploratory study to see the feasibility of doing the research. In other words, a pilot study. Pilot studies, although they are unable to verify a scientific goal have an underrated importance in science. They are the foundation for which all later studies depend on. They can provide the groundwork for what worked and what didn't work (for the different modes of instruction). What are the useful stratification factors (e.g., gender, parent's educational attainment, family income). They guide the selection of subjects. They can be the early indicators (i.e., mean differences or standardized mean differences) to see if the the intervention might be useful. They guide the selection of the optimal dependent variable. Finally they provide the guidance for determining the number of subjects needed for a full study (power analysis).

      With an N=10, your friend's undergraduate thesis adviser, should have realized the futility of having the goal to reject the Ho. I recommend speaking to the adviser and stressing all the benefits you got from this early exploratory study and where the research should go from here. I'd stress the things you learned from the study. I remember one study in which the authors, studying developmentally challenged young children enumerated all the steps they undertook to get compliance (e.g., if the child didn't complete the task put them in 'time out' then repeat it until they did; remove the parents and clocks from the testing area). My wife's master's thesis was to see if high school children benefited an intervention. The results were non-significant. Her doctoral dissertation took the scale, did a factor analysis and observed that the major scale had five underlying factors, some of which were NEGATIVELY related to one another. She then reanalyzed her master's thesis study found some factors were actually statistically significant, some had effectively zero effect sizes, and some were significantly negatively affected (e.g., the kids realized that they didn't know as much as they thought they did [like they needed chemistry to be a pharmacist]). But that all made much sense! The experiences your get from pilot studies are vital for any future study.

      In essence, relate to the adviser all the things your friend learned from doing this pilot study and all the recommendations for future studies. I don't think that an undergraduate thesis is expected to have the rigor as a doctoral dissertation. An N=10 is appropriate for an intervention study as an undergraduate. A doctoral dissertation might be expected to enroll N=150.

  6. Joe says:

    This is an interesting discussion. My take on this will be a little different I think. If you frame this question in terms of confidence intervals instead of pvalues, it brings some different intuition to the table. If we consider confidence intervals from 2 experiments from a normal distribution with 1) [-5, 5] and 2) [-0.001, 0.001] . I would not consider these equal evidences favoring the null (consequently, inherent in these samples there is a differing amount of non-zero evidence for values that are not the null).

    The disconnect here is that at some point I am happy to say the mean of the distribution is ‘essentially zero’. These experiments give the same p-value=1. However, in the real world where all of our experiments have finite resolution, they are different levels of evidence for/against the null. At some point, for large n, no matter the resolution of our experiment it will become virtually certain that the null is true.

    The problem here is that the mathematical construct (continuous) we use to model real world problems (discrete). Thus the pvalues we calculate from our continuous model fail to make a distinction between 2 experiments. If the resolution of our experiment dictates that mu=0.1 is essentially zero, there are very different levels of evidence against the null here, but the pvalues under the continuous model are silent. The same problem will apply in the tails of the distribution and to all problems when data collected with finite accuracy is modeled with a continuous model.

  7. Sarah says:

    Thanks for this great and necessary discussion. It’s very hard to find clear information on small-N hypothesis testing!

    Here’s my situation, I’m looking at N = 5 trial data and trying to decide the appropriate test. Initially, I went with the non-parametric, paired Wilcoxon test with the exact distribution (using wilcoxsign_test from the coin package in R). I also graphed all the points out (for 20+ separate DVs). No significant p-values were returned from the tests, and based on my visual inspection of the graphs, I also do not detect notable differences between the majority of the paired groups.

    So what’s the problem? Unlike the case above with N = 6, where significance was achieved with the Wilcoxon test when (and only when) all differences were in the same direction, I noticed that with N = 5, no arrangement returns a significant p-value. In other words, when all 5 differences are in the same direction, p = .0625. This leads me to conclude that this particular test is invalid when N <= 5. A test is no test when it always returns the same grade.

    Therefore, I have opted for paired t-tests, which would seem to have more power because they do not throw out the content of the value like Wilcoxon's ranks. With the t-tests, a couple of the groups returned significance, while the large majority did not, as expected. Here I risk invalid findings though, as normality cannot be assumed. I am more comfortable with that risk, than with reporting the Wilcoxon test results, which for lack of a better word, to me, seems fraudulent.

    It is also worth noting that while I am a neutral 3rd party, the 'preferred' outcome of the hypothesis testing is actually the null. I say this in case you're concerned I'm fishing for any test to return significance – I'm not. No one would notice if I reported the Wilcoxon tests, but to me, that seems wrong. What I don't want to do is raise a big red flag if t-tests are flat out inappropriate though – they do largely support the visual inspection. I looked for comparable studies, with the same N, and they used the Wilcoxon tests – eek. PLEASE advise!

    • Yes, you are correct, when N is 5 the Wilcoxon will have as a minimum a p-value of 0.0625. Think of it like this: with a two tail test, if the first ‘subject’ favored treatment A then the ‘bet’ would be that the other scores must also favor treatment A. So with a Wilcoxon, to be significant in one direction, all the other 4 cases would need to favor A as well. It becomes a binomial, or 0.5^4 or 0.0625. Unfortunately, it is not the case that they should have run another inferential test (eg, a t-test). When N is small you need a very, very large effect size to achieve statistical significance. Any p-value will often fail to achieve statistical significance even when the true effect is medium, large, or even very large. When N is very small (e.g., 5) you need a humongous effect size to achieve p<0.05. To get (not plan for, see below) significance, the two treatments would need to be 1.2 standard deviations apart. That means the distributions basically don't overlap = the treatments effect is night vs day = humongous.

      I also said that when N is small, the study is by nature exploratory. When you run a study with 20+ dependent variables and N=5, this is certainly in the realm of fishing. No, the key thing here is that the results, at best, are only exploratory. ONLY EXPLORATORY! Therefore, this study must NOT focus, dare I say, NOT REPORT the p-values. With 4 times as many tests as subjects, the true experiment-wise alpha level is not 0.05 but 0.64. In other words, the odds are about 2 out of 3 in favor of at least one significant result (ignoring the small N). You implied a couple of significant results. That would be expected when their power is so low. Their 4 to 1 ratio of dv to subjects is preposterous. Please look at my blog 1, 2 and 3 for prospective on p-values. You, and they, shouldn't be focusing on p-values.

      As p-values are not the way to go, here is what I would have recommended: With a very small N and many d.v., they should have focused on purely descriptive statistics. No p-values, not even a CI. Well actually, perhaps a CI (see below). I get the feeling this is a paired design, so you could provide means for the 5 subjects on 'A', on 'B', and, most importantly, their difference. Descriptive statistics on the difference would be the mean, median and proportion favoring 'A', and measures of variability (s.d., and range). While the CI is not directly useful in testing the null hypothesis, it might be useful in seeing how large it might favor treatment A (the upper end of the CI) and how large it might favor treatment B (the lower end of the CI). With such a small study, their results would be demonstrably very, very humbling.

      If the 'preferred' outcome is finding no difference, then the only better way of 'guaranteeing' it would be if they botched up the randomization or ran the study while dead drunk. With an N of 5, any effect size of less than stellar would not enable one to reject the null hypothesis. Remember, when you fail to reject the null hypothesis, it is NOT the same as favoring the null. Failing to reject the Ho means you can't come to any conclusion that the groups were different. Didn't they take Stat 101? (See my blog 5 on accepting the null hypothesis - one would need a large N to demonstrate that the CI excludes a clinically meaningful difference.) No, lack of statistically significant results for this study directly implies a failure of the scientists, not the theory that the groups were similar (if in doubt look at the CIs).

      I strongly also recommend you also look at my blog on power analysis (blog 8). With 80% power, the treatment effect they needed to get significance would be such that the scores on the two treatments basically do not overlap (i.e., with 80% power they would need an effect size of a whopping 1.7). For a moderate treatment effect (0.5), one would need 34 subjects. Their study was totally inadequate by a factor of 7 (and that assumes only 1 dv). They are the failure in fail to reject the null hypothesis. In the future, tell them that they could have gotten the same result two and a half times faster by testing N=2, with the same validity for their preferred null theory.

      The study you were asked to review is pure crap, not worth reviewing. Their null results directly follow their inability to run a correct trial. To paraphrase the words of a great English professor, "This is a fine and magnificent paper. Unfortunately some fool typed meaningless drivel all over it, ruining the paper."

  8. Sarah says:

    Dr. Fleishman, thank you so much for your response. I laughed… and cried, ha. Definitely look forward to reading through the rest of your blog, as posts 1, 2 and 8 were helpful (and amusing) as you recommended.

    A point of clarification, the data I looked at were not the main treatment measures of the study but instead labs and vitals collected once before and at two time points after the administration of a treatment. Despite the small N, 3 of the paired t-tests between time 1 and time 2 returned significant results of p~.01; you are correct that there was virtually no overlap between the two groups in those cases.

    I had already submitted the results of the t-tests before posting here, so, ya. Not sure what I can do now, but I do appreciate this opportunity to continue to learn. *Sigh* At the very least I will call out the erroneous Wilcoxon tests used in the previous work.

    And in case you’re interested in the backstory, my work is primarily with large digital data, not small-N biometric studies. I originally got in this mess after being asked to run ‘a quick one-way anova’ (it’s ok to laugh). At first I considered a generalized estimating equation model using difference scores as dependent measures, as used in the same tisk-tisk Wilcoxon paper, but that also seemed invalid, or at least beyond my comfort level considering my unfamiliarity with these types of data – so I opted for the multiple t-tests.

    Also, for the record, I did go to stats consulting and was advised to just ‘fudge it’ and report the Wilcoxon tests. I was shocked. Not to be too dramatic or anything, but this isn’t just about data integrity, it’s about personal integrity. Why live a lie? Honestly, the thought of reporting bogus results makes me feel sick, and even my subpar performance with this work is seriously regrettable.

    So, my advice is JUST SAY NO to super small N data!

    • To answer Sarah’s comment/question:

      Very quick anecdote. I was a beginning graduate student and was asked to do some Analysis of Variances (ANOVAs). As I hadn’t had much statistical experience at that point I didn’t ask the client (a clinical psychology graduate student) about the research (PhD dissertation). I was eventually able to force an ANOVA (which in those days assumed equal Ns per cell) into computing results. These Ns were anything but equal. I forced the computer to ignore system attempts to compute the square root of a negative number (e.g., the sum of squares were negative). I handed the client the stacks of output and he thanked/paid me. He told me he originally wanted to do a factor analysis, but was told he didn’t have a large enough sample size, so he did ANOVA instead. For those who don’t know about the difference, they are as similar as an amoeba is to a poem. We all get asked, without additional information, to do what looks like in hindsight to do quite stupid things. Some of us learn not to do it twice. [Me, I’m still learning.]

      WRT the paired t-test, the problem with those pre-post studies is that anything (and everything) could cause changes over time. Even with N=3, it might be real (at a minimum it is highly suggestive). Unfortunately, I’ve also noticed that the reason for such dramatic changes is very clinically obvious and biologically trivial. Yes, statistics can prove that a heavy stone can break a window. Personally, I’m not so hard on small N studies, as long as the scientist knows the limitation of their results (e.g., very, very wide confidence intervals). If despite the small N, they could see something in the data – more power to them. If it is a non-trivial finding, then I’d really be impressed.

      With very small N studies and many d.v., at a minimum the scientists need to run a confirmatory study, as the small N study was purely an exploratory (suggestive) study. Whether they realize it or not.

      As to being advised by a statistical consultant to ‘fudge it’, well there are good and poor consultants everywhere. I’ve seem great ones, and poor ones. Myself, I can only try! And I try to not make the same mistake twice (certainly not a fourth time). At least you learned to avoid that consultant.

  9. Yoseph says:

    my question is what are the reasons that the statistics fail to show significant association between two groups
    ex : lack of association between physical activity and diabetes ; (the prevalence was higher in those physically inactive )
    may be sample size between two groups ??? or i need better explanation
    Hoping your response

    • To Answer Yoseph’s comment/question:

      Sorry, your question is a bit ambiguous. Physical activity and diabetes are not two groups. Did you mean variables? Failing to show significant association may mean lack of an adequate sample size. Or a weak relationship. Or it could be an outlier weakening the relationship. Or it could be a curvilinear (‘U’ shaped) relationship. Or a subgroup restricting the range. Or a poor scale (measuring activity or diabetes). Or a population of uncooperative subjects/poorly collected data/incorrectly computed statistics. Or …

  10. David Atkinson, M.D. says:

    could we do an experiment to see whether or not, for a given p-value, “small n” smaples and results are less predictive of real differences than “large n”?

    • To correct a misconception in your question, ‘for a given p-value’ never is appropriate. As I stated in Blog #1, the p-value is a function of N and the experimental effect size. Only the latter could be a fixed amount for a given hypothesis/population. YOU select the N. Hence the p-value cannot be a fixed number. Only the effect size/treatment difference can be reasonably thought of as fixed.

      We don’t need to do an experiment. ‘Less predictive’ has a one-to-one equivalence with the width of the confidence interval (CI). The width of the CI is directly a function of N and the experimental error (s.d.). [Note: I do not mean to imply that error is due to the experimenter. True, error may be due to sloppy technique, but to a statistician error has to do with all sources of variability. For example, the number of levels of the scale (dichotomous vs continuous), the patients (e.g, wedding dancing for a patient with a foot ulcer), the disease process, external effects (e.g., summer for acne), etc.

      As I have been saying all along (Blogs #3 and #4), ‘predictive of real differences’ lies not with the p-value but the observed real difference (e.g., the observed mean difference or the observed effect size) and their confidence interval. When you have a small N, then the confidence interval will be much larger than when N is large. Simply put, the CI is a function of the square root of 1/N (i.e., typically the observed difference plus and minus ~2*standard deviation*square root of 2*square root of 1/N per group).

      Therefore, for a given effect size/treatment difference, the width of the CI will be narrow when N is large and wide when N is small. With a large enough N (e.g., 10,000) any non-zero difference (and they should all be non-zero, although some might be very close to zero and others could be negative) will have a CI which does not include zero = statistically significant. Conversely, with a small N (e.g., 2) almost nothing would be statistically significant. Please see my blog #1 (1. Statistic’s Dirty Little Secret), especially the table.

      The REAL issue is how large do you believe the treatment difference is. Then you should plan on getting a large enough N for your experiment to ‘just’ get it predictive enough (see Blog 8 – What is a Power Analysis?).

  11. shahd says:

    Hi Dr. Fleishman,
    I hope you have the time to answer my questions,
    I’m running a clinical study with (n=11), I have n=6 in treatment group vs n=5 in placebo.

    1-for this small sample size is it ok to have the significance level <1.0, and a trend between 1-1.5??
    2-I'm running correlations, is it ok to use 1-tailed tests? based on the results of previous studies?
    3-I find it hard to interrupt MAnn-whitney test output, what is the best way to report the results?


    • I am sorry, but I don’t understand all of your questions.

      1 – Significance level usually refers to p-values. Traditional statistical significance usually means p<0.05. So <1.0 or 1-1.5 is unclear to me. Did you mean testing at 0.10 (or 0.20) rather than 0.05?

      2 – Testing correlations or any other statistic can use a one-tailed test. Whether you use a one- or two-tailed result depends on your audience. If they are willing to accept a one-tailed test then you could use it. However, to level the playing field, editors typically require a consistent two-tailed test. You can't justify the use of a one-tailed test on an inadequately run study. However, your real problem is that the effect size needed to demonstrate a non-zero correlation would need to be very, very large. If you pool your two groups together (N=11), the correlation test would have 8 d.f. An observed r, needed to achieve statistical significance with 11 observations, is r=0.56. If you were testing within each treatment N=6 and 5, you would need to see correlations of 0.73 and 0.80, respectively. These are large and very large correlations. If your editor required a two-tailed test, the needed effect size (in this case correlation) would need to be even larger. Comparing the two correlations against one another would need a VERY, VERY large difference in correlations.

      3 – The Mann-Whitney (sometimes tested by the more general Wilcoxon) test would rank order the all 11 subjects (getting ranks of 1 to 11). It computes the mean ranks for the two groups and tests if one mean of the ranks was higher than the other. Think of a t-test on the ranks. What do you report from the Mann-Whitney output: Only the p-value is typical. I would then report the observed means and medians, although they are not relevant to the Mann-Whitney test. If the results were significant, it would imply that the group with the higher median is the larger group. If you need help in interpreting the output from your statistical output, I recommend you speak to the program’s help site or refer to a statistical text. However the larger question would be WHY would you be doing a non-parametric test? Did you expect the data to be badly skewed or non-normal? I highly recommend you read my blogs 7a. Assumptions of Statistical Tests: Ordinal Data, 10. Parametric or non-parametric analysis – Why one is almost useless, and 19. A Reconsideration of my Biases.

      My reactions are:

      a) You said you are ‘running a clinical trial’; and have already examined the output of the analyses. This implies that the protocol was already written, trial unblinded and analyzed. However, you are asking about significance levels, number of tails and the statistical test. What did the statistical section of the protocol say? The analysis MUST follow the protocol.

      b) You are planning a trial with a very small sample size. This sounds to me like a pilot/exploratory trial, especially if, as I suspect, there was no pre-specified statistical methodology. In such a case, I would not report the p-value. It is likely that reporting the p-value would only hurt you, as it probably will not even approach statistical significance. Therefore, one- or two-tailed tests are inapplicable. The focus should be on reporting summary statistics! I would report confidence intervals as well.

      c) It is not a good idea to throw away information when you have a small sample to begin with. You already have poor power. I suggest you focus on the parametric statistics (e.g., means and Pearson correlations). You could also present rank order statistics (e.g., medians, Spearman rank order correlations) to describe your results. There is little reason to use rank order statistics when N is small, except to demonstrate the robustness of the parametric analysis. This would be a supportive analysis. See blogs 7 and 7a.

  12. shahd says:

    Thank you so much for your response ! I really appreciate it

    What I meant by significance of 1.0, is the p-value. For example if I got a p value of <1.0, is it justified to present it as significant because its a preliminary/explanatory study?

    I read somewhere else to focus on summary statistics rather than significance with this small number of subjects. Could you be more specific by summary statistics? is it to report (means, medians, confidence intervals) for each outcome measure in each group? without testing the differences?

    Thanks again and I'll will be reading up in your blog

    • P-value have values between 0.0 and 1.0. Therefore, ALL p-values are <= 1.0. Most journals require p < 0.05 for statistical significance. I think you may be thinking of using a p-value of 0.10 as stated in my last comment. Think of the p-value as meaning 'the probability that random numbers could explain your, or more extreme, results'. Therefore a p-value of 0.10, means that there is a ten-percent likelihood that random results could explain your findings. < 0.05 means it is even more unlikely (one time in twenty). A p-value of 1.0 means random numbers would certainly explain your findings. The goal of most research is to say your findings are NOT random. Yes, means, medians, correlations (perhaps CI) are the summary statistics I alluded to. You could report these statistics for the group differences as well. As a further reason to avoid reporting p-values, one can do a power analysis and see the likelihood that you will get statistically significant results. If God said the effect size (correlation) was moderate (i.e., 0.30), and your study used a sample size of 11, then the likelihood that your study would be statistically significant is 0.23 (i.e., your study would fail 77% of the time). In fact, an observed correlation of 0.30 would not be significant. If the N were 5 than the power would drop to 0.12 (i.e., your study would fail 88% of the time). I, like most statisticians, plan studies with at least an 80% success rate (power), not a 80% failure rate. See my blog '8. What is a Power Analysis?'. Unfortunately power analyses only applies to planning a future study, not in discussions of a completed study. Reporting p-values for small studies is akin to the old adage of 'urinating into the wind'. Nevertheless, you can always compute the p-value and hope for the rare (23%) success. Who knows, you may be lucky and the effect size might be much larger. I personally wouldn't recommend you should expect a positive result.

  13. PeterN says:

    Hi Dr. Fleishman,

    Great blog! I’m not sure if this question is in the right place here, but it seems somewhat related. I’m a celiac disease patient, which means that since my diagnosis, 2 years ago, I follow a strict gluten-free diet on doctor’s advice. Because food testing is expensive and I still get symptoms occasionally from food products not showing any gluten-containing ingredients, I want to know if the occurrence of symptoms is a good measure for the presence of gluten in a food product. Therefore, I want to design an experiment to test if my body can accurately detect small amounts of gluten. I already thought about how to do it in a double-blind, randomized and placebo-controlled way, but am still a bit confused about how to get a good level of confidence/significance regarding any results.

    I’m thinking about taking a bowl of porridge a week, which will be sometimes gluten-free and sometimes gluten-containing. Nor me, nor the person serving me the bowl will know which is which. At the end of the experiment, I will have to tell which ones contained gluten and which ones did not and we will open the envelope containing the right answers. It’s a bit like the Muriel Bristol / Lady tasting tea experiment, although in this case I would like to keep the number of gluten-spiked porridge to a minimum.

    Please correct me if I’m wrong, but as I understand it, if I test 21 bowls of porridge, with 1 containing gluten, and I pick it out correctly, the p-value of this result would be 1/21 = 0.048. Using a binomial coefficient calculator, I calculate that by testing 7 bowls of porridge, with 2 of them containing gluten, assuming I again pick out both correctly, the p-value of that result would be the same: 1/C(7,2) = 1/21 = 0.048.

    My main question is:
    – What is the difference between these two setups (1-out-of-21 vs. 2-out-of-7)? It feels like the second setup would be ‘better’, give a ‘stronger’ result, but why?

    Two other questions I got after reading some of your blog:
    – How would you feel about calling either of them a significant result anyway, given the small sample sizes?
    – You wrote in an earlier post: One of the biggest blunders I see made by non-statisticians is the mistaken belief that if p is < 0.05 then the results are significant or meaningful. However, we can all feel that picking 1-out-of-2 tells less than picking 1-out-of-21. So what would be a correct way to refer to the p-value in the context of an experiment like this?

    Thank you!

    • First off, I am an empiricist. I strongly believe we should test everything. Your approach of testing if you are sensitive to the gluten porridge is quite admirable.

      WRT 1 out of 21 vs 2 out of 7, you are correct that the likelihood of picking the one or two correctly would occur by chance with a probability of 0.048 (about one time in twenty). Therefore, (assuming you ‘nailed it’) it would not likely be a guess. Think of it like picking a card out of your own deck of cards and a friend (over the phone) correctly picking the color of the card. If right, you wouldn’t be too impressed – a 50-50 chance of being right (0.50) – No Biggie. If they guessed the suit, one chance in four (0.25), you’d be more impressed. If they guessed the number (e.g., deuce) then you’d think something was up, as the chance of a random guess is 1 out of 13 (0.08). Etc. In the scientific literature, we traditionally use 0.05, like being told the color and number of the card correctly, as something non-random. This is statistical significance.

      On the other hand, you are not limited to this arbitrary cut-off. You could do 1 out of 10 (0.10, by chance, assuming you correctly picked it).

      Now, if you’ve read my other blogs, you would have noticed that I make a big differentiation between statistical significance and clinical significance. If you had a very mild, but perceptible, detection of a difference, then is it relevant? I would have also collected the reaction you had (e.g., 0 – none, 1 – very mild stomach pain, … , 9 – hospitalization). You might even want a finer set of gradients (0 to 100). A mean for the gluten porridge of 2.0 vs 0.5 for the non-gluten porridge when a top score is 100 and 10 is very mild might be detectable, but irrelevant.

      On a practical note, a blinded trial is one in which other factors do not give the treatment assignment away. That is, is the gluten porridge a different color? different texture? different taste? Also on a practical note, you mentioned what is called false positives – cases where you had a reaction to a gluten-free products. There will also be false negatives – cases where you think a porridge is gluten free but actually contained gluten. This will be much more problematic for you, the tester, when almost all (95%) of the porridges are gluten free. Don’t be surprised if you said all the porridges were gluten free. To make it easier on yourself, to choose, I would have asked you to choose 3 gluten free and 3 gluten containing porridges. You also need to take into account that one sample of gluten might not be problematic, but two in recent succession might be much more challenging to your body. I had a lactose deficiency. I didn’t have a problem with a single slice of pizza, but got sick with two. OTOH, if you had a strong reaction, then a rare gluten containing porridge (e.g., 1/21) might make more sense. That is, if your clinical reaction to the 20 gluten free porridges had a mean of 2.0 with a standard deviation of 1.0 and your single gluten containing porridge reaction was 25, well you don’t need to compute statistics!

      To answer your questions succinctly:
      1/21 and 2/7 both give the same p-value. Statistically, both are equally ‘strong’ = < 0.05. It wouldn't be called a random guess if you perfectly nailed it. Each would be equally statistically significant. You would still need to collect data to see if it was clinically meaningful. Clinically meaningful means to me that the reaction to gluten is meaningfully large.

      • PeterN says:

        Thanks for your quick reply!

        Good point about using a scale for measuring the type of reaction. I have read about some scales that might fit this purpose (Celiac Symptom Index, Gastrointestinal Symptom Rating Scale, Psychological General Well-Being Index). I will look into this.

        Regarding the difference between statistical and clinical significance, I have read some of your blog entries on this and I’m not sure how I should see this. I understand your argument that effect size should be an important measure, for example in trials of new medicins. However, I’m not doing this experiment to later decide on bringing gluten back in my diet. Regardless of the results of this experiment, I will need to keep following a strict gluten-free diet anyway. The main purpose of the experiment is to see if I can recognize the consumption of trace amounts of gluten through external symptoms. I’m testing the test: the sensitivity of my body as a binary classification test for gluten-intake. In this way I wonder if the effect size of the symptoms is really that important for me.

        On the other hand, finding means of 2 vs. 0.5 on a scale of 1-100 would thus not always be irrelevant to me. I will be testing for trace amounts of gluten and, considering literature on this subject, I’m not expecting the most severe reactions. Is this where the effect size, or the standard deviations come into play? For example, if I can blindly score my symptoms 0.4, 0.4, 0.5, 0.6, 0.6 on the gluten-free bowls and 1.9, 2.1 on the gluten-containing bowls, this would be impressive enough, right, even if the scale can go up to 100?

        Regarding your suggestion of using 3 gluten free and 3 gluten containing porridges, you are right that I need to take into account the effect more gluten-intake will have on me. In celiac disease, it is not uncommon to have intestinal damage without external symptoms. This makes me want to keep the gluten exposure as low as possible. However, I do notice that you also prefer to move to a more balanced division of the two types of porridges. Given similar p-values, you would prefer a division of 50%/50% (3-out-of-6) over 5%/95% (1-out-of-21). This would imply that you would also prefer 2-out-of-7 (29%/71% division, same p-value) over 1-out-of-21, as was also my gut feeling. However, I still do not really understand how this preference is explained. Is there an actual measure regarding false negatives that is improved by getting relatively more gluten-containing bowls in?

        Finally, about the blindness of the test: I’m thinking of making the gluten porridge by spiking the gluten free porridge with a small amount of wheat porridge. So before doing the main experiment, I will need to design another test using my gluten-tolerating friends to ensure that the difference between the porridges cannot be noticed. I guess the setup will be similar to the main experiment, except that I can feed my friends as much gluten porridge as I want. 🙂

        • PeterN says:

          Hi Dr. Fleishman,

          Just an update, because I seem to be close to an answer: You referred to the ‘power’ of the ‘trial’, and it seems that this is indeed where the two tests (1-out-of-21 and 2-out-of-7) differ. I ran some simulations of both tests, calculating the power for n=1 and different levels of sensory difference (known as d-prime or delta, this could be symptoms as in my case, but also taste or any other signal). They both start out at 0.048 in case of no detectable difference, but the 2-out-of-7 test starts to outperform the 1-out-of-21 test as the difference rises:

          delta;p(2-out-of-7 correct);p(1-out-of-21 correct)

          • If I understand you correctly, delta (d-prime) is a theoretic measure of your sensitivity. It looks like 2-out-of-7 is uniformly more powerful than 1-out-of-21. In statisticianese, it is a more powerful experimental design. Did you attempt a ‘trial’ where the number of true positives are more equal? I believe you have either 2 gluten v 5 non-gluten or 1 gluten v 20 non-gluten. In general, you get greater power in a study when the Ns per group (gluten v non-gluten) are equal.

  14. Chloe says:

    Hi Dr. Fleishman,

    I am doing my research with the title of ” effects of cement dust exposure on respiratory health among the workers in a cement mill”. I am confusing in what test should i use to find out the association between these two variables since my respondents is only 19. Should i run the normality test? Is it appropriate to use Anova test? Look for your reply! thank you!

    • I didn’t get much information from you regarding what you collected. Let me guess that ‘cement dust exposure’ is a continuous parameter and is something like years working at the mill. Respiratory health is likely to be a dichotomous variable: disease free v. some respiratory disease. I’m guessing, so please excuse any blunders.

      I’m not sure how ANOVA would be used.

      A more appropriate statistic would be a survival analysis. But that would just give you a simple single curve. For example, it would tell you of the risk for developing respiratory illness after 5 years of exposure. A survival analysis does not assume normality. This brings up a second and more appropriate issue, you need a comparison group to see if it is cement dust which is the relevant issue. I’m not sure what a good comparison group would be, but you could see if cement dust is more respiratory illness inducive than working at a paper factory or at a printer.

  15. Liv says:

    Hi Allen Fleishman, I have been looking for a simple citation that states how the small sample is naturally not statistically significant for a report I am writing where the sample was small and evidently the results not statistically sinificant using t-test. I would like to know if you have published or know of a source where I can find this same information but in a published journal. All of the journals that I find are quite complex for the level of explanation that I require for this school assignment. Unfortunately, I cannot use your blog here as a source. So, if you know of a source that I can use or how to look for it, I would appreciate it!

    Thanks in advance!

    • A colleague suggested this: http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0020124


      There is increasing concern that most current published research findings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientific field. In this framework, a research finding is less likely to be true when the studies conducted in a field are smaller; when effect sizes are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater flexibility in designs, definitions, outcomes, and analytical modes; when there is greater financial and other interest and prejudice; and when more teams are involved in a scientific field in chase of statistical significance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientific fields, claimed research findings may often be simply accurate measures of the prevailing bias. In this essay, I discuss the implications of these problems for the conduct and interpretation of research.

  16. Feliz says:

    Hi Dr.Fleishman, I am really glad that I stumbled upon your blog. I am still in the process of learning the art of stats ( definitely not a big fan of it) and would love to hear your opinion about one of the problems I’m going through right now.

    -If I obtain a p-value of 0.022 for both n=50 and n=150 using the same correlational analysis, what would be the relative size of obtained p-value for n=150 compared to the one for n=50?

    I thought with a larger sample size, the p-value will be smaller and signifiant at the same time? Why is the p-value same for both n=50 and n=150? and just another silly question, what do you mean by a relative size? and why would that be a relative size when it’s stated that the obtained p-value is same for both n=50,n=150?

    I hope my writing is not too confusing for you. Many thanks in advance for your time and effort. Hope to hear from you soon!

    • You are doing a correlation analysis and are only focusing on p=0.022? If you were looking at a new car (say a conventional gasoline engine) and asked about the gas mileage, would you be satisfied with an answer, ‘Yes, the car uses gas.’?? [In statisticianese, the amount of gas used is greater than zero.] No, you’d want the mpg, and perhaps some type of range for different conditions.

      I should not have said relative size, but effect size. Effect size for correlations is the correlation itself (or r squared).

      Simple truth: In correlation analysis, what you should focus on is correlation. If you ran two correlation studies, one in an N=50 and a second in N=150, and got a p-value of 0.022 in each, then the observed correlations in the two studies would be different. You never mentioned what the r was, so I estimated it from your N and p-value to be 0.19, when N is large (N=150); and r of 0.32, when N was small (N=50). Like I said elsewhere, statisticians love squaring things, so it might be interesting to do that with r and get r squared. In these studies, the large study r of 0.19, squared becomes r**2 of 0.0361, or 3.61%. For the smaller study, the r of 0.32 squared becomes 0.1024 or 10.24%. Why did I square it? Well, the squared correlation is basically how much (variance) is explainable by using one variable to predict the second. You didn’t say what you were correlating, so let me make things up. Say you were predicting potency of a batch given the temperature of your process. The higher the temperature the greater the potency. In the first experiment (N=50), you found that the temperature was able to predict roughly 10% (10.24%) of the potency variability. While not stellar, it is still useful. In the second experiment (N=150) you were only able to predict about 4% of the variance. This is less than a half a good of predictability as the first study.

      The p-value only asks are you able to predict anything? With a p of 0.022, which is less than 0.05, your experiment certainly is doing better than pure chance, in both cases. So, in both studies, you demonstrated that temperature of the process definitely improves the potency of the batch. However, in the small N study the width of the CI for the correlation is pretty large. With N=50, and an observed r of 0.32, the 95% CI is (0.05 to 0.55). With N=150, and an observed r of 0.19, the 95% CI is (0.03 to 0.34). Yes, in both cases, the correlation is quite likely not zero (an r of at least 0.05 for the small sample and 0.03 for the large sample). However, for the best case, the correlation could be as high as 0.55 for the small N study and 0.34 for the large N study. To return to r squared, the small N study could see up to 30% of the variability of potency explained by heat. In the small N study, you could see up to 12% of the variability explained by heat.

  17. Feliz says:

    Hi Dr.Fleishman

    Thank you for your quick reply. Your example for the gas mileage is definitely a good one. Are you trying to say that by using correlational analysis, I should be focusing more on the correlation value instead of the p-value? and I am sorry about the question, I read my question again and found out that I interpreted it the wrong way. Gonna restructure my question again.

    A obtained p value for a correlation is p=0.012 for a sample size of n=47. The same kind of correlational analysis is used again for sample of n=94. The correlation value for both is r=-0.23. What would be the relative size of obtained p-value for the sample n=94, compared to the one n=47? and why so?

    I think this is the right way to interpret the question . Sorry for the confusion before. From what I read, I think the p-value for the sample n=94 would be smaller compared to the first hypothesis test and I think it has something to do with the sample size. Is that right? or is there any other reason to explain this situation?

    Thanks heaps!

    • I am unequivocally saying that the correlation is the KEY statistic in an analysis of correlations.

      Summary: I believe you made an error in the N=47 case, check your results.

      I threw the r = -0.23 into a web-based program. I assumed a fairly traditional 2-tailed hypothesis. I got the following results:

      r = -0.23
      2-tailed 1-tailed
      N=47 p = 0.120 p = 0.060
      N=94 p = 0.026 p = 0.013

      As I’ve said elsewhere, the t-test (which produces the p-value) is a computation based on N (see below). The key for the t-test is numerator of r*sqrt(N-2). The denominator will be near 1.0 for low correlations but will increase the ‘potency’ of r when it is large, so doesn’t affect much in your example. In your study, N (your subjects) and the variables you examine is your experiment. The correlation is the result. The p-value is a mathematical computation based on your N and the resultant r. To repeat myself, the p-value is NOT an experimental relationship, like the r, but is a computation based on r and the square-root of N (actually N-2). It is IMPOSSIBLE for the two Ns, given the same r, to have the same p-value. Therefore for N=47, an error was made in either the computation, or the transcription of the results, or the correlation was much higher than 0.23. I suggest you look very carefully at your computer output at N, r, and p for the N=47 case.

      t = [r * sqrt(N-2)] / [sqrt(1-r*r)]

      • Feliz says:

        Hi Dr.Fleishman,

        It’s me again. I have double checked my table for n=47 like what you have mentioned by using SPSS. For n=47, Pearson correlation is r=-.23, p=0.12.

        I guess what I am trying to say is, let’s say, we obtain a p-value (X) for a sample size of n=50 and a same correlational analysis is used again in the second sample n=180. The same correlation value is obtained magically! What would be the relative size of the p-value for the sample n=180, compared to the p-value we obtained when n=50? and why is it so? Can I say that the sample size plays a part?

        Thank you.:)

        • I’m glad you caught your blunder. It wasn’t 0.013 but 0.12. Zeroes, which mean nothing, can be quite pesky.

          In your ‘magical’ study, you have the same correlation with different sample sizes (50 and 180). The t-test would be different by a fixed ratio. Looking at the equation for the t (see the equation in my last comment), they would differ by a factor of square root of N-2, so you could take a ratio of sqrt(180-2)/sqrt(50-2) and the t for the larger study would be 1.93 times larger. So, a simple answer is that the t-test would close to double. Unfortunately the p-value needs to be computed from the t-distribution. Even if we used a reasonable approximation, like a normal distribution, one cannot make a general statement on p-values as it is a ‘bell shaped distribution’. A 1.93 fold increase in a t-test of 0.2 would move the one-sided p from 0.42 to 0.35, while a 1.93 fold increase in a t-test of 2.0 would move the one-sided p from 0.023 to 0.00006.

          The answer to why is the t-test equation I gave before, especially the numerator.

          As I have said repeatedly in my blogs, sample size plays a major role in t-tests, hence p-values. Please examine the table in Blog 1, where I had a fixed effect size (think of your ‘magical’ single correlation) and varied N from n=4 to approximately N=4,000. The p-values varied from 0.90 to 0.00001. Given any observed experimental result (like r=-0.23), you can get almost any p-value based on the sample size. To return to r=0.23, if N=3 then p = 0.85, if N=300 then p < 0.0001. Yes, sample size plays a part. Given a fixed experimental effect (e.g., r or treatment difference), then the larger the N the smaller the p.

  18. Kevin Schulte says:

    qucik house keeping question. When you use the mathematical symbol of N for the sample size what mathematical symbol do you use for population?

    • N is traditionally the sample size. For the population? Most of the time the population size is symbolically represented by infinity. In Classical (I will not comment on Bayesian) statistics, one theoretically samples from an infinite population, e.g., an infinite population of patients or subjects. When we consider an effect random (e.g., investigators), they are also considered as sampled from an infinite population of investigators. Those effects treated as ‘fixed’ (e.g., treatment – active and placebo) have a set and known number (e.g., 2). For those areas of statistics where the sample size is more limited (e.g., sampling from the NYC public schools, where the number of public schools is a finite number), we might use ‘n’ and ‘N’ for the sample and population size, respectively. Sometimes authors will use capital letters for populations and lower caps for samples, subscripts or superscripts, sometimes other letters, sometimes sometimes Greek for populations. The problem is that the capital letter for the Greek nu looks identical to our ‘N’. Others symbolically define sample size as whimsy strikes. When in doubt, most articles identify what the symbols mean (e.g., “where N is the sample size”). As long as you define it in your paper, feel free to use whatever symbol you’d like.

  19. Jane says:

    Hello Dr. Fleischman,

    I am an ecologist studying predator-prey interactions. Currently, I am running an analysis using a mixed effect model. My mixed effect is for an encounter ID (in which predators encounter prey) in which there are a varying number of observations.

    I am curious about the effect of my sample size (encounters=44, observations=103). Here, I found that the predictor variables were all significant (including an interaction between two of them) (p<0.05) and that their effect sizes were biologically significant and made sense. However, I am still concerned with my ability to justify this small sample size. I am also curious, based on the discussions about the decline effect, whether I found effects that may not be real?

    It is clearly possible to get false results based on small sample sizes, even if you found them to be statistically significant. But how do we know when we have a large enough sample size to deal with this problem? In essence, how do you go about empirically defending your sample size, thereby justifying that the effects you found are real?

    Power analyses seem to estimate the number of samples that you would need in order to increase the probability that you achieve statistical significance in your experiment, or your ability to find a difference when a real difference exists. But they do not deal with, at least as I understand it, the biological reality of the significance of those effects. For example, you may find a real difference statistically, but that effect may disappear as you continue to increase sample size correct?

    • One thing I keep coming back to is the confidence interval. If you are looking at a mean difference (not clear from your question) then a significant effect means that the difference is likely not zero. The CI will give you the lowest reasonable value (lower CI) for the difference as well as the largest reasonable value (upper CI). When the sample size is small, the CI will be relatively large. If you still found the effect to be significant, it means that the size of the difference is large enough to compensate for the larger width (also the largest possible value for the mean difference is likely to be quite large (see the blog entry). If your general linear model is actually regression in nature then the predictor model weights will be similarly non-zero and could be quite large. 103 observations is not small in most areas. I can’t comment about your epidemiological domain. Nevertheless, the mean difference (prediction model) all have non-zero utility.

      Yes, you can get false positive results with small samples. Or moderate samples. Or large samples. They happen 5% of the time, across the board.

      I cannot agree with your last question. If you found a statistically significant effect, [no one should ever, ever call a statistically significant effect a ‘real difference statistically’] then it isn’t likely to disappear if you replicate the study. At least, 95% of the time it will not disappear.

      As I have said before, don’t focus on the p-value or the sample size. Focus on the CI. Look at the lower limit, if it is greater than zero then it is statistically significant. Also see if it is meaningfully larger than zero. If you see that it is 0.001 (on most d.v. scales) even though it is non-zero, it might not be realistically greater than zero and you have an unimportant effect. If so, redo the study with a larger N. You will need to understand the nature of your dependent variable. Then look at the simple mean difference. Is it clinically (or epidemiologically) meaningful? Finally look at the upper end of the CI. Is it large enough to be exciting and clinically significant?

      One last comment. You mentioned that the interaction(s?) were also significant. You might consider different transformations of the data (e.g., log) and see if it disappears. God never said that the true scale S/HE uses is a natural counting one. S/HE might think in a logarithmic scale. Or that your effect is multiplicative not additive in nature.

  20. Fleming says:

    Good evening Dr Allen,
    I am a PhD student . I am a complete novice in statistics . I have a sample size n=6 and several other variables . For example, I have a particular variable which is the gravimetric water content (GWC ) and several Index or soil indices which I am hoping would predict the amount of water content present in soil but my n= 6 ….
    I do understand that I need to test for normality but for n =6 ., I do not understand what should be done as my initial first step to obtaining the best index as a predictor of soil water content using statistics .
    I have done the following ;
    (1) Checking for outliers using Box plot
    (2) A QQ plot but with n =6 …I don’t think any of these are reliable .
    And would regression or correlation be the way to go about it ? PLEASE HELP!!!!!!!!
    Thank you so much , I wait in anticipation for your reply .

    • First, you are doing the right thing by first looking at the data. If you see numbers like 0.1, 0.15, 0.18, 0.2 and 300.0, then the data have outliers which need to be addressed (e.g., you could attempt a log transform). I personally am not familiar with the gravimetric water content (GWC), so I don’t know what is routinely done (ask your advisor?). One problem with your question (and you are not the first one to do this): What questions of the data do you want to answer? It sounds like a simple and naive question, but often only gets asked when you are ready to write up your results – i.e., far too late. It sounds like you are attempting to predict something. I think you will be doing a regression analysis. This approach is based on correlations.

      You have GWC and several other variables with an N of 6. Well, the obvious first step would be to see if GWC is linearly related to each them (and them to each other). In that case correlations are appropriate. However to compute a correlation (regression) the analysis would first need to compute the mean GWC and the mean of the (first of many) secondary parameter(s). But all the computing of these preliminary statistics ‘uses’ up the data. The real number of data you have for a correlation (called in statistics – degrees of freedom) is N=4. So you don’t have much data – 4 d.f. for a correlation. To give a very quick rational for this, imagine the meaningfulness of trying to determine the adequacy of a straight line with 2 data points. You would always be able to get a perfect fit. So with 2 points a correlation is meaningless. In your case you only have 4 useful data points. To compound the issue, you mentioned ‘several other variables’. Let’s say you have GWC and three others, variables B, C and D. In total you will have n(n-1)/2 unique correlations, where n is the number of variables. In this example with 4 variables, you’d have 6 correlations. In other words, you have 6 correlations based on 3 d.f. Sorry, no where near enough.

      I’ve never been satisfied with the justification for this ‘rule of thumb’ but most statisticians recommend N for a regression analysis to be greater than 10*n to 20*n (e.g., with 4 predictor variables you would need 40 to 80 observations).

      To return to your question ‘I do understand that I need to test for normality but…”. With an N of 6, normality tests would be useless. Totally useless. These tests, often non-parametric themselves, have very, very low power. I would almost guarantee that your data would be n.s. In this case, n.s. would mean you lack adequate sample size to be capable of answering the question of whether the data is non-normally distributed. N.s. never means you could accept the null hypothesis (i.e., the data is normally distributed). That is especially true in this case. In the blog you read, I said that a significant p-value in small samples often means that something big may be seen. However the converse is: When you have a non-significant p-value (e.g., in testing normality), it means it could be normal, it could be non-normal – you just don’t know with the inadequate study. See my blog #8 on power analysis.

      Recommendations: 0) You learned how to collect 6 observations. Most would call this an exploratory study. You learned valuable techniques. You probably learned a lot of what not to do and some of how to do it correctly. That’s a valuable and necessary lesson for a PhD student. Now that you learned the technique, go back and collect real data. I cannot believe that data collection for water content and soil characteristics is anything more than tedious and time consuming. Grunt work is part of the dues paid for a PhD. The data you currently have might hint at the data’s distribution. 1) If the data for a parameter is non-normal, try to normalize it. Ask your advisor or look at the literature. You should see what people routinely do. For example, if they talk about geometric means (a simple mean of log transformed data transformed back into the original metric), then they believed the data is log distributed. At a minimum, the best correlations are only obtainable when all the variables have the same distributions (e.g., all are normal, or all have a skew of -1 and kurtosis of 2.0). 2) Look at the literature and see what effect sizes (e.g., correlations among the transformed parameters) you expect to see. 3) Do a power analysis and determine the correct sample size. 4) Collect an adequate sample. 5) Take four aspirin and call me in the morning. – Dr. Fleishman

  21. Fleming says:

    Thank you very much , I am more than grateful for your detailed explanation. However, the aim of the experiment is to predict drought stress in plants . Unfortunately, the data I have now was acquired in summer 2013. The experiment was not put in place this summer , therefore I have to work with what I have got.
    For example;
    1. A particular plant was subjected to 6 different treatments . Each treatment gives a degree of variability in the amount of water accessible to the plant . Therefore n=6 for each variable on a single plant . Several Indices were derived , for each index n= 6 . As you said it is not adequate to run a regression analysis , for example GWC against Index 1 , 2 , 3 …etc or better still , GWC against Index 1 …..if this is applicable, can P<0.05 or P < 0.1 be used a as a criteria for choosing the most significant index which best predicts GWC?
    In most literatures, regression and correlation are often used as a preferred statistical method of analyses. In this case, the only difference is the sample size.
    2. Another example, a plot of all the 6 GWC values obtain for 6
    treatments gives the following linear relationship …y = -0.6438x + 67.9 R² = 0.1059
    Regression GWC Vs Index 1 below….

    Regression Statistics
    Multiple R 0.738486117
    R Square 0.545361744
    Adjusted R Square 0.431702181
    Standard Error 2.789792161
    Observations 6
    df SS MS F Sign F
    Regression 1 37.34 37.34 4.79 0.09
    Residual 4 31.13 7.78
    Total 5 68.47

    Coeff Stand Err t Stat P-value Lower 95% Upper 95% Lower 95%
    Intercept 48.49 7.91 6.12 0.003 26.51 70.46 26.51
    index 1 97.97 44.72 2.19 0.09 -26.20 222.16 -26.23

    Please can you explain why Index 1 cannot significant at p< 0.1 ?

    Thank you very much for your great help and assistance.

  22. Fleming says:

    Another example, a plot of all the 6 GWC values obtained for 6
    treatments gives the following linear relationship …y = -0.6438x + 67.9 R² = 0.1059

    Please can you explain why Index 1 cannot be significant at p< 0.1 ?

    • [Note: Most organizations, like publications or dissertation committees, use 0.05 two-sided as the critical alpha level, not 0.10. Furthermore, when you have multiple predictors, unless you state that one (and only one) is primary in the protocol, you have the multiple testing issue. This means that if you had 2 (k) predictors, you are not dealing with a p of 0.10 but 0.20 (k*0.10) (approximately). See blog #13 Multiple Observations and Statistical ‘Cheapies’ – specifically the section on multiple dependent variables.

      I discuss regression and correlation often interchangeably as they are identical analyses statistically (i.e., a good correlation program will also output the regression equation and a good regression program will also output the correlation – the outputs for the two programs will be identical to each other). Furthermore, your analysis is likely an issue of multiple regression/multiple correlation. This multiple linear regression aspect will not be expounded here.]

      SUMMARY: When N is small, the range of the effect size (in this case r) is very wide. Due to this width, a correlation of +0.33 could be anything from -0.66 to +0.90. A value of zero cannot be excluded. A sample size of 6 is inadequate for testing typical null hypothesis for anything but the largest effect sizes for regression (=correlation).

      You asked me to explain why your parameter cannot be significant at p less than 0.1. Why? N=6. When N is small, only the very, very largest relationship has a CI which does not include zero (i.e., is statistically significant). You have an R-squared of 0.1059 (equivalent to an r of 0.33). That is a moderate effect size. If you planned a trial with 80% power to be significant at the 0.05 two sided alpha level, you would need 70 observations – you had 6. With an alpha level of 0.10, with the true effect size equal to what you happened to see in your sample, to barely get statistical significance (power of 50% – half the time your study would fail), you’d need 27 observations. With an N of 6, alpha of 0.10, and just barely getting statistical significance, the true effect size R2 would need to go from 0.10 to 0.53 – over 5 times larger. Please read and re-read and re-read blogs 1 through 5A and 8. I strongly suggest you abandon any hope of proving statistical significance when N is very, very small, if you expect anything less than stellar results (huge effect sizes). In this case, use your study as a exploratory one. Just describe your findings, perhaps include point estimates and confidence intervals then plan on, yes plan on, not presenting any inferential statistics (p-values).

      I feel bad you wasted your summer collecting useless data, but the first step should have been to write a protocol, and the second step to do a power analysis. Your advisor should have mandated that second step. They should have stopped you cold. If they were clueless, I humbly suggest you switch advisors. They were incompetent. If they strongly advised against your study or you never consulted with them, then chalk it up to youthful ignorance and a valuable lesson learned.

      Let me put it visually. If you just glanced (one-tenth of a second exposure) at a candle 20 miles away, can you equivocally say you’d see it? That is the phenomenon of inadequate sample size. The candle is real. It gives off light. But with an inadequate data collection you’d never be capable of unequivocally saying you saw it. Now if it were an atomic bomb, you’d have no problem. Candle is like an R-squared of 0.10, an atomic bomb is like an R-squared of 0.90. OTOH, if you had a reasonably adequate telescope and enough time to locate it, you’d be able to see the candle. A glance of one-tenth of a second is like N=6 and a telescope with ample time, is like an N of 1,000.

      Your quick glance at the plant hydrology thinks it might have seen a moderate effect (r squared of 0.1059). However, with this small N and the very small light, it could have been your eyes playing tricks on you. You can’t unequivocally say you saw something. The C.I. on the correlation is quite large. I went to http://www.how2stats.net/2011/09/confidence-intervals-for-correlations.html (I did not validate, nor do I advocate this website – it was just the first webpage to compute a CI) and computed the CI for r=0.33. The plausible values (95% CI) for your correlation range from -0.66 to +0.90. With your very low N, your relationship could be quit high (+0.90). OTOH a value of zero cannot be dismissed (i.e., ns), nor could a large negative value of -0.66.

      Looking at your other results below. I believe you had an R-squared of 0.54, not 0.1059. This is a very, very large effect. It was significant at 0.09 (i.e., p less than 0.10, but greater than 0.05). I’d still be hesitant at publishing the results, especially with a non-standard alpha level and multiple predictors. See my final summary for blog 12 and my summary of the comments of others for significant p-values in small sample sizes.

  23. Frances Orton says:

    Hi Alan,

    This is a fascinating blog. I also have a problem with small n numbers. In my case, I am testing differences between 4 groups, with each one containing n = 5-9. The p value I get with an ANOVA is 0.06 and posthoc test (Fishers LSD test) find p values < 0.05 between control and 2 of the treatment groups. In my groups, mean values in my treated groups ~x2 higher than the control group (data are normally distributed according to normality tests, though reading your blog above – maybe I cannot really test for normality with this size n number?). I have good reason for "believing" these differences are real from a biological perspective, due to other endpoints that I have measured which corroborate them. But, how should I report this? Or, is there an alternative test that would be appropriate for studies that are underpowered. Many Thanks, Frankie

    • With such small Ns and 3 active treatments, I would guess this is an exploratory study, early Phase II? From what you described, two of the active treatments were ‘significantly’ different from control. My conclusion: There is reason to believe there is some evidence of (an) active treatment difference(s). My suggestion is to report the two fold improvement of the active groups, and parenthetically remark that the results were significant. That is, emphasize the magnitude of the treatment difference. Was there much overlap in the individual data for the active and control groups? Report that too. I recently saw a simple plot of placebo on the top of a number line and active on the bottom, with every patient’s change from baseline (sorted) on the number line. The study was extraordinarily poor [unblinded and MASSIVE (and treatment related) dropout]. But the drug was approved. A simple plot like that could convince anyone. It helps you that there is supportive evidence for the effect (other parameters?). The post hoc tests would be completely justifiable if you pre-specified them in your protocol/SAP as key hypotheses. I am not sure if the LSD is strictly appropriate given the overall ANOVA did not achieve statistical significance (p > 0.05). In the future, I would have recommended stating in the protocol one comparison as the key. Alternatively, use the adjusted Bonferroni for the two actives v control as the key. Note: there is seldom a time where all three active treatments are equally important (one could be relegated as a secondary hypothesis along with the other supportive parameters).

      My personal bias is that small N, exploratory studies should emphasize descriptive statistics and only informally and parenthetically (ie., gloss over) statistical significance. With such small Ns, I would have stated in the protocol that ‘statistical testing will be done, however due to lack of power, will not be emphasized’, ‘or reported as descriptive statistics’.

      As I said elsewhere, the issue of normality is not a pressing issue (Central Limit Theorem), although I would first eyeball the data (especially the control group) for outliers. A formal test for non-normality, with such small Ns, is grossly underpowered.

      “An alternative test that would be appropriate for studies that are underpowered”: Replicate the trial.

  24. JSmooth says:

    Dr. Fleishman,

    I know most of the talk on here has been about t-tests, but I’m wondering if you could help with a two-way ANOVA question I have?
    I just found your blog today looking for answers for a senior thesis I’m working on (social sciences). I had a large overall sample size (almost 150). I had three conditions, one control and two experimental. They still had decent sizes and I got some interesting (and significant results). The problem is this: I had one dichotomous variable in which all the men but one were positive on (an experience). I wanted to do a two-way ANOVA to examine interaction effects between this variable on the original IV. I thought I should examine only the women since that is where all the variability was in this variable. When I did the two-way ANOVA, some of the groups (for example, those in condition 2 who had had the experience) were quite small–the smallest was 5. However, I found a very interesting interaction effect, and it was statistically significant (p < .05) and had a decent effect size. The interaction effect only happened on the items that I had expected them to happen on, which really validated my suspicion. My question is this: in a two-way ANOVA (I'm using SPSS, FYI), is a small group in one cell a big problem? Does the ANOVA measure the effects of each IV separately, or is my results as weak as my weakest subgroup?

    • I am not sure I understand your question and the analysis you have done.

      First, ANOVA analyzes means and assumes variability of the data (i.e., at least interval level data and variability in each cell of the design). Dichotomous data should not be analyzed by an ANOVA. When the dependent variable is dichotomous, and you have an experiment, then the data should be analyzed by a logistic regression. A true experiment has the scientist control the independent variables and observes a dependent variable.

      Next you mentioned that the males had almost (all but one) identical responses and you analyzed the females. Was gender a factor in the two-way model? If so, then drop gender and analyze the females alone using a model which can handle dichotomous and polychotomous data (e.g., Chi-Square or Fisher’s Exact test). And analyze the males alone.

      With regard to a small N (e.g., N/cell = 5), the Chi-Square is not the best test, I would recommend the Fisher’s Exact test. SPSS can easily analyze that.

      Unless I am wrong, the data comes down to 6 numbers for the females and 6 for the males. For the males, I believe all but 1 of them were positive on one row and only 1 was negative on the other row. That is, for the males, two ‘treatments’ had zero variability and one ‘treatment’ had almost no variability. For the females in one ‘experimental group’/experience, there was only N=5. In ANOVA, the independent variable is CONTROLLED by the experimenter. YOU assign the subjects to the experimental group. Why would you have only N=5 for the experimental group? I’d guess this wasn’t an experiment, but an observational study. In that case, correlation comes to mind.

      I’m afraid any additional answer I can give to you would be inappropriate without a much greater understanding of your ‘study’.

      Let me share a story of my first consulting job. I was a gifted undergraduate student and a much older, PhD student asked me to analyze many two-way ANOVAs. In those days the ANOVA had to have equal Ns per cell, or at least one had to assume that the Ns were approximately equal. They asked me to brute force the ANOVA to compute the answer. I told the computer to analyze the results and ignore errors, by telling the mainframe to allow ten thousand errors (rather than stop after one). I handed the PhD student the results. The Ns were not equal. The two uncorrelated ‘independent’ factors were highly correlated. When the ANOVA attempted to compute the answers, due to the high correlation, the ANOVA produced negative sum of squares and when the computer attempted to compute the square root of the negative number it ‘did not compute’ – try computing the square root of -4 on your calculator. Like I said, I handed the graduate student the output and got paid. He later told me that what he really wanted to do was a factor analysis, didn’t know how, so he did an ANOVA instead. Now I will insist on understanding what the purpose of a client’s study is, I will ask to see their protocol and their data, then I will give an answer. Forty years of experience allows me to not hand a client output which is dead wrong, an analyses of imaginary numbers [the square root of a negative number is called an imaginary number]. His p-values were not real numbers, in either the logical or mathematical sense of the word.

      You might be doing a factor analysis, but only know about ANOVA, and getting inappropriate results. I strongly suggest speaking to your senior adviser. Alternatively, most schools have a resident statistician available for consulting to the students/faculty. At the very least, speak to the person who taught you statistics. Make sure you can explain to them your study and show then the protocol. I wouldn’t take the output to them on the first meeting, although you should ask them if ANOVA is an appropriate statistical technique.

      BTW, N=150 is moderate. An N of 150,000 is large. OTOH, for a senior thesis, it might be considered large in comparison to other theses.

  25. JSmooth says:

    Thanks for the reply!

    I should clarify what I did. My original experiment had three groups to which we did do random assignment. We asked the participants several questions to see if they would answer differently depending on which group (treatment) they were in. We did ANOVA with that, as is appropriate (as I understand it).

    Now, there is another variable I want to explore. Let’s just say it’s marriage for simplicity’s sake. Since all but one of the males were married, I excluded the males from my next analysis (I didn’t want the non-married group being almost completely female while the married group included males). So I ended up with about 60 non-married females and 23 married. I wanted to see if being married had an effect on one of the treatment variables, but not the others (interaction effects). So I did a two-way ANOVA, in SPSS I did univariate, with the question as the DV, and the group (the original three) and marital status as the IV’s. Is two-way ANOVA not appropriate with these IV’s (one dichotomous and another with three conditions)? If not, how would I compare interaction effects with these two variables?

    While it was a planned experiment, we did a random sample and ended up with only 23 married women. Since these were further split into the three groups (and the 60 married ones were also split into the same three groups), the married group was sparse across groups (as I said, there were 5 married women in one of the three treatment conditions). This low number is why I was concerned about the analysis, I didn’t know if it precluded my results from being notable. My advisor was actually the one concerned about this, which is why I came here.

    • In a true experiment, you could have taken all of your females (ignoring males) and randomized the non-married into the three groups, so there would be 20 in each; then randomized the married into the three groups so there would be 7, 8, and 8 in the three groups. This is called stratified random sampling. In the biostatistical world, and would be totally analogous to having a separate randomization for each site (center) in a trial. We frequently have some sites with smaller Ns than others. But it appears that you took all people, ignored marital status, and assigned them to the three groups, with chance giving you one group with 5 married females. However, since you did have ‘random assignment’, there is no reason to believe that the slightly smaller N in that cell is anything more than chance.

      In sum, I see no issue in doing a two-way ANOVA within the females with factors of treatment condition and marital status.

      OTOH, (1) With three treatment groups, you need to tease out why you see the interaction. This is where post-hoc analyses come into play. In a post hoc comparisons, you could examine the comparison of Group 1 vs Group 2 and how it interacts with married vs single. This would be a single comparison, a t-test. There could be a number of other comparisons. For example, a) Group 1 vs Group 3, b) Group 2 vs Group 3, c) Groups (1+2)/2 vs Group 3, d) Groups (1+3)/2 vs Group 2, e) Groups (2+3)/2 vs Group 1, f) ((Group 1)/3) + (2*(Group 2)/3) vs Group 3, etc. I can’t go into the logic here, but only comparison (c) is asking a question unrelated (called orthogonal comparison) to the Group 1 vs 2 question. Such orthogonal comparisons allow you to take the interaction variance and separate it into the two unrelated questions. To put it another way, asking Group 1 vs 2 and Group 1 vs 3 are related, both are heavily influenced by Group 1. Nevertheless, you can still do all pairwise comparisons. To return to the first sentence of your original question: Yes, this would be a t-test within the two-way ANOVA. Yes, I focus on t-tests as that is a specific answer to a scientific question (e.g., Group 1 has a significantly higher mean than Group 2), whereas a significant treatment F-test says that among the three treatment groups, a difference (somewhere) exists. Most people want to know where the difference is – hence they do t-tests.

      (2) In your original question, you implied that one variable was dichotomous. It is perfectly commonplace for the independent variables (e.g., gender or treatment) to be nominal parameters or dichotomous. However, ANOVA is not appropriate when the dependent variable is dichotomous.

      (3) Finally, the blog was about ‘Significant p-values in small samples’. When you have statistical significance when the N is small, this implies two things. First, you have a large effect size for at least some question (see (1) above) and second, the variability of your conclusion (effect size) may be quite high, due to the small N. Since this is likely a post-hoc or exploratory finding, I’d conclude that the results (albeit statistically significant) are exploratory in nature and should be confirmed in future research.

  26. Tim says:

    Dear Dr. Fleishman,

    Thank you for sharing your knowledge via this blog. It is helped me tremendously in getting more familiar with some key statistical concepts. Currently I am writing my master thesis in which I am conducting a macro-economic analysis. The aim of the study is to define the relative importance (in comparison to other determinants) of policy stringency in determining eco-innovation performance. To achieve this, I conduct a multiple linear regression analysis.
    Based on extant literature I have established the following model:
    F(policy stringency and 8 other variables),
    with eco-innovation performance as dependent variable.
    Hence, in total I have 9 independent variables. Due to a small sample size (only 13 countries), the multiple linear regresssion results are unreliable (R-square = 0.59, p-value=0.55). The rule of thumb states that at least n=90 is required for 9 predictors. Unfortunately this is an unfeasible number for this type of macro-economic analysis (it is impossible to find data on these variables for 90 countries, 20 is already a challenge). This implies that this model (with 9 independent variables) will never generate reliable results right? In my discussion, I recommend to future research to simplify (reduce number of predictors) the model in order to get reliable results. Although this forms a severe limiation for my study, it still provides new insights on the importance of the determinants (this type of macro-economic analysis has not been done before). Do you think this is a legitimate conclusion?

    Secondly, when I conduct a multiple linear regression with 3/9 independent variables, instead of using all 9 variables, I find some (more) reliable results (R-square is still poor=0.50 but p-value=0.008). The final coefficients are also significant (p-value <0.05). Does this mean that I can be confident about the results? Or is the linear regression model still volatile due to the small sample size? What can I do to validate the results?

    All the best, and greetings from the Netherlands.


    • You are right, you have a very, very small sample. I’ve heard repeatedly about your rule of thumb, 10 per parameter (90 needed in your study), although the multiplicator can be 5 to 15. I haven’t heard the origin of the rule of thumb. To look at this another way, the number of degrees of freedom is (N – 2 – number of parameters) 13 – 2 – 9 = 2. That is the reason you could almost never get statistical significance with so many parameters and so few observations. So, your advisor should have told you to never consider such a small N and such a large number of parameters. Yes, I agree that the analysis was highly questionable (2 d.f.).

      Then you mentioned selecting 3 of the 9 parameters. If you selected the three BEFORE you saw the results, then your R2 of 0.50 is valid (and likely different from zero). I’m not familiar with your field, but an R squared of 0.50 is likely quite good. OTOH, if you selected the 3 as the best of the 9, then I personally wouldn’t consider the results as anything more than capitalization on chance.

      To validate the results, repeat the study, hopefully with a larger N.

  27. Tim says:

    Wow. Thank you for your swift reply, appreciated!

    You helped to eliminate my doubts.
    Concerning your point regarding the 3/9 variables. Yes, you are right. I found this model via the automatic linear modelling function in SPSS. It does not have any grounded theoretical support.

    Although it sucks that my results are unreliable, at least I have made a first step in analyzing these factors on macro scale. Extant empircal literature only examined the relation on micro and meso scale.

    Thanks again for your help

    • The automatic modeling procedures capitalize on chance. My stat mentors referred to them as the ‘devil’s tools’. I recommend you fall back to less automated approaches: a) Simpler analyses, b) more intuitive selection of fewer parameters, and/or collect more data.

      Another fall-back, is to analyze these 10 variables as pairs (i.e., simple correlations) – simple data = simple solutions. I would also look at the scatterplots for outliers (transformations) and non-linearity. If pairs of the ‘predictors’ are highly correlated (the jargon term is collinear) then discard the redundant parameters (e.g., you don’t need both the left shoe size as well as the right shoe size). Although it is likely beyond you, you could put the predictors into a factor analysis to get at a fewer number of (latent or underlying) factors and use them. I’ve seen one Monte Carlo study which found a huge improvement in regression weight stability when N is small and the number of predictors is large with factor analysis/principal components.

  28. Sample size says:

    Professor, would you answer me this question?
    What is the rationale of using different sample size when conducting an experiment using the same procedure, tasks and stimuli?

    • Very simply, the difference of doing a study with different Ns is the relative precision of the answers you get back. A study with an N of 400 will have 10 times greater precision (one tenth the standard error of the mean) than a study with an N of 4. The bigger the N the greater the accuracy of the mean. Similarly, a large N is much more powerful in being able to reject the null hypothesis (ie, get statistically significant results). Is that your question?

  29. Validity says:

    Dear Dr. Fleishman,
    I’m working on my dissertation (toward the end) and have not been able to find a powerful rationale for the following issue: I conducted a survey to about 600 students (using a Likert scale), measuring attitudes and behaviors, and the questions were taken from another study. Although those questions were asked in a different format (dichotomously) and within a much larger questionnaire, the author of that study validated the whole questionnaire. Furthermore, and unlike the aforementioned author, these same questions were used in another study with the same population characteristics as mine (but not using Likert scale) and within a larger questionnaire. I tested the reliability of my questionnaire and the scales have a good alpha value (>.8) and acceptable reliability (>.7). I did not conduct a pilot study, nor the survey was revised by experts in the field. How can I justify in my paper this validity issue?

    • You asked about using the same items, but using a Likert (e.g., 1-7 point) scoring scale rather than a dichotomous (agree/disagree or 0-1) scoring scale. In general, a quasi-continuous (Likert) scale is expected to have better relationships than a dichotomous scoring scale. So if the dichotomous scale has been validated, then the continuous one should have even a higher validity. See Blog 9. Dichotomization as the Devils Tool. That blog considered a single dependent variable, but a sum of many dichotomous variables (and their intercorrelations), would have similar characteristics. So the coefficient alpha (a measure of internal consistency/reliability) is expected to be higher with your Likert scale. Therefore, if the item analysis for the dichotomous scales indicated a single concept (e.g., factor or underlying structure), using the more powerful scoring scale should have a higher degree of internal correlations.

      I am confused by your reference to using a subset of the larger original questions from the original study. You then mentioned a second “study … within a larger questionnaire”. Were the items from your subscale validated by the original or second author, albeit using the dichotomous scaling? Or was the validation only on the total score of the larger scale. If for example, the original and second study used a 50 item scale and validated the total score of the 50 items, but you used 2 of those items, then I foresee problems, even if you had a Likert scoring. If they used 50 items, but validated your 2 item subscale, it might be ok.

      Unfortunately a reliability of 0.70 is not very impressive. Personally I would consider a 0.70 in the lower end of acceptability. That is, with a reliability of 0.70, having a person take the scale twice, then (1.0 – 0.70*0.70 = 0.51) or 51% of the prediction of the second score is noise – error variance.

      In summary, if in your branch of science the original dichotomous sub-scale is acceptable and if the two previous studies validated your subset of items, since you did not change the content of the items (only the scoring), then your scale should be acceptable.

  30. Validity says:

    Dear Dr. Fleishman,

    Thank you so much for your prompt and great response. Responding to your inquiry, “I am confused by your reference to using a subset of the larger original questions from the original study,” here are some specifics:

    The original study (where I took the questions from) used a 50-question/statement questionnaire. The students had to circle yes/no responses for some of them; for some others, true/not true/sometimes/not sure; and other questions were multiple choice. I took from that questionnaire of 50 statements and questions 17 statements, which I grouped according to the variable I wanted to measure (i.e. from Qs1-7 DV1; Qs 8-14 DV2; Qs 15-20 DV3).

    The author that constructed the original questionnaire used a pilot study (in at a private school), and experts in the field revised the questionnaire. There is no mention in the study of any type of reliability test.

    The second study used also some questions from the original 50-question/statement questionnaire. In the research report the author does not specify any type of validity or reliability procedure for the questionnaire.
    Thank you very much.

  31. Irosha says:

    Dear sir,
    I’m performing a research regarding edible coating for fresh strawberries.I have measured the weight loss of coated and uncoated fruits during the storage period of 15 days. I collected data at three day interval for each sample, so I got 5 data from each sample. Can I apply independent t – test to analyze the data?

    • It sounds like you have two groups measured on six (not five) times – Days 0, 3, 6, 9, 12 and 15. Now, with your data you could do 5 t-tests, comparing the Day 0 with the other 5 period differences and testing the mean (coated vs uncoated) differences. Each, if applied uniquely, would be an appropriate statistical test.

      One problem with this approach is you have 5 t-tests. How are they related? Likely very highly. This is similar to the issue of asking your shoe sizes every 3 days. With shoe sizes (autocorrelations approaching 1.0), you are not getting 6 independent measurements but 1 measurement repeated 6 times. With fruit weights, with quite high autocorrelations, you likely have very, very highly related observations. The only ‘correct’ way of handling this is to use a program which handles such correlated errors (e.g., SAS proc Mixed) or Multivariate Analysis of Variance (MANOVA) and adjusting for such issues. It is INVALID to make any assumption of independence common in the stock ANOVA program. Assuming compound symmetry is typically completely wrong. Before the advent of such programs, I NEVER did repeated measurement ANOVAs as the assumption of independence is blown away. There are many assumptions one needs to make with data, (for example normality, equal variance, independence) but independence completely overwhelms everything. For example, an autocorrelation of 0.30 would make an assumed alpha level of 0.05 test actually 0.25. With weights, I would guess an autocorrelation of greater than 0.80. The differences between your 5 p-values would be meaningless. In essence, your are not getting 5 p-values but 1 p-value repeated 4+ times.

      One approach you could do is just report the Day 15 t-test and stop there. That is what I did as a professional statistician between 1980 and 1995 (e.g., analyze change from baseline at study endpoint and ignored any interim measurements).

      If it were me, the first thing I would do is to compute the correlations of Day 0 with 3, 6, 9, 12 and 15 (ignoring treatment group) as well as all the other times. If the correlations off the diagonal are non-zero, you have the issue I talked about above. If the correlations within a band (e.g., First Off-Diagonal: Day 0&3, 3&6, 6&9, 9&12, and 12&15) are different and higher than the next band (e.g., Second Off-Diagonal: 0&6, 3&9, 6&12, and 9&15) I’d use AR(1). Otherwise I’d use a different covariance structure. Assuming non-zero correlations, I would use a correlated error mixed model statistical program and analyze the 5 post baseline changes from baseline (weight loss) with effects of Day, Treatment and Day by Treatment interaction, using an ANCOVA with baseline as a covariate and an appropriate autoregressive covariance structure. My hypothesis would be that the treatment difference increases over time by examining the Day by Treatment interaction for a linear (or curvilinear effect). As the coating likely has a non-zero weight, you would need to ignore the Day 0 difference between treatment groups, especially at post Day 0.

  32. Irosha says:

    Dear professor,
    Thank you very much for your kind response. Actually I’m not good at statistics. But, since I have data for two groups (coated and uncoated) can’t I just apply independent t – test for the analysis of these two groups?
    I have read articles and some of them said that t – test cannot be applied for very small samples, but some said it can be applied. This also confuses me.

    • Simple answer: As I said in my reply, yes you can do a (singular) t-test. If you want to do more than one t-test get professional assistance (see my analysis suggestion in the last reply).

      Second question: Yes, you can do a test when you have small samples. The articles are wrong! You didn’t mention your N. Minimum N should be 6 (3 per group), but a t-test could be done with 4. See my Blog 12. Significant p-values in small samples. However, my main caveat was that you should focus on descriptive statistics and effect sizes. Unless you have a huge effect, the inferential statistics (p-values) would be non-significant (i.e., your study was run so you are unable to reject the hypothesis that there is no real difference between your treated and untreated fruit). I suggest focusing on both sides of the CI. The lower end is the minimum difference (e.g., does it include zero [statistical significance?], if positive this indicates significance and the minimum weight loss is x). The upper end might be more interesting, it suggests that the coating benefit could be as large as y.

      Think of inferential statistics with small sample sizes like standing on the Jersey shore looking at the New York City skyline with very thick fog. Yes, you see the fog ‘lit up’, but under the circumstances you cannot rightfully say that New York exists.

  33. Irosha says:

    Dear sir,
    I conducted difference from control test for my sensory evaluation asked the penalties to rank the difference from the control according to a scale. but I didn’t use a blind control. Can I analyze the data since I didn’t use a blind control? If so how can I analyze the data?

  34. Deepashika says:

    Dear sir,
    I’m following a research on Lean manufacturing and I have selected few performance indicators to check the effectiveness of implementing Lean manufacturing to a firm.Therefore I have to conduct a paired t test for each performance indicators to check before and after condition of the firm.The selected performance indicators are waiting time,number of excess motion,number of defects, production(number of tea bags),effectiveness of the machine and Overall equipment efficiency of the machine.

    Sir, I want to know whether those each indicators are OK for paired t test and what are the measurement scales of them ? I wait in anticipation for your reply .

    Thank you.

    • I am feeling like one of those TV psychiatrists who listen for 5 minutes and then make suggestions.

      Without knowing anything more, it ‘sounds’ like each of your performance indicators are on a continuum (i.e., may be interval level measurements). Although many of your parameters might have a log normal or Poisson distribution. So something like a (paired) t-test might be appropriate. As you have “before and after condition” measurements you appear to be warranted in taking a difference, hence a paired t-test is likely correct. But if I’m correct about the log scale, a ratio, not a difference, might be more appropriate.

      However, many decades ago, a senior statistician told a junior statistician (me) that the true job of a statistician is to understand exactly the question which is being raised. The question which the client is asking is not always what they ask. Ten years before, my first ‘job’ as an undergraduate statistician was doing a large number of 2 way ANOVAs on correlated independent variables for a psychology graduate student’s dissertation. Under his direction, I artificially forced an orthogonal solution to the ANOVAs (ignoring multiple negative sum of squares per analysis). I told the IBM batch computer to ignore the first 400 error messages per ANOVA. A week after the job was done, he told me he actually wanted to do a factor analysis, but he didn’t know how. To be explicit, the ANOVAs, I was given the punch cards for, were pure garbage – PURE garbage.

      You may be completely right in doing paired t-tests, but my suggestion is to ask a local statistician for a consult.

  35. Ayman says:

    (The way I see it a p-value of 0.01 for a sample size of 20 is the same as a p-value of 0.01 for a sample of size 500): I don’t think it is true!
    The accuracy of the pvalue depends crucially on the variance of the estimated variance sigma_hat, as the pvalue is the surface at the right of the t_stat computed at the estimated coefficient, beta_hat, of the “TRUE” pdf curve of the t-stat . Naturally, we will have to estimate the variance of the t-stat distribution (under the null hypothesis, the mean is known). To do so, we use sigma_hat as an estimated of that variance. Yet, we all know that the var(sigma_sq_hat)= 2σ4/(n − p), with (n-p) is the degree of freedom. So, with a small n, the variance of sigma_hat is going to be relatively much more higher, right?
    So, the p-value will be less accurate as the difference between the approximated pdf curve of the t_stat and the true one increases.

    • Ayman says:

      * (To do so, we use sigma_hat as an estimated of that varianceTo do so, we use sigma_hat as an estimated of that variance)= To do so, we use sigma_hat in the esimation of that variance.

      * The “TRUE” pdf curve of the t-stat is equivalent to the true distribution of Beta_hat —N(beta, sigma^2(X ^T X)^{-1}
      sigma_hat^2 — sigm^2/(n-p).chi^2_{n-p}
      —-> var( sigma_hat^2) = 2.sigm^4/(n-p)

      —-> So, if n decreases the var of sig_hat increases, so the the p_value is less accurate = underpowered test


      • I totally, and completely, agree with your observation. In essence, the confidence interval for a p-value is narrower when N is large and is wider when the N is small. However, when did you ever see a publication report a CI for a p-value? Only the point-estimate (50% of the p-value distribution) is reported, and 0.01 = 0.01.

        Nevertheless, if I observed a p=0.05000, half the time the result would be p>0.05 in either the large or small N studies. But you are correct, an observed p=0.01000 is much more likely to be p<0.05 in the large N case.

        I agree with your observation, but rather than the quite uninformative p-value (see Blog #1), I'd prefer to report on the CI for the non-centrality parameter (or effect size) (see Blog #4) or the CI on the difference between the means (see Blog #3). The p-value only asks is the difference zero, but the CI tells you how much; that is: (our most conservative [similar but more informative to testing if statistically significant]; our best guess; and our most liberal [useful to testing if the means are clinically significant]).

        When you look at the CI on the mean difference (mean1 - mean2) or the effect size (mean1 - mean2)/sd, both the large and small N say the most conservative lower end (statistical significance) is better than a zero difference - and actually quantify how much better than zero it is, but the best guess or upper end of the CI would imply a larger difference when N is small. So, personally, while I agree that I'd love to see a replication, I'd rather see a significant p-value when N is small than large for exploratory findings. It might foretell an important breakthrough effect, rather than it’s not zero.

        Related to your observation, is the publication bias effect (i.e., not submitting or not allowing to be published p>0.05 findings). Meta analyses have found that both the effects of drug therapy and talk therapy were over-estimated. So for confirmatory work, I’d prefer a large N and many replications. FDA has required two adequate and well controlled (statistically significant) studies.

  36. Haryanti says:

    Hello Sir,

    I am writing my master thesis about students’ decision making related socioscientific issues. I am testing differences between 3 groups discussion (males group, females group and mix group), with each group containing n = 6 students. is it okay to conducted with N=6? and I am confusing in what test should i use to find out the differences between 3 groups. help me.

    Thank you

    • First question: ‘is it okay to conduct with N=6?’ The correct way to make that decision is to do a power analysis. As stated elsewhere among my blogs, most comparisons can be focused to pairwise differences (perhaps each of the 3 pairwise comparisons). So the comparison comes down to 2 groups of 6 each (or 10 degrees of freedom). [Note: you probably can get a pooled variance estimate with all 18 subjects (or 15 df).] This sized study is likely of quite limited power, and only allow a very huge difference could be statistically significant. As stated strongly in the first 4 blogs, and in this blog 12, I would strongly recommend you refocus your exploratory study to descriptive statistics with emphasis on point estimates and confidence intervals and state that ‘p-values are presented but only for descriptive purposes only’.

      Second question: what test should you use? You didn’t say, but given ‘students’ decision making’, I would guess you have some quasi-continuous parameters. ANOVA sounds like a first guess. Most stat packages naturally generate all pairwise comparisons for ANOVA. That should make your adviser happy. Again, the p-values would be for ‘entertainment purposes only’ – that’s what they say about astrology charts and exploratory studies. With your N, I wouldn’t take p-values seriously at all. If your N were an order of magnitude larger (e.g., n/group=60), I might pay more attention to the p-values and do an improved Bonferroni test [e.g., most significant (largest mean difference) tested at alpha/3; middle difference at alpha/2, and the least significant difference at alpha (e.g., 0.05) with an appropriate stopping rule].

      Nevertheless, with barely a 50 word description of your problem, I would say my advice must not be considered definitive. I would ask your adviser if there is a free stat consulting group at your school. For example, the statistician might question if you have n/group=6 or n=1, as each group might be considered a single group process. They might throw around terms like clusters and correlated errors.

  37. Mauro says:

    Dear author. I have finished a study. Previously i have calculated the samplesize with a determinated effect size (clinical significance). At the results i have found statistically significance but it not reach the clinical significance pre established. Why could it happen if i have calculated the samplesize previously?

    • You did ‘not reach the clinical significance pre established’. Think of a straight line, with 0 at one end and your clinical significance at the other. Now place your mean on that line. When you said it did not reach clinical significance I take it to mean that the mean is below the clinically significant end. NOW PLACE THE 95% CONFIDENCE INTERVAL ON THAT LINE AROUND THE MEAN. When you said it was statistically significant, therefore the lower end of the interval was above zero. What other values did the lower end not include? Zero is an arbitrary number. You could likely exclude more meaningful effect sizes above zero.

      Now this is important! Look at the upper end of the CI. Does it include your clinically significant value? If so, you have demonstrated that your study treatment might include your clinically significant value. OTOH, your treatment effect might be less than you hoped. Doing a power analysis only applies to finding statistical significance, never clinical significance.

      There are many reasons to not attain clinical significance: You failed to control the experimental noise (e.g., your variability was larger than expected), you got an incorrect population, your treatment wasn’t as good as hoped, you were unrealistic).

      Please read the blogs #1 through 4, inclusive. Also see the blog on proving the null hypothesis “#5 Proving the Null Hypothesis”. You many have noticed in the second paragraph I said “might include your clinically significant value”. With a traditional significance level approach having an upper CI include the clinically significant value might exclude it with a bigger N. You actually need a totally different approach to PROVE the null hypothesis – in this case to prove the alternative hypothesis. Read blog #5.

      Let me know if you have any confusions here.

  38. Omar Farooque says:

    What is the minimum T-statistics for a P-value 0.001?

    • You asked about a p-value of 0.001 for a t-statistic. The only other info I would need is the degrees of freedom (df) and whether you were dealing with a one-sided or two-sided p-value.

      Since this is a blog on small N significant p-values, I assume you mean maximum for a critical t-test, which is when the df is smallest (i.e., df=1), then for a two-sided t you would need to exceed a critical t of 636.62; for a one-sided t, you would need to exceed a critical t of 318.31. These critical t values decrease by a factor of 20, when you add a single observation (i.e., df=2).

      If you actually meant minimum for a critical t-test, which is when the df is largest (i.e., df=infinity), it would be a z-statistic. Or you could approximate the t by plugging into the t-distribution a huge df (like a billion). Then for a two-sided t you would need to exceed a critical t of 3.29; for a one-sided t, you would need to exceed a critical t of 3.09. The t-distribution also approaches these numbers somewhat rapidly (e.g., for a moderate df [N-1 or N-2] of 100 the two-sided critical t is 3.39 [rather than 3.29 for an infinite df], the one-sided was also 0.10 larger than for an infinite df). By the time you get to around 1900 observations, you’d get 3.09 to 2 significant decimals.

  39. I really enjoyed this article, and would like to point my students towards it as food for thought! However, there is one aspect of this that (as far as I can see) you don’t mention. If our smaller sample size gives us lower power, then the proportion of significant results that will be true positives will be lower (since we will detect fewer true positives, and just as many false positives), and the (Bayesian-ish) probability that a given significant result is a false positive will be inflated. I guess with very small samples, this could be quite severe, and might be further reason to question any given significant result from a small sample?

    • Let’s say you need 100 observations (e.g., patients) for a correctly powered study*, but you only run 10. Under those circumstances your power would be small, like 0.141, not the typical 0.80. The true positive rate is 14.1%. What is the false positive rate? Zero! I assumed in my ‘Classical’ calculation that the true Mean-2 was 0.56 standard deviations different from Mean-1 (see below*). So if you just ran a study with 1/10 the correct N, you’re quite likely to fail (85.9% of the time). This is called the false negative rate, there is no false positives when the Ho is false. Please review blog #1 and #2, as I said there, the Ho is theoretically/mathematically, scientifically. practically, philosophically very seldom true.

      However, with only ten observations, one could have run the trial ten times, rather than the original 1 time with 100 observations total. Then, one can ask what is the likelihood that at least one of these ten replications is significant = (1 – [(1 – 0.141)^10]) = 0.781. The vast majority of such studies will be n.s. Under those conditions, the best approach would be to run a meta-analysis and try to estimate the N=100 study.

      Classical statistics assumes the theoretical values (e.g., population values of the mean and variance) are true and the study results are estimates (i.e., 1 among an infinite number of possible replications). You mentioned Bayesian statistics. Bayesians assume the data you analyze is real (true), but the theoretical estimates are variable and need to be estimated. A true Bayesian would be much less focused on p-values and more on estimating the true values of the population means.

      Personally, I think like a Bayesian, but calculate like a Classicalist. I tend to focus on the estimates of the mean difference and its C.I. When N is small the C.I. might include or exclude zero, but the C.I. range is quite large.

      I think you might be trying to envision a (Schrodinger’s cat like) universe where the Ho is either true half the time and the Ha is true the other half of the time. Therefore the false positive rate is 95% and the false negative rate was ~86%, as above. These two rates, in the small N study case are numerically similar.
      *I assumed Mean 1 = 0, Mean 2 = 0.56, sd = 1, alpha = 0.05 and 1-B = 0.80, then N[total] = 100. I then set N[total] = 10, and the post-hoc power (which should never be calculated) gives 1-B = 0.141.

  40. Merm says:

    Hi Allen,

    I found this page really informative, thank you!

    I have one question though. I’m carrying out a small study (N=8), where the participants will each be tested on the same three conditions, it is an exploratory study. I’m using a repeated-measures one-way ANOVA. When I ran the test with the data I got, I got a significant result – can I trust this result? Also if not, do I write in my report that it is an exploratory study and the results are only indicative?

    Thank you for any help,

    • First question I have is: did you do the repeated-measures ANOVA correctly? The most frequently used correction, compound symmetry, is NOT APPLICABLE to your study. That is, ANOVA is totally insensitive to almost all violations of its assumptions EXCEPT correlated error, that is the correlation among your three observations. Under frequently seen cases, a significant 0.05 alpha level would become a n.s. 0.25. Please read Blog 7. Assumptions of Statistical Tests.

      Second question is: Three treatments implies three comparisons (pairwise differences) among your treatment groups. Did you have any multiple comparison adjustments? A significant 2df treatment difference is not what you ultimately report. The best case is where you stated in your protocol’s stat section that one comparison is key and the others are of secondary concern. Otherwise, with a Bonferroni adjustment, the critical p-value is not 0.05, but 0.05/3 = 0.0167. Actually half that with a typical two-sided test.

      Third question is: Did you have an analysis plan or a stat section of your study design. That is, did you state that the study’s p-values were to be descriptive or inferential? You implied the study was exploratory, so the p-values shouldn’t be considered real. However, many people say a study is exploratory as a CYA for when an inadequately designed study is expected to fail. If you ran the trial, expecting a huge (gargantuan) effect size (which you apparently got!), hence was empowered to run it with a small N, then you actually planned on looking at the p-values. OTOH, if you didn’t really expect the study to statistically significant; ran it as a pilot trial to get the feel of things; and had no intention of believing (or getting blamed for) a n.s. result, then you felt ‘the results were only indicative’.

      Fourth question: You obviously ran a crossover trial. Please read blog 14. Great and Not so Great Designs. [Spoiler Alert: Crossovers aren’t so great.] a) Did you randomize your participants into the 6 different orders (i.e., A then B then C (in simpler notation – A:B:C); A:C:B; B:A:C; B:C:A; C:A:B; and C:B:A)? Even more importantly, did you treat everyone with the same treatment order (e.g., all patients received A:B:C in that order)? If not, then the ‘treatment’ difference can just as easily be a temporal difference; hence the results are not interpretable. b) Did you analyze for carryover effects? I have previously had very bad experience with such trials (i.e., I saw crossover effects). Many professional statisticians recommend large enough trials so only the first period can be used and ignore all the other data (e.g., your periods 2 and 3). This would make your study a parallel group study, so that the N’s per group would be 2 to 3, not 8. A crossover effect is where the patient’s pre-period treatment status differ and treatment effect sizes differ depending on their prior treatments. For example, imagine treating mending bones, and having an excellent bone treatment. If C were Control and A were Active, then the pre-treatment status at the second period for A:C would be near 0 severity; but the C:A second period baseline would be quite high. If C were actually a different Active treatment, then since the maximum improvement is lower (e.g., low) the treatment effect must be lower in the second period relative to the first. Please see blog 14 for reasons to expect such temporal changes for the second and later periods.

      Let me assume 1) you did the ANOVA with a correct correlated error term, e.g., AR(1), 2) your study’s stat section identified one of the pairwise comparisons as the key comparisons and the other two of secondary importance, 3) you ran the study with full intent on getting statistically significant results, and 4) you actually had all 6 possible treatment orders and there was no crossover effects. Then you should believe the p-value. Publish it – mazel tov! You did it! You did it! Brag to your mother.

      As I’ve stated many places in my blogs (see blogs 1-4 and this blog 12), I’d be most inclined to present the treatment effects with its CI and de-emphasize p-values, especially for small N studies. When N is small and the p-value is significant, the lower end will exclude zero and is the best estimate of the minimum treatment difference expected. Above zero (i.e., statistically significant p<0.05) is only one of an infinite number of small effect size values. The middle of the CI (actual mean difference) is the best estimate of the treatment difference size. Finally the upper end (of this wide CI) indicates a possible maximum benefit from this treatment. As the N with 8 participants is small, the CI will be very wide. Hence the (middle of the CI or actual mean difference) treatment effect you saw is likely to be very large, possibly huge. The upper end will likely indicate a gargantuan effect size. It would be of potentially monumental benefit.

      Yes, this could be a spurious result (happens one time in twenty). If you published I'm sure you'll say something about replicating. Replicate it and hope to see similar results.

  41. Stanley Levinson says:

    A p-value of 0.01 for a sample size of 20 implies a large effect but there is a better chance the result is wrong. A p-value of 0.01 with a sample size of 5,000 implies a small effect but there is a better chance the result is correct.

    • Let me preface this comment by asking what was meant by ‘chance the result is wrong’? I think you are again falling into to trap of assuming that the purpose of a study is to determine if the difference is zero. Please see my first 4 blogs. A difference is NEVER zero. To stop our thinking at ONLY determining if it is not zero, is extraordinarily limiting. As I said in Blog #1, NO CREDIBLE THEORETICAL DIFFERENCE IS EVER ZERO (with the possible exception of ESP research). Testing if it isn’t zero is just plain silly.

      Let’s remember that a p-value is just the likelihood that the results occur by chance. In both the N=20 and N=5,000, the chance the result is correct are IDENTICAL. One could get that result 1 time in 100. The studies differ in the size of the confidence interval. In both cases the CI excludes zero. In the N=20 the CI is large. This CI excludes zero. The observed mean difference (our best guess for the true mean difference) is probably fairly large, and the upper end of the CI, the most extreme, but credible, guess for the treatment effect is huge. In the N=5,000 the CI is small. This CI also excludes zero. But we are more certain of the location of the mean effect size and the upper end is likely near the mean difference.

Leave a Reply

Your email address will not be published. Required fields are marked *