29. Should you publish a non-significant result?

53.  If the beautiful princess that I capture says “I’ll never marry you! Never, do you hear me, NEVER!!!”, I will say “Oh well” and kill her.

61.  If my advisors ask “Why are you risking everything on such a mad scheme?”, I will not proceed until I have a response that satisfies them.

Peter’s The T0p 100 Things I’d Do If I Ever Became An Evil Overlord

The above title is actually a trick question.  Should you publish a non-significant result?  You wouldn’t even try, nor would it be published.  No journal would publish it.  What to do?  Either redo the trial with more subjects, improve the methodology (decrease the noise and/or increase the effect), or both, or drop that line of research in favor of a more useful line of inquiry.  Hence my quotes above “‘Oh well’ and kill her” and “I will not proceed until I have a response that satisfies them.”  There are always more beautiful princesses out there and so little time in life!

There is a very, very good reason why you may have over-estimated the effect size.

Why mention this?  In the October 2015 issue of Significance there was an article entitled “Psychology papers fail to replicate more than half the time”.  It refers to a Science article (bit.ly/1LgoZb2) where 350 researchers attempted to replicate 100 papers.  “[W]hile 97% of original studies had P values less than 0.05 – the standard cut-off for statistical significance – only 36% of replicated studies did so.  Meanwhile, the mean effect size of the replicated studies was half that of the original findings. (Significance, page 2, Oct 2015)”  First of all, this is endemic to all fields of research, not just psychology.  If the key hypothesis fails to demonstrate its effect, then the research is either not published or the authors didn’t even submit it for publication.  Medical journals give a very biased estimate of treatment effect (effect size).

Let us assume the Frequentist (a school of statistics) vantage point.  Imagine an infinite number of studies on the effect of Treatment x.  The Frequentists believe that there is a true (population value) treatment difference – δ, and an infinite number of replications.  [Note: The other school of statistics, Bayesian, believe there is one replication, but the true value (δ) has infinite possibilities.]   If the researchers undersized the trial, or overemphasize the effect size, then the results would be n.s. and do not tend to be published.  On the other hand, if they were lucky, it met the p<0.05 criteria.

Let me attempt to explain why this is the case.  Imagine that the true mean difference was a set amount (e.g., effect size is 0.3), now let me assume a fixed sample size of 30 per group (60 total) – the fixed N/group just makes things easier to understand.  This will be a normal probability bell curve with the mean centered at 0.3 and a sd of 1.0.  Half the observed replications will be lower than 0.3, some far less.  Half will be higher than 0.3, some far higher.  With a total of 60 subjects, alpha of 0.05 (two-sided), a true effect size of 0.3, then the study would be statistically significant 20% of the time.  If you were one of those 20% of lucky scientists and you saw a p < 0.05, then you wouldn’t have seen an effect size of 0.3, but an effect size of 0.515 or greater.  Hence, any published study with an N/group of 30, would have an observed effect size of 0.515 or higher.  The true effect size would be unknown unless one knew the number of unpublished/rejected studies.

Let me say this again.  When n.s. scientific papers are not submitted by the scientists or rejected by the journal editors, the observed effect size would be drastically overestimated or the true effect size is much, much lower than the that seen in the literature.  THE PUBLISHED EFFECT SIZE IS A VERY BIASED ESTIMATE FOR THE TRUE EFFECT SIZE.

One can estimate the true effect size, but this involves knowing the percent of unpublished papers.  Probably, the best approach would be to get your hands on every single paper written by your competitors.  Yeah, r-i-g-h-t!

Conclusion:  When trying to determine the sample size for a new trial (power analysis), do not use published papers unless you know that all papers are published/accepted for publication.  It is still an excellent idea to use a pilot (e.g., Phase II) study as long as you don’t cherry-pick the best result.

This entry was posted in Effect Size, p-values, Power, Psychology, Statistics, Treatment Effect. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *