This blog was written on April 23, 2017, but was ‘published on October 2, 2011, so it appears after blog 1.A Another view on testing by Peter Flom, PhD.

I wrote my first blog, ‘1. Statistic’s dirty little secret‘, in September of 2011, a few years ago. You, my gentile reader, might feel I was a bit of a reactionary, to quote former Vice President Spiro Agnew about the press, a member of the ‘nattering nabobs of negativism’, one of the ‘the ‘hopeless, hysterical hypochondriacs of history.” As this had been my first blog, you might have felt I am a disgruntled statistician, unrepresented in my field, alone, an extreme maverick, a heretical hermit spouting doomsday, a fool. It is now April of 2017, over five years later. Have I changed my mind? Not in the slightest. Am I alone? Ahh, NO!

I just put down the latest copy (April 2017) of the American Statistical Association’s (ASA) and the Royal Statistical Society’s (RSS) joint journal, Significance. A major article was by Robert Matthews entitled ‘The ASA’s p-value statement, one year on’. Matthews writes that he noted a major problem with ‘statistically significant’ results in that there is a replication crisis. Statistically significant results are frequently unable to be replicated. Scientists who repeat a study are unable to get significant results. “In nutritional studies and epidemiology in particular, the flip-flopping of findings was striking. … [T]he same flip-flopping began appearing in large randomised controlled trials of life-saving drugs.”

He reported that over a year ago the “the American Statistical Association (ASA) took the unprecedented step of issuing a public warning about a statistical method … the p-value.” He went on to say it “was damaging science, harming people – and even causing avoidable deaths.” The article was published in 2016, ‘The ASA’s statement on p-values: Context, process, and purpose’, American Statistician, 70 (2), 129-133. Matthews quotes “the ASA’s then-president Jessica Utts pointed out what all statisticians know: that calls for action over the misuse of p-values have been made many times before. As she put it, ‘statisticians and other scientists have been writing on the topic for decades’.”

Matthews laments that there has been little no change in the use of p-values. “Claims are backed by the sine qua non of statistical significance ‘p < 0.05’, plus a smattering of the usual symptoms of statistical cluelessness like ‘p = 0.00315’ and ‘p < 0.02′.”

There were two comments on Matthews’ article.

The first was by Ron Wasserstein, the executive director of the ASA. He begins: “We concede. There is no single, perfect way to turn data into insight! The only surprise is that anyone believes there is!” “Thus, the leadership of the ASA was keen to join in the battle that Robert Matthews describes and that he and many, many others have long fought … because it is a battle that must be won.” “Matthews was right about a lack of consensus among statisticians about how best to navigate in the post p < 0.05 era.”

The second commentator was David Spiegelhalter, the president of the RSS. He begins: “I have a confession to make. I like p-values.” Dr. Spiegelhalter emphasizes that the fault lies primarily on bad science, not statistics. “[M]any point out that the problem lies not so much with p-values in themselves as with the willingness of researchers to lurch casually from descriptions of data taken from poorly designed studies, to confident generalisable inferences.” He adds, “p-values are just too familiar and useful to ditch (even if it were possible).” I made the same point in my second blog ‘2. Why do we compute p-values?‘

He then goes on to suggest three things: 1a) When dealing with data descriptions (e.g., exploratory results or secondary and tertiary results) “it may be fine to litter a results section with exploratory p-values, but these should not appear in the conclusions or abstract unless clearly labeled as such”. 1b) “I believe that drawing unjustified conclusions based on selected exploratory p-values should be considered as scientific misconduct and lead to retraction or correction of papers.” 2) “A p-value should only be considered part of a confirmatory analysis, … if the analysis has been pre-specified, all results reported, and p-values adjusted for multiple comparisons, an so on.”

I agree presenting p-values is unavoidable. I always gave my clients p-values in my reports. I agree with his dichotomization of exploratory and confirmatory analyses. Exploratory ‘significant’ results should never be included in conclusion/abstract sections. However, it is the job of statisticians to focus clients on better approaches. The emphasis should never be p<0.05, but on a mathematical statement of what the results are, primarily the confidence intervals using metrics understandable by the client and his clients (see blogs 3 and 4).

Dr. Spiegelhalter gave as an example of such misconduct a case where a colleague was “confronted by a doctor at 4 PM on a Friday with ‘Could you just ‘t and p’ this data by Monday?'” He lamented on the rise of automated statistical programming, bypassing trained statisticians. He concluded “We must do our best to help them.”

In sum, my conclusion that the p-value is frequently an incorrect statistic to emphasize is supported by many, many statisticians and the major statistical associations.