# 1.A Another view on testing by Peter Flom, PhD

The following was written by Peter Flom, PhD dated November 4, 2009 from a Book Review: Statistics as Principled Argument by Robert Abelson.  His website is http://www.statisticalanalysisconsulting.com/ His blog is http://www.statisticalanalysisconsulting.com/blog/  I changed the date of publication of 12July2012 to 1Oct2011 so it appears after ‘1. Statistic’s Dirty Little Secret’.

Today, I’ll look at how to make and evaluate a good statistical argument. I’m going to base this on the absolutely wonderful book: Statistics as Principled Argument by Robert Abelson.  It’s an easy read, and I urge those interested in this stuff to go buy a copy.

The book makes the point of the title: Statistics should be presented as part of a principled argument. You are trying to make a case, and your argument will be better if it meets certain criteria; but which criteria are the right ones?

In Statistics as Principled Argument, Abelson lists five criteria by which to judge a statistical argument. He calls them the MAGIC criteria
1. Magnitude How big is the effect?
2. Articulation How precisely stated is it?
3. Generality How widely does it apply?
4. Interesting How interesting is it?
5. Credibility How believable is it?

We can tell how big an effect is through various measures of effect size. I can get into some of these in later article, but some of the common ones are correlation coefficients, the difference between two means, and regression coefficients. Big effects are impressive. Small effects are not. How big is big depends on context, and on what we already know. If we find, for example, that a new diet plan lets people lose (on average) 10 pounds in a month, that’s pretty big. 10 ounces in a month is pretty small. But if it was a diet tested on rats, 10 ounces might be a lot.

Articulation is measured in what Abelson calls Ticks and Buts. A ‘tick’ is a statement, and a ‘but’ is an exception. The more ticks the better, the fewer buts the better. There are also blobs, which are masses of undifferentiated results. Blobs are, as you might have guessed, bad.

Generality refers to how general an effect is. Does it apply to all humans everywhere? That would be very general. Or does it apply only to left handed people who have posted 50 or more articles on AC? That would be pretty specific. Usually, more general effects are of greater value than more specific ones, but you should be sure that the study states how general it is.

Interestingness is very hard to measure precisely, but one way is to say how different the reported effect size is from what we thought it would be. For example, I once read a study that showed that Black people, on average, earn less than Whites.  Upsetting, but not interesting.   I knew that already, and the size of the difference was large (which I thought it would be) but not huge (which I also knew, because, after all, even the average White person doesn’t earn all that much). But then it went on to say that, while Black men earned a lot less than White men (more than I thought the difference would be), Black women and White women earned almost the same (that’s really interesting! I would have thought that Black women earned much less than Whites!)
Finally, credibility. The more hard a result is to believe, the more stringent you have to be about the evidence supporting it. Extraordinary claims require extraordinary evidence.

This entry was posted in p-values, Statistics, Treatment Effect. Bookmark the permalink.