No virtual observations were harmed in the running of this study.
A man with one watch knows what time it is. A man with two is not sure, at least for the Wilcoxon test.
And don’t try this at home, we’re what you would call ‘experts’.
I admit it, I have biases.
One bias, seen in my first blog is “the near sacred pvalue (i.e., p < 0.05) indicating our ability to reject the null hypothesis. As it is theoretically false, believed by all to be false, and practically false, all statisticians I’ve ever talked to believe that the pvalue is a near meaningless concept.” I haven’t changed my mind about that one. See my second blog as to why I still (always) do it.
Another bias, I haven’t changed my mind about, is the purpose of a study is to see which specific values of a treatment difference are credible. My third and fourth blogs dealt with computing the confidence interval (CI) of the raw data or the CI for the effect size. Zero on the low end is just a single value. We also need to know other low values and the upper end as well.
One of my strong biases is the avoidance of nonparametric (np) tests, except for supportive analyses. Yes, it was based on knowledge and good experimental design (e.g., nonnormality, heteroscedasticity, which are controllable by reasonable clinical design of N_{group} > 10 and equal Ns per group [see blog 7. Assumptions of Statistical Tests and below]). I avoided np tests. I observed that they only provided pvalues, and only the median was reported with these pvalues. As I pointed out previously, the median is NOT something used in computing the Wilcoxon/MannWhitney test. These np tests compare the mean rank, and NOBODY reports mean ranks.
Recently a colleague, Catherine Beal, introduced me to an analogue of the 95% CI used in parametric testing, a HodgesLehmann (HL) estimator. Statistics had moved on! HL estimators were only recently made available in SAS, the statistical analysis language used in the industry, and only for the Wilcoxon test. The HL estimator provides the CI and the midpoint of this interval. This satisfied some of my theoretical objections to np testing.
On the other hand, how well does the HL estimators compare to the 95% CI on the means? Has anyone examined the relative efficiency of them? In a recent blog I mentioned that statisticians can do comparisons of different approaches by simulating data. One such approach is the Monte Carlo study. In it, one can take a large number of observations from a random sample and see how well two or more approaches compare.
I did such a study. You are learning about it here first! I generated 10,000 virtual samples for 4 distributions: a normal distribution, a rectangular distribution, a ‘typical’ nonnormal distribution, and a population with 2% outliers (98% of the sample had a s.d. of 0.93 and 2% had a s.d. 3 times larger of 2.78). The ‘typical’ nonnormal distribution was based on a suggestion by Pearson and Please. They suggested not using an extreme population, but with only a moderate skew and kurtosis (two measures of nonnormality). I used a skew of 0.75 and a kurtosis of 0.50.
I ran this simulation assuming the N_{group} was 6 (i.e., small), 51 (i.e., moderate), and 501 (i.e., large), in other words, with 10, 100, and 1,000 degrees of freedom. With the 3 N_{group} and the 4 distributions, 120,000 sample means for an ‘active’ and ‘control’ groups were drawn. In other words, more than 500 million ‘subjects’ were generated for this study. And don’t try this at home, we’re what you would call ‘experts’.
First, let me report one major disappointment. Sometimes the HL CI said the results were not statistically significant when the Wilcoxon test said they were significant. For example, when N_{group} was small (6), I noticed that a good number of times (e.g., 8.16% for the normal distribution) when the Wilcoxon was just barely (p = 0.045) statistically significant, the HL estimators had confidence interval which included zero. The same occurred in the moderate (N_{group} = 51) cases, but less frequently (e.g., 0. 21% for the normal distribution), again when the Wilcoxon was just barely (p = 0.04988) statistically significant. In other words, the pvalue indicated statistical significance, but the HL estimator said it wasn’t significant. Of course, the t‑test pvalue and its CI were consistent with one another in all 10,000 samples of the 4 distributions and 3 levels of N_{group}. The SAS consultant confirmed my observations and told me of a 2011 talk by Riji Yao, et. al. who concluded that “the results from the three statistics [pvalue, HL estimator and medians – AF] are not entirely consistent.”
Second, let me present the empirical power of both tests. In all cases it should be 80%, as I ran the study with different effect sizes for the different N_{group}.
Power of the study N_{group} 

6 
51 
501 

Distribution 
ttest 
Wilcoxon 
ttest 
Wilcoxon 
ttest 
Wilcoxon 
Normal 
79.30 
74.06 
80.60 
78.69 
80.82 
78.91 
Outlier 
81.13 
76.26 
79.99 
82.02 
80.04 
81.99 
Rectangular 
79.83 
70.23 
79.89 
74.97 
79.07 
77.44 
‘Typical’ 
79.96 
74.43 
79.99 
81.77 
79.60 
81.79 
Two major observations can be made of the power. First, when N_{group} is small, the ttest, which had approximately the 80% power, has greater power than the Wilcoxon, for all distributions. That is, when the data were normal, the Wilcoxon had 5.9% lower power than the nominal 80%. For the outlier, rectangular and ‘typical’ distribution, they were underpowered by 3.7%, 9.8%, and 5.6%, respectively. Second, when N_{group} is moderate or large, if the data truly are normal, the Wilcoxon test has power almost as good (with 1.3% to 1.9% lower power) as the ttest. If the data were rectangularly distributed even in larger sample sizes, the Wilcoxon power was also lower than the ttest. However for the ‘typical’ nonnormality or the outlier distributions, for moderate and large sample sizes, the HL power had about 2% better power. In other words, for tailheavy distributions [leptokurtotic in statisticianese], a < 2% power benefit would be gained by using the Wilcoxon test.
It should be pointed out that one NEVER powers a study assuming nonnormality. In fact, we can only power studies assuming normality and ‘adjust’ (increase) the N for np analyses. Siegel’s (1956) book on nonparametrics said the Wilcoxon test had 95% the power of the ttest, a rather good estimate given the above results. Other books dedicated solely to np analyses (e.g., Sprent [1990] or Daniel [1990]) had poorer practical suggestions. So for small studies with an unknown distribution, I would recommend increasing power to 90%.
Third, all things considered, one would like ‘tight’ (or narrow) confidence intervals. This is the primary reason one uses large N, it makes the CI narrow. An approach which produces narrow CI is more efficient than any other. I took the ratio of the width of the HL CI relative to the width of the ttest CI. A ratio of 1 indicates equality, while a ratio greater than 1 indicates that the ttest is more efficient and a ratio less than 1 indicates that the HL is more efficient.
The ratio of HL to ttest intervals is presented below:
HL CI range/ttest CI range N_{group} 

Distribution 
6 
51 
501 
Normal 
1.2101 
1.0296 
1.0235 
Outlier 
1.2137 
0.9863 
0.9727 
Rectangular 
1.2211 
1.0614 
1.0211 
‘Typical’ 
1.2175 
0.9853 
0.9719 
A similar set of observations could be made. First, when N_{group} is small, the ttest has over 21% better better efficiency. This is similar to the above results. Second, when N_{group} is moderate or large, if the data truly are normal, the ttest has slightly better (3% and 2%) efficiency then the Wilcoxon. The rectangular distribution also had better efficiency with the ttest (6% and 2% for the moderate and large Ns respectively). The heavy tailed ‘typical’ nonnormal and outlier distributions had slightly better efficiency for the HL estimators given moderate and large Ns, both about 1.5% and 2.7% respectively.
Finally, one assumption of the ttest is that the distribution of means is normally distributed. However, with the central limit theorem, as the N_{group} increases, the original nonnormal distribution of means becomes much more normal. How normal was the difference between the means? Well, I examined the 10,000 simulated mean differences per N and distribution. We can examine their distributions and test if the means are nonnormal (I used the AndersonDarling test pvalue).
Skew, Kurtosis, and test of normality pvalue N_{group} 

6 
51 
501 

Skew 
Kurtosis 
pvalue 
Skew 
Kurtosis 
pvalue 
Skew 
Kurtosis 
pvalue 

Normal 
0.02 
0.00 
0.07 
0.02 
0.01 
>0.25 
0.00 
0.06 
>0.25 
Outlier 
0.01 
0.06 
>0.25 
0.02 
0.05 
>0.25 
0.06 
0.01 
0.22 
Rectangular 
0.01 
0.11 
>0.25 
0.02 
0.05 
>0.25 
0.00 
0.02 
>0.25 
‘Typical’ 
0.01 
0.02 
>0.25 
0.16 
0.04 
0.13 
0.00 
0.06 
>0.25 
It can be seen that even when only 6 observations were seen per group (or N_{total} was 12), the skew and kurtosis was very close to zero for all distributions. For all distributions, despite having 10,000 observations, no normality pvalue indicated that the means were anything but normally distributed. [Yes, if a trillion observations were used it would be statistically significant, but the skew and kurtosis of these distributions will still be ‘clinically’ nonsignificant.]
Summary: In this statistical study,
 The HodgesLehmann CI occasionally was nonsignificant when the pvalue was significant. This often occurred when N was small, the nonparametric test indicated statistical significance, but the CI indicated the results were not statistically significant. If that occurs with any study’s data, Yao, et. al. and the SAS consultant suggested using an exact or bootstrap methods and see if that solved the problem. [Note: one wouldn’t know if an exact pvalue and CI would both be nonsignificant.] Of course, this is beyond what could be included in any protocol or SAP. It is unclear how the Agency would respond to ignoring the results of a ns analysis then reselecting the test in order to reject the null hypothesis. This is likely to ‘red flag’ any study.
 The power of the nonparametric test was lower than the ttest when N was small. If a user wanted to rely on np testing and its CI, I would recommend increasing the small sample size power to 90% to ensure 80% power (or use a ttest). For a moderate or large sized study, there wasn’t much difference between them.
 The efficiency of the HL CI was about 21% worse than the mean’s CI when N is small. However, when N is moderate or large much smaller differences were seen.
 The normality assumption of the ttest is unnecessary when samples sizes were ‘as large as’ 6 per group. A distribution of means is virtually normally distributed for most pilot studies, despite the original distribution.
Conclusion: I will continue to suggest that the ttest (or ANOVA) should be the primary test to be used. This is especially true when the sample size is small, or there were more than two treatment groups or a multifactor analysis was used or covariates or stratification, or one wanted to determine the sample size for the trial, or design future trials, or when one has to ability to design the trial using sound methodology. Whew, that was a lot of ‘or’s. I should note that I have never seen a moderate or large study that did not include multiple factors, strata, or covariates. Never.
Is nonparametric testing next to useless as I suggested in my tenth blog? Not anymore, as confidence intervals are now possible. However, n.p. testing still focuses on trivially simple analyses (e.g., 2 groups with no other factors, strata or covariates), lacks a methodology (power analysis) to design for n.p. analyses, or the nonnormality assumption can be avoided by either N/group>5 or transforming the data/cleaning it. Would I suggest np analyses for a key analysis? NO. For almost all cases, I would still strongly recommend the use of the more powerful and more bulletproof ttest (ANOVA). I would still suggest presenting nonparametric statistics as a supplemental analysis.
Hi Allen
I absolutely agree with you about pvalues.
On nonparametric tests, I disagree somewhat, but based on our different substantive fields. I mostly work with nonexperimental data. Often, the assumptions of (e.g.) OLS regression are grossly violated. There are various remedies, one of which is to go with something nonparametric. On the other hand, this does sacrifice both some simplicity and some interpretabilty. Sometimes transforming a variable can be good. As Cox said “there are no routine statistical questions, only questionable statistical routines”.
I think you might be talking about the equal interval (interval level data) assumption for regression. For that I might agree with you. For the nonstatisticians, the ordinal vs interval question becomes “Is the difference between a ‘4. Severe’ and ‘3. Moderate’ the same as between ‘2. Mild’ and ‘3. Moderate'”? My main issue is that I believe you’d be sacrificing a great deal of simplicity and a great deal of interpretability by using ordinal regression. I once asked my first PostPhD mentor, “Why don’t we use multivariate statistical methods to deal with multiple dependent variables?” He sagely said that few of our ‘clients’ would understand it. The same would apply to ordinal regression. I might also quibble on the ‘simplicity’ of doing ordinal regression. Even the simplest regression analysis might take four times as much labor to do.
Luckily, I do very little nonexperimental regression analyses. I personally have less problems with the interval level data assumption when dealing with means and mean differences. In that case we could do the Wilcoxon and HodgesLehmann estimators. If we could rescale the ordinal parameter to a perfect equal interval approach I would expect it to only have minor effects on the noise of the study (i.e., error variance). Hence, its rarity in the applied world.
But I may be a bit too cavalier. Perhaps it would have large effect. I would love to see a study examining the effect of nonequal intervals on means using real subjective scales.[Postcomment note: The issue of ordinal data was answered in my blog ‘7a: Assumptions of Statistical Tests: Ordinal Data’. In it I cited two statistical studies, demonstrating that ANOVA (ttests) still controls the alpha level.]