19. A Reconsideration of My Biases

No virtual observations were harmed in the running of this study.

A man with one watch knows what time it is.  A man with two is not sure, at least for the Wilcoxon test.

And don’t try this at home, we’re what you would call ‘experts’.

I admit it, I have biases.

One bias, seen in my first blog is “the near sacred p-value (i.e., p < 0.05) indicating our ability to reject the null hypothesis.  As it is theoretically false, believed by all to be false, and practically false, all statisticians I’ve ever talked to believe that the p-value is a near meaningless concept.”  I haven’t changed my mind about that one.  See my second blog as to why I still (always) do it.

Another bias, I haven’t changed my mind about, is the purpose of a study is to see which specific values of a treatment difference are credible.  My third and fourth blogs dealt with computing the confidence interval (CI) of the raw data or the CI for the effect size.  Zero on the low end is just a single value.  We also need to know other low values and the upper end as well.

One of my strong biases is the avoidance of non-parametric (np) tests, except for supportive analyses.  Yes, it was based on knowledge and good experimental design (e.g., non-normality, heteroscedasticity, which are controllable by reasonable clinical design of Ngroup > 10 and equal Ns per group [see blog 7. Assumptions of Statistical Tests and below]).  I avoided np tests.  I observed that they only provided p-values, and only the median was reported with these p-values.  As I pointed out previously, the median is NOT something used in computing the Wilcoxon/Mann-Whitney test.  These np tests compare the mean rank, and NOBODY reports mean ranks.

Recently a colleague, Catherine Beal, introduced me to an analogue of the 95% CI used in parametric testing, a Hodges-Lehmann (HL) estimator.  Statistics had moved on!  HL estimators were only recently made available in SAS, the statistical analysis language used in the industry, and only for the Wilcoxon test.  The HL estimator provides the CI and the midpoint of this interval.  This satisfied some of my theoretical objections to np testing.

On the other hand, how well does the HL estimators compare to the 95% CI on the means?  Has anyone examined the relative efficiency of them?  In a recent blog I mentioned that statisticians can do comparisons of different approaches by simulating data.  One such approach is the Monte Carlo study.  In it, one can take a large number of observations from a random sample and see how well two or more approaches compare.

I did such a study.  You are learning about it here first!  I generated 10,000 virtual samples for 4 distributions: a normal distribution, a rectangular distribution, a ‘typical’ non-normal distribution, and a population with 2% outliers (98% of the sample had a s.d. of 0.93 and 2% had a s.d. 3 times larger of 2.78).  The ‘typical’ non-normal distribution was based on a suggestion by Pearson and Please.  They suggested not using an extreme population, but with only a moderate skew and kurtosis (two measures of non-normality).  I used a skew of 0.75 and a kurtosis of 0.50.

I ran this simulation assuming the Ngroup was 6 (i.e., small), 51 (i.e., moderate), and 501 (i.e., large), in other words, with 10, 100, and 1,000 degrees of freedom.  With the 3 Ngroup and the 4 distributions, 120,000 sample means for an ‘active’ and ‘control’ groups were drawn.  In other words, more than 500 million ‘subjects’ were generated for this study.  And don’t try this at home, we’re what you would call ‘experts’.

First, let me report one major disappointment.  Sometimes the HL CI said the results were not statistically significant when the Wilcoxon test said they were significant.  For example, when Ngroup was small (6), I noticed that a good number of times (e.g., 8.16% for the normal distribution) when the Wilcoxon was just barely (p = 0.045) statistically significant, the HL estimators had confidence interval which included zero.  The same occurred in the moderate (Ngroup  = 51) cases, but less frequently (e.g., 0. 21% for the normal distribution), again when the Wilcoxon was just barely (p = 0.04988) statistically significant.  In other words, the p-value indicated statistical significance, but the HL estimator said it wasn’t significant.  Of course, the t‑test p-value and its CI were consistent with one another in all 10,000 samples of the 4 distributions and 3 levels of Ngroup.  The SAS consultant confirmed my observations and told me of a 2011 talk by Riji Yao, et. al. who concluded that “the results from the three statistics [p-value, HL estimator and medians – AF] are not entirely consistent.”

Second, let me present the empirical power of both tests.  In all cases it should be 80%, as I ran the study with different effect sizes for the different Ngroup.

Power of the study

Ngroup

6

51

501

Distribution

t-test

Wilcoxon

t-test

Wilcoxon

t-test

Wilcoxon

Normal

79.30

74.06

80.60

78.69

80.82

78.91

Outlier

81.13

76.26

79.99

82.02

80.04

81.99

Rectangular

79.83

70.23

79.89

74.97

79.07

77.44

‘Typical’

79.96

74.43

79.99

81.77

79.60

81.79

Two major observations can be made of the power.  First, when Ngroup is small, the t-test, which had approximately the 80% power, has greater power than the Wilcoxon, for all distributions.  That is, when the data were normal, the Wilcoxon had 5.9% lower power than the nominal 80%.  For the outlier, rectangular and ‘typical’ distribution, they were under-powered by 3.7%, 9.8%, and 5.6%, respectively.  Second, when Ngroup is moderate or large, if the data truly are normal, the Wilcoxon test has power almost as good (with 1.3% to 1.9% lower power) as the t-test.  If the data were rectangularly distributed even in larger sample sizes, the Wilcoxon power was also lower than the t-test.  However for the ‘typical’ non-normality or the outlier distributions, for moderate and large sample sizes, the HL power had about 2% better power.  In other words, for tail-heavy distributions [leptokurtotic in statisticianese], a < 2% power benefit would be gained by using the Wilcoxon test.

It should be pointed out that one NEVER powers a study assuming non-normality.  In fact, we can only power studies assuming normality and ‘adjust’ (increase) the N for np analyses.  Siegel’s (1956) book on non-parametrics said the Wilcoxon test had 95% the power of the t-test, a rather good estimate given the above results.  Other books dedicated solely to np analyses (e.g., Sprent [1990] or Daniel [1990]) had poorer practical suggestions.  So for small studies with an unknown distribution, I would recommend increasing power to 90%.

Third, all things considered, one would like ‘tight’ (or narrow) confidence intervals.  This is the primary reason one uses large N, it makes the CI narrow.  An approach which produces narrow CI is more efficient than any other.  I took the ratio of the width of the HL CI relative to the width of the t-test CI.  A ratio of 1 indicates equality, while a ratio greater than 1 indicates that the t-test is more efficient and a ratio less than 1 indicates that the HL is more efficient.

The ratio of HL to t-test intervals is presented below:

HL CI range/t-test CI range

Ngroup

Distribution

6

51

501

Normal

1.2101

1.0296

1.0235

Outlier

1.2137

0.9863

0.9727

Rectangular

1.2211

1.0614

1.0211

‘Typical’

1.2175

0.9853

0.9719

A similar set of observations could be made.  First, when Ngroup is small, the t-test has over 21% better better efficiency.  This is similar to the above results.  Second, when Ngroup is moderate or large, if the data truly are normal, the t-test has slightly better (3% and 2%) efficiency then the Wilcoxon.  The rectangular distribution also had better efficiency with the t-test (6% and 2% for the moderate and large Ns respectively).  The heavy tailed ‘typical’ non-normal and outlier distributions had slightly better efficiency for the HL estimators given moderate and large Ns, both about 1.5% and 2.7% respectively.

Finally, one assumption of the t-test is that the distribution of means is normally distributed.  However, with the central limit theorem, as the Ngroup increases, the original non-normal distribution of means becomes much more normal.  How normal was the difference between the means?  Well, I examined the 10,000 simulated mean differences per N and distribution.  We can examine their distributions and test if the means are non-normal (I used the Anderson-Darling test p-value).

Skew, Kurtosis, and test of normality p-value

Ngroup

6

51

501

Skew

Kurtosis

p-value

Skew

Kurtosis

p-value

Skew

Kurtosis

p-value

Normal

-0.02

0.00

0.07

0.02

-0.01

>0.25

0.00

0.06

>0.25

Outlier

0.01

0.06

>0.25

0.02

0.05

>0.25

-0.06

-0.01

0.22

Rectangular

0.01

-0.11

>0.25

-0.02

-0.05

>0.25

0.00

0.02

>0.25

‘Typical’

-0.01

0.02

>0.25

-0.16

-0.04

0.13

-0.00

-0.06

>0.25

It can be seen that even when only 6 observations were seen per group (or Ntotal was 12), the skew and kurtosis was very close to zero for all distributions.  For all distributions, despite having 10,000 observations, no normality p-value indicated that the means were anything but normally distributed.  [Yes, if a trillion observations were used it would be statistically significant, but the skew and kurtosis of these distributions will still be ‘clinically’ non-significant.]

Summary:  In this statistical study,

  1. The Hodges-Lehmann CI occasionally was non-significant when the p-value was significant.  This often occurred when N was small, the non-parametric test indicated statistical significance, but the CI indicated the results were not statistically significant.  If that occurs with any study’s data, Yao, et. al. and the SAS consultant suggested using an exact or bootstrap methods and see if that solved the problem.  [Note: one wouldn’t know if an exact p-value and CI would both be non-significant.]  Of course, this is beyond what could be included in any protocol or SAP.  It is unclear how the Agency would respond to ignoring the results of a ns analysis then re-selecting the test in order to reject the null hypothesis.  This is likely to ‘red flag’ any study.
  2. The power of the non-parametric test was lower than the t-test when N was small.  If a user wanted to rely on np testing and its CI, I would recommend increasing the small sample size power to 90% to ensure 80% power (or use a t-test).  For a moderate or large sized study, there wasn’t much difference between them.
  3. The efficiency of the HL CI was about 21% worse than the mean’s CI when N is small.  However, when N is moderate or large much smaller differences were seen.
  4. The normality assumption of the t-test is unnecessary when samples sizes were ‘as large as’ 6 per group.  A distribution of means is virtually normally distributed for most pilot studies, despite the original distribution.

Conclusion: I will continue to suggest that the t-test (or ANOVA) should be the primary test to be used.  This is especially true when the sample size is small, or there were more than two treatment groups or a multi-factor analysis was used or covariates or stratification, or one wanted to determine the sample size for the trial, or design future trials, or when one has to ability to design the trial using sound methodology.  Whew, that was a lot of ‘or’s.  I should note that I have never seen a moderate or large study that did not include multiple factors, strata, or covariates.  Never.

Is non-parametric testing next to useless as I suggested in my tenth blog?  Not anymore, as confidence intervals are now possible.  However, n.p. testing still focuses on trivially simple analyses (e.g., 2 groups with no other factors, strata or covariates), lacks a methodology (power analysis) to design for n.p. analyses, or the non-normality assumption can be avoided by either N/group>5 or transforming the data/cleaning it.   Would I suggest np analyses for a key analysis?  NO.  For almost all cases, I would still strongly recommend the use of the more powerful and more bullet-proof t-test (ANOVA).  I would still suggest presenting non-parametric statistics as a supplemental analysis.

This entry was posted in ANOVA, Effect Size, non-normality, non-parametric statistics, p-values, Power, t-test. Bookmark the permalink.

2 Responses to 19. A Reconsideration of My Biases

  1. Peter Flom says:

    Hi Allen

    I absolutely agree with you about p-values.

    On non-parametric tests, I disagree somewhat, but based on our different substantive fields. I mostly work with non-experimental data. Often, the assumptions of (e.g.) OLS regression are grossly violated. There are various remedies, one of which is to go with something non-parametric. On the other hand, this does sacrifice both some simplicity and some interpretabilty. Sometimes transforming a variable can be good. As Cox said “there are no routine statistical questions, only questionable statistical routines”.

    • I think you might be talking about the equal interval (interval level data) assumption for regression. For that I might agree with you. For the non-statisticians, the ordinal vs interval question becomes “Is the difference between a ‘4. Severe’ and ‘3. Moderate’ the same as between ‘2. Mild’ and ‘3. Moderate'”? My main issue is that I believe you’d be sacrificing a great deal of simplicity and a great deal of interpretability by using ordinal regression. I once asked my first Post-PhD mentor, “Why don’t we use multivariate statistical methods to deal with multiple dependent variables?” He sagely said that few of our ‘clients’ would understand it. The same would apply to ordinal regression. I might also quibble on the ‘simplicity’ of doing ordinal regression. Even the simplest regression analysis might take four times as much labor to do.

      Luckily, I do very little non-experimental regression analyses. I personally have less problems with the interval level data assumption when dealing with means and mean differences. In that case we could do the Wilcoxon and Hodges-Lehmann estimators. If we could rescale the ordinal parameter to a perfect equal interval approach I would expect it to only have minor effects on the noise of the study (i.e., error variance). Hence, its rarity in the applied world. But I may be a bit too cavalier. Perhaps it would have large effect. I would love to see a study examining the effect of non-equal intervals on means using real subjective scales.

      [Post-comment note: The issue of ordinal data was answered in my blog ‘7a: Assumptions of Statistical Tests: Ordinal Data’. In it I cited two statistical studies, demonstrating that ANOVA (t-tests) still controls the alpha level.]

Leave a Reply

Your email address will not be published. Required fields are marked *