… ‘If you lost your watch in that dark alley, why are we looking here?’ ‘Well, <hic> there’s light here.’ (old chestnut)
In my last blog, I stated that we should avoid dichotomizing as it throws away a lot of information in the data. Specifically it ignored the intervals in the data, precluding computing/presenting the mean, and it ignored the order of the data, precluding medians. The surprising thing is that we can compensate for this massive loss of information by increasing the sample size by only 60% (or up to four or more fold depending on assumptions).
I also stated that many non-parametric tests are basically the old t-tests or ANOVAs but replacing the observed data by ranks and analyzing the ranks. So, if you had data like 0.01, 0.03, 0.07, 1.0, and 498.0 ng/ml, you would be analyzing the numbers 1, 2, 3, 4, and 5. The difference between the traditional parametric t-tests on the ranks and the Mann-Whitney non-parametric test is that the non-parametric tests knows the variability, while the t-test has to estimate it. That is, any 5 unique values of the data would have ranks of 1, 2, 3, 4, and 5. Hence, the variability of ranks is known mathematically and doesn’t need to be estimated. [Note: Ties would also be able to be compensated for.]
Statistical approaches which only assume ordering of numbers has a lot less restrictive assumptions than those which assume that differences between any two adjacent numbers are the same. The Mann-Whitney non-parametric test doesn’t need to assume (or test) normality, nor can the variances differ (heteroscedastcity). So, if a test were almost as good, but it relied on less restrictive set of assumptions, then we should use the approach which had less restrictive assumptions. Shouldn’t we?
As suggested in my last blog the real question might be its effect on power. Actually for rank or ordinal data the power doesn’t meaningfully go down even when the parametric data would have been appropriate. My first non-parametric text book entitled Non-parametric Statistics by Sidney Siegel observed that the efficiency of the Mann-Whitney when compared to the t-test is 95%, even in small sample sizes. So, when a t-test could have been run with 100 patients, but the Mann-Whitney was used, it would need only 5 more patients. Trivial!
One loses a lot of power by dichotomizing the data. But loses almost nothing when you can order it. [Note: BTW, all methodologies will suffer when there is a marked number of ties (e.g., zeros). Of course, with dichotomization, all data will be ties!]
Yet, through out these blogs I have been promoting parametric testing. That is, presenting t-test and ANOVAs. Why? The power is almost the same. The assumptions are less rigorous. Why?
The answer is easy. If you feel that the beginning and end of an analysis is a p-value, then non-parametric testing is best. Stop. Don’t do anything more. Many, many publications only present the non-parametric tests. By this time, dear reader, I hope you do know better. If not, start at my blog #1 and read the first four blogs again. I’ll wait.
[No, I mean it, re-read them! This blog isn’t going anywhere.]
From the beginning of these blogs, I pointed out that p-values were not the beginning and end of the analysis. The most important thing you can get from a statistical analysis is not are the two treatments different (of course they’re different!), but how much are they different?
Let me return to the Mann-Whitney (or Wilcoxon) test. As I mentioned above, we are computing the average rank for the two treatment groups and comparing it. In a recent analysis for three groups, the Wilcoxon test gave the mean rank (each group had 10 animals) as 18.0, 15.7 and 12.8. Is the mean rank ever useful? Sorry no. I have NEVER, EVER included the mean rank in any statistical report. Never. Would it be reasonable to present mean differences and confidence intervals. Nope, that is not appropriate when we believe the data is ordinal not interval. Can we report the medians? Yes, yes we can. Unfortunately, this statistical test does not use the medians, it does not analyze medians, it does not compare medians.
Another reason I strongly favor parametric testing is that one can more easily test more complicated analyses. For example, looking at treatment differences over time, or by controlling for irrelevant or confounding factors (age, gender, baseline severity). Non-parametric testing has little provision for handling analyses with time in the model or interactions among confounding factors.
Nor can you use any results from a non-parametric analysis to plan for future studies. I’m not aware of any way to compute the power (sample size) for a future study based on data from the non-parametric test.
A final reason I strongly favor parametric testing is that the assumptions for parametric data can be trivially designed away (e.g., use N/group of at least twenty, use almost equal number of patients in either treatment group, and/or transform the data). See blog ‘7. Assumptions of Statistical Tests’.
Post-Initial Publication Note: I recently learned that the statistical program I use, SAS, recently implemented a methodology to compute confidence intervals of the differences. They are called Hodges-Lehmann Estimators. Therefore, it is possible to get 95% confidence intervals and the interval midpoint. I plan on including these Hodges-Lehmann estimators whenever I do non-parametric testing. However, this does not mitigate my objections: 1) the average rank being tested is not the statistic being analyzed, 2) one cannot do complicated analyses (e.g., covariates, two-way designs), 2a) even the Hodges-Lehmann estimators can only be done when there are only two treatments, 3) one cannot compute power, and 4) one can still design away most objections which might want us to avoid parametric statistics.
So, is there any place for non-parametric testing. Definitely, but I’ll say more about that in my next blog – 11. p-values by the pound.