… ‘If you lost your watch in that dark alley, why are we looking here?’ ‘Well, <hic> there’s light here.’ (old chestnut)

***

In my last blog, I stated that we should avoid dichotomizing as it throws away a lot of information in the data. Specifically it ignored the intervals in the data, precluding computing/presenting the mean, and it ignored the order of the data, precluding medians. The surprising thing is that we can compensate for this massive loss of information by increasing the sample size by only 60% (or up to four or more fold depending on assumptions).

I also stated that many non-parametric tests are basically the old t-tests or ANOVAs but replacing the observed data by ranks and analyzing the ranks. So, if you had data like 0.01, 0.03, 0.07, 1.0, and 498.0 ng/ml, you would be analyzing the numbers 1, 2, 3, 4, and 5. The difference between the traditional parametric t-tests on the ranks and the Mann-Whitney non-parametric test is that the non-parametric tests knows the variability, while the t-test has to estimate it. That is, any 5 unique values of the data would have ranks of 1, 2, 3, 4, and 5. Hence, the variability of ranks is known mathematically and doesn’t need to be estimated. [Note: Ties would also be able to be compensated for.]

Statistical approaches which only assume ordering of numbers has a lot less restrictive assumptions than those which assume that differences between any two adjacent numbers are the same. The Mann-Whitney non-parametric test doesn’t need to assume (or test) normality, nor can the variances differ (heteroscedastcity). So, if a test were almost as good, but it relied on less restrictive set of assumptions, then we should use the approach which had less restrictive assumptions. Shouldn’t we?

As suggested in my last blog the real question might be its effect on power. Actually for rank or ordinal data the power doesn’t meaningfully go down even when the parametric data would have been appropriate. My first non-parametric text book entitled Non-parametric Statistics by Sidney Siegel observed that the efficiency of the Mann-Whitney when compared to the t-test is 95%, even in small sample sizes. So, when a t-test could have been run with 100 patients, but the Mann-Whitney was used, it would need only 5 more patients. Trivial!

One loses a lot of power by dichotomizing the data. But loses almost nothing when you can order it. [Note: BTW, all methodologies will suffer when there is a marked number of ties (e.g., zeros). Of course, with dichotomization, all data will be ties!]

Yet, through out these blogs I have been promoting parametric testing. That is, presenting t-test and ANOVAs. Why? The power is almost the same. The assumptions are less rigorous. Why?

The answer is easy. If you feel that the beginning and end of an analysis is a p-value, then non-parametric testing is best. Stop. Don’t do anything more. Many, many publications only present the non-parametric tests. By this time, dear reader, I hope you do know better. If not, start at my blog #1 and read the first four blogs again. I’ll wait.

[No, I mean it, re-read them! This blog isn’t going anywhere.]

From the beginning of these blogs, I pointed out that p-values were not the beginning and end of the analysis. The most important thing you can get from a statistical analysis is not are the two treatments different (of course they’re different!), but how much are they different?

Let me return to the Mann-Whitney (or Wilcoxon) test. As I mentioned above, we are computing the average rank for the two treatment groups and comparing it. In a recent analysis for three groups, the Wilcoxon test gave the mean rank (each group had 10 animals) as 18.0, 15.7 and 12.8. Is the mean rank ever useful? Sorry no. I have NEVER, EVER included the mean rank in any statistical report. Never. Would it be reasonable to present mean differences and confidence intervals. Nope, that is not appropriate when we believe the data is ordinal not interval. Can we report the medians? Yes, yes we can. Unfortunately, this statistical test does not use the medians, it does not analyze medians, it does not compare medians.

Another reason I strongly favor parametric testing is that one can more easily test more complicated analyses. For example, looking at treatment differences over time, or by controlling for irrelevant or confounding factors (age, gender, baseline severity). Non-parametric testing has little provision for handling analyses with time in the model or interactions among confounding factors.

Nor can you use any results from a non-parametric analysis to plan for future studies. I’m not aware of any way to compute the power (sample size) for a future study based on data from the non-parametric test.

A final reason I strongly favor parametric testing is that the assumptions for parametric data can be trivially designed away (e.g., use N/group of at least twenty, use almost equal number of patients in either treatment group, and/or transform the data). See blog ‘7. Assumptions of Statistical Tests’.

Post-Initial Publication Note: I recently learned that the statistical program I use, SAS, recently implemented a methodology to compute confidence intervals of the differences. They are called Hodges-Lehmann Estimators. Therefore, it is possible to get 95% confidence intervals and the interval midpoint. I plan on including these Hodges-Lehmann estimators whenever I do non-parametric testing. However, this does not mitigate my objections: 1) the average rank being tested is not the statistic being analyzed, 2) one cannot do complicated analyses (e.g., covariates, two-way designs), 2a) even the Hodges-Lehmann estimators can only be done when there are only two treatments, 3) one cannot compute power, and 4) one can still design away most objections which might want us to avoid parametric statistics.

So, is there any place for non-parametric testing. Definitely, but I’ll say more about that in my next blog – 11. p-values by the pound.

Great Blog!

I had a question about mixing and matching.

I have run a randomised trial which had to be stopped early comparing blood markers as outcomes, in a group of patients who received a particular drug compared to one which didn’t.

Unfortunately the recruitment was too slow so we stopped after 20 patients (10 in each group) although the target was 80 patients total.

The drug in question only has a short term effect lasting a day or two. I was planning to use t-tests but the groups are “small” (10 each) so have been advised to go non-parametric.

But i have some other markers which i have checked in the whole cohort over a longer term and I want to compare these markers serially at different time points as one larger group of 20. ie marker x at timepoints 0, 6weeks and 6 months and was going to use ANOVA.

so can i use non-parametric independent samples comparison for one set of the data from these patients (time points 0, 1 day and 3 days, in treatment group and control group (n=10 each) then use ANOVA for timepoints 0, 6 weeks and 6months (n=20) ??

or is this cheating?

Kolmogorov-Smirnov testing of normality shows that timepoint 1 data is normally distributed but subsequent timepoints are non-normal distribution.

And in case you were wondering, with such small numbers, the trial differences are non-significant and we are calling it a ‘failed’ trial rather than a negative result. (as per your Blog post 12)

First off, as you must realize, by running the trial with one-quarter the number of patients (assuming that N=80 had adequate power), you no longer have adequate power. Since the detectable difference is proportional to N^2, the detectable difference is ‘only’ twice what you originally planned. That is, you were originally able to reject the Ho if the means were 0.63 s.d. different. When the N is 1/4 the size, you would need an effect size of 1.33. Or to put it another way, the power to reject the Ho (assuming it was 80% at N=80), is only 26% when N=20.

Second, As demonstrated by my Monte Carlo study when N/group=10, the t-test has better power than the n-p test and is robust to non-normality. Furthermore, I don’t know any n-p approach to handle repeated measurements to handle the big problem of autocorrelation. Remember, the big problem with sensitivity of assumptions is NOT non-normality or outliers, or unequal variances, but correlated errors. That is, the correlation of days 0, 1 and 3 (or 0 to 6 weeks and 0 to 6 months and 6 weeks to 6 months). Therefore, I would still do the one-way ANOVA and use a correlated errors approach (e.g. the AR(1)). If you can see non-normality, you might want to transform the data (logs?) prior to analysis. One approach is to do the analysis with only two time points. That is, by analyzing the difference score (e.g., time1 minus baseline). This could be done by t-tests or n-p statistics.

The advice of going non-parametric, is simply without merit. Many statisticians think that n-p statistics are better when N is small, that t-tests/ANOVA are unduly sensitive to non-normality. The central limit theorem and many studies have disproved this. I challenge your statistician to present any empirical study which will demonstrate it with your degree of non-normality.

Given that you were unable to reject the Ho, I would say that the trial was inadequately run to reject the Ho. However, I would still report the differences which you saw. Did you see any ‘trends’? Any patterns of results? Report the descriptive statistics!

I didn’t understand your two sets of markers question or the ‘whole cohort’. Yes, you can analyze your key parameters with one set of times (days 0, 1, and 3) and do a second set of parameters with their time periods (e.g., baseline, 6 weeks and 6 months). You can also use ANOVA for one and any other test for the other, if it allowed for that in your protocol or made some provision for exceptions. If you didn’t specify anything for your secondary parameters, then you can ‘fish’ all you want. However, like I said before, the main focus should NOT be significance testing (i.e., p < 0.05), but descriptive statistics for your underpowered study. This can only be an exploratory, not confirmatory, study given your inadequate N.