In my ‘7. Assumptions of Statistical Tests‘, there was a glaring omission (at least glaring for me): no discussion of ordinal data. I felt that I let my readers down, by glossing over this issue. Time to repair my quite weak ego. This blog was actually completed on 8Aug2012, although for continuity of my blog (7a follows 7) it has a fake publication date of 12Nov2011.

Most of the time, we can’t assume that the difference between any two points are necessarily equal. For example is the difference of no adverse event and mild, the same as between severe and life threatening? Is the difference between a pain scale score of 1 (the minimum on the five point scale) and 2 the same as between 4 and 5 (the maximum on the scale)? While one can’t have outliers, one can’t have large differences either. Are we justified in computing the average on this rank data? Can one ignore the lack of continuous data and the lack of equal intervals and still have a valid test of the null hypothesis. Many statisticians think not, some extraordinarily fervently.

This blog will present the a summary of two statistical articles. One was by Timothy Heeren and Ralph D’Agostino, “Robustness of the two independent samples t-test when applied to ordinal scaled data”, which appeared in Statistics in Medicine, 6 (1987), p79-90. The second was LM Sullivan and RB D’Agostino Sr. which appeared in Stat. Med. April 30, 2003, 22(8), p1317-1334, entitled “Robustness and power of analysis of covariance applied to ordinal scaled data as arising in randomized controlled trials.”

What Heeren and D’Agostino did was to test small samples sizes, N1 = N2 from 5 to 15 and some cases of unequal Ns. They investigated the cases of a 3, 4, or 5 level ordinal scale and multiple alpha levels. They restricted their investigation so that effectively the probabilities in any of the cells was> 5%, otherwise they wouldn’t have a 3, 4, or 5 level scale. For example, if there was only three levels of the scale, and one level had no observations, then effectively they would only have a two-level ordinal scale, a dichotomy. Given that they had, for example N/group = 5, a limited number of observations, they could test every possible pattern and compute the observed p-value for a t-test.

Time for their summary: “Our investigation demonstrates the robustness of the two independent samples t-test on three, four or five point scaled variables when sample sizes are small. The probability of rejecting a correct null hypothesis in this situation will not greatly exceed the stated nominal level of significance.” Greatly exceed was defined by them as less than 10% of the nominal alpha level (e.g., for alpha of 0.05 it would be < 0.055, that is, pretty close to 0.05). In other words, if you have a ordinal scale of only a few observations, the examining and testing the p-value by a t-test is valid. A t-test is a valid test when you have small Ns and only a few level ordinal scale.

The Sullivan and D’Agostino summary was:

Abstract: In clinical trials comparing two treatments, ordinal scales of three, four or five points are often used to assess severity, both prior to and after treatment. Analysis of covariance is an attractive technique, however, the data clearly violate the normality assumption and in the presence of small samples, and large sample theory may not apply. The robustness and power of various versions of parametric analysis of covariance applied to small samples of ordinal scaled data are investigated through computer simulation. Subjects are randomized to one of two competing treatments and the pre-treatment, or baseline, assessment is used as the covariate. We compare two parametric analysis of covariance tests that vary according to the treatment of the homogeneity of regressions slopes and the two independent samples t-test on difference scores. Under the null hypothesis of no difference in adjusted treatment means, we estimated actual significance levels by comparing observed test statistics to appropriate critical values from the F- and t-distributions for nominal significance levels of 0.10, 0.05, 0.02 and 0.01. We estimated power by similar comparisons under various alternative hypotheses. The model which assumes homogeneous slopes and the t-test on difference scores were robust in the presence of three, four and five point ordinal scales. The hierarchical approach which first tests for homogeneity of regression slopes and then fits separate slopes if there is significant non-homogeneity produced significance levels that exceeded the nominal levels especially when the sample sizes were small. The model which assumes homogeneous regression slopes produced the highest power among competing tests for all of the configurations investigated. The t-test on difference scores also produced good power in the presence of small samples.

Up to now, we were only discussing the type I error. The test of the null hypothesis when the difference is truly zero. A more appropriate question is what about our ability to reject the null hypothesis when the difference is not zero. Readers of my first four blogs should realize that the null hypothesis is never true. When the means are not identical, what is the effect of a limited item ordinal scale?

To quote Sullivan and D’Agostino, “The magnitude of the effect size, or the magnitude of the difference in adjusted treatment means, is reduced by ordinal scaling due to the discreteness of the data. In particular, continuous data with an effect size of 0.8 which is scaled into three-point ordinal scaled data reduces the effect size by approximately 75 per cent. Continuous data with an effect size of 0.8 which is scaled into five-point ordinal scaled data reduces the effect size by approximately 37 per cent.”

I had made a very similar comment in my blog ‘9. Dichotomization as the Devils Tool’. If you take a continuous scale and cut it into two parts (e.g., success or failure), you reduce the effect size tremendously. When you reduce the effect size tremendously, to compensate, you need to increase your Ns tremendously. In that blog, I pointed out that under OPTIMAL conditions when dealing with a two-level (ordinal or dichotomous) scale the Ns would need to increase by 60% to compensate. Under more realistic conditions, you would need to increase the Ns by a FACTOR of four.

FREEBY SUGGESTION: In designing a trial, rather than asking the physician or patients to rate things as good or bad (two levels) or into a five-point scale (normal, mild, moderate, severe, or extremely ill), You should consider a much wider framework. For example, rate things on a scale of 0 to 100, or ask them to make a mark on a 10 cm line. Such a change in the CRF is trivial (and free), and will give you much greater ability to prove your treatment’s effectiveness.

In sum, if one has ordinal data (e.g., a three-point rating scale), using ANOVA or ANCOVA and computing means is a pretty good approach with good control of your alpha level. In fact, for certain analyses (e.g., repeated measurements) it may be the ONLY way to analyze your data. However, one should strive to have quasi-continuous data (e.g., a 100 level scale) to give you greater ability to reject the (false) null hypothesis.

Hello,

Thanks for this post. It is very interesting and explanatory.

I was wondering if you would have a reference for the statements you make in your last paragraph. I used your indications to analyse my data and write an article, and I have now been challenged by one of the reviewers. Maybe, going to the primary source could help me to give a grounded explanation.

Best wishes,

Jose

For the references to this blog, the third paragraph was:

This blog will present the a summary of two statistical articles. One was by Timothy Heeren and Ralph D’Agostino, “Robustness of the two independent samples t-test when applied to ordinal scaled data”, which appeared in Statistics in Medicine, 6 (1987), p79-90. The second was LM Sullivan and RB D’Agostino Sr. which appeared in Stat. Med. April 30, 2003, 22(8), p1317-1334, entitled “Robustness and power of analysis of covariance applied to ordinal scaled data as arising in randomized controlled trials.”

If the question pertains to “for certain analyses (e.g., repeated measurements) it may be the ONLY way to analyze your data”, then check your favorite nonparametrics statistics text and search the glossary for ‘Interactions’, ‘Multifactorial’, ‘Planned Comparisons’, and ‘Repeated Measurement’. The methods discussed, if available, are often so complex as to be a windfall of profit to your statistician, and utter confusion to you when the statistician attempts to explain the results. I still shudder when doing a logistic regression and examining the interaction term for even a 2×2 interaction and I won’t even mention General Estimating Equations (GEE) modeling. Another memory was a wonderful program I wrote as a graduate student to do a mxnxpxq (four way) chi square contingency table analysis. I used it once.

If your question pertained to “one should strive to have quasi-continuous data (e.g., a 100 level scale) to give you greater ability to reject the (false) null hypothesis” then the quote from Sullivan and D’Agostino applies “continuous data with an effect size of 0.8 which is scaled into three-point ordinal scaled data reduces the effect size by approximately 75 per cent. Continuous data with an effect size of 0.8 which is scaled into five-point ordinal scaled data reduces the effect size by approximately 37 per cent.” You could also reference my blog ‘9. Dichotomization as the Devils Tool’.

I hope that helps.