“All models are incorrect. Some are useful.” George Box
When you do a statistical test, you are, in essence, testing if the assumptions are valid. We are typically only interested in one, the null hypothesis. That is, the assumption that the difference is zero (actually it could test if the difference were any amount). We previously discussed the merits of the null hypothesis in previous blogs. But the null hypothesis is only one of many assumptions.
Let me focus on the lowly t-test and also a simple two-way ANOVA (comparing two groups and time [repeated measurements]).
A second assumption is that the data are normally distributed. One unusual thing about the ‘real’ world is that data are often normally distributed. Height, IQ and many, many other parameters are normal (or Gaussian). [Note: William Gossett, aka ‘Student’, was the inventor of the t-test. Interesting side note, he worked for Guinness and developed the t-distribution to monitor stout production. However, Guinness didn’t approve of any publications, hence the pseudonym, ‘Student’. This will not be on any test. Unless you become a statistician. In which case it is mandatory you know it.] In general, if a variable is affected by many, many different factors, it will be normally distributed. [I’ll explain the reason for this shortly.] We even have tests to determine if the data are normal. Unfortunately, almost all variables have a slight departure from normality. As implied in my previous blogs, if we have a large enough sample, then any statistical test will reject the null hypothesis (e.g., the data will never be normally distributed if the sample size is large enough).
So, how bad is the effect of non-normal data? The answer is simply: NON-NORMALITY has almost NO EFFECT ON P-VALUES when we compare means, especially when the sample sizes are moderate. There is a theorem in statistics (central limit theorem) which says that as the sample size increases the distribution of means approachs the normal distribution.
Let me back up a bit. Let us imagine we are interested in seeing if our mean isn’t zero. In the plots below the Greek letter mu (μ) could be any number, in this case we might make it a zero. Arbitrarily, let us also assume that the extremes of the plots below are -2 and +2. [Actually its trivially easy for the mean to be zero and the range to be -2 and +2, we just need to look at a new variable with the mean subtracted from the old and the data to be divided by its standard deviation. Let me call this new variable t = (X – μ)/σ, where X is a mean and sigma is the s.d. of means.]
We compute a mean from our sample’s data. The plots below are an artist’s representation of what a bunch of means around mu would look like. The artist’s representation is actually what a mathematician would call a shit-load of means given the above true mean and its variability. Actually they would say an infinite number of means, but no one ever has seen an infinity, but we humans have seen loads of the other stuff. If we know the true variability of the original data (σ), then the variability, hence the width of the above pictures is mathematically known (σ/√N). I should also note that the widths below would be narrower as N increases, but the pictures were widened so you could see the shape of the curves.
However let me get to the nub of the problem. The shape of the curve is NOT the same as the shape of the original raw data. So if the original data were skewed, a set of means is less skewed. Let us assume that you have a sample of 2. When you see an extreme point (e.g., a +2), it would be quite rare to see a second point as extreme (i.e., two +2s). So the possibility of getting a mean of +2 would be unusual. If the true mean were zero and you had a sample of say 10, the likelihood of seeing all ten even positive would be quite remote (one time in a thousand).
Let me illustrate this with the following four figures.
The dotted line represents a true normal distribution what we hope to eventually see. The left-hand top plot is the original distribution. As can be seen, we have what we statisticians would call a negatively (the long tail points to the negative side of the number line) skewed distribution. The top right hand plot is means of that original negatively skewed distribution, but when N is 2. As can be seen, even when the sample size is 2, the distribution (solid line) of means is less skewed compared to the first. The bottom plots has sample sizes of 4 and 10, respectively. When the sample size is ten, the distribution of means is virtually identical to the normal curve. So if you analyzed this data with this originally skewed data set but had a sample of ten observations, the statistical test on means will be based on results which is virtually indistinguishable from the normal curve.
What is hypothesis testing? Imagine having a theoretical distribution with a mean of zero and an expected range of -2 to +2. They you get an actual mean of +10. What would you conclude? You’d say “Uh uh uh uh. Wrong. It can’t be. The theory is wrong.” Statistics just codifies such reasonable conclusions.
The above logic with means is also the reason why, when you have a number of small effects culminating in a parameter, that it tends to be normally distributed. For example, IQ is produced by the additive effect of many genes and many environmental factors. The net result is that this characteristic is like the mean of a number of sub-effects. It will tend to be normally distributed.
One cause of non-normality is outliers, or extreme values. The best way to see them is to plot the data. Two quick approaches are the stem and leaf and the box-plot. Outliers can change the means, they also very strongly influence variability and correlation. Many times outliers are transcription errors or bad assays and can be ignored/corrected. I also noticed that sometimes units get confused (e.g., one investigator using grams and the rest use micrograms). Other times, they can’t be ignored as they are valid extreme disease states. Transformations (see below) might be the best way to handle them.
So, I would say that if the data are expected to be pathologically non-normal, then the t-test would not be affected by non-normality when you have at least twenty observations. Do we ever do a pivotal trial with that small an N? Never! I will also mention below a second way around this non-normality issue – data transformation.
Any marginally competent statistician will look at a frequency distribution (e.g., stem and leaf) to see if the data had any marked non-normality and/or contain outliers. If I’m very lazy, I might just look at the skewness and kurtosis (the third and forth moments of the data).
If the data are skewed, one reason might be that the measuring instrument was not constructed to differentiate on one or both ends of the scale. This is called the floor or ceiling effect. If you detect this in an early study (e.g., Phase 2a), then you might get better differentiation and sensitivity by getting or developing a different method to get your data (i.e., a better scale).
You might be confronted by an ‘expert’ who scoffs at using a t-test, when the raw data is so obviously non-normal and tells you that only the non-distributional non-parametric test is correct. I would suggest you to tell them: “Yes, I can clearly see that the raw data isn’t normal. You’re right. But I’m analyzing and comparing means from x subjects. Do you know what shape a distribution of means would look like? Indistinguishable from normal. Please run a simulation of [differences of] means when N is x and compare that to a normal distribution. The shape of the original distribution is irrelevant. The sampling distribution of means will be normal. Feel free to bootstrap from my original distribution with a million replication to compute the mean [difference]. No, I’m not saying ‘Damn the torpedoes, full steam ahead.’ I’m saying ‘Damn the torpedoes, I’m in a freaking plane.’ ” I’ll be talking about the usefulness of non-parametric statistics in blog 9.
A third assumption of the t-test is that the variances for the two treatments are equal. This has a fancy five syllable name – homoscedastic (pronounced ‘hoe-moe-ski-dast-tick’). When the two variances are not equal, it is called heteroscedastic. You could drop these terms to impress your friends and neighbors. [Post publication note: In the November 2017 Significance there was a poll on “What is your favorite statistics word? Heteroscedasticity won, hands down.] On second thought, forget it, unless you want your friends to avoid you and your neighbors to ask you to move. [Statisticians are by nature lonely people (and humorless).]
How badly is the alpha level affected when the two groups have different variances? It depends on the sample size for the two groups. If we have equal N’s in the two groups, the effect is zilch. When you have equal N, if the ratios of the variances were zero or infinite, the 0.05 alpha level is actually 0.05, as I said, zilch. If the two sample sizes differ and the larger variance group has the larger N, then the test is actually conservative. For example, if one group had a variance twice as large as the other and also had twice the number of subjects, then the 0.05 nominal alpha level would actually be 0.029. On the other hand, if the group with the variance half the size of the other had twice the number of subjects, then the 0.05 nominal alpha level would be 0.080. At the extremes (although it is not possible to have either zero or infinite variability): when the group with twice the sample size had zero variability the actual 0.05 p-value would be 0.17; and when the group with twice the sample size had infinite variability, the 0.05 p-value would be 0.006. So, again, I would recommend keeping the N’s around the same or as close to the same as you can. This is one of the reasons why we use a 1:1 treatment allocation. [Note: the second is that the power to reject the null hypothesis is maximized.]
Sometimes we are asked to have unequal allocation. In general, I would seldom recommend using more than a 2:1 treatment allocation.
Well, is there any workaround? Actually yes, a pretty neat one. One doesn’t need to use the regular data you have to analyze. Huh??? What I mean is that one can do some type of transformation. For example, many years ago I worked on a wound healing salve (rh-PDGF-G). We needed to measure the surface area of the wound. Most of the wounds were small (e.g., most with area less than 0.03). Unfortunately, we also saw some gaping wounds (e.g., area of 2.00). We used a square root transformation on the data. Some people might object as this might have little intrinsic meaning. One possible rationale for the square root is the simple geometry formula: Area = πr² or r = sqrt(Area/π). The square root of the area is an approximation for the radius of the wound. Wounds tend to heal from the outside moving in. That is, the radius changes. In any case, God never said that the natural number line was any better at describing the data than the square root number line. Many variables benefit from a log transformation. We mentioned in a previous blog that in analyzing drug potency that we often transform area under the curve by a logarithmic transformation. The same goes for many laboratory variables (e.g., triglycerides). It is beyond the scope of this blog to discuss all the possible transformations one could apply to normalize the distributions or variances. In fact, it is possible to select the best transformation, using any power, aka the Box-Cox power transformation. Data transformations often do work surprisingly well. You might want to include in the protocol a CYA (e.g., cover your assets): ‘If it is observed that the data is non-normally distributed, a transformation (e.g., logarithmic transformation) will be used.’
A side benefit from applying the right distribution is seen in complicated analyses. When the data is correctly transformed, many interactions often disappear, making the data more interpretable. I’ll return to the issue of interactions in a later blog.
One final assumption is the question of correlated errors. I’m referring to a MAJOR problem one often encounters when dealing with repeated measures. For example, one runs a study and collects data weekly for 8 weeks, then one wants to see if the two treatments differ. Let me first say that I have observed that the best predictor of a subject’s datum today, is their value of yesterday. Depending on the parameter and the time difference the correlation of any two consecutive values is of the order of 0.30 to 0.90. In his book Analysis of Variance, Henry Scheffe says that when the correlation is around 0.40 the 0.05 alpha level is actually 0.25. That is, analysis of repeated measurements, destroys the alpha level. I have to admit, this blew me away when I first heard it. Let me tell you of my reaction in another way, I stopped doing any and all repeated measurement analyses as a professional pharmaceutical statistician for about 15 years. Originally, the only approach that was available was to assume that the correlation between weeks 1 and 2, and between weeks 1 and 8 were identical. This is the compound symmetry approach. All correlations would not be the same.
In contrast to assuming they are all the same, I noticed that they typically are approximately the same between any two consecutive weeks (e.g., between 1 and 2, 2 and 3, … , 7 and 8). The correlations separated by two weeks (e.g., between 1 and 3, 2 and 3, … , 6 and 8 ) are typically lower. Correlations with weeks maximally different (e.g., between 1 and 8 ) will be the lowest. What I did in those days was to do a paired t-test looking at the pre-post differences (i.e., change from baseline) between the two groups, ignoring intermediary time points. What changed? Well, statisticians (I’m thinking primarily of Box and Jenkins) figured a way to analyze data with multiple observations over time, like the stock market. In time, computer programs implemented ways to handle these ‘correlated errors’. The approach I tend to use is a single parameter approach – its called the autoregressive error structure of lag one or AR(1) for short. One client I used it for, had a parameter estimate for the correlated errors from 0.70 to 0.94. There are times when the compound symmetry approach might work (e.g., many raters measuring the same subjects), but for repeated measurements it is not valid. Let me say it again. For repeated measurements (data over time), compound symmetry is not valid. Any time I ever review any analysis of repeated measurements and the report does not explicitly say AR(1) or a similar approach was used, I will tell the client that it is very likely that the analysis was TOTALLY USELESS and INVALID. I feel that strongly about it. Most cheap statistical programs are not written to handle correlated errors. The better cheap programs will tell you that it is not valid for that approach. For example, GraphPad which only allows for compound symmetry (aka circularity) says, “Repeated-measures ANOVA is quite sensitive to violations of the assumption of circularity. If the assumption is violated, the P value will be too low.” The GraphPad manual then suggests “wait long enough between treatments so the subject is essentially the same as before the treatment.” As if that was a real option! No, the only solution for repeated measurements is to ignore them (e.g., by looking at one score, like the change from baseline), or to use a high powered stat program, like SAS’ proc mixed.
In sum, the assumption of normality in any reasonably sized study is not important. Unequal variances, when you have approximately equal Ns is not important. Data transformation can often help. Always look at the data to see if outliers are present (and either change/delete them, or try to transform the data to lessen their effect). However, repeated measurement should only be analyzed by appropriate high-powered programs or avoided completely.
Post 7a will discuss an assumption not presented here, ordinal data.