“All models are incorrect. Some are useful.” George Box

***

When you do a statistical test, you are, in essence, testing if the assumptions are valid. We are typically only interested in one, the null hypothesis. That is, the assumption that the difference is zero (actually it could test if the difference were any amount). We previously discussed the merits of the null hypothesis in previous blogs. But the null hypothesis is only one of many assumptions.

Let me focus on the lowly t-test and also a simple two-way ANOVA (comparing two groups and time [repeated measurements]).

A second assumption is that the data are normally distributed. One unusual thing about the ‘real’ world is that data are often normally distributed. Height, IQ and many, many other parameters are normal (or Gaussian). [Note: William Gossett, aka ‘Student’, was the inventor of the t-test. Interesting side note, he worked for Guinness and developed the t-distribution to monitor stout production. However, Guinness didn’t approve of any publications, hence the pseudonym, ‘Student’. ] In general, if a variable is affected by many, many different factors, it will be normally distributed. [I’ll explain the reason for this shortly.] We even have tests to determine if the data are normal. Unfortunately, almost all variables have a slight departure from normality. As implied in my previous blogs, if we have a large enough sample, then any statistical test will reject the null hypothesis (e.g., the data will never be normally distributed if the sample size is large enough).

So, how bad is the effect of non-normal data? The answer is simply: NON-NORMALITY has almost NO EFFECT ON P-VALUES when we compare means, especially when the sample sizes are moderate. There is a theorem in statistics (central limit theorem) which says that as the sample size increases the distribution of means approachs the normal distribution. Let me illustrate this with the following four figures.

The dotted line represents a true normal distribution what we hope to eventually see. The left-hand top plot is the original distribution. As can be seen we have what we statisticians would call a negatively (the long tail points to the negative side of the number line) skewed distribution. The top right hand plot is means of that original negatively skewed distribution, but when N is 2. As can be seen, even when the sample size is 2, the distribution (solid line) of means is much less skewed compared to the first. The bottom plots has sample sizes of 4 and 10, respectively. When the sample size is ten, the distribution of means is virtually identical to the normal curve. So if you analyzed this data with this originally skewed data set but had a sample of ten observations, the statistical test on means will be based on results which is virtually identical to the normal curve.

This is also the reason why, when you have a number of small effects culminating in a parameter, that it tends to be normally distributed. For example, IQ is produced by the additive effect of many genes and many environmental factors. The net result is that this characteristic is like the mean of a number of sub-effects. It will tend to be normally distributed.

One cause of non-normality is outliers, or extreme values. The best way to see them is to plot the data. Two quick approaches are the stem and leaf and the box-plot. Outliers can change the means, they also very strongly influence variability and correlation. Many times outliers are transcription errors or bad assays and can be ignored/corrected. I also noticed that sometimes units get confused (e.g., one investigator using grams and the rest use micrograms). Other times, they can’t be ignored as they are valid extreme disease states. Transformations (see below) might be the best way to handle them.

So, I would say that if the data are expected to be pathologically non-normal, then the t-test would not be affected by non-normality when you have at least twenty observations. Do we ever do a pivotal trial with that small an N? Never! I will also mention below a second way around this non-normality issue – data transformation.

Any marginally competent statistician will look at a frequency distribution (e.g., stem and leaf) to see if the data had any marked non-normality and/or contain outliers. If I’m very lazy, I might just look at the skewness and kurtosis (the third and forth moments of the data).

If the data are skewed, one reason might be that the measuring instrument was not constructed to differentiate on one or both ends of the scale. This is called the floor or ceiling effect. If you detect this in an early study (e.g., Phase 2a), then you might get better differentiation and sensitivity by getting or developing a different method.

A third assumption of the t-test is that the variances for the two treatments are equal. This has a fancy five syllable name – homoscedastic (pronounced ‘hoe-moe-**skee**-dast-tic’). When the two variances are not equal, it is called heteroscedastic. You could drop these terms to impress your friends and neighbors. On second thought, forget it, unless you want your friends to avoid you and your neighbors to ask you to move. [Statisticians are by nature lonely people (and humorless).]

How badly is the alpha level affected when the two groups have different variances? It depends on the sample size for the two groups. If we have equal N’s in the two groups, the effect is zilch. When you have equal N, if the ratios of the variances were zero or infinite, the 0.05 alpha level is actually 0.05, as I said, zilch. If the two sample sizes differ and the larger variance group has the larger N, then the test is actually conservative. For example, if one group had a variance twice as large as the other and also had twice the number of subjects, then the 0.05 nominal alpha level would actually be 0.029. On the other hand, if the group with the variance half the size of the other had twice the number of subjects, then the 0.05 nominal alpha level would be 0.080. At the extremes (although it is not possible to have either zero or infinite variability): when the group with twice the sample size had zero variability the actual 0.05 p-value would be 0.17; and when the group with twice the sample size had infinite variability, the 0.05 p-value would be 0.006. So, again, I would recommend keeping the N’s around the same or as close to the same as you can. This is one of the reasons why we use a 1:1 treatment allocation. [Note: the second is that the power to reject the null hypothesis is maximized.]

Sometimes we are asked to have unequal allocation. In general, I would seldom recommend using more than a 2:1 treatment allocation.

Well, is there any workaround? Actually yes, a pretty neat one. One doesn’t need to use the regular data you have to analyze. Huh??? What I mean is that one can do some type of transformation. For example, many years ago I worked on a wound healing salve (rh-PDGF-G). We needed to measure the surface area of the wound. Most of the wounds were small (e.g., most with area less than 0.03). Unfortunately, we also saw some gaping wounds (e.g., area of 2.00). We used a square root transformation on the data. Some people might object as this might have little intrinsic meaning. One possible rationale for the square root is the simple geometry formula: Area = πr² or r = sqrt(Area/π). The square root of the area is an approximation for the radius of the wound. Wounds tend to heal from the outside moving in. That is, the radius changes. In any case, God never said that the natural number line was any better at describing the data than the square root number line. Many variables benefit from a log transformation. We mentioned in a previous blog that in analyzing drug potency that we often transform area under the curve by a logarithmic transformation. The same goes for many laboratory variables (e.g., triglycerides). It is beyond the scope of this blog to discuss all the possible transformations one could apply to normalize the distributions or variances. In fact, it is possible to select the best transformation, using any power, aka the Box-Cox power transformation. Data transformations often do work surprisingly well. You might want to include in the protocol a CYA (e.g., cover your assets): ‘If it is observed that the data is non-normally distributed, a transformation (e.g., logarithmic transformation) will be used.’

A side benefit from applying the right distribution is seen in complicated analyses. When the data is correctly transformed, many interactions often disappear, making the data more interpretable. I’ll return to the issue of interactions in a later blog.

One final assumption is the question of correlated errors. I’m referring to a MAJOR problem one often encounters when dealing with repeated measures. For example, one runs a study and collects data weekly for 8 weeks, then one wants to see if the two treatments differ. Let me first say that I have observed that the best predictor of a subject’s datum today, is their value of yesterday. Depending on the parameter and the time difference the correlation of any two consecutive values is of the order of 0.30 to 0.90. In his book Analysis of Variance, Henry Scheffe says that when the correlation is around 0.40 the 0.05 alpha level is actually 0.25. That is, analysis of repeated measurements, **destroys** the alpha level. I have to admit, this blew me away when I first heard it. Let me tell you of my reaction in another way, I stopped doing any and all repeated measurement analyses as a professional pharmaceutical statistician for about 15 years. Originally, the only approach that was available was to assume that the correlation between weeks 1 and 2, and between weeks 1 and 8 were identical. This is the compound symmetry approach. All correlations would not be the same.

In contrast to assuming they are all the same, I noticed that they typically are approximately the same between any two consecutive weeks (e.g., between 1 and 2, 2 and 3, … , 7 and 8). The correlations separated by two weeks (e.g., between 1 and 3, 2 and 3, … , 6 and 8 ) are typically lower. Correlations with weeks maximally different (e.g., between 1 and 8 ) will be the lowest. What I did in those days was to do a paired t-test looking at the pre-post differences (i.e., change from baseline) between the two groups, ignoring intermediary time points. What changed? Well, statisticians (I’m thinking primarily of Box and Jenkins) figured a way to analyze data with multiple observations over time, like the stock market. In time, computer programs implemented ways to handle these ‘correlated errors’. The approach I tend to use is a single parameter approach – its called the autoregressive error structure of lag one or AR(1) for short. One client I used it for, had a parameter estimate for the correlated errors from 0.70 to 0.94. There are times when the compound symmetry approach might work (e.g., many raters measuring the same subjects), but for repeated measurements it is not valid. Let me say it again. For repeated measurements (data over time), compound symmetry is not valid. Any time I ever review any analysis of repeated measurements and the report does not explicitly say AR(1) or a similar approach was used, I will tell the client that it is very likely that the analysis was TOTALLY USELESS and INVALID. I feel that strongly about it. Most cheap statistical programs are not written to handle correlated errors. The better cheap programs will tell you that it is not valid for that approach. For example, GraphPad which only allows for compound symmetry (aka circularity) says, “Repeated-measures ANOVA is quite sensitive to violations of the assumption of circularity. If the assumption is violated, the P value will be too low.” The GraphPad manual then suggests “wait long enough between treatments so the subject is essentially the same as before the treatment.” As if that was a real option! No, the only solution for repeated measurements is to ignore them (e.g., by looking at one score, like the change from baseline), or to use a high powered stat program, like SAS’ proc mixed.

In sum, the assumption of normality in any reasonably sized study is not important. Unequal variances, when you have approximately equal Ns is not important. Data transformation can often help. Always look at the data to see if outliers are present (and either change/delete them, or try to transform the data to lessen their effect). However, repeated measurement should only be analyzed by appropriate high-powered programs or avoided completely.

Post 7a will discuss an assumption not presented here, ordinal data.

This is a trivial point only, but I’m sure Student’s name was actually William Gosset. I have always assumed that the Gaussian distribution was named for the famous German mathematician Carl Friedrich Gauss.

To reply to Karyn’s observation:

You are completely correct, Gauss is for Carl Fredrich Gauss. The Gaussian distribution is the normal distribution. Student was the pseudonym of William Gosset, who discovered and parameterized the t-distribution. My parenthetical comment ‘[Note: William Gauss, aka ‘Student’, was the inventor of the t-test.]’ is wrong and should be ignored. Fortunately the originator of the t-test is totally unrelated to my conclusions on the effects of non-normality, heteroscedasticity, data transformations, and repeated measurements. Karyn’s observation on the originator of the t-test as Gosset is correct. The observations I made about the assumptions of the statistical tests still stands.

Great blog! Very helpful for a methodologist (aka a wannabe statistician). It is hard to realise that p-values (as surprise index) are not influenced by non-normality. So I sampled from a negative binomial distribution and compared simple lm and glm, only to find that it does not make a difference.

But comparing lm-coefficients of samples from a normal distribution, adjusting a copy to a negative binomial, calculate z-scores and perform a simple regression, reduces the correlation “significantly”.

I have colleagues who perform an Anova on a likert type scale from 1 to 5 and analyse mean differences of 1.2 and 1.6 – this doesn’t make sense to me but it is difficult to make a case when they all refer to the robustness of the analysis. Any thoughts on this matter would be very helpful.

Three points.

First, My comments on the robustness of analysis of means still holds for the validity of the p-value. This includes using it to analyze Likert scales. The Central Limit Theorem works amazingly well. Means from Ns greater than 10 appear almost identical to normal distributions, even if you used a negative binomial distribution or virtually any distribution.

Second, while the p-value (aka type I error) is robust to departures from normality, the power (aka type II error) is strongly affected. Anything which limits the effect size will reduce the power of the study. This will directly affect your colleague’s use of a Likert scale, when there are means of 1.2. A mean of 1.2 implies that at least 80% of the respondents are identically scoring 1. I would strongly recommend he ABANDON the 5 point scale and use a much wider scale, especially expanding the range at the floor of the scale. For example, he could use a 100 point scale and reword the low anchor point to an extreme. For example, assume the question was ‘I would eat vanilla ice cream’ with 1 as agree and 5 disagree. He could change it to 1 to 100 with 1 being, ‘I crave it. I NEED it. You MUST give it to me. NOW. NOW I SAY. Do you hear me. NOW. I would gladly steal from my grandmother her life savings for a taste. Heck, take her, just give me the ice cream.’ Well, maybe not that extreme, but your colleague definitely needs to get 80% off the bottom single score, if he wants any reasonable power. I would guess his effect size is >45% lower than if his distribution is more normal. To put it another way, with almost all people answering identically, there is no discriminating information in that scale. So why include it?

Third, my comments apply only to analyzing means, not variances (or covariance). [Although an Analysis of VARIANCE on means is still appropriate.] Analyzing variances and covariances, e.g., correlations, will be directly affected by non-normality. More to the point, it is IMPOSSIBLE to get a correlation of 1.0, when the two variables have different distributions**. That is not to say, that people should not do such correlational analysis. It is done all the time. Most of the time without even awareness of the consequences. I cannot imagine two more differently shaped distributions, as a normal curve and a dichotomous distribution, but we have in our toolkit a point-biserial correlation. For those who forget the biserial correlation, it is a ‘correction’* for an artificially dichotomized parameter to adjust and treat the dichotomy as if it were continuous and normal. Such a correction will always increase the correlation. The magnitude of the correction is dependent on the proportion in the larger dichotomy (see below). When the dichotomy is 50-50, then un-dichotomizing the parameter will increase the correlation by 25%. When you have a 80-20 split, then the correlation would be expected to increase by 45%. [Note: even if one does a Pearson product-moment correlation with 0/1 parameter, the correlation would still be called a point-biserial correlation. It is not the mechanics of the program, but the nature of the parameters, which affect the correlation’s name.]

*The ‘correction’ is a multiplication of the point-biserial value by [square root (p*(1-p))]/h, where p is the proportion in on dichotomy, and h is the ordinate (height) of the normal distribution at that point. h is always greater than square root (p*(1-p)).

**As you like to empirically try things, you could have two dichotomous (0/1) parameters. Correlate them. Try 50/50 splits then any other split. The phi coefficients [phi is the name when you correlate two dichotomous parameters] could be 1.0 in a 50/50 split, but could never be 1.0 with the second parameter being 80/20 split (the maximum phi should be about 0.50). Phi max has been extensively studied with its value equal to square root [(p1*(1 – p1) * ((1-p2)/p2))], where p1 and p2 are the maximum proportions for parameters 1 and 2, respectively.

Select and explain what will be your subsequent follow up including statistical tests if

the key assumptions (s) were found to be violated

Uhh, which key assumption?

Normality: If the N was greater than 10 and I was comparing means, then I wouldn’t do anything. If appropriate, I might attempt a data transform. Alternatively, I would suggest for future studies changing the measurement scale (ridding of floor or ceiling effects). If I was analyzing correlations, I’d be much more inclined to attempt a data transformation, or realize that the correlation were underestimates. I would NEVER include a statistical test of the non-normality assumption. When N is small (like N=5 and I should be concerned on the effects of non-normality) then the power of the non-normality test is miniscule. Normality tests require large Ns. If N (e.g., N=50) is moderate then the central limit theorem will be in force, and the power of the non-normality test is still quite low. When N is large (like 5,000 and the central limit theorem says I should never be concerned), then small non-normality differences are always seen.

Heteroscedacity: If the Ns were about the same (i.e., the ratio of the Ns wasn’t greater than 2:1), then the effect would be limited. I.e., I wouldn’t worry about it. Alternatively, I might use a correction like the Satterthwaite approach. Satterthwaite ‘corrects’, i.e., decreases the N (actually the degrees of freedom), to adjust for the unequal variances. If I had many means and a two-way ANOVA (e.g., a Site by Treatment analysis), I would be inclined to use an approach which minimizes the effects of the small N cells (e.g., weighted means SAS type III analysis). A statistical test for heteroscedacity? Hartley’s Fmax comes to mind. Simply, one looks at the ratio of the largest to smallest variance and checks it against his table. If one only has two variances, this becomes a simple F test.

Correlated error: I’d use a statistical program which can adjust for it (e.g., SAS proc mixed), or I wouldn’t analyze the different time points (e.g., only look at one ‘key’ time point or combine the different times measurements into one parameter, like an AUC or average across all post-baseline observations). Statistical test for the correlated error assumption? I’d first look at the correlations (e.g., time 1 with time 2, time 1 with time 8). Each could be tested against 0. Sphericity could be tested by checking if all correlations were approximately equal. The Fisher’s test comes to mind. I tend to visually analyze these correlations. If it looks like they are banded (i.e., time 1 with 2 is the same as 2 with 3 and 3 with 4 …, and is higher than 2 with 4 and 3 with 5 …) then I’d use an AR(1) autocorrelation, otherwise I’d use an unrestricted correction.

let me put this way.

I have computed the 95% confidence intervals for the revenues of the two samples below.

Question 1) How can i validate if the revenues of the two samples are different?

Question 2)How do i describe the key assumption that i have made in constructing this confidence interval.

Question 3) How to know if this assumptions are reasonable? and what will be my subsequent follow up including statistical tests if the key assumptions were found to be violated.

Sample A – Total Revenue

Mean 4700.473

Standard Error 92.37343537

Median 4623.1

Mode #N/A

Standard Deviation 923.7343537

Sample Variance 853285.1561

Kurtosis -0.539918152

Skewness 0.372166109

Range 3886.6

Minimum 3083.5

Maximum 6970.1

Sum 470047.3

Count 100

Confidence Level(95.0%) 183.2889363

Upper Limit: 4883.761936

Lower Limit : 4517.184064

Sample B – Total Revenue

Mean 3816.883

Standard Error 59.38452664

Median 3720.25

Mode 3903

Standard Deviation 593.8452664

Sample Variance 352652.2004

Kurtosis -0.699770209

Skewness 0.226499031

Range 2697.4

Minimum 2513

Maximum 5210.4

Sum 381688.3

Count 100

Confidence Level(95.0%) 117.8317844

Upper Limit: 3934.714784

Lower Limit : 3699.051216

Thanks

What you did was compute the CI for each sample separately. What you need to do is compute the CI of the

differencebetween the two samples. This is identical to the operations for computing a t-test. I would assume that the variances might be different (the ratio of the two variances is 2.4 – that is, one is 2.4 times larger than the other, which is statistically significant p < 0.0001). The input to the proper program would be the two means, the two s.d., and the Ns.The other statistics (e.g., kurtosis) are all irrelevant to the needed computation. Note: the skewness and kurtosis, indicates a positively skewed, but roughly flattened distribution. This is completely countered by the N/sample = 100, so ignore any non-normality. On the other hand, you might want to examine the data distribution. I suspect it might be bimodal (two peaks, one on the high side).

As you might know, I always suggest looking at the magnitude of the difference rather than the silly question: is any difference seen (statistical significance). I would guess (actually its more than a guess) that 1) the difference was significant, and 2) the difference is quite large (first pass is the two differ by over 1.1 standard deviations).

To explicitly answer your three questions:

1) You would need to compute a t-test or CI on the difference between the two samples.

2) Assumptions: the parameter ‘revenue’, is an interval level measurement, the data is roughly normally distributed or at least has a large enough sample to meet the central limit theorem, the variances may be different (heteroscedacity), hence an adjustment will need to be made.

3) My main concern here is whether you are asking the right question. Is the ‘purpose’ of the ‘study’ to determine if a significant difference exists between these two samples.