What do you call a numbers cruncher who is creative? An accountant.
What do you call a numbers cruncher who is uncreative? A statistician.
***
In the last blog, I explored the meaning of variance. I said that variance is basically how the scores differed from each other. I observed that if we took every person’s score and compared it with every other person’s score (and squaring it and dividing it by the appropriate N to get an average) it would be a measure of how people differ from one another. Or we could say that the simple variance is a measure of errors in using a mean to describe everyone.
We next observed that we could do something identical using not differences among each person, but differences between each mean. That is, take every mean and compare it with every other mean and divide it by the appropriate N. This would be a second variance, but this one measures if the means differ from one another, the mathematical model.
We also said that we could divide the mathematical model by the noise to get a ratio, a form of the signal to noise ratio. If the treatment differences (mathematical model) were suitably larger than the noise, we’d say that the model has some usefulness, beyond chance.
We are, in essence, comparing two variances. We uncreative statisticians called such a ratio of two variances (model/noise), an Analysis of Variance. Alas, too bad we weren’t logicians, we could have called it Analysis of Means.
So how do we test the ratio of two variances. Sorry, it wasn’t called the ANOVA distribution. Statisticians in those days liked the simple, single alphabet letter (e.g., zdistribution or tdistribution). This ratio was compared to an Fdistribution, named after R. A. Fisher.
Let’s imagine we have two treatments (active and control, ‘A’ and ‘C’, for short) and look at the patients at Weeks 1, 2, and 3 (the end of the trial). Let me follow my uncreative lead and call the weeks 1, 2, and 3. Let’s also imagine that it takes 3 weeks for the treatment to work, with Week 3 as the key time point. Hmm, three weeks and two treatments, there would be six means. ANOVA will compare each and every mean with one another.
A1 
A2 
A3 
C1 
C2 
C3 
One client actually did something like that. But ANOVA has some issues and some GREAT abilities.
First the issues, as readers of my blogs will note, I’ve said in the past that differences among means over time deserve some very careful handling. One can’t ignore this. Blindly comparing the means is a very, very bad nono. My client ignored this issue.
Secondly, things can get unwieldy with so many means. One great ability of ANOVA is its ability to handle things in a compartmentalized fashion. We can treat the three times as columns of a 2 by 3 box, and the 2 treatments by rows.
Treatment 
Time 

1 
2 
3 

Active 
A1 
A2 
A3 
Control 
C1 
C2 
C3 
For the treatment effect, we simply have one real comparison, Active vs Control. In general, we have as many unique (unrelated) comparisons as groups minus one. In statistics we say it has 1 degree of freedom (1 df). Hence, for time there would be a two comparisons (2 df). Hmm, with 6 means, there should be 5 df. But we see 1 df for treatment and 2 df for time. That’s 3 df. What happened to the remaining two degrees of freedom? The remaining two degrees of freedom go into deviations from a simple treatment and a simple time effect. More about interactions below.
In general, we’d love to just say that the Active treatment is better than the control. We could just take the three Active means (A1, A2, and A3) and compare their average with the three Control means (C1, C2, and C3). More about how we do that later. ANOVA gives us a very easy way to compare the average Active and average Control means, with a single degree of freedom comparison, like a superduper ttest. Simply put, it would be the ‘main effect’ of Treatment.
When can one blithely just report the treatment main effect? Well, as we discussed in other blogs the comparison of means assumes:
 Normality. Which we can ignore if we have about 10 measurements comprising the overall Active mean, Low Active, etc.
 The variance of the different treatments are the same, remember heteroscedasticity? Which we can also ignore if the Ns are about the same.
 The errors are independent. Which we CANNOT ignore on time related data, but we can use a slightly fancier model to analyze the data (see previous blogs). If all the scores are measured at the same time (e.g., items on a quality of life scale), we can test it with something called compound symmetry (or sphericity).
 Negligible interaction. We usually rely on a significance test, of something called the interaction, in this case the Treatment by Time interaction. I typically use an alpha level of 0.10, not 0.05, for testing interactions.
Interactions
If the differences between the active and control were the same at Week 1 as seen at both Week 2 and 3, then we could typically stop there. That is, A1 – C1 = A2 – C2 = A3 – C3. In that case the interaction would be zero (and the pvalue would not be statistically significant at 0.10).
But we stated that it takes 3 weeks for the treatment to work. So we expected the treatment difference at Week 1 and at Week 2 to be slight and to see a real difference at Week 3. In other words, the drug effect ‘kinda’ levels off at Week 3. The treatment difference at Week 1 is different from the treatment difference at Week 2, which is different than Week 3. We often don’t see an equal treatment difference happening at all three study weeks.
If the interaction was statistically significant, we CANNOT blithely report the overall treatment difference, we CANNOT report the overall Active and Control means. Because it depends on which week we’re talking about, an average doesn’t make any sense. Our report must focus on the treatment difference AT EACH WEEK. In the best of all worlds, let us assume that the treatment difference at Week 1 was a n.s. +0.3 (I’m supposing that a positive difference means the active is better), +0.8 at Week 2, and a statistically significant difference of +1.1 at Week 3 (the key time point). Game over, we succeeded, we publish or submit to the Agency. When the treatment differences are always in the right direction (i.e., positive, in this example), we have what is known as a qualitative interaction. We could also justify a slight negative mean (e.g., 0.03) at Week 1, saying that the treatment needs more than a week to kickin, and prior to that no difference is expected, hence, half the time the difference would be negative.
BTW, virtually all statistical packages reports the overall treatment or interaction means, their standard errors, CI, differences, and pairwise pvalues.
Up to now, we’ve been talking about time as the second factor. Other common factors in ANOVAs are investigators (sites and/or regions), demographic and blocking factors (e.g., gender or age), baseline medical conditions (e.g., baseline severity, genetic factors). In fact, many statisticians believe in the dictum: ‘As Randomized, Analyzed’. If you block on a factor, you should include it in the ANOVA. We can include all factors simultaneously (e.g., treatment, site, gender, and baseline severity in a four way ANOVA). One major problem is that with such a four way ANOVA, you would have 6 twoway interactions (e.g., treatment by site, …, gender by severity), 4 threeway interactions (e.g., treatment by site by gender, site by gender by severity), and 1 four way interaction (i.e., treatment by site by gender by severity). With 11 interactions you should count on at least one being statistically significant (p < 0.10) by chance alone. Murphy is the LAW, which we who analyze data, must OBEY.
One last point about interactions, sometimes they go away following a data transformation. Let us say that the three active means were 4, 9, and 16; and the three control means were 1, 4 and 9. If we looked at the Week 1 difference, it would be 3 (41), at Week 2, it would be 5, at Week 3, it would be 7. Since they are not the same (3 ≠ 5 ≠ 7) it would mean an interaction. However, if we applied a square root transformation, the interaction would be zero (the difference in square root units were all 1.0). Obviously this is ‘cooked’ data. Data transformation can greatly help reducing nonnormality, heteroscedacity, and spurious interactions.
Weighted vs Unweighted Means
When I talked about getting the overall Active treatment mean there are actually two methods of doing it. First let me switch from Study Week to Investigator Site (also abbreviated as Site 1, 2, and 3) in the boxes above. If, and only if, the Ns are identical in the 6 cells (N_{A1} = N_{A2} = … = N_{C2} = N_{C3}), will the weighted and unweighted means will be identical. Let us assume that there was no statistically significant treatment by site interaction, hence the overall Active (and Control) means are meaningful.
Unweighted Means: One way is to do a simple average ((M_{A1} + M_{A2} + M_{A3})/3) and something similar for the control group. This is called the Unweighted Mean. Each investigator (site) is treated equally, hence each investigator is equally ‘important’. However, for the Unweighted means, some patients are counted more heavily than others. Huh? Let me ignore the third site right now. Let me assume that site 1 went gangbusters and enrolled 60 patients, with 30 patients treated with Active. Site 2 had ‘problems’, enrolling only one patient into Active. Without going into mathematical proofs, the single patient in Site 2 is weighted as 30 times more important as each patient in Site 1. So, the weighted means will weight each site equally, but unequally weights each patient. [For SAS users, this is similar to the type III sum of squares.]
Weighted Means: A second way to compute the simple average is to weight each mean by its N, then divide by the total N. Mathematically the active mean would look like this: (N_{A1}M_{A1} + N_{A2}M_{A2} + N_{A3}M_{A3})/(N_{A1} + N_{A2} + N_{A3}), which is identical to adding up all the active patients and dividing by the number of active patients. Weighted means will make better enrolling sites more important, but treats each patient equally. [For SAS users, this is similar to the type II sum of squares.]
Many years ago, the Unweighted means (type II) was THE FDA approved methodology. The problem is that when you see poorly enrolling sites, it tends to make the overall treatment difference have greater noise, hence poorer ability to reject the null hypothesis. I remember doing a type of meta analysis where I combined all the data from every published study for a certain drug, in other words an Integrated Summary of Efficacy (ISE). The best study had 120 patients, the poorest two studies each had 4, fortunately the studies randomized approximately equally to Active vs Comparator. The total N was about 2,400. I saw a significant nonparametric test result, but the overall (unweighted means – each study was equally important) treatment effect was nonsignificant. Again, at this time the FDA wanted to see the unweighted means approach, and I dutifully used only that. I then looked at a plot of the mean difference by study size. The data fell into a triangular pattern. When the Ns were large, the treatment difference converged on one number, when the Ns were small, sometimes the means were above this, sometimes well below, just as one would expect. When N is large, the treatment difference should be close to the true treatment difference, when N is small, the treatment difference would be quite variable. I switched from using Unweighted means to Weighted means and the results were statistically significant. I gave a talk to the FDA on my findings. I don’t know if it was because of me, but they stopped relying exclusively on unweighted mean analyses.
In any case, as I’ve said in my blog on the analysis plan (6. ‘Lies, Damned Lies, and Statistics’ part 1, and Analysis Plans, an essential tool), how you’re going to do the analysis needs to be stated in the protocol and analysis plan, including the type of ANOVA, how you will be testing for interactions, critical pvalues for main and interactions, data transformations, weighted/unweighted means, etc.