‘Measure once cut twice, measure twice cut once’

‘A man with one watch knows what time it is, a man with two is never sure’

The three words which get everyone’s attention: Free, Free, Free

***

Multiple observations occur in a variety of ways.

- Multiple items (replications)
- Multiple dependent variables
- Multiple time points
- Multiple active treatments

Multiple Items (replications)

Think of a Quality of Life (QoL) scale. The best ones have a variety of questions which it asks the patient/caregiver/physician. They then add up the individual questions to get a sub-scale or total score. Why is this useful? Well, statistical theory can demonstrate that the ability for a scale to measure anything is related to the number of ‘parallel’ items which compromise its score. A two item total is always better than using one item to measure something. How much better? Let us first define terms. When I talked about the ‘ability for a scale to measure anything’ the crudest thing a scale can measure is itself. The ability to measure itself is a limit of how well it can measure itself. What do we mean by measure itself? Say we measure the characteristic and then do it again. We could correlate the two measures. This is called the scale’s reliability (often abbreviated r_{x,x’}). There exists an equation (Spearman-Brown prophesy formula) to determine the scale’s reliability when you increase the number of items by a factor of n:

r_{n,n’ }= nr_{1,1’}/(1 + (n-1)r_{1,1’}),where r_{1,1’} is the reliability of a single item, n is the number of items in the scale and r_{n,n’} is the reliability of the n-item scale. See below for a broader interpretation of ‘1’ and ‘n’.

Let me illustrate this by doubling the number of items (e.g., going from one item to two). If the original reliability of the single item was 0.50, then by adding a second item and using both, the new sub-scale would have a reliability of 0.667. The better the reliability the better the ability of the scale to correlate to anything. How much better? Well, if you square a correlation, it gives the amount of variance which could be predicted. For example, with the one item scale, 0.50^{2} indicates that 25% of the variance of a second measurement can be predicted by the first. On the other hand, if one used a two item scale, 0.667^{2} indicates that 44.5% can be predicted. A huge improvement over 25%! Or to put it another way, by increasing the reliability, you directly increase the power of a scale to show a treatment effect. Hence, increasing the reliability, will directly reduce the error variance in a study and therefore, increase the effect size. What may not be obvious from the above equation is that the reliability increase is not linear but has diminishing returns. As mentioned above, if a single item had a reliability of 0.5, 2 items would increase to 0.667, three items a reliability of 0.75, four items – 0.8, five items – 0.833, ten items – 0.909, one hundred items – 0.990.

I initially used the example of a QoL scale, but the above formula is true for anything. For example, having 3 raters (and using their average) is better than using only 1. Rating the total size of all knuckles is a better measure of arthritic severity than measuring only one joint. It is better to measure blood pressure three times than only once.

BTW, I stated that r_{1,1’} is the reliability for a one item test, that is strictly not necessarily true. It actually could be the a test of any length and n is how much larger (or smaller) the new scale would be. Nor does n have to be an integer. So if you are using the average of two raters and want to see the effect of using five raters, then n would be 2.5.

All this is predicated on the different items/raters are measuring the same thing. One recent study I was involved in examined the inter rater correlations of three raters. Unfortunately one of the raters correlated almost zero with the other two. The suggested take-away for that study is the need to train (standardize) the ‘judges’.

Summary: Having a total score of n-items (or replicate measures) is always better than using a single item (one observation, one rater, or one item scales). Better is defined as smaller error variance, hence larger effect size, hence smaller study size. As patient recruitment is typically hard and it is easy to get patients or physicians to fill out a short questionnaire (or have the physician rate them on multiple related attributes) the study will directly benefit by decreasing noise (errors). All measurements contain error. It is the scientist’s job to reduce that error.

Multiple Dependent Variables

It is seldom the case where a medical discipline has focused on one and only one dependent variable. In discussions with my clients I always attempt to identify one key parameter which the study can rest on. I call this the key parameter, for obvious reasons. I usually do a complete and full analysis of this parameter. Of course, I’ll do inferential tests (p-values), but I’ll also do assumption testing (e.g., normality, interactions), test different populations, supportive testing, back-up alternative testing (e.g., non-parametrics), graphical presentations, etc. Following the key parameter are the secondary and tertiary parameters. The secondary parameters also merit inferential tests, but may lack alternative populations, tests of assumptions, etc. Finally, the tertiary parameters I tend to present only descriptively. I should note that the key parameter is defined by the key time point. Typically the key parameter at a secondary time point is relegated to secondary status.

But what if my client has two or three parameters for which they can’t make up their mind about. Well we can call them the key parameter**s**. Is there any cost for this? Yes. One simple way of handling this is to take the experiment’s alpha level (typically 0.05) and split it equally among the different primary parameters. Se we can test two key parameters but not at 0.05 but 0.025 each. This is the Bonferroni approach. Does it make the study half as powerful? Do you need to double the N? Nope! For a simple comparison, when N is moderate (e.g., > 30), one would typically need a critical t-test of 2.042 for a 0.05 two-sided test to be statistically significant. For a 0.025 two-sided test, that is, for 2 key parameters, one would need 2.360. That is a 15.6% larger t-test. To compensate, one would need to increase the sample size by 33.6%. If the initial sample size were larger the limit on the increase in sample size would be 30.7% larger. In other words, the cost of doubling the number of key parameters is not 100% increase in N, but roughly thirty-ish percent for a small to moderate sized trial.

There are even ways to reduce this! One can use something called an improved Bonferroni and test the larger difference by an alpha level of 0.025, like before, but the second (of two) parameters at 0.05. This isn’t just a cheapy, but a free-be. Nevertheless, I’d still power the trial using an alpha level of 0.025, not 0.05, when dealing with two key parameters.

Multiple Time Points

In a previous blog (Assumptions of Statistical Tests), I pointed out that analyses of repeated measurements HAVE TO BE HANDLED IN A VERY SPECIAL WAY or AVOIDED AT ALL COSTS. I pointed out that Scheffé demonstrated that with a modest (and, if anything, underestimated) 0.30 autocorrelation, the nominal 0.05 alpha level actually is 0.25. I said that autocorrelation DESTROYS the alpha level. I’ve seen autocorrelations of 0.90 and higher. Let me briefly demonstrate why. Let me look at a patient’s right foot and get his shoe size. Is it useful to look at their left shoe? Of course not. Knowing one will tell us the other. Measuring both the left and right shoes is redundant. One doesn’t have two unique pieces of information, one has one. That would be the case of a correlation between two variables (left and right shoe sizes) of 1.00. [Note: The first section of this blog on multiple items is never a case of correlations of 1.00. Correlations of 0.2 and 0.3 predominate.] If, and only if, the correlation were much lower would one be justified in measuring both. Statistical theory mandates that errors must be uncorrelated. When one talks about multiple time point the correlation is seldom zero. As I mentioned in that previous blog, unfortunately, the only viable solution is to use statistical programs which handle correlated error structures or avoid having multiple time points in the analysis. As an example of the former I suggested using SAS’s proc Mixed with an autoregressive structure with one term (AR(1)) or with unequally spaced intervals a structure called spatial power, although other structures are often used. As an example of using only one time point, I would suggest taking the baseline and key time point measurement (often the last observation), take their difference (e.g., call it improvement) and analyzing improvement at the key time point.

Multiple Active Treatments

I am not going to say much of that here, I’ll leave that to a separate blog. However, I will point back to the improved Bonferroni above.

My next blog will discuss Great and Not so Great Designs.