There are two types of people, those who classify people into two types of people and those who don’t.

Never trust anyone over thirty.

As Mason said to Dixon, ‘you gotta draw the line somewhere’.

***

Don’t get me wrong, there are many places you need to draw a dichotomy. For example, you need to exclude patients. Obese patients (patients with a body mass index [BMI] >30) might be an exclusion criteria. Similarly you might want to exclude asymptomatic patients (patients whose key parameter at baseline is less than *x*). What I’ll be talking about is dichotomizing the dependent variable.

Nor am I talking about a natural dichotomy. Examples of real dichotomies are alive or dead; male or female. However, even real dichotomies are often fraught with major difficulties. For example, in one mortality trail we dealt with mortality within 30 days of initial treatment for patients with life-threatening trauma. Unfortunately, some of those patients were alive only by dint of extraordinary medical intervention (i.e., their loved ones refused to allow the brain dead patients to pass on).

I shall be discussing the effects of drawing a dichotomy. I will assume that there is a continuous parameter which is separated into two parts. Patients who are above a value *x* and patients who are at or below *x*. As an example, take this quote “That study, first reported in 2008, found that 40 percent of patients who consumed the drink improved in a test of verbal memory, while 24 percent of patients who received the control drink improved their performance.” They are obviously not talking about a 40% increase from baseline, but the percentage of patients with any improvement, a dichotomy. Also note that the split into improved/not improved wasn’t at 50%.

The best way to dichotomize is to use the published literature and some expert’s dichotomy. Unfortunately, when you delve deeper into why they selected the cut-off, it is often quite arbitrary and not at the optimal 50/50 split. If you’re dichotomizing the data yourself, for reasons given below, try to dichotomize at the median, so you have a 50/50 split between the two groups. Why do I suggest using an ‘expert’ cutoff? It is possible to data dredge and find a split which presents your data to the maximum benefit. The FDA might ‘red flag’ any analysis with an arbitrary cut-off. At a minimum, if you plan on dichotomizing, then state how the cut-off was/will be derived in the protocol (or analysis plan). If you create the cut-off criteria post hoc, it will be completely non-credible.

Let me first discuss why people like to dichotomize a parameter. It’s dirt-easy to understand. Everyone feels they understand proportions. If you dichotomize than you classify the world into the winners or losers (successes or failures). It is easy to think in terms of black and white. As I implied in another blog, people might not understand what the average on an esoteric parameter (e.g., verbal memory test) represents. But a difference of 18% is understood by all.

My objections to such an easy to understand statistic? Let me make a list:

- Power – you need to enroll more patients into your trial.
- We throw away interval level information (hence means) and ordinal level information (hence medians).
- The statistical approaches often assume large Ns.
- The statistical approaches limits the type of analyses.

Power

In a previous blog, I pointed out that effect size and correlation are different versions of the same thing (see the equation between effect size and correlation in ‘4. Meaningful ways to determine the adequacy of a treatment effect when you lack an intuitive knowledge of the dependent variable’). If you have a correlation and dichotomize a variable, how does that affect the size of the new correlation (with the the dichotomized data). Well it is the relationship between the biserial and point-biserial correlation. The undicotomized relationship (biserial correlation) is ALWAYS larger than the dichotomized relationship (point-biserial correlation). How much larger depends on the proportion of patients in the two dichotomies and the ordinate of the normal curve at the dichotomy. How much larger? If one forms the dichotomy at the ideal 50% then the continuous data would have a 25% larger correlation. If the dichotomy is not in the middle, then it will be even larger. For example, a 90%/10% split would make the undichotomized correlation about 70% larger than the dichotomy.

How does the minimum 25% lower correlation affect N or power? Obviously, the number of patients you must enroll will have to be bigger. But how much bigger? Let me assume that that we ran a power analysis and came up with a few sample sizes. As we discussed in the Power Analysis blog (‘8. What is a Power Analysis?’), if we know alpha (I will use 0.05, two-sided below) and power (using 80%), we can deduce the effect size given a number of patients enrolled per group. So, I can input various Ns (first column below) and get the detectable effect size (δ) in the second column below. I can then see the effect of the optimal (50%) and more extreme (90%) dichotomy and then determine its effect on the effect size (δ) in the third and sixth column, on the new number of patients needed (fourth and seventh column) and how these increased Ns relate to the original number of patients in the continuous data (fifth and eighth column).

Dichotomize at |
|||||||

50% |
90% |
||||||

N/group |
Continuous δ |
50% dichotomy δ |
N/group |
Sample Size Fold Increase (%) |
90% dichotomy δ |
N/group |
Sample Size Fold Increase (%) |

25 |
0.809 |
0.580 |
48 |
92 |
0.396 |
102 |
308 |

50 |
0.566 |
0.427 |
88 |
76 |
0.301 |
175 |
250 |

75 |
0.460 |
0.354 |
127 |
69 |
0.252 |
249 |
232 |

100 |
0.398 |
0.309 |
166 |
66 |
0.222 |
320 |
220 |

200 |
0.281 |
0.221 |
323 |
62 |
0.160 |
615 |
208 |

300 |
0.229 |
0.181 |
481 |
60 |
0.132 |
902 |
201 |

400 |
0.198 |
0.157 |
638 |
60 |
0.114 |
1,209 |
202 |

500 |
0.177 |
0.140 |
802 |
60 |
0.102 |
1,510 |
202 |

750 |
0.145 |
0.115 |
1,188 |
58 |
0.084 |
2,226 |
197 |

1,000 |
0.125 |
0.099 |
1,603 |
60 |
0.073 |
2,947 |
195 |

2,000 |
0.089 |
0.071 |
3,115 |
56 |
0.052 |
5,807 |
190 |

4,000 |
0.063 |
0.050 |
6,281 |
57 |
0.037 |
11,468 |
187 |

8,000 |
0.044 |
0.035 |
12,816 |
60 |
0.026 |
23,223 |
190 |

What do we get? When we dichotomize the effect size decreases. It always decreases. At the best, we’d have to increase the sample size by 60% to compensate. However smaller studies would need to almost double in size. Again that is the best. If the dichotomy weren’t at 50%/50%, say 90%/10% then we’d need to increase the sample size by 190% or almost a three-fold increase in N. For example, in a very small trial (N/group = 25) we would need to go from 25 patients per group to 48 when we have the optimal 50/50 split. When the split was 90/10, we would go from 25 to 102, over a four fold increase in N.

Why?? Well, we’re throwing away a lot of information. Say we consider a 10 pound weight loss as a success. A 10.1 weight loss is being treated the same as a 40 pound weight loss. A weight loss of 9.9, despite only being a miniscule 0.2 pound difference from 10.1, is a failure. That 9.9 is being considered the same as a weight gain of 10 pounds. That is, all weight gains and all weight losses less than 10 pounds weight gain are treated identically. All weight losses of 10 pounds or greater are also treated identically. We ignore all that information. Why the difference between a 50/50 split and a 90/10 split? With the latter, most of the data are identical. There is nothing to differentiate the lowest 1% with the 89th%. 90% of the data is identical. The only new information is seen at the top (or bottom) 10%. With the optimal 50/50 split, half the data is different from the other half.

Interval, Ordinal and Nominal Data

Let me say it again. What is the difference between a 10.1 pound weight gain and 40.5 (please shut your 3rd grader up, it’s not 30.4 pounds)? It’s zero. What about the difference between a 20 pound weight gain and a 9 pound weight loss? Your fifth grader would be wrong about 29 pounds, it would again be zero. We give up all ‘interval’ level information. In such a dichotomous analysis, we shouldn’t present the average (we are not allowed to add the weights together). Nor do we consider even the order of the numbers. A forty pound weight loss is not more than a thirty pound weight loss, which is not more than a ten pound weight loss. A nine pound weight loss is actually less than a ten pound weight loss, but it is not more than a one pound weight loss or even a twenty pound weight gain. We are ignoring almost all ordered information (‘ordinal’). We should not present median weight loss. We are not allowed to order the weights to find the middle value. We only have nominal information. We are throwing away a lot of information.

With our throwing away all this information, it is staggering that we only need to increase our sample sized by only 60% (optimal value, although it could be a four fold increase under non-optimal conditions).

Large N Assumption

When you have two groups and a success/failure dependent variable, the analysis of choice is the 2×2 contingency table (e.g., a Chi square test). The Chi square test is not appropriate when the expected sample sizes are ever less than 5. SAS prints out the following warning: “*x*% of the cells have expected counts less than 5. Chi-Square may not be a valid test.” That means, at best, we need a sample size of at least 10 patients per group or 20 total for a two group study (assuming that success/failure is 50%/50%). If failure (e.g., many AEs or mortality) is rare (e.g., 10%), then we’d need 50 patients per group at a minimum.

The analogue for the analysis of variance test is the logistic regression test. One of its key assumptions is something called ‘asymptotic normality’. What that means is that it assumes that the Ns need to be quite large. Logistic regression routinely uses hundreds of observations. ‘Nuff said.

Dichotomization (or non-parametric statistics, in general) is NOT the viable alternative when you’re concerned with small samples. See my blog ’19. A Reconsideration of my Biases’ when I consider small samples for ordinal data.

Types of Problems

Most clients do more than classify the patients as in treatment group A or B. Patients are assigned to different centers. We have male and female patients. We look at the patients weekly (or monthly). We have slightly different etiologies for the patients. They differ by age. That means factorial designs with treatment, gender, time, etiology, and age, along with all the interactions. Let me take a simple case, we have two treatments and look at the patients at weeks 1, 2, 4, and 8. In the continuous world, we would do a two-way ANOVA, with factors of treatment and time. Key in this type of analysis is what is called the interaction. We would expect that the treatment difference would increase over time. For example, at week 1, with little time for a response, the treatment difference are expected to be very small. As time increases, the difference would increase (although not in a consistent manner). Perhaps the largest difference would be seen at the study’s endpoint (8 weeks). This interaction is the key result in such a trial. I’m not going to elaborate here, but interactions are a horror to analyze in logistic regression and even worse to interpret. Without interactions, there is almost no point in doing a multi-factor study. And please don’t even get me started on how to handle correlated data (e.g., repeated measurements over time) – the only assumption which blows away the significance test (see ‘7. Assumptions of Statistical Tests’) and you can’t realistically design around.

Horror Story: In the CHMP submission for Vyndaqel or tafamidis meglumine, the key co-primary d.v. was a dichotomized worsening from baseline score (NIS-LL) >2 or study dropout. The bottom line: the results were not statistically significant (p = 0.0682). The p-value for an ‘ancillary analysis’ of the continuous change from baseline at study endpoint was p = 0.0271. The FDA rejected the submission for efficacy reasons. However, the CHMP clearly bent over backwards to provisionally accept the application for this orphan drug as the sole available licensed treatment for TTR-FAP. The dichotomized d.v. was not significant, but the continuous d.v. achieved statistical significance.

Conclusion: I agree that dichotomizing data into success and failure makes interpretation much easier. However, to plan a trial for a dichotomy would necessitate at a minimum a 60% increase in patients. A small study would, at best, need to be doubled in size. If the split into the two groups is not the ideal 50/50, then the increase would need to be much larger. A statistical analysis of a dichotomy also requires a large N. It also makes factorial designs almost impossible to analyze or interpret.

Recommendation: If simplicity of interpretation is desired, then analyze the data as a continuum, but present (descriptive [no p-values or CI]) summary tables with the dichotomy. I personally often relegate variables into three categories: primary, secondary and tertiary. Tertiary variables (like dichotomies) would be presented only descriptively – no (inferential) statistical analyses.

Dichotomization is the most restrictive form of non-parametric analysis. I’ll say more about the ordinal form of non-parametric analyses in ’10. Parametric or non-parametric analysis – Why one is almost useless’.