The following blog was written by Randy Gallistel, PhD of Rutgers. It presents a Bayesian approach to hypothesis testing. It was written on April 23, 2012, but will eventually appear to have an earlier date, to sort it immediately after my original (Frequentist) blog.

The Bayesian would say that the truism that one cannot prove the null is a consequence of a misformulation of the inference problem. If we agree that hypothesis-testing statistics is the mathematics of probabilistic inference and if we resort to probabilistic inference only when we are faced with some uncertainty as to which conclusion to draw, then the NHST formulation of the problem is ruled out because, given that formulation, we have no uncertainty: Only one possible conclusion is to be tested against the data, the null conclusion, and we are a priori certain that it cannot be true. Thus, there is no inference problem.

One might object that this is not so; the alternative is that there is “some” (positive!) effect. But until we specify what we understand by “some”, this is not a well-formulated alternative. For example, in a typical pharmacological clinical trial, “some” effect could mean that the drug had an effect anywhere between 0 and complete cure in every patient (maximum possible effect). If that is what we understand “some” effect to mean, then for most drugs, the null conclusion (no positive effect) has a greater likelihood than the “some” (positive) effect conclusion.

The Bayesian computation tells us how well each possible conclusion (aka hypothesis) predicts the data that we have gathered. The possible hypotheses are represented by prior distributions. These prior distributions may be thought of as bets made by each hypothesis before the data are examined. Each hypothesis has a unit mass of prior probability with which to bet. The null conclusion bets it all on 0. The unlimited “some” hypothesis spreads its unit mass of prior probability out over all possible effect sizes.The question then becomes which of these prior probability distributions does a better job of predicting the likelihood function.

Likelihood is sometimes called the reverse probability. In forward probability, we assume that we know the distribution (that is, we know its form and the values of its parameters) and we use this knowledge to predict how probable different outcomes are. In reverse probability (likelihood), we assume we know the data and we use the data to compute how likely those data would be for various assumptions about the distribution from which they came (assumptions about the form and about the values of the parameters of the distribution from which the data may have come). The likelihood function tells us the likelihood for all different values of the parameters of an assumed distribution. The highly likely values are the ones that predict what we have observed; the highly unlikely ones are the ones that predict that we should not have observed what we have in fact observed

The possibilities for which probabilities are defined in a probability distribution are mutually exclusive and exhaustive, so their probabilities must sum (integrate) to one. Reverse probabilities (likelihoods), by contrast, are neither mutually exclusive nor exhaustive. It is possible to have two hypotheses that are distinct but overlapping and they may both either predict the data we have with absolute certainty (in which case, they both have a likelihood of 1) or not at all (in which case, they both have a likelihood of 0). Generally, however, one hypothesis does a better job of predicting our data than the other, in which case that hypothesis is more likely than the alternative. The Bayes Factor is the ratio of the likelihoods, in other words, the likelihood of the one hypothesis relative to the other.

Suppose the data suggest only a weak positive effect. That means that we COULD have got those data with reasonable probability even if there is in fact no effect (the null hypothesis), whereas, we could not have got those data if the effect of the drug were so great as to completely cure every patient, which is one of the states of the world encompassed by the unbridled version of the “some” hypothesis. The marginal likelihood of an hypothesis is its average likelihood over each possible value of (say) its mean that is compassed by the associated prior probability function. Because weakly positive results are inconsistent with all the stronger forms of “some”, the marginal likelihood of the unbridled “some” hypothesis is low. The null places all its chips on a single value 0, so the “average” for this hypothesis is simply the likelihood at that value, and, as already noted, if the data are weak, then the likelihood that the true effect is 0 is substantial.

Thus, the Bayesian would argue that when we formulate the inference problem in such a way that there actually is some uncertainty–hence, something to be inferred–the data may very well favor the null hypothesis. The frequentist objects that when we frame the inference problem this way, our inference will depend on the upper limit that we put on what we understand by ‘some,’ and that is true. But there is no reason not to compute the Bayes Factor (the ratio of the marginal likelihoods) as a function of this upper limit. If the Bayes Factor in favor of the null approaches 1 from above as the upper limit goes to 0, that is, as “some effect” becomes indistinguishable for “no effect”, then we can conclude that the inference to the null is to be preferred over ANY (positive!) alternative to it.

When the null is actually true, the data will yield such a function 50% of the time. And, when the pharmacological effect is actually slightly or strongly negative (deleterious), the data will yield such a function even more often. Moreover, this will be true no matter how small the N. Thus, we have a rational basis for favoring one conclusion over the other no matter how little data we have.

When, by chance, the function relating the odds in favor of the null to the upper limit on “some positive effect” dips slightly below 1 for some non-zero assumption about the upper limit, it will not go very far below 1, that is, the “some effect” hypothesis cannot attain high relative likelihood when there is in fact no effect or when the effect is weak and we have little data. Therefore, if we insist that we want some reasonable odds (say 10:1) in favor of “some” (positive) effect before we put the drug on the market, we will more often than not conclude that there is no effect or none worth considering. And that is what we should conclude. Not because it is necessarily true–nothing is certain but death and taxes–but because that is what is consistent with the data we have and the principle that a drug should not be marketed unless the data we have make us reasonably confident that it will do good.