How to Inflate your Bayes Factor with Nothing

or how a partition isn't actually necessary

In #religion

I was recently directed to some work by Lydia McGrew, as I was responding to hers and Than Christopoulos's 9-part response to Paulogia's video featuring me where we were responding to a blog post by Than Christopoulos. (catch all that? 😂)

One argument in particular had me puzzled for days,

McGrew's Argument

https://lydiaswebpage.blogspot.com/2025/12/the-resurrection-independence-and.html

It's surprising how many people don't know this: When considering evidence for some event, such as the resurrection or the Battle of Gettysburg or the moon landing or anything else, and comparing how well that event explains the evidence and how well its negation explains the evidence, you must not confine yourself to thinking only of alternatives that have some hope of explaining the evidence.

Rather, in order to think rightly about how strong the evidence is for the event, you need to compare the explanatory power of the salient hypothesis H (that is, the hypothesis toward which the individual items prima facie point) to the power of ~H overall. I can't stress this strongly enough. And it's pretty obvious in the case of the Battle of Gettysburg or the moon landing. If man never landed on the moon, the most probable thing that would happen is that we would have no attestations that man did land on the moon! The most probable outcome would be...nothing. No video, no astronaut interviews, no pictures of the American flag, etc. You won't get a good sense of how strong the evidence is for the moon landing by comparing "Man first landed on the moon on July 20, 1969" only to some gerrymandered conspiracy theory that man didn't land on the moon at all and that this, that, and the other pieces of evidence were falsely manufactured. Where M is the moon landing hypothesis, the gerrymandered conspiracy hypothesis is merely a subhypothesis of ~M. Of course that subhypothesis explains at least some of the evidence. That's what it was designed to do. But the evidence is still strong against ~M because the subhypothesis takes up only a very small bit of the probability space given ~M. Since the conspiracy is very improbable given ~M, it doesn't help much to improve the explanatory power of ~M.

You must use a partition--a mutually exclusive and joinly exhaustive set of options. M and ~M are a partition. M and "~M and a conspiracy to make it look like M" do not form a partition. See here for a video in which I point this out. It's astonishing that not only most skeptics but also many Christians don't realize this. If you don't use a partition, you're going to find it virtually impossible to get a true sense of the strength of the evidence.

Walking through the math here, I start with the Bayes Factor,

$$ \frac{P(\text{data}|H)}{P(\text{data}|\neg H)} $$

To try to make this as easy as possible, I will consider \(\neg H\) to be broken up into only two options and assume these two are mutually exclusive and exhaustive (i.e. they form a partition, as McGrew insists). So we have three models:

  • \(H\) - we landed on the moon
  • \(M\) - some conspiracy, where we filmed the landing in Hollywood and broadcast from there
  • \(N\) - nothing happened at all

Since \(M\) and \(N\) both entail \(\neg H\), and we are assuming they are exhaustive, we can break the bottom part of the Bayes Factor into pieces, like

$$ P(\text{data}|\neg H) = P(\text{data}|\neg H, M)P(M|\neg H)+P(\text{data}|\neg H, N)P(N|\neg H) $$

We should also have

$$ P(M|\neg H)+P(N|\neg H)=1 $$

We now make the following observations: - "nothing happening" is much more likely than a conspiracy: \(P(M|\neg H)\) will be pretty small compared to \(P(N|\neg H)\) - "nothing happening" doesn't even attempt to explain the data, so it's predictive power is basically zero: \(P(\text{data}|\neg H, N) \approx 0\) - The original hypothesis, \(H\), is presumed to fit the data very well: \(P(\text{data}|H)\approx 1\) - The conspiracy also fits the data well: \(P(\text{data}|M)\approx 1\)

This leads to a top-heavy Bayes Factor:

$$ \begin{align} \frac{P(\text{data}|H)}{P(\text{data}|\neg H)}&=\frac{P(\text{data}|H)}{P(\text{data}|\neg H, M)P(M|\neg H)+P(\text{data}|\neg H, N)P(N|\neg H)}\\ &=\frac{1}{1\cdot (\rm small)+0\cdot (\rm largish)} = \frac{1}{\rm small} \end{align} $$

So it would seem that if you have the hypothesis "\(H=\)I believe the claim in all its details", and the opposite, \(\neg H\), has a large pool of hypotheses that don't explain anything, then you'll automatically support \(H\). We will see this is a bug not a feature.

A bit more generally, including \(P(M|\neg H)=\frac{P(M)}{P(M)+P(N)}\) and \(P(\text{data}|\neg H,M)=P(\text{data}|M)\),

$$ \begin{align} \frac{P(\text{data}|H)}{P(\text{data}|\neg H)}&=\frac{P(\text{data}|H)}{P(\text{data}|M)\cdot\frac{P(M)}{P(M)+P(N)}} \end{align} $$

The effect of the partition is to scale the denominator by the small fraction of the \(M\) model probability compared to the "nothing" model \(N\). Compare this to McGrew's quote above "But the evidence is still strong against ~M because the subhypothesis takes up only a very small bit of the probability space given ~M." -- this is the same observation.

Something seems wrong here, however. E.T. Jaynes shows that equivalent states of knowledge should yield equivalent probability assignments, so this observation feels like adding "nothing" to our state of knowledge, but changing probabilities.

Why I never saw this effect before

I don't like the Bayes Factor, or ratio-versions of Bayes Theorem in particular. I think it encourages binary thinking. In addition, there are always unconsidered models within \(H\) and \(\neg H\) that are hard to see. Many times \(H\) is not well defined, which doesn't help matters either. I prefer to write out Bayes Theorem from the start with all of the models under consideration, but not in the full numerator/denominator form -- that one gets messy quickly, and encourages restricting to a small number of models. Equivalently, one can write out the numerator-only for each model and as a second step add all of them up to find the denominator, and then go back and redivide by that denominator. This multi-step process allows you to consider as many models as one wants, and to more directly compare priors.

Using that approach we'd have:

$$ \begin{align} P(H|\text{data})&\sim P(\text{data}|H)\cdot P(H) \\ P(M|\text{data})&\sim P(\text{data}|M)\cdot P(M) \\ P(N|\text{data})&\sim P(\text{data}|N)\cdot P(N) \end{align} $$
$$ \begin{align} P(H|\text{data})&\sim P(\text{data}|H)\cdot P(H) \\ &+\\ P(M|\text{data})&\sim P(\text{data}|M)\cdot P(M) \\ &+\\ P(N|\text{data})&\sim P(\text{data}|N)\cdot P(N) \\ &=T\\ \end{align} $$
$$ \begin{align} P(H|\text{data})&= P(\text{data}|H)\cdot P(H)/T \\ P(M|\text{data})&= P(\text{data}|M)\cdot P(M)/T \\ P(N|\text{data})&= P(\text{data}|N)\cdot P(N)/T \\ \end{align} $$

Since \(P(\text{data}|N)\) is near zero, that one will drop out of the sum,

$$ T=P(\text{data}|H)\cdot P(H)+P(\text{data}|M)\cdot P(M) $$
$$ \begin{align} P(H|\text{data})&= \frac{P(\text{data}|H)\cdot P(H)}{P(\text{data}|H)\cdot P(H)+P(\text{data}|M)\cdot P(M)} \\ P(M|\text{data})&= \frac{P(\text{data}|M)\cdot P(M)}{P(\text{data}|H)\cdot P(H)+P(\text{data}|M)\cdot P(M)} \\ P(N|\text{data})&= 0\\ \end{align} $$

Making the posterior ratio,

$$ \begin{align} \frac{P(H|\text{data})}{P(\neg H|\text{data})}&=\frac{P(\text{data}|H)\cdot P(H)}{P(\text{data}|M)\cdot P(M)} \end{align} $$

and since both \(H\) and \(M\) explain the data well, \(P(\text{data}|H)\approx 1\) and \(P(\text{data}|M)\approx 1\), the model comparison reduces to comparing priors \(P(M)\) vs \(P(H)\). Considering models like "nothing happened" doesn't seem to come into play, in contrast to the McGrew approach.

The answers have to be the same

Probability theory does not allow equivalent problems to yield different answers. So it would seem these two approaches are answering similar, but somehow different, questions. Or maybe they are equivalent in a way that isn't immediately obvious. Initially. I was not sure which, but now I can show where we can see the connection.

We start with the posterior ratio, expanding the denominator,

$$ \begin{align} \frac{P(H|\text{data})}{P(\neg H|\text{data})}&=\frac{P(\text{data}|H)}{P(\text{data}|\neg H)}\times \frac{P(H)}{P(\neg H)} \\ &=\frac{P(\text{data}|H)}{P(\text{data}|\neg H,M)P(M|\neg H)+P(\text{data}|\neg H,N)P(N|\neg H)}\times \\ &\frac{P(H)}{P(\neg H|M)P(M)+P(\neg H|N)P(N)} \end{align} $$

Substitute

$$\begin{align} P(M|\neg H) &= \frac{P(M)}{P(M)+P(N)}\\ P(N|\neg H) &= \frac{P(N)}{P(M)+P(N)} \\ \end{align} $$

because \(M\) and \(N\) are mutually exclusive and exhaustive (by definition) and,

$$ \begin{align} P(\text{data}|\neg H,M) &= P(\text{data}|M) \\ P(\text{data}|\neg H,N) &= P(\text{data}|N) \end{align} $$

because knowing \(M\) or \(N\) guarantees that \(\neg H\) is also true, so their conjunction is redundant.

This leads to,

$$ \begin{align} \frac{P(H|\text{data})}{P(\neg H|\text{data})}&=\frac{P(\text{data}|H)}{P(\text{data}|\neg H,M)P(M|\neg H)+P(\text{data}|\neg H,N)P(N|\neg H)}\times\frac{P(H)}{P(M)+P(N)} \\ &=\underbrace{\left[\frac{P(\text{data}|H)}{P(\text{data}|M)\cdot \frac{P(M)}{P(M)+P(N)}+P(\text{data}|N) \cdot \frac{P(N)}{P(M)+P(N)}}\right]}_{\text{likelihood ratio (Bayes Factor)}}\times\underbrace{\left[\frac{P(H)}{P(M)+P(N)}\right]}_{\text{prior}} \end{align} $$

Recall that the "nothing" model doesn't predict the data, \(P(\text{data}|N)=0\), so we have,

$$ \begin{align} \frac{P(H|\text{data})}{P(\neg H|\text{data})}&=\underbrace{\left[\frac{P(\text{data}|H)}{P(\text{data}|M)\cdot \frac{P(M)}{P(M)+P(N)}}\right]}_{\text{likelihood ratio (Bayes Factor)}}\times\underbrace{\left[\frac{P(H)}{P(M)+P(N)}\right]}_{\text{prior}} \end{align} $$

which reduces to,

$$ \begin{align} \frac{P(H|\text{data})}{P(\neg H|\text{data})}&=\frac{P(\text{data}|H)\cdot P(H)}{P(\text{data}|M)\cdot P(M)} \end{align} $$

This last equation matches my solution, and the Bayes Factor in the previous equation matches the McGrew's Bayes Factor. We can now see immediately what is happening.

By adding a "nothing" model (\(P(\text{data}|N)=0\) model), the Bayes Factor gets scaled down by the relative fraction of the alternative to the main model to that "nothing" model, \(\frac{P(M)}{(P(M)+P(N)}\). This makes the Bayes Factor top-heavy without really changing anything about our state of knowledge. This should make us suspicious (which is why I thought something was wrong earlier) because one of the tenets of probability theory is that equivalent states of knowledge should yield equivalent probability assignments. Notice, however, that the denominator of the prior gets scaled up by the same amount, \(P(M)+P(N)\), which cancels the effect of the "nothing" model, leaving only the comparison between \(H\) and \(M\):

$$ \begin{align} \frac{P(H|\text{data})}{P(\neg H|\text{data})}&=\underbrace{\left[\frac{P(\text{data}|H)}{P(\text{data}|M)}\right]}_{\text{likelihood}}\times\underbrace{\left[\frac{P(H)}{P(M)}\right]}_{\text{prior}} \end{align} $$

So, models that don't explain the data at all will not contribute to our model comparison values at all. All the insistence McGrew gives us for having a partition, then, becomes moot.

It also clears up McGrew's observation from a different video that if you have \(H\) and you have an alternative designed to account for the data, \(M\), then the Bayes Factor, \(P(\text{data}|H)/P(\text{data}|M)\), would be, as McGrew states, "unhelpful and uninformative". Correct! If both models predict the same data equally, then the argument will have to come down to priors. If you use a partition which includes irrelevant hypotheses, as McGrew suggests, then you'll artificially boost the Bayes factor while at the same time reducing the prior by the same scale. Thus the truly interesting ratio -- not the Bayes Factor but the posterior ratio -- is unchanged. Ignoring the priors gives one a false sense of security for \(H\) in this case.

Exercise for the student

To make this concrete, do both the Bayes Factor and the Posterior ratio calculations for the following values:

  • \(P(N)=0.94\) - the "nothing model" is the most likely thing a-priori
  • \(P(M)=0.05\) - the \(M\) model is 5x more likely a-priori than \(H\)
  • \(P(H)=0.01\)

note that - \(P(\neg H)=P(M)+P(N)\) - \(P(M|\neg H)=P(M)/(P(M)+P(N))\) - \(P(N|\neg H)=P(N)/(P(M)+P(N))\)

And predicting the data, assume

  • \(P(\text{data}|H)=1\) - model \(H\) perfectly predicts the data
  • \(P(\text{data}|M)=0.5\) - model \(M\) predicts the data worse
  • \(P(\text{data}|N)=0\) - the "nothing" model doesn't predict the data at all.

If you run through this calculation, you'll find things like,

  • \(P(M|\neg H)=P(M)/(P(M)+P(N))\approx 0.051\)
  • The Bayes Factor, \(P(\text{data}|H)/P(\text{data}|\neg H)\approx 39\), -- really in favor of model \(H\)!
  • The Posterior ratio (which is the correct and interesting quantity to look at), \(P(H|\text{data})/P(\neg H|\text{data})=P(H|\text{data})/P(M|\text{data}) = \frac{1}{2.5}\) or the combination of 1) \(H\) explains the data twice as well as \(M\) but 2) \(H\) is 5x less likely a-priori than \(M\).

It seems clear to me now that if you work only with Bayes Factors you are not doing Bayesian reasoning.