If we have a collection of models indexed by $m \in \mathcal{M}$, each being true with probability $\pi(m)$ and having a prior $\pi(\theta ~|~ m)$ and likelihood $p(y ~|~ \theta, m)$, then we can compute posteriors as
$\pi(m,\theta ~|~ y) \propto p(y ~|~ \theta, m)\cdot \pi(\theta ~|~ m)\cdot \pi(m),$
with normalisation constant being $p(y):= \sum_{m\in \mathcal{M}}\int p(y~|~\theta ,m) \pi(\theta ~|~ m)~ d\theta$, the marginal likelihood.
One option is to perform model selection using MAP, i.e. $m^{\ast} := \underset{m}{\arg\max}~\pi(m ~|~ y)$, then pretending it is true, we can perform the usual Bayesian inference of $\theta$ with posterior $\pi(\theta ~|~ m^{\ast},y) \propto\pi(\theta ~|~ m^{\ast})\cdot p(y ~|~ \theta,m^{\ast}).$
However, this **select-then-infer** procedure has flaws like:
- It is very discrete hence discontinuous: when two models have a similar level of performance, a slight change in data can lead us to ditch one model for the other.
- It fails to capture the inherent uncertainty in the model selection, since any inference with $\pi(\theta ~|~ m^{\ast}, y)$ lacks information about the selection.
- Suppose we want to estimate $\mathbb{E}[h(\theta)]$, the select-then-infer procedure's estimator is $\mathbb{E}[h(\theta)~|~y,m^{\ast}]$, in general not equal to $\mathbb{E}[h(\theta) ~|~ y]$, the posterior mean.
Computing the actual posterior mean requires us to not discard the additional information:
> [!definition|*] Model Averaging
> The **model-averaged posterior** is given by $\pi(\theta ~|~ y):= \sum_{m \in \mathcal{M}}\pi(\theta, m ~|~ y),$defined for $\theta \in \cup_{m}\Omega_{m}$, where $\Omega_{m}$ is the parameter space of $m$; if $\theta \not\in \Omega_{m}$, then $\pi(\theta,m~|~y)$ is understood to be $0$.
- Then integrating this model-averaged posterior gives $\int _{\Omega}\pi(\theta ~|~ y) h(\theta)~ d\theta=\mathbb{E}[h(\theta) ~|~ y]$, the Bayes rule.
- Since we do not know the normalising constant $p(y)$ in the posterior $\pi(\theta,m~|~y)$, we cannot evaluate this density in general, but can still sample from it.
An alternative form is the weighted average of each model's posterior: $\pi(\theta ~|~ y)=\sum_{m}\underset{\substack{\text{per-model} \\ \text{posterior}}}{\vphantom{\frac{1}{2}}\pi(\theta ~|~ m ,y)}\cdot \underset{\substack{\text{weighted} \\ \text{by proba.}}}{\vphantom{\frac{1}{2}}\pi(m ~|~ y)}.$
### Example: Spike-and-Slab Priors
Consider the regression $Y ~|~ \beta_{m}\sim N(X^{T}_{m}\beta_{m}, \sigma^{2}I),$with models of variable selections $m \subset [p]$, and $\beta_{m}$, $X_{m}$ being subsetted by $m$. What is the model-averaged prior/posterior of the coefficients?
One difficulty is that $\beta_{m}$ can vary in dimension, making it hard to model and sample. Instead, define $\beta=z\ast \theta$ with $z \in \{ 0,1 \}^{p}$ turning on/off the (latent) values $\theta \in \mathbb{R}^{p}$, so we have the equivalent model $Y ~|~ z,\theta \sim N(X^{T}\beta, \sigma^{2}I);$for simplicity, we assume $z, \theta$ are independent (so $m$ only controls the inclusion of coefficients, not their values).
Since $z$ controls variable selection, it has a bijection with models given by $z \longleftrightarrow m:= \{ j \in [p]~|~z_{j} \ne 0 \},$i.e. $m$ that indexes the subset of variables selected for regression, and we can convert between their densities as
Then we have the model-averaged prior $\begin{align*}
\mathbb{P}[\beta_{j} \le b]&= \sum_{m}\pi(m)\cdot \mathbb{P}[\beta_{j} \le b ~|~ m]\\
&= \sum_{m: j \in m} \pi(m) \cdot \mathbb{P}[\theta_{j} \le b ~|~ m] + \mathbf{1}\{ b \ge0 \}\cdot\sum_{m: j \not \in m} \pi(m)\\
&= \mathbb{P}[\theta_{j} \le b]\cdot \pi(z=1) + \mathbf{1}\{ b \ge 0\} \cdot \pi(z=0).
\end{align*}$This is equivalent with the prior density $\pi(\beta_{j})=p_{1}\cdot\pi(\theta=\beta_{j}) + p_{0}\cdot\delta_{0}(\beta_{j}),$i.e. a mixture of the **spike** $\delta_{0}$ and **slab** $\pi(\theta)$ with weights $p_{i}:=\pi(z=i)$.
Similarly, linearity gives the model-averaged posterior $\begin{align*}
\pi(\theta,z~|~ y) &\propto \pi(\theta)\cdot \pi(z)\cdot p(y ~|~ \theta,z),\\
\pi(\beta_{j}~|~y)&= \pi(\theta=\beta _{j}, z=1 ~|~y).
\end{align*}$
We can sample $\beta_{j} ~|~ y$ by sampling $\theta,z ~|~ y$, then computing $\beta_{j}=\theta_{j}z_{j}$.