For classifying $Y \,|\, X$, conditional probabilistic models estimate the posterior $p_{k}(x):=\mathbb{P}[Y=k \,|\, X=x]$ and derive the Bayes rule for that -- for 0-1 loss, it is simply choosing the class $k$ that maximizes the posterior.
On one hand, methods like logistic regression directly models the posterior.
In contrast, **generative models** are an alternative approach that models the data as if it is generated by:
- (1) determining the class with the **priors** $\pi_{k}:=\mathbb{P}[Y=k]$, and
- (2) drawing the predictors from **likelihoods** $f_{k}(x)=f_{X|Y=k}(x)$.
It then computes the posterior using **Bayes' theorem**: $p_{k}(x)=\mathbb{P}[Y=k\,|\, X=x]=\frac{\pi_{k}\cdot f_{k}(x)}{\sum_{i}\pi_{i}f_{i}(x)}.$ Therefore, *generative models estimate the posterior indirectly by estimating $\pi_{k}$ and $f_{k}$.*
Advantages of this approach include:
- If there is good reason to choose a certain likelihood, it is easy to incorporate this knowledge to reduce bias and variance.
- It generalizes naturally to multiple categories, compared to methods like logistic regression.
Disadvantages include:
- They are more rigid, so bad choices of priors and/or likelihoods can compromise the fit.
### Missing Data
Because of their fuller model, generative models can deal with missing data through marginal distributions: if (WLOG) the first entry $x_{1}$ is missing from $\mathbf{x}$, the same computations hold for the joint densities $f_{k,-1}$ of $\mathbf{x}_{-1}$ (i.e. the remaining entries): $f_{k,-1}(\mathbf{x}_{-1}) \Longrightarrow p_{k}(\mathbf{x}_{-1})=\mathbb{P}[Y=k\,|\, X_{-1}=\mathbf{x}_{-1}]=\frac{\pi_{k}\cdot f_{k,-1}(\mathbf{x}_{-1})}{\sum_{c}\pi_{c}f_{c,-1}(\mathbf{x}_{-1})}.$
In practice, *the marginal distributions' integrals are difficult to compute in general*. However, in special cases like [[naive Bayes]], it is trivial due to independence assumptions.
### Imputing Predictors
Suppose an observation $\mathbf{x}=(x_{1},\dots,x_{p}), y=k$ is missing the $j$th variable $x_{j}$, and we have fitted the priors $\hat{\pmb{\pi}}$ and densities $\hat{f}_{k}$ with other data.
Then marginalizing the $j$th variable gives its distribution $g_{j\,|\, y = k}(z)=\int_{ \mathbb{R}^{p-1}} \hat{f}_{k}(x_{1},\dots,\underset{j\text{th}}{z},\dots,x_{p}) \, d\mathbf{x}_{-j},$where the integration is over the $p-1$ predictors that are not missing. Then we can impute the missing value with its MLE: $x_{j}^{\mathrm{(imputed)}}:= \underset{z}{\arg\max}g_{j}(z)$
### Prediction with Missing Data
If instead we are interesting in classification with $y$ unknown too, then we can "integrate out" the missing $x_{j}$: $\hat{\mathbb{P}}[Y=k \,|\,X_{-j}= \mathbf{x}_{-j}]=\hat{\pi}_{k}\cdot \int_{\mathbb{R}} \hat{f}_{k}(x_{1},\dots,\underset{j\text{th}}{z},\dots, x_{p}) \, dz, $and find the Bayes rule from that; e.g., the 0-1 loss gives the MLE $\hat{Y}=\underset{k}{\arg\max}\, \hat{\mathbb{P}}[Y=k \,|\, X_{-j}=\mathbf{x}_{-j}].$