Naive Bayes - Random Notes Go Brrrrrrr

Naive Bayes is a [[Empirical Risk Minimization vs Plug In|conditional plug-in]] classifier, meaning that it models the conditional probabilities $\mathbb{P}[Y=k ~|~ X=x]$ and using that to approximate the Bayes classifier. In particular, it models the distribution of $X ~|~\{ Y=k \}$ to be $g_{k}$, then using Bayes rule to compute $\mathbb{P}[Y=k ~|~X=x]=\frac{\pi_{k}g_{k}(x)}{\sum_{c=1}^{K}\pi_{c}g_{c}(x)},$then computing the Bayes rule. Usually, we use the 0-1 loss, so the classifier is $\hat{y}(x)=\underset{k}{\arg\max}~\mathbb{P}[Y=k ~|~ X=x].$Of course, in this case because the denominators are the same across $k$, it reduces to $\hat{y}(x)=\underset{k}{\arg\max}~ \pi_{k}g_{k}(x).$ ### Naive Bayes Estimations The estimation of prior probabilities $\pi_{k}$ are straightforward: usually it is their MLE $\hat{\pi}=N_{k} / N$. In order to model $g_{k}$, Naive Bayes makes the crucial assumption that *conditional on $\{ Y=k \}$, the features $X=(X_{1},\dots,X_{p})$ are independent*, so if each has distribution $X_{j}~|~\{ Y=k \} \sim g_{kj}$, the distribution factorizes into $g_{k}(x)=\prod_{j=1}^{K}g_{kj}(x_{j}).$ Now each marginal-conditional distribution $g_{kj}$ is a lot easier to model: - For continuous features, we can use 1D parametrized models like Gaussians (essentially fitting a multivariate Gaussian but requiring its covariance matrix to be diagonal). - For discrete features of classes $x_{j} \in \{ 1,\dots,C\}$, simply estimate it with the multinomial where $\mathbb{P}[X_{j}=c ~|~ Y=k]=\frac{1}{N_{k}} \sum_{i:Y_{i}=k}\mathbf{1}({x_{ij}=c}),$i.e. the proportion of the class $Y=k$ that has $X_{j}=c$. ### Missing Training Data Suppose there is a missing entry (WLOG the first one) in the training data $\mathbf{x}_{i}=(?,x_{i{2}},\dots,x_{ip})$. For some classifiers this makes the observation useless during training, but naive Bayes can still use $\mathbf{x}_{i}$ when fitting the distributions of $X_{2},\dots,X_{p} ~|~Y$. That is, when fitting $g_{kj}$, we can simply use the sub-dataset $\{ i: y_{i}=k, x_{ik} \text{ not missing} \},$which is a lot more lenient than requiring $\{ i: y_{i}=k, x_{i{\ast}} \text{ all not missing} \}$ like other algorithms. ### Missing Prediction Data Another situation is when a new input $\mathbf{x}_{\mathrm{new}}=(?,x_{2},\dots,x_{p})$ has a missing (WLOG the first) entry. In this case, naive Bayes handles the situation by using the marginal densities of $X_{-1}~|~Y=(X_{2},\dots,X_{p})~|~Y$, which is just $g_{k}^{(-1)}(\mathbf{x}_{-1})=\prod_{j=2}^{p}g_{kj}(x_{j}),$where $\mathbf{x}_{-1}$ is $\mathbf{x}_{\mathrm{new}}$ without the missing entry. That is, *if an entry is missing, we just classify using the non-missing entry and their densities*. Any generative method can theoretically use this marginal density, but *only naive Bayes can obtain with no extra cost for integration*.