> [!tldr]
> The EM algorithm approximates the maximum likelihood estimators of a model with latent variables. It iteratively produces an MLE estimate $\hat{\theta}$ of a distribution $X \sim \pi(Z,\theta)$, where $X$ is observed data and $Z$ is latent.
### The Chicken-and-Egg Problem
Suppose we wish to estimate the parameter $\theta$ of a distribution $X \sim \pi(\cdot \,|\,Z,\theta)$, where $Z$ is a **latent variable**. That is, to sample $X$, we first need to randomly draw $Z$, which determines the distribution of $X$.
- For example, $X$ can be a mixture model of distributions $\pi_{1}(\theta),\dots,\pi_{k}(\theta)$, and $Z \sim \mathrm{Unif}\{ 1,\dots,k \}$ determines which distribution $X$ follows.
The likelihood $L(X;\theta)$ is hard to compute if the latent distribution $Z\sim \tau(\theta)$ is also dependent on $\theta$: $L(X;\theta)=\int \pi(X~|~Z, \theta) \cdot \tau(Z~|~\theta) ~ dx $which becomes the chicken-and-egg problem, where maximizing $L$ to find $\theta$ requires knowing the distribution of $Z$, which is dependent on $\theta$.
The EM algorithm circumvents the second part of the cycle by iteratively *replacing the unknown $\theta$ with a previous estimate $\hat{\theta}_{j-1}$ to produce a new estimate $\hat{\theta}_{j}$.*
### Expectation and Maximization Steps
At each **expectation (E) step**, the algorithm computes the expected log-likelihood as a function of the dummy-parameter $t$, where the expectation is over the distribution $\tau(Z,\hat{\theta}_{j-1})$: $\begin{align*}
Q(t,\hat{\theta}_{j-1}):&= \mathbb{E}_{Z}[l(X,Z;t)]\\
&= \int \tau(Z;\hat{\theta}_{j-1})\cdot l(X,Z;t) \, dZ
\end{align*}$that is, the average of the log-likelihoods, weighted by the latent data's consistency with the previous estimate $\hat{\theta}_{j-1}$.
The **maximization (M) step** is just finding the new estimate $\hat{\theta}_{j}$ that maximizes the expected log-likelihood $Q(t)$: $\hat{\theta}_{j}:= \underset{t}{\arg\max}~Q(t,\hat{\theta}_{j-1})$
In summary, the algorithm is:
> [!algorithm] EM Algorithm
> $[0]$ Initialize with some guesses $\hat{\theta}_{0}$.
>
> For $[j=1,\dots]$, repeating until convergence:
> - $[\mathrm{E}]$xpectation: compute the function $Q(t, \hat{\theta}_{j-1})$ as a function of $t$.
> - $[\mathrm{M}]$aximization: find $\hat{\theta}_{j}=\underset{t}{\arg\max}~Q(t,\hat{\theta}_{j-1})$.
### Application in Gaussian Mixture Models
Gaussian mixtures can be interpreted to contain latent variables $Z_{k}:= \mathbf{1}_{X \text{ belongs to class }k}$which allows the EM algorithm to iterative fit parameters the Gaussian distributions: $\theta := (\mu_{1}, \sigma^{2}_{1},\dots,\mu_{K},\sigma_{K}^{2})$
### Reference Resources
- [Youtube video](https://www.youtube.com/watch?v=xy96ArOpntA) by ritvikmath: non-rigorous, intuitive introduction.
- [Youtube video](https://www.youtube.com/watch?v=iQoXFmbXRJA) by Victor Lavrenko: introduction in the context of Gaussian mixture models.