EM Algorithm - Random Notes Go Brrrrrrr

> [!tldr] > The EM algorithm approximates the maximum likelihood estimators of a model with latent variables. It iteratively produces an MLE estimate $\hat{\theta}$ of a distribution $X \sim \pi(Z,\theta)$, where $X$ is observed data and $Z$ is latent. ### The Chicken-and-Egg Problem Suppose we wish to estimate the parameter $\theta$ of a distribution $X \sim \pi(\cdot \,|\,Z,\theta)$, where $Z$ is a **latent variable**. That is, to sample $X$, we first need to randomly draw $Z$, which determines the distribution of $X$. - For example, $X$ can be a mixture model of distributions $\pi_{1}(\theta),\dots,\pi_{k}(\theta)$, and $Z \sim \mathrm{Unif}\{ 1,\dots,k \}$ determines which distribution $X$ follows. The likelihood $L(X;\theta)$ is hard to compute if the latent distribution $Z\sim \tau(\theta)$ is also dependent on $\theta$: $L(X;\theta)=\int \pi(X~|~Z, \theta) \cdot \tau(Z~|~\theta) ~ dx $which becomes the chicken-and-egg problem, where maximizing $L$ to find $\theta$ requires knowing the distribution of $Z$, which is dependent on $\theta$. The EM algorithm circumvents the second part of the cycle by iteratively *replacing the unknown $\theta$ with a previous estimate $\hat{\theta}_{j-1}$ to produce a new estimate $\hat{\theta}_{j}$.* ### Expectation and Maximization Steps At each **expectation (E) step**, the algorithm computes the expected log-likelihood as a function of the dummy-parameter $t$, where the expectation is over the distribution $\tau(Z,\hat{\theta}_{j-1})$: $\begin{align*} Q(t,\hat{\theta}_{j-1}):&= \mathbb{E}_{Z}[l(X,Z;t)]\\ &= \int \tau(Z;\hat{\theta}_{j-1})\cdot l(X,Z;t) \, dZ \end{align*}$that is, the average of the log-likelihoods, weighted by the latent data's consistency with the previous estimate $\hat{\theta}_{j-1}$. The **maximization (M) step** is just finding the new estimate $\hat{\theta}_{j}$ that maximizes the expected log-likelihood $Q(t)$: $\hat{\theta}_{j}:= \underset{t}{\arg\max}~Q(t,\hat{\theta}_{j-1})$ In summary, the algorithm is: > [!algorithm] EM Algorithm > $[0]$ Initialize with some guesses $\hat{\theta}_{0}$. > > For $[j=1,\dots]$, repeating until convergence: > - $[\mathrm{E}]$xpectation: compute the function $Q(t, \hat{\theta}_{j-1})$ as a function of $t$. > - $[\mathrm{M}]$aximization: find $\hat{\theta}_{j}=\underset{t}{\arg\max}~Q(t,\hat{\theta}_{j-1})$. ### Application in Gaussian Mixture Models Gaussian mixtures can be interpreted to contain latent variables $Z_{k}:= \mathbf{1}_{X \text{ belongs to class }k}$which allows the EM algorithm to iterative fit parameters the Gaussian distributions: $\theta := (\mu_{1}, \sigma^{2}_{1},\dots,\mu_{K},\sigma_{K}^{2})$ ### Reference Resources - [Youtube video](https://www.youtube.com/watch?v=xy96ArOpntA) by ritvikmath: non-rigorous, intuitive introduction. - [Youtube video](https://www.youtube.com/watch?v=iQoXFmbXRJA) by Victor Lavrenko: introduction in the context of Gaussian mixture models.