> [!tldr] Kullback–Leibler Divergence
> The KL divergence measures the "difference" between two distributions $P,Q$ as $\begin{align*}
\mathrm{KL}(P \|Q)&:= \mathbb{E}_{P}[\log P(X) / Q(X)]\\
&:= \mathbb{E}_{X \sim P}[\log P(X) - \log Q(X)]\\
&= \int P(x)\log\left( \frac{P(x)}{Q(x)} \right) ~ dx.\\
\end{align*}$I.e. the amount of expected "surprise" when thinking the distribution is $Q$ when in fact it is $P$.
>
> Equivalently, it is $\mathrm{KL}(P\|Q)= -\mathbb{E}_{P}[\log Q(X)] - (-\mathbb{E}_{P}[\log P(X)])=H(P,Q)-H(P),$
> where $H(P,Q)$ is the **cross entropy** and $H(p)$ is the **entropy**.
- Note that $\mathrm{KL}(P\|Q)= \mathbb{E}_{P}\left[ -\log \frac{Q}{P} \right]\ge -\log \mathbb{E}_{P}\left[ \frac{Q}{P} \right]=0$, so it is always non-negative; as a corollary, $H(P, Q) \ge H(P)$, with equality achieved when $P=Q$.
- Also, $\mathrm{KL}(P\|Q) \ne \mathrm{KL}(Q\|P)$ in general.
When $Q$ is a model fitted to approximate $P$, since $H(P)$ is constant, we can use the cross entropy part as a [[Loss Functions|loss function]]. Suppose we have a dataset $\mathbf{y}$ sampled from some true distribution $P$. We try to minimise KL divergence from that distribution by approximating it with $\hat{P}:= \frac{1}{n}\sum_{i} \delta_{y_{i}}$, then the loss of a model $Q$ is
$\mathbb{E}_{\hat{P}}[-\log Q(Y)]= -\frac{1}{n}\sum_{i} \log Q(y_{i}),$
- For prediction tasks with inputs $\mathbf{X}$ too, the above can be modified to conditional distributions $P(Y ~|~ X)$, etc. In this case, the minimiser is the [[Saturated and Null Models|saturated model]].
The special case when $y$ is binary $0 / 1$ gives the **binary entropy loss**, or the **negative log-likelihood**. For [[Generalized Linear Models#^84f6f6|logistic regression]] with linear predictor $\eta$ and log-odds link, this simplifies to
$\begin{align*}
L(\eta,y)&= -y\log \left\{ \frac{e^{\eta}}{1+e^{\eta}} \right\}-(1-y)\log\left( \frac{1}{1+\exp(\eta)} \right)\\
&= -y\eta + \log(1+e^{\eta}).
\end{align*}$
So for large $\eta$ (confident positives), $L(\eta,y) \to (1-y)\eta$, a linear loss. Similarly for confident negatives, $L(\eta, y) \to -y\eta$.
Comparison of using the **forward** $\mathrm{KL}(P~\|~Q)$ and **reverse** $\mathrm{KL}(Q~\|~P)$ as loss functions.
![[ForwardReverseKL.png#invert]]
- Suppose $P$ has a peak/mode at $x$; then if $Q(x)$ is tiny, the forward KL gives a huge penalty $\log(P(x) / Q(x))$ that is not diminished by the large $P(x)$, so in optimising the forward KL, we need $P$ peaks at $x$ $\Rightarrow Q$ has some mass at $x$, leading to **mode-covering** behaviour. In the tails, $Q$ needs a thicker tail than $P$ to have $P(x)\log P(x) / Q(x)= O(P(x))$; a thinner tail will incur larger divergence when integrated.
- When fitting with reverse KL, it is fine if $Q(x)$ is tiny where $P(x)$ is a peak, since the penalty will be diminished by multiplying with $Q(x)$; on the other hand, when $Q(x)$ is a peak, $P$ should also have a peak there, so that $\log(Q(x) / P(x))$ is not too big; so $Q$ peak $\Rightarrow P$ peak. Moreover, where $Q(x)$ is low, $P(x)$ must be higher (e.g. $P$ having a thicker tail than $Q$). Otherwise, $Q(x) \log (Q(x) / P(x)) \ge Q(x)$, leading to large divergence when integrated. This means that $Q$ must be concentrating its density in a few peaks, and those peaks are also peaks of $P$ i.e. **mode-seeking** behaviour.