> [!abstract] > Given inputs $\mathbf{X} \in \mathbb{R}^{N\times P}$ and one-hot encoded response $\mathbf{Y} \in \mathbb{R}^{N \times K}$, if a model predicts probabilities $\hat{\mathbf{Y}}$, its **cross entropy loss** is the negative log-likelihood: $l(\mathbf{Y}, \hat{\mathbf{Y}}):= -\sum_{i=1}^{N}(\mathbf{Y}_{i} \cdot \log \hat{\mathbf{Y}}_{i}),$where $\mathbf{Y}_{i},\hat{\mathbf{Y}}_{i}$ are the $i$th response/fitted probability, and the logarithm is applied element-wise. > > It is the **cross entropy** from $\mathbf{Y}$ to $\hat{\mathbf{Y}}$, or the "expected surprise" $-\log \hat{\mathbf{Y}}$ by anticipating ${\hat{\mathbf{Y}}}$ when the true distribution is $\mathbf{Y}$. This expected surprise is minimized when $\hat{\mathbf{Y}}=\mathbf{Y}$; in contrast, we are always surprised if our anticipation $\hat{\mathbf{Y}}$ is way off from the truth. Therefore, it can be used as a [[Loss Functions|loss function]] in machine learning. For binary (non-one-hot encoded) responses $\mathbf{y} \in \{ 0,1 \}^{N}$ and fitted probabilities $\hat{\mathbf{y}}$, the cross entropy can also be written as $l(\mathbf{y},\hat{\mathbf{y}})=-\sum_{i}y_{i}\log\hat{y}_{i}+(1-y_{i})\log(1-\hat{y}_{i}).$ For gradient descent methods, the [[Softmax]] activation function squishing raw predictions $\mathbf{o}_{i} \mapsto \exp(\mathbf{o}_{i}) / \| \exp(\mathbf{o}_{i}) \|_{1}$ has cross entropy $l(\mathbf{Y}, \hat{\mathbf{Y}})=-\sum_{i=1}^{N}\mathbf{Y}_{i}\log \frac{\exp\mathbf{o}_{i}}{\|\exp \mathbf{o}_{i} \|_{1} }=\log \| \exp \mathbf{o}_{i} \| _{1}-\mathbf{Y}_{i}\cdot \mathbf{o}_{i}.$It has a particularly neat gradient:$\begin{align*} \partial_{o_{ik}}l=\mathrm{softmax}(\mathbf{o}_{i})_{k}- Y_{ik}=\text{prediction error}_{ik}. \end{align*}$