Linear Discriminant Analysis and Generalizations

For classifying $Y \,|\, X$, methods like logistic regression directly models the **posterior** $p_{k}(x):=\mathbb{P}[Y=k \,|\, X=x]$. In contrast, [[Generative Classification Models|generative models]] are an alternative approach that models the data as generated by (1) determining the class with the **priors** $\pi_{k}:=\mathbb{P}[Y=k]$ and (2) drawing the predictors from **likelihoods** $f_{k}(x)=f_{X|Y=k}(x)$. It then computes the posterior using **Bayes' theorem**: $p_{k}(x)=\mathbb{P}[Y=k\,|\, X=x]=\frac{\pi_{k}\cdot f_{k}(x)}{\sum_{i}\pi_{i}f_{i}(x)}.$ Therefore, *generative models estimate the posterior indirectly by estimating $\pi_{k}$ and $f_{k}$.* ## Linear Discriminant Analysis **Linear discriminant analysis (LDA)** models the likelihoods $f_{k}$ with Gaussians: $f_k(\mathbf{x};\mu_{k},\Sigma)=\frac{1}{(2\pi)^{p/2}|\Sigma|^{1/2}}\exp\left( -\frac{1}{2}(\mathbf{x}-\mu_{k})^{T}\Sigma^{-1}(\mathbf{x}-\mu_{k}) \right),$hence reducing the problem to an *estimation of category-specific means $\mu_{k}$ and a common covariance matrix $\Sigma$.* - The term $(\mathbf{x}-\mu_{k})^{T}\Sigma^{-1}(\mathbf{x}-\mu_{k})$ is the (squared) **Mahalanobis distance** of $x$ to the $k$th Gaussian mixture $N(\mu_{k},\mathbf{\Sigma})$. - The class $k$ with the maximum likelihood $f_{k}(x)$ is just the one with the mean $\mu_{k}$ closest to $x$ in Mahalanobis distance. Under 0-1 loss, the Bayes (i.e. optimal) classifier selects the class $k$ with the largest posterior $p_{k}(\mathbf{x})$; this is equivalent to maximizing the class's **log-probability**: $\log {\mathbb{P}[Y=k\,|\, X=x]}=\log f_{k} + \log \pi_{k},$which under the Gaussian assumption is further equivalent to maximizing the LDA **discriminant function** $\delta_{k}(x;\pi_{k},\mu_{k},\Sigma) = \left( x^{T} - \frac{1}{2}\mu_{k}^{T} \right)\Sigma^{-1}\mu_{k} +\log(\pi_{k}),$*a linear function in $x$ and gives linear decision boundaries.* - The quadratic term $x^{T}\Sigma^{-1}x$ is the same for all classes since the covariance matrix is common; thus there is not need to include it in the discriminant function. - If $\mathbf{X}$ involves feature expansion, $\delta_{k}$ is obviously not linear in the original variables: for example, if we only observe $(x_{1},x_{2})$ but feature-expand it to $(x_{1}, x_{2}, x_{1}^{2}, x_{2}^{2}, x_{1}x_{2})$, LDA can produce a quadratic-like decision boundary when plotted in $x_{1}-x_{2}$ plane. The estimated discriminant $\hat \delta_k$ is obtained by plugging the parameters' MLE $\begin{align*} \hat{\pi}_{k}&= N_{k} / N\\[0.4em] \hat \mu_{k}&= \mathrm{avg}\{ x_{i} \,|\,i: y_{i}=k \}\\[0.4em] \hat\Sigma&= \frac{1}{N-K} \sum_{k}\sum_{i: y_{i}=k}(x_{i}-\hat{\mu}_{k})(x_{i}-\hat{\mu}_{k})^{T} \end{align*}$where $N_{k}$ is the number of observations in class $k$, and $N$ is the sample size. Therefore, the classification with LDA is done by: > [!algorithm] Linear Discriminant Analysis > Training: compute the parameter estimates $\hat{\pi},\hat{\mu},\hat{\Sigma}$ as above. > > Prediction: given input $\mathbf{x}_{\mathrm{new}}$, the predicted class is $\hat{y}=\underset{k}{\arg\max}~\delta_{k}(\mathbf{x}_{\mathrm{new}}; \hat{\pi}_{k},\hat{\mu}_{k},\hat{\Sigma}).$ ### Quadratic Discriminant Analysis **Quadratic discriminant analysis (QDA)** is an easy generalization of the LDA, allowing categories to have their own covariance matrices $\Sigma_{k}$, giving the likelihood $f_k(\mathbf{x})=\frac{1}{(2\pi)^{p/2}|\Sigma_{k}|^{1/2}}\exp\left( -\frac{1}{2}(\mathbf{x}-\mu_{k})^{T}\Sigma_{k}^{-1}(\mathbf{x}-\mu_{k}) \right).$so the discriminant function (also derived from log-odds) is $\delta_{k}(\mathbf{x}) = -\frac{1}{2}(\mathbf{x}-\mu_{k})^{T}\Sigma_{k}^{-1}(\mathbf{x}-\mu_{k}) +\log(\pi_{k})$which is quadratic in $\mathbf{x}$, and gives a curved decision boundary. ### Variance and Regularization Compared to the LDA, The QDA is more flexible, hence sacrificing variance for lower bias. - QDA contains $(K-1)\left( \frac{p(p+3)}{2} + 1 \right)$ parameters, whereas LDA only has $(K-1)(p+1)$ parameters. - LDA with expanded features space (e.g. with quadratic terms) generally give a similar result to the QDA. A regularized covariance matrix gives a continuum between LDA and QDA: $\hat{\mathbf{\Sigma}}_{k}(\alpha) = \alpha \hat{\mathbf{\Sigma}}_{k} + (1-\alpha)\hat{\mathbf{\Sigma}}$where $\hat{\mathbf{\Sigma}}_{k}$ is the QDA covariance matrix, and $\hat{\mathbf{\Sigma}}$ the LDA one. LDA covariance matrix can also be continuously reduced to that of iid. predictors: $\hat{\mathbf{\Sigma}}^ {\text{reg}}(\gamma)=\gamma\hat{\mathbf{\Sigma}}+ (1-\gamma)\hat{\sigma}^{2}\mathbf{I}_{p \times p}$ - Besides regularization, this also improves the numerical stability of inverting $\hat{\Sigma}^\mathrm{reg}$, since adding (a multiple of) $I_{p \times p}$ increases the eigenvalues and determinant of the matrix. ## LDA in Sphered Space > [!bigidea] Sphering in LDA > By sphering/rescaling the data by pre-applying $\mathbf{\Sigma}^{-1/2}$, comparing Gaussian likelihoods $f_{k}(x)$ reduces to *comparing Euclidean distances between the transformed data $x^{\ast}$ and centroids $\mu_{k}^{\ast}$.* > > This allows application of dimension-reducing properties (orthogonality) and algorithms ([[Principal Component Analysis|PCA]]) in the transformed space. The multivariate Gaussian densities in the LDA computes the (squared) Mahalanobis distance $(x-\mu_{k})^{T}\Sigma^{-1}(x-\mu_{k})$, essentially **sphering** *(normalizing) the data with respect to variances in each predictor.* Instead of doing this in each density $f_{k}$, the sphering can be done beforehand by mapping $\begin{align*} \mathbf{X}\mapsto \mathbf{X}^{*} :&= \mathbf{X}\mathbf{\Sigma}^{-1/2}\\ \mu_{k}\mapsto \mu_{k}^{*}:&= \mathbf{\Sigma}^{-1/2}\mu_{k} \end{align*},$where $\mathbf{\Sigma}$ is the positive semi-definite covariance matrix. Then the Gaussian likelihoods become $\begin{align*} f_k(\mathbf{x};\mu_{k},\Sigma)&= \text{const.} \times\exp\left( -\frac{1}{2}(\mathbf{x}-\mu_{k})^{T}\Sigma^{-1}(\mathbf{x}-\mu_{k}) \right)\\ &= \mathrm{const.} \times \exp\left( -\frac{1}{2}\| \mathbf{x}^{*}-\mu_{k}^{*} \|_{2}^{2} \right), \end{align*}$and *the Mahalanobis distances become the Euclidean distances $\| x^{\ast}-\mu_{k}^{\ast} \|_{2}^{2}$*. > [!idea] LDA in the transformed space > Hence in the transformed space, **maximizing the likelihood is just selecting the centroid** $\mu_{k}^{*}$ closest in Euclidean distance, and LDA only adds effects from the priors $\pi_{k}$ on top of it. ### Orthogonality and Reduced Rank LDA > [!bigidea] Projection onto a subspace > In LDA, only up to $K-1$ dimensions are relevant. Consider the $(K-1)$-dimensional subspace $H^{\ast}_{K-1}$ containing the $K$ transformed centroids $\mu_{k}^{\ast}$. After sphering, *the part of $\mathbf{x}^{{\ast}}$ orthogonal to $\mu_{k}^{\ast} \in H_{K-1}^{\ast}$ contributes the same amount to the Euclidean distance $\| \mathbf{x}^{\ast}-\mu_{k}^{\ast} \|_{2}^{2}$, regardless of $k$.* Therefore, that orthogonal component can be removed with no effect on the classification result by projecting the data $\mathbf{X}^{\ast}$ onto $H_{k-1}^{\ast}$. This gives *a dimension reduction of $p \to K-1$ with no loss of information* for classification purposes. ### PCA for Further Dimension Reduction For further dimension reduction, we wish to *find projections that maximize the distance between centroids* -- the further away they are, the better the classes are separated. This calls for classic tools like the SVD and [[Principal Component Analysis|PCA]]. However, this "distance" is again subject to the covariance $\Sigma$, so the classic tools, which were developed in the standard $(\mathbb{R}^{p}, l_{2})$ space, cannot be used directly. For a 2-dimensional example: ![[LDASeparateClasses.png#invert|w80|center]] - Before the transformation, maximizing the Euclidean distance between centroids (left panel) does not separate the distribution well, *as separation is measured in the Mahalanobis scale*. Therefore, the optimization is that of the **between-class variance $\mathbf{B}$** relative to the **within-class variances $\Sigma$**: $\underset{a:\| a \| =1}{\arg\max}~ \frac{a^{T}\hat{\mathbf{B}}a}{a^{T}\hat{\Sigma}a}.$ However, *after sphering, the separation can be measured in Euclidean distance* (right panel), so the new equidistant plane is the optimal decision boundary, and the optimization above is reduced to $\underset{u^{\ast}:\| u^{\ast} \|=1 }{\arg\max}~ u^{{\ast}T}\hat{\mathbf{B}}^{\ast} u^{\ast},$where $u^{\ast}\propto\hat{\Sigma}^{1 / 2}a$ is the projection directions in the transformed space. Therefore, this reduces to the PCA problem in the transformed space, i.e. *selecting directions $u^{\ast}$ that maximize the distance/variance between classes*, measured by the (empirical) **between-class variance** $\begin{align*} \hat{\mathbf{B}}\,\,&:= \frac{1}{K}\sum_{k = 1}^{K}\hat{\pi}_{k}(\hat{\mu}_{k}-\hat{\mu})(\hat{\mu}_{k}-\hat{\mu})^{T}\\ \hat{\mathbf{B}}^{\ast}&\,\,= \frac{1}{K}\sum_{k = 1}^{K}\hat{\pi}_{k}(\hat{\mu}^{\ast}_{k}-\hat{\mu}^{\ast})(\hat{\mu}^{\ast}_{k}-\hat{\mu}^{\ast})^{T}&= \hat{\Sigma}^{-\frac{1}{2}} \hat{\mathbf{B}}\hat{\Sigma}^ {-\frac{1}{2}} \end{align*}$where $\hat{\mathbf{B}}$ is in the original Mahalanobis scale, and $\hat{\mathbf{B}}^{\ast}$ is in the transformed space using Euclidean distances. Doing the usual eigen-decomposition of $\hat{\mathbf{B}}^{\ast}=\mathbf{U}^{\ast}\mathbf{DU}^{{\ast}T}$ gives the principal components in the transformed space, also known as the **canonical/discriminant variables**. - The projection directions are $u^{\ast}\propto \hat{\Sigma}^{1 / 2}a$, which are the columns of $U^{\ast}$. - Projection of the sphered data $\mathbf{X}^{\ast} \mapsto \mathbf{Z}:= \mathbf{X}^{\ast}\mathbf{U}^{\ast}$ are called the **discriminant coordinates**. - Note that since the centroids are contained in $H_{K-1}$, the diagonal matrix $\mathbf{D}$ contains at least $p-K+1$ eigenvalues that are $0$. - *Therefore, the orthogonal projection can be done directly through the PCA projection* by projecting onto those components corresponding to non-zero eigenvalues. Let $\mathbf{Z}$ be the projection of $\mathbf{X}^{\ast}$ onto those non-$0$ components, and finally, we may select a the first $L < K$ projections (columns) for further dimension reduction, yielding $\mathbf{Z}_{1:L}$. Schematically, the PCA dimension reduction is: $\begin{align*} {\mathbf{X}} &\xrightarrow[\hspace{1.2in}]{\text{sphering projection}} \hat{\Sigma}^{-1 / 2} \mathbf{X}=:\mathbf{X}^{\ast} \\ & \xrightarrow[\hspace{1.2in}]{\text{PCA projection}}\mathbf{X}^{\ast}\mathbf{U}^{\ast}=: \mathbf{Z} \\ &\xrightarrow[\hspace{1.2in}]{\text{select subset}}\mathbf{Z}_{1:L}\\ & \xrightarrow[\hspace{1.2in}]{\text{transform back}} \hat{\Sigma}^{1/2}\mathbf{Z}_{1:L}. \end{align*}$ ## Generalization on the LDA LDA suffers a few major issues: - The one-Gaussian-per-category assumption can be very incorrect, for example when the same category is bimodal (e.g. caused by sex, when the variables include weight and height). - The linear decision boundary can be too rigid in general -- basis expansions and QDA can tackle that issue. - LDA suffers from the curse of dimensionality, since it does no variable selection, and its parameter estimates degrade quickly. Recasting LDA as a linear regression problem and using more flexible, non-parametric regression methods (instead of the linear model) produces **flexible discriminant analysis (FDA)**. To tackle the curse of dimensionality, and/or to regularize basis expansion methods like FDA, adding a penalty to encourage coefficient smoothness gives **penalized discriminant analysis (PDA)**. Allowing for multiple centroids in each class gives **mixture discriminant analysis (MDA)**. ### Flexible Discriminant Analysis Doing LDA with its principal components $v_{l}$ has an equivalent formulation, where we try to find a "labelling function" $\theta:\{ 1,\dots,K \} \to \mathbb{R}$ with zero mean and unit variance that can be optimally modelled with some linear $\beta^{T}X$: $\min_{\beta,\theta}\sum_{i=1}^N (\theta(y_{i}) - \beta^{T}X_{i})^{2}$and in general, we can repeat this for functions $\theta_{1},\dots,\theta_{L}$, assumed to be normalised and orthogonal, by fitting $\beta_{1},\dots ,\beta_{L}$ that minimise the average squared residual: $\text{ASR}:=\frac{1}{N}\sum_{i=1}^{N}\left[ \sum_{l=1}^{L} (\theta_{l}(y_{i}) - \beta_{l}^{T}X_{i})^{2}\right].$Then *the set of coefficients $\beta_{1},\dots,\beta_{L}$ equal the principal components* up to a constant. Moreover, the squared Mahalanobis distances between $x$ and class $k$ are given by $\hat{d}_{k}(x)=(x-\mu_{k})^{T}\mathbf{B}\mathbf{W}\mathbf{B}^{T}(x-\mu_{k}) + C(x)$where $\mathbf{B}=(\beta_{1},\dots,\beta_{L}) \in \mathbb{R}^{p \times L}$ and $C(x)$ is constant between classes; $\mathbf{W}$ is the diagonal weight matrix $\mathbf{W}=\mathrm{diag}(r^{-2}_{l}(1-r_{l}^{2})^{-1})_{l=1}^{L}$, $r^{2}_{l}$ being the mean squared residuals of the $l$th fit. *Hence LDA is equivalent to fitting labeling functions $\theta_{l}$ with linear predictors $\beta_{l}^{T}X$ and comparing the (weighted) magnitude of $\mathbf{B}^{T}(x-\mu_{k})$.* **Flexible discriminant analysis** then generalizes this process by replacing the linear $\beta_{l}^{T}X$ with non-linear fits $\eta_{l}(X)$ like smoothing splines and MARS, then minimizing the penalized loss $\min_{\{ \theta_{l},\eta_{l} \}_{l=1}^{L}} \frac{1}{N}\sum_{i=1}^{N}\Bigg[ \sum_{l=1}^{L} \underbrace{(\theta_{l}(y_{i}) - \beta_{l}^{T}X_{i})^{2}}_{\text{loss}}+\underbrace{\lambda J(\eta_{l})}_{\text{penalty}} \Bigg]$where $J$ is some appropriate penalty (depends on the methods used to fit $\eta_{l}$), and $\lambda>0$ is a tuning parameter. - For basis expansion methods like those, it's equivalent to conducting LDA on their expanded feature sets. For example, using quadratic fits gives the same result as including squares and cross products in the LDA.