Linear Classification Methods - Random Notes Go Brrrrrrr

> [!info] This material corresponds to chapter 4 in the ESL. Linear classification methods refer to methods with a linear decision boundary: - It can be modeled directly as a hyperplane (e.g. SVM) - A linear discriminant function $\delta_{k}(x)$, sometimes the posterior probability $\mathbb{P}[Y=k\,|\, X=x]$ itself. ## Glossary of Methods Covered Elsewhere [[Linear Discriminant Analysis and Generalizations]] is a generative model (using Gaussians) that reduce to linear decision boundaries (either linear in the original inputs or in their quadratic terms). **Logistic regression** is just a special case of [[Generalized Linear Models|GLMs]] modeling a binary Bernoulli response, i.e. modeling the log-odds with a linear predictor: $\beta^{T}X=\eta \approx\log\left( \frac{\mathbb{P}[Y=1 ~|~ X]}{1-\mathbb{P}[Y=1 ~|~X]} \right).$Alternatively, solving for the probability gives $\mathbb{P}[Y=1~|~X]=\frac{e^{\eta}}{ (1+e^\eta)}$, which is the linear predictor squished by the **logit** function. - It can be generalized to be the multi-class log-odds (of category $k=1,\dots,K-1$ against category $K$): $\beta_{k}^{T}X=\eta_{k}=\log\left( \frac{\mathbb{P}[Y=k\,|\,X]}{\mathbb{P}[Y=K\,|\,X]} \right),\,k=1,\dots,K-1.$ It then results in the linear decision boundaries $\begin{align*} &\{ \mathbf{x}~|~\beta^{T}\mathbf{x}=0 \} &\text{for binary responses}; \\ &\{ \mathbf{x} ~|~ (\beta_{k_{1}}-\beta_{k_{2}})^{T}\mathbf{x}=0 \} & \text{for multiple classes}. \end{align*}$ - Logistic regression then uses maximum likelihood estimator computed with Newton-Raphson iteration to estimate the coefficients. ### Logistic Regression as a Linear Model Consider a logistic regression on a binary response $y_{i}=0 \text{ or }1$, with the predicted probability $\hat{p}_{i}=\hat{\mathbb{P}}[Y=1\,|\,X=x_{i}]$. The MLE coefficients $\hat{\beta}$ given by logistic regression is the solution to the weighted least squares problem $\begin{align*} z_{i}&= \hat{\beta}^{T}x_{i}+\frac{y_{i}-\hat{p}_{i}}{\hat{p}_{i}(1-\hat{p}_{i})}\\ w_{i} &= \hat{p}_{i}(1-\hat{p}_{i}) \end{align*}$where $z_{i}$ is the response and $w_{i}$ is the weight. - In this model, the weighted $\mathrm{RSS}$ is the Pearson chi-square statistic: $\mathrm{wRSS}=\sum_{i}w_{i}(z_{i}-\hat{z}_{i})^{2}=\sum_{i} \frac{(y_{i}-\hat{p}_{i})^{2}}{\hat{p}_{i}(1-\hat{p}_{i})}$ - The linear regression model also gives asymptotic behavior of $\hat{\beta}$: - By consistency of the MLE, $\hat{\beta} \to \beta\,\,\text{a.s.}$. - By the CLT, $\hat{\beta}\xrightarrow{d}N(\beta,(\mathbf{X}^{T}\mathbf{WX})^{-1})$, where $\mathbf{W}$ is the diagonal matrix $\mathrm{diag}(w_{1},\dots,w_{N})$. ### Regularized Logistic Regression - The regular logistic regression MLE maximizes the log-likelihood (for binary outcomes) given by $l(\mathbf{X},\mathbf{y};\beta_{0},\beta)=\sum_{i=1}^{N}[y_{i}(\beta_{0}+\beta^{T}x_{i})-\log(1+\exp(\beta_{0}+\beta^{T}x_{i}))]$So with $l_{1}$ regularization, maximization problem becomes $\max_{\beta_{0},\beta}[l(\mathbf{X},\mathbf{y},\beta_{0},\beta) - \lambda \| \beta \|_{1} ]$where just like in the lasso, the intercept $\beta_{0}$ is not penalized, and the data is assumed to be centralized and standardized. ## Separating Hyperplane Methods - A class of methods classifies binary responses by finding hyperplanes in the predictor space that perfectly separate the two classes. - Such a hyperplane does not necessarily exist without basis expansions, which is the common issue of such methods. - A hyperplane can be represented using $f(x)=\beta_{0}+\beta^{T}x$ as the set $\{ x \in \mathbb{R}^{p}\,|\,f(x)=0 \}$and with the encoding $Y\in \{ 0,1 \}$, the expression $y_{i}\cdot f(x_{i})\begin{cases} >0 & \text{if $(x_{i},y_{i})$ is classified correctly} \\ <0 &\text{if not} \end{cases}$by the hyperplane satisfying $f(x)=0$. ### Rosenblatt’s Perceptron - The perceptron learning algorithm minimizes the total distance of misclassified points: $\min_{f}[ -\sum_{i \in \mathcal{M}_{f}} y_{i} \cdot f(x_{i})]$where $\mathcal{ M}_{f}$ indexes the points misclassified by $f$. It estimates the coefficients $(\beta_{0},\beta)$ using stochastic gradient descent: $\begin{pmatrix}\beta \\ \beta_{0}\end{pmatrix} \leftarrow\rho \begin{pmatrix} y_{i}x_{i} \\ y_{i} \end{pmatrix} + \begin{pmatrix}\beta \\ \beta_{0}\end{pmatrix}$where $\rho$ is the learning rate. - Issues of the perceptron algorithm include: - Separable data admits infinitely many separating hyperplanes, and the result of this algorithm depends on the initial values. - Convergence can be slow for separable data. - The algorithm gets stuck in long cycles if the data is not separable. ### Optimal Separating Hyperplanes - To avoid the dependence on initial values, we can further require a maximal margin to points in either category: this defines the optimal separating hyperplane. - The optimal separating hyperplane is solely determined by a few support points in the data, so moving other points without crossing the margin does not affect the output. - This gives robustness against faraway outliers, which LDA weak to. - Logistic regression is a compromise between the two, whose weighted least squares formulation gives a smaller weight to faraway data points.