> [!info] This material corresponds to chapter 4 in the ESL.
Linear classification methods refer to methods with a linear decision boundary:
- It can be modeled directly as a hyperplane (e.g. SVM)
- A linear discriminant function $\delta_{k}(x)$, sometimes the posterior probability $\mathbb{P}[Y=k\,|\, X=x]$ itself.
## Glossary of Methods Covered Elsewhere
[[Linear Discriminant Analysis and Generalizations]] is a generative model (using Gaussians) that reduce to linear decision boundaries (either linear in the original inputs or in their quadratic terms).
**Logistic regression** is just a special case of [[Generalized Linear Models|GLMs]] modeling a binary Bernoulli response, i.e. modeling the log-odds with a linear predictor:
$\beta^{T}X=\eta \approx\log\left( \frac{\mathbb{P}[Y=1 ~|~ X]}{1-\mathbb{P}[Y=1 ~|~X]} \right).$Alternatively, solving for the probability gives $\mathbb{P}[Y=1~|~X]=\frac{e^{\eta}}{ (1+e^\eta)}$, which is the linear predictor squished by the **logit** function.
- It can be generalized to be the multi-class log-odds (of category $k=1,\dots,K-1$ against category $K$): $\beta_{k}^{T}X=\eta_{k}=\log\left( \frac{\mathbb{P}[Y=k\,|\,X]}{\mathbb{P}[Y=K\,|\,X]} \right),\,k=1,\dots,K-1.$
It then results in the linear decision boundaries $\begin{align*}
&\{ \mathbf{x}~|~\beta^{T}\mathbf{x}=0 \} &\text{for binary responses}; \\
&\{ \mathbf{x} ~|~ (\beta_{k_{1}}-\beta_{k_{2}})^{T}\mathbf{x}=0 \} & \text{for multiple classes}.
\end{align*}$
- Logistic regression then uses maximum likelihood estimator computed with Newton-Raphson iteration to estimate the coefficients.
### Logistic Regression as a Linear Model
Consider a logistic regression on a binary response $y_{i}=0 \text{ or }1$, with the predicted probability $\hat{p}_{i}=\hat{\mathbb{P}}[Y=1\,|\,X=x_{i}]$.
The MLE coefficients $\hat{\beta}$ given by logistic regression is the solution to the weighted least squares problem $\begin{align*}
z_{i}&= \hat{\beta}^{T}x_{i}+\frac{y_{i}-\hat{p}_{i}}{\hat{p}_{i}(1-\hat{p}_{i})}\\
w_{i} &= \hat{p}_{i}(1-\hat{p}_{i})
\end{align*}$where $z_{i}$ is the response and $w_{i}$ is the weight.
- In this model, the weighted $\mathrm{RSS}$ is the Pearson chi-square statistic: $\mathrm{wRSS}=\sum_{i}w_{i}(z_{i}-\hat{z}_{i})^{2}=\sum_{i} \frac{(y_{i}-\hat{p}_{i})^{2}}{\hat{p}_{i}(1-\hat{p}_{i})}$
- The linear regression model also gives asymptotic behavior of $\hat{\beta}$:
- By consistency of the MLE, $\hat{\beta} \to \beta\,\,\text{a.s.}$.
- By the CLT, $\hat{\beta}\xrightarrow{d}N(\beta,(\mathbf{X}^{T}\mathbf{WX})^{-1})$, where $\mathbf{W}$ is the diagonal matrix $\mathrm{diag}(w_{1},\dots,w_{N})$.
### Regularized Logistic Regression
- The regular logistic regression MLE maximizes the log-likelihood (for binary outcomes) given by $l(\mathbf{X},\mathbf{y};\beta_{0},\beta)=\sum_{i=1}^{N}[y_{i}(\beta_{0}+\beta^{T}x_{i})-\log(1+\exp(\beta_{0}+\beta^{T}x_{i}))]$So with $l_{1}$ regularization, maximization problem becomes $\max_{\beta_{0},\beta}[l(\mathbf{X},\mathbf{y},\beta_{0},\beta) - \lambda \| \beta \|_{1} ]$where just like in the lasso, the intercept $\beta_{0}$ is not penalized, and the data is assumed to be centralized and standardized.
## Separating Hyperplane Methods
- A class of methods classifies binary responses by finding hyperplanes in the predictor space that perfectly separate the two classes.
- Such a hyperplane does not necessarily exist without basis expansions, which is the common issue of such methods.
- A hyperplane can be represented using $f(x)=\beta_{0}+\beta^{T}x$ as the set $\{ x \in \mathbb{R}^{p}\,|\,f(x)=0 \}$and with the encoding $Y\in \{ 0,1 \}$, the expression $y_{i}\cdot f(x_{i})\begin{cases}
>0 & \text{if $(x_{i},y_{i})$ is classified correctly} \\
<0 &\text{if not}
\end{cases}$by the hyperplane satisfying $f(x)=0$.
### Rosenblatt’s Perceptron
- The perceptron learning algorithm minimizes the total distance of misclassified points: $\min_{f}[ -\sum_{i \in \mathcal{M}_{f}} y_{i} \cdot f(x_{i})]$where $\mathcal{ M}_{f}$ indexes the points misclassified by $f$. It estimates the coefficients $(\beta_{0},\beta)$ using stochastic gradient descent: $\begin{pmatrix}\beta \\ \beta_{0}\end{pmatrix} \leftarrow\rho \begin{pmatrix}
y_{i}x_{i} \\ y_{i}
\end{pmatrix} + \begin{pmatrix}\beta \\ \beta_{0}\end{pmatrix}$where $\rho$ is the learning rate.
- Issues of the perceptron algorithm include:
- Separable data admits infinitely many separating hyperplanes, and the result of this algorithm depends on the initial values.
- Convergence can be slow for separable data.
- The algorithm gets stuck in long cycles if the data is not separable.
### Optimal Separating Hyperplanes
- To avoid the dependence on initial values, we can further require a maximal margin to points in either category: this defines the optimal separating hyperplane.
- The optimal separating hyperplane is solely determined by a few support points in the data, so moving other points without crossing the margin does not affect the output.
- This gives robustness against faraway outliers, which LDA weak to.
- Logistic regression is a compromise between the two, whose weighted least squares formulation gives a smaller weight to faraway data points.