Consider a classification task with dataset $\mathcal{D}=\{ (x_{i},y_{i}) ~|~i=1,\dots,n\}$ and binary response $Y \in \{ -1,1 \}$. We want to build a classifier $f: \mathcal{X} \to \{ -1,1 \}$. [[Hyperplane Classification Methods]] solve for an (affine) hyperplane $\{ x: x^{T}\beta+\beta_{0} =0\}$ defined by some normal vector $\beta$ and offset $\beta_{0}$, and define the classifier $\hat{y}(x):=\mathrm{sign}(f_{\beta,\beta_{0}}(x)), ~\text{where }f_{\beta, \beta_{0}}:= \mathrm{sign}(x^{T}\beta+\beta_{0}).$ ## Maximal Margin Classifier The **maximal margin classifier** or the **optimal separating hyperplane** fits the unique hyperplane with the largest **margin** $M$, i.e. the minimal distance between data points and the hyperplane: $\max_{w,b} M, ~~\text{subject to }\forall i, y_{i}(w^{T}x_{i}+b) \ge M \| \beta \|.$ Now reparameterising with $\tilde{M} := M\| w \|$, $w \leftarrow w / \tilde{M}$ and $b \leftarrow b /\tilde{M}$, this simplifies to $\max_{w,b} \frac{1}{\| w \| } ~~\text{subject to }\forall i,~ y_{i}\cdot(w^{T}x_{i}+b) \ge 1.~~~(\dagger)$ - One feature of the MMC is the **support vectors**: the hyperplane is determined by a few data points near the decision boundary, and completely ignores the rest. In other words, the solution is sparse in the training dataset. - This means that the MMC is completely robust against faraway outliers, but is very sensitive to changes near the decision boundary. ![[SeparatingHyperplaneMargin.png#invert|w40|center]] ### Hinge Loss and Penalty Form Replacing maximising $1/ \| w \|$ with minimising $\| w \|^{2} / 2$ in $(\dagger)$, forming its [[Lagrangian Dual]] gives the dual problem that generalises the hard margins with soft loss functions: $\min_{w,b} \frac{\| w \|^{2} }{2}+\ \sum_{i}c_{i}(1-y_{i}f(x_{i}))_{+},$ where $(1-y_{i}f(x_{i}))_{+}:= \max(0, 1-y_{i}f(x_{i}))$ is the **hinge loss** function, and $c=(c_{1},\dots,c_{n}) \succeq 0$ are the Lagrange multipliers. Replacing them with a common value $C$ gives the soft-margin problem $\min_{w, b} \left( \frac{1}{2}\| w \|^2+C \sum_{i} (1-y_{i}f(x_{i}))_{+} \right)= \min_{w,b} \Bigg( \underset{\text{penalty}}{\vphantom{\sum_{i}}\frac{\lambda}{2} \| w \|^{2} }+\underset{\text{risk}}{\sum_{i}(1-y_{i}f(x_{i}))_{+}} \Bigg).$ where we reparametrized with $C= \lambda^{-1}$. The maximal margin problem is the case where $C=\infty$ (or a sufficiently large value), corresponding to $\lambda=0$, an unregularised problem. ## Support Vector Classifiers Due to the risk-penalty formulation, the MMC can be easily generalised to non-separable cases with a smaller choice of $C$. Substituting $\xi_{i}:= (1-y_{i}f(x_{i}))_{+}$ gives the **soft margin** formulation $\min_{w,b,\xi} \frac{\| w \|^{2} }{2}+C\ \sum_{i}\xi_{i}, ~~\text{subject to }\forall i, ~\xi_{i} \ge (1-y_{i} f(x_{i}))_{+};$ here instead of requiring $y_{i}f(x_{i}) \ge 1$, we instead have $y_{i}f(x_{i}) \ge 1- \xi_{i}$, so $\xi_{i}$ is the **slack variable** that measures the amount of margin violation, and *the hinge loss penalises this violation linearly.* ![[SoftMarginSVM.png#invert|center|w60]] ### The Dual Problem In the usual linear constraint form, we require $\xi_{i}\ge 0$ and $\xi_{i} \ge 1-y_{i}f(x_{i})$ separately. With Lagrange multipliers $\lambda,\alpha$ respectively, the [[Lagrangian Dual|Lagrangian]] is $L(w,b,\xi,\lambda,\alpha)=\underset{\text{primal objective}}{\left\{ \frac{\| w \|^{2} }{2}+C\sum_{i}\xi_{i} \right\}}+\underset{\text{Lagrangian multipliers}}{\left\{ \sum_{i}\lambda_{i}(-\xi_{i})+\sum_{i}\alpha_{i}(1-\xi_{i}-y_{i}f(x_{i})) \right\}}.$Since $\lambda,\alpha$ both represent inequality constraints, we require $\lambda, \alpha \succeq 0$. Solving for the dual function $g(\lambda, \alpha):= \sup_{w,b,\xi}L$ by differentiating wrt. $w,b,\xi$ produces the following: $\begin{align*} \nabla_{\!w}L&= w-\sum_{i}\alpha_{i}y_{i}x_{i} \equiv 0 &&\Longrightarrow w=\sum_{i}\alpha_{i}y_{i}x_{i},\\ \nabla_{\!\xi}L&= C-\lambda-\alpha \equiv 0 &&\Longrightarrow \alpha = C-\lambda \preceq C\\[0.4em] \nabla_{\!b}L&= -\sum_{i}\alpha_{i}y_{i} \equiv {0} &&\Longrightarrow \sum_{i}\alpha_{i}y_{i}=0 \end{align*}$ These conditions say: - The normal vector is a weighted average of the data $\{ x_{i} ~|~i=1,\dots ,n\}$. Equivalently, the discriminator $f(x):=w^{T} x$ scores $x$ based on its similarities $\{ x^{T}x_{i} \}$ with the training data measured in Euclidean distances. - The weights $\alpha$ are bounded element-wise by $C$. - The weights are not biased toward either class. If $\xi_{i} \ge 0$, then the data point becomes a **support vector** of the SVC: - If $y_{i}f(x_{i})> 1$, then $1-\xi_{i}-y_{i}f(x_{i})<0$, and complementary slackness gives $\alpha_{i}=0$, i.e. the point has no weight in the normal vector. Therefore, the SVC is completely robust against correctly classified outliers (unlike [[Linear Discriminant Analysis and Generalizations|LDA]]). - A point is a **non-marginal SV** if $\xi_{i} > 0$, so the observation violates the margin, and CS requires $\lambda_{i}=0$ so $\alpha_{i}=C$, and the point has the maximum possible weight in the normal vector; if further $\xi_{i} > 1$, then the observation is misclassified. Since the weight is capped, severe, incorrectly classified outliers do not ruin the model. - A point is a **marginal SV** if $\xi_{i}=0$, but it is still on the margin. Note that CS does not require $\lambda_{i}\ne 0$, so the weight can be anything between $0$ and $C$. Substituting $w$ and $\lambda=C-\alpha$ give the dual function $g(\lambda, \alpha)=\sum_{i}\alpha_{i}-\frac{1}{2}\alpha^{T}\tilde{K}\alpha,$where $\tilde{K}_{ij}:= y_{i}y_{j}\cdot x_{i}^{T}x_{j}$ is the Gram matrix but with signs depending on whether $y_{i}=y_{j}$. Here $\xi$ all cancel out so we don't need to solve for it. The SVC produces a hyperplane in the predictor space, hence a linear decision boundary. This will fare poorly if the true relationship is not linear: ![[BadSVC.png#invert|w60|center]] ## Support Vector Machines Support vector machines further generalises the SVC by expanding the feature space, allowing non-linear decision boundaries. However, instead of "manually" including the features, it does so through **kernels**, i.e. functions $\mathcal{X}^{2} \to \mathbb{R}$ such that their Gram matrices are always positive (semi-)definite. - One important feature of fitting the SVC is that the predictors $x_{p}$ only appear in inner products $\left< x_{i},x_{j} \right>= x_{i}^{T}x_{j}$ called kernels in this context. - The SVC assumes the kernel to be the Euclidean distance, but it's possible to use different functions $K(x_{i}, x_{j})$. Common kernels include: - The **polynomial kernel** $K(x_{i},x_{j};d)=(1+\langle x_{i},x_{j}\rangle)^{d}$, - The **radial basis kernel (RBF)** $K(x_{i},x_{j};\gamma)=\exp(-\gamma\|x_{i}-x_{j}\|^{2})$ where $\gamma$ is a positive parameter. SVM does not excel at variable selection in high dimensions, however, since by symmetry it places equal weight on predictor pairs $\left< x_{i},x_{j} \right>$, unlike more adaptive methods like MARS. ### The Kernel Trick Suppose we replace the hypothesis class of hyperplane discriminants $x\mapsto w^{T} x + b$ with a [[Reproducing Kernel Hilbert Space|RKHS]] $(\mathcal{H},\left< \cdot,\cdot \right>_{\mathcal{H}})$. So the discriminants are general functions $f\in \mathcal{H}$, and are penalised with some $\| f \|_{\mathcal{H}}$, giving the problem $\min_{f\in \mathcal{H}} C\left( \sum_{i}(1-y_{i}f(x_{i}))_{+} \right)+\frac{\| f \| _{\mathcal{H}}^2}{2}.$ By the [[Reproducing Kernel Hilbert Space#Representer Theorem|representer theorem]], the optimiser has the form $f^{\ast}=\sum_{i}\beta ^{\ast}_{i}\kappa(\cdot,x_{i}),$ and the problem is reduced to $\tilde{\mathcal{H}}:= \mathrm{span}\{ \kappa(\cdot,x_{i}) \}$, parametrised by $\beta$: $\min_{\beta\in \mathbb{R}^{n}}C\sum_{i}\left( 1-y_{i}f_{\beta}(x_{i}) \right)_{+}+\frac{1}{2}\beta^{T}K\beta,$ where $f_{\beta}(x):= \sum_{j}\beta_{j}\kappa(x,x_{j})$. In terms of slack variables, it is $\min_{\xi \succeq 0, \beta} C\sum_{i}\xi_{i}+\frac{1}{2}\beta^{T}K\beta, ~~ \text{subject to }\forall i,~ y_{i}\cdot f_{\beta}(x_{i}) \ge 1-\xi_{i}.$ If we imitate the process for SVC, we will find the solution to be of the form $f^{\ast}(x)=\sum_{i}\alpha_{i}y_{i}\kappa(x,x_{i}),$and $\alpha$ solves the quadratic dual problem $\max_{\alpha \succeq 0} g(\alpha)=\max \left( \sum_{i}\alpha_{i}-\frac{1}{2}\alpha^{T}\tilde{K}\alpha \right),$where $\tilde{K}_{ij}=y_{i}y_{j} \cdot \kappa(x_{i},x_{j})$ this time, instead of $y_{i}y_{j}\cdot x_{i}^{T}x_{j}$. ### Implicit Feature Expansions In general, feature expansion is not good for hyperplane methods, since introducing too many features will make the data separable and cause overfitting. So how does the SVM's kernel introduce non-linearity and circumvent the problem? Consider the special case where the feature expansion is an eigen-decomposition of some positive definite kernel $\kappa$: $\kappa(x, x')=\sum_{j}h_{j}(x)h_{j}(x')\sigma_{j}\overset{\text{WLOG}}{=}\sum_{j}h_{j}(x)h_{j}(x')$by choosing $h_{j} \leftarrow h_{j} / \sqrt{ \sigma_{j} }$. Then the objective can be simplified as $\sum_{i}(1-y_{i}w^{T}\mathbf{h}(x_{i}))_{+} + C \| w \|^{2}= \sum_{i} (1-y_{i}f(x_{i}))_{+}+C\| f \|_{\mathcal{H}(\kappa)}^{2}, $where $\mathcal{H}_{\kappa}$ is the [[Reproducing Kernel Hilbert Space]] of the kernel $\kappa$, and $\| \cdot \|_{\mathcal{H}(\kappa)}$ is its norm, derived from the inner product $\left< \cdot,\cdot \right>_{\mathcal{H}(\kappa)}$ that satisfies $\left< f,\kappa(x, \cdot) \right>_{\mathcal{H}(\kappa)}=f(x)$. - The equality $\| w \|^{2}=\| f \|_{\mathcal{H}(\kappa)}$ follows from $f=w^{T}\mathbf{h}$, so $\left< f,f \right>_{\mathcal{H}}=w^{T}\begin{pmatrix} \left< h_{1},h_{1} \right>_{\mathcal{H}} & \left< h_{1},h_{2} \right>_{\mathcal{H}} & \dots \\ \left< h_{2},h_{2} \right>_{\mathcal{H}} & \left< h_{2},h_{2} \right>_{\mathcal{H}} & \dots \\ \vdots & \vdots & \ddots \end{pmatrix}w = w^{T}Iw=\| w \|^{2}, $as $\left< h_{i},h_{j} \right>=\delta_{ij}$ (to prove it, expand $\left< f,\kappa(x,\cdot) \right>$ then substitute $f=h_{i}$). Therefore, the SVM comes a functional optimisation in the space $\mathcal{H}(\kappa)$, with objective $\text{Empirical risk} +C\| f \|_{\mathcal{H}(\kappa)}. $The **representer theorem** guarantees that the solution is of the form $f^{\ast}(x)=\sum_{i}\beta_{i}\kappa(x, x_{i}),$i.e. a weighted average of the training data. > [!idea] > Therefore, if the feature expansion defines a positive definite kernel $\kappa$, then the corresponding SVM simply replaces the dot product $\left< x_{i},x_{j} \right>=x_{i}^{T}x_{j}$ with the kernel $\left< x_{i},x_{j} \right>=\kappa(x_{i},x_{j})$. Conversely, we can implicitly do a feature expansion for free by choosing a kernel $\kappa$ in the first place -- this is the whole schtick of kernel SVM.