> [!info] This material corresponds to chapter 4 and 12 in the ESL. > For support vector machines, refer to its [[Support Vector Machines|dedicated page]]. ## Separating Hyperplanes A class of methods classifies binary responses $Y \in \{ 1,-1 \}$ by finding **affine hyperplanes** in the predictor space given by the set $\{ x \in \mathbb{R}^{p}\,|\,f(x):=\beta_{0} + \beta^{T}X=0 \}$Then classification error can be expressed as $y_{i}\cdot f(x_{i})\begin{cases} >0 & \text{if $(x_{i},y_{i})$ is classified correctly} \\ <0 &\text{if not} \end{cases}.$ Naive methods look for hyperplanes that perfectly separate the two classes. Data where such hyperplanes exist are called **separable**. - Such a hyperplane does not necessarily exist without basis expansions, which is the common issue of such methods. - Another issue is that there are infinitely many such hyperplanes if the data is separable, so the method need extra constraints. ### Rosenblatt’s Perceptron Learning Algorithm **Rosenblatt’s perceptron learning algorithm** looks for a separating hyperplane by minimizing the total distance of misclassified points: $\min_{\beta_{0},\beta, \| \beta \|=1 }\left[ -\sum_{i \in \mathcal{M}_{f}} y_{i} \cdot f(x_{i})\right]= \min_{\beta_{0},\beta,\| \beta \|=1} \left[-\sum_{i} [y_{i} \cdot f(x_{i})]_{-}\right]$where $\mathcal{ M}_{f}$ indexes the points misclassified by $f$. It estimates the coefficients $(\beta_{0},\beta)$ using stochastic gradient descent, iterated over misclassified observations $(x_{i}, y_{i}) \in \mathcal{M}_{f}$: $\begin{pmatrix}\beta \\ \beta_{0}\end{pmatrix} \leftarrow\rho \begin{pmatrix} y_{i}x_{i} \\ y_{i} \end{pmatrix} + \begin{pmatrix}\beta \\ \beta_{0}\end{pmatrix}$where $\rho$ is the learning rate. The algorithm is guaranteed to terminate if the data is separable. Issues of the perceptron algorithm include: - The estimation of this algorithm depends on the initial values. - Convergence can be slow for separable data. - The algorithm gets stuck in long cycles if the data is not separable.