> [!info] This material corresponds to chapter 4 and 12 in the ESL.
> For support vector machines, refer to its [[Support Vector Machines|dedicated page]].
## Separating Hyperplanes
A class of methods classifies binary responses $Y \in \{ 1,-1 \}$ by finding **affine hyperplanes** in the predictor space given by the set $\{ x \in \mathbb{R}^{p}\,|\,f(x):=\beta_{0} + \beta^{T}X=0 \}$Then classification error can be expressed as $y_{i}\cdot f(x_{i})\begin{cases}
>0 & \text{if $(x_{i},y_{i})$ is classified correctly} \\
<0 &\text{if not}
\end{cases}.$
Naive methods look for hyperplanes that perfectly separate the two classes. Data where such hyperplanes exist are called **separable**.
- Such a hyperplane does not necessarily exist without basis expansions, which is the common issue of such methods.
- Another issue is that there are infinitely many such hyperplanes if the data is separable, so the method need extra constraints.
### Rosenblatt’s Perceptron Learning Algorithm
**Rosenblatt’s perceptron learning algorithm** looks for a separating hyperplane by minimizing the total distance of misclassified points: $\min_{\beta_{0},\beta, \| \beta \|=1 }\left[ -\sum_{i \in \mathcal{M}_{f}} y_{i} \cdot f(x_{i})\right]= \min_{\beta_{0},\beta,\| \beta \|=1} \left[-\sum_{i} [y_{i} \cdot f(x_{i})]_{-}\right]$where $\mathcal{ M}_{f}$ indexes the points misclassified by $f$. It estimates the coefficients $(\beta_{0},\beta)$ using stochastic gradient descent, iterated over misclassified observations $(x_{i}, y_{i}) \in \mathcal{M}_{f}$: $\begin{pmatrix}\beta \\ \beta_{0}\end{pmatrix} \leftarrow\rho \begin{pmatrix}
y_{i}x_{i} \\ y_{i}
\end{pmatrix} + \begin{pmatrix}\beta \\ \beta_{0}\end{pmatrix}$where $\rho$ is the learning rate. The algorithm is guaranteed to terminate if the data is separable.
Issues of the perceptron algorithm include:
- The estimation of this algorithm depends on the initial values.
- Convergence can be slow for separable data.
- The algorithm gets stuck in long cycles if the data is not separable.