Prototyping Methods and Nearest Neighbors

> [!info] This material corresponds to chapter 13 in the ESL. ## Prototypes Given a data set $(X_{i}, g_{i})_{i=1}^{N}$ with categorical responses $g_{i} \in \{ 1,\dots,K \}$, prototype methods find **prototype (points)** $(m_{r})_{r=1}^{R}$ in the feature space, each having a class label. *They are meant to be "representatives" of their respective classes*. - In general, the prototypes are distinct from the training data points, except in $K=1$ nearest neighbors. Then new data points are classified to have the same class as the nearest prototype. The "nearness" can be measured with Euclidean distance, if the features are standardized. This method is **model-free**, i.e. it makes no a priori assumption about the exact relationship. - Benefits include reduced bias, allowing for very irregular decision boundaries. - Difficulties come from choosing the appropriate number of prototypes, and their locations. ### K-Means for Classification Although a clustering algorithm, [[K-Means Clustering|K-means]] can be used to find prototypes for classification: > [!algorithm] Classification with K-means > $[0]$ Fix $R$, the number of prototypes for each class. > > For $[\text{class }k=1,\dots,K]$: > - Apply K-means with $R$ clusters to the data points belonging to class $k$. > - Record the centroids $m_{k1},\dots,m_{kR}$. > > New data point $x$ is classified to be the same as the nearest of the $K \times R$ centroids. The issue with this approach is that *classes do not communicate*: the centroids $m_{k1},\dots,m_{kR}$ are selected with no input from classes $j \ne k$. This means that if class $j$ is very dense in a region, K-means for class $k$ might still place a centroid there, misclassifying (roughly) half of the region. ### Learning Vector Quantification To allow communication between classes, a centroid of class $k$ should no only stay close to data points of class $k$, it should also steer away from regions densely filled with other classes. This motivates the **learning vector quantification** algorithm, which allows same-class points to attract centroids, but also misclassified points to repel centroids: > [!algorithm] Learning Vector Quantification (LVQ1) > > $[0]$ Fix $R$, and initialize with $K \times R$ centroids $\{ m_{kr} \,|\,k=1,\dots,K,r=1,\dots,R\}$ (e.g. the ones selected by K-means, or randomly sample $R$ points from each class). Initialize learning rate $\epsilon > 0$. > > $[1]$ Repeating until convergence or $\epsilon$ too small: > - Randomly sample $(x_{i},g_{i})$ from the data set (with replacement), and find its closest centroid $m_{kr}$. > - If $(x_{i},g_{i})$ belongs to the same class as $m_{kr}$, then nudge it towards $x_{i}$: $m_{kr}\leftarrow m_{kr} + \epsilon(x_{i} - m_{kr})$Otherwise, nudge it away from $x_{i}$: $m_{kr}\leftarrow m_{kr} - \epsilon(x_{i} - m_{kr})$ > - Decrease $\epsilon$. Implementation in Python: [GitHub link](https://github.com/aliquod/LearningVectorQuantization) ![[LVQImplementation.png#invert|w80|center]] The main drawback of LVQ is that it is purely algorithm-based, so its properties are difficult to understand. ## Nearest Neighbors Classifiers **$k$-nearest neighbor classifiers** are memory-based methods that require no model to work. When given a new observation $x^{*}$, it simply queries the $k$ training data points $(x_{(1)},\dots,x_{(k)})$ that are closest to it, and predict $x^{*}$ to belong to the majority class. - Closeness can be measured by Euclidean distance if the predictors are all standardized. - *Low values of $k$ reduces bias but increases variance*, since the decision boundaries are heavily influenced by each data point. With squared error loss and $k=1$, drawing a new sample $x^{*}$ with the same predictor values as an observation $x_{i}$ has an error rate no more than twice the Bayes rate: $\begin{align*} \text{Bayes error}&= 1-p_{\max}\\ 1\text{-Nearest Neighbor error}&= \sum_{k=1}^{K}p_{k}(1-p_{k}) \end{align*}$where $p_{k}=\mathbb{P}[Y=k]$, and $p_{\max}=\max_{k}p_{k}$ at $x^{*}$. Note that *in binary classification ($K=2$), the $1$-NN error rate is always less than twice the Bayes error rate*. - The assumption that $x^{*}$ has the same predictor values as some $x_{i}$ holds true asymptotically if the number of predictors stay constant when $n \to \infty$. ### Invariance Metrics and Tangent Distances *For certain applications, Euclidean distance is not suitable for measuring closeness* -- for example in computer vision like digit classification, a slightly rotated image is still "close" to the original, but not in Euclidean distance: ![[Pasted image 20231210150339.png#invert|w80|center]] The brute-force solution would include adding all rotated images into the training set (**data augmentation**). Alternatively, the **invariance metric** for distances, where an image's distance to "3" will be the shortest distance between it and the manifold containing the transformations of 3 (green curve): ![[TransformationTangentDistance.png#invert|w50/|center]] However, this is hard to compute in real applications, so the transformation curve is approximated by the tangent hyperplane at the original point "3". The distance measured this way is the **tangent distance**. Schematically, ![[TangentDistanceSchematic.png#invert|w60|center]] However, this kind of computations is still slow for real-life applications, so other faster algorithms (e.g. neural networks) are modified to imitate this approach. ### Adaptive Nearest Neighbors One major drawback of vanilla nearest neighbor methods is the [[Curse of Dimensionality|curse of dimensionality]]. - One extreme case is with irrelevant predictors (i.e. noises): in that case *the distance is arbitrarily inflated by the noises, and "closeness" in the relevant predictors is diluted*. - Therefore, nearest neighbor methods can benefit from dimension reduction methods like PCA. Alternatively, using weights when computing distances and *assigning a smaller weight to irrelevant predictors* also prevents them from inflating the distance: - In this case of $p=2$ with one noise (vertical axis), the vanilla neighborhood (white circle) puts equal weight to the noise dimensions and the horizontal direction; the weighted orange neighborhood gives zero weight to the noise. ![[IrrelevantPredictorKNN.png#invert|w50|center]] - More generally, in the space of predictors, we should *ignore directions in which classes remain balanced -- the class boundaries*; their directions are characterized by a constant between-class variance. The **discriminant adaptive nearest neighbors (DANN)** implements this re-weighting with by local in-class and between-class variances. > [!connection] Connection: [[Linear Discriminant Analysis and Generalizations#Linear Discriminant Analysis|LDA]] Recall the (square of) **Mahalanobis distance** from LDA, measuring the distance between a point $x$ to a centroid $x_{0}$: $D(x,x_{0})=(x-x_{0})^{T}\boldsymbol{\Sigma}^{-1}(x-x_{0}), $where $\boldsymbol{\Sigma}$ is the covariance matrix of the class's distribution. Differences in predictors with high variance are scaled down by $\boldsymbol{\Sigma}^{-1}$. Similarly, DANN uses a suitable $\boldsymbol{\Sigma}$ to down-scale distance in irrelevant directions. Consider $\begin{align*} \boldsymbol{\Sigma}_{\mathrm{DANN}?}&= \mathbf{W}^{-1}\mathbf{B}\mathbf{W}^{-1}=\mathbf{W}^{-1/2}\mathbf{B}^{*}\mathbf{W}^{-1/2}\\[0.6em] D(x,x_{0};\boldsymbol{\Sigma}_{\mathrm{DANN}?})&= (x-x_{0})^{T}\boldsymbol{\Sigma}_{\mathrm{DANN?}}(x-x_{0})\\[0.4em] &= \| (\mathbf{WB}^{*})^{-1/2}(x-x_{0}) \|_{2}^{2} \end{align*}$where $\mathbf{W},\mathbf{B}$ are within-class and between-class covariance matrices, and $\mathbf{B}^{*}=\mathbf{W}^{-1/2}\mathbf{BW}^{-1/2}$ is the between-class covariance after **sphering** the data with $\mathbf{W}^{-1/2}$. Hence this distance measures *the length of $(x-x_{0})$ re-weighted by normalizing variances*. - The matrices $\mathbf{W},\mathbf{B}$ are locally estimated, e.g. with $50$ observations closest in Euclidean distance. - Note that the zero-eigenvectors of $\mathbf{B}^{*}$ are *the irrelevant directions in which the between-class variances remain constant -- they get a weight of $0$*. In practice, the down-weighing is tuned with a parameter $\epsilon$ (usually $1$), giving the DANN metric $\begin{align*} \boldsymbol{\Sigma}_{\mathrm{DANN}}(\epsilon)&= \mathbf{W}^{-1/2}(\mathbf{B}^{*} + \epsilon \mathbf{I})\mathbf{W}^{-1/2}\\[0.6em] D(x,x_{0};\boldsymbol{\Sigma}_{\mathrm{DANN}}(\epsilon))&= (x-x_{0})^{T}\boldsymbol{\Sigma}_{\mathrm{DANN}}(\epsilon)(x-x_{0})\\[0.4em] &= \| (\mathbf{WB}^{*} + \epsilon^{1/2} \mathbf{I})^{-1/2}(x-x_{0}) \|_{2}^{2} \end{align*}$So the irrelevant directions still get a weight of $\epsilon^{1/2}$. When $\epsilon$ increases from $0$, the neighborhood shrinks from an infinitely long hyperbox to a hyperellipse (boxes and ellipsoids but in higher dimensions). ![[DANN.png#invert|center|w60]] The above are examples of neighborhoods found by DANN. - *In homogenous regions where the local $\mathbf{B}$ is near-$0$, the neighborhood are close to spheres*, i.e. only the sphering effect of $\mathbf{W}$ takes place, and no direction is otherwise preferred over others. - *Near class boundaries, $\mathbf{B}$ starts to dominate*, and the neighborhoods are stretched along the boundary.