Localized Methods - Random Notes Go Brrrrrrr

> [!info] This material corresponds to chapter 6 in the ESL. ## 1D Local Kernels Kernels in this chapter refer to "importance" functions $K(x;x_{0})$ that can introduce locality and weighting when regressing the value $f(x_{0})$. - Locality: data that are too far away can be assigned an importance of $0$, effectively filtering them out. - Weighting: more generally, we can assign a smaller importance to irrelevant data. The **Gaussian kernel** $\phi(x;x_{0},\sigma^{2})$ assigns lower importance to faraway data, where the rate of decay is determined by $\sigma^{2}$. - The Gaussian kernel has non-compact support, hence does not perform strict localization. ### Kernels with Compact Support For localization, kernels need to evaluate to $0$ for faraway data, so it is ideal to have a kernel that *has a compact support and a parameter for its size* . Compact supports are usually defined with a truncated distance function $D(r)$, and the area truncated is parametrized with $\lambda$ via some function $h_{\lambda}(x;x_{0})$, usually in the form $K_{\lambda}(x;x_{0}):= D\left( \frac{|x-x_{0}|}{h_{\lambda}(x;x_{0})} \right)$where $D(r)$ evaluates to $0$ when $r$ is large, so the larger $h_{\lambda}(x;x_{0})$, the larger the support. The **Epanechnikov quadratic kernel** uses the distance $D(r):= \frac{3}{4}(1-r^{2})_{+}$and $h_{\lambda}=\lambda$ constant over $x$, giving the kernel $K_{\lambda}(x;x_{0}):= \frac{3}{4}\left( 1-\frac{|x-x_{0}|^{2}}{\lambda^{2}} \right)_{+}$so any data $x_{i}$ farther than $\lambda$ units away from the regressed point $x_{0}$ is ignored. Another choice of distance is the **tri-cube** function: $D(r):= (1-|r|^{3})^{3}_{+}$which has a flatter peak and a differentiable boundary. ![[EpanechnikovKernel.png#invert]] **Metric window widths** are support sizes that are constant, e.g. that of the Epanechnikov quadratic kernel. They provide constant bias but the variability is inversely proportional to data density. - The $k$ nearest neighbor kernel is the reverse: it has constant variance and changing bias. The width of kernel functions are usually determined with a parameter $\lambda$, examples include: - $\lambda$ is the radius of the support of the Epanechnikov quadratic kernel and the tri-cube kernel. - $\lambda=\sigma$ is the standard deviation of the Gaussian kernel. - $\lambda=k / N$ is the proportion of data used in the $k$-nearest neighbors kernel. *The choice of $\lambda$ is part of the [[Bias-Variance Tradeoff]]*: - A smaller support only samples the few local data points, hence has higher variance but low bias. - A large support considers faraway data points, giving high bias but low variance. - The optimal choice of $\lambda$ can be found with [[Cross Validation|CV]] or other model comparison methods. ## Regression with Kernels Using the localizing capability of kernels, we can fit a **scatterplot smoother** $\hat{y}(\cdot): x \mapsto \hat{y}$ to model/smooth the noisy response $\mathbf{y}$. In particular, for each input $x$, we fit a function $\hat{f}_{x}$ that follows $\mathbf{y}$ locally around $x$, then evaluate it at $x$ to get the estimate: $\text{input }x \to \text{locally fit } \hat{f}_{x}(z) \to \text{return }\hat{y}(x):=\hat{f}_{x}(x)$ As a special case, if the local fits are all linear in the responses, i.e. $\forall i$ there exist weights $S_{i} \in \mathbb{R}^{1\times n}$ such that the local fit is the weighted average $\hat{f}_{x_{i}}(x_{i})=S_{i}\mathbf{y},$then stacking them vertically into a matrix $\mathbf{S}$ gives the linear map $\hat{\mathbf{y}}=\mathbf{S}\mathbf{y}.$ ### Kernel-Weighted Average The simplest way of regression using kernels is just *interpreting the kernel as a weight*, then computing a weighted average. This gives the **Nadaraya–Watson kernel-weighted average**: $\hat{y}(x)=\frac{\sum_{i}K(x_{i};x)\cdot y_{i}}{\sum_{i}K(x_{i};x)}$and in the case of $k$ nearest neighbors kernel, this is just the unweighted mean of $y_{i}$ at those $k$ neighbors. - Implicitly, the local fit is the constant function $\hat{f}_{x}(z)\equiv \frac{\sum_{i}K(x_{i};x)\cdot y_{i}}{\sum_{i}K(x_{i};x)}~(\forall z)$, and the linear map is $s_{ij} \propto K(x_{j};x_{i})$, normalized to unity per row. - The **k nearest neighbor** regression can be formulated into a weighted average with the kernel $K(x_{i};x)=\mathbb{1}\{x_{i} \in N_{k}(x)\},$where $N_{k}(x)$ is the set of $k$ data points closest to $x_{0}$, so it evaluates to $0$ to all non-neighbors. *NW weighted averages are biased on boundaries and regions with unevenly distributed data in general*, where the kernel will be asymmetric: ![[BoundaryBiasNW.png#invert]] - Here N-W overestimates the value because the values to the right are higher due to the locally positive slope, but there are few values to the left to balance them out. - To eliminate this bias, we need a higher-order regression -- in particular one that can capture local trends. > [!idea] > Locally weighted averages like Nadaraya–Watson are just a special case of **local polynomial regressions**, in particular they use polynomials of order $0$ -- the intercept. ### Local Linear Regression **Local linear regression** *fits a linear function $f_{x}(z):z \mapsto\theta_{0}(x)+\theta_{1}(x)(z-x)$ around $x$ to remove such bias to the first order*. A local fit is enforced by weighing the squared errors with the kernel function. > [!definition|*] Local Linear Regression > **Local linear regression** models the local pattern of $y(x)$ with a linear function $\hat{f}_{x}: z \mapsto \theta_{0} + \theta_{1}(z-x).$ > > The coefficients $\theta(x)=(\theta_{0}(x), \theta_{1}(x))$ are solved with the **weighted OLS problem** > $\theta(x) :=\underset{\theta=(\theta_{0}, \theta_{1})}{\arg\min}\sum_{i}\underset{\text{weighted}}{K(x_{i};x)}\cdot \underset{\text{squared error}}{( \theta_{0}+\theta_{1} (x_{i}-x)-y_{i} )^{2}},$where the regressed value $\hat{y}(x)$ is then $\hat{f}_{x}(x)=\theta_{0}(x)$. > [!exposition] Solving the weighted least squares > The weighted least squares problem has weights $W(x):= \mathrm{diag}(K(x_{i};x) ~|~ i=1,\dots,n),$so the weighted OLS is equivalent to the unweighted OLS $\theta(x)= \underset{{\theta}}{\arg\min}~\| W^{1 / 2} \mathbf{X}\theta - W^{1 / 2} \mathbf{Y}\|_{2}^{2},$i.e. fitting the response $W^{1 / 2}Y$ with variables $W^{1 / 2}X$. Here $\mathbf{X}$ is the **local design matrix** with rows $\mathbf{X}_{i, :}=[1, x_{i}-x]$. > > Using standard OLS results, the fitted coefficients $\theta$ are given by $\theta=(\mathbf{X}^TW \mathbf{X})^{-1}\mathbf{X}^{T}W\mathbf{Y}.$ The fitted value $\hat{y}(x)=f_{x}(x)=\theta_{0}(x)$ is linear in the responses: writing $\mathbf{L}:= (\mathbf{X}^{T}WX)^{-1}X^{T}W$, with $L=(l_{1},\dots,l_{n})$ being its first row, $\hat{y}(x)=\theta_{0}(x)=L^{T}(x)\mathbf{Y}=\sum_{i}l_{i}(x)y_{i},$where the weights $l_{i}(x) \in \mathbb{R}$ are independent of $\mathbf{Y}$. These weights defines the **equivalent kernel** $L_{0}(\cdot~;x)$, making local linear regression a kernel smoothing method. - To convert to matrix form, concatenate $L^{T}(x_{1}),\dots,L^{T}(x_{n})$ into a matrix $\mathbf{S}$ and we have $\hat{\mathbf{Y}}=\mathbf{S}\mathbf{Y}$. ### Local Polynomial Regression Local linear regression "trims the hills" and "fills the valleys", a result of it having non-linear bias: ![[CurvatureBias.png#invert]] - It fails to capture regions when the local trend changes, i.e. second order changes. To deal with this issue, we can expand the basis to solve the weighted least squares problem with order-$p$ local polynomials: > [!definition|*] Local Polynomial Regression > **Local polynomial regression** models the relationship between $x$ and $y$ with a locally fitted order-$p$ polynomial $\hat{f}_{x}:z \mapsto\theta_{0}+\theta_{1}(z-x)+\cdots + \theta_{p}(z-x)^{p},$where the coefficients solve the weighted OLS > $\theta(x) :=\underset{\theta=(\theta_{0},\dots,\theta_{p})^{T}}{\arg\min}\sum_{i}K(x_{i};x)\cdot ( \mathbf{X}_{i, :}\theta-y_{i} )^{2},$where $\mathbf{X}_{i}^{0:p}=\left[1, (x_{i}-x), (x_{i}-x)^{2},\dots,(x_{i}-x)^{p}\right]$ is the $i$th row of the local **design matrix**. > > Again the fitted value is $\hat{y}(x)=\theta_{0}(x)$, but of course its value is in general different from that fitted in local linear regression. By a similar argument as in local linear regression, the fitted values $\hat{\mathbf{Y}}$ are weighted averages of the original $\mathbf{Y}$, given by $\hat{\mathbf{Y}}=L(x)\mathbf{Y}$. - As a bias-variance tradeoff, higher-order local regressions have vastly larger variance around boundaries. ## Evaluating Kernel Regressions In general, the quality of fit of kernel regressions can be evaluated easily with [[Model Error Metrics#Cross Validation|cross validation]], where their linear relationship $\hat{\mathbf{y}}=\mathbf{Sy}$ allows efficient computation of the LOOCV error. ### $l_{2}$ Loss in Kernel Regressions The usual $l_{2}$ losses are: $\begin{align*} \mathrm{RSS}&= \| \mathbf{y}-\hat{\mathbf{y}} \|^{2}=\| \mathbf{y}-\hat{m}(\mathbf{x}) \|^{2} \\[0.4em] \text{modeling }\mathrm{MSE}(x)&= \mathbb{E}[(m(x)-\hat{m}(x))^{2}]\\[0.4em] \text{predictive }\mathrm{MSE}(x)&= \mathbb{E}[(Y-\hat{m}(x))^{2} ~|~ X=x] \end{align*} $where $Y=m(X)+\epsilon$ is a new sample independent from the one used to train $\hat{m}$. Note that here $\hat{m}$ is random. - Predictive MSE is also known as **out-of-sample** ($l_{2}$) error. > [!exposition] Deriving RSS and modeling MSE > As seen in [[Degree of Freedom#Special Form for Linear Estimators]], $\mathrm{RSS}=\| (I-\mathbf{S})\mathbf{y} \|^{2}=\| (I-\mathbf{S})m(\mathbf{x}) \|^{2}+\| (I-\mathbf{S})\pmb{\epsilon} \|^{2},$due to $\left< \mathrm{const.}, (I-\mathbf{S})\pmb{\epsilon} \right>=0$; it has expectation $\mathbb{E}[\mathrm{RSS}]=\| (I-\mathbf{S})m(\mathbf{x}) \|^{2}+\sigma^{2}(\mathrm{tr}(S^{T}S)-2\mathrm{tr}\mathbf{S}+n).$ > > --- > >[](Degree%20of%20Freedom.md#Special%20Form%20for%20Linear%20Estimators)ing MSE is just $\begin{align*} \sum_{i}\mathrm{MSE}(x_{i})&= \mathbb{E}[\| m(\mathbf{x})-\hat{m}(\mathbf{x}) \|^{2} ]\\ &= \mathbb{E}[\| (I-\mathbf{S})m(\mathbf{x})+\mathbf{S}\pmb{\epsilon} \|^{2} ]\\[0.8em] &= \| (I-\mathbf{S})m(\mathbf{x}) \| ^{2}+\sigma^{2}\mathrm{tr}(\mathbf{S}^{T}\mathbf{S}). \end{align*}$ The prediction MSE is $\begin{align*} \mathbb{E}[(Y-\hat{m}(x))^{2}~|~ X=x]&= \mathbb{E}[(m(x)-\hat{m}(x)+\epsilon)^{2} ]\\ &= \mathbb{E}[(m(x)-\hat{m}(x))^{2}]+\sigma^{2} \end{align*}$Now we can apply the [[Point Estimators#^cf7800|bias-variance decomposition]] on $\mathbb{E}[(m(x)-\hat{m}(x))^{2}]$ (can't apply in the first step since both $Y$ and $\hat{m}$ are random): $ =\mathrm{bias}(m(x),\hat{m}(x))^{2}+\mathrm{Var}(\hat{m}(x))+\underset{\text{noise}}{\sigma^{2}}.$ > [!idea] Bias-Variance Tradeoff > - A wiggly smoother has low bias but high variance; > - A inflexible smoother has high bias but low variance; > - Then there is the noise that we can do nothing about. ### Bias of Local Polynomial Regressions Simple weighted average methods like NW average using a Gaussian kernel misses the trend; the local linear regression misses second order changes; it seems that *increasing the complexity of the local model successively decreases biases*. This process is a result of *an order-$p$ local polynomial regression being unbiased up to order-$p$.* More precisely: > [!theorem|*] Order-$p$ bias of the local polynomial regression > Let $m(x)=\mathbb{E}[Y ~|~X=x]$ be the true relationship between $X$ and $Y$, and $\hat{m}$ be that fitted by order-$p$ local polynomial, then $\mathrm{bias}(\hat{m}(x), m(x))=\sum_{i} l_{i}(x) \frac{m^{(p+1)}(\xi_{i})}{(p+1)!}(x_{i}-x)^{p+1}$for some $\xi_{i}$ between $x$ and $x_{i}$. Note that *the bias has first $p$ orders removed.* > [!proof]- Proofs of the order of bias > First note that for any $x$, and any local polynomial regression, the weights $L(x)$ applied to the responses are normalized with sum of $1$. > > [!proof]- Proof that the weights sum to 1 > > Keep $X$ as is, but consider $Y=(1,\dots,1)$. Since the weights do not depend on $Y$, the conclusion would work on the original dataset too. > > > > Since a flat line with coefficient $\theta=(1,0,\dots,0)$ perfectly fits the data, it must be the solution to the weighted OLS that defines the local polynomial regression. > > > > This means that for any point $x$, computing > > $\hat{f}_{x}$ will give the constant function $1$; at the same time, $\hat{f}_{x}(x)=\sum_{i}l_{i}(x)y_{i}=\sum_{i}l_{i}$since $\forall i,~y_{i}=1$ in this case. Hence $\sum_{i}l_{i}=1$. > > Now the main result follows from the fact that the fitted weights $L(x)$ are orthogonal to any polynomial of $(x-z)$ of order $k : 1 \le k \le p$ evaluated at $x_{1},\dots,x_{n}$, except its intercept. > > That is, fixing $x$, any polynomial of the form $m(z)=a_{1}(z-x)+\dots+a_{p}(z-x)^{p}$satisfies $\sum_{i}l_i(x)m(x_{i})=0.$ > > [!proof]- Proof of orthogonality > > Imitate the proof of the weights summing to $1$: consider the dataset $Y=(m(x_{1}),\dots,m(x_{n}))$. > > > > Since the order $k \le p$, the weighted OLS will will yield coefficients $\theta(x)=(0, a_{1},\dots, a_{n})$ that interpolates the data. > > > > Then the fitted value is $\hat{f}_{x}(x)=\theta_{0}=0$. Hence $\hat{f}_{x}(x)=\sum_{i}l_{i}(x)y_{i}=\sum_{i}l_{i}(x)m(x_{i})=0.$ > > Then any biases of order $k\le p$ is eliminated by an order-$p$ local polynomial regression. This is because the bias is of the form $\begin{align*} \mathbb{E}&[\hat{y}(x)-f(x)]\\[0.4em] &= \mathbb{E}\left[ \sum_{i}l_{i}(x)y_{i} \right]-f(x)\\ &= \left[ \sum_{i}l_{i}\mathbb{E}[y_{i}] \right]-f(x)\\ &= \left[ \sum_{i}l_{i}\cdot (f(x)+f'(x)(x_{i}-x)+\cdots) \right]-f(x), \end{align*}$where the non-constant terms up to order $p$ (i.e. $f'(x)(x_{i}-x), f''(x)(x_{i}-x)^{2}$, etc.) are eliminated by $H_{0}(x)$. The constant terms $l_{i}(x)f(x)$ sum to $f(x)$, and cancel out with the $-f(x)$. That being said, the leading term might still be small: for NW weighted average regression, expanding $m$ to the second order,$\begin{align*} \mathrm{bias}&(\hat{m}(x), m(x))\\[0.4em] &= \frac{1}{\sum_{j}K(x_{j};x)}\sum_{i}K(x_{i};x)\cdot [m'(x)(x_{i}-x)+O((x_{i}-x)^{2})].\\ \end{align*}$Therefore, *for symmetric kernels where $K(x-\Delta x; x)=K(x+\Delta x;x)$ applied to data symmetrically distributed about $x$, the linear bias (almost) cancels out*. As seen in the [[#Kernel-Weighted Average]] section: - The linear bias is small when $x$ is taken to be an **interior point** of the dataset, - But near the fringe of the dataset, there aren't enough datapoints on one side to counterbalance the other, leading to the fit "flatting out" the local linear trend. The same holds for an order-$p$ local polynomial regression: - If $p$ is even, balanced data can diminish the $(p+1)$th order bias. - If $p$ is odd however (in particular for local linear regression), the $(p+1)$th order has terms $(x_{i}-x)^{p+1}$, which are equal (instead of negated) between the left and right of $x$. ### Estimating MSE of Local Polynomial Regressions Although the predictive MSE has the forms derived above, it's rather useless practically since we don't know $m$. An alternative is using [[Cross Validation#LOOCV for Linear Smoothers|LOOCV]]. That is, to measure the quality of a smoother (e.g. for a specific choice of the bandwidth $h$), we estimate its $\mathrm{MSE}_{h}$ with > [!definition|*] LOOCV Error of Linear Smoother $\widehat{\mathrm{MSE}}_{h}=\mathrm{LOOCV}_{h}=\frac{1}{n}\sum_{i}(y_{i}-\hat{m}^{(-i)}_{h}(x_{i}))^{2},$where $\hat{m}_{h}^{(-i)}$ is the smoother fitted on $(\mathbf{x}_{-i}, \mathbf{y}_{-i})$, i.e. the dataset without the $i$th observation. - Of course the LOOCV can be used to compare other hyperparameters like the choice of the kernel $K$, the degree $p$, or even compare kernel regression with other regression methods. Moreover, for many kernels, the local regressions are linear smoothers of the response, so the LOOCV MSE has a simpler form that is cheap to compute (no need to fitting one model per datapoint): ![[Cross Validation#^61f6c1]] And to further simplify computations, we can use the **GCV** given by ![[Cross Validation#^e66897]]