Given a dataset $\mathbf{X},\mathbf{y}$, a **linear smoother** with hat matrix $\mathbf{S}$ gives predictions $\hat{\mathbf{y}} := \mathbf{S}\mathbf{y}.$Here $\mathbf{S}$ is allowed to depend on $\mathbf{X}$ (and potentially other parameters) but not $\mathbf{y}$. Examples include all [[Linear Regression Methods]], [[Localized Methods|local linear regression]], and [[Basis Expansion and Regularization|basis expansions]] used with a linear modlel. ### $l_{2}$ Loss in Kernel Regressions The usual $l_{2}$ losses are: $\begin{align*} \mathrm{RSS}&= \| \mathbf{y}-\hat{\mathbf{y}} \|^{2}=\| \mathbf{y}-\hat{m}(\mathbf{x}) \|^{2} \\[0.4em] \text{modeling }\mathrm{MSE}(x)&= \mathbb{E}[(m(x)-\hat{m}(x))^{2}]\\[0.4em] \text{predictive }\mathrm{MSE}(x)&= \mathbb{E}[(Y-\hat{m}(x))^{2} ~|~ X=x] \end{align*} $where $Y=m(X)+\epsilon$ is a new sample independent from the one used to train $\hat{m}$. Note that here $\hat{m}$ is random. - Predictive MSE is also known as **out-of-sample** ($l_{2}$) error. > [!exposition] Deriving RSS and modeling MSE > As seen in [[Degree of Freedom#Special Form for Linear Estimators]], $\mathrm{RSS}=\| (I-\mathbf{S})\mathbf{y} \|^{2}=\| (I-\mathbf{S})m(\mathbf{x}) \|^{2}+\| (I-\mathbf{S})\pmb{\epsilon} \|^{2},$due to $\left< \mathrm{const.}, (I-\mathbf{S})\pmb{\epsilon} \right>=0$; it has expectation $\mathbb{E}[\mathrm{RSS}]=\| (I-\mathbf{S})m(\mathbf{x}) \|^{2}+\sigma^{2}(\mathrm{tr}(S^{T}S)-2\mathrm{tr}\mathbf{S}+n).$ > > --- > >Training MSE is just $\begin{align*} \sum_{i}\mathrm{MSE}(x_{i})&= \mathbb{E}[\| m(\mathbf{x})-\hat{m}(\mathbf{x}) \|^{2} ]\\ &= \mathbb{E}[\| (I-\mathbf{S})m(\mathbf{x})+\mathbf{S}\pmb{\epsilon} \|^{2} ]\\[0.8em] &= \| (I-\mathbf{S})m(\mathbf{x}) \| ^{2}+\sigma^{2}\mathrm{tr}(\mathbf{S}^{T}\mathbf{S}). \end{align*}$ The prediction MSE is $\begin{align*} \mathbb{E}[(Y-\hat{m}(x))^{2}~|~ X=x]&= \mathbb{E}[(m(x)-\hat{m}(x)+\epsilon)^{2} ]\\ &= \mathbb{E}[(m(x)-\hat{m}(x))^{2}]+\sigma^{2} \end{align*}$Now we can apply the [[Point Estimators#^cf7800|bias-variance decomposition]] on $\mathbb{E}[(m(x)-\hat{m}(x))^{2}]$ (can't apply in the first step since both $Y$ and $\hat{m}$ are random): $ =\mathrm{bias}(m(x),\hat{m}(x))^{2}+\mathrm{Var}(\hat{m}(x))+\underset{\text{noise}}{\sigma^{2}}.$ > [!idea] Bias-Variance Tradeoff > - A wiggly smoother has low bias but high variance; > - A inflexible smoother has high bias but low variance; > - Then there is the noise that we can do nothing about. ### Examples of Linear Smoothers [[Localized Methods]] offers a way of producing the weights $l_{j}(x)$ by fitting a function $\hat{f}_{x}$ about $x$ (locality imposed by kernels), then evaluating $\hat{f}_{x}(x)=: \hat{m}(x)$. - If the fitted value $\hat{f}_{x}(x)$ is linear in the responses, the smoother is linear. Of course this does not imply that $\hat{f}_{x}: z\mapsto \hat{f}_{x}(z)$ needs to be linear. Fitting [[Basis Expansion and Regularization|basis expanded datasets]] generalizes the simple OLS: examples include splines and wavelet smoothing. Even f-ing regression [[Decision Trees]] can be casted as linear smoothers: say a tree $T(\Theta)$ has fitted regions and values $\Theta=(\{ R_{k},\gamma_{k} \}_{k=1}^{K})$, then usually $\gamma_{k}$ is some linear combination of $\mathbf{y}$ -- most commonly $\gamma_{k}=\sum_{i}y_{i}\cdot\mathbf{1}\{ \mathbf{x}_{i} \in R_{j} \},$so its predictions has the form $\hat{y}_{i}\propto\sum_{i'} y_{i'} \cdot\mathbf{1}\{ \mathbf{x}_{i},\mathbf{x}_{i'} \text{ in the same region} \} ,$i.e. that of a linear smoother. ^182099