Degree of Freedom - Random Notes Go Brrrrrrr

> [!warning] Multiple Equivalent Definitions > Obviously such a term has multiple definitions (mostly off by a constant factor like $n$ or $\sigma^{2}_{Y}$). Mathematicians really suck at this. The quantity is straightforward for OLS: it's just $p$, the number of coefficients / predictors in the model. This is less true for regularized models like [[Linear Regression Methods#Ridge Regression|ridge]] and [[Linear Regression Methods#The Lasso|lasso]] regression, where a model with $p$ predictors has less than $p$ degrees of freedom. This motivates the basic definition > [!definition|*] Degree of Freedom > $\mathrm{df}:= \mathrm{tr}(\mathrm{Cov}(y, \hat{y}))=\sum_{i=1}^{n}\mathrm{Cov}(y_{i}, \hat{y}_{i}).$Here $\mathrm{Cov}(\mathbf{y}, \hat{\mathbf{y}})$ is the $n\times n$ matrix where the $i,j$ entry is $\mathrm{Cov}(y_{i}, \hat{y}_{j})$. Therefore *the degree of freedom measures the amount of influence an observation has in predicting itself*. - Note that we do not need to scale by $1 / n$ (doing so doesn't hurt) since the $\mathrm{Cov}(y_{i}, \hat{y}_{i})$ in general decreases when $n$ increases. For example in global fits like OLS, each individual $y_{i}$ has less influence when there are more data points. ## Special Forms for Linear Estimators For linear models $\mathbf{y \mapsto Hy}$ ($\mathbf{H}$ being the hat matrix), this definition is just $\mathrm{df}= \mathrm{tr}\mathbf{H}$. - This agrees with the OLS definition, since $\mathrm{Cov}(\mathbf{y}, \hat{\mathbf{y}})=\mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}$ has trace $p$. - For ridge regression, the covariance is $\mathbf{X}(\mathbf{X}^{T}\mathbf{X} + \lambda I)^{-1}\mathbf{X}^{T}$, so $\begin{align} \mathrm{df}&=\mathrm{tr} (\mathbf{X}(\mathbf{X}^{T}\mathbf{X} + \lambda I)^{-1}\mathbf{X}^{T})\\ &=\mathrm{tr}(\mathbf{X}^{T}\mathbf{X}(\mathbf{X}^{T}\mathbf{X} + \lambda I)^{-1}) \\ &=\mathrm{tr}(I-\lambda(\mathbf{X}^{T}\mathbf{X} + \lambda I)^{-1}) \\ &= p-\lambda\sum_{j=1}^{p} \frac{1}{j\text{th eigenvalue of }(X^{T}X+\lambda I)}\\ &=p- \sum_{j=1}^{p} \frac{\lambda}{\sigma^{2}_{j}+\lambda} &({\dagger})\\ &=\sum_{j=1}^{p} \frac{\sigma_{j}^{2}}{\sigma_{j}^{2}+\lambda}, \end{align}$where $\sigma_{j}$ are the singular values of $\mathbf{X}$. Therefore in $({\dagger})$, we see that the ridge regression reduces the $\mathrm{df}$ by the second term (compared to OLS). ^918055 - [[Linear Regression Methods#Least Angle Regression|Least angle regression]] increases $\mathrm{df}$ by exactly $1$ for each step it takes (i.e. for each variable added to the active set). More generally, the $\mathrm{df}$ does not necessarily have a closed form (e.g. for lasso), but can be estimated by methods like the [[Bootstraps|bootstrap]], e.g. [[Computer Age Statistical Inference|CASI]] p224. ### Residual Degree of Freedom Suppose the data is generated with the model $Y=m(X)+\epsilon$, with $\epsilon$ being iid. noises with $\mathbb{E} [\epsilon]=0,\mathrm{Var}(\epsilon)=\sigma^{2}$. The error/residuals of the model's estimations is captured by $\mathrm{RSS}(\mathbf{Y}, \hat{\mathbf{Y}})= \| \mathbf{Y}-\hat{\mathbf{Y}} \| _{2}^{2}=\| (I-\mathbf{H})\mathbf{Y} \|_{2}^{2} .$ By bias-variance decomposition of the RSS, we can slightly abuse the concept by considering $-(I-\mathbf{H})\epsilon$ as a random variable approximating $\mathbf{e}:=(I-\mathbf{H})m(\mathbf{x})$, $\begin{align*} \mathbb{E}[\mathrm{RSS}(\mathbf{Y}, \hat{\mathbf{Y}})]&= \mathbb{E}\| \mathbf{Y}-\mathbf{HY} \|^{2} \\[0.4em] &= \mathbb{E} \| (I-\mathbf{H})(m(\mathbf{x}) + \epsilon) \| ^{2}\\[0.4em] &= \mathbb{E}[\mathrm{RSS}(\mathbf{e}, (I-\mathbf{H})\epsilon)]\\[0.4em] &= \mathrm{bias}(\mathbf{e},(I-\mathbf{H})\epsilon)^{2}+\mathrm{Var}((I-\mathbf{H})\epsilon)\\[0.4em] &= \| (I-\mathbf{H})m(\mathbf{x}) \|^{2}+ \mathrm{tr}[(I-\mathbf{H})^{T}(I-\mathbf{H})]\cdot \sigma^{2}\\[0.4em] &= \| (I-\mathbf{H})m(\mathbf{x}) \|_{2}^{2} + (\mathrm{tr}(\mathbf{H}^{T} \mathbf{H})-2\mathrm{tr}(\mathbf{H}) + n)\cdot\sigma^{2} \end{align*}$ So alternatively defining *$\mathrm{df}^{\ast}:=n-2\mathrm{tr}(\mathbf{H})+\mathrm{tr}(\mathbf{H}^{T}\mathbf{H})$* being the **reidual degree of freedom**, the expected RSS is $\mathbb{E}[\mathrm{RSS}(\mathbf{Y}, \hat{\mathbf{Y}})]= \underbrace{\| (I-S)m(\mathbf{x}) \|_{2}^{2}}_{\text{squared bias estimating }m} + \underbrace{\mathrm{df}^{\ast}\cdot\sigma^{2}}_{\text{wiggling}},$and $\widehat{\sigma^{2}}:= \mathrm{RSS} / \mathrm{df}^{\ast}$ is an unbiased estimate of $\sigma^{2}$. In particular for idempotent estimators (where $\mathbf{H}^{T}\mathbf{H}=\mathbf{H}$) like OLS, $\mathrm{df}^{\ast}$ simplifies to $\mathrm{df}^{\ast}=n-\mathrm{tr}(\mathbf{H})=n-\mathrm{df}$, i.e. the $\mathrm{df}$ of the [[Inference in OLS#T-Statistic for OLS Coefficients|t-distribution used for OLS inference]]. ### Generalization of Model Complexity Metrics Many model-selection metrics (see [[Model Error Metrics|in-sample error]]) have measures of model complexity, often in terms of the number of parameters $d$. They generalize nicely with the effective number of parameters: Replacing $d$ with $\mathrm{df}$, Mallow's $C_{p}$ can be generalized as $\begin{align*} C_{p}&= \overline{\mathrm{err}}+\frac{2}{|\mathcal{T}|}\mathrm{Tr}(P)\cdot\hat{\sigma}^{2}_{\epsilon}\\[0.2em] &= \overline{\mathrm{err}}+\frac{2}{|\mathcal{T}|}\sum_{y_{i} \in \mathcal{T}}\mathrm{Cov}(y_{i},\hat{y}_{i}). \end{align*}$