Pearson's R2 - Random Notes Go Brrrrrrr

> [!tldr] Pearson's R2 > In a regression setting where the response is $\mathbf{y}$ and predictions are $\hat{\mathbf{y}}$, the **Pearson's $R^{2}$** (or multiple regression coefficient) is defined to be $R^{2}:= \frac{\mathrm{ESS}}{\mathrm{TSS}}=1-\frac{\mathrm{RSS}}{\mathrm{TSS}}=\frac{\| \hat{\mathbf{y}}-\bar{y} \|^{2} }{\| \mathbf{y}-\bar{y} \|^{2} },$where $\mathrm{ESS}:= \| \hat{\mathbf{y}}-\bar{y} \|^{2}$ is the explained sum of squares. > - In OLS $\mathbf{y} \sim \beta_{0}+\mathbf{X}\beta$ (assuming $\mathbf{X}$ does not contain an intercept-column), this is closely related to the [[Inference in OLS#The F-Statistic|F-statistic]] for testing $H_{0}:\beta=\mathbf{0}$ against $H_{1}:\lnot H_{0}$. - This is also related to the [[Likelihood Ratios|likelihood ratio]] of the above hypotheses under the Gauss-Markov Model. - In the one-predictor model $Y\sim 1+X$, it is just $\hat{\rho}_{XY}^{2}$. Alternatively, it measures the empirical correlation between $\mathbf{y}$ and $\hat{\mathbf{y}}$: > [!theorem|*] $R^2$ as correlation coefficient > Pearson's $R^{2}$ equals the square of the empirical correlation coefficient between $\mathbf{y}$ and $\hat{\mathbf{y}}$: > $R^{2}=\Bigg( \underbrace{ \frac{\left< \mathbf{y}-\bar{y},\hat{\mathbf{y}}-\bar{y} \right> }{\| \mathbf{y}-\bar{y} \|\cdot \| \hat{\mathbf{y}}-\bar{y} \| } \vphantom{\frac{1}{\frac{2}{\frac{3}{4}}}}}_{=: \hat{\rho}_{y\hat{y}}}\Bigg) ^{2}.$ > > > [!proof]- > > The numerator of $\hat{\rho}_{y\hat{y}}$ can be written as $\begin{align*} > \left< \mathbf{y}-\bar{y} ,\hat{\mathbf{y}}-\bar{y}\right> &= -\frac{1}{2}\left[\| (\mathbf{y}-\bar{y})-(\hat{\mathbf{y}}-\bar{y}) \|^{2} -\| \mathbf{y}-\bar{y} \|^{2}-\| \hat{\mathbf{y}}-\bar{y} \|^{2} \right]\\ > &= \frac{\mathrm{TSS}+\mathrm{ESS}-\mathrm{RSS}}{2}\\ > &= \mathrm{ESS}, > \end{align*}$ > > Now $\mathrm{RHS}$ becomes $\hat{\rho}^{2}_{y\hat{y}}=\mathrm{ESS}^{2} / (\mathrm{TSS} \cdot \mathrm{ESS})=\mathrm{ESS} / \mathrm{TSS}$, which is by definition $\mathrm{LHS}=R^{2}$. - Therefore, if the prediction is highly correlated with the response it's predicting, it will have a high $R^{2}$, indicating a good fit (at least within the training sample). ### Distribution of the $R^{2}$ > [!theorem|*] Beta Distribution of the $R^2$ > Under the Gauss-Markov Model and the hypothesis $H_{0}:\beta=\mathbf{0}$, (i.e. there is a non-0 intercept, but all other predictors are not related to $Y$), *$R^{2}$ has the [[beta distribution]]* $R^{2}\sim \mathrm{Beta}\left( \frac{p-1}{2},\frac{n-p}{2} \right),$where $p$ is the total number of predictors (including the intercept, so $\mathbf{X}$ has $p-1$ columns). > > > [!proof]- > > Rewrite $R^{2}=\mathrm{ESS} / (\mathrm{ESS}+\mathrm{RSS})$, then [[Inference in OLS#^a5aa2f|independence of RSS increments]] guarantee that $\mathrm{ESS}$ and $\mathrm{RSS}$ are independent $\sigma^{2}\chi^{2}$ distributions of $\mathrm{df}=p-1$ and $n-p$ respectively. > > > > Therefore $R^{2}$ can be written as $R^{2}=\frac{\chi^{2}_{p-1}}{\chi^{2}_{p-1}+\chi^{2}_{n-p}},$the two $\chi^{2}s being independent. Since $\chi^{2}_{\nu}=\mathrm{Gamma}(\nu / 2, 1/2)$, we have the Beta distribution desired by definition. ### The Adjusted $R^{2}$ Since $R^{2}$ is monotonically increasing when we add more predictors to the OLS (even if they are unrelated to the response), it is not a useful for model selection -- it always favors the largest model. Instead, we can correct for the complexity of the model with > [!definition|*] The Adjusted $R^2$ > The **$R^{2}$-adjusted, or $\bar{R}^{2}$** is given by $\bar{R}^{2}:= 1 - \frac{n-1}{n-p}(1-R^{2}).$Written in the sum-of-squares: $\bar{R}^{2}=1-\frac{n-1}{n-p} \frac{\mathrm{RSS}}{\mathrm{TSS}}=1-\frac{\mathrm{RSS} / (n-p)}{\mathrm{TSS} / (n-1)}=1-\frac{\hat{\sigma}_{\epsilon}^{2}}{\hat{\sigma}_{Y}^{2}},$where $\hat{\sigma}s are the unbiased sample estimators of the respective variances. - The second term can be interpreted as (the inverse of) *the signal-to-noise ratio*. - A large $\bar{R}^{2}$ indicates that according to the model, the signal-to-noise ratio is large -- in contrast, a poor fit will needs to use the noise term to "explain away" more variance, thereby increasing $\hat{\sigma}_{\epsilon}^{2}$ and decreasing $\bar{R}^{2}$. - Overfitting is accounted for by using the unbiased estimator $\hat{\sigma}^{2}_{\epsilon}$. > [!theorem|*] $R^2$-adjusted for Model Comparison > For two OLS models indexes $1,2$, comparing their $\bar{R}^{2}$ gives $\begin{align*} \bar{R}^{2}_{1} > \bar{R}^{2}_{2} & \iff \frac{\mathrm{RSS}_{1}}{n-p_{1}} < \frac{\mathrm{RSS}_{2}}{n-p_{2}}\\ &\iff \frac{\mathrm{RSS}_{2} / (n-p_{2})}{\mathrm{RSS}_{1} / (n-p_{1})} > 1. \end{align*}$This resembles the $F$-statistic, but it doesn't necessarily follow the F-distribution since the two terms are not necessarily independent. - *The selection criterion is far more lenient than the regular $F$-test for nested models*, as the test favors the complex model only when a similar ratio is gt; F^{-1}(1-\alpha) \gg 1$, where $F^{-1}$ is the percentile of the relevant F-distribution, and $\alpha$ is some significance level.