James-Stein Estimators - Random Notes Go Brrrrrrr

**Stein's paradox** is a result of [[Bias-Variance Tradeoff|bias-variance tradeoff]] for quadratic loss, where the MLE is strictly dominated by the following in $l_{2}$ loss: > [!definition|*] James-Stein Estimator > For independent variables $X_{i} \sim N(\mu_{i}, 1)$, $i=1,\dots,p$, the **James-Stein estimator** of $\mu=(\hat{\mu}_{1},\dots,\hat{\mu}_{n})$ is $\hat{\mu}^{\mathrm{JSE}}:=\left(1 - \frac{p-2}{\| \mathbf{X} \| _{2}^{2}} \right)\mathbf{X},$where $\mathbf{X}=(X_{1},\dots,X_{n})$, and $\| \cdot \|_{2}$ is the Euclidean $l_{2}$ norm. > > When $p \ge 3$, $\hat{\mu}^\mathrm{JSE}$ strictly dominates the MLE. - Hence *the MLE of $\mu$ is not admissible*, but nor is James-Stein, as it is dominated by another estimator. - The reason JSE is better is that it introduces bias but greatly reduces its variance, so its overall MSE is lower; *it shrinks the predictions towards the origin*, and the variance is reduced because of that. - This tradeoff only works when $p > 3$: see [this video](https://www.youtube.com/watch?v=cUqoHQDinCM) for a visual explanation. - In addition, if $p=1$, Stein's lemma that $\mathbb{E}[(X_{i}-\mu_{i})h(\mathbf{X})]=\mathbb{E}[\partial h / \partial X_{i}]$ does not apply because $h$ is not almost surely bounded and differentiable. If $p=2$, the shrinkage is $0$ so James-Stein estimator produces the MLE. Shrinking towards the origin is obviously arbitrary: another popular option is to shrink toward the overall mean $\bar{X}$. - This can be interpreted as the effect of incorporating prior information about $\mathbb{E}[X]$, estimated with $\bar{X}$, i.e. adding indirect evidence ($X_{j \ne i}$) to the direct evidence $X_{i}$. - It represents a *compromise between the null hypothesis of full equality ($\hat{\mu}=(\bar{X},\dots,\bar{X})$) and the MLE’s tacit assumption of full independence between $\mu_{1},\dots,\mu_{n}$*. More generally, this shrinkage can be used for incorporating information from other samples. - For example, we can fit a regression model $\hat{\mu}^\text{reg}$ to predict $\mu$ from other covariates, and shrink $\hat{\mu}^\mathrm{MLE}$ towards $\hat{\mu}^\text{reg}$. ### Empirical Bayes Derivation of James-Stein #bayesian/empirical_bayes #bayesian/bayesian_interpretation Assume we have a prior $\begin{align*} \pmb{\mu} &\sim N(0, \sigma^{2}I),\\ \mathbf{Z} ~|~ \pmb{\mu} & \sim N(\pmb{\mu}, I), \end{align*}$ then under the $l_{2}$ loss the [[Decision Theory#Bayes Rules|Bayes rule]] is the posterior mean $\hat{\pmb{\mu}}_{\mathrm{Bayes}}(\mathbf{z})=\mathbb{E}[\pmb{\mu} ~|~ \mathbf{z}]=\left( 1-\frac{1}{\sigma^{2}+1} \right)\mathbf{z},$which has (frequentist) risk$\begin{align*} R(\pmb{\mu})&= \mathbb{E}_{\mathbf{Z}}[\|\pmb{\mu}-\hat{\pmb{\mu}}(\mathbf{Z}) \|^{2}]\\ &= \| \pmb{\mu} \| ^{2}-2\pmb{\mu}^{T}\mathbb{E}[\hat{\pmb{\mu}}(\mathbf{Z})]+\mathbb{E}\| \hat{\pmb{\mu}}(\mathbf{Z}) \|^{2}\\ &= \left( 1 -2B\right)\| \pmb{\mu} \|^{2} +B ^{2}\mathbb{E}[\| \mathbf{Z} \| ^{2} ~|~ \pmb{\mu}] \\ &= \left( 1-B \right)^{2}\| \pmb{\mu} \| ^{2}+B^{2}\sum_{i}\mathrm{Var}(Z_{i}~|~\pmb{\mu}), \end{align*}$where $B=1- 1/(\sigma^{2}+1)$ is the shrinkage factor. So substituting $\mathrm{Var}(Z_{i}~|~\mu)=1$, we get the frequentist and Bayes risk of $\begin{align*} R(\pmb{\mu})&= (1-B)^{2}\| \pmb{\mu} \| +nB^{2},\\[0.8em] r&= n(1-B)^{2} \sigma^{2}+nB^{2}\\ &= nB. \end{align*}$ Of course in practice we do not know $\sigma^{2}$, so we can use a frequentist estimation of $\mathbf{Z} \sim N(0, (\sigma^{2}+1)I) \Longrightarrow \mathbb{E}\left[ \frac{1}{\sum_{i}Z_{i}^{2}} \right]=\frac{1}{(\sigma^{2}+1)(n-2)},$so we arrive at the James-Stein estimator by replacing $B$ with $\hat{B}=1- \frac{n-2}{\sum_{i}Z^{2}_{i}}.$