**Stein's paradox** is a result of [[Bias-Variance Tradeoff|bias-variance tradeoff]] for quadratic loss, where the MLE is strictly dominated by the following in $l_{2}$ loss:
> [!definition|*] James-Stein Estimator
> For independent variables $X_{i} \sim N(\mu_{i}, 1)$, $i=1,\dots,p$, the **James-Stein estimator** of $\mu=(\hat{\mu}_{1},\dots,\hat{\mu}_{n})$ is $\hat{\mu}^{\mathrm{JSE}}:=\left(1 - \frac{p-2}{\| \mathbf{X} \| _{2}^{2}} \right)\mathbf{X},$where $\mathbf{X}=(X_{1},\dots,X_{n})$, and $\| \cdot \|_{2}$ is the Euclidean $l_{2}$ norm.
>
> When $p \ge 3$, $\hat{\mu}^\mathrm{JSE}$ strictly dominates the MLE.
- Hence *the MLE of $\mu$ is not admissible*, but nor is James-Stein, as it is dominated by another estimator.
- The reason JSE is better is that it introduces bias but greatly reduces its variance, so its overall MSE is lower; *it shrinks the predictions towards the origin*, and the variance is reduced because of that.
- This tradeoff only works when $p > 3$: see [this video](https://www.youtube.com/watch?v=cUqoHQDinCM) for a visual explanation.
- In addition, if $p=1$, Stein's lemma that $\mathbb{E}[(X_{i}-\mu_{i})h(\mathbf{X})]=\mathbb{E}[\partial h / \partial X_{i}]$ does not apply because $h$ is not almost surely bounded and differentiable. If $p=2$, the shrinkage is $0$ so James-Stein estimator produces the MLE.
Shrinking towards the origin is obviously arbitrary: another popular option is to shrink toward the overall mean $\bar{X}$.
- This can be interpreted as the effect of incorporating prior information about $\mathbb{E}[X]$, estimated with $\bar{X}$, i.e. adding indirect evidence ($X_{j \ne i}$) to the direct evidence $X_{i}$.
- It represents a *compromise between the null hypothesis of full equality ($\hat{\mu}=(\bar{X},\dots,\bar{X})$) and the MLE’s tacit assumption of full independence between $\mu_{1},\dots,\mu_{n}$*.
More generally, this shrinkage can be used for incorporating information from other samples.
- For example, we can fit a regression model $\hat{\mu}^\text{reg}$ to predict $\mu$ from other covariates, and shrink $\hat{\mu}^\mathrm{MLE}$ towards $\hat{\mu}^\text{reg}$.
### Empirical Bayes Derivation of James-Stein
#bayesian/empirical_bayes #bayesian/bayesian_interpretation
Assume we have a prior $\begin{align*}
\pmb{\mu} &\sim N(0, \sigma^{2}I),\\
\mathbf{Z} ~|~ \pmb{\mu} & \sim N(\pmb{\mu}, I),
\end{align*}$
then under the $l_{2}$ loss the [[Decision Theory#Bayes Rules|Bayes rule]] is the posterior mean $\hat{\pmb{\mu}}_{\mathrm{Bayes}}(\mathbf{z})=\mathbb{E}[\pmb{\mu} ~|~ \mathbf{z}]=\left( 1-\frac{1}{\sigma^{2}+1} \right)\mathbf{z},$which has (frequentist) risk$\begin{align*}
R(\pmb{\mu})&= \mathbb{E}_{\mathbf{Z}}[\|\pmb{\mu}-\hat{\pmb{\mu}}(\mathbf{Z}) \|^{2}]\\ &= \| \pmb{\mu} \| ^{2}-2\pmb{\mu}^{T}\mathbb{E}[\hat{\pmb{\mu}}(\mathbf{Z})]+\mathbb{E}\| \hat{\pmb{\mu}}(\mathbf{Z}) \|^{2}\\
&= \left( 1 -2B\right)\| \pmb{\mu} \|^{2} +B ^{2}\mathbb{E}[\| \mathbf{Z} \| ^{2} ~|~ \pmb{\mu}] \\
&= \left( 1-B \right)^{2}\| \pmb{\mu} \| ^{2}+B^{2}\sum_{i}\mathrm{Var}(Z_{i}~|~\pmb{\mu}),
\end{align*}$where $B=1- 1/(\sigma^{2}+1)$ is the shrinkage factor. So substituting $\mathrm{Var}(Z_{i}~|~\mu)=1$, we get the frequentist and Bayes risk of $\begin{align*}
R(\pmb{\mu})&= (1-B)^{2}\| \pmb{\mu} \| +nB^{2},\\[0.8em]
r&= n(1-B)^{2} \sigma^{2}+nB^{2}\\
&= nB.
\end{align*}$
Of course in practice we do not know $\sigma^{2}$, so we can use a frequentist estimation of $\mathbf{Z} \sim N(0, (\sigma^{2}+1)I) \Longrightarrow \mathbb{E}\left[ \frac{1}{\sum_{i}Z_{i}^{2}} \right]=\frac{1}{(\sigma^{2}+1)(n-2)},$so we arrive at the James-Stein estimator by replacing $B$ with $\hat{B}=1- \frac{n-2}{\sum_{i}Z^{2}_{i}}.$