Information and Bounding Errors - Random Notes Go Brrrrrrr

> [!warning]- Regularity Conditions >Most stuff in this note assume a few regularity conditions on the family of distributions $\{ f(\cdot;\,\theta) \,|\, \theta \in \Theta\}$: > - $[\mathrm{R{1}}]$ They have a common support. > - $[\mathrm{R} 2]$ The parameter space $\Theta$ is open. > - $[\mathrm{R}{3}]$ The derivative $\frac{ \partial f }{ \partial \theta }$ exists and is dominated by some integrable function. > > They will be omitted for the rest of this note. Suppose again we have a sample $\mathbf{X}=(X_{1},\dots,X_{n})$ from a distribution $f(\cdot\,;\theta)$, and we want to estimate $\gamma:= g(\theta)$ with the point estimator $T(\mathbf{X})$. Obviously with a finite sample, $T(\mathbf{X})$ cannot be arbitrarily precise, so *there is a lower bound on its MSE*. - This bound serves as a basis point for comparing different estimators -- those that attain the bound have to be the best ones. > [!idea] Limiting the scope to unbiased estimators > If we wish to compare the [[Point Estimators#^e44b87|MSE]] of two arbitrary estimators $T_{1},T_{2}$, it seem natural to see if one is **uniformly better** than the other, i.e. if $T_{1}$ is uniformly better than $T_{2}$, then $\forall \theta \in \Theta,~\mathrm{MSE}_{\theta}(T_{1}) \le \mathrm{MSE}_{\theta}(T_{2}).$ > However, nothing is uniformly better than the trivial estimator $T_{0}\equiv \theta_{0} \in \Theta$, as $\mathrm{MSE}_{\theta_{0}}(T_{0})=0$, so this comparison is not useful. > > The issue here is that [[Point Estimators#^e44b87|biased]] estimators like $T_{0}$ can be pointlessly pathological, making it impossible to find a uniformly better estimator. Hence we restrict the search to **minimum variance unbiased estimators (MVUEs)**. ## Fisher and Observed Information Ideally, the more information a sample contains, the smaller the MSE can be made -- this necessitates measures for the amount of information in a sample. Given a sample $\mathbf{x}$, the **score function** $S(\theta,\mathbf{x}):\Theta \to \mathbb{R}^{k}$ is $S(\theta,\mathbf{x})=\frac{ \partial }{ \partial \theta } l(\mathbf{x};\theta)$where $l$ is the log-likelihood. Hence the MLE $\hat{\theta}$ satisfies $S(\hat{\theta},\mathbf{x})=0$. - $\mathbb{E}_{\theta}[S(\theta,\mathbf{X})]=0$, which is verified by exchanging the derivatives with the integral. > [!definition|*] Fisher Information > The **Fisher information** of a parameter $\theta$ is the expected variance of the score function (averaged over all possible samples $\mathbf{X}$): $I_{\mathbf{X}}(\theta)= \mathrm{Var}_{\theta}(S(\theta, \mathbf{X}))=\mathbb{E}_{\theta}[S^{2}]$ - A large $\mathrm{Var}(S)$ means that the value of $S$ is sensitive to $\mathbf{X}$, i.e. the likelihood decays sharply around the MLE $\hat{\theta}^\mathrm{MLE}(\mathbf{X})$, and there is a lot of information in $\mathbf{X}$, enough to narrow down on $\theta$. > [!definition|*] Observed Information > Assume second-order differentiability of the log-likelihood $l$, the **observed information** is $J(\theta,\mathbf{x})=-l''(\mathbf{x};\theta)$If $\theta=\hat{\theta}^\mathrm{MLE}$, it measures how sharply the decay in likelihood is growing. > > If there are multiple sources of information (e.g. two independent samples $X, Y$), then use subscripts to differentiate (e.g. $J_{X},J_{Y}$). ^7df282 - The observed information appears in the second-order approximation of the log-likelihood $l(\theta)$ near $\hat{\theta}$: $l(\theta) \approx l(\hat{ \theta}) + \cancel{(\theta-\hat{\theta})\left.\frac{\partial l}{\partial \theta} \right|_{\hat{\theta}}}+(\theta-\hat{\theta})^{2} \underbrace{\left.\frac{\partial^{2} l}{\partial \theta^{2}}\right|_{\hat{\theta}}}_{-J(\hat{\theta})}.$ The larger $J(\hat{ \theta})$ is, the faster log-likelihood decays when $\theta$ moves away from $\hat{\theta}$. In other words, *large information = precise MLE*. > [!info]+ In Higher Dimensions > - The score function is $S=\nabla_{\theta}\,l=\left( \frac{ \partial l }{ \partial \theta_{1} },\dots,\frac{ \partial l }{ \partial \theta_{k} } \right)$. > - The Fisher information matrix is $I_{\mathbf{X}}(\theta)=\mathrm{Cov}(S)$. > - The observed information matrix is $J(\theta,\mathbf{x})$ given by $J_{ij}=-\frac{ \partial^{2} l }{ \partial \theta_{i} \partial \theta_{j}} $ ### Properties of the Fisher Information > [!theorem|*] Fisher and Observed Information > *Fisher information is the expected observed information*: $I_{\mathbf{X}}(\theta)=\mathbb{E}_{\mathbf{X} ~|~\theta}[J(\theta,\mathbf{X})],$and in higher dimensions, the same hold for each entry of the matrices. The score function, and by extension the *informations are additive for independent samples*: - For independent variables $X,Y$ whose respective distributions share the parameter $\theta$, $\begin{align*} I_{(X,Y)}(\theta)&= I_{X}(\theta)+I_{Y}(\theta),\\ J_{X,Y}(\theta,\mathbf{x}, \mathbf{y})&= J_{X}(\theta,\mathbf{x})+J_{Y}(\theta,\mathbf{y}). \end{align*}$ - In particular, if $X_{1},\dots,X_{n}\overset{\mathrm{iid.}}{\sim} f(\cdot\,;\theta)$, then the whole sample $\mathbf{X}$ has information $I_{\mathbf{X}}=n \cdot i_{X},$where $i_{X}:= I_{X_{1}}$ is the *Fisher information in one sample*. This usually makes computations easier. > [!warning] > *Fisher information changes under reparametrization*: if $\phi \mapsto \theta(\phi)$ is a differentiable reparameterization, then $I^{*}_{\mathbf{X}}(\phi)=I_{\mathbf{X}}(\theta)\cdot \theta'(\phi)^{2}$where $I^{*}$ and $I$ are Fisher information for $\phi,\theta$ respectively, which are in general distinct functions. > > [!proof]- Sketch proof > > Use the variance definition of Fisher information, and note that the score function of $\phi$ is $S_{\phi}=\frac{ \partial }{ \partial \phi }l(\mathbf{x};\theta(\phi))=S_{\theta}\theta'(\phi)$. When taking the variance, the scaling term $\theta'(\phi)$ is squared. ### Information and Statistics Refer to [[Minimality and Sufficiency#Sufficiency|sufficiency]]. > [!theorem|*] Sufficient statistics do not lose information > Under some regularity conditions, the *sufficient statistic yields the same observed (and hence Fisher) information as the original sample*. That is, if $\mathbf{X} \sim f(\cdot\,;\theta)$ and $T(\mathbf{X})$ is sufficient, then $\begin{align*} J_{\mathbf{X}}(\theta,\mathbf{x})&= J_{T}(\theta,T(\mathbf{x})),\\ I_{\mathbf{X}}(\theta)&= I_{T(\mathbf{X})}(\theta). \end{align*}$This relationship is $\ge$ in general for non-sufficient statistics. > > > [!proof]- > > By sufficiency (and the factorization criterion), we may pick $g(T;\theta)$ to be the density of $T$, and $h$ the density of $\mathbf{X} ~|~ T$ such that $f(\mathbf{x};\theta)=g(T;\theta)h(\mathbf{x}).$Then computing the observed information $-\frac{ \partial }{ \partial \theta^{2} }\text{loglik}$ from the original sample $\mathbf{X}=\mathbf{x}$ gives $J_{\mathbf{X}}(\theta,\mathbf{x})\overset{(1)}{=}-\frac{ \partial ^{2} }{ \partial \theta^{2} }\log g(T;\theta)\overset{(2)}{=}J_{T}(\theta,T(\mathbf{x})), $where $(1)$ follows from $h(\mathbf{x})$ being constant here, and $(2)$ is by choice of $g$ and definition of observed information. > > > > Since the Fisher information is just the expectation of observed information, they are equal as well. ^3cd3db This property makes computing information of certain distributions easier: if $S\sim f(\cdot\,; \theta)$ has the same distribution as a sufficient statistic of $X_{1},\dots,X_{n} \overset{\mathrm{iid.}}{\sim} g(\cdot\,; \theta)$, then $I_{S}(\theta)=ni_{X}(\theta)$. The latter is usually easier to compute. > [!examples] Deriving Fisher information of a binomial variable > Suppose $S \sim \mathrm{Binom}(n, \theta)$ where $n$ is known. Then with some hairy algebra we can find that the Fisher information is $I_{S}(\theta)=\frac{n}{\theta(1-\theta)}$. > > But alternatively, recall that $S:=\sum_{i}X_{i}\sim \mathrm{Binom}(n,p)$ is a sufficient statistic of the sample $X_{1},\dots,X_{n} \overset{\mathrm{iid.}}{\sim}\mathrm{Bernoulli}(p)$, where each has Fisher information $I_{X}(\theta)=\frac{1}{\theta(1-\theta)}$; we can find this will minimal fuss. > > Since information is additive, the whole sample has information $\frac{n}{\theta(1-\theta)}$, and that must equal the information in $S=\sum_{i}X_{i}$. ## Cramer-Rao Lower Bounds > [!tldr] > The Fisher information determines the **CRLB**, a lower bound of the variance of unbiased, regular estimators. A statistic $T$ is **regular** if it allows the exchange of differentiation and integration in $\begin{align*} \int T(x)\frac{ \partial }{ \partial \theta } L(x;\theta) \, dx &= \frac{ \partial }{ \partial \theta } \int T(x)L(x;\theta) \, dx \\ &= \frac{ \partial }{ \partial \theta } \mathbb{E}_{\theta}[T(X)]. \end{align*}$ > [!theorem|*] Cramer-Lao Lower Bounds > > In estimating $\gamma=g(\theta)$, if the information $I_{X}<\infty$, $g'\ne 0$, then for any regular, unbiased estimator $T$ of $\gamma$, the **Cramer-Rao lower bound (CRLB)** holds: $\mathrm{MSE}_{\theta}(T)=\mathrm{Var}_{\theta}(T) \ge \frac{g'(\theta)^{2}}{I_{X}(\theta)}.$The equality is attained if and only if for (almost) all $x,\theta$, $T(x)=g(\theta)+\frac{g'(\theta)S(\theta,x)}{I_{X}(\theta)},$and of course the expression must simplify to be independent of $\theta$. > > > [!info]- In higher dimensions > > Variance of vector-valued variables are compared with the Loewner order: $A \preceq B$ if $B-A$ is positive semi-definite. > > CRLB is given by $\mathrm{Var}_{\theta}(T) \succeq J_{g}I_{X}^{-1}J_{g}^{T}$where $J_{g}(\theta)_{ij}=\frac{ \partial g_{i} }{ \partial \theta_{j} }$ is the Jacobian. > > > [!proof]- Sketch proof for scalar estimators > > First prove that $\mathrm{Cov}_{\theta}(T, S(\theta,\mathbf{X}))=g'(\theta)$ using regularity. Then consider the random variable $T-c(\theta)S(\theta,\mathbf{X})$ where $c(\theta):= g'(\theta) / I_{X}(\theta)$. Its variance is non-negative and can be shown to equal $\mathrm{Var}_{\theta}(T)-\mathrm{CRLB}$. > > > > If the equality holds (so the latter is $0$), then $T-c(\theta)S(\theta,\mathbf{X})$ must be ($\mathrm{a.s.}$) constant, so being unbiased it equals $g(\theta)$, and $T=g(\theta)+c(\theta)S(\theta,\mathbf{X})$. In particular, if $g$ is the identity function, and we are directly estimating the scalar-valued parameter $\theta$, the CRLB simplifies to $\mathrm{MSE}_{\theta}(\hat{\theta})=\mathrm{Var}_{\theta}(\hat{\theta}) \ge I_{X}(\theta)^{-1}.$Recalling that $\hat{\theta}_{\mathrm{MLE}}$ is asymptotically $N(\theta, I_{X_{1:n}}^{-1}(\theta))$ when the sample size $n \to \infty$, we see that it approaches the CRLB. Moreover, $I_{X_{1:n}} =O(n)$, so the variance decreases linearly with sample size. If $T^{*}$ is a biased estimator of $\theta$, then it is an unbiased estimator of $\theta+\mathrm{bias}(T^{*};\theta)$, so applying CRLB gives $\mathrm{Var}_{\theta}(T^{*})\ge \frac{[1+\mathrm{bias}(T^{*};\theta)']^{2}}{I_{X}(\theta)}.$ ### Attaining the CRLB > [!definition|*] Efficiency > If $T$ is an unbiased estimator of $\gamma=g(\theta)$, its **efficiency** is the ratio between the CRLB and its variance: $\mathrm{eff}_{\theta}(T,\gamma)=\frac{\mathrm{CRLB}(T)}{\mathrm{Var}_{\theta}(T)}=\frac{g'(\theta)^{2}}{I_{X}(\theta)\mathrm{Var}_{\theta}(T)}$ > By definition, $\mathrm{eff}_{\theta} \in [0,1]$ for regular unbiased estimators, and $T$ is called **efficient** if it attains the CRLB ($\mathrm{eff}_{\theta}=1~\forall \theta$). However, regular, unbiased estimators do not attain CRLB in general: if $T$ is efficient, then the distribution must be from an [[Exponential Families|exponential family]]. > [!theorem|*] Efficiency in 1-parameter exponential families > > If $\mathcal{P}$ is a strictly $1$-parameter family with canonical statistic $T$, then *$T$ is an efficient estimator of $\mathbb{E}_{\theta}[T]$*. > - This does not necessarily hold in higher dimensions. > > > [!proof]- > > First note that in (1-parameter) exponential families, $\begin{align*} > S(\eta,T)&= \frac{ \partial }{ \partial \eta }(T\cdot\eta-B(\eta)) =T-B',\\ > J(\eta,T)&= -\frac{ \partial }{ \partial \eta }S(\eta,T)=B ''\\ > I_{\eta}(T)&= \mathbb{E}_{\eta}[J(\eta,T)]=B '' ~(\mathrm{const.}) > \end{align*}$Therefore, for the natural parametrization, the CRLB reduces to $\frac{ \partial }{ \partial \eta }\mathbb{E}_{\eta}[T]$, which equals $B '' =\mathrm{Var}(T)$. > > --- > > For non-natural parameterization $\theta \mapsto \eta(\theta)$, $\begin{align*} > S(\theta,T)&= \frac{ \partial }{ \partial \theta }(T\cdot\eta(\theta)-B(\theta)) =T\eta'-B'\eta',\\ > J(\theta,T)&=- \frac{ \partial }{ \partial \theta }S(\theta,T)=\cancel{-T\eta''} + B ''\cdot(\eta')^{2}+\cancel{B'\eta''},\\ > I_{\theta}(T)&= \mathbb{E}_{\theta}[J(\theta,T)]=B '' \cdot(\eta')^{2} > \end{align*}$The same result follows: $\mathrm{CRLB}=\frac{\left( \frac{ \partial \mathbb{E}_{\theta}[T] }{ \partial \theta } \right)^{2}}{I_{\theta}(T)}=\frac{\left( \frac{ \partial \mathbb{E}_{\eta(\theta)}[T] }{ \partial \eta } \eta' \right)^{2}}{B '' \cdot (\eta')^{2}}=B '' = \mathrm{Var}(T). $ > [!theorem|*] MLE and the CRLB > If CRLB is attained by some unbiased predictor, it must be the MLE: > - More precisely, if $\theta$ has the MLE $\hat{\theta}_{\mathrm{MLE}}$, and there is an unbiased $\tilde{\theta}$ that attains CRLB, then $\hat{\theta}_{\mathrm{MLE}}=\tilde{\theta}\,\,\mathrm{a.s.}$. > > > [!proof]- > > Suppose $\tilde{\theta}$ is efficient and unbiased, then $\tilde{\theta}-\theta=S(\theta,x) / I_{X}(\theta)~\mathrm{a.s.}$ > > Then plugging in $\theta=\hat{\theta}_{\mathrm{MLE}}$ gives $\tilde{\theta}=\hat{\theta}_{\mathrm{MLE}}~\mathrm{a.s.}$ since $S(\hat{\theta}_{\mathrm{MLE}})=0$. ## Minimum Variance Unbiased Estimators > [!tldr] > - **Rao-Blackwell** gives a way to improve unbiased estimators using sufficient statistics. > - For sufficient statistics that are complete, **Lehmann-Scheffé** guarantees that the improvement is optimal. In general, the CRLB is not achievable. However, it's still valuable to find the **minimum variance unbiased estimator (MVUE)**. > [!definition|*] MVUE > An unbiased estimator $T$ of $g(\theta)$ is the **MVUE** if it is uniformly better than any other unbiased estimator $T'$. That is, $\forall \theta \in \Theta,~\underbrace{\mathrm{Var}_{\theta}(T)}_{=\mathrm{MSE}_{\theta}(T)} \le \underbrace{\mathrm{Var}_{\theta}(T')}_{=\mathrm{MSE}_{\theta}(T')}.$Here MSE and variance are equal since they are unbiased. > [!theorem|*] Rao-Blackwell > > Suppose $X \sim f(\cdot\,; \theta)$, $T(X)$ is sufficient, and $\hat{\gamma}$ is an unbiased estimator of $\gamma=g(\theta)$. > > Then *incorporating the information of $T$ gives a better unbiased estimator* $\hat{\gamma}_{T}= \mathbb{E}_{\theta}[\hat{\gamma}\,|\,T]$: > - $\hat{\gamma}_{T}$ is independent of $\theta$, so it can be used as an estimator. > - $\mathbb{E}_{\theta}[\hat{\gamma}_{T}]=\gamma$, so $\hat{\gamma}_{T}$ is unbiased. > - $\mathrm{Var}(\hat{\gamma}_{T}) \le \mathrm{Var}(\hat{\gamma})$, so $\hat{\gamma}_{T}$ is a better estimator. > - Equality holds if and only if $\hat{\gamma}_{T}=\hat{\gamma}\,\,\mathrm{a.s.}$ > > > [!info]- In higher dimensions > > If $\gamma \in \mathbb{R}^{k}$ is vector-valued, then the variance inequality becomes $\mathrm{Cov}(\hat{\gamma}_{T}) \preceq \mathrm{Cov}(\hat{\gamma})$. > > > [!proof]- > > *Independence*: conditioning on $T$ removes dependence on $\theta$: $\mathbb{E}_{\theta}[\hat{\gamma}(X)\,|\,T]=\int _{\mathcal{X}}\hat{\gamma}(x) \underbrace{f(x\,|\,\theta,T)}_{\text{indep. of }\theta} \, dx $ > > *Unbiasedness*: $\mathbb{E}[\hat{\gamma}_{T}]=\mathbb{E}_{T}[\mathbb{E}_{X}[\hat{\gamma}(X)\,|\,T]]=\mathbb{E}[\hat{\gamma}]=\gamma$where subscripts after $\mathbb{E}$ indicate the variable over which the expectation is taken. > > > > *Less variance*: since both $\hat{\gamma},\hat{\gamma}_{T}$ are unbiased, it's enough to show that $\mathbb{E}[\hat{\gamma}^{2}] \ge \mathbb{E}[\hat{\gamma}_{T}^{2}]$: $\mathbb{E}_{}[\hat{\gamma}_{T}^{2}]=\mathbb{E}_{T}[\mathbb{E}_{X}[\hat{\gamma} \,|\, T]^{2}] \le \mathbb{E}_{T}[\mathbb{E}_{X}[\hat{\gamma}^{2} \,|\, T]]=\mathbb{E}[\hat{\gamma}^{2}]$where the inequality is just $\mathbb{E}[W]^{2} \le \mathbb{E}[W]^{2}$ with $W=\hat{\gamma}$ and conditioned on $T$. > > > > For that to be an equality, $\mathrm{Var}(W)$ must be $0\,\,\mathrm{a.s.}$, i.e. given $T$, $\hat{\gamma}=\mathrm{const.}=\hat{\gamma}_{T}\mathrm{\,\,a.s.}$. One consequence of Rao-Blackwell is that an unbiased estimator $\hat{\gamma}$ can always be improved, unless it is (solely) a function of some sufficient statistic $T$. ### MVUE from Complete Estimators > [!definition|*] Completeness > A family of distributions $\mathcal{P}=\{ P_{\theta}\,|\, \theta \in \Theta \}, X \sim P_{\theta}$ is **complete** if there is no non-trivial unbiased estimators of $0$: $\forall h(X)$, $\big( \mathbb{E}_{\theta}[h(X)]=0\,\, \forall \theta \in \Theta \big) \Longrightarrow \big(h(X)=0 \,\,\mathrm{a.s.}\,\, \forall \theta \in \Theta\big).$Equivalently, the unbiased estimators of the same quantity $\gamma$ are $\mathrm{a.e.}$ equal. > > A statistic $T(X)$ is **complete** if the family of distributions of $T$ $\mathcal{P}_{T}=\{ P^{(T)}_{\theta}\,|\, \theta \in \Theta \}, ~T \sim P^{(T)}_{\theta}$ is complete, i.e. $\big( \mathbb{E}_{\theta}[h(T)]=0\,\, \forall \theta \in \Theta \big) \Longrightarrow \big(h(T)=0 \,\,\mathrm{a.s.}\,\, \forall \theta \in \Theta\big).$ MVUEs from complete, sufficient statistics exist in general, without needing to attain the CRLB, which is only possible for exponential families. > [!math|{"type":"theorem","number":"","setAsNoteMathLink":false,"title":"Lehmann-Scheffé","label":"lehmann-scheff"}] Theorem (Lehmann-Scheffé). > If $\hat{\gamma}$ is an unbiased estimator of $\gamma$, $T$ is a complete and sufficient statistic, then $\hat{\gamma}_{T}:= \mathbb{E}[\hat{\gamma}\,|\, T]$ is a MVUE of $\gamma$. > > [!proof]- > > For any other unbiased estimator $\tilde{\gamma}$, consider $\tilde{\gamma}_{T}=\mathbb{E}[\tilde{\gamma}\,|\,T]$. Both $\hat{\gamma}_{T},\tilde{\gamma}_{T}$ are functions of $T$ by sufficiency, and they are both unbiased estimators of $\gamma$. > > > > By completeness of $T$, they are $\mathrm{a.e.}$ equal, so $\mathrm{Var}(\hat{\gamma}_{T}) = \mathrm{Var}(\tilde{\gamma}_{T})\le \mathrm{Var}(\tilde{\gamma})$for any unbiased estimator $\tilde{\gamma}$. One important corollary is that *if an unbiased estimator $\hat{\gamma}$ is a function of a complete statistic, it must be a MVUE*. - For example, in a sample $X_{1},\dots,X_{n} \overset{\mathrm{iid.}}{\sim}N(\mu,\sigma^{2})$, $\mathbf{T}=\left( \sum_{i}X_{i},\sum_{i}X_{i}^{2}\right)$ is complete (see next section), so $(\bar{X}, S^{2})$ are MVUEs of $(\mu,\sigma^{2})$. - For $X_{1},\dots,X_{n} \sim U[0,\theta]$, $X_{\mathrm{max}}$ is complete, so the estimator $\hat{\theta}= \frac{n+1}{n}X_{\mathrm{max}}$ is the MVUE of $\theta$; it does not achieve the CRLB because the distribution is not regular enough (its support depends on $\theta$). ### Finding Complete Statistics > [!theorem|*] Complete Statistics of Exponential Families > If $\mathcal{P}$ is a full-rank strictly $k$-parameter exponential family, its canonical observations $T$ are complete, sufficient statistics. A general sufficient and necessary condition of an MVUE is independence with all unbiased estimators of $0$ with finite variance: $\begin{gather} \mathcal{U}:= \{ u:\mathcal{X} \to \mathbb{R} \,|\, \mathrm{Var}(u)<\infty, \mathbb{E}[u]=0\}\\[0.4em] \hat{\gamma}\text{ is MVUE } \iff \forall u \in \mathcal{U}: \mathbb{E}[\hat{\gamma}u]=0 \end{gather}$ > [!proof] > $(\Rightarrow)$ follows from $\begin{align*} 0 &\ge \mathrm{Var}(\hat{\gamma}) -\mathrm{Var}(\hat{\gamma}+cu)\\ &= c\mathbb{E}[\hat{\gamma}u]-c^{2}\mathrm{\mathrm{Var}(u)} \end{align*}$ for any $c$, taking the quadratic's determinant shows that it is only possible if $\mathbb{E}[\hat{\gamma}u]=0$. > > $(\Leftarrow)$: Take any other unbiased estimator $\tilde{\gamma}$ and use $u=\hat{\gamma}-\tilde{\gamma}$: $0 = \mathbb{E}[\hat{\gamma}(\hat{\gamma}-\tilde{\gamma})]= \underbrace{\mathbb{E}[\hat{\gamma}^{2}]-\gamma^{2}}_{\mathrm{Var}(\hat{\gamma})}-\mathrm{Cov}(\hat{\gamma},\tilde{\gamma})$Hence $\mathrm{Var}(\hat{\gamma})^{2}=\mathrm{Cov}(\hat{\gamma},\tilde{\gamma})^{2}$, and Cauchy-Schwarz gives $\mathrm{Var}(\hat{\gamma})^{2}=\mathrm{Cov}(\hat{\gamma},\tilde{\gamma})^{2} \le \mathrm{Var}(\hat{\gamma})\mathrm{Var}(\tilde{\gamma}).$Dividing both sides by $\mathrm{Var}(\hat{\gamma})$ proves $\hat{\gamma}$ is MVUE. > [!warning] Complete, sufficient statistics do not always exist > If $X_{1},\dots,X_{n} \overset{\mathrm{iid.}}{\sim} \mathrm{Unif}[\theta, \theta+1]$, then they does not have complete, sufficient statistics. > > Furthermore, the distribution does not even have an MVUE of $\theta$.