> [!tldr] > **Point estimators** map a sample $\mathbf{X}$ to single-valued estimations of things only dependent on the distribution, e.g. the parameter. > - Common estimators arise from **method of moments** and [[Maximum Likelihood Estimator|MLEs (covered in its own page)]]. > - Their estimations have errors and sometimes **bias**. > - **Consistent** estimators converge to the true value with more samples. Suppose we have a sample $\mathbf{X}=(X_{1},\dots,X_{n})$ from a distribution $f(\cdot\,;\theta)$, and we want to estimate $\gamma:= g(\theta)$. A **point estimator** is a statistic $T:\mathcal{X} \to g(\Theta)$, and its value $T(\mathbf{X})$ is the **estimate**. ## Method of Moments Often $\gamma=g(\theta)$ can be written as a function of the distributions moments, say $\gamma=h(m_{1},\dots,m_{r}),$where $m_{k}=\mathbb{E}_{\theta}[X^{k}]$ are functions of $\theta$. > [!definition|*] Method of Moments > Plugging in the estimators of the moments give the **moment estimator** of $\gamma$: $\hat{\gamma}_{\mathrm{MME}}:=h(\hat{m}_{1},\dots,\hat{m}_{r})$where $\hat{m}_{k}=\mathrm{avg}(X_{i}^{k})$ is the sample estimator of the $k$th moment, for $k=1,2,\dots$. > [!examples] Using method of moments > For $X_{i} \overset{\mathrm{iid.}}{\sim} \mathrm{Bernoulli}(p)$, the parameter $p$ has $p=m_{1}:= \mathbb{E}[X_{i}]$, so its moment estimator is $\hat{p}=\hat{m}_{1}=\bar{X}$. > > For $X_{i} \overset{\mathrm{iid.}}{\sim} N(\mu,\sigma^{2})$, the parameter $\sigma^{2}$ has $\sigma^{2}=m_{2}-m_{1}^{2}$, so the moment estimator is $\widehat{\sigma^{2}}=\hat{m}_{2}-\hat{m}_{1}^{2}=\mathrm{avg}(X_{i}^{2})-(\bar{X})^{2}.$ Note that this equals the biased MLE $\sum_{i}(X_{i}-\bar{X})^{2} / n$. ## Bias, Error, and Consistency A point estimator might be inaccurate, consistently underestimating the true value $\gamma=g(\theta)$. It can also be imprecise with a huge variance. > [!definition|*] Measures of Error > The **bias** measures the average inaccuracy of the estimator: $\mathrm{bias}(T, \theta):=\mathbb{E}_{\theta}[T(\mathbf{X})-g(\theta)].$An estimator with bias zero is **unbiased**. > > The **mean squared error (MSE)** of the estimator is $\mathrm{MSE}_{\theta}(T):=\mathbb{E}_{\theta}\left[(T(\mathbf{X})-g(\theta))^{2} \right],$also known as the **quadratic loss** (as an example of [[Loss Functions]]). ^e44b87 Analogous to the Pythagorean theorem, bias and variance are like orthogonal components of the MSE: > [!theorem|*] Bias-Variance Decomposition > $\mathrm{MSE}_{\theta}(T)=\underbrace{\mathrm{bias}(T;\theta)^{2}}_{\substack{\text{consistent over/}\\ \text{underestimations}}}+\underbrace{\mathrm{Var}_{\theta}(T)}_{\text{imprecision}}$so in particular for unbiased estimators, their MSE equals their variance. > > [!proof]- Proof > > Note that for the error $e(\mathbf{X}):= T(\mathbf{X})-g(\theta)$, $\mathrm{MSE}_{\theta}(T)=\mathbb{E}[e(\mathbf{X})^{2}]$ and $\mathrm{bias}(T;\theta)^{2}=\mathbb{E}[e(\mathbf{X})]^{2}$, so $ \mathrm{MSE}_{\theta}(T)-\mathrm{bias}(T;\theta)^{2} = \mathrm{Var}_{\theta}(e(\mathbf{X})) = \mathrm{Var}_{\theta}(T),$where the last equality is because $e(\mathbf{X})$ is just $T(\mathbf{X})$ shifted by a constant. ^cf7800 This decomposition explains the **bias-variance tradeoff**. - A more flexible model can be less biased, but it might have higher variance, because it's harder to estimate the parameters, the model being sensitive to individual data observations, etc. - Conversely, a biased model might perform well if its estimates are a lot stabler than unbiased models, or when it happens to be close to the truth, so bias is not an issue. As the sample size $n \to \infty$, we hope that the estimator $\hat{\theta}_{n}$ converges to the true parameter $\theta$. **Consistency** characterizes this convergence. > [!definition|*] Consistency > > For a sequence of estimators $(\hat{\theta}_{n})$ indexed by sample size $n$ and the value $\theta$ they estimate, it is **consistent** if it converges to $\theta$. > > Different senses of convergence leads to different types of consistency: > - **Consistent in probability** if $\hat{\theta}_{n} \to \theta$ in probability. > - **Consistent in MSE** if $\mathrm{MSE}_{\theta}(\hat{\theta}_{n}) \to 0$ for $\forall \theta \in \Theta$. > - **Consistent almost surely** if $\hat{\theta} \to \theta \,\,\,\mathrm{a.s.}$ in the probability measure $\mathbb{P}_{\theta}$.