> [!tldr]
> **Point estimators** map a sample $\mathbf{X}$ to single-valued estimations of things only dependent on the distribution, e.g. the parameter.
> - Common estimators arise from **method of moments** and [[Maximum Likelihood Estimator|MLEs (covered in its own page)]].
> - Their estimations have errors and sometimes **bias**.
> - **Consistent** estimators converge to the true value with more samples.
Suppose we have a sample $\mathbf{X}=(X_{1},\dots,X_{n})$ from a distribution $f(\cdot\,;\theta)$, and we want to estimate $\gamma:= g(\theta)$. A **point estimator** is a statistic $T:\mathcal{X} \to g(\Theta)$, and its value $T(\mathbf{X})$ is the **estimate**.
## Method of Moments
Often $\gamma=g(\theta)$ can be written as a function of the distributions moments, say $\gamma=h(m_{1},\dots,m_{r}),$where $m_{k}=\mathbb{E}_{\theta}[X^{k}]$ are functions of $\theta$.
> [!definition|*] Method of Moments
> Plugging in the estimators of the moments give the **moment estimator** of $\gamma$: $\hat{\gamma}_{\mathrm{MME}}:=h(\hat{m}_{1},\dots,\hat{m}_{r})$where $\hat{m}_{k}=\mathrm{avg}(X_{i}^{k})$ is the sample estimator of the $k$th moment, for $k=1,2,\dots$.
> [!examples] Using method of moments
> For $X_{i} \overset{\mathrm{iid.}}{\sim} \mathrm{Bernoulli}(p)$, the parameter $p$ has $p=m_{1}:= \mathbb{E}[X_{i}]$, so its moment estimator is $\hat{p}=\hat{m}_{1}=\bar{X}$.
>
> For $X_{i} \overset{\mathrm{iid.}}{\sim} N(\mu,\sigma^{2})$, the parameter $\sigma^{2}$ has $\sigma^{2}=m_{2}-m_{1}^{2}$, so the moment estimator is $\widehat{\sigma^{2}}=\hat{m}_{2}-\hat{m}_{1}^{2}=\mathrm{avg}(X_{i}^{2})-(\bar{X})^{2}.$
Note that this equals the biased MLE $\sum_{i}(X_{i}-\bar{X})^{2} / n$.
## Bias, Error, and Consistency
A point estimator might be inaccurate, consistently underestimating the true value $\gamma=g(\theta)$. It can also be imprecise with a huge variance.
> [!definition|*] Measures of Error
> The **bias** measures the average inaccuracy of the estimator: $\mathrm{bias}(T, \theta):=\mathbb{E}_{\theta}[T(\mathbf{X})-g(\theta)].$An estimator with bias zero is **unbiased**.
>
> The **mean squared error (MSE)** of the estimator is $\mathrm{MSE}_{\theta}(T):=\mathbb{E}_{\theta}\left[(T(\mathbf{X})-g(\theta))^{2} \right],$also known as the **quadratic loss** (as an example of [[Loss Functions]]).
^e44b87
Analogous to the Pythagorean theorem, bias and variance are like orthogonal components of the MSE:
> [!theorem|*] Bias-Variance Decomposition
> $\mathrm{MSE}_{\theta}(T)=\underbrace{\mathrm{bias}(T;\theta)^{2}}_{\substack{\text{consistent over/}\\ \text{underestimations}}}+\underbrace{\mathrm{Var}_{\theta}(T)}_{\text{imprecision}}$so in particular for unbiased estimators, their MSE equals their variance.
> > [!proof]- Proof
> > Note that for the error $e(\mathbf{X}):= T(\mathbf{X})-g(\theta)$, $\mathrm{MSE}_{\theta}(T)=\mathbb{E}[e(\mathbf{X})^{2}]$ and $\mathrm{bias}(T;\theta)^{2}=\mathbb{E}[e(\mathbf{X})]^{2}$, so $
\mathrm{MSE}_{\theta}(T)-\mathrm{bias}(T;\theta)^{2} = \mathrm{Var}_{\theta}(e(\mathbf{X})) = \mathrm{Var}_{\theta}(T),$where the last equality is because $e(\mathbf{X})$ is just $T(\mathbf{X})$ shifted by a constant.
^cf7800
This decomposition explains the **bias-variance tradeoff**.
- A more flexible model can be less biased, but it might have higher variance, because it's harder to estimate the parameters, the model being sensitive to individual data observations, etc.
- Conversely, a biased model might perform well if its estimates are a lot stabler than unbiased models, or when it happens to be close to the truth, so bias is not an issue.
As the sample size $n \to \infty$, we hope that the estimator $\hat{\theta}_{n}$ converges to the true parameter $\theta$. **Consistency** characterizes this convergence.
> [!definition|*] Consistency
>
> For a sequence of estimators $(\hat{\theta}_{n})$ indexed by sample size $n$ and the value $\theta$ they estimate, it is **consistent** if it converges to $\theta$.
>
> Different senses of convergence leads to different types of consistency:
> - **Consistent in probability** if $\hat{\theta}_{n} \to \theta$ in probability.
> - **Consistent in MSE** if $\mathrm{MSE}_{\theta}(\hat{\theta}_{n}) \to 0$ for $\forall \theta \in \Theta$.
> - **Consistent almost surely** if $\hat{\theta} \to \theta \,\,\,\mathrm{a.s.}$ in the probability measure $\mathbb{P}_{\theta}$.