> [!definition|*] Maximum Likelihood Estimator
>
> The **maximum likelihood estimator (MLE)** $\hat{\theta}_{\mathrm{MLE}}$ of a parameter $\theta$ is a statistic mapping $\mathbf{X} \mapsto \underset{\theta \in \Theta}{\arg\max}~L(\mathbf{X};\theta).$Here $L$ is the likelihood of the sample; it can be replaced by the log-likelihood $l$ due to the monotonicity of logarithms.
### Solving the MLE
> [!examples] Solving for the Closed Form
> Most commonly, the MLE is found by differentiating $L$ or $l$ and setting the derivatives to $0$. For example, for normal samples $X_{1},\dots,X_{n} \overset{\mathrm{iid.}}{\sim} N(\mu,\sigma_{0}^{2})$, $\begin{align*}
l(\mathbf{X};\mu)&= \mathrm{const.}+\mathrm{ const.}\times\sum_{i}(X_{i}-\mu)^{2}\\
&\Longrightarrow\frac{ \partial l }{ \partial \mu } \propto n\mu - \sum_{i}X_{i}\\
& \Longrightarrow \hat{\mu}_{\mathrm{MLE}}=\bar{X}.
\end{align*}$If necessary, it is possible to verify that this is the global maximum (instead of a minimum/saddle) by checking if the second derivative is negative (the Hessian being negative-definite in higher dimensions).
> ---
> For certain distributions, the log-likelihood does not have vanishing derivative, and the MLE is on the boundary of the set of possible parameters. For example, $X_{1},\dots,X_{n} \overset{\mathrm{iid.}}{\sim}U[0,\theta]$ has $\hat{\theta}_{\mathrm{MLE}}=\max\{ X_{i} \}_{i=1}^{n}$, the infimum of possible parameters.
The MLE can be numerically solved via the **Newton-Raphson method**, applied to the log-likelihood equation $\frac{\partial l}{\partial \theta}=0$.
> [!algorithm] Newton-Raphson for the MLE
> $[1]$ Start with an initial guess $\theta_{0}$.
> $[2]$ Given the previous guess $\theta_{n}$, linear approximation gives $\left.\frac{\partial l}{\partial \theta}\right|_{\hat{\theta}} \approx \left.\frac{\partial l}{\partial \theta}\right|_{\theta_{n}}+(\hat{\theta}-\theta_{n})\cdot \left.\frac{\partial^2 l}{\partial \theta^2}\right|_{\theta_{n}}$and rearranging to solve for $\hat{\theta}$ gives the new approximate $\theta_{n+1}$: $
\hat{\theta} \approx \theta_{n+1} := \theta_{n}+\frac{U(\theta_{n})}{J(\theta_{n})}$where $U=\frac{\partial l}{\partial \theta}$ is the score function and $J$ the information.
>
> $[3]$ Repeat till sufficient convergence.
### Properties of the MLE
> [!theorem|*] MLE in Exponential Families
>
> If $\eta=(\eta_{1},\dots,\eta_{k})$ is the natural parameter of a strictly $k$-parameter exponential family, then its MLE $\hat{\eta}_{\mathrm{MLE}}$ is unique.
> > [!proof]-
> > Since the family is strictly $k$-parameter, the covariance matrix $\mathrm{Cov}_{\eta}(T(\mathbf{X}))=\left( \frac{ \partial^{2} }{ \partial \eta_{i}\partial \eta_{j} }B(\eta) \right)_{i,j \in \{ 1,\dots,k \}}$ is strictly positive-definite. This is $(-1)$ times the Hessian of $l$, hence $l$ is concave with respect to $\eta$ and its maximum (i.e. the MLE) is unique.
^d6d206
> [!theorem|*] Invariance Property of the MLE
> The **invariance property** of MLE states that: if $\hat{\theta}$ is the MLE of $\theta$, then for any function $g$, $g(\hat{\theta})$ is the MLE of $g(\theta)$.
> - That is, *the MLE of the function is the function of the MLE*.
>
> > [!proof]- Proof for injective functions
> > Denote the likelihood as a function of $\theta$ as $L(\mathbf{x},\theta)$, and as a function of $g(\theta)$ by $L^{*}(\mathbf{x},\psi)\equiv L(\mathbf{x},g^{-1}(\psi))$ ($g$ is invertible since it's injective).
> > Then $\sup_{\psi}L^{*}=\sup_{\theta}L$, and $\hat{\theta}$ maximizes $\mathrm{RHS}=L(\mathbf{x},\hat{\theta})$.
> > Hence if the MLE $\hat{\psi}$ maximizes $\mathrm{LHS}=L^{*}(\mathbf{x},\hat{\psi})=L(\mathbf{x},g^{-1}(\hat{\psi}))$, equating it with $\mathrm{RHS}$ gives $g^{-1}(\hat{\psi})=\hat{\theta}$.
> > Hence $\hat{\psi}=g(\hat{\theta})$ is the MLE of $g(\theta)$.
> [!theorem|*] Asymptotic Normality of the MLE
> The **asymptotic normality** of the MLE $\hat{\theta}$ states that under some regularity conditions (see lecture notes), as $n \to \infty$, $ (I_{n}(\theta))^{1 / 2} (\hat{\theta}-\theta)\xrightarrow{d}N(0,1),$where $I_{n}$ is the expected information for the MLE from $n$ samples. Equivalently, for large $n$, $\hat{\theta} \overset{D}{\approx} N(\theta, I_{n}(\theta)^{-1}).$
>
> This quantifies the fact that larger the information $I(\theta)$, the more accurate the MLE is.
^f3ac9d
---