> [!definition|*] Maximum Likelihood Estimator > > The **maximum likelihood estimator (MLE)** $\hat{\theta}_{\mathrm{MLE}}$ of a parameter $\theta$ is a statistic mapping $\mathbf{X} \mapsto \underset{\theta \in \Theta}{\arg\max}~L(\mathbf{X};\theta).$Here $L$ is the likelihood of the sample; it can be replaced by the log-likelihood $l$ due to the monotonicity of logarithms. ### Solving the MLE > [!examples] Solving for the Closed Form > Most commonly, the MLE is found by differentiating $L$ or $l$ and setting the derivatives to $0$. For example, for normal samples $X_{1},\dots,X_{n} \overset{\mathrm{iid.}}{\sim} N(\mu,\sigma_{0}^{2})$, $\begin{align*} l(\mathbf{X};\mu)&= \mathrm{const.}+\mathrm{ const.}\times\sum_{i}(X_{i}-\mu)^{2}\\ &\Longrightarrow\frac{ \partial l }{ \partial \mu } \propto n\mu - \sum_{i}X_{i}\\ & \Longrightarrow \hat{\mu}_{\mathrm{MLE}}=\bar{X}. \end{align*}$If necessary, it is possible to verify that this is the global maximum (instead of a minimum/saddle) by checking if the second derivative is negative (the Hessian being negative-definite in higher dimensions). > --- > For certain distributions, the log-likelihood does not have vanishing derivative, and the MLE is on the boundary of the set of possible parameters. For example, $X_{1},\dots,X_{n} \overset{\mathrm{iid.}}{\sim}U[0,\theta]$ has $\hat{\theta}_{\mathrm{MLE}}=\max\{ X_{i} \}_{i=1}^{n}$, the infimum of possible parameters. The MLE can be numerically solved via the **Newton-Raphson method**, applied to the log-likelihood equation $\frac{\partial l}{\partial \theta}=0$. > [!algorithm] Newton-Raphson for the MLE > $[1]$ Start with an initial guess $\theta_{0}$. > $[2]$ Given the previous guess $\theta_{n}$, linear approximation gives $\left.\frac{\partial l}{\partial \theta}\right|_{\hat{\theta}} \approx \left.\frac{\partial l}{\partial \theta}\right|_{\theta_{n}}+(\hat{\theta}-\theta_{n})\cdot \left.\frac{\partial^2 l}{\partial \theta^2}\right|_{\theta_{n}}$and rearranging to solve for $\hat{\theta}$ gives the new approximate $\theta_{n+1}$: $ \hat{\theta} \approx \theta_{n+1} := \theta_{n}+\frac{U(\theta_{n})}{J(\theta_{n})}$where $U=\frac{\partial l}{\partial \theta}$ is the score function and $J$ the information. > > $[3]$ Repeat till sufficient convergence. ### Properties of the MLE > [!theorem|*] MLE in Exponential Families > > If $\eta=(\eta_{1},\dots,\eta_{k})$ is the natural parameter of a strictly $k$-parameter exponential family, then its MLE $\hat{\eta}_{\mathrm{MLE}}$ is unique. > > [!proof]- > > Since the family is strictly $k$-parameter, the covariance matrix $\mathrm{Cov}_{\eta}(T(\mathbf{X}))=\left( \frac{ \partial^{2} }{ \partial \eta_{i}\partial \eta_{j} }B(\eta) \right)_{i,j \in \{ 1,\dots,k \}}$ is strictly positive-definite. This is $(-1)$ times the Hessian of $l$, hence $l$ is concave with respect to $\eta$ and its maximum (i.e. the MLE) is unique. ^d6d206 > [!theorem|*] Invariance Property of the MLE > The **invariance property** of MLE states that: if $\hat{\theta}$ is the MLE of $\theta$, then for any function $g$, $g(\hat{\theta})$ is the MLE of $g(\theta)$. > - That is, *the MLE of the function is the function of the MLE*. > > > [!proof]- Proof for injective functions > > Denote the likelihood as a function of $\theta$ as $L(\mathbf{x},\theta)$, and as a function of $g(\theta)$ by $L^{*}(\mathbf{x},\psi)\equiv L(\mathbf{x},g^{-1}(\psi))$ ($g$ is invertible since it's injective). > > Then $\sup_{\psi}L^{*}=\sup_{\theta}L$, and $\hat{\theta}$ maximizes $\mathrm{RHS}=L(\mathbf{x},\hat{\theta})$. > > Hence if the MLE $\hat{\psi}$ maximizes $\mathrm{LHS}=L^{*}(\mathbf{x},\hat{\psi})=L(\mathbf{x},g^{-1}(\hat{\psi}))$, equating it with $\mathrm{RHS}$ gives $g^{-1}(\hat{\psi})=\hat{\theta}$. > > Hence $\hat{\psi}=g(\hat{\theta})$ is the MLE of $g(\theta)$. > [!theorem|*] Asymptotic Normality of the MLE > The **asymptotic normality** of the MLE $\hat{\theta}$ states that under some regularity conditions (see lecture notes), as $n \to \infty$, $ (I_{n}(\theta))^{1 / 2} (\hat{\theta}-\theta)\xrightarrow{d}N(0,1),$where $I_{n}$ is the expected information for the MLE from $n$ samples. Equivalently, for large $n$, $\hat{\theta} \overset{D}{\approx} N(\theta, I_{n}(\theta)^{-1}).$ > > This quantifies the fact that larger the information $I(\theta)$, the more accurate the MLE is. ^f3ac9d ---