Consider a test of $H_{0}$ against $H_{1}$ with some data $\mathbf{X}=\mathbf{x}$. The **likelihood ratio** (of $H_{0}$ to $H_{1}$) is something that looks like $\lambda(\mathbf{x})=\frac{L(\mathbf{x}\,|\,H_{0})}{L(\mathbf{x}\,|\,H_{1})}$ * Data that are consistent with $H_{0}$ have large ratios, and small ratios mean disagreement; hence we can use the likelihood as a criterion for rejection. > [!definition|*] Likelihood Ratio Tests > > **Likelihood ratio tests (LRT)** are tests with critical region of the form $C=\{ \mathbf{x}:\, \lambda(\mathbf{x}) \le k \}$ ### The Neyman-Pearson Lemma > [!bigidea] > Using the likelihood ratio as the criterion to reject $H_{0}$ produces the most powerful test. For simple hypotheses $H_{0}: \theta=\theta_{0}$ and $H_{1}:\theta=\theta_{1}$. Here the likelihood ratio is $\lambda(\mathbf{x})=\frac{L(\mathbf{x};\theta_{0})}{L(\mathbf{x};\theta_{1})}$ > [!lemma|*] Pearson-Neyman > > Given these simple hypotheses and $\alpha >0$, the LRT is the most powerful test that has size $\le \alpha$. > - More precisely, let $\mathcal{T}$ be the LRT with the critical region $C=\left\{ \mathbf{x}: \frac{L(\mathbf{x};\theta_{0})}{L(\mathbf{x};\theta_{1})} \le k_{\alpha} \right\}$where $k_{\alpha}$ is chosen so that the test has size $\alpha$. Then any other test $\mathcal{T}^{*}$ with size $\le \alpha$ must have $\mathrm{power}(\mathcal{T}) \ge \mathrm{power}(\mathcal{T}^{*})$. >- That is, if the criterion is to reject $H_{0}$ when $\lambda$ is too small, it produces the most powerful test. > > > [!proof]- > > For any test with size $\le\alpha$ and critical region $A$, we shall prove that it has less (or equal) power. Consider the function $F \equiv (\Bbb{1}_{C}-\Bbb{1}_{A})\left\{ L(\mathbf{x};\theta_{1})-\frac{1}{k}L(\mathbf{x};\theta_{0}) \right\}$which is non-negative (both factors are non-negative when $\mathbf{x} \in C$, and both non-positive otherwise). > > > > Then its integral is also non-negative, and noting that $\int _{R} L(\mathbf{x};\theta)=\mathbb{P}(\mathbf{x} \in R \,|\,\theta)$, expand the parentheses to get: $\begin{align*} > 0 \le\int_{\mathbb{R}^{n}} F &= \int _{C} \{\dots\} - \int _{A} \{\dots\} \\ > &= \underbrace{\int_{C} L(\mathbf{x};\theta_{1}) }_{\text{power}(C)} - \frac{1}{k} \underbrace{ \int _{C} L(\mathbf{x};\theta_{0})}_{\text{size}(C)=\alpha}\\ > &\hphantom{\text{111111111}}- \underbrace{\int_{A} L(\mathbf{x};\theta_{1}) }_{\text{power}(A)} + \frac{1}{k} \underbrace{ \int _{A} L(\mathbf{x};\theta_{0})}_{\text{size}(A) \le \alpha}\\ > &\le \text{power}(C)-\text{power}(A) > \end{align*}$hence $\text{power}(C) \ge \text{power}(A)$. This does not mean that tests based on p-values are bad -- likelihood ratios and p-values are sometimes just different ways to derive the same critical region: > [!examples] p-values and LRT giving the same critical region > Suppose the test is on $\mathbf{X} \sim N(\mu,\sigma^{2})$ with unknown variance and mean. The hypotheses are $H_{0}:\mu=0$ and $H_{1}:\mu>{0}$. > > Then both the likelihood ratio criterion and the p-value of a t-test lead to the critical region $\{ \mathbf{x}:\bar{\mathbf{x}} \ge c_{\alpha} \}$ for some constant $c_{\alpha}$. ### Uniformly Most Powerful Tests Suppose instead of testing $H_{0}:\theta=\theta_{0}$ against the simple alternative $H_{1}=\theta_{1}$, the alternative is $H_{1}:\theta \in \Theta_{1}$. Then the likelihood ratio is a function over $\Theta_{1}$: $\lambda(\mathbf{x},\theta_{1})=\frac{L(\mathbf{x};\theta_{0})}{L(\mathbf{x};\theta_{1})}$ > [!definition|*] Uniformly Most Powerful Tests > A **uniformly most powerful (UMP)** test $\phi$ of size $\alpha$ has: > - Its size $\mathbb{E}[\phi ~|~ \theta] \le \alpha$ whenever $\theta \in \Theta_{0}$. > - Its power $\mathbb{E}[\phi ~|~ \theta] \ge \mathbb{E}[\phi' ~|~ \theta]$ whenever $\theta \in \Theta_{1}$, where $\phi'$ is any other test with size $\alpha$. For example, in testing a normal sample for $H_{0}:\mu=0$ against $H_{1}: \mu >0$, the critical region with size $\alpha$ is the same, hence the critical region $\{ \mathbf{x}:\bar{\mathbf{x}} \ge c_{\alpha} \}$ is UMP. *Usually, UMP critical regions only appear in one-sided tests*, where all alternatives favor data that deviates in the same direction: this motivates the definition > [!definition|*] Monotone Likelihood Ratio > A model $f(x,\theta)$ has **monotone likelihood ratio (MLR)** if for any $\theta_{1} \ge \theta_{2}$, the likelihood ratio$x \mapsto \frac{L(x;\theta_{1})}{L(x;\theta_{2})}$can be written as a non-decreasing function of some statistic $T$. Here $T$ can be thought to be more likely to take large values if $\theta$ is larger. > [!examples] > If $L$ is exponential, then the likelihood ratio is $\lambda(x)\propto \exp(-(\theta_{1}-\theta_{2})x),$which is monotone increasing in $T=-x$: intuitively, a high rate of $\theta_{1}$ makes it more likely to have small values of $X$, hence large values of $T$. > > If $L$ is Gaussian (with known variance), then the ratio is $\begin{align*} \lambda(x) &\propto \exp\left( -\frac{(x-\theta_{1})^{2}-(x-\theta_{2})^{2}}{2\sigma ^{2}} \right)\\ &= \exp\left( \frac{\mathrm{const.}+2(\theta_{1}-\theta_{2})x}{2\sigma^{2}} \right), \end{align*}$so $T=x$ gives the MLR property. > [!theorem|*] MLR in one-parameter exponential families > If $f(\cdot;\theta)$ is an one-parameter [[Exponential Families|exponential family]] with form $f(x;\theta)=\exp(\eta(\theta)\cdot T(x)-B(\theta))h(x),$and $\eta(\theta)$ is monotone increasing in $\theta$ (e.g. $\eta=\theta$ or $\log \theta$), then it has MLR with the canonical statistic. > > [!proof]- > > The likelihood ratio is $\lambda(x)=\exp[(\eta(\theta_{1})-\eta(\theta_{2}))T(x)],$so $\eta(\theta_{1})-\eta(\theta_{2})>0$ assuming $\theta_{1}> \theta_{2}$, and the ratio is monotone increasing with $T$. > [!theorem|*] MLR models have one-sided UMP > If $f(x;\theta)$ is MLR with statistic $T$, and $T$ is continuously distributed, then for testing $H_{0}:\theta \le \theta_{0}$ (or equal to) against $H_{1}: \theta > \theta_{0}$, > - The test with critical region $C:=\{ x: T(x) > t_{0} \}$ is UMP among tests with the same size (i.e. $\mathbb{P}[X \in C ~|~ \theta_{0}]$); > - We can find $t_{0}$ such that the size is $\alpha$ for any choice of $\alpha$. ### Generalized Likelihood Ratio Tests > [!bigidea] > In general, likelihood ratio tests can determine whether a simpler model is good enough for the data. Consider a hypothesis test of two composite hypotheses: $\begin{align*} H_{0}:\, &\theta \in \Theta_{0}\\ H_{1}:\, &\theta \in \Theta_{1} \supset \Theta_{0} \end{align*}$Hence $H_{0}$ is a simplification of $H_{1}$, in the sense that $\theta$ has fewer possible values. - For example, for $H_{0}:N(0,\sigma^{2})$ and $H_{1}:N(\mu,\sigma^{2})$, the parameter spaces are $\Theta_{0}=\{ 0 \} \times \mathbb{R}^{\ge 0}$ and $\Theta_{1}=\mathbb{R} \times \mathbb{R}^{\ge 0}$. The null hypothesis is **nested within** the alternative, and the test decides if simplifying $H_{1}$ to $H_{0}$ is reasonable. - We accept $H_{0}$ if the simpler $H_{0}$ explains the data reasonably well, corresponding to a large likelihood ratio. > [!definition|*] Generalized Likelihood Ratios > > For such a test, the **likelihood ratio** is defined to be $\lambda(\mathbf{x})=\frac{\sup_{\theta \in \Theta_{0}}L(\theta,\mathbf{x})}{\sup_{\theta \in \Theta_{1}}L(\theta,\mathbf{x})}=\frac{L(\hat{\theta}_{0},\mathbf{x})}{L(\hat{\theta}_{1},\mathbf{x})}$where $\hat{\theta}_{0,1}$ are the MLE of $\theta$ in $\Theta_{0}$ and $\Theta_{1}$; these MLE's could be different, since $\Theta_{1}$ is larger. The **(generalized) likelihood ratio test** with the corresponding critical region is $C_{\alpha}=\{ \mathbf{x} : \lambda(\mathbf{x}) \le k_{\alpha} \}$for some $k_{\alpha}$ chosen so that the size of the test is $\alpha$: $\text{size}=\sup_{\theta \in \Theta_{0}}\mathbb{P}(\lambda(\mathbf{X})\le k_{\alpha}\,|\,\theta)=\alpha.$ > [!idea] Finding Distribution of the Likelihood Ratio > To actually get a test, we need to find such a $k_{\alpha}$ by: > - $[1]$ directly computing it, if we can simplify the condition $\lambda(\mathbf{X}) \le k_{\alpha}$ to some equivalent condition involving $\mathbf{X}$ only. This is only possible for a few distributions of $\mathbf{X}$, e.g $N(\mu,\sigma^{2})$ in example $3.9$ in the notes. > > - $[2]$ approximating the distribution of the likelihood ratio statistic $\Lambda=-2\log(\lambda(\mathbf{X}))$, see the following section. ### The Likelihood Ratio Statistic > [!bigidea] > The log-of-ratio-of-likelihoods is asymptotically $\chi^{2}$. > [!definition|*] Likelihood Ratio Statistic > > Confusingly, the **likelihood ratio statistic** is essentially the *logarithm* of the likelihood ratio: $\begin{align*} \Lambda(\mathbf{X}) &\equiv -2\log \lambda(\mathbf{X})\\ &= -2\left(\log\left[\sup_{\theta \in \Theta_{0}}L(\theta,\mathbf{x})\right]-\log\left[\sup_{\theta \in \Theta_{1}}L(\theta,\mathbf{x})\right]\right) \end{align*}$ > [!theorem|*] Wilk's Theorem > > When we assume $H_{0}$ to be true, along with a few regularity conditions, $\Lambda(\mathbf{X})=-2\log(\lambda(\mathbf{X})) \overset{D}{\approx}\chi^{2}_{p}$where $p=\dim\Theta_{1}-\dim\Theta_{0}$. The dimensions are the number of parameters that can be chosen freely, e.g. 2 in $N(\mu,\sigma^{2})$. ^cd7692 Then asymptotically, the critical region of the generalized LRT is $C \approx \{ \mathbf{x}:\, \Lambda(\mathbf{x}) \ge l_{\alpha} \}$where $l_{\alpha}$ is the upper $\alpha$-percentile of $\chi_{p}^{2}$. --- ## Testing Goodness of Fit Consider a sample of $n$ observations, each could independently belong to one of the categories $1 \sim k$. > [!definition|*] Terminologies in goodness of fit tests > > - The **observed** count of category $i$ is denoted $O_{i}$ or $n_{i}$. > - A certain model would predict the **expected** count of category $i$, which is denoted $E_{i}$. > - The **goodness of fit** refers to how well the model's expectations match the observed data. In this case, $H_{0}$ is that the model is sufficient, and $H_{1}$ is usually that some more complex model is required (so $H_{0} \subseteq H_{1}$). Most commonly, $\begin{align*} H_{0}: \,&\{ M\,|\, M \in \text{the family of models} \}\\ H_{1}: \,&\{ M\,|\, M \text{ is any model} \} \end{align*}$where "any model" means the bare minimum of consistency, that the probabilities assigned to the categories add up to exactly $1$. ### The Multinomial Distribution > [!definition|*] Multinomial distributions > > A population following a **multinomial distribution** has the probabilities $\pi=(\pi_{1},\dots,\pi_{k})$, where $\pi_{i}=\mathbb{P}(X \text{ is in category } i)$ for some random sample $X$ from the population. Hence $\sum_{i}\pi_{i}=1$. For a multinomial distribution and an iid. sample $\mathbf{X}=(X_{1},\dots,X_{n})$ that has observed counts $n_{1},\dots,n_{k}$, its likelihood is $L(\mathbf{X};\pi)=\frac{n!}{n_{1}!\cdots n_{k}!}\prod^{k}_{i=1}\pi_{i}^{n_{i}}$ As seen in the lectures, if no assumptions are made about $\pi$ (other than that $\sum_{i} \pi_{i}=1$), the MLE is $\hat{\pi}_{i}=\frac{n_{i}}{n}=\text{proportion of category $i$ in the sample}$in other words, this is the MLE over the space of "anything that makes sense". ### LRT on Models of Multinomials > [!bigidea] > Models of multinomials can be tested with the likelihood ratio statistic, and accepted/rejected based on their p-values. The MLE above allows any $\pi$, as long as it sums to 1; models might impose further restrictions on $\pi$, say dependence on a parameter $\theta$. > [!examples] Hardy–Weinberg equilibrium > For example, the three categories in Hardy–Weinberg equilibrium have probabilities $\pi(\theta)=\left( (1-\theta)^{2},\,2\theta(1-\theta),\,\theta^{2}\right)$In this case the MLE is denoted $\pi(\hat{\theta})$, a single-variable optimization on $\theta$. Suppose we want to test a model $\pi(\theta)$ on the multinomial data, a simplification to the hypothesis of "anything that makes sense". This sounds like a LRT on: $\begin{align*} &H_{0}:\pi=\pi(\theta)\\ &H_{1}: \pi \text{ can be anything as long as } \sum_{i}\pi_{i}=1 \end{align*}$The likelihood ratio statistic is then $\begin{align*} \Lambda&= -2\log\left( \frac{\sup_{H_{0}}L}{\sup_{H_{1}}L} \right)\\ &= -2 \log\left( \frac{L(\pi(\hat{\theta}))}{L(\hat{\pi})} \right)\\ &= \overset{(\text{see notes})}{\dots}\\ &= 2\sum_{i}n_{i}\log\left( \frac{n_{i}}{n\pi_{i}(\hat{\theta})} \right)\\ &= 2\sum_{i}O_{i}\log\left( \frac{O_{i}}{E_{i}} \right) \end{align*}$ The asymptotic distribution of $\Lambda$ is $\chi^{2}_{k-1-q}$, where $k$ is the number of categories, $q$ the number of parameters in the model (e.g. 1 in Hardy-Weinberg). - $\dim(\Theta_{0})=q$ by definition. - $\dim(\Theta_{1})=k-1$ because we can freely choose the first $k-1$ terms in $\pi$, but $\pi_{k}=1-\sum_{i=1}^{k-1} \pi_{i}$ is determined. Using $\Lambda$ as the statistic, we can then test the hypotheses $H_{0}$ and $H_{1}$ by the p-value computed from its distribution:$ \begin{align*} \Lambda&= 2\sum_{i}O_{i}\log\left( \frac{O_{i}}{E_{i}} \right) \sim \chi^{2}_{k-1-q} \,\,(\text{assuming $H_{0}$ is true})\\ p &= \mathbb{P}(\Lambda \ge \Lambda_{\text{obs}} \,|\, H_{0}) \end{align*} $where $\Lambda_{\text{obs}}$ is the observed value. ### Pearson's Chi-Squared Test Suppose we don't want to compute the logarithm in the likelihood ratio statistic, because we are lazy, or we don't have computers. Taylor expansion on $O_{i}\log(O_{i} / E_{i})$ around $E_{i}$ gives $O_{i}\log\left( \frac{O_{i}}{E_{i}} \right) \approx (O_{i}-E_{i})+\frac{(O_{i}-E_{i})^{2}}{2E_{i}}$hence substituting it into the likelihood ratio statistic gives $\Lambda \approx \sum_{i} \frac{(O_{i}-E_{i})^{2}}{E_{i}}$where the linear terms sum to $0$ since $\sum_{i}O_{i}=\sum_{i}E_{i}=n$. > [!definition|*] Pearson's $\chi^2$ statistic and test > > This approximation is **Pearson's chi-squared statistic** $P$, and as it name suggests, when $H_{0}$ is true, $P \equiv \sum_{i} \frac{(O_{i}-E_{i})^{2}}{E_{i}} \approx \chi^{2}_{k-1-q}$ > Using $P \approx \chi^{2}_{k-1-q}$ to test $H_{0}$ is called **Pearson's chi-squared test**. Although the distribution is approximate, it usually gives the same result as LRT with the likelihood ratio statistic, if there are enough samples in each category.