Consider the standard two-hypothesis setup $\begin{align*} &H_{0i}: Z_{i} \sim f_{0}, && \text{probability} = \pi_{0}\\ &H_{1i}: Z_{i} \sim f_{1}, &&\text{probability} =\pi_{1}=1-\pi_{0}. \end{align*}$ Definitions: for some subset $\mathcal{S}$ of the sample space $\mathcal{X}$, $\begin{align*} N(\mathcal{S})&:= \# \{ i ~|~ x_{i} \in \mathcal{S}\} &&\text{no. of observations in }\mathcal{S}\\ N_{0}(\mathcal{S})&:=\# \{ i ~|~ x_{i} \in \mathcal{S}, H_{0i} \text{ is true}\} &&\text{no. of null obs. in }\mathcal{S}\\ e(\mathcal{S})&:=\mathbb{E}[N(\mathcal{S})]=N\cdot F(\mathcal{S}) &&\text{expectation of }N(\mathcal{S})\\ e_{0}(\mathcal{S})&:=\mathbb{E}[N_{0}(\mathcal{S})]=N\cdot F_{0}(\mathcal{S})\cdot \pi_{0}, &&\text{expectation of }N_{0}(\mathcal{S}) \end{align*}$where of course $N_{0}$ is not observable. ## Estimation of Null Probability ${\pi}_{0}$ If we select a "null area" $\mathcal{A}$ where $\forall x \in A, f_{1}(x)=0$ (also known as the **zero assumption**), then we have $\forall x \in A,~~f(x)=\pi_{0}f_{0}(x)+\pi_{1}f_{1}(x)=\pi_{0}f_{0}(x),$so assuming we know $f_{0}$, $F(\mathcal{A}):=\int _{\mathcal{A}} f(x)~ dx =\pi_{0}\int _{\mathcal{A}}f_{0}(x) ~ dx =: \pi_{0}F_{0}(\mathcal{A})$and estimating $F(\mathcal{A})= \mathbb{E}[N(\mathcal{A})] / N$ with $N(\mathcal{A}) / N$, we can estimate $\pi_{0}$ as $\hat{\pi}_{0}(\mathcal{A};f_{0}):= \frac{N(\mathcal{A})}{N\cdot F_{0}(\mathcal{A})}.$ Note that this estimate depends on both the choice of $\mathcal{A}$ and the null distribution $f_{0}$. - A bad choice of $\mathcal{A}$ can violate the zero assumption, causing $N(\mathcal{A})$ to contain many non-null observations, inflating $\hat{\pi}_{0}$. - A bad choice of $f_{0}$ affects $F_{0}(\mathcal{A})$ -- usually this can be mitigated with an empirically estimated null $\hat{f}_{0}$ instead of an a priori selected null. > [!exposition] How a bad $f_{0}$ causes issues > In particular, if the true $f_{0}^\text{truth}$ has less mass in $\mathcal{A}$ than a theoretical $f_{0}^\text{theo}$, then our estimate $\hat{\pi}(\mathcal{A};f_{0}^\text{theo})=\frac{N(\mathcal{A})}{N \cdot \int _{\mathcal{A}} f_{0}^{\text{theo}}(x) ~ dx }$will be biased downwards. So in fact, for a rejection region $\mathcal{R}$, *the estimated [[False Discovery Rate Control|tail false discovery rate]] using an incorrect $f_{0}^{\text{theo}}$ given by $\widehat{\mathrm{Fdr}}^{\text{theo}}:= \frac{\hat{\pi}_{0}F_{0}^{\text{theo}}(\mathcal{R})}{\hat{F}(\mathcal{R})}$is negatively correlated with the actual false discovery proportion $\mathrm{Fdp}:= \frac{\# \{ \text{null }z_{i} \in \mathcal{R}\}}{\# \{ z_{i} \in \mathcal{R} \}}=\frac{N_{0}(\mathcal{R}) / N}{\hat{F}(\mathcal{R})}$that it is supposed to estimate*, because the larger $N_{0}(\mathcal{R})$ is, the smaller $N_{0}(\mathcal{A})$ is (assuming $\mathcal{A},\mathcal{R}$ are disjoint), and the smaller $\hat{\pi}_{0}:= \hat{\pi}_{0}(\mathcal{A}; f_{0}^{\text{theo}}) \propto N_{0}(\mathcal{A})$ is. ## Estimation of Null Distribution $f_{0}$ ### Why Theoretical Null Distributions Fail The first issue to deal with is the identifiability issue: we can freely shift mass between $f_{0},f_{1}$. To solve this issue, we can require $f_{0}$ to be from a certain parametrized family, and apply the zero assumption on some $\mathcal{A}$ to do inference on the subset $\begin{align*} I(\mathcal{A})&:= (i \in \{ 1,\dots,N \} ~|~ z_{i} \in \mathcal{A}),\\ \mathbf{z}(\mathcal{A})&:=(z_{i} ~|~ i \in I(\mathcal{A})). \end{align*}$ For a $f_{0}$ parametrized by $\theta$, one can do MLE on the likelihood $\begin{align*} l(\pi_{0},\theta;\mathbf{z}(\mathcal{A}))&= \mathbb{P}[\text{there are } N(\mathcal{A}) \text{ observations in }\mathcal{A}] \\ &~~~~~~~~~~~~\cdot (\text{their density in }\mathcal{A})\\[0.4em] &= \left\{ {N \choose N(\mathcal{A})} p^{N(\mathcal{A})}(1-p)^{N - N(\mathcal{A})} \right\}\cdot \left\{ \prod_{i \in I(\mathcal{A})} \frac{f_{0}(z_{i};\theta)}{F_{0}(\mathcal{A})} \right\}. \end{align*}$ - Here the first term is a binomial distribution with $p:= \pi_{0}F_{0}(\mathcal{A})=\mathbb{P}[i\text{th observation is in }\mathcal{A}]$for any $i$ -- derived from assuming independence between each $Z_{i}$. - The second term is just the product of each $Z_{i}s conditional distribution (conditioning on $\{ Z_{i} \in \mathcal{A} \}$). If $f_{0}$ is from an [[Exponential Families|exponential family]], standard results allow us to optimize the two terms separately: - The first term is optimized by $\hat{\pi}_{0}:=N(\mathcal{A}) / (N\cdot F_{0}(\mathcal{A}))$. - The second term usually requires iterated optimization for $\theta$. [[Empirical Bayes]]