Decision Theory - Random Notes Go Brrrrrrr

Estimating parameters, predicting new values, or inferring probability distributions are all decisions made based on data. This process can be generalized into decision theory. Under the standard setup of $X \sim f(\cdot\,; \theta)$ and parameter space $\Theta$, > [!definition|*] Decisions and Losses > The **action space** $\mathcal{A}$ contains individual **actions** that are the outcome of the study, e.g. $\{ 0,1 \}$ for hypothesis testing, $\Theta$ for parameter estimation. > > The **decision rule** $\Delta:\mathcal{X} \to \mathcal{A}$ mapping data to actions. For example, it can be $\mathbf{1}_{X \in \mathcal{C}}$ for some hypothesis test with critical region $\mathcal{C}$. > > The set of decision rules is denoted $\mathcal{D}$. > > The **loss** function $L:\Theta \times \mathcal{A} \to \mathbb{R}^{+}$, measuring how incorrect the action $A$ is, given that the truth is $\theta$. For examples, see [[Loss Functions]]. Decision rules can be **randomized** or **deterministic**: - A deterministic rule maps data $\mathbf{X}$ to a single action $A$ as defined above. - A randomized rule instead maps to a probability distribution on $\mathcal{A}$. Examples include stochastic algorithms like K-means and stochastic gradient descent. - Of course a probabilistic mixture of deterministic rules produces a randomized rule. <div style="page-break-after: always;"></div> ## Risks and Comparing Decision Rules > [!summary] TLDR: Risks > **Loss function** $L:\Theta \times \mathcal{A} \to \mathbb{R}^{+}$, error of an action $A$ for a particular $\theta$. > > **Frequentist risk** $R: \Theta \times \mathcal{D} \to \mathbb{R}^{+}$, error of a rule $\Delta$ for a particular $\theta$, averaged over inputs $X$ only. > > **Bayes risk** (or simply **risk**) $r: \mathcal{D} \to \mathbb{R}^{+}$ given $\theta \sim\pi$, error of a rule $\Delta$, averaged both $\theta \sim \pi$ and $X \sim f(\cdot\,;\theta)$. > > **Posterior risk** $\Lambda: \mathcal{X} \times\mathcal{D} \to \mathbb{R}^{+}$ given $\theta \sim\pi$, error of a rule $\Delta$, averaged over all $\theta \sim \pi(\cdot\,|\, x)$. > [!definition|*] Frequentist Risk, Admissibility > The **frequentist risk** measures the expected loss $L$ of a decision rule $\Delta$, treating the "truth" $\theta$ as a constant: $R(\theta,\Delta):= \mathbb{E}_{X}[L(\theta, \Delta(X))]=\int _{\mathcal{X}}L(\theta, \Delta(x)) f(x;\theta)\, dx. $ > A rule $\Delta_{2}$ **strictly dominates** $\Delta_{1}$ if $R(\theta,\Delta_{1}) \ge R(\theta,\Delta_{2})$ is true $\forall \theta \in \Theta$, and the inequality is strict for some $\theta$. A decision rule is **inadmissible** if it is strictly dominated by some other rule. ^913a85 - Strict dominance does not yield a unique minimum: two decision rules can outperform one another under different values of $\theta$, so neither strictly dominates the other. - It needn't provide useful minima either, for example with quadratic loss (MSE), [[Information and Bounding Errors|no estimator is uniformly better than a trivial estimator]]. ### Bayes Rules Bayes risk averages the frequentist risk with a prior $\pi(\theta)$, measuring loss on average, and *Bayes rules are those that do the best on average*. > [!definition|*] Bayes Risk and Bayes Rules > The **Bayes risk** of decision rule $\Delta$ is the expected frequentist risk, integrated over the prior $\pi(\theta)$: $r(\pi, \Delta):= \int R(\theta,\Delta)\pi(\theta) \, d\theta. $A decision rule $\Delta ^{*}$ is a **Bayes rule** or **Bayes estimator** wrt. $\pi$ if it minimizes the Bayes risk: $\Delta ^{*}= \underset{\Delta \in \mathcal{D}}{\arg\min} ~r(\pi,\Delta).$If such a rule exists for $\pi$, denote $r_{\pi}:= r(\pi, \Delta ^{*})$. Bayes risk deals with worst cases by considering the **least favorable prior** $\pi$ that is the hardest to deal with, i.e. $r_{\pi} \ge r_{\pi'}$ for any other $\pi'$. - Note that the two risks $r_{\pi},r_{\pi'}$ are in general minimized by different rules. Bayes rules are admissible under mild assumptions: > [!lemma|*] Admissibility of Bayes Rules > If $\Delta$ is a Bayes rule of prior $\pi$ and $r(\pi,\Delta) < \infty$, then it is admissible if any of the following is true: > - $\Delta$ is the unique Bayes rule of $\pi$; > - $R(\cdot\,;\Delta)$ is continuous in $\theta$, and $\pi$ is non-zero in $\Theta$ (i.e. we don't a priori assume some parameters to be impossible). > > > [!proof]- > > Consider another rule $\Delta'$ and assume for contradiction that $\Delta$ is strictly dominated by $\Delta'$. > > > > $[1]$ We prove $r(\pi, \Delta) = r(\pi, \Delta')$ by $\ge \& \le$. Then $\le$ follows by definition of Bayes rules. By strict dominance $\forall \theta \in \Theta,~ R(\theta,\Delta)\ge R(\theta, \Delta')$, and taking their expectation gives $\ge$. This makes $\Delta'$ a Bayes rule of $\pi$ too, contradicting uniqueness. > > > > $[2]$ Consider the set $A:= \{ \theta \,|\, R(\theta, \Delta') < R(\theta, \Delta) \}$ on which $\Delta$ does worse than $\Delta'$. By continuity of $R$, $A$ must contain an open subset $A ^{*}$. Since $\pi > 0$ on $A^{*}$, splicing the two gives a rule with a strictly smaller Bayes risk than $\Delta$, contradicting minimality of $r(\theta, \Delta)$. By splicing we mean $\Delta ^{\ast}=\mathbf{1}_{A^{\ast}}\Delta'+\mathbf{1}_{A^{\ast C}}\Delta.$I.e. letting each rule decide over its preferred region $A^{\ast},A^{\ast C}$. One particular case is when $r$ ### Minimax Rules Minimax rules are the other extreme, only considering the worst case scenario instead of the average. > [!definition|*] Maximum Risk and Minimax Rules > The maximum risk of a rule $\Delta$ is just the worst-case scenario: the supremum of frequentist risks, $\sup_{\theta \in \Theta}R(\theta;\Delta)$. > > The **minimax** rule minimizes this maximum risk: $\Delta ^{*}$ is minimax if $\Delta ^{*}=\underset{\Delta \in \mathcal{D}}{\arg\min}\left[ \sup_{\theta \in \Theta} R(\theta;\Delta) \right].$ - Since minimax focuses on the worst case, it *might try to deal with a worst case $\theta$ that is highly unlikely and degrade its average performance*. It's like opposite to how quick sort is worst case $O(n^{2})$ but $O(n\log n)$ on average. The two notions of decision rules being the best on average (Bayes) and in the worst case (minimax) are connected in the following theorem: > [!theorem|*] Bayes and Minimax Rules > If a prior $\pi$ has minimized Bayes risk of $r_{\pi}$, and there is a rule $\Delta$ where $\sup_{\theta}R(\theta,\Delta) \le r_{\pi},$then $\Delta$ must be both a Bayes rule (of $\pi$) and a minimax rule. $R(\theta, \Delta)=r_{\pi}~\mathrm{a.e.}$ in that case. > > Furthermore, It is the unique minimax if it is the unique Bayes rule. > > > [!proof]- Proof > > $[\Delta\text{ is Bayes rule}]$ since $r(\pi,\Delta)\le r_{\pi} \le r(\pi,\Delta)$, where the first inequality follows by the supremum assumption, and the second is the definition of $r_{\pi}$. > > > > $[\Delta \text{ is minimax}]$ since any other rule $\Delta' \ne \Delta$ has $\sup_{\theta}R(\theta,\Delta')\ge \mathbb{E}_{\theta \sim \pi}[R(\theta,\Delta')]=r(\pi,\Delta') \overset{(*)}{\ge} r_{\pi} \ge \sup_{\theta}R(\theta,\Delta).$ > > $[R(\theta,\Delta)=r_{\pi}~\mathrm{a.e.}]$ since otherwise $r(\pi, \Delta) < r_{\pi}$, giving a contradiction. > > > > $[\text{unique Bayes} \Rightarrow \text{unique minimax}]$ since the inequality $(*)$ will be strict if the Bayes rule is unique. - Note that this theorem does not comment on the existence of such a rule $\Delta$, nor does an arbitrary Bayes rule satisfy the inequality in general. In the case $\Delta$ is a Bayes rule, the above theorem strengthens to say that > [!theorem|*] Bayes and Minimax Rule in Least Favorable Priors > If the above inequality is achieved by a Bayes rule itself: $\sup_{\theta}R(\theta,\Delta_{\text{Bayes}}) \le r_{\pi},$then it is a minimax rule, and $\pi$ must be the least favorable prior. > > > [!proof]- > > That $\Delta_{\text{Bayes}}$ is minimax is just the previous theorem. > > > > If $\pi'$ is any (other) prior with its Bayes rule $\Delta_{\text{Bayes}}'$, then $r_{\pi^{\prime}}=\int R(\theta,\Delta_{\mathrm{Bayes}}^{\prime})\pi^{\prime}(\theta)d\theta\leqslant\int R(\theta,\Delta_{\mathrm{Bayes}})\pi^{\prime}(\theta)d\theta\leqslant\operatorname*{sup}_{\theta}R(\theta,\Delta_{\mathrm{Bayes}})=r_{\pi}.$So $\pi$ must be the least favorable prior. > - One corollary is that if the Bayes rule has constant frequentist risk for any parameter $\theta$, then it is the minimax. - This can be useful for finding minimax Bayes rules if we can tweak the prior with hyperparameters $\phi$, say -- after finding a Bayes rule as a function of $x,\phi$, we can solve for $\phi$ that makes the frequentist risk constant, if there is a solution. ### Posterior Risks > [!idea] Minimizing the posterior risk is a handy way of finding the Bayes rule. > [!definition|*] Posterior Risk > With prior $\theta \sim \pi$ and observed data $X=x$, the **posterior risk** of a rule $\Delta$ wrt. the prior $\pi$ is $\Lambda(x,\Delta):= \mathbb{E}_{\theta \,|\, x}[L(\theta,\Delta(x)) \,|\, X=x]=\int L(\theta,\Delta(x)) \cdot \pi(\theta \,|\, x) \, d\theta.$Since $\Delta$ only matters through its action $\Delta(x)$, we may also pass an action $A$ (or just $y \in \mathbb{R}$ for prediction tasks) in its stead. Only Bayes rules can (but not necessarily do) minimizes the posterior risk. This is because *Bayes risk equals the expected posterior risk* (see proof). > [!theorem|*] Posterior Risk Minimizer is Bayes > If a prior $\pi$ allows finite Bayes risk, and the function $c: x \mapsto \underset{y \in \mathcal{A}}{\arg\min} ~\Lambda(x, y)$ is defined $\mathrm{a.e.}$ in $\mathcal{X}$, then $\Delta(x):= c(x)$ is a Bayes rule of $\pi$. > > That is, if we can minimize $\Lambda$ pointwise in $x$ by choosing some action $y$ based on data $x$, then this map from $x$ to $y$ must be the Bayes rule, assuming finite risk. > > > > [!proof]- > > Bayes risk of $\Delta$ equals the expected posterior risk: $r(\pi,\Delta)= \mathbb{E}_{\theta,X}[R(\theta, \Delta(X))]=\mathbb{E}_{X}\Big[\mathbb{E}_{\theta \,|\, X}[R(\theta, \Delta(X))]\Big]= \mathbb{E}_{X}[\Lambda(X, \Delta)].$Hence if $\Delta$ minimizes $\Lambda(x,\,\cdot)~\mathrm{a.e.}$, it also minimizes the Bayes risk when the expectation is finite (by assumption). > [!examples] Bayes Estimator of Binomial Parameter > Suppose $X \sim \mathrm{Binom}(n, \theta)$ with prior $\theta \sim \mathrm{Beta}(\alpha, \beta)$. Then the posterior is $\theta ~|~ \{ X=x \} \sim \mathrm{Beta}(\alpha+x, \beta+n-x),$so the Bayes risk can be found by minimizing the posterior risk. > > For example, Bayes risk with $l_{2}$ loss is minimized by the posterior mean $(\alpha + x) / (\alpha +\beta + n)$, which is a weighted average of the prior mean $\alpha / (\alpha + \beta)$ and MLE $x / n$. <div style="page-break-after: always;"></div> ## Decision Theory in Machine Learning For an estimation problem $Y=f(X)$ with loss $L$, suppose we have a dataset $D=(\mathbf{X}, \mathbf{Y})$ and use it to fit an estimate $\hat{h}$ from $\mathcal{F}$, the space of all deterministic relationships mapping $\mathcal{X} \to \mathcal{Y}$. - Here $f$ is some general relationship, might not be deterministic; $\hat{h}$, however, needs to be deterministic. ### Risk due to Random Data > [!idea] In practice, we do not get to access the true distribution of $(X,Y)$, and have to rely on a sampled dataset $D=(\mathbf{x}, \mathbf{y})$. This causes the fitted rule to deviate from the optimal Bayes rule. The risk of an estimated rule $\hat{h}$ is $R(\hat{h})=\mathbb{E}_{X,Y}[L(\hat{h}(X), Y)],$where the training dataset $D$ (hence $\hat{h}$) is treated as a constant, and the only randomness is the $(X,Y)$ at which the loss is computed. For $l_{2}$ loss, this becomes $R(\hat{h};h)=\mathbb{E}_{X}[(\hat{h}(X)-Y)^{2}].$ If $h^{\ast}$ is the minimizer of $R$ (constrained or not), then the **excess risk** of $\hat{h}$ is $R(\hat{h})-R(h^{\ast})$, i.e. the gap between the estimate and the absolute best. For $l_{2}$ loss, the excess risk of $\hat{h}$ is: $\begin{align*} R(\hat{h})&-R(h^{\ast})\\ &= \mathbb{E}_{X}\Big[\mathbb{E}_{Y~|~X}\big[(Y-\hat{h}(x))^{2}-(Y-h^{\ast}(x))^{2}~|~ X=x\big]\Big]\\ &= \mathbb{E}_{X}\Big[\mathbb{E}_{Y~|~X}\big[\hat{h}(x)^{2}-h^{\ast}(x)^{2}-2Y\hat{h}(x)+2Yh^{\ast}(x)~|~ X=x\big]\Big]\\ &= \mathbb{E}_{X}\Big[ \hat{h}(x)-2h^{\ast}(x)\hat{h}(x)+h^{\ast}(x)^{2}\Big]\\ &= \mathbb{E}_{X}[(\hat{h}(x)-h^{\ast}(x))^{2}], \end{align*}$where we used the identity $\mathbb{E}_{Y ~|~ X}[Y]=h^{\ast}$. ### Empirical Risk and Overfitting > [!theorem|*] Empirical Risk > The **empirical risk** of a rule $h$ on a dataset $D$ is $\hat{R}_{D}(h):= \frac{1}{| D |}\sum_{i=1}^{| D |}L(h(x_{i}), y_{i}),$i.e. the averaged loss. > > Note that the empirical risk depends on the dataset, and if $D$ is used to train (estimate) the rule $h=\hat{h}_{D}$, it is also called the **training error**. ^2aa114 The **generalization gap** of a rule $\hat{h}_{D}$ fitted on the dataset $D$ is $R(\hat{h})-\hat{R}_{D}(\hat{h}_{D})$, i.e. the frequentist risk minus the training error -- a model might achieve perfect accuracy on the training set by memorizing both the underlying pattern and random noise, but that will generalize poorly on a new dataset. In this case the model is said to have **overfit**. Because of this, training error is not indicative of risk in general: in particular, $\mathbb{E}_{D}[\hat{R}_{D}(\hat{h}_{D})] \le R(h^{\ast}) \le R(\hat{h}_{d}),$where the $d$ in $\mathrm{RHS}$ is any fixed dataset. That is, *the expected training error is even less than the risk of the optimal Bayes rule*. > [!proof]- > This is because the $\mathrm{LHS}$ integrand $\hat{R}_{D}(\hat{h}_{D})\le \hat{R}_{D}(\hat{h}_{d})$ for any $D$, since $\hat{h}_{D}$ is selected by ERM on that same dataset. Now average over $D$, and because each observation in $D$ are drawn independently (from each other and from $d$), $\mathbb{E}_{D}[\hat{R}_{D}(\hat{h}_{d})]=R(\hat{h}_{d})$. ### Risk due to Modeling > [!idea] If we choose the estimated relationship from a model $\mathcal{H}$, doing so causes deviations from the (unconstrained) Bayes rule. The optimization of Bayes risk happens in the space of all rules $\mathcal{F}=\{ f: \mathcal{X} \to \mathcal{Y} \}$, but this space is too complex to handle, so instead we restrict the search by requiring $\hat{h} \in \mathcal{H} \subset \mathcal{F},$where $\mathcal{H}$ is the **hypothesis class**, a set of functions we can "pick from" and realistically optimize over. - E.g. the set of hyperplanes in $\mathcal{X} \times \mathcal{Y}$ for linear regression, or the set of piecewise polynomials with degree at most $3$ for cubic splines. This gives the optimizer $h^{\ast}_{\mathcal{H}} \in \mathcal{H}$, the new target that is actually achievable. However, this also causes risk, leading to the estimation-approximation breakdown of excess risk: > [!theorem|*] Estimation-Approximation Error > $\text{excess risk}=R(\hat{h})-R(h^{\ast})=\underbrace{R(\hat{h})-R(h^{\ast}_{\mathcal{H}})}_{\text{estimation error}} + \underbrace{R(h^{\ast}_{\mathcal{H}})-R(h^{\ast})}_{\text{approximation error}},$where the **estimation error** is caused by fitting $\hat{h}$ on a random, finite dataset, and the **approximation error** is caused by imposing the hypothesis. This is analogous of bias-variance trade-off, where the estimation error is variance, and the approximation error is the bias due to restricting the scope to $\mathcal{H}$. - A larger hypothesis class $\mathcal{H}$ is harder to fit, hence increases estimation error. - A smaller hypothesis class might be further away from the true best $h^{\ast}$, causing high approximation error. - However, *the decompositions are not equivalent* (even under $l_{2}$ loss), as estimation-approximation uses the intermediary $h^{\ast}_{\mathcal{H}}$ (a suboptimal goal), while bias-variance uses $\bar{h}:=\mathbb{E}_{\mathcal{D}}[\hat{h}_{\mathcal{D}}]$ (the average result) -- see below. ### Bayes Rules in ML > [!theorem|*] Bayes Rule in Estimation > If unconstrained, the Bayes rule (or Bayes predictor) is given by $h^{\ast}(x)=\underset{\hat{y}}{\arg\min}~\mathbb{E}[L(Y, \hat{y}) ~|~ X=x],$but obviously we need the distribution $Y ~|~X$: if it is $g(y ;x)$, then the Bayes predictor is $\begin{align*} h^{\ast}(x)&= \underset{\hat{y}}{\arg\min}~\mathbb{E}_{Y ~|~ X}[L(Y, \hat{y}) ~|~ X=x]\\ &= \underset{\hat{y}}{\arg\min}~ \int _\mathcal{Y} L(y,\hat{y}) g(y;x) ~ dy. \end{align*}$ > [!idea] If the objective/loss is strictly convex (e.g. quadratic loss), this rule is unique, hence admissible. In general, to find the Bayes rule, we may fix $X=x$ and find the optimal $g(x)$: if we know the distribution of $Y ~|~ X$ to be $g(y;x)$, we can compute the optimal rule as For example, a classification problem has Bayes risk $\mathbb{E}_{X,Y}[L(g(X), Y)],$ which reduces to the following with **0-1 loss**: $\mathbb{E}_{X}[\mathbb{P}[Y \ne g(x) ~|~ X=x]],$which is optimized by $\begin{align*} g^{\ast}(x)&= \underset{k}{\arg\min}~\mathbb{P}[Y \ne k ~|~ X=x]\\ &= \underset{k}{\arg\max}~\mathbb{P}[Y=k ~|~ X=x], \end{align*}$i.e. the most likely class given $X=x$. ### Risks of Point Estimates In estimating $\theta$ with point estimate $\hat{\theta}:\mathcal{X} \to \Theta$, the common loss functions $L(\theta, \hat{\theta})$ include: - 0-1 loss on interval with radius $b$: $\mathbf{1}_{|\theta-\hat{\theta}|>b}$. - $l_{1}$ (linear) loss: $| \theta-\hat{\theta} |$. - $l_{2}$ (quadratic) loss: $(\theta-\hat{\theta})^{2}$. > [!theorem|*] Bayes Rules for the Three Losses > The Bayes rules for the losses are: > - Posterior mode for 0-1 loss when $b \to 0$ and $\pi(\theta \,|\, x)$ is continuous. > - Posterior median for $l_{1}$ loss. > - Posterior mean for $l_{2}$ loss. > > > [!proof]- > > We find the rules that minimize the posterior risks, a necessary condition for Bayes rules. > > $[\text{0-1 loss}]$ The posterior risk is $\Lambda(x, \hat{\theta})=1-\mathbb{P}[| \theta-\hat{\theta} |<b] \approx 1- 2b \cdot\pi(\theta=\hat{\theta} \,|\, x),$when $b \to 0$, and this is minimized by the posterior mode. > > > > $[l_{1}\text{ loss}]$ The posterior risk is $\Lambda(x, \hat{\theta})=\int _{(-\infty,\hat{\theta})}(\hat{\theta}-\theta)\cdot \pi(\theta \,|\, x) \, d\theta +\int _{(\hat{\theta}, \infty)}(\theta - \hat{\theta})\cdot \pi(\theta \,|\, x) \, d\theta.$Its derivative is $\frac{ \partial \Lambda(x,\hat{\theta}) }{ \partial \hat{\theta} } =\mathbb{P}[\hat{\theta} < \theta \,|\,X= x] - \mathbb{P} [\hat{\theta} > \theta \,|\, X=x].$Setting it to zero gives the posterior median as the minimizer. > > > > $[l_{2} \text{ loss}]$ Apply bias-variance decomposition to the MSE $\mathbb{E}[(\theta-\hat{\theta})^{2}\,|\, X=x]$ while treating $\theta$ as a variable "approximating" the constant $\hat{\theta}$. This gives $\begin{align*} > \Lambda(x,\hat{\theta})&= \mathbb{E}[(\theta-\hat{\theta})^{2}\,|\, X=x]\\ > &= \underbrace{\mathbb{E}[\theta-\hat{\theta}]^{2}}_{\text{bias}^{2}} + \mathrm{Var}(\theta \,|\, X=x). > \end{align*}$Since we can't do anything to $\mathrm{Var}(\theta \,|\, X=x)$, we can minimize the bias to 0 by setting $\hat{\theta} = \mathbb{E}[\theta \,|\, X=x]$, the posterior mean. ^bf20fe <div style="page-break-after: always;"></div> ## Decision Theory for Hypothesis Testing Suppose the test is done by the rule $\Delta:\mathcal{X} \to \{ 0,1 \}$, where $1$ indicates rejection of the null hypothesis. For simplicity, let the loss be constant in each hypothesis, i.e. $L(\Delta (x),\theta)=\begin{cases} a &\text{if }H_{0} \text{ and }\Delta(x) = 1 \text{ (rejected $H_{0}$)}, \\ b &\text{if }H_{1} \text{ and }\Delta(x) = 0 \text{ (failed to reject $H_{0}$)}. \end{cases}$Here $a,b$ can be different constants -- if the test is for poison ($H_{1}$ being poisonous), choosing a large $b$ relative to $a$ is equivalent to rather being safe than sorry. Let $g_{0},g_{1}$ be the marginal likelihoods of $X$ under $H_{0},H_{1}$, so $g_{i}:= \int _{\Theta_{i}} f(x;\theta) \pi(\theta)~ d\theta, $and in particular $g_{i}=f(\cdot;\theta_{i})$ for simple hypotheses. Then, > [!theorem|*] Likelihood Ratio Test is Bayes > The likelihood ratio test with critical region $\left\{ x: \frac{g_{0}(x)}{g_{1}(x)} \le \frac{b\pi_{1}}{a\pi_{0}} \right\}$ is a Bayes rule. Here $\pi_{i}$ is the prior probabilities for each hypothesis. > > >[!proof]- > > The Bayes risk is $\begin{align*} r(\Delta)&= a \cdot\pi_{0}\mathbb{P}[X \in C ~|~ H_{0}]+b \cdot \pi_{1}\mathbb{P}[X \in C ~|~ H_{1}]\\ &= a\pi_{0} \int _{C} g_{0} ~ dx + b\pi_{1}\int _{\Omega-C} g_{1}~ dx \\ &= b\pi_{1}+\int _{C} a\pi_{0}g_{0}-b\pi_{1}g_{1} ~ dx. \end{align*}$Therefore we choose $C$ that minimizes the second term, i.e. wherever the integrand is negative. This is exactly the LRT given above. In particular, when $a=b=1$, this is the **maximum a posteriori (MAP)** test, i.e. $\Delta(x)= \underset{i}{\arg\max}~\mathbb{P}[H_{i} ~|~ X=x].$