Estimating parameters, predicting new values, or inferring probability distributions are all decisions made based on data. This process can be generalized into decision theory.
Under the standard setup of $X \sim f(\cdot\,; \theta)$ and parameter space $\Theta$,
> [!definition|*] Decisions and Losses
> The **action space** $\mathcal{A}$ contains individual **actions** that are the outcome of the study, e.g. $\{ 0,1 \}$ for hypothesis testing, $\Theta$ for parameter estimation.
>
> The **decision rule** $\Delta:\mathcal{X} \to \mathcal{A}$ mapping data to actions. For example, it can be $\mathbf{1}_{X \in \mathcal{C}}$ for some hypothesis test with critical region $\mathcal{C}$.
>
> The set of decision rules is denoted $\mathcal{D}$.
>
> The **loss** function $L:\Theta \times \mathcal{A} \to \mathbb{R}^{+}$, measuring how incorrect the action $A$ is, given that the truth is $\theta$. For examples, see [[Loss Functions]].
Decision rules can be **randomized** or **deterministic**:
- A deterministic rule maps data $\mathbf{X}$ to a single action $A$ as defined above.
- A randomized rule instead maps to a probability distribution on $\mathcal{A}$. Examples include stochastic algorithms like K-means and stochastic gradient descent.
- Of course a probabilistic mixture of deterministic rules produces a randomized rule.
<div style="page-break-after: always;"></div>
## Risks and Comparing Decision Rules
> [!summary] TLDR: Risks
> **Loss function** $L:\Theta \times \mathcal{A} \to \mathbb{R}^{+}$, error of an action $A$ for a particular $\theta$.
>
> **Frequentist risk** $R: \Theta \times \mathcal{D} \to \mathbb{R}^{+}$, error of a rule $\Delta$ for a particular $\theta$, averaged over inputs $X$ only.
>
> **Bayes risk** (or simply **risk**) $r: \mathcal{D} \to \mathbb{R}^{+}$ given $\theta \sim\pi$, error of a rule $\Delta$, averaged both $\theta \sim \pi$ and $X \sim f(\cdot\,;\theta)$.
>
> **Posterior risk** $\Lambda: \mathcal{X} \times\mathcal{D} \to \mathbb{R}^{+}$ given $\theta \sim\pi$, error of a rule $\Delta$, averaged over all $\theta \sim \pi(\cdot\,|\, x)$.
> [!definition|*] Frequentist Risk, Admissibility
> The **frequentist risk** measures the expected loss $L$ of a decision rule $\Delta$, treating the "truth" $\theta$ as a constant: $R(\theta,\Delta):= \mathbb{E}_{X}[L(\theta, \Delta(X))]=\int _{\mathcal{X}}L(\theta, \Delta(x)) f(x;\theta)\, dx. $
> A rule $\Delta_{2}$ **strictly dominates** $\Delta_{1}$ if $R(\theta,\Delta_{1}) \ge R(\theta,\Delta_{2})$ is true $\forall \theta \in \Theta$, and the inequality is strict for some $\theta$. A decision rule is **inadmissible** if it is strictly dominated by some other rule.
^913a85
- Strict dominance does not yield a unique minimum: two decision rules can outperform one another under different values of $\theta$, so neither strictly dominates the other.
- It needn't provide useful minima either, for example with quadratic loss (MSE), [[Information and Bounding Errors|no estimator is uniformly better than a trivial estimator]].
### Bayes Rules
Bayes risk averages the frequentist risk with a prior $\pi(\theta)$, measuring loss on average, and *Bayes rules are those that do the best on average*.
> [!definition|*] Bayes Risk and Bayes Rules
> The **Bayes risk** of decision rule $\Delta$ is the expected frequentist risk, integrated over the prior $\pi(\theta)$: $r(\pi, \Delta):= \int R(\theta,\Delta)\pi(\theta) \, d\theta. $A decision rule $\Delta ^{*}$ is a **Bayes rule** or **Bayes estimator** wrt. $\pi$ if it minimizes the Bayes risk: $\Delta ^{*}= \underset{\Delta \in \mathcal{D}}{\arg\min} ~r(\pi,\Delta).$If such a rule exists for $\pi$, denote $r_{\pi}:= r(\pi, \Delta ^{*})$.
Bayes risk deals with worst cases by considering the **least favorable prior** $\pi$ that is the hardest to deal with, i.e. $r_{\pi} \ge r_{\pi'}$ for any other $\pi'$.
- Note that the two risks $r_{\pi},r_{\pi'}$ are in general minimized by different rules.
Bayes rules are admissible under mild assumptions:
> [!lemma|*] Admissibility of Bayes Rules
> If $\Delta$ is a Bayes rule of prior $\pi$ and $r(\pi,\Delta) < \infty$, then it is admissible if any of the following is true:
> - $\Delta$ is the unique Bayes rule of $\pi$;
> - $R(\cdot\,;\Delta)$ is continuous in $\theta$, and $\pi$ is non-zero in $\Theta$ (i.e. we don't a priori assume some parameters to be impossible).
>
> > [!proof]-
> > Consider another rule $\Delta'$ and assume for contradiction that $\Delta$ is strictly dominated by $\Delta'$.
> >
> > $[1]$ We prove $r(\pi, \Delta) = r(\pi, \Delta')$ by $\ge \& \le$. Then $\le$ follows by definition of Bayes rules. By strict dominance $\forall \theta \in \Theta,~ R(\theta,\Delta)\ge R(\theta, \Delta')$, and taking their expectation gives $\ge$. This makes $\Delta'$ a Bayes rule of $\pi$ too, contradicting uniqueness.
> >
> > $[2]$ Consider the set $A:= \{ \theta \,|\, R(\theta, \Delta') < R(\theta, \Delta) \}$ on which $\Delta$ does worse than $\Delta'$. By continuity of $R$, $A$ must contain an open subset $A ^{*}$. Since $\pi > 0$ on $A^{*}$, splicing the two gives a rule with a strictly smaller Bayes risk than $\Delta$, contradicting minimality of $r(\theta, \Delta)$. By splicing we mean $\Delta ^{\ast}=\mathbf{1}_{A^{\ast}}\Delta'+\mathbf{1}_{A^{\ast C}}\Delta.$I.e. letting each rule decide over its preferred region $A^{\ast},A^{\ast C}$.
One particular case is when $r$
### Minimax Rules
Minimax rules are the other extreme, only considering the worst case scenario instead of the average.
> [!definition|*] Maximum Risk and Minimax Rules
> The maximum risk of a rule $\Delta$ is just the worst-case scenario: the supremum of frequentist risks, $\sup_{\theta \in \Theta}R(\theta;\Delta)$.
>
> The **minimax** rule minimizes this maximum risk: $\Delta ^{*}$ is minimax if $\Delta ^{*}=\underset{\Delta \in \mathcal{D}}{\arg\min}\left[ \sup_{\theta \in \Theta} R(\theta;\Delta) \right].$
- Since minimax focuses on the worst case, it *might try to deal with a worst case $\theta$ that is highly unlikely and degrade its average performance*. It's like opposite to how quick sort is worst case $O(n^{2})$ but $O(n\log n)$ on average.
The two notions of decision rules being the best on average (Bayes) and in the worst case (minimax) are connected in the following theorem:
> [!theorem|*] Bayes and Minimax Rules
> If a prior $\pi$ has minimized Bayes risk of $r_{\pi}$, and there is a rule $\Delta$ where $\sup_{\theta}R(\theta,\Delta) \le r_{\pi},$then $\Delta$ must be both a Bayes rule (of $\pi$) and a minimax rule. $R(\theta, \Delta)=r_{\pi}~\mathrm{a.e.}$ in that case.
>
> Furthermore, It is the unique minimax if it is the unique Bayes rule.
>
> > [!proof]- Proof
> > $[\Delta\text{ is Bayes rule}]$ since $r(\pi,\Delta)\le r_{\pi} \le r(\pi,\Delta)$, where the first inequality follows by the supremum assumption, and the second is the definition of $r_{\pi}$.
> >
> > $[\Delta \text{ is minimax}]$ since any other rule $\Delta' \ne \Delta$ has $\sup_{\theta}R(\theta,\Delta')\ge \mathbb{E}_{\theta \sim \pi}[R(\theta,\Delta')]=r(\pi,\Delta') \overset{(*)}{\ge} r_{\pi} \ge \sup_{\theta}R(\theta,\Delta).$
> > $[R(\theta,\Delta)=r_{\pi}~\mathrm{a.e.}]$ since otherwise $r(\pi, \Delta) < r_{\pi}$, giving a contradiction.
> >
> > $[\text{unique Bayes} \Rightarrow \text{unique minimax}]$ since the inequality $(*)$ will be strict if the Bayes rule is unique.
- Note that this theorem does not comment on the existence of such a rule $\Delta$, nor does an arbitrary Bayes rule satisfy the inequality in general.
In the case $\Delta$ is a Bayes rule, the above theorem strengthens to say that
> [!theorem|*] Bayes and Minimax Rule in Least Favorable Priors
> If the above inequality is achieved by a Bayes rule itself: $\sup_{\theta}R(\theta,\Delta_{\text{Bayes}}) \le r_{\pi},$then it is a minimax rule, and $\pi$ must be the least favorable prior.
>
> > [!proof]-
> > That $\Delta_{\text{Bayes}}$ is minimax is just the previous theorem.
> >
> > If $\pi'$ is any (other) prior with its Bayes rule $\Delta_{\text{Bayes}}'$, then $r_{\pi^{\prime}}=\int R(\theta,\Delta_{\mathrm{Bayes}}^{\prime})\pi^{\prime}(\theta)d\theta\leqslant\int R(\theta,\Delta_{\mathrm{Bayes}})\pi^{\prime}(\theta)d\theta\leqslant\operatorname*{sup}_{\theta}R(\theta,\Delta_{\mathrm{Bayes}})=r_{\pi}.$So $\pi$ must be the least favorable prior.
>
- One corollary is that if the Bayes rule has constant frequentist risk for any parameter $\theta$, then it is the minimax.
- This can be useful for finding minimax Bayes rules if we can tweak the prior with hyperparameters $\phi$, say -- after finding a Bayes rule as a function of $x,\phi$, we can solve for $\phi$ that makes the frequentist risk constant, if there is a solution.
### Posterior Risks
> [!idea] Minimizing the posterior risk is a handy way of finding the Bayes rule.
> [!definition|*] Posterior Risk
> With prior $\theta \sim \pi$ and observed data $X=x$, the **posterior risk** of a rule $\Delta$ wrt. the prior $\pi$ is $\Lambda(x,\Delta):= \mathbb{E}_{\theta \,|\, x}[L(\theta,\Delta(x)) \,|\, X=x]=\int L(\theta,\Delta(x)) \cdot \pi(\theta \,|\, x) \, d\theta.$Since $\Delta$ only matters through its action $\Delta(x)$, we may also pass an action $A$ (or just $y \in \mathbb{R}$ for prediction tasks) in its stead.
Only Bayes rules can (but not necessarily do) minimizes the posterior risk. This is because *Bayes risk equals the expected posterior risk* (see proof).
> [!theorem|*] Posterior Risk Minimizer is Bayes
> If a prior $\pi$ allows finite Bayes risk, and the function $c: x \mapsto \underset{y \in \mathcal{A}}{\arg\min} ~\Lambda(x, y)$ is defined $\mathrm{a.e.}$ in $\mathcal{X}$, then $\Delta(x):= c(x)$ is a Bayes rule of $\pi$.
>
> That is, if we can minimize $\Lambda$ pointwise in $x$ by choosing some action $y$ based on data $x$, then this map from $x$ to $y$ must be the Bayes rule, assuming finite risk.
>
>
> > [!proof]-
> > Bayes risk of $\Delta$ equals the expected posterior risk: $r(\pi,\Delta)= \mathbb{E}_{\theta,X}[R(\theta, \Delta(X))]=\mathbb{E}_{X}\Big[\mathbb{E}_{\theta \,|\, X}[R(\theta, \Delta(X))]\Big]= \mathbb{E}_{X}[\Lambda(X, \Delta)].$Hence if $\Delta$ minimizes $\Lambda(x,\,\cdot)~\mathrm{a.e.}$, it also minimizes the Bayes risk when the expectation is finite (by assumption).
> [!examples] Bayes Estimator of Binomial Parameter
> Suppose $X \sim \mathrm{Binom}(n, \theta)$ with prior $\theta \sim \mathrm{Beta}(\alpha, \beta)$. Then the posterior is $\theta ~|~ \{ X=x \} \sim \mathrm{Beta}(\alpha+x, \beta+n-x),$so the Bayes risk can be found by minimizing the posterior risk.
>
> For example, Bayes risk with $l_{2}$ loss is minimized by the posterior mean $(\alpha + x) / (\alpha +\beta + n)$, which is a weighted average of the prior mean $\alpha / (\alpha + \beta)$ and MLE $x / n$.
<div style="page-break-after: always;"></div>
## Decision Theory in Machine Learning
For an estimation problem $Y=f(X)$ with loss $L$, suppose we have a dataset $D=(\mathbf{X}, \mathbf{Y})$ and use it to fit an estimate $\hat{h}$ from $\mathcal{F}$, the space of all deterministic relationships mapping $\mathcal{X} \to \mathcal{Y}$.
- Here $f$ is some general relationship, might not be deterministic; $\hat{h}$, however, needs to be deterministic.
### Risk due to Random Data
> [!idea] In practice, we do not get to access the true distribution of $(X,Y)$, and have to rely on a sampled dataset $D=(\mathbf{x}, \mathbf{y})$. This causes the fitted rule to deviate from the optimal Bayes rule.
The risk of an estimated rule $\hat{h}$ is $R(\hat{h})=\mathbb{E}_{X,Y}[L(\hat{h}(X), Y)],$where the training dataset $D$ (hence $\hat{h}$) is treated as a constant, and the only randomness is the $(X,Y)$ at which the loss is computed. For $l_{2}$ loss, this becomes $R(\hat{h};h)=\mathbb{E}_{X}[(\hat{h}(X)-Y)^{2}].$
If $h^{\ast}$ is the minimizer of $R$ (constrained or not), then the **excess risk** of $\hat{h}$ is $R(\hat{h})-R(h^{\ast})$, i.e. the gap between the estimate and the absolute best.
For $l_{2}$ loss, the excess risk of $\hat{h}$ is: $\begin{align*} R(\hat{h})&-R(h^{\ast})\\
&= \mathbb{E}_{X}\Big[\mathbb{E}_{Y~|~X}\big[(Y-\hat{h}(x))^{2}-(Y-h^{\ast}(x))^{2}~|~ X=x\big]\Big]\\
&= \mathbb{E}_{X}\Big[\mathbb{E}_{Y~|~X}\big[\hat{h}(x)^{2}-h^{\ast}(x)^{2}-2Y\hat{h}(x)+2Yh^{\ast}(x)~|~ X=x\big]\Big]\\
&= \mathbb{E}_{X}\Big[ \hat{h}(x)-2h^{\ast}(x)\hat{h}(x)+h^{\ast}(x)^{2}\Big]\\
&= \mathbb{E}_{X}[(\hat{h}(x)-h^{\ast}(x))^{2}],
\end{align*}$where we used the identity $\mathbb{E}_{Y ~|~ X}[Y]=h^{\ast}$.
### Empirical Risk and Overfitting
> [!theorem|*] Empirical Risk
> The **empirical risk** of a rule $h$ on a dataset $D$ is $\hat{R}_{D}(h):= \frac{1}{| D |}\sum_{i=1}^{| D |}L(h(x_{i}), y_{i}),$i.e. the averaged loss.
>
> Note that the empirical risk depends on the dataset, and if $D$ is used to train (estimate) the rule $h=\hat{h}_{D}$, it is also called the **training error**.
^2aa114
The **generalization gap** of a rule $\hat{h}_{D}$ fitted on the dataset $D$ is $R(\hat{h})-\hat{R}_{D}(\hat{h}_{D})$, i.e. the frequentist risk minus the training error -- a model might achieve perfect accuracy on the training set by memorizing both the underlying pattern and random noise, but that will generalize poorly on a new dataset. In this case the model is said to have **overfit**.
Because of this, training error is not indicative of risk in general: in particular, $\mathbb{E}_{D}[\hat{R}_{D}(\hat{h}_{D})] \le R(h^{\ast}) \le R(\hat{h}_{d}),$where the $d$ in $\mathrm{RHS}$ is any fixed dataset. That is, *the expected training error is even less than the risk of the optimal Bayes rule*.
> [!proof]-
> This is because the $\mathrm{LHS}$ integrand $\hat{R}_{D}(\hat{h}_{D})\le \hat{R}_{D}(\hat{h}_{d})$ for any $D$, since $\hat{h}_{D}$ is selected by ERM on that same dataset. Now average over $D$, and because each observation in $D$ are drawn independently (from each other and from $d$), $\mathbb{E}_{D}[\hat{R}_{D}(\hat{h}_{d})]=R(\hat{h}_{d})$.
### Risk due to Modeling
> [!idea] If we choose the estimated relationship from a model $\mathcal{H}$, doing so causes deviations from the (unconstrained) Bayes rule.
The optimization of Bayes risk happens in the space of all rules $\mathcal{F}=\{ f: \mathcal{X} \to \mathcal{Y} \}$, but this space is too complex to handle, so instead we restrict the search by requiring $\hat{h} \in \mathcal{H} \subset \mathcal{F},$where $\mathcal{H}$ is the **hypothesis class**, a set of functions we can "pick from" and realistically optimize over.
- E.g. the set of hyperplanes in $\mathcal{X} \times \mathcal{Y}$ for linear regression, or the set of piecewise polynomials with degree at most $3$ for cubic splines.
This gives the optimizer $h^{\ast}_{\mathcal{H}} \in \mathcal{H}$, the new target that is actually achievable.
However, this also causes risk, leading to the estimation-approximation breakdown of excess risk:
> [!theorem|*] Estimation-Approximation Error
> $\text{excess risk}=R(\hat{h})-R(h^{\ast})=\underbrace{R(\hat{h})-R(h^{\ast}_{\mathcal{H}})}_{\text{estimation error}} + \underbrace{R(h^{\ast}_{\mathcal{H}})-R(h^{\ast})}_{\text{approximation error}},$where the **estimation error** is caused by fitting $\hat{h}$ on a random, finite dataset, and the **approximation error** is caused by imposing the hypothesis.
This is analogous of bias-variance trade-off, where the estimation error is variance, and the approximation error is the bias due to restricting the scope to $\mathcal{H}$.
- A larger hypothesis class $\mathcal{H}$ is harder to fit, hence increases estimation error.
- A smaller hypothesis class might be further away from the true best $h^{\ast}$, causing high approximation error.
- However, *the decompositions are not equivalent* (even under $l_{2}$ loss), as estimation-approximation uses the intermediary $h^{\ast}_{\mathcal{H}}$ (a suboptimal goal), while bias-variance uses $\bar{h}:=\mathbb{E}_{\mathcal{D}}[\hat{h}_{\mathcal{D}}]$ (the average result) -- see below.
### Bayes Rules in ML
> [!theorem|*] Bayes Rule in Estimation
> If unconstrained, the Bayes rule (or Bayes predictor) is given by $h^{\ast}(x)=\underset{\hat{y}}{\arg\min}~\mathbb{E}[L(Y, \hat{y}) ~|~ X=x],$but obviously we need the distribution $Y ~|~X$: if it is $g(y ;x)$, then the Bayes predictor is $\begin{align*}
h^{\ast}(x)&= \underset{\hat{y}}{\arg\min}~\mathbb{E}_{Y ~|~ X}[L(Y, \hat{y}) ~|~ X=x]\\
&= \underset{\hat{y}}{\arg\min}~ \int _\mathcal{Y} L(y,\hat{y}) g(y;x) ~ dy.
\end{align*}$
> [!idea] If the objective/loss is strictly convex (e.g. quadratic loss), this rule is unique, hence admissible.
In general, to find the Bayes rule, we may fix $X=x$ and find the optimal $g(x)$: if we know the distribution of $Y ~|~ X$ to be $g(y;x)$, we can compute the optimal rule as
For example, a classification problem has Bayes risk $\mathbb{E}_{X,Y}[L(g(X), Y)],$
which reduces to the following with **0-1 loss**: $\mathbb{E}_{X}[\mathbb{P}[Y \ne g(x) ~|~ X=x]],$which is optimized by $\begin{align*}
g^{\ast}(x)&= \underset{k}{\arg\min}~\mathbb{P}[Y \ne k ~|~ X=x]\\
&= \underset{k}{\arg\max}~\mathbb{P}[Y=k ~|~ X=x],
\end{align*}$i.e. the most likely class given $X=x$.
### Risks of Point Estimates
In estimating $\theta$ with point estimate $\hat{\theta}:\mathcal{X} \to \Theta$, the common loss functions $L(\theta, \hat{\theta})$ include:
- 0-1 loss on interval with radius $b$: $\mathbf{1}_{|\theta-\hat{\theta}|>b}$.
- $l_{1}$ (linear) loss: $| \theta-\hat{\theta} |$.
- $l_{2}$ (quadratic) loss: $(\theta-\hat{\theta})^{2}$.
> [!theorem|*] Bayes Rules for the Three Losses
> The Bayes rules for the losses are:
> - Posterior mode for 0-1 loss when $b \to 0$ and $\pi(\theta \,|\, x)$ is continuous.
> - Posterior median for $l_{1}$ loss.
> - Posterior mean for $l_{2}$ loss.
>
> > [!proof]-
> > We find the rules that minimize the posterior risks, a necessary condition for Bayes rules.
> > $[\text{0-1 loss}]$ The posterior risk is $\Lambda(x, \hat{\theta})=1-\mathbb{P}[| \theta-\hat{\theta} |<b] \approx 1- 2b \cdot\pi(\theta=\hat{\theta} \,|\, x),$when $b \to 0$, and this is minimized by the posterior mode.
> >
> > $[l_{1}\text{ loss}]$ The posterior risk is $\Lambda(x, \hat{\theta})=\int _{(-\infty,\hat{\theta})}(\hat{\theta}-\theta)\cdot \pi(\theta \,|\, x) \, d\theta +\int _{(\hat{\theta}, \infty)}(\theta - \hat{\theta})\cdot \pi(\theta \,|\, x) \, d\theta.$Its derivative is $\frac{ \partial \Lambda(x,\hat{\theta}) }{ \partial \hat{\theta} } =\mathbb{P}[\hat{\theta} < \theta \,|\,X= x] - \mathbb{P} [\hat{\theta} > \theta \,|\, X=x].$Setting it to zero gives the posterior median as the minimizer.
> >
> > $[l_{2} \text{ loss}]$ Apply bias-variance decomposition to the MSE $\mathbb{E}[(\theta-\hat{\theta})^{2}\,|\, X=x]$ while treating $\theta$ as a variable "approximating" the constant $\hat{\theta}$. This gives $\begin{align*}
> \Lambda(x,\hat{\theta})&= \mathbb{E}[(\theta-\hat{\theta})^{2}\,|\, X=x]\\
> &= \underbrace{\mathbb{E}[\theta-\hat{\theta}]^{2}}_{\text{bias}^{2}} + \mathrm{Var}(\theta \,|\, X=x).
> \end{align*}$Since we can't do anything to $\mathrm{Var}(\theta \,|\, X=x)$, we can minimize the bias to 0 by setting $\hat{\theta} = \mathbb{E}[\theta \,|\, X=x]$, the posterior mean.
^bf20fe
<div style="page-break-after: always;"></div>
## Decision Theory for Hypothesis Testing
Suppose the test is done by the rule $\Delta:\mathcal{X} \to \{ 0,1 \}$, where $1$ indicates rejection of the null hypothesis.
For simplicity, let the loss be constant in each hypothesis, i.e. $L(\Delta (x),\theta)=\begin{cases}
a &\text{if }H_{0} \text{ and }\Delta(x) = 1 \text{ (rejected $H_{0}$)}, \\
b &\text{if }H_{1} \text{ and }\Delta(x) = 0 \text{ (failed to reject $H_{0}$)}.
\end{cases}$Here $a,b$ can be different constants -- if the test is for poison ($H_{1}$ being poisonous), choosing a large $b$ relative to $a$ is equivalent to rather being safe than sorry.
Let $g_{0},g_{1}$ be the marginal likelihoods of $X$ under $H_{0},H_{1}$, so $g_{i}:= \int _{\Theta_{i}} f(x;\theta) \pi(\theta)~ d\theta, $and in particular $g_{i}=f(\cdot;\theta_{i})$ for simple hypotheses. Then,
> [!theorem|*] Likelihood Ratio Test is Bayes
> The likelihood ratio test with critical region $\left\{ x: \frac{g_{0}(x)}{g_{1}(x)} \le \frac{b\pi_{1}}{a\pi_{0}} \right\}$ is a Bayes rule. Here $\pi_{i}$ is the prior probabilities for each hypothesis.
>
> >[!proof]-
> > The Bayes risk is $\begin{align*}
r(\Delta)&= a \cdot\pi_{0}\mathbb{P}[X \in C ~|~ H_{0}]+b \cdot \pi_{1}\mathbb{P}[X \in C ~|~ H_{1}]\\
&= a\pi_{0} \int _{C} g_{0} ~ dx + b\pi_{1}\int _{\Omega-C} g_{1}~ dx \\
&= b\pi_{1}+\int _{C} a\pi_{0}g_{0}-b\pi_{1}g_{1} ~ dx.
\end{align*}$Therefore we choose $C$ that minimizes the second term, i.e. wherever the integrand is negative. This is exactly the LRT given above.
In particular, when $a=b=1$, this is the **maximum a posteriori (MAP)** test, i.e. $\Delta(x)= \underset{i}{\arg\max}~\mathbb{P}[H_{i} ~|~ X=x].$