ABC Algorithm - Random Notes Go Brrrrrrr

> [!tldr] ABC Algorithm > The ABC (approximate Bayesian computation) algorithm simulates $\theta \approx \pi(\theta ~|~ y)$ without requiring knowledge of the likelihood $p(y~|~\theta)$. > > It only requires ability to sample from $\theta \sim \pi(\theta)$ and $Y \sim p(y ~|~ \theta)$ (not computing the density themselves, not even up to a normalising constant). Suppose our data $y$ comes from a very cursed process, so we cannot compute $p(y ~|~ \theta)$ explicitly and can only simulate it. Or we can't normalize this likelihood (e.g. imagine a lottery where we know the number of tickets each $y$ has given $\theta$, but not the total amount), so we have $p(y~|~\theta)=f(y,\theta) / c(\theta)$ for some unknown $c(\theta)$. In general, difficult situations like this is called > [!definition|*] Doubly Intractable > A Bayesian model $\pi(\theta)$, $p(y ~|~ \theta)$ with data $\mathbf{y}$ is called **doubly intractable** if at least one of $\pi(\theta) / \pi(\theta')$ and $p(\mathbf{y} ~|~ \theta) / p(\mathbf{y} ~|~ \theta')$ cannot be computed for arbitrary choices of $\theta,\theta'$. This requires a so-called **likelihood-free** approach, and ABC is a specific case of this class of methods. ### Motivation Consider a [[Rejection Sampling]] algorithm that uses $\pi(\theta)$ as the proposal, for discrete $y \in \mathcal{Y}$: > [!algorithm] Not-so-approximate BC > Input $y_{\mathrm{obs}}$ for which we want to sample $\theta \sim \pi(\theta ~|~ y_{\mathrm{obs}})$. > Repeat: > - Propose $\theta \sim \pi(\theta)$ and for that value of $\theta$, draw $y \sim p(y ~|~ \theta)$. > - If the drawn $y = y_{\mathrm{obs}}$, then accept the proposal $\theta$ and return. > - Else, continue the loop. - Note that accepting a given proposal $\theta$ happens with probability $\mathbb{P}(\text{draw }y_{\mathrm{obs}} ~|~ \theta)= \frac{\mathbb{P}[\text{draw }y_{\mathrm{\mathrm{obs}}} \text{ and }\theta]}{\mathbb{P}[\text{draw }\theta]} =\left( \underset{\text{target}}{p(\theta ~|~ y_{\mathrm{obs}})}\Big/ \underset{\text{proposal}}{\pi(\theta)} \right) \cdot \underset{\substack{\\ \text{acceptance} \\ \text{prob.}}}{\mathbb{P}(\text{draw }y_{\mathrm{obs}})},$so this is indeed a rejection sampling algorithm, and returns $\theta \sim p(\theta ~|~ y_{\mathrm{obs}})$ as desired. - The only issue is that $\mathcal{Y}$ must be discrete (up to $p$-null sets), else the algorithm will run forever almost surely. > [!idea] To handle continuous data, we can simply discretize it by settling with acceptance when the sampled $y$ is "close enough" to $y_{\mathrm{obs}}$. But how do we define "close"? - Since $\mathcal{Y}$ can be very high dimensional (e.g. $\mathbb{R}^{n}$ for a large $n$), Euclidean distance $\| y-y_{\mathrm{obs}} \|$ is bad because of [[Curse of Dimensionality]], i.e. it's almost impossible to sample another $y$ that has $\| y-y_{\mathrm{obs}} \|< r$ for any reasonably tight $r$. - However, we don't really need entry-wise closeness: if $\theta = \mathbb{E}[Y]$, we should treat $y=(1,2)$ and $y=(2,1)$ the same, even though their $l_{2}$ distance is large. Therefore, we can use summary statistics to filter for information that actually matters for inferring $\theta$. - This reduces the dimension of the dataset from some large $n$ to that of the summary statistic (e.g. $2$, if the statistic is the sample mean and standard deviation $(\bar{y},s_{y})$). - When a summary statistic $S$ is [[Minimality and Sufficiency#Sufficiency|sufficient]], then $\theta ~|~\{S(Y)=S(y_{\mathrm{obs}}) \}$ is identically distributed as $\theta ~|~ \{ Y=y_{\mathrm{obs}} \}$, so no information is lost in using the summary statistic instead of the raw dataset. In conclusion, we make two improvements to handle continuous data: - First define "closeness" with a summary statistic $S$. - Second, instead of requiring exact equality, we settle for $S(Y) \approx S(y_{\mathrm{obs}})$, measured by a distance function $d$. > [!exposition] Formally defining "closeness" > > Formally, let $\mathcal{Y}$ be the space of possible samples, say $\mathbb{R}^{n}$. Take a summary statistic $S: \mathcal{Y} \to \mathbb{R}^{s}$ ($s$ reasonably small) and metric $d: \mathbb{R}^{s}\to \mathbb{R}$, we can define the "distance" between $y,y'$ as > $d_{S}(y,y'):= d(S(y), S(y')),$ > > i.e. the difference between their summary statistics. > - For example, with $S: y \mapsto \bar{y}$ and $d=d_{2}$, $d_{S}$ simply measures the difference between the sample means. > > With this, we can define "close enough" (with radius $\delta$) to be $B(y_{\mathrm{obs}}, \delta;d_{S}):= \{ y \in \mathcal{Y} ~|~ d_{S}(y,y_{\mathrm{obs}}) < \delta \},$i.e. the set of $y$ with a summary statistic similar to that of $y_{\mathrm{obs}}$. When $y_{\mathrm{obs}}$ and $S$ is fixed, we simply denote that as $B(\delta)$. > [!algorithm] ABC > Input $y_{\mathrm{obs}}$ for which we want to sample $\theta \sim \pi(\theta ~|~ y_{\mathrm{obs}})$. Fix the radius $\delta$ and summary statistics $S$ for defining "closeness". Define the ball $B:= B(y_{\mathrm{obs}},\delta; d_{S})$. > > Repeat: > - Propose $\theta \sim \pi(\theta)$ and for that value of $\theta$, draw $y \sim p(y ~|~ \theta)$. > - If the drawn $y \in B(\delta)$, accept the proposal $\theta$ and end the program. > - Else, continue the loop. This is equivalent to rejection sampling with the target $\pi(\theta ~|~ y \in B)$ (rather than $\pi(\theta ~|~ y=y_{\mathrm{obs}})$): $\mathbb{P}[\theta \text{ is accepted} ~|~ \theta \text{ is proposed}]= p(y\in B ~|~ \theta)=\frac{p(\theta ~|~ y\in B)}{\pi(\theta)}\cdot p(B(\delta)).$ Therefore, we define this target as > [!theorem|*] ABC Posterior > The ABC algorithm samples from the **ABC posterior** $\pi_{\mathrm{ABC}}(\theta)=\pi(\theta ~|~ y \in B(\delta)).$ Alternatively, we can interpret its target as a blurred version of the posterior: $\begin{align*} \pi(\theta ~|~ y \in B)&= \frac{p(B,\theta)}{p(B)}\\ &= \int _{B} \frac{p(y,\theta)}{p(B)} ~ dy \\ &= \int _{B} \frac{p(y)}{p(B)}\cdot \textcolor{cyan}{\pi(\theta ~|~ y)} ~ dy \\ &= \int _{B} p(y ~|~ B) \cdot\textcolor{cyan}{\pi(\theta ~|~ y)} ~ dy\\ &= \mathbb{E}_{y}[\textcolor{cyan}{\pi(\theta ~|~ y)} ~|~ y \in B], \end{align*} $ i.e. the posterior $\textcolor{cyan}{\pi(\theta ~|~ y)}$ blurred over $B$, weighed by $p(y ~|~ y\in B)$. > [!idea] The ABC posterior is a blurred version of the exact posterior. ### Regression Adjustment There is a simple technique for improving the approximation $\pi(\theta ~|~ B) \approx \pi(\theta ~|~ y_{\mathrm{obs}})$. Let $\theta \sim \pi(\theta ~|~y)$ (for some $y \in B$). We hope to slightly transform $\theta$ to have the same distribution as $\theta' \sim \pi(\theta ~|~ y_{\mathrm{obs}})$. > [!theorem|*] Regression Adjustment > Suppose $\mathbb{E}[\theta ~|~ y]=\mu(S(y))$ is a function of $S$ (e.g. when $S$ is sufficient), and is smooth enough that for small $\delta$, it is approximately linear: $\mu(s) \approx \mu(s_{\mathrm{obs}})+(s-s_{\mathrm{obs}})^{T}\beta$for some coefficients $\beta$, and $s,s_{\mathrm{obs}}:= S(y),S(y_{\mathrm{obs}})$. > > Furthermore, suppose the posteriors have $\mu(s)$ as a location parameter, i.e. $\theta-\mu(s) \overset{\mathrm{identical ly}}{\sim}\theta'-\mu(s_{\mathrm{obs}}),$ > > Then solving for $\theta'$, $\theta-(s-s_{\mathrm{obs}})^{T}\beta \approx \theta-\mu(s)+\mu(s_{\mathrm{obs}}) = \theta' \sim \pi(\cdot ~|~ y_{\mathrm{obs}}).$ Therefore, we can get a (better) approximate sample of $\theta' \sim\pi(\theta ~|~ y_{\mathrm{obs}})$ by using $\theta-(s-s_{\mathrm{obs}})^{T}{\beta}$ instead of $\theta$. The $(s-s_{\mathrm{obs}})^{T}\beta$ term is the **regression adjustment**, and since $\beta$ is the sensitivity of $\mathbb{E}[\theta ~|~ S]$ wrt. $S$, it can be approximated with a linear regression $(\theta_{1},\dots) \sim (S(y_{1}),\dots),$i.e. the dataset $\theta_{i} \sim \pi_{\mathrm{ABC}}$ and $y_{i}\sim p(y~|~\theta_{i})$ from the ABC algorithm. With the regression coefficient $\hat{\beta}$, we then return ${\theta}'_{i}:= \theta_{i}-(S(y_{i})-S(y_{\mathrm{obs}}))^{T}\hat{\beta}$ as the regression-adjusted samples.