Bayesian Inference - Random Notes Go Brrrrrrr

```dataview table without id File as "Topics", join( sort(map( filter(file.tags, (tag) => any(map(this.domain_tags, (domtag) => contains(tag, domtag + "/")))), (x) => regexreplace(replace(x, "_", " "), "#("+ join(this.domain_tags, "|") +")/", "") ) ), ", ") as "Type", dateformat(file.mtime, "yyyy-MM-dd") as "Last Modified" from "" FLATTEN "[[" + file.path + "|" + truncate(file.name, 30) + "]]" as File where ( (domain and contains(domain, this.file.link) and (file.name != this.file.name)) or any(map(file.tags, (x) => econtains(this.domain_tags, substring(x, 1)))) or any(map(file.tags, (x) => any(map(this.domain_tags, (domtag) => contains(x, domtag + "/"))) )) ) and !contains(file.path, "2 - Snippets") and !contains(file.tags, "subdomain") sort file.mtime desc ``` ```dataview table without id File as "Snippets", join( sort(map( filter(file.tags, (tag) => any(map(this.domain_tags, (domtag) => contains(tag, domtag + "/")))), (x) => regexreplace(replace(x, "_", " "), "#("+ join(this.domain_tags, "|") +")/", "") ) ), ", ") as "Type", dateformat(file.mtime, "yyyy-MM-dd") as "Last Modified" from "2 - Snippets" FLATTEN "[[" + file.path + "|" + truncate(file.name, 30) + "]]" as File where ( (domain and contains(domain, this.file.link) and (file.name != this.file.name)) or any(map(file.tags, (x) => econtains(this.domain_tags, substring(x, 1)))) or any(map(file.tags, (x) => any(map(this.domain_tags, (domtag) => contains(x, domtag + "/"))) )) ) sort file.mtime desc ``` > [!theorem|*] Bayes' Theorem > Recall that **Bayes' theorem** for discrete random variables $X,Y$ states that $\mathbb{P}(X|Y)=\frac{\mathbb{P}(Y|X)\mathbb{P}(X)}{\mathbb{P}(Y)}$and for continuous random variables $X,Y$ with pdf. $f_{X}$ and $f_{Y}$, the conditional pdf is $f_{X|Y}(x|y)=\frac{f_{Y|X}(y|x)f_{X}(x)}{f_{Y}(y)}$where $f_{A|B}(a|b)=f_{A,B}(a,b) / f(b)$. In Bayesian inference, parameter(s) is treated as a variable instead of a fixed value, and they have their own distributions: > [!definition|*] Priors and Posteriors > The **prior (distribution)** $\pi(\theta)$ is the distribution of the parameter we assume to be true, before any data is seen. > > After collecting data, the updated distribution of $\theta$ is a new distribution called the **posterior (distribution)** $\pi (\theta|\mathbf{x})$.$\text{prior} \xrightarrow[\text{and update}]{\text{observe data}}\text{posterior}$ - We can use priors to express a formerly held or default opinion (e.g. ignorance, if we use [[Uninformative Priors]]). The "update" on the prior is done via Bayes' theorem: $\pi(\theta|\mathbf{x})=\frac{f(\mathbf{x}|\theta)\pi(\theta)} {f(\mathbf{x})}=\frac{f(\mathbf{x}|\theta)\pi(\theta)}{\int _{\mathbb{R}} f(\mathbf{x}|\theta)\pi(\theta) \, dx },$but in most cases we don't need to compute the denominator, and identify the posterior with the following: > [!lemma|*] Updating the Bayesian Prior > Since the denominator $f(\mathbf{x})$ is just a constant in Bayesian context,$\underset{\text{posterior}}{\pi(\theta|\mathbf{x})} \propto \underset{\text{likelihood}}{f(\mathbf{x}|\theta)} \times \underset{\text{prior}}{\pi(\theta)},$so if we can identify $\mathrm{RHS}$ as some familiar distribution, we can pinpoint the posterior without solving the normalizing denominator. ^8a333e - We can choose priors that make it easy to identify the RHS, simplifying our calculations (see [[Conjugate Priors]]) ## Bayesian Inference ### Credible Intervals Credible intervals are the Bayesian equivalent of confidence intervals. > [!definition|*] Credible Intervals > A $100(1-\alpha)\%$ **credible set** is any set $C \subseteq \Theta$ that the probability of $\theta | \mathbf{x}$ being in it is $100(1-\alpha)\%$: $C: \int _{C} \pi(\theta|\mathbf{x}) \, dx =1-\alpha.$ * The difference between the Bayesian credible interval $C$ and the frequentist confidence interval $I$ is just what is considered variable: $\begin{align*} &\text{Bayesian:} &&\text{variable $\theta|\mathbf{x}$ has probability ... to be in $C$. }\\ &\text{Frequentist:} &&\text{variable $I(\mathbf{x})$ has probability ... to contain $\theta$.} \end{align*}$ A **credible interval** $(\theta_{1},\theta_{2})$ is just a credible set that is also an interval. A credible interval is **equal-tailed** if $\mathbb{P}(\theta|\mathbf{x} \le \theta_{1})=\mathbb{P}(\theta|\mathbf{x} \ge \theta_{1})$; for example, the two highlighted tails have equal area: ![[EqualTailsInterval.png#invert]] A credible set/interval $C$ is **highest posterior density (HPD)** if it has the form $C = \{ \theta:\pi(\theta|\mathbf{x}) \ge p_{\min} \}$for some constant $p_{\min}$. Equivalently, $C:\forall \theta \in C, \theta' \not \in C, \pi(\theta|\mathbf{x})>\pi(\theta'|\mathbf{x})$For example, the highlight region is an HPD set, where $p_{\min}$ is the horizontal line: ![[HPDInterval.png#invert]] ### Prediction with the Posterior Given observations $\mathbf{X}=(X_{1},\dots,X_{n})$ and the posterior $\theta \sim\pi(\theta|\mathbf{x})$ computed from them, the **(posterior) predictive density** of a new observation $X_{n+1}$ is $f(X_{n+1}|\mathbf{x})$. * *That is, our updated belief about the underlying distribution $f$, after incorporating the knowledge from previous data $\mathbf{x}$.* The predictive density can be computed by: $\begin{align*} f(X_{n+1}|\mathbf{x})&= \int f(X_{n+1},\theta|\mathbf{x}) \, d\theta &&\substack{\text{definition of }\\\text{marginal density}}\\ &= \int f(X_{n+1}|\theta,\mathbf{x}) \,\cdot\,\pi(\theta|\mathbf{x}) \, d\theta &&\text{Bayes' rule}\\ &= \int \underset{\text{prediction}}{f(X_{n+1}|\theta)} \,\cdot\, \underset{\text{posterior}}{\pi(\theta|\mathbf{x})} \, d\theta &&\substack{\text{independence of } X_{n+1} \\\text{ and } \mathbf{x}=(x_{1},\dots,x_{n})} \end{align*}$i.e. the prediction averaged over the posterior. ## Hierarchical Bayesian Models With data $\mathbf{Y}=(Y_{1},\dots,Y_{n})\sim(f(\theta_{1}),..,f(\theta_{n}))$, where $\theta_{1},\dots,\theta_{n}$ are parameters, more simplistic approaches of inferring their values include: - Pooling the data and estimate $\hat{\theta}=\hat{\theta}_{1}=\dots =\hat{\theta}_{n}$ as a frequentist approach, or use a common Bayesian prior $\pi(\theta)$ and posterior $\pi(\theta\,|\,\mathbf{Y})$. - Model each data point $Y_{i}$ and parameter $\theta_{i} \approx \hat{\theta}_{i}$ completely separately, ignoring the rest of the data set. Neither of these make much sense, motivating the **hierarchical model**: the parameters $\theta_{1},\dots,\theta_{n}$ are different, but all drawn from the same distribution $\pi(\theta;\phi)$, which is determined by the **hyperparameters** $\phi$, which themselves are treated as variables to be inferred. > [!definition|*] Hierarchical Bayesian Methods > > A hierarchical Bayesian model contains: > - The hyperparameters $\phi \sim P$, the **hyperprior** with density $p$. > - The (conditional) prior $\theta \,|\,\phi \sim \pi_{\phi}$ with density $p(\theta \,|\,\phi)$; > - The likelihood $y \,|\,\theta,\phi \sim P_{\theta}$, with density $p(y \,|\, \theta)$. Note that $P_{\theta}$ is independent of $\phi$. > > Independence assumptions include (for $i \ne j$): > - $\theta_{i,j}$ are not necessarily independent, but $\theta_{i}\,|\,\phi$ and $\theta_{j} \,|\, \phi$ are. > - $y_{i,j}\,|\,\phi$ are independent. > > The **joint prior** is $p(\theta,\phi)=p(\phi)p(\theta \,|\,\phi)$. > The **joint posterior ** is $p(\theta,\phi \,|\,y) \propto p(y\,|\,\theta,\phi)p(\theta,\phi)=p(y \,|\, \theta)p(\theta , \phi)$. http://www2.stat.duke.edu/~pdh10/Teaching/581/LectureNotes/bayes.pdf