Computing Marginals in Graphical Models - Random Notes Go Brrrrrrr

Suppose that we have fitted the joint distribution $\hat{p}(x_{V})$ using some factorization $\hat{p}(x_{V})=\prod_{v}\hat{p}(x_{v} ~|~ x_{\mathrm{pa}(v)}),$where each term $\hat{p}(x_{v} ~|~x_{\mathrm{pa}(v)})$ can be inferred separately as in [[Directed Graphical Models#Inference]]. - For simplicity, we drop the hat for the rest of the note. Now what? We might want to compute marginals like $p(x_{A} ~|~ x_{B})$, e.g. $p(\text{tuberculosis, cancer} ~|~ \text{bad x-ray scan}),$while there are many other variables in the model (e.g. the following). ![[NetworkExample.png#invert|center|w60]] However, issues are: - For Gaussians, we can quickly determine the marginal/conditional densities by looking at the covariance matrix. This is less easy for non-Gaussian (e.g. discrete) distributions. - Summing over all other variables can be very costly (if done naively), even with [[Directed Graphical Models#^c663ed|DAG factorizations]]. But we know that $p$ is not an arbitrary distribution and is instead factorizable wrt. the graph $G$, so we can exploit this fact. > [!warning] What's the goal here? > We have knowledge of the joint distribution $p(x)$, but it's too complex to integrate/sum over. > > We also know a subset of conditional distributions $\{ p(x_{v}~|~x_{\mathrm{pa}(v)}) ~:~ v \in V\},$and we want to use them to efficiently compute any conditional distribution $p(x_{A} ~|~x_{B}),$where $A$ is not necessarily a single vertex, and $B$ not necessarily its parents. In the example above, $A=(X_{\text{tuberculosis}},X_{\text{cancer}})$ , and $B=(X_{\text{x-ray}})$ is not the parent of anything. > > Because naive implementations require a lot of computations, we want to compute $p(x_{C})$ efficiently, which can simplify the computation. ## Factorizable Undirected Graphs If we first study results on factorizable, undirected graphs, we can later solve marginals on directed graphs via their [[Directed Graphical Models#Moral Graphs|moral graphs]]. Suppose $G=(V,E)$ is decomposable, with the cliques $\mathcal{C}:=(C_{1},\dots,C_{k})$ and their separators $\mathcal{S}:=(S_{1},\dots,S_{k})$ satisfying the running intersection property. > [!bigidea] > > We want to use the fact that $p$ factorizes wrt. a graph $G$ to simplify computations. > > Exploiting the fact, we can find the marginal $p(x_{C})$ of clique $C$ by finding **consistent potentials** $\phi_{C}(x_{C})$. > > **Junction trees** offer an algorithm for making a set of potentials $\{ \phi_{C} ~|~ C \in \mathcal{C}(G)\}$ consistent. This is a long section, so the logic between concepts flows like ```mermaid %%{ init: { 'theme': 'base', 'themeVariables': { 'darkMode': true, 'primaryColor': '#bbb', 'secondaryColor': '#ddd', 'primaryTextColor': '#333' } } }%% flowchart TB subgraph Factorization["factorizable graph G"] direction LR factor[Factorization] <--> seq[Running intersection sequence of cliques and separators] end Factorization -->|constructs| T[Junction Tree T] subgraph BeliefPropagation["Belief Propagation Algorithm"] I[Initialization] -->|"creates"| P update["Message Passing"] -->|"adjusts"| P["functions called "Potentials""] T -->|makes it easier to adjust| P P --> R["adjusted potentials will equal the marginals"] end ``` ### Potentials and Consistency > [!definition|*] Potentials > **Potentials** are just non-negative functions defined for any subset $U \subset V$: $ \phi_{U}(x_{U}) >0,$and we can treat them as tentative estimates $\phi_{U}(x_{U}) \simeq p(x_{U})$ of the marginals. > > In our case, *we are only interested in potentials of cliques/separators $U\in \mathcal{C} \cup \mathcal{S}$.* > > We want them to be **consistent**: for any pair of cliques $C_{1},C_{2}\in\mathcal{C}(G)$ with intersection $D:= C_{1} \cap C_{2}$, their potentials $\phi_{C_{1}},\phi_{C_{2}}$ are consistent if $\sum_{x_{C_{1}-D}}\phi_{C_{1}}(x_{C_{1}})=\sum_{x_{C_{2}-D}}\phi_{C_{2}}(x_{C_{2}}),$i.e. they produce the same "marginals" on $D$ if we treat them as actual densities. Obviously $\{ p(x_{C}) ~|~C \in \mathcal{C}(G)\}$ are consistent, but the reverse implication is also true: > [!theorem|*] When potentials equal marginals > > If a set of potentials $\{ \phi_{C} ~|~ C \in \mathcal{C}(G)\}$ were to satisfy > $p(x)=\prod_{k} \frac{\phi_{C_{k}}(x_{C_{k}})}{\phi_{S_{k}}(x_{S_{k}})},~~~({\dagger})$(just like the [[Undirected Graphical Models#^39a55a|factorization]] involving actual marginals), then consistency is equivalent to them being the marginals, i.e. $\{ \phi_{C} \}\cup \{ \phi_{S} \}\text{ consistent} \iff \forall U \in \mathcal{C} \cup \mathcal{S},~~\phi_{U}(x_{U})=p(x_{U}).$ > > > [!proof]- > > Induction on $K$, the number of cliques. When $K=1$ (i.e. a complete graph), there is nothing to prove since by convention $S_{1}=\varnothing, \phi_{\varnothing}\equiv 1$. > > > > Now isolating the last term, let $H:= \cup_{k=1}^{K-1}C_{k}$ be the "history", and $R_{K}:= C_{K}-S_{K}$ be the unseen vertices. Then > > $p(x_{H})=\sum_{x_{R_{K}}}p(x_{V})=\prod_{k=1}^{K-1} \frac{\phi_{C_{k}}(x)}{\phi_{S_{k}}(x)}\cdot \cancel{\frac{1}{\phi_{S_{K}(x)}}\sum_{x_{R_{K}}}\phi_{C_{K}}(x)},$ > > where the cancellation is from consistency. > > > > We induct on the number of cliques, considering the subgraph $G[H]$. We can check that the same cliques, separators, and potentials satisfy the theorem's conditions on $G[H]$, so induction gives $\phi_{C_{k}}(x)=p(x_{C_{k}})$ and $\phi_{S_{k}(x)}=p(x_{S_{k}})$ for $k=1,\dots,K-1$. > > > > Therefore, $\phi_{S_{K}}(x)=p(x_{K})$ follows from consistency and summing $\phi_{C_{\sigma(K)}}$ over $x_{C_{\sigma(K)}-S_{k}}$. Lastly, solving for $\phi_{C_{K}}$ in $(\dagger)$ gives $\phi_{C_{k}}(x)=p(x_{C_{k}})$. > > [!idea] Therefore, we can fit the marginals $\{ p(x_{C}) \}$ with $(1)$ suitable potentials that satisfy $({\dagger})$, and $(2)$ an algorithm that makes them consistent while maintaining $(\dagger)$. But how do we get those two ingredients? ### Initialising Potentials > [!bigidea] Big Idea > We can construct potentials $\{ \phi_{C}(x_{C}) \} \cup \{ \phi_{S}(x_{S}) \}$ that satisfy $(\dagger)$ using the $\{ p(x_{v}~|~x_{\mathrm{pa}(v)}) \}$ that we know. This step is called **initialization**. > [!algorithm] Initialization > For each vertex $v$, assign it to any clique $C$ such that $v \in C$ (possible since as long as $v$ has an edge from it, say $v \sim u$, then $\{ u,v \}$ is complete hence must be in some clique). Let $V(C):=\{ v \in V~|~v \text{ assigned to C} \}$. > > For the cliques $C \in \mathcal{C}$, $\phi_{C}(x_C):= \prod_{v \in V(C)}p(x_{v} ~|~x_{\mathrm{pa}(v)}),$while for the separators $S \in \mathcal{S}$, let $\phi_{S}:= 1$. - This is not begging the question of solving marginals, since we assume we know $p(x_{v} ~|~ x_{\mathrm{pa}(v)})$ as a byproduct of fitting $p$. Verify that $(\dagger)$ holds: $\begin{align*} \prod_{k} \frac{\phi_{C_{k}}}{\phi _{S_{k}}}&= \prod_{k}\phi_{C_{k}} && [\phi_{S_{k}} \equiv 1]\\ &= \prod_{k}\prod_{v \in V(C_{k})}p(x_{v} ~|~x_{\mathrm{pa}(v)}) && [\text{def. of }\phi_{C_{k}}]\\ &= \prod_{v \in V}p(x_{v} ~|~x_{\mathrm{pa}(v)}) && [V=\cup_{k}V(C_{k})]\\ &= p(x), && [\text{factorizability}] \end{align*}$ so indeed our choice satisfies $(\dagger)$. Now we need an algorithm for making $\{\phi_{C}\}\cup\{\phi_{S}\}$ all pairwise consistent, while maintaining their product $\prod_{k}\phi_{C_{k}} / \phi_{S_{k}}$. ### Junction Trees > [!bigidea] Big Idea > Consistency between neighbor cliques on a junction tree is enough for all pairwise consistency among $\{ \phi_{C}(x_{C}) \}\cup \{ \phi_{S}(x_{S}) \}$. We can convert the running intersection property of cliques $\mathcal{C}(G)$ into a graph representing the structure of $G$: > [!definition|*] Junction Trees > The **junction tree** $\mathcal{T}=(\mathcal{V},\mathcal{E})$ of graph $G=(V,E)$ has vertices $\mathcal{V} \subset \mathcal{P}(V)$, and for any vertices $C,D \in \mathcal{V}$ that are also cliques of $G$, the unique path $\pi_{C,D}$ between them on $\mathcal{T}$ has $\forall E \in \pi_{CD}, ~~ C \cap D \subset E.$ For our purposes, we only consider $\mathcal{V}=\mathcal{C}(G)$, i.e. where the vertex are the cliques of $G$. In that case, junction trees are interconvertible with running intersection sequences: > [!lemma|*] Junction Trees and Running Intersection Sequences > If $\mathcal{T}$ is a junction tree, then we can arrange its vertices $\mathcal{V}$ to satisfy the running intersection sequence. > > Conversely, if $(V_{1},\dots,V_{k}) \subset \mathcal{P}(V)$ satisfies the running intersection property, we can arrange them into a junction tree $\mathcal{T}$. How are junction trees useful for our purpose? > [!theorem|*] Consistency Travels along Junction Trees > On the junction tree $\mathcal{T}$, let $C,D$ be cliques, and $\pi_{CD}$ the unique path between them. > > If all neighboring cliques on the path have consistent potentials, non-neighbors must have consistent potentials too. > > Or as a slogan, *local consistency implies global consistency of the tree.* > > > [!proof]- > > Define $\phi_{C},\phi_{D}$ to be consistent "over $Squot; (for any $S \subset C \cap D$, not necessarily their separator) if they satisfy the marginal equality on $S$: $\sum_{x_{C-S}}\phi_{C}(x_{C})=\sum_{x_{D-S}}\phi_{D}(x_{D}) ~~~~~(\ast)$ > > > > First a few useful facts: > > - $[1]$ Consistency over $S$ implied consistency over any $S' \subset S$, as we can sum both sides of $(\ast)$ over $x_{S-S'}$. > > - $[2]$ If there is another node $B$ where $\phi_{C},\phi_{B}$ are consistent (over $C_{1}\cap B$) and $\phi_{D},\phi_{B}$ are consistent (over $D \cap B$), then $\phi_{C},\phi_{D}$ are consistent over $B\cap C\cap D$ (easy proof using $[1]$). > > > > Now we prove the theorem itself. > > > > By truncating the path, we WLOG prove the result for the two terminal vertices $C,D$, i.e. that $\phi_{C},\phi_{D}$ are consistent. Induct on the length of the path, where the base case ($C \sim D$ in $\mathcal{T}$) is trivially true. > > > > Suppose $C \sim B$ in the path. By assumption $\phi_{C},\phi_{B}$ are consistent over the set $S_{1}:= C \cap B$. > > > > On the other hand, since $\pi_{DB}$ is a shorter path, induction gives $\phi_{D}$ consistent with $\phi_{B}$ over the set $S_{2}:=D \cap B$. > > > > Now $\phi_{C},\phi_{D},\phi_{B}$ are consistent over $S_{1} \cap S_{2}=B\cap C\cap D$ due to $[2]$, but by definition of the junction tree, $C\cap D \subset B$, so $\phi_{C},\phi_{D}$ are consistent over $B\cap C\cap D=C\cap D$ (i.e. the original definition). > [!idea] Therefore, to achieve $(2)$, we only need to guarantee consistency between neighbors on a junction tree. ### Message Passing > [!bigidea] Big Idea > Message passing is an algorithm to get consistency between neighbors on a junction tree. We define an operation on the junction tree $\mathcal{T}$ that makes the potentials $\phi_{C},\phi_{D}$ of neighboring cliques $C,D$ consistent. - WLOG suppose $S$ is the separator of $C$ in the running intersection sequence, and $S=C \cap D$ as during construction of $\mathcal{T}$, $C:=C_{k}$ was attached to $C_{\sigma(k)}=:D$, and $S=C_{k} \cap C_{\sigma(k)}$ by definition of separators. So their edge looks like ```mermaid %%{ init: { 'theme': 'base', 'themeVariables': { 'darkMode': true, 'primaryColor': '#bbb', 'secondaryColor': '#ddd', 'primaryTextColor': '#333' } } }%% graph LR A1["(further down the tree)"] --- C C--- S(("S")) S --- D D --- R["(towards the root)"] ``` Note that $S$ is not a node of the tree. > [!definition|*] Message Passing > The **message passing operation** (from $C$ to $D$) is defined as $\begin{align*} \phi_{S}'(x_{S})&:= \sum_{x_{C\setminus S}} \phi_{C}(x_{C}),\\ \phi_{D}'(x_{D})&:=\frac{\phi_{D}(x_{D})\phi'_{S}(x_{S})}{\phi_{S}(x_{S})}, \end{align*}$and the new $\phi'_{S},\phi'_{D}$ will be used as updated versions of $\phi_{S},\phi_{D}$. We want to repeatedly apply this operation to achieve step $(2)$, so verify that this operation has some desired properties: - It preserves $(\dagger)$ because it only modifies the terms $\frac{\phi_{C}}{\phi_{S}}\cdot \frac{\phi_{D}}{\dots} \longrightarrow \frac{\phi_{C}}{\phi_{S}'}\cdot \frac{\phi_{D}'}{\dots},$ The two sides are the same due to choice of $\phi_{D}'$. - $\phi_{C}$ and $\phi_{S}'$ are now consistent by construction of $\phi'_{S}$. - $\phi'_{S}$ and $\phi_{D}'$ are consistent if $\phi_{S}$ and $\phi_{D}$ were: $\begin{align*} \sum_{x_{D \setminus S}}\phi'_{D}(x_{D})&= \sum_{x_{D \setminus S}} \frac{\phi_{D}(x_{D})\phi'_{S}(x_{S})}{\phi_{S}(x_{S})}\\ &= \frac{\phi'_{S}(x_{S})}{\cancel{\phi_{S}(x_{S})}}\cancel{\sum_{x_{D \setminus S}} \phi_{D}(x_{D})}. \end{align*}$ Therefore, the effect of one pass of this operation (from $C$ to $D$) is: ```mermaid %%{ init: { 'theme': 'base', 'themeVariables': { 'darkMode': true, 'primaryColor': '#bbb', 'secondaryColor': '#ddd', 'primaryTextColor': '#333' } } }%% graph LR C--- |"consistent"| S(("S")) S --- |"consistent if already consistent"|D D ``` i.e. the first edge $C \sim S$ is made consistent unconditionally, while consistency of $S \sim D$ is preserved. Suppose we can do it once more, but in reverse (passing from $D$ to $C$), then we would achieve: ```mermaid %%{ init: { 'theme': 'base', 'themeVariables': { 'darkMode': true, 'primaryColor': '#bbb', 'secondaryColor': '#ddd', 'primaryTextColor': '#333' } } }%% graph LR C--- |"consistency preserved from first pass"| S(("S")) S --- |"now made consistent"|D D ``` This makes $\phi_{C},\phi_{D}$ consistent since $S=C\cap D$. > [!idea] Applying message passing twice to the neighbors $C,D$ on $\mathcal{T}$ makes sure $\phi_{C},\phi_{D}$ are consistent. ### Belief Propagation Therefore, *our goal is to apply this operation on every edge of $\mathcal{T}$ twice: once "forward" and once "backward"*. To make sure that the directions are well-defined, we simply take a topological order lt;$ on $\mathcal{T}$, and define "forward" as descending (from leaves to a root) and "backward" as ascending (from that root to leaves). The forward pass is called **collection**, the backward pass is **distribution**, and the entire process is called **belief propagation**: ```python fold def pass_message(coming_from, going_to, separator, potentials): (... modifies potentials inplace...) return def collect(T: JunctionTree, potentials): for C in reversed(T.topological_order[1:]): # going from leaves to the root, skipping the root D = T.sigma[C] S = T.separator_of[C] pass_message(coming_from = C, going_to = D, separator = S, potentials = potentials) return def distribute(T: JunctionTree, potentials): for C in T.topological_order[1:]: # going from the root to the leaves, # skipping the root since it has no predecessor D = T.sigma[C] S = T.separator_of[C] pass_message(coming_from = D, going_to = C, separator = S, potentials = potentials) return def propagate_belief(T: JunctionTree, potentials): collect(T, potentials) distribute(T, potentials) ``` (of course `JunctionTree` and `pass_message` are not implemented so the code won't run) > [!theorem|*] Belief Propagation Gives Global Consistency > Suppose we have a junction tree $\mathcal{T}$ and initialized potentials $\{ \phi_{C}(x_{C}) \}\cup \{ \phi_{S}(x_{S}) \}$ that satisfy $(\dagger)$. > > After applying belief propagation, the potentials are consistent and still satisfy $(\dagger)$, so we have computed $\phi_{C}(x_{C})=p(x_{C}), ~~\phi_{S}(x_{S})=p(x_{S}) ~~~\forall C \in \mathcal{C},S\in \mathcal{S}.$ ## Directed or Non-Factorizable Graphs For a directed graph $G$, we can simply use the result that ![[Directed Graphical Models#^39ee4c]] so we can moralize $G$ and use the belief propagation algorithm to compute marginals of $p$ on $G^{m}$... If $G^{m}$ is factorizable. ### Non-Factorizable Graphs If $G$ is not factorizable, we can no longer construct a junction tree $\mathcal{T}$ for it. Now what? Interestingly, if $G$ (directed or not) is a subgraph of a larger graph $G'$, then $p\text{ Markov wrt. } G \Longrightarrow p\text{ Markov wrt. }G'.$ > [!proof]- > For directed graphs, this is because $\mathrm{pa}_{G}(v) \subset \mathrm{pa}_{G'}(v)$, so trivially $p(x_{v}~|~x_{\mathrm{pa}_{G}(v)})$ satisfies the role of $p(x_{v}~|~x_{\mathrm{pa}_{G'}(v)})$ in [[Directed Graphical Models#^e23296|the definition of factorizability (on a directed graph)]]. > > For undirected graphs, we simply let $\psi_{C}\equiv 1$ for any newly created cliques in the [[Undirected Graphical Models#^c6aa89|the definition]]. Therefore, *we can simply embed our model into a larger graph*, in particular by [[Undirected Graphical Models#^c929b1|triangulating]] it, i.e. adding edges until it is triangulated. - This comes with the downside that a larger graph is harder to do inference - Finding a triangulation that gives the smallest cliques is a very hard problem in general. Going back to the lung disease example, the moral graph is: ![[DirectedGraphMoralizedExample.png#invert|center|300]] But it is not factorizable, since the cycle $(E,L,S,B)$ does not contain a chord. To make it factorizable, we triangulate the cycle to get ![[TriangulatedMoralGraph.png#invert|center|300]] In conclusion, the **junction tree algorithm** of finding clique marginals on a directed graph $G$ is given by: > [!algorithm] Junction Tree Algorithm > $[1]$ Moralize the graph to get $G^{m}$. > $[2]$ Triangulate $G^m$ to make it decomposable. > $[3]$ Construct a junction tree $\mathcal{T}$ of $G^{m}$. > $[4]$ Run belief propagation algorithm on $\mathcal{T}$, obtaining > $[\text{END}]$ the set of potentials $\{ \phi_{C}(x_{C}) \}$ that equal the marginals $\{ p(x_{C}) \}$. ## Computations ### Adding Evidence In this context, **evidence** is just new information that we wish to condition on, e.g. $p(X_{\text{tuberculosis}},X_{\text{cancer}})\xrightarrow[\text{went to Asia}]{\text{evidence:}}p(X_{\text{tuberculosis}},X_{\text{cancer}}~|~X_{\text{Asia}}=1).$ In general terms, with evidence (event) $E$, we are now interested in $p(x_{C} ~|~ E)$. To do so, we can simply *modify one potential to be conditional on $E$, and rerun belief propagation to make everything consistent again.* We only consider the simple case where $E=\{ X_{e}=t \}$ for some know number $t$. Imitate the update $p(x ~|~E)= \frac{p(x)\cdot \mathbf{1}\{ x_{e} =t\}}{p(X_{e}=t)},$we can similarly update, for some $C^{\ast}: e \in C^{\ast}$, $\phi_{C^{\ast}}(x_{C^{\ast}})' := \frac{\phi_{C^{\ast}}(x_{C^{\ast}})\cdot \mathbf{1}_{x_{e}=t}}{p(X_{e}=t)}.$ - We can compute the denominator as $p(X_{e}=t)=\sum_{x_{C^{\ast}-\{ e \}}}\phi_{C^{\ast}}(x_{C^{\ast}-\{ e \}}, x_{e}=t)$, i.e. marginalising $\phi_{C^{\ast}}$ over the rest of $C^{\ast}$. For example, if we inject the evidence $\{ X=t \}$ into a 2-clique of binary variables $\{ X,Y \}$, its potential/density $\phi_{XY}(x,y)=p(x,y)$ will be updated to $\phi'_{XY}(x,y)=p(x,y ~|~ X=0)$, as $\underset{\phi_{XY}(x,y)=p(x,y)}{\boxed{\begin{array} {c|cc} x \setminus y & 0 & 1\\ \hline 0 & 0.05 & 0.45\\ 1 & 0.1 & 0.4 \end{array}}} \xrightarrow{\text{given }\{ X=0 \}} \underset{\phi'_{XY}(x,y)=p(x,y ~|~ X=0)}{\boxed{\begin{array} {c|cc} x \setminus y & 0 & 1\\ \hline 0 & 0.1 & 0.9\\ 1 & 0& 0 \end{array}}}$ Note that $\dagger$ now holds for $p(x ~|~ E)$ (instead of $p(x)$): if $S$ is the separator corresponding to $C^{\ast}$, $\begin{align*} \prod_{k} \frac{\phi_{C_{k}}(x_{C_{k}})}{\phi_{S_{k}}(x_{S_{k}})}&= \frac{\phi_{C^{\ast}}'(x_{C^{\ast}})}{\phi_{S}(x_{S})}\prod_{k : C_{k} \ne C^{\ast}} \frac{\phi_{C_{k}}(x_{C_{k}})}{\phi_{S_{k}}(x_{S_{k}})}\\ &= \frac{\mathbf{1}_{x_{e}=y}}{p(X_{e}=t)}\prod_{k} \frac{\phi_{C_{k}}(x_{C_{k}})}{\phi_{S_{k}}(x_{S_{k}})}\\[0.4em] &= \frac{p(x)\cdot \mathbf{1}\{ x_{e} =t\}}{p(X_{e}=t)}&& \left[\substack{\{ \phi_{C_{k}} ~|~ k\}\cup \{ \phi_{S_{k}}~|~k \} \\ \text{ satisfies } (\dagger) \text{ for }p(x)} \right]\\[0.8em] &= p(x ~|~ X_{e}=t). \end{align*}$ Therefore, running belief propagation again and making the potentials consistent will produce potentials $\phi_{C}=p(x_{C}~|~X_{e}=t)$. Luckily, we only need to run one direction of it, namely *the `distribute` direction with $\mathcal{T}$ rooted at $C^{\ast}$.* Consider the edge between neighbour cliques $C,C^{\ast}$: ```mermaid %%{ init: { 'theme': 'base', 'themeVariables': { 'darkMode': true, 'primaryColor': '#bbb', 'secondaryColor': '#ddd', 'primaryTextColor': '#333' } } }%% graph LR A1["(further down the tree)"] --- C C---|consistent| S(("S")) S ---|probably not consistent| D[C*] ``` where $\phi_{C},\phi_{S}$ is consistent as they were never altered. After just one pass of `distribute`: ```mermaid %%{ init: { 'theme': 'base', 'themeVariables': { 'darkMode': true, 'primaryColor': '#bbb', 'secondaryColor': '#ddd', 'primaryTextColor': '#333' } } }%% graph LR A1["(further down the tree)"] --- C C---|consistency unchanged| S(("S")) S ---|made consistent| D[C*] ``` Similar logic applies to the rest of the tree, so indeed one pass of `distribute` is enough to make all potentials consistent again. Since $(\dagger)$ is satisfied with $p(x ~|~ E)$, every marginal must also be now conditioned on $E$. Therefore, the algorithm for adding evidence is: > [!algorithm] Adding Evidence > Input evidence $E=\{ X_{e}=y \}$. > > $[1]$ Find $C^{\ast}$ where $e \in C^{\ast}$. Find its index $k^{\ast}:C_{k^{\ast}}=C^{\ast}$ in the running intersection sequence. > $[2]$ Modify $\phi_{C^{\ast}} \leftarrow \phi_{C^{\ast}}\cdot \mathbf{1}\{ x_{e}=y \} / p(X_{e}=y)$. > $[3]$ Re-root $\mathcal{T}$ to be rooted at $C^{\ast}$, and find a new topological order. > $[4]$ Run `distribute(T, potentials)`. ### Adding Evidence Economically The algorithm above updates the entire tree's potentials with the evidence $E$. What if we are only interested in $p(x_{z}~|~x_{e})$ (for some variable $Z$ corresponding to the vertex $z \in D$)? In fact, we can restrict to the path between $C^{\ast}$ (the clique $X_{e}$ is in) and $D$ on the junction tree. Then we only need to run `distribute` from $C^{\ast}$ as root. > [!examples] Finding the path on the junction tree > As an example, suppose we want to compute $p(z ~|~ S=s)$ on the graph > ```mermaid > %%{ > init: { > 'theme': 'base', > 'themeVariables': { > 'darkMode': true, > 'primaryColor': '#bbb', > 'secondaryColor': '#ddd', > 'primaryTextColor': '#333' > } > } > }%% > graph LR > A[Z, T] --- B["Y, Z (D)"] --- C[X, Y] --- D[X, V, U] > B --- E[Y, W] > C --- F["S,X (C*)"] > ``` > Then we only need to consider the subgraph > ```mermaid > %%{ > init: { > 'theme': 'base', > 'themeVariables': { > 'darkMode': true, > 'primaryColor': '#bbb', > 'secondaryColor': '#ddd', > 'primaryTextColor': '#333' > } > } > }%% > graph LR > A[Z, T] --- B[Y, Z] --- C[X, Y] --- D[X, V, U] > B --- E[Y, W] > C --- F[S,X] > classDef blue fill:#558,stroke:#333,stroke-width:4px; > class B blue > class C blue > class F blue > ``` > We inject evidence $\{ S=s \}$ into the potential $\phi_{SX}$, the run `distribute` to update $\phi_{XY}$ then $\phi_{YZ}$. This produces the right marginal $p(x_{D} ~|~ x_{e})$ because: - When we run the full algorithm of adding evidence (that updates the whole junction tree), the topological order is arbitrary, as long as it starts at $C^{\ast} \ni X_{e}$. - Therefore, we can WLOG let the cliques on the $C^{\ast}\sim D$ path come first, and the economical algorithm described above simply terminates the full algorithm early. - Since the potentials, in particular $\phi_{D}$, is only updated once (when `distribute` calls `pass_message` on its parent in $\mathcal{T}$), terminating early still produces the correct $\phi_{D}=p(x_{D}~|~x_{C^{\ast}})$ like the full algorithm. ### Combining Conditionals Suppose after adding evidence, we have computed $p(x_{C_{1}}~|~x_{E}),~p(x_{C_{2}}~|~x_{E}).$How do we merge the two distributions to obtain $p(x_{C_{1} \cup C_{2}} ~|~ x_{E})$? Suppose the junction tree looks like ```mermaid %%{ init: { 'theme': 'base', 'themeVariables': { 'darkMode': true, 'primaryColor': '#bbb', 'secondaryColor': '#ddd', 'primaryTextColor': '#333' } } }%% flowchart LR subgraph branchC["Branch with C1"] direction LR rest1["Rest of the branch"] --- C1 end subgraph branchD["Branch with D"] direction LR D --- |"(...)"|C2 C2 --- rest2["Rest of the branch"] end C1 ---S(("S")) S --- D ``` where $C_{1} \sim D$ in the junction tree for some $D$ with separator $S$. Define $V_{i}:= \{ v ~|~ \exists C \in \text{branch with }C_{i}, v \in C \}$to be the set of vertices that appear in the same branch with $C_{i}$. We are tempted to say $X_{V_{1}} \perp X_{V_{2}} ~|~ X_{S}$, but is that true? - Since $X\sim p$ is Markov wrt. $G$, we know that if $V_{1},V_{2}$ are separated by $S$ in $G$, then the independence must hold (global Markov). > [!lemma|*] Separators actually Separate Stuff > Indeed, $V_{1} \perp V_{2} ~|~ S$ in $G$. > > > [!proof]- > > Take any vertices $v_{1} \in V_{1}-S,v_{2} \in V_{2}-S$, and let $\pi$ be any path in $G$ between $v_{1},v_{2}$. We show that $\pi \cap S \neq \varnothing$, so that removing $S$ cuts off $\pi$. > > > > (Forget what $C_{1}, C_{2}$ meant outside this proof -- we don't need them here.) > > > > Let $u_{1}u_{2}$ be the edge via which $\pi$ (starting from $v_{1}$) leaves $V_{1} \cup S$ for the first time, so $u_{1} \in C_{1} \subset V_{1} \cup S$, and $u_{2} \in C_{2} \subset V_{2} \cup S$. Since $u_{1}\sim u_{2}$, $u_{1}$ must share a clique with $u_{2}$ is in, say $C$. Then $C$ on the same side of either $v_{1}$ or $v_{2}$. In the former case, by definition of junction trees, $C \cap C_{2} \subset S$, so $u_{2}\in S$. In the latter, the same argument gives $u_{1}\in S$. So removing $S$ cuts off $\pi$. Therefore, we can say $X_{C_{1}} \perp X_{C_{2}} ~|~ S$, and combine the conditional distributions as $p(x_{C_{1} \cup C_{2}} ~|~ x_{E})=p(x_{C_{1}}~|~x_{E},x_{S})\cdot p(x_{C_{2}} ~|~x_{E},x_{S})\cdot p(x_{S}~|~x_{E}).$