quot; (for any $S \subset C \cap D$, not necessarily their separator) if they satisfy the marginal equality on $S$: $\sum_{x_{C-S}}\phi_{C}(x_{C})=\sum_{x_{D-S}}\phi_{D}(x_{D}) ~~~~~(\ast)$ > > > > First a few useful facts: > > - $[1]$ Consistency over $S$ implied consistency over any $S' \subset S$, as we can sum both sides of $(\ast)$ over $x_{S-S'}$. > > - $[2]$ If there is another node $B$ where $\phi_{C},\phi_{B}$ are consistent (over $C_{1}\cap B$) and $\phi_{D},\phi_{B}$ are consistent (over $D \cap B$), then $\phi_{C},\phi_{D}$ are consistent over $B\cap C\cap D$ (easy proof using $[1]$). > > > > Now we prove the theorem itself. > > > > By truncating the path, we WLOG prove the result for the two terminal vertices $C,D$, i.e. that $\phi_{C},\phi_{D}$ are consistent. Induct on the length of the path, where the base case ($C \sim D$ in $\mathcal{T}$) is trivially true. > > > > Suppose $C \sim B$ in the path. By assumption $\phi_{C},\phi_{B}$ are consistent over the set $S_{1}:= C \cap B$. > > > > On the other hand, since $\pi_{DB}$ is a shorter path, induction gives $\phi_{D}$ consistent with $\phi_{B}$ over the set $S_{2}:=D \cap B$. > > > > Now $\phi_{C},\phi_{D},\phi_{B}$ are consistent over $S_{1} \cap S_{2}=B\cap C\cap D$ due to $[2]$, but by definition of the junction tree, $C\cap D \subset B$, so $\phi_{C},\phi_{D}$ are consistent over $B\cap C\cap D=C\cap D$ (i.e. the original definition). > [!idea] Therefore, to achieve $(2)$, we only need to guarantee consistency between neighbors on a junction tree. ### Message Passing > [!bigidea] Big Idea > Message passing is an algorithm to get consistency between neighbors on a junction tree. We define an operation on the junction tree $\mathcal{T}$ that makes the potentials $\phi_{C},\phi_{D}$ of neighboring cliques $C,D$ consistent. - WLOG suppose $S$ is the separator of $C$ in the running intersection sequence, and $S=C \cap D$ as during construction of $\mathcal{T}$, $C:=C_{k}$ was attached to $C_{\sigma(k)}=:D$, and $S=C_{k} \cap C_{\sigma(k)}$ by definition of separators. So their edge looks like ```mermaid graph LR A1["(further down the tree)"] --- C C--- S(("S")) S --- D D --- R["(towards the root)"] ``` Note that $S$ is not a node of the tree. > [!definition|*] Message Passing > The **message passing operation** (from $C$ to $D$) is defined as $\begin{align*} \phi_{S}'(x_{S})&:= \sum_{x_{C\setminus S}} \phi_{C}(x_{C}),\\ \phi_{D}'(x_{D})&:=\frac{\phi_{D}(x_{D})\phi'_{S}(x_{S})}{\phi_{S}(x_{S})}, \end{align*}$and the new $\phi'_{S},\phi'_{D}$ will be used as updated versions of $\phi_{S},\phi_{D}$. We want to repeatedly apply this operation to achieve step $(2)$, so verify that this operation has some desired properties: - It preserves $(\dagger)$ because it only modifies the terms $\frac{\phi_{C}}{\phi_{S}}\cdot \frac{\phi_{D}}{\dots} \longrightarrow \frac{\phi_{C}}{\phi_{S}'}\cdot \frac{\phi_{D}'}{\dots},$ The two sides are the same due to choice of $\phi_{D}'$. - $\phi_{C}$ and $\phi_{S}'$ are now consistent by construction of $\phi'_{S}$. - $\phi'_{S}$ and $\phi_{D}'$ are consistent if $\phi_{S}$ and $\phi_{D}$ were: $\begin{align*} \sum_{x_{D \setminus S}}\phi'_{D}(x_{D})&= \sum_{x_{D \setminus S}} \frac{\phi_{D}(x_{D})\phi'_{S}(x_{S})}{\phi_{S}(x_{S})}\\ &= \frac{\phi'_{S}(x_{S})}{\cancel{\phi_{S}(x_{S})}}\cancel{\sum_{x_{D \setminus S}} \phi_{D}(x_{D})}. \end{align*}$ Therefore, the effect of one pass of this operation (from $C$ to $D$) is: ```mermaid graph LR C--- |"consistent"| S(("S")) S --- |"consistent if<br>already consistent"|D D ``` i.e. the first edge $C \sim S$ is made consistent unconditionally, while consistency of $S \sim D$ is preserved. Suppose we can do it once more, but in reverse (passing from $D$ to $C$), then we would achieve: ```mermaid graph LR C--- |"consistency preserved<br>from first pass"| S(("S")) S --- |"now made consistent"|D D ``` This makes $\phi_{C},\phi_{D}$ consistent since $S=C\cap D$. > [!idea] Applying message passing twice to the neighbors $C,D$ on $\mathcal{T}$ makes sure $\phi_{C},\phi_{D}$ are consistent. ### Belief Propagation Therefore, *our goal is to apply this operation on every edge of $\mathcal{T}$ twice: once "forward" and once "backward"*. To make sure that the directions are well-defined, we simply take a topological order lt;$ on $\mathcal{T}$, and define "forward" as descending (from leaves to a root) and "backward" as ascending (from that root to leaves). The forward pass is called **collection**, the backward pass is **distribution**, and the entire process is called **belief propagation**: ```python fold def pass_message(coming_from, going_to, separator, potentials): (... modifies potentials inplace...) return def collect(T: JunctionTree, potentials): for C in reversed(T.topological_order[1:]): # going from leaves to the root, skipping the root D = T.sigma[C] S = T.separator_of[C] pass_message(coming_from = C, going_to = D, separator = S, potentials = potentials) return def distribute(T: JunctionTree, potentials): for C in T.topological_order[1:]: # going from the root to the leaves, # skipping the root since it has no predecessor D = T.sigma[C] S = T.separator_of[C] pass_message(coming_from = D, going_to = C, separator = S, potentials = potentials) return def propagate_belief(T: JunctionTree, potentials): collect(T, potentials) distribute(T, potentials) ``` (of course `JunctionTree` and `pass_message` are not implemented so the code won't run) > [!theorem|*] Belief Propagation Gives Global Consistency > Suppose we have a junction tree $\mathcal{T}$ and initialized potentials $\{ \phi_{C}(x_{C}) \}\cup \{ \phi_{S}(x_{S}) \}$ that satisfy $(\dagger)$. > > After applying belief propagation, the potentials are consistent and still satisfy $(\dagger)$, so we have computed $\phi_{C}(x_{C})=p(x_{C}), ~~\phi_{S}(x_{S})=p(x_{S}) ~~~\forall C \in \mathcal{C},S\in \mathcal{S}.$ ## Directed or Non-Factorizable Graphs For a directed graph $G$, we can simply use the result that ![[Directed Graphical Models#^39ee4c]] so we can moralize $G$ and use the belief propagation algorithm to compute marginals of $p$ on $G^{m}$... If $G^{m}$ is factorizable. ### Non-Factorizable Graphs If $G$ is not factorizable, we can no longer construct a junction tree $\mathcal{T}$ for it. Now what? Interestingly, if $G$ (directed or not) is a subgraph of a larger graph $G'$, then $p\text{ Markov wrt. } G \Longrightarrow p\text{ Markov wrt. }G'.$ > [!proof]- > For directed graphs, this is because $\mathrm{pa}_{G}(v) \subset \mathrm{pa}_{G'}(v)$, so trivially $p(x_{v}~|~x_{\mathrm{pa}_{G}(v)})$ satisfies the role of $p(x_{v}~|~x_{\mathrm{pa}_{G'}(v)})$ in [[Directed Graphical Models#^e23296|the definition of factorizability (on a directed graph)]]. > > For undirected graphs, we simply let $\psi_{C}\equiv 1$ for any newly created cliques in the [[Undirected Graphical Models#^c6aa89|the definition]]. Therefore, *we can simply embed our model into a larger graph*, in particular by [[Undirected Graphical Models#^c929b1|triangulating]] it, i.e. adding edges until it is triangulated. - This comes with the downside that a larger graph is harder to do inference - Finding a triangulation that gives the smallest cliques is a very hard problem in general. Going back to the lung disease example, the moral graph is: ![[DirectedGraphMoralizedExample.png#invert|center|300]] But it is not factorizable, since the cycle $(E,L,S,B)$ does not contain a chord. To make it factorizable, we triangulate the cycle to get ![[TriangulatedMoralGraph.png#invert|center|300]] In conclusion, the **junction tree algorithm** of finding clique marginals on a directed graph $G$ is given by: > [!algorithm] Junction Tree Algorithm > $[1]$ Moralize the graph to get $G^{m}$. > $[2]$ Triangulate $G^m$ to make it decomposable. > $[3]$ Construct a junction tree $\mathcal{T}$ of $G^{m}$. > $[4]$ Run belief propagation algorithm on $\mathcal{T}$, obtaining > $[\text{END}]$ the set of potentials $\{ \phi_{C}(x_{C}) \}$ that equal the marginals $\{ p(x_{C}) \}$. ## Computations ### Adding Evidence In this context, **evidence** is just new information that we wish to condition on, e.g. $p(X_{\text{tuberculosis}},X_{\text{cancer}})\xrightarrow[\text{went to Asia}]{\text{evidence:}}p(X_{\text{tuberculosis}},X_{\text{cancer}}~|~X_{\text{Asia}}=1).$ In general terms, with evidence (event) $E$, we are now interested in $p(x_{C} ~|~ E)$. To do so, we can simply *modify one potential to be conditional on $E$, and rerun belief propagation to make everything consistent again.* We only consider the simple case where $E=\{ X_{e}=t \}$ for some know number $t$. Imitate the update $p(x ~|~E)= \frac{p(x)\cdot \mathbf{1}\{ x_{e} =t\}}{p(X_{e}=t)},$we can similarly update, for some $C^{\ast}: e \in C^{\ast}$, $\phi_{C^{\ast}}(x_{C^{\ast}})' := \frac{\phi_{C^{\ast}}(x_{C^{\ast}})\cdot \mathbf{1}_{x_{e}=t}}{p(X_{e}=t)}.$ - We can compute the denominator as $p(X_{e}=t)=\sum_{x_{C^{\ast}-\{ e \}}}\phi_{C^{\ast}}(x_{C^{\ast}-\{ e \}}, x_{e}=t)$, i.e. marginalising $\phi_{C^{\ast}}$ over the rest of $C^{\ast}$. For example, if we inject the evidence $\{ X=t \}$ into a 2-clique of binary variables $\{ X,Y \}$, its potential/density $\phi_{XY}(x,y)=p(x,y)$ will be updated to $\phi'_{XY}(x,y)=p(x,y ~|~ X=0)$, as $\underset{\phi_{XY}(x,y)=p(x,y)}{\boxed{\begin{array} {c|cc} x \setminus y & 0 & 1\\ \hline 0 & 0.05 & 0.45\\ 1 & 0.1 & 0.4 \end{array}}} \xrightarrow{\text{given }\{ X=0 \}} \underset{\phi'_{XY}(x,y)=p(x,y ~|~ X=0)}{\boxed{\begin{array} {c|cc} x \setminus y & 0 & 1\\ \hline 0 & 0.1 & 0.9\\ 1 & 0& 0 \end{array}}}$ Note that $\dagger$ now holds for $p(x ~|~ E)$ (instead of $p(x)$): if $S$ is the separator corresponding to $C^{\ast}$, $\begin{align*} \prod_{k} \frac{\phi_{C_{k}}(x_{C_{k}})}{\phi_{S_{k}}(x_{S_{k}})}&= \frac{\phi_{C^{\ast}}'(x_{C^{\ast}})}{\phi_{S}(x_{S})}\prod_{k : C_{k} \ne C^{\ast}} \frac{\phi_{C_{k}}(x_{C_{k}})}{\phi_{S_{k}}(x_{S_{k}})}\\ &= \frac{\mathbf{1}_{x_{e}=y}}{p(X_{e}=t)}\prod_{k} \frac{\phi_{C_{k}}(x_{C_{k}})}{\phi_{S_{k}}(x_{S_{k}})}\\[0.4em] &= \frac{p(x)\cdot \mathbf{1}\{ x_{e} =t\}}{p(X_{e}=t)}&& \left[\substack{\{ \phi_{C_{k}} ~|~ k\}\cup \{ \phi_{S_{k}}~|~k \} \\ \text{ satisfies } (\dagger) \text{ for }p(x)} \right]\\[0.8em] &= p(x ~|~ X_{e}=t). \end{align*}$ Therefore, running belief propagation again and making the potentials consistent will produce potentials $\phi_{C}=p(x_{C}~|~X_{e}=t)$. Luckily, we only need to run one direction of it, namely *the `distribute` direction with $\mathcal{T}$ rooted at $C^{\ast}$.* Consider the edge between neighbour cliques $C,C^{\ast}$: ```mermaid graph LR A1["(further down the tree)"] --- C C---|consistent| S(("S")) S ---|probably not<br>consistent| D[C*] ``` where $\phi_{C},\phi_{S}$ is consistent as they were never altered. After just one pass of `distribute`: ```mermaid graph LR A1["(further down the tree)"] --- C C---|consistency unchanged| S(("S")) S ---|made consistent| D[C*] ``` Similar logic applies to the rest of the tree, so indeed one pass of `distribute` is enough to make all potentials consistent again. Since $(\dagger)$ is satisfied with $p(x ~|~ E)$, every marginal must also be now conditioned on $E$. Therefore, the algorithm for adding evidence is: > [!algorithm] Adding Evidence > Input evidence $E=\{ X_{e}=y \}$. > > $[1]$ Find $C^{\ast}$ where $e \in C^{\ast}$. Find its index $k^{\ast}:C_{k^{\ast}}=C^{\ast}$ in the running intersection sequence. > $[2]$ Modify $\phi_{C^{\ast}} \leftarrow \phi_{C^{\ast}}\cdot \mathbf{1}\{ x_{e}=y \} / p(X_{e}=y)$. > $[3]$ Re-root $\mathcal{T}$ to be rooted at $C^{\ast}$, and find a new topological order. > $[4]$ Run `distribute(T, potentials)`. ### Adding Evidence Economically The algorithm above updates the entire tree's potentials with the evidence $E$. What if we are only interested in $p(x_{z}~|~x_{e})$ (for some variable $Z$ corresponding to the vertex $z \in D$)? In fact, we can restrict to the path between $C^{\ast}$ (the clique $X_{e}$ is in) and $D$ on the junction tree. Then we only need to run `distribute` from $C^{\ast}$ as root. > [!examples] Finding the path on the junction tree > As an example, suppose we want to compute $p(z ~|~ S=s)$ on the graph > ```mermaid > graph LR > A[Z, T] --- B["Y, Z (D)"] --- C[X, Y] --- D[X, V, U] > B --- E[Y, W] > C --- F["S,X (C*)"] > ``` > Then we only need to consider the subgraph > ```mermaid > graph LR > A[Z, T] --- B[Y, Z] --- C[X, Y] --- D[X, V, U] > B --- E[Y, W] > C --- F[S,X] > classDef blue fill:#558,stroke:#333,stroke-width:4px; > class B blue > class C blue > class F blue > ``` > We inject evidence $\{ S=s \}$ into the potential $\phi_{SX}$, the run `distribute` to update $\phi_{XY}$ then $\phi_{YZ}$. This produces the right marginal $p(x_{D} ~|~ x_{e})$ because: - When we run the full algorithm of adding evidence (that updates the whole junction tree), the topological order is arbitrary, as long as it starts at $C^{\ast} \ni X_{e}$. - Therefore, we can WLOG let the cliques on the $C^{\ast}\sim D$ path come first, and the economical algorithm described above simply terminates the full algorithm early. - Since the potentials, in particular $\phi_{D}$, is only updated once (when `distribute` calls `pass_message` on its parent in $\mathcal{T}$), terminating early still produces the correct $\phi_{D}=p(x_{D}~|~x_{C^{\ast}})$ like the full algorithm. ### Combining Conditionals Suppose after adding evidence, we have computed $p(x_{C_{1}}~|~x_{E}),~p(x_{C_{2}}~|~x_{E}).$How do we merge the two distributions to obtain $p(x_{C_{1} \cup C_{2}} ~|~ x_{E})$? Suppose the junction tree looks like ```mermaid %%{ init: { 'theme': 'base', 'themeVariables': { 'darkMode': true, 'primaryColor': '#bbb', 'secondaryColor': '#ddd', 'primaryTextColor': '#333' } } }%% flowchart LR subgraph branchC["Branch with C1"] direction LR rest1["Rest of the branch"] --- C1 end subgraph branchD["Branch with D"] direction LR D --- |"(...)"|C2 C2 --- rest2["Rest of the branch"] end C1 ---S(("S")) S --- D ``` where $C_{1} \sim D$ in the junction tree for some $D$ with separator $S$. Define $V_{i}:= \{ v ~|~ \exists C \in \text{branch with }C_{i}, v \in C \}$to be the set of vertices that appear in the same branch with $C_{i}$. We are tempted to say $X_{V_{1}} \perp X_{V_{2}} ~|~ X_{S}$, but is that true? - Since $X\sim p$ is Markov wrt. $G$, we know that if $V_{1},V_{2}$ are separated by $S$ in $G$, then the independence must hold (global Markov). > [!lemma|*] Separators actually Separate Stuff > Indeed, $V_{1} \perp V_{2} ~|~ S$ in $G$. > > > [!proof]- > > Take any vertices $v_{1} \in V_{1}-S,v_{2} \in V_{2}-S$, and let $\pi$ be any path in $G$ between $v_{1},v_{2}$. We show that $\pi \cap S \neq \varnothing$, so that removing $S$ cuts off $\pi$. > > > > (Forget what $C_{1}, C_{2}$ meant outside this proof -- we don't need them here.) > > > > Let $u_{1}u_{2}$ be the edge via which $\pi$ (starting from $v_{1}$) leaves $V_{1} \cup S$ for the first time, so $u_{1} \in C_{1} \subset V_{1} \cup S$, and $u_{2} \in C_{2} \subset V_{2} \cup S$. Since $u_{1}\sim u_{2}$, $u_{1}$ must share a clique with $u_{2}$ is in, say $C$. Then $C$ on the same side of either $v_{1}$ or $v_{2}$. In the former case, by definition of junction trees, $C \cap C_{2} \subset S$, so $u_{2}\in S$. In the latter, the same argument gives $u_{1}\in S$. So removing $S$ cuts off $\pi$. Therefore, we can say $X_{C_{1}} \perp X_{C_{2}} ~|~ S$, and combine the conditional distributions as $p(x_{C_{1} \cup C_{2}} ~|~ x_{E})=p(x_{C_{1}}~|~x_{E},x_{S})\cdot p(x_{C_{2}} ~|~x_{E},x_{S})\cdot p(x_{S}~|~x_{E}).$