Undirected Graphical Models - Random Notes Go Brrrrrrr

We can state independence properties of a RV $X \sim p$ using a [[Graph Theory|graph]]. Let $A\perp B ~|~ S$ denote separation between vertex sets $A,B,S$ on a graph, and $X \perp Y ~|~ Z$ denote the usual conditional independences between random variables $X,Y$ given $Z$. > [!definition|*] Pairwise, Local, and Global Markov Properties > A distribution $X\sim p$ is **pairwise Markov** for a graph $G=(V,E)$ if > $\forall i,j \in V, i \not \sim j \Longrightarrow X_{i} \perp X_{j} ~|~ X_{V-\{ i,j \}}.$ > That is, a missing edge $ij$ indicates that $X_{i} \perp X_{j}$, conditioning on all else. > > $p$ is **local Markov** if > $\forall i, j \in V, i \not \sim j \Longrightarrow X_{i} \perp X_{V - \mathrm{bd}(i)} ~|~ X_{\mathrm{bd}(i)},$ > where $\mathrm{bd}(i)$ is the "boundary", i.e. the set of neighbors, of $i$ in $\mathcal{G}$. That is, $X_{i}$ is independent of all else, conditioning on its neighbours. > > Lastly, $p$ is **global Markov** if > $\forall A,B, S \subset V, A \perp B ~|~ S ~\Longrightarrow~X_{A} \perp X_{B} ~|~ X_{S}$ > That is, graphical separation by $S$ is equivalent to independence conditioning on $S$. Obviously global Markov implies local and pairwise Markov. > [!theorem|*] Equivalence of Markov Properties (for strictly positive distributions) > If $p$ is strictly positive, then local Markov property implies the global Markov property, and so the Markov properties are all equivalent. > > > [!proof]- > > Consider sets $A,B,S$ such that $A \perp B ~|~ S$; induct on the size of $S$ in the reverse direction. When $| S |=| V |-2$, $A,B$ are singletons so $X_{A} \perp X_{B} ~|~X_{S}$ follows from local Markov property. > > > > Now assume $| S |=n < | V |-2$, and as the induction hypothesis, that any two sets separated by a set of size $\ge n+1$ have independence required by global Markov. > > > > Suppose $a \in A$, so the hypothesis applied to $\{ a \}\perp B ~|~ S \cup (A-\{ a \})$ and $A-\{ a \} \perp B ~|~ S \cup \{ a \}$ gives $\begin{align*} X_{a}&\perp X_{B} ~|~ X_{S}, X_{A - \{ a \}}, \\ X_{A-\{ a \}} &\perp X_{B} ~|~ X_{S}, X_{a}. \end{align*}$ > > Now applying the intersection graphoid axiom (applicable since we assumed $p$ is strictly positive) gives $X_{a}, X_{A-\{ a \}} \perp X_{B} ~|~ X_{S}$. > ## Decomposition of Graphs > [!definition|*] Triangulated/Chordal Graphs > A graph $G = (V,E)$ is **triangulated** or **chordal** if any cycle $C$ of length $\ge 4$ contains at least a chord $ij \in E$ where $ij \not \in C$. ^c929b1 > [!theorem|*] Equivalences with Decomposability > For a graph $G$, all of the following are equivalent: > - $G$ is decomposable. > - $G$ is triangulated/chordal. > - For $a,b \in V$, any minimal separator $S$ is complete. > - For any clique $C_{1}\in\mathcal{C}(G)$, we can construct a running intersection sequence of cliques $(C_{1},\dots,C_{K})$ that starts with the chosen $C_{1}$. > [!proof]- Proof that $(i) \Rightarrow (ii)$ > If $G$ is complete then it is certainly chordal. Otherwise, induct on its size; it has decomposition $(A,S,B)$ with $G_{A \cup S},G_{B \cup S}$ decomposable (hence chordal by induction) and $S$ complete. Any $k$-cycle on $G$ ($k\ge 4$) must either be a subset of $A\cup S,B\cup S$ (hence has a chord) or contain elements from both $A,B$. Since $S$ separates $A,B$, the cycle must contain two distinct elements from $S$. They are neighbours hence provide a chord in the cycle. > [!proof] Proof that $(ii)\Rightarrow (iii)$ > Assume for contradiction that $a,b$ has an incomplete minimal separator $S$ with $s_{1},s_{2}\in S, s_{1} \not \sim s_{2}$. Then by minimality, there are $ab$ paths $\pi_{1},\pi_{2}$ with $\pi_{i} \cap S=\{ s_{i} \}$ for $i=1,2$ (otherwise, removing the corresponding $s_{i}$ still separates $a,b$, contradictory to the minimality of $S$). > > Now consider the connected components $A,B$ of $a,b$ respectively in $G_{V-S}$. Then $s_{i}$ has neighbours $l_{i}\in A,r_{i}\in B$ in $\pi_{i}$. Now let $\pi_{l},\pi_{r}$ be the shortest paths joining $s_{1}s_{2}$ in $A,B$ respectively. > ```mermaid > flowchart LR > subgraph A > direction LR > a --- l1 > a --- l2 > end > l1 --- s1 > l2 --- s2 > > subgraph B > direction LR > r1 --- b > r2 --- b > end > > s1 --- r1 > s2 --- r2 > ``` > > We claim that $s_{1}\sim l_{1}\xrightarrow{\pi_{l}}l_{2}\sim s_{2}\sim r_{2}\xrightarrow{\pi_{r}}r_{1}\sim s_{1}$ > > is a cycle of length at least $4$ with no chords. It a cycle since it contains no duplicate elements, and it has at least 4 elements since $s_{1},s_{2},l_{1},r_{1}$ are certainly distinct, even if $l_{1}=l_{2}$ and $r_{1}=r_{2}$. It contains no chords: (1) intermediate vertices of $\pi_{l},\pi_{r}$ are not neighbours due to being in different connected components, (2) between $s_{i}$ and those intermediate vertices there are no chords due to minimality of $\pi_{l},\pi_{r}$, and (3) $s_{1} \not \sim s_{2}$. > > This contradicts chordality. > [!proof]- Proof that $(iii)\Rightarrow (iv)$ > If $G$ is complete, we are done; otherwise, choose $a\not \sim b$ with minimal (hence complete) separator $S$, and let $A$ be the connected component of $a$ in $G_{V-S}$, and $B := V-A$. Since $C_{1}$ is a clique, it is either $\subset A \cup S$ or $B \cup S$ due to disconnectedness. We prove the result for $C_{1}\subset A\cup S$, but the proof is identical for $C \subset B \cup S$. > > Inducting on the number of cliques, we have a RIP $(C_{1},\dots,C_{m}), (S_{1},\dots,S_{m})$ of $G_{A \cup S}$ that starts with $C_{1}$, and some RIP $(C_{m+1},\dots,C_{m+n}), (S_{m+1},\dots,S_{m+n})$ of $G_{B \cup S}$ that starts with a clique containing $S$, say $\tilde{S}=: C_{m+1}$. > > We claim that concatenating the two sequences gives an RIP of $G$, i.e. that $(C_{1},\dots,C_{m},\dots,C_{m+n}), (S_{1},\dots,S_{m},\dots,S_{m+n})$ > is an RIP. Again, any clique of $G$ is contained in either $A\cup S$ or $B \cup S$, so this sequences contains all cliques. The RIP is inherited for any $j \le m$, and for $j > m$, $\begin{align*} C_{j} \cap \Big( \bigcup_{i<j} C_{i} \Big) &= \Big(C_{j} \cap \bigcup_{i=1}^{m} C_{i} \Big)\cup \Big(C_{j} \cap \bigcup _{i=m+1}^{j-1}C_{i}\Big) \\ &\subset( C_{j} \cap S) \cup \Big(C_{j} \cap \bigcup _{i=m+1}^{j-1}C_{i}\Big) &&[C_{j}\subset B\cup S, ~ \cup_{i=1}^{m} C_{i} \subset A \cup S]\\ &\subseteq (C_{j} \cap C_{m+1}) \cup \Big(C_{j} \cap \bigcup _{i=m+1}^{j-1}C_{i}\Big) &&[S \subseteq \tilde{S}=: C_{m+1}]\\ &= C_{j} \cap C_{\sigma(j)} &&[\text{RIP in }G_{B\cup S}]. \end{align*}$ > > [!proof] Proof that $(iv)\Rightarrow (i)$ > Given cliques $(C_{1},\dots)$ and separators $(S_{1},\dots)$ that satisfy running intersection, we can decompose $G$ by taking $A_{k}=C_{k}-S_k,~ S_{k}=S_{k},~ B_{k}=\{\cup_{k' < k}C_{k}\}-S_{k},~~k=K, \dots , 1.$ ## Factorisation Let $\mathcal{C}(G)$ be the set of **cliques** in $G$, where a clique is a maximal complete subgraph. > [!definition|*] Factorisation > $p$ **factorises** with respect to $G$ if we can find functions $\psi_{C}$ for each clique $C \in \mathcal{C}(G)$ such that $p(x) = \prod_{C \in \mathcal{C}(G)}\psi_{C}(x_{C}).$ ^c6aa89 > [!theorem|*] Factorization implies Global Markov > Factorization implies global Markov property: if $p$ factorizes wrt. $G$, then it is globally Markov wrt. $G$. > > > [!proof]- > > Let $\tilde{A}$ be the vertex set of the connected component of $A$ in $G-S$, and define $\tilde{B}:= V-\tilde{A}-S$. Then any clique $C$ is a subset of (at least) one of $\tilde{A}\cup S$ and $\tilde{B}\cup S$. So the product $\prod_{C\in\mathcal{C}(G)}$ can be separated into $\prod_{C\in \mathcal{C}(\tilde{A}\cup S)}$ and $\prod_{C\in\mathcal{C}(\tilde{B}\cup S)}$ (removing overcounted cliques in $S$), and $p(x)$ can be written as a product of functions of $x_{\tilde{A}\cup S}$ and $x_{\tilde{B}\cup S}$, which is equivalent to $\tilde{A} \perp \tilde{B} ~|~ S$, implying $A \perp B ~|~ S$. Hammersley-Clifford: If $\forall x \in \mathcal{X},~p(x) > 0$ (i.e. if $p$ has support over the entirety of $\mathcal{X}$), then pairwise Markov implies factorisation, so in fact all of the above concepts are equivalent, and we can simply say that $p$ is **Markov** wrt. $G$. $p$ factorises wrt. a $G$ with decomposition $(A,S,B)$ if and only if $p(x_{A\cup S}),p(x_{B\cup S})$ factorises wrt. $A \cup S,B\cup S$ respectively, and $p(x_{V})\cdot p(x_{S})=p(x_{A \cup S})\cdot p(x_{B \cup S}),$i.e. $X_{A- S} \perp X_{B- S} ~|~X_{S}$. ^464932 > [!proof]- > Any clique $C\in \mathcal{C}$ is a subset of either $A\cup S$ or $B \cup S$, but not both, since $A\perp_{s}B ~|~ S$; this has the possible exception of $S$ itself being a clique. So the factorisation can be partitioned: $p(x_{V}) =\prod_{C\in \mathcal{C}}\psi_{C}= \prod_{C\subseteq A\cup S}\psi_{C}\prod_{C\subseteq B \cup S}\psi_{C};$ > when $S$ is a clique, just assign it to one of the products. This proves conditional independence, as the first term is a function of $x_{A \cup S}$ and the second that of $x_{B \cup S}$. > > Now integrating over $B-S$, we have $p(x_{A \cup S})= \prod_{C\subseteq A \cup S} \psi_{C}\cdot \underbrace{\int \prod_{C\subseteq B \cup S} \psi_{C}(x_{C})~ dx_{B-S} }_{=:\tilde{\psi}(x_{S})},$ > which is a factorisation of $p(x_{A \cup S})$ once we merge $\tilde{\psi}(x_{S})$ into the potential of the clique of $G_{A \cup S}$ that contains $S$. So $p(x_{A \cup S})$ factorises over $G_{A \cup S}$. > > A similar proof holds for $B \cup S$. If $p$ factorises with respect to $G$, and $G$ is decomposable with cliques $(C_{1},\dots,C_{k})$ that satisfies running intersection with separator sets $(S_{1},\dots,S_{k})$, then $p$ can be written as $p(x)=\prod_{k} \frac{p(x_{C_{k}})}{p(x_{S_{k}})},$ ^39a55a so in the definition of factorizability, we can take $\phi_{C_{k}}:= p(x_{C_{k}}) / p(x_{S_{k}})=p(x_{C_{k}-S_{k}} ~|~ x_{S_{k}}).$ The important consequence is that we can conduct inference separately on each of $x_{C_{k}}$, as this theorem guarantees $C_{k} \perp C_{i}~|~S_{k}$ for any $i <k$. This reduces the inference over $\mathbb{R}^{| V |}$ (including interactions between cliques) to smaller inferences over $\mathbb{R}^{| C_{k} |}$. If we have data $\mathbf{x}$ and want to make inference using a [[Contingency Table|contingency table]], the log-likelihood is $\begin{align*} l(\theta;\mathbf{x})&= \sum_{x \in \mathcal{X}}n(x;\mathbf{x})\log p(x)\\ &= \sum_{x} n(x;\mathbf{x}) \sum_{k}\log p(x_{C_{k}-S_{k}} ~|~ x_{S_{k}})\\ &= \sum_{k} \sum_{x_{C_{k}} \in \mathcal{X}_{C_{k}}} n(x_{C_{k}}; \mathbf{x}) \log p(x_{C_{k}-S_{k}} ~|~ x_{S_{k}}). \end{align*}$As a result, we can infer $p(x_{C_{k}})$ by looking at $n(x_{C_{k}}; \mathbf{x})$. > [!help]- Notation and explanations > $n(x;\mathbf{x})$ is the contingency table function that counts the number of times $x$ appeared in $\mathbf{x}$, and $\mathcal{X}_{C_{k}}$ is the sample space of the RVs indexed by $C_{k}$. > > The last equality follows from fixing a $x_{C_{k}}$, and considering all terms that contribute to the coefficient of $\log p(x_{C_{k}-S_{k}} ~|~ x_{S_{k}})$: it gets $+n(x';\mathbf{x})$ (for any $x' \in \mathcal{X}$) if and only if that $x'_{C_{k}}=x_{C_{k}}$, i.e. whenever $x_{C_{k}}$ appears in $\mathbf{x}$. ### Non-Decomposable Graphs > [!algorithm] Iterative Proportional Fitting/Scaling > Input target marginals $\{ q_{A}(x_{A}) ~|~ A\in \mathcal{A} \}$, where $A\in\mathcal{A}$ are all subsets of $V$. > > While the density has not converged: > - For $A\in \mathcal{A}$, let $p(x_{V})\leftarrow p(x_{V-A} ~|~ x_{A}) \cdot q_{A}(x_{A})$. If we let the target $q_{A}$ be the empirical (i.e. MLE) marginals, then *the IPS never decreases the likelihood*: the update is $\log p(x_{V})= \log p(x_{V-A}~|~ x_{A})+\log p(x_{A}) \to \log p(x_{V-A} ~|~ x_{A})+\log q_{A}(x_{A}).$ Since $\log p(x_{A}) \le \log q_{A}(x_{A})$ (assuming $q_{A}$ is MLE), this update step increases the likelihood. If we run the IPS algorithm on a decomposable graph, it actually completes the fitting in one iteration: > [!theorem|*] IPS on Decomposable Graph > Let $p$ be a distribution that is Markov wrt. a decomposable graph $\mathcal{G}$ with running intersection cliques $\mathcal{C}=(C_{1},\dots)$ and separators $\mathcal{S}=(S_{1},\dots)$. > > If we run the IPS algorithm that starts with $p^{(0)}= \mathrm{Unif}(\mathcal{X})$ and ask it to match $p(x_{C_{i}})$ on each $i=1,\dots$, it will converge to $p$ after one iteration, i.e. after adjusting each marginal $p(x_{C_{i}})$ exactly once. > > > [!proof]- > > > > > > Let's denote $H_{j}:= \cup_{i \le j}C_{i}$ to be the history (i.e. all vertices that have been updated), and the vertices $V-H_{j}$ are the ones that haven't been updated. For the rest of the proof, let $\textcolor{orange}{\text{orange}=\text{uniform distribution}}$, and $\textcolor{cyan}{\text{cyan}=\text{the target }p}$. > > > > We shall prove that after the $j$th step, $p^{(j)}(x_{V})= \textcolor{cyan}{\prod_{i \le j} p(x_{C_{i}-S_{i}} ~|~ x_{S_{i}})} \cdot \textcolor{orange}{\mathrm{Unif}(\mathcal{X}_{V-H_{j}})}.$So it consists of the $\textcolor{orange}{\text{un-updated uniform density}}$ and $\textcolor{cyan}{\text{the correct density}}$, and hopefully all of $\textcolor{orange}{\text{orange}}$ will change into $\textcolor{cyan}{\text{cyan}}$. > > > > For induction, we want to prove that on the $(j+1)$th step, the IPF moves $x_{C_{j+1}-S_{j+1}}$ from the $\textcolor{orange}{\text{orange}}$ half to the $\textcolor{cyan}{\text{cyan}}$ half, i.e. it changes $\textcolor{orange}{\mathrm{Unif}(\mathcal{X}_{C_{j+1}-S_{j+1}})}$ to $\textcolor{cyan}{p(x_{C_{j+1}-S_{j+1}} ~|~ x_{S_{j+1}})}$. > > > > On the $(j+1)$th step, the IPF rescales the distribution by $p(x_{C_{j+1}}) / p^{(j)}(x_{C_{j+1}}),$and the denominator can be written as $p^{(j)}(x_{C_{j+1}})=\textcolor{cyan}{p^{(j)}(x_{S_{j+1}})}\cdot \textcolor{orange}{p^{(j)}(x_{C_{j+1}-S_{j+1}}~|~x_{S_{j+1}})}.$ > > Let's figure out what the two terms are (hint: the color code). > > - Sum/integrate out the untouched vertices to get $\textcolor{cyan}{p^{(j)}(x_{H_{j}})=\prod_{i \le j}p(x_{C_{i}-S_{i}} ~|~ x_{S_{i}})=p(x_{H_{j}})},$since the middle term is a factorization of $p(x_{H_{j}})$ over the (sub)graph $\mathcal{G}[H_{j}]$. Therefore, all updated vertices had the correct marginal $p(x_{H_{j}})$. > > - Since $S_{j} \subset H_{j}$ is among the updated vertices, it also has the correct marginal, i.e. $\textcolor{cyan}{p^{(j)}(x_{S_{j}})=p(x_{S_{j}})}$. > > - On the other hand, all vertices in $C_{j+1}-S_{j+1}$ have not been touched, so they still have the uniform distribution with density $\textcolor{orange}{p^{(j)}(x_{C_{i+1}-S_{j+1}}| x_{S_{j+1}})= \mathrm{Unif}(\mathcal{X}_{C_{j+1}-S_{j+1}} )}$. > > > > Therefore, the update is $\textcolor{cyan}{\frac{p(x_{C_{j+1}})}{p(x_{S_{j+1}})}}\cdot \textcolor{orange}{\frac{1}{\mathrm{Unif}(\mathcal{X}_{C_{j+1}-S_{j+1}} )}}.$When multiplied to the density $p^{(j)}(x_{V})$, it replaces the $\textcolor{orange}{\text{un-updated}}$ density of $x_{C_{j+1}-S_{j+1}}$ with $\textcolor{cyan}{\text{the correct density}}$. This completes the induction. > > > > Lastly, simply let $j=| \mathcal{C} |$ to get the result after updating every clique: $p^{| \mathcal{C} |}(x_{V})=\textcolor{cyan}{\prod_{i}p(x_{C_{i}-S_{i}}~|~x_{S_{i}})}\cdot \textcolor{orange}{\mathrm{Unif}(\mathcal{X}_{\varnothing})}=\textcolor{cyan}{\prod_{i}p(x_{C_{i}-S_{i}}~|~x_{S_{i}})}=p(x_{V}).$That is, the IPF fits the target $p$ perfectly after one update of every clique.