Directed Graphical Models - Random Notes Go Brrrrrrr

On a directed graph $G=(V,E)$, define the following sets of a vertex $v \in V$: $\begin{align*} \text{parents } \mathrm{pa}(v)&:= \{ u \in V ~|~ \overrightarrow{uv}\in E \} \\ \text{children } \mathrm{ch}(v)&:= \{ u \in V ~|~ \overrightarrow{vu}\in E \}\\[0.4em] \text{ancestors } \mathrm{an}(v)&:= \{ u \in V ~|~ u\to v\}\\[0.4em] \text{descendents } \mathrm{de}(v)&:= \{ u \in V ~|~ v\to u\}\\[0.4em] \text{non-descendents } \mathrm{nd}(v)&:= \{ u \in V ~|~ v\not \to u\} \end{align*}$Here $v\to v$ is assumed, so *$v$ is both its own ancestor and descendant*. Note that $\mathrm{pa(v)} \subset\mathrm{nd}(v)$. ### Factorisation and Markov Properties A **topological ordering** of $G$ is an ordering lt;$ such that $u<v$ if $u \in \mathrm{pa}(v)$. It always exists for [[Directed Acyclic Graph|directed acyclic graphs (DAGs)]]. Equivalently, it defines a sequence $(v_{1},\dots v_{p}): v_{i}<v_{j} \iff i<j.$ Given such a sequence, we can factor a distribution $p$ using conditional probabilities: $p(x)=\prod_{i}p(x_{v_{i}} ~|~x_{v_{1}},\dots,x_{v_{i-1}}).$Note that this equality holds for all distributions of $X_{V}$. ^d47a62 > [!definition|*] Factorization (on a Directed Graph) > In contrast, we say $p$ **factorises** with respect to $G$ if $p(x)=\prod_{v \in V}p(x_{v} ~|~x_{\mathrm{pa}(v)}),$i.e. it's sufficient to condition on direct parents for factorizing $p$. ^e23296 In other words, in a factorised distribution $p$, the factors only need to be conditioned on $x_{\mathrm{pa}(v)}$, rather than all vertices that came before $v$. Taking the extreme case where all of $\mathrm{nd}(v)$ come before $v$, we have: > [!definition|*] Local Markov Property > The distribution $p$ satisfies the **local Markov property** if for any $v$ we have $p(x_{v} ~|~ x_{\mathrm{nd}(v)})=p(x_{v} ~|~ x_{\mathrm{pa}(v)})$, or equivalently $X_{v} \perp X_{\mathrm{nd}(v)-\mathrm{pa}(v)} ~|~ X_{\mathrm{pa}(v)}.$ In plain English, I (${v}$) am unrelated to my grandparents ($\mathrm{an}(v)-\mathrm{pa}(v)$) or the person next door ($\mathrm{nd}(v)$ in general) when conditioned on my parents ($\mathrm{pa}(v)$). I can still be related to my descendants ($\mathrm{de}(v)$). ^9bd6c6 > [!theorem|*] Factorisation Implies Local Markov > When $p$ has factorisation $p(x_{V})=\prod_{v}p(x_{v} ~|~ x_{\mathrm{pa}(v)})$, $p$ also satisfies the local Markov property. > > > > [!proof]- > > Take a topological ordering that places all non-descendants of $v$ before $v$, so all vertices after $v$ are its descendants. Now integrate in the reverse of that order, by integrating out $x_{\mathrm{de}(v)\setminus \{ v \}}$ and $x_{\mathrm{de}(v)}$ resp., we have $\begin{align*} > p(x_{\mathrm{nd}(v)},x_{v})&= p(v ~|~ \mathrm{pa}(v))\prod_{u\in \mathrm{nd}(v) }p(u ~|~ \mathrm{pa}(u)),\\ p(x_{\mathrm{nd}(v)})&= \prod_{u\in \mathrm{nd}(v)}p(u ~|~ \mathrm{pa}(u)). > \end{align*}$ > > > > Taking their ratio gives the result. ### Inference of a Factorised Distribution Given the factorisation $p(x)=\prod_{v} p(x_{v} ~|~x_{\mathrm{pa}(v)})$ and data $\mathbf{x}=(x^{(1)},\dots, x^{(N)})$, we can infer each term separately, as the log-likelihood is $\begin{align*} l&= \sum_{n=1}^{N}\log p(x^{(n)})\\ &= \sum_{n} \sum_{v}\log p(x^{(n)}_{v}~|~x^{(n)}_{\mathrm{pa}(v)})\\ &= \sum_{v}\left( \sum_{n} p(x^{(n)}_{v}~|~x^{(n)}_{\mathrm{pa}(v)})\right), \end{align*}$ allowing us to optimize each term $p(x_{v}~|~x_{\mathrm{pa}(v)})$ separately. In the case of discrete variables in a [[Contingency Table|contingency table]], $\begin{align*} l&= \sum_{x} n(x)\log p(x)\\ &= \sum_{x}n(x)\sum_{v}\log p(x_{v}~|~x_{\mathrm{pa}(v)})\\ &= \sum_{v} \sum_{x_{v},x_{\mathrm{pa}(v)}} n(x_{v},x_{\mathrm{pa}(v)}) \cdot\log p(x_{v}~|~x_{\mathrm{pa}(v)}), \end{align*}$ so each term $p(x_{v}~|~x_{\mathrm{pa}(v)})$ can be estimated with its MLE $\hat{p}(x_{v}~|~x_{\mathrm{pa}(v)})= \frac{n(x_{v},x_{\mathrm{pa}(v)})}{n(x_{\mathrm{pa}(v)})}.$ ^cf4041 ### Moral Graphs ![[Directed Acyclic Graph#^d17b82]] > [!definition|*] Moral Graph (of a DAG) > For a DAG $G$, its **moral graph** $G^{m}=(V^{m},E^{m})$ is an undirected graph with $\begin{align*} V^{m}&:= V,\\ E^{m}&:=\{ ij ~|~(i \to j) \text{ OR } (j \to i) \text{ OR } (i,j \text{ collide})\}. \end{align*}$That is, $G^{m}$ is formed by connecting v-structures, and removing directions on the edges. - Roughly speaking, we "*marry separated parents*" in $G$. For example, the left is a directed graph, and the right is its moral graph. ![[DirectedGraphExample.png#invert|w50]]![[DirectedGraphMoralizedExample.png#invert|w50]] > [!theorem|*] Markov Property is Shared with Moral Graphs > > If $p$ factorises wrt. $G$, then it also factorises wrt. $G^{m}$. > > [!proof]- > > Suppose $p$ is factorizable with $p(x_{V})=\prod_{v}p(x_{v}~|~\mathrm{pa}(x_{v}))$. > > > > Note that in $G^{m}$, any two parents of a vertex $v$ are connected, so $\{ v \} \cup \mathrm{pa}(v)$ is complete in $G^{m}$, so they are contained in some clique $C \in \mathcal{C}(G^{m})$. Therefore, the term $p(v~|~ \mathrm{pa}(v))$ is a function of $x_{C}$. > > > > Now gathering all such terms for $C$, we can define $\phi_{C}(x_{C})$ as their product. This gives the definition of factorizability. ^39ee4c > [!definition|*] Global Markov (on a directed graph) > > $p$ satisfies the **global Markov property** if whenever there are $A,B,S$ such that $A,B$ are separated by $S$ in $G[\mathrm{an}(A\cup B\cup S)]$, we have $X_{A}\perp X_{B}~|~X_{S}$. This is [[Causal Graphs#Conditioning and d-Separation|equivalent with $A,B$ being d-separated by $S$]]. In DAGs, local Markov, global Markov, and factorizability are equivalent. - This does not require positiveness, unlike the analogous result for [[Undirected Graphical Models]]. - Factorizability $\Rightarrow$ global Markov: for any $A,B,C$, let $W:= \mathrm{an}(A\cup B\cup C)$. Then $p$ factorises wrt. $G$ implies that it also factorises wrt. $G[W]$, because $W$ is ancestral. So it also factorises wrt. $(G[W])^{m}=(G^{m})[W]$, so we have $A \perp B ~|~ C \Rightarrow X_{A}\perp X_{B}~|~X_{C}$. - Global Markov $\Rightarrow$ local Markov: for any $v$, consider $W:= \{ v \} \cup \mathrm{nd}(v)$. This is an ancestral set, in which $A:= \{ v \}$ and $B:= \mathrm{nd}(v)-\mathrm{pa}(v)$ are separated by $C:= \mathrm{pa}(v)$. So global Markov gives $X_{v} \perp X_{\mathrm{nd}(v)-\mathrm{pa}(v)} ~|~ X_{\mathrm{pa}(v)}$, i.e. local Markov. - Local Markov $\Rightarrow$ factorizability: in any topological ordering, $\{ v': v'<v \} \subset \mathrm{nd}(v)$ (otherwise a descendant $u$ will have its ancestor $v>u$), so $p(x_{v_{i}} ~|~x_{v_{1}},\dots,x_{v_{i-1}})=p(x_{v_{i}} ~|~ x_{\mathrm{pa}(v)})$ for all $i$.