## Markov Gaussians on DAGs An arbitrary distribution $p$ is Markov on a DAG $G=(V,E)$ if it satisfies > [!embed] > ![[Directed Graphical Models#^9bd6c6]] and we want to study the implication on Gaussians. Suppose $X_{V}\sim N(0, \Sigma)$, but we don't assume it's Markov wrt. the graph $G=(V,E)$ just yet. > [!exposition]- Long Ahh derivation of the matrix form > Let $\beta_{v}:=\beta_{v \sim U}$ be the [[Population Linear Regression]] coefficients of $X_{v} \sim X_{U}$ for some $U \subset V$, given by $\beta_{v}=(\Sigma_{U,U})^{-1}\Sigma_{U,v},$where the subscripts indicate subsetting $\Sigma$. > - Compare it to the [[Linear Regression Methods#Least Squares Linear Regression|sample OLS coefficients]] given by $(X^T X)^{-1}(X^{T}y)$ (the first term is an estimate of $\Sigma^{-1}_{X,X}$ and the second of $\Sigma_{X,y}$ if we assume $X\sim$ some Gaussian too). > > Then we have $X_{v}~|~X_{U}\sim N(X_{U}^{T}\beta_{v}, \dots)$ > When we take $U=V-\{ v \}$, we equivalently get $X_{v}=\beta_{v}^{T}X_{V-\{ v \}}+\epsilon_{v}=\sum_{u \ne v}\beta_{vu}X_{u}+\epsilon_{v},$for some $\epsilon \sim N(0, \dots)$. In matrix form, this is $X_{V}=BX_{V}+\epsilon_{V},$where $B=(b_{vu})$. In general, *$\epsilon_{V}$ has a non-diagonal covariance, and $B$ is dense*. However, if we require $X_{V}\sim N(0, \Sigma)$ to be Markov wrt. $G$, then: > [!exposition]- More derivation > We must have $X_{v} \perp X_{\mathrm{nd}(v)-\mathrm{pa}(v)}~|~X_{\mathrm{pa}(v)},$i.e. the conditional distribution $X_{v}~|~X_{\mathrm{nd(v)}}\sim N(\beta^{T}_{v}X_{\mathrm{nd}(v)},\dots)$ must not depend on $X_{\mathrm{nd}(v)-\mathrm{pa}(v)}$, i.e. > > $\beta_{v}\text{'s non-$0$ entries correspond to }\mathrm{pa}(v).$ > > In the special case of $v$ being the last vertex in some topological order, we have $\mathrm{nd}(v)=V-\{ v \}=: U$, so $p(x_{V})=p(x_{v}~|~x_{U})\cdot p(x_{U}).$ > Therefore, if $p(x_{v} ~|~ x_{U})$ satisfies $X_{v} \perp X_{U-\mathrm{pa}(v)}~|~X_{\mathrm{pa}(v)},$ and $p(x_{U})$ is Markov wrt. $G[V-\{ v \}]$, then we get a full factorization of $p(x_{V})$ implying that $p(x_{V})$ is Markov wrt. $G$. > > This gives an inductive argument that $p(x_{V})$ is Markov iff $\forall v\in V, ~~\beta_{v}\text{'s non-$0$ entries correspond to }\mathrm{pa}(v).$ > This simplifies $X_{v}=\beta_{v}^{T}X_{V-\{ v \}}+\epsilon_{v}=\sum_{u \in \mathrm{pa}(v)}\beta_{vu}X_{u}+\epsilon_{v},$resulting in a sparse $B$ (i.e. $b_{vu}=\beta_{vu}\ne0$ only when $u$ is a parent of $v$) whose sparsity structure equals that of $Gs adjacency matrix transposed. > > Moreover, $\epsilon_{v}=X_{v}-\sum_{u\in\mathrm{pa}(v)}\beta_{vu}X_{u}$ is (uncorrelated, hence) independent of $X_{\mathrm{pa}(v)}$, hence also independent of any $\epsilon_{u}$ computed after it (the induction happens in reverse topological order). As a result, $\epsilon_{V} \text{ has fully independent entries, and its covariance }\Sigma_{\epsilon} \text{ is diagonal.}$ > [!theorem|*] Gaussians Markov wrt. a DAG > If $p$ is a Gaussian density that is Markov wrt. a DAG $G$, then $X_{V}=BX_{V}+\epsilon_{V}$for some Gaussian $\epsilon_{V}$ with diagonal covariance $D$, and $B$ is sparse with $b_{vu}=\begin{cases} (\beta_{v\sim \mathrm{pa}(v)})_{u} &\text{if }u \in \mathrm{pa}(v), \\ 0 &\text{otherwise}. \end{cases}$Here $(\beta_{v \sim \mathrm{pa}(v)})_{u}$ is $X_{u}s coefficient in the population regression $X_{v} \sim X_{\mathrm{pa}(v)}$. > [!lemma|*] Independence of Gaussian components on a graph > If Gaussian $X \sim p$ is Markov wrt. a graph $G$, then $X_{u} \not\perp X_{v} ~|~ X_{V-\{ u,v \}} \iff (u \sim v, \text{ or } \exists w:~ u \to w \leftarrow v ).$ > > > [!proof]- > > From the above result, solving for $X_{V}$ gives $X_{V}=(I-B)^{-1}\epsilon_{V}$, so $\begin{align*} > > \Sigma= \mathrm{Cov}(X_{V})&= (I-B)^{-1}D(I-B)^{-T},\\ > > K:= \Sigma^{-1}&= (I-B)^{T}D^{-1}(1-B). > > \end{align*}$ > > Note that the terms of $K_{uv}$ are (summed over $w \in V$) $\underbrace{(I-B)_{w,u}}_{\substack{\ne 0 \text{ when }u \to w \\ \text{or }u=w}}~\cdot ~\underbrace{(I-B)_{w,v}}_{\substack{\ne 0 \text{ when }v \to w \\ \text{or }v=w}} ~/ d_{ww}$The only non-zero terms correspond to (1) $v\sim u$ and (2) $u\to w\leftarrow v$. ### Covariance Computations Given a structural equation $X_{V}=BX_{V}+\epsilon$, or equivalently $X_{V}=(I-B)^{-1}\epsilon,$how do we compute the pairwise covariances? Of course we can brute force by computing $\mathrm{Cov}(X_{V})=(I-{B})^{-1}D(I-B)^{-T}$ where $D:=\mathrm{Cov}(\epsilon)$, but that is very expensive if we are only interested in a few pairs. Since the sparsity of $B$ is the same as that of $A^{T}$, where $A$ is the adjacency matrix of $G$, and $A^{T}$ is nilpotent due to acyclicity, so is $B$ (inductively for $k=1,\dots$, we can show that $B^{k}_{ij}=0$ when $A^{Tk}_{ij}=0$). This allows us to write $(I-B)^{-1}=I+B+\dots+ B^{p-1}$where $p$ is the number of nodes/variables, and $B^{p}=0$. Just as $A^{Tk}_{ij}$ records the paths $j\to i$ of length $k$, $B^{k}_{ij}$ records the "correlation" $X_{j}$ gets from $X_{i}$ that travelled via $k$ variables. So in particular, we have $\mathrm{Var}(X_{i})=\sum_{m=0}^{p-1}\sum_{n=0}^{p-1}(B^{m}DB^{Tn})_{ii}=\sum_{m=0}^{p-1}\sum_{n=0}^{p-1}\sum_{k}B^{m}_{ik}D_{kk}B^{Tn}_{ki}.$ How do we interpret this formula? Imaging mixing noise from $\epsilon$ together. Fixing $k$, we consider the amount from $\epsilon_{k}$. For each $m,n$, $B^{m}_{ik}$ and $B^{Tn}_{ki}=B^{n}_{ik}$ represent two paths: ```mermaid flowchart TD A("εk") -- "path of length m" --> X[Xi] A -- "path of length n" --> X ``` when they meet, they are scaled to be $B_{ij}^{m}\sigma_{j}$ and $B^{n}_{ij}\sigma_{j}$ respectively, producing a covariance of $\mathrm{Cov}(B_{ij}^{m}\sigma_{j}, B_{ij}^{n}\sigma_{j})=B_{ij}^{m}B_{ij}^{n}\sigma_{j}=B_{ij}^{m}B_{ji}^{Tn}D_{jj}.$Now summing over all such paths (by their lengths $m,n$) and source $j$ gives the variance of $X_{i}$. Moving on to covariances, we instead have ```mermaid flowchart TD A("εk") -- "path of length m" --> Xi[Xi] A -- "path of length n" --> Xj[Xj] ``` and their covariance $B_{ik}^{m}B_{kj}^{Tn}D_{kk}$ now go towards $\mathrm{Cov}(X_{i},X_{j})$, giving the virtually identical equation $\mathrm{Cov}(X_{i},X_{j})=\sum_{m=0}^{p-1}\sum_{n=0}^{p-1}(B^{m}DB^{Tn})_{ij}=\sum_{m=0}^{p-1}\sum_{n=0}^{p-1}\sum_{k}B^{m}_{ik}D_{kk}B^{Tn}_{kj}.$ But this is just matrix algebra -- it doesn't really simplify the computations. How do we exploit the graphical model? We know that $B^{m}_{ik}$ is non-zero only if there is a path $\pi$ that goes $k\to i$ with length $m$, and each path $\pi$ contributes $\rho(\pi):= \prod_{uv\in \pi}\beta_{v\sim u}=\prod_{uv \in \pi}B_{vu}$, so $B^{m}_{ik}=\sum_{\substack{\pi: k\to i\\\text{length }m}}\rho(\pi).$ So the $\sum_{m}\sum_{n}$ above simply sums over all paths $k\to i$ and $k\to j$. Rearranging the sum, we get $\mathrm{Cov}(X_{i},X_{j})=\sum_{k}D_{kk}\left( \sum_{\pi: k\to i}\rho(\pi) \right)\left( \sum_{\pi': {k} \to j}\rho(\pi') \right).$ If we pair $\pi: k\to i$ with $\pi': k\to j$ into a **trek** $\tau:= (\pi,\pi')$ from $i$ to $j$ with source $k$, we can further define the **trek covariance** $c(\tau):=D_{kk}\rho(\pi)\rho(\pi')$, and the covariance is just $\mathrm{Cov}(X_{i},X_{j})=\sum_{\tau\in \mathcal{T}_{ij}}c(\tau),$where $\mathcal{T}_{ij}$ is the set of all treks (regardless of sources) that goes from $i$ to $j$. Is this any cheaper to compute? Since trek sources can only be from $\mathrm{an}(i)\cap \mathrm{an}(j)$, we simply find all $k\to i$ and $k\to j$ paths in that set, compute the respective sums of their $\rho$, then plug into the formula above. ## Gaussians on Causal Graphs Suppose we wish to study the effect of some treatment $T$ on a response $Y$, in the presence of other variables $X$ and under a [[Causal Graphs|causal graph]] $G=(V,E)$ with vertices $V=(T,X,Y)$. Then by [[Causal Graphs#^79040c|results about adjustment sets]], we can find a subset $C \subset V$ such that $p(x_{y}~|~\mathrm{do}(x_{t}))= \sum_{x_{C}} p(x_{y} ~|~x_{t},x_{C})\cdot p(x_{C}).$ If we are only interested in the expectation of $Y$, we have $\begin{align*} \mathbb{E}[Y ~|~ \mathrm{do}(x_{t})]&= \sum_{x_{y}}x_{y}\cdot p(x_{y}~|~\mathrm{do}(t))\\ &= \sum_{x_{y}}x_{y} \sum_{x_{C}}p(x_{y} ~|~x_{t},x_{C})\cdot p(x_{C})\\ &= \sum_{x_{C}}p(x_{C})\sum_{x_{y}}x_{y}p(x_{y}~|~x_{t},x_{C})\\ &= \sum_{x_{C}} \mathbb{E}[Y ~|~ x_{t},x_{C}]\cdot p(x_{C}). \end{align*}$ The expectation in the last line is particular easy to find if we have $Y~|~X_{t},X_{C}$ is Gaussian for some unspecified $\mu,\sigma^{2}$: - Regardless of the true relationship between $\mu$ and $X_{t},X_{C}$, the best linear unbiased predictor of $\mathbb{E}[Y ~|~ x_{t},x_{C}]$ is given by the [[Linear Regression Methods#Least Squares Linear Regression|OLS]] $Y \sim X_{t}+X_{C}$. - If we further assume $(Y,X_{t},X_{C})$ is jointly Gaussian, then the conditional expectation indeed has this linear form. Denoting $\beta_{0},\beta_{t},\beta_{C}$ as the estimated OLS coefficients ($\beta_{0}$ being the intercept, and hats dropped for simplicity) and $\beta=(\beta_{0},\beta_{t},\beta_{C}^{T})^{T}$ be the concatenated form, we have $\hat{\mathbb{E}}[Y ~|~x_{t},x_{C}]=(1,x_{t},x_{C})\beta,$ and plugging into the adjustment formula gives $\begin{align*} \hat{\mathbb{E}}[Y ~|~\mathrm{do}(x_{t})]&= \sum_{x_{C}}p(x_{C})\cdot\hat{\mathbb{E}}[Y ~|~x_{t},x_{C}]\\ &= \beta_{0}+\beta_{t}\cdot x_{t}+\left( \sum_{x_{C}}p(x_{C})x_{C} \right)\beta_{C}\\ &= \beta_{0}+\beta_{t}\cdot x_{t}+\mathbb{E}[x_{C}]\beta_{C}. \end{align*}$ Therefore, *$Ts average causal effect on $Y$ is given by $\beta_{t}$*, regardless of $\mathbb{E}[x_{C}]$ (we can WLOG assume it's $0$ by centering each variable of $x_{C}$).