## Markov Gaussians on DAGs
An arbitrary distribution $p$ is Markov on a DAG $G=(V,E)$ if it satisfies
> [!embed]
> ![[Directed Graphical Models#^9bd6c6]]
and we want to study the implication on Gaussians.
Suppose $X_{V}\sim N(0, \Sigma)$, but we don't assume it's Markov wrt. the graph $G=(V,E)$ just yet.
> [!exposition]- Long Ahh derivation of the matrix form
> Let $\beta_{v}:=\beta_{v \sim U}$ be the [[Population Linear Regression]] coefficients of $X_{v} \sim X_{U}$ for some $U \subset V$, given by $\beta_{v}=(\Sigma_{U,U})^{-1}\Sigma_{U,v},$where the subscripts indicate subsetting $\Sigma$.
> - Compare it to the [[Linear Regression Methods#Least Squares Linear Regression|sample OLS coefficients]] given by $(X^T X)^{-1}(X^{T}y)$ (the first term is an estimate of $\Sigma^{-1}_{X,X}$ and the second of $\Sigma_{X,y}$ if we assume $X\sim$ some Gaussian too).
>
> Then we have $X_{v}~|~X_{U}\sim N(X_{U}^{T}\beta_{v}, \dots)$
> When we take $U=V-\{ v \}$, we equivalently get $X_{v}=\beta_{v}^{T}X_{V-\{ v \}}+\epsilon_{v}=\sum_{u \ne v}\beta_{vu}X_{u}+\epsilon_{v},$for some $\epsilon \sim N(0, \dots)$.
In matrix form, this is $X_{V}=BX_{V}+\epsilon_{V},$where $B=(b_{vu})$. In general, *$\epsilon_{V}$ has a non-diagonal covariance, and $B$ is dense*.
However, if we require $X_{V}\sim N(0, \Sigma)$ to be Markov wrt. $G$, then:
> [!exposition]- More derivation
> We must have $X_{v} \perp X_{\mathrm{nd}(v)-\mathrm{pa}(v)}~|~X_{\mathrm{pa}(v)},$i.e. the conditional distribution $X_{v}~|~X_{\mathrm{nd(v)}}\sim N(\beta^{T}_{v}X_{\mathrm{nd}(v)},\dots)$ must not depend on $X_{\mathrm{nd}(v)-\mathrm{pa}(v)}$, i.e.
>
> $\beta_{v}\text{'s non-$0$ entries correspond to }\mathrm{pa}(v).$
>
> In the special case of $v$ being the last vertex in some topological order, we have $\mathrm{nd}(v)=V-\{ v \}=: U$, so $p(x_{V})=p(x_{v}~|~x_{U})\cdot p(x_{U}).$
> Therefore, if $p(x_{v} ~|~ x_{U})$ satisfies $X_{v} \perp X_{U-\mathrm{pa}(v)}~|~X_{\mathrm{pa}(v)},$ and $p(x_{U})$ is Markov wrt. $G[V-\{ v \}]$, then we get a full factorization of $p(x_{V})$ implying that $p(x_{V})$ is Markov wrt. $G$.
>
> This gives an inductive argument that $p(x_{V})$ is Markov iff $\forall v\in V, ~~\beta_{v}\text{'s non-$0$ entries correspond to }\mathrm{pa}(v).$
> This simplifies $X_{v}=\beta_{v}^{T}X_{V-\{ v \}}+\epsilon_{v}=\sum_{u \in \mathrm{pa}(v)}\beta_{vu}X_{u}+\epsilon_{v},$resulting in a sparse $B$ (i.e. $b_{vu}=\beta_{vu}\ne0$ only when $u$ is a parent of $v$) whose sparsity structure equals that of $G