Gaussians on Undirected Graphical Models - Random Notes Go Brrrrrrr

Consider fitting some distribution $p$ for an [[Undirected Graphical Models|undirected graphical model]] with a factorizable graph $G$. Let $(C_{1},\dots,C_{K})=\mathcal{C}(G)$ be a sequence of cliques of $G$ that satisfy the running intersection property, with separators $(S_{1},..,S_{K})$. Even with factorizability, this is no trivial matter, but things are manageable if we are fitting Gaussians with $\mathcal{X}=\mathbb{R}^{p}$ and $p=N(\pmb{\mu}, \Sigma)$. - WLOG we assume $\pmb{\mu}=0$, else just shift the data $\mathbf{x}$ by that amount. - Since Gaussians have support over the entire $\mathcal{X}$, they satisfy the positivity constraint, we do not need to distinguish the various Markov properties and factorizability. ### Conditional Independence of Gaussians Let $K:= \Sigma^{-1}$ be the **concentration matrix** of $p$. In that case, $p$ is Markov wrt. $G$ if and only if $\forall i,j \in V, ~i \not \sim j, ~~ K_{ij}=0.$ Therefore, $K$ has the same sparsity structure (if not sparser) as the adjacency matrix of $G$. > [!proof]- > A special property of multivariate Gaussians guarantees that $K_{ij}=0$ is equivalent with $X_{i} \perp X_{j} ~|~ X_{V-\{ i,j \}}.$This is the pairwise Markov property, so $p$ is Markov. Using the [[Undirected Graphical Models#^464932|characterization of factorization on a decomposition]], any Markov $p$ and decomposition $(A,B,S)$ has $\log p(x_{V})=\log p(x_{A \cup S})+\log p(x_{B \cup S})-\log p(x_{S}),$ so how does this translate in a Gaussian context, in particular in terms of the concentration matrices of $x_{V},x_{A \cup S},x_{B \cup S},x_{S}$? For $U,U' \subset V$, we define slicing $\Sigma[U,U']$ as the $| U |\times | U' |$ submatrix of $\Sigma$, corresponding to the elements of $U,U'$. Let $\Sigma_{U}:=\Sigma[U,U]$, the covariance matrix of $x_{U}$. Denote $\Sigma^{-1}_{U}:= (\Sigma_{U})^{-1}$ to be its concentration matrix ($\ne K[M,M]$ in general). Then for Gaussians, the above becomes $\mathrm{const.}+x_{V}^{T}Kx_{V}=\mathrm{same ~const.} +x_{A \cup S}^{T}\Sigma_{A\cup S}^{-1}x_{A \cup S}+x_{B \cup S}^{T}\Sigma_{B \cup S}^{-1}x_{B \cup S}-x_{S}^{T}\Sigma_{S}^{-1}x_{S},$ so padding the matrices with $0$ to make them $p\times{1}$ and $p\times p$ respectively, we get $x_{V}^{T}Kx_{V}=x_{V}^{T}\Big[\{ \Sigma_{A \cup S}^{-1} \}_{A \cup S}\Big]x_{V}+x_{V}^{T}\Big[\{ \Sigma_{B\cup S}^{-1} \}_{B \cup S}\Big]x_{V}-x_{V}^{T}\Big[\{ \Sigma_{S}^{-1} \}_{S}\Big]x_{V},$ where $\{ \Sigma^{-1}_{U} \}_{U}$ are the padded matrices, defined in the info box below. Factoring out $x_{V}$ from both sides, we obtain the following theorem: > [!theorem|*] Gaussians on Decomposable Graphs > If $G$ is decomposable with decomposition $(A,B,S)$, then we can also decompose a Gaussian $p$ (Markov wrt. $G$) with $K=\{ \Sigma_{A \cup S}^{-1} \}_{A \cup S}+\{ \Sigma_{B \cup S}^{-1} \}_{B \cup S}- \{ \Sigma_{S}^{-1} \}_{S},$ > where $\Sigma_{U}=\mathrm{Cov}(x_{U})$ for $U \subset V$ is the $[U,U]$ block of $\Sigma$, and $\Sigma^{-1}_{U}:= (\Sigma_{U})^{-1}$. > > > [!help]- Definition of padding matrices $\{ M \}_{U}$ > > Let $M \in \mathbb{R}^{| U |\times | U |}$ and $U \subset V=\{ 1,\dots,p \}$. Then $\{ M \}_{U}$ is the matrix $M$ padded by zeros to match the size of $| V |$. More precisely, > > ```python > > def padding(M, U, p): > > padded_matrix = np.zeros(p) > > padded_matrix[U, U] = M > > return padded_matrix > > ``` > > As examples, for $V=\{ 1,2,3,4 \}$, $\{ \Sigma^{-1}_{\{ 1, 2 \}} \}_{\{ 1,2 \}}= \begin{pmatrix} \times & \times \\ \times & \times \\ \\ & & & & \end{pmatrix}, ~~\{ \Sigma^{-1}_{\{ 2, 4 \}} \}_{\{ 1,2 \}}= \begin{pmatrix} & \\ & \times & & \times\\ & & & \\ & \times & & \times \end{pmatrix}.$ > ^a6cfdf In the context of a decomposable graph $G$, ![[Undirected Graphical Models#^39a55a|induced decomposition]] where again $(C_{1},\dots), (S_{1},\dots)$ are a sequence of cliques and separators satisfying the RIP. So the theorem above gives > [!theorem|*] Gaussians on RIP Clique Sequences > $K=\sum_{k}\{ \Sigma^{-1}_{C_{k}} \}_{C_{k}}-\{ \Sigma_{S_{k}} ^{-1}\}_{S_{k}}.$ ```R fold title:"Implementation in R" padded_inverse = function(cov, subset, p){ ### # Given a covariance matrix `cov` and a subset of # variables `subset`, returns the "padded inverse" # given by # {cov[subset, subset] ^ -1}_{subset, subset} # i.e. a matrix with the same size as cov, filled # with 0's except for the row/columns indexed by # `subset`, where it contains the concentration # matrix of `subset` (i.e. inverse of its covariance). ### subset_cov = cov[subset, subset] # covariance of the subset subset_con = solve(subset_cov) # concentration of the subset # prepare an empty/zero matrix p = nrow(cov) padded_con = padding(subset_con, subset, p) return (padded_kon) } ``` ```R # a = analysis, s = stats, l = algebra, m = mech., v = vector graph_concentration = ( padded_inverse(marks_cov, c('a', 's', 'l')) + padded_inverse(marks_cov, c('m', 'l', 'v')) - padded_inverse(marks_cov, c('l')) ) graph_covariance = solve(graph_concentration) ``` ## MLE Inference of $\Sigma$ Suppose we have an iid. sample $\mathbf{x}=(x^{(1)},\dots,x^{(n)}) \overset{\mathrm{iid.}}{\sim} p=N(0, \Sigma)$ and we want to estimate $\Sigma$, knowing that $p$ factorises with respect to $G$. We can't fitting an MLE willy-nilly, say with the sample covariance matrix $\hat{\Sigma}:= \frac{1}{n}\sum_{i}x^{(i)}x^{(i)T},$ since of course there is no guarantee that $N(\mathbf{0},\hat{\Sigma})$ will be Markov/factorisable wrt. our graph $G$. ### Constraints on the MLE Denote our graph-conscious estimator with $\hat{\Sigma}^{G}$ with concentration $\hat{K}^{G}:= (\hat{\Sigma}^{G})^{-1}$, so the Markov/factorizability constraint can be written as $\forall i,j \in V, i \not \sim j,~~ \hat{K}^{G}_{ij}=0.~~~({\dagger})$ Under this restriction, the sufficient statistics of $\{ p \text{ Gaussian}~|~ p~\text{Markov wrt.} G \}$ are those in $X^{T}KX$ that get non-zero entries of $K$, i.e. $\{ X_{i}X_{j} ~|~ i,j\in V, i \sim j \}.$ Because of [[Minimality and Sufficiency#^8d3447|a special property of exponential families]], *the MLE $\hat{\Sigma}^{G},\hat{K}^{G}$ must match expectations* $\mathbb{E}[X_{i}X_{j}]=\mathrm{Cov}(X_{i}X_{j})=\hat{\Sigma}^{G}_{ij}$ with the observed sufficient statistic $\overline{x_{i}x_{j}}=\hat{\Sigma}_{ij}$, where again $\hat{\Sigma}$ is the sample covariance and the (unconstrained) MLE. That is, $\forall i,j \in V, i \not \sim j,~~ \hat{\Sigma}^{G}_{ij}=\hat{\Sigma}_{ij}.~~~({\ddagger})$ > [!examples] Student Test Score Example > > For example, in if the graph is > ```mermaid > graph TD A[Analysis A] --- L[Algebra L] A --- S[Statistics S] L --- S V[Vectors V] --- L L --- M[Mechanics M] V --- M > ``` > > and the student's score $X=(X_{a},X_{s},X_{l},X_{v},X_{m})$ has cliques $\{ a,l,s \}$ and $\{ l,v,m \}$. > > Therefore, $\hat{K}$ must look like $\hat{K}^{G}= \begin{array}{c|cc|c|cc} & a & s & l & v & m\\ \hline a & * & * & * & \\ s & * & * & * & \\ \hline l & * & * & * & * & * \\ \hline v & & & * & * & * \\m & & & * & * & * \end{array} ~~~(\dagger)$to match the graph (the blank entries are $0$). > > $\hat{\Sigma}^{G}$ must look like > $\hat{\Sigma}^{G}= \begin{array}{c|cc|c|cc} & a & s & l & v & m\\ \hline a & \hat{\Sigma}_{aa} & \hat{\Sigma}_{as}& \hat{\Sigma}_{al} & ? & ? \\ s & \hat{\Sigma}_{as} & \hat{\Sigma}_{ss}& \hat{\Sigma}_{sl} & ? & ? \\ \hline l & \hat{\Sigma}_{al} & \hat{\Sigma}_{sl} & \hat{\Sigma}_{ll} & \hat{\Sigma}_{lv} & \hat{\Sigma}_{lm} \\ \hline v & ?& ?& \hat{\Sigma}_{lv} & \hat{\Sigma}_{vv} & \hat{\Sigma}_{vm} \\m & ? & ? & \hat{\Sigma}_{lm} & \hat{\Sigma}_{vm} & \hat{\Sigma}_{mm} \end{array}~~~({\ddagger})$to match the observed moments from $\hat{\Sigma}$. ### Fitting the Constrained MLE Now how do we find the MLE $\hat{\Sigma}^{G},\hat{K}^{G}$ that satisfy those requirements? Answer: we explicitly construct $\hat{K}^{G}$ in such a way that it has the correct sparsity structure (has $0s where it should), while checking that its inverse $\hat{\Sigma}^{G}$ has the correct moments. Consider replacing each term with $\left(\sum_{k}\{ \Sigma^{-1}_{C_{k}} \}_{C_{k}}-\{ \Sigma_{S_{k}} ^{-1}\}_{S_{k}}\right)$, giving $\hat{K}^{G}\overset{?}{=}\sum_{k}\{\hat{\Sigma}^{-1}_{C_{k}} \}_{C_{k}}-\{\hat{\Sigma}_{S_{k}} ^{-1}\}_{S_{k}}.$ It is obvious that this $\hat{K}^{G}$ has the correct sparsity structure (i.e. $({\dagger})$ is satisfied) due to the padding. We claim that it indeed matches the required moments to satisfy $({\ddagger})$. > [!warning]- Long Ahh Justification > > Verify that for any clique $C \in \mathcal{C}(G)$, we have $(\hat{K}^{G}_{C})^{-1}=\hat{\Sigma}_{C}$, so that in particular for the pair $\{ i,j \}$ (if $i \sim j$ so they share a clique), $\mathbb{E}_{\hat{K}^{G}}[X_{i}X_{j}]$ is correct: > > We can show that in the [[#^a6cfdf|decomposition theorem above]], the MLE $\hat{K}^{G}$ can be decomposed as > > $\begin{align*} > \hat{K}^{G}&= \{ \hat{K}^{G[A\cup S]} \}_{A\cup S}+\{ \hat{K}^{G[B\cup S]} \}_{B\cup S}-\{ \hat{\Sigma}_{S}^{-1} \}_{S},\\ > \hat{\Sigma}^{G}_{A\cup S}&= (\hat{K}^{G[A \cup S]})^{-1},\\ > \hat{\Sigma}^{G}_{B\cup S}&= (\hat{K}^{G[B \cup S]})^{-1}, > \end{align*}$ > - First identity is about how to compute this MLE from those of smaller subgraphs. For a subset of variables $U \subset V$, $\hat{K}^{G[U]}$ is the MLE of $\mathrm{Cov}(X_{U})^{-1}$ for the marginal distribution $p(X_{U})$, and $G[U]$ is the induced subgraph. > - The second and third identity specifies that this combination gives the correct marginal covariances. > > > [!proof] See Lauritzen's book for proof... It's just a ton of matrix algebra. > > Using the second/third identity, for either $A\cup S$ or $B\cup S$ (WLOG the former), we can now say that its block in the final MLE $\hat{K}^{G}$ satisfies $\hat{\Sigma}^{G}_{A\cup S}=(\hat{K}^{G[A \cup S]})^{-1}\overset{!}{=}\hat{\Sigma}_{A \cup S},$where $(!)$ follows by inducting on the size of the graph(ical model) -- we reduced the original model over $G$ to a smaller one over $G[A\cup S]$, and the induction allows us to conclude that the smaller MLE $\hat{K}^{G[A\cup S]}$ equals the unconstrained MLE $\hat{\Sigma}$ on the $(C,C)$ block. > > Now use the same induction as above, namely ![[Undirected Graphical Models#^582a3a]]we can conclude that $\hat{\Sigma}^{G}$ gives any clique $C_{k}$ the correct covariance $\hat{\Sigma}^{G}_{C_{k}}=\hat{\Sigma}_{C_{k}}$that matches the unconstrained MLE $\hat{\Sigma}$.