Consider fitting some distribution $p$ for an [[Undirected Graphical Models|undirected graphical model]] with a factorizable graph $G$. Let $(C_{1},\dots,C_{K})=\mathcal{C}(G)$ be a sequence of cliques of $G$ that satisfy the running intersection property, with separators $(S_{1},..,S_{K})$.
Even with factorizability, this is no trivial matter, but things are manageable if we are fitting Gaussians with $\mathcal{X}=\mathbb{R}^{p}$ and $p=N(\pmb{\mu}, \Sigma)$.
- WLOG we assume $\pmb{\mu}=0$, else just shift the data $\mathbf{x}$ by that amount.
- Since Gaussians have support over the entire $\mathcal{X}$, they satisfy the positivity constraint, we do not need to distinguish the various Markov properties and factorizability.
Let $K:= \Sigma^{-1}$ be the **concentration matrix** of $p$. In that case, $p$ is Markov wrt. $G$ if and only if $\forall i,j \in V, ~i \not \sim j, ~~ K_{ij}=0.$
Therefore, $K$ has the same sparsity structure (if not sparser) as the adjacency matrix of $G$.
> [!proof]-
> A special property of multivariate Gaussians guarantees that $K_{ij}=0$ is equivalent with $X_{i} \perp X_{j} ~|~ X_{V-\{ i,j \}}.$This is the pairwise Markov property, so $p$ is Markov.
Using the [[Undirected Graphical Models#^464932|characterization of factorization on a decomposition]], any Markov $p$ and decomposition $(A,B,S)$ has
$\log p(x_{V})=\log p(x_{A})+\log p(x_{B})-\log p(x_{S}),$
so how does this translate in a Gaussian context, in particular in terms of the concentration matrices of $x_{V},x_{A},x_{B},x_{S}$?
For $M \subset V$, slice $\Sigma$ to define $\Sigma_{M}:=\Sigma[M,M]$, the covariance matrix of $x_{M}$. Denote $\Sigma^{-1}_{M}:= (\Sigma_{M})^{-1}$ to be its concentration matrix ($\ne K_{M}$ in general). Then the above becomes
$\mathrm{const.}+x_{V}^{T}Kx_{V}=\mathrm{same ~const.} +x_{A}^{T}\Sigma_{A}^{-1}x_{A}+x_{B}^{T}\Sigma_{B}^{-1}x_{B}-x_{S}^{T}\Sigma_{S}^{-1}x_{S},$
so padding the vectors and matrices with $0$ to make them $p\times{1}$ and $p\times p$ respectively, we get:
> [!theorem|*] Gaussians on Decomposable Graphs
> If $G$ is decomposable with decomposition $(A,B,S)$, then we can also decompose a Gaussian $p$ (Markov wrt. $G$) with $K=\{ \Sigma_{A \cup S}^{-1} \}_{A \cup S}+\{ \Sigma_{B \cup S}^{-1} \}_{B \cup S}- \{ \Sigma_{S}^{-1} \}_{S},$where $\Sigma_{U}$ for $U \subset V$ is the $[U,U]$ block of $\Sigma$ (if done in `numpy` or `R`, it would be `sigma[U, U]`), and $\Sigma^{-1}_{M}:= (\Sigma_{M})^{-1}$ is the concentration matrix of $x_{M}$, which is $\ne K_{M}$ in general.
>
> > [!help]- Definition of the padding $\{ M \}_{U}$
> > Let $M \in \mathbb{R}^{| U |\times | U |}$ and $U \subset V=\{ 1,\dots,p \}$. Then $\{ M \}_{U}$ is the matrix $M$ padded by zeros to match the size of $| V |$. More precisely,
> > ```python
> > def padding(M, U, V):
> > padded_matrix = np.zeros(V.shape)
> > padded_matrix[U, U] = M
> > return padded_matrix
> > ```
>
^a6cfdf
In the context of a decomposable graph $G$, ![[Undirected Graphical Models#^582a3a|induced decomposition]]
So here the theorem above gives
> [!theorem|*] Gaussians on RI Clique Sequences
> $K=\sum_{k}\{ \Sigma^{-1}_{C_{k}} \}_{C_{k}}-\{ \Sigma_{S_{k}} ^{-1}\}_{S_{k}}.$
```R fold title:"Implementation in R"
padded_inverse = function(cov, subset){
###
# Given a covariance matrix `cov` and a subset of
# variables `subset`, returns the "padded inverse"
# given by
# {cov[subset, subset] ^ -1}_{subset, subset}
# i.e. a matrix with the same size as cov, filled
# with 0's except for the row/columns indexed by
# `subset`, where it contains the concentration
# matrix of `subset` (i.e. inverse of its covariance).
###
subset_cov = cov[subset, subset] # covariance of the subset
subset_con = solve(subset_cov) # concentration of the subset
# prepare an empty/zero matrix
p = nrow(cov)
padded_con = matrix(0, nrow = p, ncol = p)
colnames(padded_con) = colnames(cov)
rownames(padded_con) = rownames(cov)
# fill in the non-zero entries with the concentration
padded_con[subset, subset] = subset_con
return (padded_kon)
}
```
```R
# a = analysis, s = stats, l = algebra, m = mech., v = vector
graph_concentration = (
padded_inverse(marks_cov, c('a', 's', 'l'))
+ padded_inverse(marks_cov, c('m', 'l', 'v'))
- padded_inverse(marks_cov, c('l'))
)
graph_covariance = solve(graph_concentration)
```
## MLE Inference of $\Sigma$
Suppose we have an iid. sample $\mathbf{x}=(x^{(1)},\dots,x^{(n)}),$and we want to estimate $\Sigma$.
Fitting an MLE willy-nilly, say with the sample covariance matrix
$\hat{\Sigma}:= \frac{1}{n}\sum_{i}x^{(i)}x^{(i)T},$
but of course there is no guarantee that $N(\mathbf{0},\hat{\Sigma})$ will be Markov/factorizable wrt. our graph $G$.
### Constraints on the MLE
Let our estimator be called $\hat{\Sigma}^{G}$ with concentration $\hat{K}^{G}:= (\hat{\Sigma}^{G})^{-1}$, so the Markov/factorizability constraint can be written as
$\forall i,j \in V, i \not \sim j,~~ \hat{K}^{G}_{ij}=0.~~~({\dagger})$
Under this restriction, the sufficient statistics of $\{ p \text{ Gaussian}~|~ p~\text{Markov wrt.} G \}$ are those in $X^{T}KX$ that get non-zero entries of $K$, i.e. $\{ X_{i}X_{j} ~|~ i,j\in V, i \sim j \}.$
Because of [[Minimality and Sufficiency#^8d3447|a special property of exponential families]], the MLE $\hat{\Sigma}^{G},\hat{K}^{G}$ must *match expectations* $\mathbb{E}[X_{i}X_{j}]=\mathrm{Cov}(X_{i}X_{j})=\hat{\Sigma}^{G}_{ij}$ with the observed sufficient statistic $\overline{x_{i}x_{j}}=\hat{\Sigma}_{ij}$, where again $\hat{\Sigma}$ is the sample covariance and the (unconstrained) MLE. That is, $\forall i,j \in V, i \not \sim j,~~ \hat{\Sigma}^{G}_{ij}=\hat{\Sigma}_{ij}.~~~({\ddagger})$
> [!examples] Student Test Score Example
>
> For example, in if the graph is
> ```mermaid
> graph TD
A[Analysis A] --- L[Algebra L]
A --- S[Statistics S]
L --- S
V[Vectors V] --- L
L --- M[Mechanics M]
V --- M
> ```
>
> and the student's score $X=(X_{a},X_{s},X_{l},X_{v},X_{m})$ has clusters (of size
gt;2$) $\{ a,l,s \}$ and $\{ l,v,m \}$.
>
> Therefore, $\hat{K}$ must look like $\hat{K}^{G}= \begin{array}{c|cc|c|cc} & a & s & l & v & m\\ \hline a & * & * & * & \\ s & * & * & * & \\ \hline l & * & * & * & * & * \\ \hline v & & & * & * & * \\m & & & * & * & * \end{array} ~~~(\dagger)$to match the graph (the blank entries are $0$).
>
> $\hat{\Sigma}^{G}$ must look like
> $\hat{\Sigma}^{G}= \begin{array}{c|cc|c|cc} & a & s & l & v & m\\ \hline a & \hat{\Sigma}_{aa} & \hat{\Sigma}_{as}& \hat{\Sigma}_{al} & ? & ? \\ s & \hat{\Sigma}_{as} & \hat{\Sigma}_{ss}& \hat{\Sigma}_{sl} & ? & ? \\ \hline l & \hat{\Sigma}_{al} & \hat{\Sigma}_{sl} & \hat{\Sigma}_{ll} & \hat{\Sigma}_{lv} & \hat{\Sigma}_{lm} \\ \hline v & ?& ?& \hat{\Sigma}_{lv} & \hat{\Sigma}_{vv} & \hat{\Sigma}_{vm} \\m & ? & ? & \hat{\Sigma}_{lm} & \hat{\Sigma}_{vm} & \hat{\Sigma}_{mm} \end{array}~~~({\ddagger})$to match the observed moments from $\hat{\Sigma}$.
### Fitting the Constrained MLE
Now how do we find the MLE $\hat{\Sigma}^{G},\hat{K}^{G}$ that satisfy those requirements? Answer: we explicitly construct $\hat{K}^{G}$ in such a way that it has the correct sparsity structure (has $0