Log Linear Model - Random Notes Go Brrrrrrr

> [!tldr] Log Linear Model > The log linear model is a probability model on a vector-valued RV $X=(X_{1},\dots,X_{p})$ taking values $x \in \mathcal{X}$. > > It models $\log p(x)=\sum_{A \subset[p]}\lambda_{A}(x_{A}),$where $[p]:=\{ 1,\dots,p \}$ is the indexing set of the features, and for identifiability, it imposes the constraint > $\begin{align*} \lambda_{A}({x}_{A})=0 &\overset{\mathrm{def}}{\iff} \forall j \in A, x_{j} \ne 0 \\[0.4em] & \,\,\iff \text{none of }{x}_{A} \text{ has baseline value}, \end{align*}$ > where a baseline is fixed for each entry (WLOG $0$), so *$\lambda_{A}$ models the effect of having $A$ all active.* This is equivalent to modeling $\log p(x)$ with a linear model (i.e. a [[Generalized Linear Models|Poisson GLM]] with log-link) with an interaction term between every subset $A \subset V$ of features. ### Application to Contingency Tables The [[Contingency Table|contingency table]] of a sample $\mathbf{x}:=\{ x^{(n)}: n=1,\dots,N \}$, assuming independence, follows $n(x;\mathbf{x})\sim \mathrm{Multinom}(N, \pi).$ with $\pi_{x}=p(x)$. Also, a series of Poisson RVs $R_{i}\overset{\mathrm{indep.}}{\sim}\mathrm{Po}(\mu_{i})$ with sum $S:= \sum_{i}R_{i}$ has $(R_{1},\dots,R_{k}) ~|~ S \sim \mathrm{Multinom}\left( S, \pi \right)$with probabilities $\pi_{i}:= \mu_{i} / \sum_{i'}\mu_{i'}$. Equating the two, we model each cell $n(x;\mathbf{x})$ being independent Poisson with rate $\mu _{x}\propto \mathbb{P}[X=x]$ (then conditioned on their sum being $N$). Now modeling each rate $\mu_{x}$ with the log-linear model, we have $\begin{align*} n(x;\mathbf{x})&\sim N\cdot\mathrm{Po}(\mu_{x}), &&[\text{iid. assumption}]\\ \log \mu_{x}&= \sum_{A \subset[p]}\lambda_{A}(x_{A}), &&[\text{log-linear model}] \end{align*}$ ### Fitting with Conditional Independence Assumptions We do not need to include a $\lambda_{A}$ for every single subset $A \subset V$. In particular, suppose we require the variables $V:= [p]$ to be "factorized" as $X_{A}\perp X_{B} ~|~ X_{S}, \text{ where }A,B,S \text{ is a partition of }V,$similar to [[Undirected Graphical Models#^464932|factorization on a decomposable graph $G=(A,S,B)$]] but without any assumption on $A,B,S$ (i.e. don't have to be cliques). It is not immediately obvious how to fit a log-linear model $\lambda$ that satisfies this independence. Using standard properties of conditional independence, the distribution must have $p(x_{V})\cdot p(x_{S})=p(x_{A \cup S})\cdot p(x_{B \cup S}).$ Taking logarithm under the log-linear model, we get $ \sum_{U \subset V}\lambda_{U}= \sum_{U \subset A\cup S}\lambda_{U}+ \sum_{U \subset B \cup S}\lambda_{U}-\sum_{U \subset S}\lambda_{U}, ~~~~~(\dagger)$where the factorization forces $\lambda_{U}=0$ for any $U:U\cap A \ne \varnothing,U \cap B \ne \varnothing, U \not \subset S$; in other words, *there is no interaction between $A,B$ beyond what's in $S$.* How does this help us find $\lambda$? We have broken the task of fitting $\lambda_{U}$ on $X_{V}$ into fitting it on $X_{A\cup S},X_{B\cup S},X_{S}$, corresponding to the three terms of $(\dagger)$: - It uses less data to fit each $\lambda_{U}$; if $U \subset A \cup S$ (and $U \not \subset S$), for example, we can ignore $X_{B},X_{S}$ altogether when fitting $\lambda_{U}$, as $\lambda_{U}$ does not appear in $\sum_{U' \subset B \cup S}\lambda_{U}$ or $\sum_{U' \subset S}$ at all. - Moreover, by fitting $\lambda$ on those three subsets (which is straightforward without independence assumptions), we can obtain a log-linear model on $V$ that satisfies the conditional independence that we assume. Denote $\lambda_{U}^{(V')}$ to be the $\lambda_{U}$ fitted using marginals $x_{V'}$ where $V' \subset V$, then we use $(\dagger)$ to fit $\lambda^{(V)}$ by *defining* it as $\sum_{U \subset V}\lambda_{U}^{(V)}:=\sum_{U \subset A\cup S}\lambda_{U}^{(A \cup S)}+ \sum_{U \subset B \cup S}\lambda_{U}^{(B \cup S)}-\sum_{U \subset S}\lambda_{U}^{(S)}.$ So for each $U$, we identify: - For $U \subset A \cup S$ and $U \not \subset S$ (similarly $B \cup S$), let $\lambda_{U}^{(V)}:=\lambda_{U}^{(A \cup S)}$. - For $U \subset S$, let $\lambda^{(V)}_{U}:=\lambda^{(A \cup S)}_{U}+\lambda^{(B \cup S)}_{U}- \lambda^{(S)}_{U}$. Now this model follows the conditional independence assumption we made. - The sample's log-likelihood is $\begin{align*} l(\mathbf{x})&= \sum_{x}n(x;\mathbf{x})\cdot\log p(x) \\ &= \sum_{x}n(x;\mathbf{x}) \cdot \left( \sum_{U \subset A\cup S}\lambda_{U}+ \sum_{U \subset B \cup S}\lambda_{U}-\sum_{U \subset S}\lambda_{U} \right)\\ &= \sum_{U \subset A \cup S}\sum_{x}(n(x;\mathbf{x})·\lambda_{U}) +\sum_{U \subset B \cup S} (\dots) - \sum_{U \subset S} (\dots)\\ &= \sum_{U \subset A \cup S} n(x_{U}=\mathbf{1}; \mathbf{x}) \end{align*}$