There are cases where the traditional, fixed effect linear model (essentially [[Linear Regression Methods]]) do not satisfy modeling needs. > [!examples] > Consider an experiment where six students were randomly assigned to two different study plans (treatments level $0,1$) $X$. Their scores on $4$ exams after the study plans are the response $Y$. > To account for their different starting points, we model their specs as $Z$ (one-hot encoded). > > Naive modeling with fixed effects yields the model $Y=X\beta+Zb+\epsilon, ~\epsilon \sim N(0, \sigma^{2}_{\epsilon}),$and the six students' scores are $\mathrm{iid.}$ instances of $Y$. > > However, it has the following issues: > - (1) Because by design $X$ is nested inside $Z$, this causes perfect collinearity. > - (2) $Z$ completely confounds the effect of $X$ -- we are interested in estimating $\beta$, but not $b$ (which will not generalize beyond the 6 test subjects). This motivates the **mixed model**, where we model $b$ as **random effects**: $b_{1:q} \sim N(\mathbf{0}, \Sigma), ~\pmb{\epsilon}_{1:n} \sim N(\mathbf{0}, \Lambda), ~b \perp \pmb{\epsilon},$where subscripts indicate vector dimensions, and $q=6$ in the example above; $n=4\times 6$ is the total number of exams taken by those $6$ students. This is because *we only care about the random effect's variance $\Sigma$ relative to effect $\betas size*, i.e. whether the treatment's effect is large, compared to natural variation between students. ### Relationship to Ridge-Like Penalties Since $Y~|~ b \sim N(X\beta+Zb, \Lambda),$the posterior log-likelihood of the model is $\begin{align*} l(\beta,b)&= \log p(\mathbf{y} ;\beta,b) +\log \pi(b)\\ &\approx -\| \mathbf{y}-\mathbf{X}\beta-\mathbf{Z}b \|_{\Lambda^{-1}} ^{2}- \| b \|_{\Sigma^{-1}} ^{2}, \end{align*}$where $\approx$ means that the two are optimized by the same $\beta,b$, and $\|v \|_{M}:=v^T Mv$ is the Mahalanobis norm under $M$. Rewrite by concatenating $\mathbf{C}:= [\mathbf{X},\mathbf{Z}]$, $\gamma:=\begin{bmatrix} \beta \\ b\end{bmatrix}$, it "simplifies" to $J(\gamma):=-\| \mathbf{y}-\mathbf{C}\gamma \| ^{2}_{\Lambda^{-1}}-\| \gamma \|^{2}_{\widetilde{\Sigma^{-1}}} $ where $\widetilde{\Sigma^{-1}}:= \begin{bmatrix} 0 & 0 \\ 0 & \Sigma^{-1}\end{bmatrix}$ is $\Sigma^{-1}$ padded with zeros to avoid punishing $\beta$. Therefore, we see that maximizing the likelihood wrt. $\gamma=\begin{bmatrix}\beta\\b\end{bmatrix}$ is equivalent to a weighted linear regression with a [[Ridge Penalty]] on $\gamma$ with penalty matrix $\widetilde{\Sigma^{-1}}$. ### BLUP of $\beta,b$ From the ridge-like formulation, it is easy to derive the BLUP (best linear unbiased prediction) of $\beta,b$: $\nabla _{\gamma}J(\gamma) \propto \mathbf{C}^{T}\Lambda^{-1}\mathbf{y}-\mathbf{C}^{T}\Lambda^{-1}\mathbf{C}\gamma-\widetilde{\Sigma^{-1}}\gamma \equiv \mathbf{0},$solving gives the BLUP $\tilde{\gamma}:=: \begin{bmatrix} \tilde{\beta} \\ \tilde{b} \end{bmatrix}= (\mathbf{C}^{T}\Lambda^{-1}\mathbf{C}+\widetilde{\Sigma^{-1}})^{-1}\mathbf{C}^{T}\Lambda^{-1}\mathbf{y}.$ The predictor has a tilde instead of a hat because we don't know $\Lambda,\Sigma$ in practice, and those have to be estimated. In the homoscedastic case where $\Lambda=\sigma^{2}_{\epsilon}I$, we have $\tilde{\gamma}= (\mathbf{C}^{T}\mathbf{C}+\sigma^{2}_{\epsilon}\cdot\widetilde{\Sigma^{-1}})^{-1}\mathbf{C}^{T}\mathbf{y}.$ ### Estimation of Random Effect Variance in Balanced Data Index the data with $i=1,\dots I$ being treatment (fixed effect) levels, and $j=1,\dots,J$ be subject (random effect) levels, and $k$ indexing the observations made on each subject. - We assume each subject has the same number ($K$) of observations, as in a **balanced dataset**. So with one-hot encoding the naive model (which treats $b_{j}$ as a fixed effect) is $y_{ijk}~|~b_{j}=\beta_{i}+b_{j}+\epsilon_{ij}\sim N(\beta_{i}+b_{j}, \Lambda_{ijk,ijk}).$ On the other hand, pooling the $K$ observations of the same subject and using random effects gives $\begin{align*} \bar{y}_{ij\cdot}:= \mathrm{avg}_{k}\{ y_{ijk}\} &= \beta_{i} + b_{j} + \mathrm{avg}_{k}\{ \epsilon_{ijk} \}\\ &\sim N\left( \beta_{i}, \Sigma_{jj}+\frac{1}{K^{2}}\|{\Lambda}_{ij\cdot,ij\cdot}\| \right), \end{align*}$where the norm $\| \cdot \|$ is the Forbenius norm that sums every entry in the matrix. If we assume $\Sigma=\sigma^{2}I$ and $\Lambda=\lambda^{2}I$, then the two reduces to $\begin{align*} y_{ijk}~|~b_{j} &\sim N(\ast, \lambda^{2}),\\ \bar{y}_{ij\cdot} &\sim N\left( \ast, \sigma^{2}+\frac{\lambda^{2}}{J} \right). \end{align*}$Therefore, we can get (unbiased) estimates of the two using usual linear modeling, and solve for $\widehat{\sigma^{2}}$ and $\widehat{\lambda^{2}}$: > [!exposition] > Let $\mathcal{M}_{0}$ be the naive linear model $Y=X\beta+Zb+\epsilon$ that treats $b$ as fixed, and fit OLS to get $\mathrm{RSS}_{0}$. An unbiased estimate of $\widehat{\lambda^{2}}$ is $\widehat{\lambda^{2}}=\frac{\mathrm{RSS}_{0}}{n-I-q},$since $\beta$ has $I$ levels, and $b$ has $q$ observations' effects. > - In the study plan example, the denominator is $24\text{ exams}-2 \text{ plans}-6\text{ students}=16.$ > > Let $\mathcal{M}_{1}$ be the pooled model $\bar{y}_{ij\cdot}=\beta_{i}+\tilde{\epsilon}_{j}$, where $\tilde{\epsilon}_{j}=b_{j}+\bar{\epsilon}_{i\cdot}$ is the pooled noise and random effect combined. Fit it to get $\mathrm{RSS}_{1}$ and the unbiased estimate $\widehat{\sigma^{2}+\frac{\lambda^{2}}{J}}=\frac{\mathrm{RSS}_{1}}{q-I}.$ > - In the example, the denominator is $6 \text{ students}-2\text{ plans}=4$.