Linear Regression in Causal Inference - Random Notes Go Brrrrrrr

Suppose we want to study the response $Y$ that may be affected by treatment $T$, other covariates $X$, and potentially unobserved ones $U$. We have collected data in the design matrix $\mathbf{X}$, which include $T,X$, and a column of constants. - Write $\mathbf{X}^{\ast}$ as the block containing $X$ only. Let the full OLS $Y \sim 1+T+X$ return coefficients $\hat{\beta}=(\hat{\beta}_{0}, \hat{\beta}_{T},\hat{\beta}_{X})$. Then obviously we can interpret $\hat{\beta}_{T}$ as the "best" constant-valued estimation of the effect of the treatment. Then by [[Orthogonal Projection, Confounding, and Missing Variable in OLS#^4760b7|FWL theorem]], $\hat{\beta}_{T}$ can also be obtained by: - Let $\tilde{T}$ be the OLS residuals from regressing $T \sim X$ (without an intercept); this orthogonalizes the observed values of $T$ wrt. $X$. - Now regress $Y\sim 1+\tilde{T}$ returns $\hat{\beta}_{T}$. > [!idea] Therefore, OLS can be interpreted as finding the effect of the treatment that is orthogonal to that of other covariates. ## Justification for Using OLS for Causal Inference But in causal inference, we wish to find $\mathbb{E}[Y_{1}-Y_{0}],$and with the assumption that there is no confounding or selection bias, we have $Y_{0,1}\perp T ~|~ X$, so the above equals $\mathbb{E}_{X}\big[\mathbb{E}[Y_{1}~|~X]-\mathbb{E}[Y_{0}~|~X]\big]=\mathbb{E}_{X}\big[\mathbb{E}[Y_{1}~|~X,T]-\mathbb{E}[Y_{0}~|~X,T]\big].$Now linear regression estimates $\mathbb{E}[Y~|~X,T]$ because of [[Decision Theory#^bf20fe|the mean being optimal for $l2$ loss used by OLS]], and the coefficient $\hat{\beta}_{T}$ gives what we want: $\begin{align*} \hat{\beta}_{T}&= \hat{\mathbb{E}}[Y~|~X,T=1]-\hat{\mathbb{E}}[Y ~|~ X, T=0] &[\text{additive model}]\\ &= \hat{\mathbb{E}}[Y_{1}~|~X,T=1]-\hat{\mathbb{E}}[Y_{0}~|~X,T=0] &[\text{def. of treatment}]\\ &= \hat{\mathbb{E}}[Y_{1}~|~X]-\hat{\mathbb{E}}[Y_{0}~|~X], & [\text{non-confoundedness}] \end{align*}$where $\hat{\mathbb{E}}$ is the OLS estimation of the expectation. Furthermore, the OLS modeling assumes that the effect is constant (i.e. $\hat{\beta}_{T}$ is constant in $X$), so averaging over $X$ gives *$\hat{\beta}_{T}=\hat{\mathbb{E}}[Y_{1}-Y_{0}].$* Note that $\hat{\beta}_{T}$ is already an estimate of the expectation $\mathbb{E}[Y_{1}-Y_{0}~|~X]$, and averaging over $\mathbf{X}^{\ast}$ is in fact another layer of estimation (using sample mean to estimate population mean $\mathbb{E}_{X}$). For a continuous treatments, (assuming differentiability), we can instead interpret $\beta_{T}=\partial Y / \partial T$ to be the **sensitivity** (e.g. elasticity if $Y$ is sales and $T$ the price). ## Adding Predictors As explained in [[Confounding and Selection Bias in Causal Inference]], we want to add confounding variables into our model to control for them, but not things like mediators (controlling for which causes selection bias). On the other hand, there are other types of predictors that can help improve the model. ### Good Predictors of the Response Suppose there is a covariate $Z$ that is has no causal relationship and is uncorrelated with $T$ -- so controlling for which doesn't affect confounding or selection bias. For example, ```mermaid flowchart TD Z[new covariate Z] --> Y[response Y] X[already in model X] --> Y T[treatment T] --> Y ``` As $Z \perp T$, we can assume that the observed values $\mathbf{z},\mathbf{t}$ are also (almost) orthogonal (in the linear algebra sense), and the FWL theorem guarantees that the OLS coefficient $\hat{\beta}_{T}$ will be (nearly) identical when we add $Z$ to the model. Should we do so? It depends on a number of factors: - In practice the orthogonality is not strict: $\mathbf{z}^{T}\mathbf{t}$ is only approximately $0$, so the coefficient will change a bit. - More importantly, *if $Z$ is a good predictor of $Y$, adding it to the model will greatly decrease $\mathrm{RSS}$ and by extension $\hat{\sigma}^{2}$, making the coefficient estimates more significant*. - On the contrary, adding a poor predictor inflates $\hat{\sigma}^{2}$ (if we use the $\mathrm{RSS} / (n-p)$ estimate). ### Instrumental Variables Suppose there is [[Orthogonal Projection, Confounding, and Missing Variable in OLS#Missing Variable Bias in OLS|unobserved variable bias]], where the causal graph is ```mermaid flowchart TD Z[instrumental variable Z] --> T[treatment T] T --> Y[response Y] U[unobserved confounder U] --> T U --> Y X[covariates X] --> Y ``` where $U$ can be either left out on purpose (sensitive things that can have biased answers), neglected (we forgor to ask), or simply not measurable (e.g. intelligence). This causes the issue that if the true effect of $T$ is $\beta_{T}$, running OLS with the observed variables give the model $Y=\beta_{0}+\beta_{T}T+\beta_{X}X+\underbrace{\beta_{U}U+\epsilon}_{\epsilon'},$with non-independent errors $\epsilon'$. But in this case *we can estimate the effect of treatment via an instrumental variable*, defined as: > [!definition|*] Instrumental Variable > A variable $Z$ is an **instrumental variable** for treatment $T$ and response $Y$ if: > - $Z \not \perp T$, and ideally $\rho_{ZT}$ is large, i.e. $Z$ has an effect on $T$, and ideally it should be estimable with a linear relationship. > - $Y \perp Z ~|~ T$, i.e. the **exclusion restriction** that $Z$ only affects $Y$ via $T$. With those assumptions we have $\mathrm{Cov}(Z, Y~|~X)=\mathrm{Cov}(Z, \beta_{T}T~|~X)=\mathrm{Cov}(Z,T~|~X)\cdot \mathrm{Cov}(T,Y ~|~ X).$ Solving for $\mathrm{Cov}(Y, T)$, we find $\mathrm{Cov}(Y,T~|~X)=\frac{\mathrm{Cov}(Y, Z~|~X)}{\mathrm{Cov}(Z,T~|~X)}\approx\frac{\hat{\beta}_{Z:Y\sim Z+X}}{ \hat{\beta}_{Z: T\sim Z+X}} := \frac{\text{Reduced Form}}{\text{1st Stage}}.$ Therefore, we only need to run the regressions $Y \sim Z+X$ and $T \sim Z+X$ to determine the effect of $T$, even if there is unobserved confounders. - This process is implemented in the python module `linearmodels` as `iv.IV2SLS`, which also provides standard error estimates. - In the case of a binary IV, $\mathrm{RHS}$ further reduces to the **Wald estimator** $\frac{\bar{y}_{1}-\bar{y}_{0}}{\bar{t}_{1}-\bar{t}_{0}},$where a subscript of $0$, indicates averaging over observations with $Z=0$, same for $1$. You can think of this as the slope between the two points $(\bar{t}_{0}, \bar{y}_{0})$ and $(\bar{t}_{1}, \bar{y}_{1})$. > [!info] Vector-valued IV and treatments > Of course $Z,T$ can be high-dimensional, e.g. a dummy of a categorical variable. In that case, $\hat{\beta}_{T \sim Z}$ is a $(z \times t)$ coefficient-matrix, where $z,t$ are dimensions of the two RVs. > > `linearmodels.iv.IV2SLS` also handles that. - The standard errors of $\hat{\beta}_{Y \sim T}$ obviously depends on the strength of correlation between $T$ and $Z$: a weak correlation causes a huge standard error in the estimate. - The 2SLS estimates are biased towards the OLS $\hat{\beta}_{T:Y \sim T + X}$, so it is a biased (but consistent) estimator of the true relationship. ### Compliance and LATE In real life experiments, there is a difference between the assignment of treatment and actual receival of it, with the causal graph looking like ```mermaid flowchart LR Z[Assignment Z] --> T[Treatment T] T --> Y[Response Y] ``` (with other covariates omitted). In this way, *assignment $Z$ can be thought of as another layer of treatment -- that on $T$*. Therefore we can write $T_{0},T_{1}$ to give $T=\begin{cases} T_{0} & \text{if }Z=0, \\ T_{1} & \text{if }Z=1, \end{cases}$and if the assignment is suitably random, we have $T_{0},T_{1} \perp Z$. According to the values $T_{0},T_{1}$ take we have the classification - If $T_{i}=i$, they are **compliers**. - If $T_{i} = \lnot i$, they are **defiers** (rare in most experiments). - If $T_{i}=1$ for both $i$, they are **always-takers**. - If $T_{i}=0$ for both, they are **never-takers** (e.g. people with Nokia phones who cannot receive a fancy ad-treatment). The last three can muddy the water when estimating the ATE, for the case of always-takers, ![[NonCompliance.png#invert|center]] - We cannot estimate the effect $T \to Y$ with the replacement $Z\to Y$, as averaging over $\{ j: Z_{j}=i \}$ to estimate $\mathbb{E}[Y_{0}]$ would include always- and never-takers, biasing the result. - We also cannot use the naïve estimate $\hat{\mathbb{E}}[Y~|~T=1]-\hat{\mathbb{E}}[Y ~|~ T=0]$ by averaging over $\{ j:T_{j}=i \}$ (even if we can observe $T$) like in perfect randomized control trials (RCTs), because *although $Z$ is randomized, $T$ can be affected by confounders*: ```mermaid flowchart TD Z[Assignment Z] --> T[Treatment T] U[Confounder U] --> T T --> Y[Response Y] U --> Y ``` Notice how similar this graph is to that of IVs: *treat assignment $Z$ as the IV for treatment $T$*, and Wald estimator becomes $\frac{\bar{y}_{1}-\bar{y}_{0}}{\bar{t}_{1}-\bar{t}_{0}}=\frac{\hat{\mathbb{E}}[Y~|~Z=1]-\hat{\mathbb{E}}[Y ~|~ Z=0]}{\hat{\mathbb{E}}[T~|~Z=1]-\hat{\mathbb{E}}[T ~|~ Z=0]}.$The denominator is just $\hat{\mathbb{E}}[T_{1}-T_{0}]$, (estimation of) the compliance rate $\mathbb{P}[\text{compliance}]$. Here always/never-takers will have an integrand of $0$, and we ignore defiers. The numerator can be further conditioned to be $\begin{align} \hat{\mathbb{E}}[Y&~|~T=T_{1},Z=1]-\hat{\mathbb{E}}[Y ~|~ T=T_{0},Z=0] \\ &= \hat{\mathbb{E}}[Y~|~T=T_{1}]-\hat{\mathbb{E}}[Y ~|~ T=T_{0}] &[\substack{\text{exclusion} \\ \text{restriction}}] \\ &=\mathbb{P}(\text{compliance}) \cdot \hat{\mathbb{E}}[Y_{1}-Y_{0} ~|~ \text{compliance}].\end{align}$Again, the last step follows from always/never-takers having the two terms cancel out. Therefore, the Wald estimator reduces to $\hat{\mathbb{E}}[Y_{1}-Y_{0} ~|~ \text{compliance}],$also known as the **local average treatment effect (LATE)**. Assuming compliance is good, this will be close to the true ATE. - Another way of thinking is that a good compliance makes the first stage $Z \to T$ strong, giving a good IV estimate in the end. ### Regression Discontinuity Design A special case of treatment happens when a **running variable** $R$ (e.g. time, age, score) reaches a certain threshold: $T := \mathbf{1}_{R \ge r_{0}},$then the response of the treated individuals are $Y=Y_{0}\mathbf{1}_{R < r_{0}}+Y_{1}\mathbf{1}_{R \ge r_{0}}.$ ![[RDD.png#invert|center]] *Assuming continuity of $Y_{0,1}$ wrt. $R$, we can take two sided-limits to find a treatment effect*: $\begin{align*} &\lim_{R \to r_{0}-}Y= \lim_{R \to r_{0}}Y_{0} ,\\ &\lim_{R \to r_{0}+}Y= \lim_{R \to r_{0}}Y_{1},\\[0.4em] &\Longrightarrow \mathbb{E}[ \lim_{R \to r_{0}+}Y-\lim _{R \to r_{0}-}Y]=\mathbb{E}[Y_{1}-Y_{0} ~|~ R=r_{0}]. \end{align*}$*Therefore, the discontinuity in expectation at the threshold is the local ATE.* - Note that other (potentially confounding) covariates are not included because of the strong assumption that $Y_{0},Y_{1}$ is continuous. - Assuming all other covariates $X$ to be continuous across $R=r_{0}$ is also sufficient. - This fails when **bunching** occurs, i.e. one side of the threshold has significantly more observations, e.g. when teachers give mercy passes to students. *This makes two sides of the threshold no longer comparable* (in the mercy pass example, it dilutes the difference). In practice, this difference can be estimated by the OLS $Y\sim 1+\mathbf{1}_{R\ge r_{0}}+(\text{more predictors, e.g. } R,R\cdot \mathbf{1}_{_{R \ge r_{0}}},\dots).$WLOG let $r_{0}=0$ (can be achieved by centering $R$ in the dataset), and $\hat{\beta}_{0},\hat{\beta}_{1},\dots$ be the OLS coefficients, *the discontinuity jump in expectation is $\widehat{\mathrm{ATE}}=\mathbb{E}[ \lim_{R \to r_{0}+}Y-\lim _{R \to r_{0}-}Y]=\hat{\beta}_{1}.$* - Since we only care about a local fit at the threshold, we can also use weighed regression to de-emphasize data far from the threshold. In practice, there might be non-compliance issues, e.g. with a legal drinking age of $r_{0}=21$, teens might drink illegally (always-takers), and there are adults who don't drink (never-takers). Therefore, the causal graph is instead ```mermaid flowchart LR R[Assignment/Threshold R] --> T[Treatment T] T --> Y[Response Y] ``` and since the cutoff is probabilistic, this is called a **fuzzy RDD**: ![[FuzzyRDD.png#invert|center]] We can use IV techniques and the Wald estimator to estimate the effect: - Compute $\bar{t}_{0,1}$ as the proportion of treatment received among people below/above the threshold. - Compute $\bar{y}_{0,1}$ as the average response among those people. - Compute the Wald estimator $\hat{\beta}_{1}=\frac{\bar{y}_{1}-\bar{y}_{0}}{\bar{t}_{1}-\bar{t}_{0}}=\hat{\mathbb{E}}[Y_{1}-Y_{0} ~|~ \text{compliance}, R=r_{0}]\cdot \mathbb{P}[\text{compliance}].$ ### Heterogeneous Effects and Interaction Terms In the [[#Justification for Using OLS for Causal Inference|justification]], we assumed the treatment to have a constant effect $\beta_{T}$, which is of course equal to the ATE. However, in reality the effect can be heterogeneous, and we are interested in the **conditional ATE** $\mathbb{E}[Y_{1}-Y_{0} ~|~ X]$. \ For continuous treatment, the treatment effect is instead $\mathbb{E}\left[ \frac{\partial Y}{\partial T}~|~ X \right],$the conditional expected sensitivity. Suppose the regression is $Y\sim 1+T+X+TX$, then $\hat{\beta}_{T}+X'\hat{\beta}_{TX}= \hat{\mathbb{E}}[Y_{1}~|~X]-\hat{\mathbb{E}}[Y_{0}~|~X],$ where $ takes the transpose of the column-vector-valued $X$. If we really want to recover the unconditional effect, averaging over the dataset $\mathbf{X}^{\ast}$ gives $\hat{\beta}_{T}+\frac{\mathbf{1}'}{n}\mathbf{X}^{\ast}\hat{\beta}_{TX}=\hat{\mathbb{E}}[Y_{1}- Y_{0}],$ where $\mathbf{1}$ is the column vector of $1s (so left-applying $\frac{\mathbf{1}'}{n}$ is taking the sample mean). In application, we can use the OLS to *estimate the sensitivity, bin them into groups, and makes decisions for each group*. - For example, to model price elasticity for different regions and quarters, and set different prices based on the group. - In this case, the accuracy of the elasticity model is not as important as the groups it identify. ## Entity and Time Effects Suppose the data is clustered on an entity level: e.g. data collected about a few subjects across a time horizon is clustered based on each subject (**panel data**). For the entity-specific variables, we can use the entity (i.e. subject) as the mediator between it and the confounded $Y,T$: ```mermaid flowchart TD U[unobserved confounder U] --> E[entity E] Z[entity-based covariates Z] --> E E --> T[treatment T] T --> Y[response Y] E --> Y X[non-entity-based covariates X] --> Y ``` *Therefore, we can control for unobserved confounders $U$ by clustering based on the entity $E$.* ### Removing Time-Invariant Effects *Within a small time frame, we can assume many confounders to be constant for each subject.* - For example, we can assume constant intelligence and wealth for each student during their university education. Suppose we have the model $Y_{ik}=\beta X_{ik}+\gamma U_{i}+\epsilon_{ik}$, where $Y_{ik}$ is the $k$th observation on the $i$th individual, $X_{ik}$ are the time-variant covariates for that observation, and $U_{i}$ the time-invariant confounders for the individual. For simplicity we merged $T$ into $X$. > [!warning] The treatment $T$ is assumed to be non-constant in time. - The most common is the before vs. after encoding that looks like $T = (0,\dots,0, 1,\dots,1)$. Time-invariant, entity-specific confounders $U$ are particular easy to control for in linear regression: - To **center** the data in each cluster, we compute the average response $\bar{Y}_{i}:= \mathrm{avg}\{ Y_{ik}\}$ for each entity, and compute $\tilde{Y}_{ik}:= Y_{ik} - \bar{Y}_{i}=\beta\tilde{X}_{ik}+\tilde{\epsilon}_{ik},$where $\tilde{X},\tilde{\epsilon}$ are the respective centered versions of $X,\epsilon$. *The constant $U_{i}$ cancels out.* - By differencing the data in time, we compute $\Delta Y_{ik}:= Y_{ik}-Y_{i(k-1)}=\beta \Delta X_{ik}+\Delta \epsilon_{ik},$where again the confounder cancels out. ### Two-Way Fixed Effects One way of estimating the entity and time effects are to run the OLS $Y \sim 1+\text{Entity}+\text{Time}+X,$where $X$ are the covariates (including treatment $T$), and $\text{Entity, Time}$ are $0 / 1$ dummy variables for each entity/timestamp. - Since the coefficients of $\text{Entity}$ and $\text{Time}$ are fixed for each entity/timestamp, this model is called **two-way fixed effects** (TWFE). Since $\text{Entity}$ and $\text{Time}$ will contain a huge number of terms for large datasets, we can use the [[Orthogonal Projection, Confounding, and Missing Variable in OLS#^4760b7|FWL theorem]] to solve for the coefficients of $X$: - Since $\text{Entity}$ and $\text{Time}$ are $0 / 1$, orthogonalizing $Y,X$ wrt. each of them is simply demeaning $Y,X$ after grouping by each entity/timestamp. - Furthermore, since the two are orthogonal (in the linear algebra sense), *the demeaning is also orthogonal* in the sense that it can be done as $X^{\ast}:=X-\bar{X}_{\text{Time}}-\bar{X}_{\text{Entity}},$and similarly for $Y$. - Now regressing $Y^{\ast} \sim X^{\ast}$ gives the treatment effect. Because of its assumptions, the two-way fixed effects model cannot account for: - Non-parallel trends (i.e. different coefficients for $\mathrm{Time}$ for each entity), - Heterogeneous treatment effects (i.e. a treatment effect that differs across entity or time, like a gradual increase over time). Note that in the special case of 2 entities and 2 timestamps, the TWFE model reduces to the simple DiD (difference-in-difference): its design matrix (with columns in the order as above) is $\mathbf{X}=\begin{pmatrix} 1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 1 & 1 & 1 \end{pmatrix},$where the dummies for $\mathrm{Entity}$ and $\mathrm{Time}$ indicate the second entity/timestamp respectively. Therefore the coefficient estimates are $\hat{\beta}=(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{y}=\begin{pmatrix}1 & & & \\ -1 & 1 & & \\ -1 & & 1 & \\ 1 & -1 & -1 & 1\end{pmatrix}\mathbf{y},$so the estimated treatment effect (last entry of $\hat{\beta}$) is $\hat{\beta}_{T}=y_{1}-y_{2}-y_{3}+y_{4}$and using the notation $y_{i}(t)$ as the value of the $i$th entity on at time $t$, $\begin{align} \hat{\beta}_{T}&=y_{0}(t_{1})-y_{1}(t_{1})-y_{0}(t_{2})+y_{1}(t_{2}) \\ &= (y_{1}(t_{2})-y_{1}(t_{1}))-(y_{0}(t_{2})-y_{0}(t_{1})), \end{align}$i.e. the DiD estimator.