Suppose we want to study the response $Y$ that may be affected by treatment $T$, other covariates $X$, and potentially unobserved ones $U$. We have collected data in the design matrix $\mathbf{X}$, which include $T,X$, and a column of constants.
- Write $\mathbf{X}^{\ast}$ as the block containing $X$ only.
Let the full OLS $Y \sim 1+T+X$ return coefficients $\hat{\beta}=(\hat{\beta}_{0}, \hat{\beta}_{T},\hat{\beta}_{X})$. Then obviously we can interpret $\hat{\beta}_{T}$ as the "best" constant-valued estimation of the effect of the treatment.
Then by [[Orthogonal Projection, Confounding, and Missing Variable in OLS#^4760b7|FWL theorem]], $\hat{\beta}_{T}$ can also be obtained by:
- Let $\tilde{T}$ be the OLS residuals from regressing $T \sim X$ (without an intercept); this orthogonalizes the observed values of $T$ wrt. $X$.
- Now regress $Y\sim 1+\tilde{T}$ returns $\hat{\beta}_{T}$.
> [!idea] Therefore, OLS can be interpreted as finding the effect of the treatment that is orthogonal to that of other covariates.
## Justification for Using OLS for Causal Inference
But in causal inference, we wish to find $\mathbb{E}[Y_{1}-Y_{0}],$and with the assumption that there is no confounding or selection bias, we have $Y_{0,1}\perp T ~|~ X$, so the above equals $\mathbb{E}_{X}\big[\mathbb{E}[Y_{1}~|~X]-\mathbb{E}[Y_{0}~|~X]\big]=\mathbb{E}_{X}\big[\mathbb{E}[Y_{1}~|~X,T]-\mathbb{E}[Y_{0}~|~X,T]\big].$Now linear regression estimates $\mathbb{E}[Y~|~X,T]$ because of [[Decision Theory#^bf20fe|the mean being optimal for $l2$ loss used by OLS]], and the coefficient $\hat{\beta}_{T}$ gives what we want: $\begin{align*}
\hat{\beta}_{T}&= \hat{\mathbb{E}}[Y~|~X,T=1]-\hat{\mathbb{E}}[Y ~|~ X, T=0] &[\text{additive model}]\\
&= \hat{\mathbb{E}}[Y_{1}~|~X,T=1]-\hat{\mathbb{E}}[Y_{0}~|~X,T=0] &[\text{def. of treatment}]\\
&= \hat{\mathbb{E}}[Y_{1}~|~X]-\hat{\mathbb{E}}[Y_{0}~|~X], & [\text{non-confoundedness}]
\end{align*}$where $\hat{\mathbb{E}}$ is the OLS estimation of the expectation. Furthermore, the OLS modeling assumes that the effect is constant (i.e. $\hat{\beta}_{T}$ is constant in $X$), so averaging over $X$ gives
*$\hat{\beta}_{T}=\hat{\mathbb{E}}[Y_{1}-Y_{0}].$*
Note that $\hat{\beta}_{T}$ is already an estimate of the expectation $\mathbb{E}[Y_{1}-Y_{0}~|~X]$, and averaging over $\mathbf{X}^{\ast}$ is in fact another layer of estimation (using sample mean to estimate population mean $\mathbb{E}_{X}$).
For a continuous treatments, (assuming differentiability), we can instead interpret $\beta_{T}=\partial Y / \partial T$ to be the **sensitivity** (e.g. elasticity if $Y$ is sales and $T$ the price).
## Adding Predictors
As explained in [[Confounding and Selection Bias in Causal Inference]], we want to add confounding variables into our model to control for them, but not things like mediators (controlling for which causes selection bias).
On the other hand, there are other types of predictors that can help improve the model.
### Good Predictors of the Response
Suppose there is a covariate $Z$ that is has no causal relationship and is uncorrelated with $T$ -- so controlling for which doesn't affect confounding or selection bias. For example,
```mermaid
flowchart TD
Z[new covariate Z] --> Y[response Y]
X[already in model X] --> Y
T[treatment T] --> Y
```
As $Z \perp T$, we can assume that the observed values $\mathbf{z},\mathbf{t}$ are also (almost) orthogonal (in the linear algebra sense), and the FWL theorem guarantees that the OLS coefficient $\hat{\beta}_{T}$ will be (nearly) identical when we add $Z$ to the model.
Should we do so? It depends on a number of factors:
- In practice the orthogonality is not strict: $\mathbf{z}^{T}\mathbf{t}$ is only approximately $0$, so the coefficient will change a bit.
- More importantly, *if $Z$ is a good predictor of $Y$, adding it to the model will greatly decrease $\mathrm{RSS}$ and by extension $\hat{\sigma}^{2}$, making the coefficient estimates more significant*.
- On the contrary, adding a poor predictor inflates $\hat{\sigma}^{2}$ (if we use the $\mathrm{RSS} / (n-p)$ estimate).
### Instrumental Variables
Suppose there is [[Orthogonal Projection, Confounding, and Missing Variable in OLS#Missing Variable Bias in OLS|unobserved variable bias]], where the causal graph is
```mermaid
flowchart TD
Z[instrumental variable Z] --> T[treatment T]
T --> Y[response Y]
U[unobserved confounder U] --> T
U --> Y
X[covariates X] --> Y
```
where $U$ can be either left out on purpose (sensitive things that can have biased answers), neglected (we forgor to ask), or simply not measurable (e.g. intelligence).
This causes the issue that if the true effect of $T$ is $\beta_{T}$, running OLS with the observed variables give the model $Y=\beta_{0}+\beta_{T}T+\beta_{X}X+\underbrace{\beta_{U}U+\epsilon}_{\epsilon'},$with non-independent errors $\epsilon'$.
But in this case *we can estimate the effect of treatment via an instrumental variable*, defined as:
> [!definition|*] Instrumental Variable
> A variable $Z$ is an **instrumental variable** for treatment $T$ and response $Y$ if:
> - $Z \not \perp T$, and ideally $\rho_{ZT}$ is large, i.e. $Z$ has an effect on $T$, and ideally it should be estimable with a linear relationship.
> - $Y \perp Z ~|~ T$, i.e. the **exclusion restriction** that $Z$ only affects $Y$ via $T$.
With those assumptions we have $\mathrm{Cov}(Z, Y~|~X)=\mathrm{Cov}(Z, \beta_{T}T~|~X)=\mathrm{Cov}(Z,T~|~X)\cdot \mathrm{Cov}(T,Y ~|~ X).$
Solving for $\mathrm{Cov}(Y, T)$, we find $\mathrm{Cov}(Y,T~|~X)=\frac{\mathrm{Cov}(Y, Z~|~X)}{\mathrm{Cov}(Z,T~|~X)}\approx\frac{\hat{\beta}_{Z:Y\sim Z+X}}{ \hat{\beta}_{Z: T\sim Z+X}} := \frac{\text{Reduced Form}}{\text{1st Stage}}.$ Therefore, we only need to run the regressions $Y \sim Z+X$ and $T \sim Z+X$ to determine the effect of $T$, even if there is unobserved confounders.
- This process is implemented in the python module `linearmodels` as `iv.IV2SLS`, which also provides standard error estimates.
- In the case of a binary IV, $\mathrm{RHS}$ further reduces to the **Wald estimator** $\frac{\bar{y}_{1}-\bar{y}_{0}}{\bar{t}_{1}-\bar{t}_{0}},$where a subscript of $0$, indicates averaging over observations with $Z=0$, same for $1$. You can think of this as the slope between the two points $(\bar{t}_{0}, \bar{y}_{0})$ and $(\bar{t}_{1}, \bar{y}_{1})$.
> [!info] Vector-valued IV and treatments
> Of course $Z,T$ can be high-dimensional, e.g. a dummy of a categorical variable. In that case, $\hat{\beta}_{T \sim Z}$ is a $(z \times t)$ coefficient-matrix, where $z,t$ are dimensions of the two RVs.
>
> `linearmodels.iv.IV2SLS` also handles that.
- The standard errors of $\hat{\beta}_{Y \sim T}$ obviously depends on the strength of correlation between $T$ and $Z$: a weak correlation causes a huge standard error in the estimate.
- The 2SLS estimates are biased towards the OLS $\hat{\beta}_{T:Y \sim T + X}$, so it is a biased (but consistent) estimator of the true relationship.
### Compliance and LATE
In real life experiments, there is a difference between the assignment of treatment and actual receival of it, with the causal graph looking like
```mermaid
flowchart LR
Z[Assignment Z] --> T[Treatment T]
T --> Y[Response Y]
```
(with other covariates omitted). In this way, *assignment $Z$ can be thought of as another layer of treatment -- that on $T$*. Therefore we can write $T_{0},T_{1}$ to give $T=\begin{cases}
T_{0} & \text{if }Z=0, \\
T_{1} & \text{if }Z=1,
\end{cases}$and if the assignment is suitably random, we have $T_{0},T_{1} \perp Z$.
According to the values $T_{0},T_{1}$ take we have the classification
- If $T_{i}=i$, they are **compliers**.
- If $T_{i} = \lnot i$, they are **defiers** (rare in most experiments).
- If $T_{i}=1$ for both $i$, they are **always-takers**.
- If $T_{i}=0$ for both, they are **never-takers** (e.g. people with Nokia phones who cannot receive a fancy ad-treatment).
The last three can muddy the water when estimating the ATE, for the case of always-takers, ![[NonCompliance.png#invert|center]]
- We cannot estimate the effect $T \to Y$ with the replacement $Z\to Y$, as averaging over $\{ j: Z_{j}=i \}$ to estimate $\mathbb{E}[Y_{0}]$ would include always- and never-takers, biasing the result.
- We also cannot use the naïve estimate $\hat{\mathbb{E}}[Y~|~T=1]-\hat{\mathbb{E}}[Y ~|~ T=0]$ by averaging over $\{ j:T_{j}=i \}$ (even if we can observe $T$) like in perfect randomized control trials (RCTs), because *although $Z$ is randomized, $T$ can be affected by confounders*:
```mermaid
flowchart TD
Z[Assignment Z] --> T[Treatment T]
U[Confounder U] --> T
T --> Y[Response Y]
U --> Y
```
Notice how similar this graph is to that of IVs: *treat assignment $Z$ as the IV for treatment $T$*, and Wald estimator becomes $\frac{\bar{y}_{1}-\bar{y}_{0}}{\bar{t}_{1}-\bar{t}_{0}}=\frac{\hat{\mathbb{E}}[Y~|~Z=1]-\hat{\mathbb{E}}[Y ~|~ Z=0]}{\hat{\mathbb{E}}[T~|~Z=1]-\hat{\mathbb{E}}[T ~|~ Z=0]}.$The denominator is just $\hat{\mathbb{E}}[T_{1}-T_{0}]$, (estimation of) the compliance rate $\mathbb{P}[\text{compliance}]$. Here always/never-takers will have an integrand of $0$, and we ignore defiers.
The numerator can be further conditioned to be $\begin{align}
\hat{\mathbb{E}}[Y&~|~T=T_{1},Z=1]-\hat{\mathbb{E}}[Y ~|~ T=T_{0},Z=0] \\
&= \hat{\mathbb{E}}[Y~|~T=T_{1}]-\hat{\mathbb{E}}[Y ~|~ T=T_{0}] &[\substack{\text{exclusion} \\ \text{restriction}}] \\
&=\mathbb{P}(\text{compliance}) \cdot \hat{\mathbb{E}}[Y_{1}-Y_{0} ~|~ \text{compliance}].\end{align}$Again, the last step follows from always/never-takers having the two terms cancel out.
Therefore, the Wald estimator reduces to $\hat{\mathbb{E}}[Y_{1}-Y_{0} ~|~ \text{compliance}],$also known as the **local average treatment effect (LATE)**. Assuming compliance is good, this will be close to the true ATE.
- Another way of thinking is that a good compliance makes the first stage $Z \to T$ strong, giving a good IV estimate in the end.
### Regression Discontinuity Design
A special case of treatment happens when a **running variable** $R$ (e.g. time, age, score) reaches a certain threshold: $T := \mathbf{1}_{R \ge r_{0}},$then the response of the treated individuals are $Y=Y_{0}\mathbf{1}_{R < r_{0}}+Y_{1}\mathbf{1}_{R \ge r_{0}}.$
![[RDD.png#invert|center]]
*Assuming continuity of $Y_{0,1}$ wrt. $R$, we can take two sided-limits to find a treatment effect*: $\begin{align*}
&\lim_{R \to r_{0}-}Y= \lim_{R \to r_{0}}Y_{0} ,\\
&\lim_{R \to r_{0}+}Y= \lim_{R \to r_{0}}Y_{1},\\[0.4em]
&\Longrightarrow \mathbb{E}[ \lim_{R \to r_{0}+}Y-\lim _{R \to r_{0}-}Y]=\mathbb{E}[Y_{1}-Y_{0} ~|~ R=r_{0}].
\end{align*}$*Therefore, the discontinuity in expectation at the threshold is the local ATE.*
- Note that other (potentially confounding) covariates are not included because of the strong assumption that $Y_{0},Y_{1}$ is continuous.
- Assuming all other covariates $X$ to be continuous across $R=r_{0}$ is also sufficient.
- This fails when **bunching** occurs, i.e. one side of the threshold has significantly more observations, e.g. when teachers give mercy passes to students. *This makes two sides of the threshold no longer comparable* (in the mercy pass example, it dilutes the difference).
In practice, this difference can be estimated by the OLS $Y\sim 1+\mathbf{1}_{R\ge r_{0}}+(\text{more predictors, e.g. } R,R\cdot \mathbf{1}_{_{R \ge r_{0}}},\dots).$WLOG let $r_{0}=0$ (can be achieved by centering $R$ in the dataset), and $\hat{\beta}_{0},\hat{\beta}_{1},\dots$ be the OLS coefficients, *the discontinuity jump in expectation is $\widehat{\mathrm{ATE}}=\mathbb{E}[ \lim_{R \to r_{0}+}Y-\lim _{R \to r_{0}-}Y]=\hat{\beta}_{1}.$*
- Since we only care about a local fit at the threshold, we can also use weighed regression to de-emphasize data far from the threshold.
In practice, there might be non-compliance issues, e.g. with a legal drinking age of $r_{0}=21$, teens might drink illegally (always-takers), and there are adults who don't drink (never-takers). Therefore, the causal graph is instead
```mermaid
flowchart LR
R[Assignment/Threshold R] --> T[Treatment T]
T --> Y[Response Y]
```
and since the cutoff is probabilistic, this is called a **fuzzy RDD**:
![[FuzzyRDD.png#invert|center]]
We can use IV techniques and the Wald estimator to estimate the effect:
- Compute $\bar{t}_{0,1}$ as the proportion of treatment received among people below/above the threshold.
- Compute $\bar{y}_{0,1}$ as the average response among those people.
- Compute the Wald estimator $\hat{\beta}_{1}=\frac{\bar{y}_{1}-\bar{y}_{0}}{\bar{t}_{1}-\bar{t}_{0}}=\hat{\mathbb{E}}[Y_{1}-Y_{0} ~|~ \text{compliance}, R=r_{0}]\cdot \mathbb{P}[\text{compliance}].$
### Heterogeneous Effects and Interaction Terms
In the [[#Justification for Using OLS for Causal Inference|justification]], we assumed the treatment to have a constant effect $\beta_{T}$, which is of course equal to the ATE. However, in reality the effect can be heterogeneous, and we are interested in the **conditional ATE** $\mathbb{E}[Y_{1}-Y_{0} ~|~ X]$.
\
For continuous treatment, the treatment effect is instead $\mathbb{E}\left[ \frac{\partial Y}{\partial T}~|~ X \right],$the conditional expected sensitivity.
Suppose the regression is $Y\sim 1+T+X+TX$, then
$\hat{\beta}_{T}+X'\hat{\beta}_{TX}= \hat{\mathbb{E}}[Y_{1}~|~X]-\hat{\mathbb{E}}[Y_{0}~|~X],$
where