> [!tldr]
> In causal models with binary treatment $T$, response $Y$, and covariates $X$, the **propensity score** $e(x):= \mathbb{E}[T ~|~ X=x]$is a tool for controlling for the covariates. In particular, it has the properties:
> - $T \perp X ~|~ e(X)$ (i.e. the covariates affect treatment only via the propensity score)
> - If $Y_{0},Y_{1} \perp T ~|~ X$, then $Y_{0},Y_{1} \perp T ~|~ e(X)$, i.e. controlling for the propensity is enough for controlling for $X$.
> [!proof]- Proof for the second property
> Assuming $T$ is binary $0/1$, knowing $\mathbb{P}[T=1]$ is enough to determine its distribution.
>
> Therefore, $T ~|~ Y_{0},Y_{1},e(X)$ has the probability equal to $\begin{align*}
\mathbb{P}[T&= 1 ~|~Y_{0,1},e(X)]\\
&= \mathbb{E}_{X}[\mathbb{E}_{T}[T ~|~Y_{0,1},e(X),X]~|~Y_{0,1},e(X)]\\
&= \mathbb{E}_{X}[\mathbb{E}_{T}[T ~|~ Y_{0,1}, X]~|~ Y_{0,1},e(X)] & [e(X)\in m\sigma(X)]\\
&= \mathbb{E}_{X}[\mathbb{E}_{T}[T~|~X] ~|~Y_{0,1},e(X)] &[T \perp Y_{0,1} ~|~ X]\\
&= \mathbb{E}_{X}[e(X) ~|~ Y_{0,1}, e(X)]\\
&= e(X).
\end{align*}$Therefore $T\perp Y_{0},Y_{1}$ when conditioned on $e(X)$.
## Estimation of Propensity Scores
One strength of propensity scores is that it can be estimated with highly non-linear classification methods (e.g. [[Boosting]]), whereas controlling for confounders with regression methods usually require additive effects of $T$, like [[Linear Regression in Causal Inference|OLS]].
That being said, a very good fit $\hat{e}$ of the propensity $e$ is neither required or even desirable (see [[#Potential Variance Issues|the section about variance issues below]]). In practice, [[Linear Classification Methods#Logistic Regression as a Linear Model|logistic regression]] is usually enough.
## Using Propensity Scores
The most direct application is to replace $X$ with $e(X)$ in regressing $Y$, giving the model
$Y \sim T+e(X),$which in theory gives unconfounded estimation of $\beta_{T}$.
It can also be used for [[Nonparametric Controlling in Causal Inference#Matching|matching]], where we match observations with a similar propensity score but received different treatments.
- Looking for nearest neighbors in all covariates $X$ suffers from curse of dimensionality and takes into account irrelevant covariates; this is mitigated by looking at the propensity score alone.
### Inverse Probability Weighted Estimator
> [!exposition] Deriving the IPWE (population form)
> Consider the approximation $\mathbb{E}[Y~|~X,T=1]-\mathbb{E}[Y~|~X,T=0]$: taking its average gives an estimation of the ATE. The first term can be written as
> $\begin{align*}
> \mathbb{E}[Y&~|~X,T=1]\\
> &= \frac{\mathbb{E}[Y_{1}~|~X]\cdot e(X)}{e(X)}\\
> &= \mathbb{E}\left[ \frac{Y_{1}T}{e(X)}~|~X \right] &\left[ \substack{Y_{1}\perp T ~|~X, \text{ so}\\\mathbb{E}[Y~|~X]} \right],\\
> &= \mathbb{E}\left[ \frac{YT}{e(X)}~|~X \right] &[YT=Y_{1}T ~\mathrm{a.s.}]
> \end{align*}$
> and similarly the second term is $\mathbb{E}[Y~|~X,T=0]=\mathbb{E}\left[ \frac{Y(1-T)}{1-e(X)} ~|~X\right],$and the estimator is
> $\begin{align*}
> \mathbb{E}&\left[ \frac{(1-e)YT-eY(1-T)}{e(1-e)} ~|~ X\right]\\
> &= \mathbb{E}\left[ Y\frac{T-e}{e(1-e)} ~|~ X\right]
> \end{align*},$
> where $e=e(X)$ for simplicity. Now average over $X$ gives $\mathbb{E}_{X,Y,T}\left[ Y\frac{T-e(X)}{e(X)(1-e(X))} \right],$and approximation of the ATE.
We can produce a finite sample estimation of this expectation, with modeled propensity $\hat{e}$:
> [!theorem|*] The IPWE
> The **inverse probability weighted estimator (IPWE)** is given by $\begin{align*}
&\hat{\beta}_{T}:=\frac{1}{n}\sum_{i}y_{i} w_{i},\\
&\text{where }w_{i}:= \frac{t_{i}-\hat{e}(x_{i})}{\hat{e}(x_{i})(1-\hat{e}(x_{i}))}~= \begin{cases}
1 / \hat{e}(x_{i}) & \text{if }t_{i}=1 \\
-1 / (1-\hat{e}(x_{i})) & \text{if }t_{i}=0
\end{cases}
\end{align*}$
i.e. a weighted average of the sample responses.
The (magnitude of) weights look like:
![[InverseProbabilityWeighing.png#invert|center|w80]]
In particular, *it puts extra weight on "surprising" samples*, i.e. those with high propensity but was not treated ($t_{i}-\hat{e}(x_{i})\approx -2$), or those with low propensity but was treated ($t_{i} -\hat{e}(x_{i}) \approx 2$).
If we treat $T \sim X$ as a classification problem, *the weight is high for observations with $T$ near or across the decision boundary given by $\hat{e}$.* It works like a soft margin (cf. [[Hyperplane Classification Methods#Support Vector Classifiers|support vector machines]]).
This estimator also gives a special property of propensity scores: consider the special case $Y=T$, then the estimator reduces to $\begin{align*}
\mathbb{E}_{X,Y,T}&\left[ Y\frac{T-e(X)}{e(X)(1-e(X))} \right]\\
&= \mathbb{E}_{X,T}\left[ \frac{T}{e(X)} \right]\\
&= \mathbb{E}_{X}\left[ \mathbb{E}_{T~|~X}\left[ \frac{T}{e(X)} \right] \right]\\
&= 1.
\end{align*}$
But since the integrand is $0$ when $Y=T=0$, we have $1=\mathbb{E}_{T,X}\left[ \frac{\mathbf{1}_{T=1}}{e(X)} \right].$Similarly, letting $Y=1-T$ gives $\mathbb{E}_{X,T}\left[\frac{1-T}{1-e(X)} \right]=1=\mathbb{E}_{X,T}\left[ \frac{\mathbf{1}_{T=0}}{1-e(X)} \right].$
Therefore, we can expect $\sum_{i} \frac{\mathbf{1}_{t_{i}=1}}{e(x_{i})}\approx n \approx \sum_{i} \frac{\mathbf{1}_{t_{i}=0}}{1-e(x_{i})}$to hold in a finite sample.
### Potential Variance Issues of IPWE
> [!warning] The main purpose for propensity scores is to control for confounders, not to model $e$ accurately.
The IPWE needs **positivity** to work well, i.e. the treated and untreated need to mix well in the predictor space: ideally it should look like
![[GoodPositivity.png#invert|center]]
This prevents the weights $w_{i}$ from blowing up.
On the other hand, *if the model $\hat{e}$ is extremely confident (i.e. giving near-$1$ or near-$0$ scores only), then one incorrect assignment will blow up the variance*: suppose $\hat{e}(x_{i})=\epsilon$ but $t_{i}=1$, then the weight is $w_{i}=\frac{1}{\hat{e}(x_{i})}=\epsilon^{-1},$and the IPWE will have the huge variance of $\mathrm{Var}(\hat{\beta}_{T}^{\mathrm{IPWE}})\approx \epsilon^{-2}\mathrm{Var}(Y_{i}).$
Therefore, a very well-fit model of $e$ doesn't necessarily improve the causal inference, but can instead make the IPWE wildly off. It can be illustrated with this simulation:
Consider the case where $T$ can be almost-perfectly predicted by a covariate $X$, e.g. $T\sim \begin{cases}
\mathrm{Ber}(\epsilon) &\text{if }X=0, \\
\mathrm{Ber}(1-\epsilon) & \text{if }X=1.
\end{cases}$
In this case:
- If $\epsilon$ is very small, a model like [[Linear Classification Methods#Logistic Regression as a Linear Model|logistic regression]] will produce extreme propensity scores, but it is also very unlikely for it to make incorrect assignments.
- If $\epsilon$ is large, weights blowing up is not an issue.
- If $\epsilon$ is moderately small, however, the variance of the IPWE will be huge.
We use the code below to compare modeling $e$ with and without $X$.
```python fold
import numpy as np
from sklearn.linear_model import LogisticRegression
from seaborn import kdeplot
from matplotlib import pyplot as plt
def compute_ipwe(data, predictor_columns):
x = data[predictor_columns]
t = data["treatment"].to_numpy(dtype = int)
y = data["response"].to_numpy()
propensity_model = LogisticRegression(
C = 1e6,
max_iter=1000).fit(x, t)
propensity_scores = (propensity_model
.predict_proba(x)[:, 1])
reweighed_responses = (
y * t / propensity_scores +
y * (1 - t) / (1 - propensity_scores)
)
return reweighed_responses.mean()
def generate_data(n = 1000, epsilon = 0.1):
awesome_predictor = [0] * (n // 2) + [1] * (n // 2)
treatment = np.concatenate([
np.random.binomial(1, epsilon, n // 2),
np.random.binomial(1, 1 - epsilon, n // 2)])
response = treatment + np.random.normal(0, 1, size = n)
data = pd.DataFrame({
"intercept": 1,
"awesome_predictor": awesome_predictor,
"treatment": treatment,
"response": response})
return data
def simulate_one_iteration(n, epsilon, columns):
data = generate_data(n, epsilon)
ipwe = compute_ipwe(data, columns)
return ipwe
def simulate(n_simulations = 200, n = 1000, epsilon = 0.1,
columns,):
ipwes = []
for _ in range(n_simulations):
simulated_ipwe = one_iteration(n, epsilon, columns)
ipwes.append(simulated_ipwe)
return ipwes
epsilon = 0.01
result_with = simulate(epsilon = epsilon,
columns = ["intercept",
"awesome_predictor"])
result_without = simulate(epsilon = epsilon,
columns = ["intercept"])
kdeplot(result_with, fill = True,
label = "With awesome predictor")
kdeplot(result_without, fill = True,
label = "Without awesome predictor")
plt.legend()
plt.xlabel("IPWE")
plt.title(f"Simulated Distribution of IPWE with ϵ = {epsilon}")
```
These are the distribution of simulated IPWEs for $\epsilon=0.1,0.01,0.001$ respectively.
![[BigEpsilonIPWE.png|w30]]![[ModeratelySmallIPWE.png|w30]]![[TinyEpsilonIPWE.png|w30]]
- The distribution in orange doesn't use $X$, but still gives a good estimate.
- In contrast, the blue distribution using $X$ deteriorates greatly for small $\epsilon$ in terms of variance.
- Although variance is not an issue for very small $\epsilon$, it starts to show finite sample biases.
### Augmented IPWE
The AIPWE combines the IPWE and the standard method of using regression (e.g. [[Linear Regression in Causal Inference|OLS]]) to control for confounding:
> [!definition|*] The AIPWE
> Suppose $\hat{\mu}(X,T):= \hat{\mathbb{E}}[Y~|~X,T]$ is the prediction of the regression model, and $\hat{e}(X)$ is the estimated propensity score. The **AIPWE** is given by $\underbrace{\frac{1}{n}\sum_{i}\hat{\mu}(x_{i},1)-\hat{\mu}(x_{i},0)}_{\text{regresion estimate}} + \underbrace{\frac{1}{n}\sum_{i}w_{i}e_{i}}_{\text{IPWE on residuls}},$where $w_{i}$ is the IPWE weights determined by $t_{i},\hat{e}(x_{i})$, and $e_{i}:= y_{i}-\hat{\mu}(x_{i},t_{i})$ are the residuals.
It is **doubly robust** in the sense that the estimator is unbiased as long as one of $\hat{\mu}$, $\hat{e}$ is correct:
- If $\hat{\mu}$ is unbiased, then the first term is the unbiased estimate of treatment effect, and the second term has mean $0$.
- If $\hat{e}$ is correct, then the second term is an unbiased estimator of $\mathbb{E}[E_{1}-E_{0}]$, where $E$ is the residual random variable (dependent on$X,Y,T$) and $E_{0,1}$ are the RVs it equal to given $T=0,1$. It equals $E_{i}=Y_{i}-\hat{\mu}(X,i),$so the second term estimates $\mathbb{E}[E_{1}-E_{0}]=\mathbb{E}[Y_{1}-Y_{0}]-\mathbb{E}[\hat{\mu}(X,1)-\hat{\mu}(X,0)],$where the second term cancels out the bias in $\hat{\mu}$.