> [!tldr] > In causal models with binary treatment $T$, response $Y$, and covariates $X$, the **propensity score** $e(x):= \mathbb{E}[T ~|~ X=x]$is a tool for controlling for the covariates. In particular, it has the properties: > - $T \perp X ~|~ e(X)$ (i.e. the covariates affect treatment only via the propensity score) > - If $Y_{0},Y_{1} \perp T ~|~ X$, then $Y_{0},Y_{1} \perp T ~|~ e(X)$, i.e. controlling for the propensity is enough for controlling for $X$. > [!proof]- Proof for the second property > Assuming $T$ is binary $0/1$, knowing $\mathbb{P}[T=1]$ is enough to determine its distribution. > > Therefore, $T ~|~ Y_{0},Y_{1},e(X)$ has the probability equal to $\begin{align*} \mathbb{P}[T&= 1 ~|~Y_{0,1},e(X)]\\ &= \mathbb{E}_{X}[\mathbb{E}_{T}[T ~|~Y_{0,1},e(X),X]~|~Y_{0,1},e(X)]\\ &= \mathbb{E}_{X}[\mathbb{E}_{T}[T ~|~ Y_{0,1}, X]~|~ Y_{0,1},e(X)] & [e(X)\in m\sigma(X)]\\ &= \mathbb{E}_{X}[\mathbb{E}_{T}[T~|~X] ~|~Y_{0,1},e(X)] &[T \perp Y_{0,1} ~|~ X]\\ &= \mathbb{E}_{X}[e(X) ~|~ Y_{0,1}, e(X)]\\ &= e(X). \end{align*}$Therefore $T\perp Y_{0},Y_{1}$ when conditioned on $e(X)$. ## Estimation of Propensity Scores One strength of propensity scores is that it can be estimated with highly non-linear classification methods (e.g. [[Boosting]]), whereas controlling for confounders with regression methods usually require additive effects of $T$, like [[Linear Regression in Causal Inference|OLS]]. That being said, a very good fit $\hat{e}$ of the propensity $e$ is neither required or even desirable (see [[#Potential Variance Issues|the section about variance issues below]]). In practice, [[Linear Classification Methods#Logistic Regression as a Linear Model|logistic regression]] is usually enough. ## Using Propensity Scores The most direct application is to replace $X$ with $e(X)$ in regressing $Y$, giving the model $Y \sim T+e(X),$which in theory gives unconfounded estimation of $\beta_{T}$. It can also be used for [[Nonparametric Controlling in Causal Inference#Matching|matching]], where we match observations with a similar propensity score but received different treatments. - Looking for nearest neighbors in all covariates $X$ suffers from curse of dimensionality and takes into account irrelevant covariates; this is mitigated by looking at the propensity score alone. ### Inverse Probability Weighted Estimator > [!exposition] Deriving the IPWE (population form) > Consider the approximation $\mathbb{E}[Y~|~X,T=1]-\mathbb{E}[Y~|~X,T=0]$: taking its average gives an estimation of the ATE. The first term can be written as > $\begin{align*} > \mathbb{E}[Y&~|~X,T=1]\\ > &= \frac{\mathbb{E}[Y_{1}~|~X]\cdot e(X)}{e(X)}\\ > &= \mathbb{E}\left[ \frac{Y_{1}T}{e(X)}~|~X \right] &\left[ \substack{Y_{1}\perp T ~|~X, \text{ so}\\\mathbb{E}[Y~|~X]} \right],\\ > &= \mathbb{E}\left[ \frac{YT}{e(X)}~|~X \right] &[YT=Y_{1}T ~\mathrm{a.s.}] > \end{align*}$ > and similarly the second term is $\mathbb{E}[Y~|~X,T=0]=\mathbb{E}\left[ \frac{Y(1-T)}{1-e(X)} ~|~X\right],$and the estimator is > $\begin{align*} > \mathbb{E}&\left[ \frac{(1-e)YT-eY(1-T)}{e(1-e)} ~|~ X\right]\\ > &= \mathbb{E}\left[ Y\frac{T-e}{e(1-e)} ~|~ X\right] > \end{align*},$ > where $e=e(X)$ for simplicity. Now average over $X$ gives $\mathbb{E}_{X,Y,T}\left[ Y\frac{T-e(X)}{e(X)(1-e(X))} \right],$and approximation of the ATE. We can produce a finite sample estimation of this expectation, with modeled propensity $\hat{e}$: > [!theorem|*] The IPWE > The **inverse probability weighted estimator (IPWE)** is given by $\begin{align*} &\hat{\beta}_{T}:=\frac{1}{n}\sum_{i}y_{i} w_{i},\\ &\text{where }w_{i}:= \frac{t_{i}-\hat{e}(x_{i})}{\hat{e}(x_{i})(1-\hat{e}(x_{i}))}~= \begin{cases} 1 / \hat{e}(x_{i}) & \text{if }t_{i}=1 \\ -1 / (1-\hat{e}(x_{i})) & \text{if }t_{i}=0 \end{cases} \end{align*}$ i.e. a weighted average of the sample responses. The (magnitude of) weights look like: ![[InverseProbabilityWeighing.png#invert|center|w80]] In particular, *it puts extra weight on "surprising" samples*, i.e. those with high propensity but was not treated ($t_{i}-\hat{e}(x_{i})\approx -2$), or those with low propensity but was treated ($t_{i} -\hat{e}(x_{i}) \approx 2$). If we treat $T \sim X$ as a classification problem, *the weight is high for observations with $T$ near or across the decision boundary given by $\hat{e}$.* It works like a soft margin (cf. [[Hyperplane Classification Methods#Support Vector Classifiers|support vector machines]]). This estimator also gives a special property of propensity scores: consider the special case $Y=T$, then the estimator reduces to $\begin{align*} \mathbb{E}_{X,Y,T}&\left[ Y\frac{T-e(X)}{e(X)(1-e(X))} \right]\\ &= \mathbb{E}_{X,T}\left[ \frac{T}{e(X)} \right]\\ &= \mathbb{E}_{X}\left[ \mathbb{E}_{T~|~X}\left[ \frac{T}{e(X)} \right] \right]\\ &= 1. \end{align*}$ But since the integrand is $0$ when $Y=T=0$, we have $1=\mathbb{E}_{T,X}\left[ \frac{\mathbf{1}_{T=1}}{e(X)} \right].$Similarly, letting $Y=1-T$ gives $\mathbb{E}_{X,T}\left[\frac{1-T}{1-e(X)} \right]=1=\mathbb{E}_{X,T}\left[ \frac{\mathbf{1}_{T=0}}{1-e(X)} \right].$ Therefore, we can expect $\sum_{i} \frac{\mathbf{1}_{t_{i}=1}}{e(x_{i})}\approx n \approx \sum_{i} \frac{\mathbf{1}_{t_{i}=0}}{1-e(x_{i})}$to hold in a finite sample. ### Potential Variance Issues of IPWE > [!warning] The main purpose for propensity scores is to control for confounders, not to model $e$ accurately. The IPWE needs **positivity** to work well, i.e. the treated and untreated need to mix well in the predictor space: ideally it should look like ![[GoodPositivity.png#invert|center]] This prevents the weights $w_{i}$ from blowing up. On the other hand, *if the model $\hat{e}$ is extremely confident (i.e. giving near-$1$ or near-$0$ scores only), then one incorrect assignment will blow up the variance*: suppose $\hat{e}(x_{i})=\epsilon$ but $t_{i}=1$, then the weight is $w_{i}=\frac{1}{\hat{e}(x_{i})}=\epsilon^{-1},$and the IPWE will have the huge variance of $\mathrm{Var}(\hat{\beta}_{T}^{\mathrm{IPWE}})\approx \epsilon^{-2}\mathrm{Var}(Y_{i}).$ Therefore, a very well-fit model of $e$ doesn't necessarily improve the causal inference, but can instead make the IPWE wildly off. It can be illustrated with this simulation: Consider the case where $T$ can be almost-perfectly predicted by a covariate $X$, e.g. $T\sim \begin{cases} \mathrm{Ber}(\epsilon) &\text{if }X=0, \\ \mathrm{Ber}(1-\epsilon) & \text{if }X=1. \end{cases}$ In this case: - If $\epsilon$ is very small, a model like [[Linear Classification Methods#Logistic Regression as a Linear Model|logistic regression]] will produce extreme propensity scores, but it is also very unlikely for it to make incorrect assignments. - If $\epsilon$ is large, weights blowing up is not an issue. - If $\epsilon$ is moderately small, however, the variance of the IPWE will be huge. We use the code below to compare modeling $e$ with and without $X$. ```python fold import numpy as np from sklearn.linear_model import LogisticRegression from seaborn import kdeplot from matplotlib import pyplot as plt def compute_ipwe(data, predictor_columns): x = data[predictor_columns] t = data["treatment"].to_numpy(dtype = int) y = data["response"].to_numpy() propensity_model = LogisticRegression( C = 1e6, max_iter=1000).fit(x, t) propensity_scores = (propensity_model .predict_proba(x)[:, 1]) reweighed_responses = ( y * t / propensity_scores + y * (1 - t) / (1 - propensity_scores) ) return reweighed_responses.mean() def generate_data(n = 1000, epsilon = 0.1): awesome_predictor = [0] * (n // 2) + [1] * (n // 2) treatment = np.concatenate([ np.random.binomial(1, epsilon, n // 2), np.random.binomial(1, 1 - epsilon, n // 2)]) response = treatment + np.random.normal(0, 1, size = n) data = pd.DataFrame({ "intercept": 1, "awesome_predictor": awesome_predictor, "treatment": treatment, "response": response}) return data def simulate_one_iteration(n, epsilon, columns): data = generate_data(n, epsilon) ipwe = compute_ipwe(data, columns) return ipwe def simulate(n_simulations = 200, n = 1000, epsilon = 0.1, columns,): ipwes = [] for _ in range(n_simulations): simulated_ipwe = one_iteration(n, epsilon, columns) ipwes.append(simulated_ipwe) return ipwes epsilon = 0.01 result_with = simulate(epsilon = epsilon, columns = ["intercept", "awesome_predictor"]) result_without = simulate(epsilon = epsilon, columns = ["intercept"]) kdeplot(result_with, fill = True, label = "With awesome predictor") kdeplot(result_without, fill = True, label = "Without awesome predictor") plt.legend() plt.xlabel("IPWE") plt.title(f"Simulated Distribution of IPWE with ϵ = {epsilon}") ``` These are the distribution of simulated IPWEs for $\epsilon=0.1,0.01,0.001$ respectively. ![[BigEpsilonIPWE.png|w30]]![[ModeratelySmallIPWE.png|w30]]![[TinyEpsilonIPWE.png|w30]] - The distribution in orange doesn't use $X$, but still gives a good estimate. - In contrast, the blue distribution using $X$ deteriorates greatly for small $\epsilon$ in terms of variance. - Although variance is not an issue for very small $\epsilon$, it starts to show finite sample biases. ### Augmented IPWE The AIPWE combines the IPWE and the standard method of using regression (e.g. [[Linear Regression in Causal Inference|OLS]]) to control for confounding: > [!definition|*] The AIPWE > Suppose $\hat{\mu}(X,T):= \hat{\mathbb{E}}[Y~|~X,T]$ is the prediction of the regression model, and $\hat{e}(X)$ is the estimated propensity score. The **AIPWE** is given by $\underbrace{\frac{1}{n}\sum_{i}\hat{\mu}(x_{i},1)-\hat{\mu}(x_{i},0)}_{\text{regresion estimate}} + \underbrace{\frac{1}{n}\sum_{i}w_{i}e_{i}}_{\text{IPWE on residuls}},$where $w_{i}$ is the IPWE weights determined by $t_{i},\hat{e}(x_{i})$, and $e_{i}:= y_{i}-\hat{\mu}(x_{i},t_{i})$ are the residuals. It is **doubly robust** in the sense that the estimator is unbiased as long as one of $\hat{\mu}$, $\hat{e}$ is correct: - If $\hat{\mu}$ is unbiased, then the first term is the unbiased estimate of treatment effect, and the second term has mean $0$. - If $\hat{e}$ is correct, then the second term is an unbiased estimator of $\mathbb{E}[E_{1}-E_{0}]$, where $E$ is the residual random variable (dependent on$X,Y,T$) and $E_{0,1}$ are the RVs it equal to given $T=0,1$. It equals $E_{i}=Y_{i}-\hat{\mu}(X,i),$so the second term estimates $\mathbb{E}[E_{1}-E_{0}]=\mathbb{E}[Y_{1}-Y_{0}]-\mathbb{E}[\hat{\mu}(X,1)-\hat{\mu}(X,0)],$where the second term cancels out the bias in $\hat{\mu}$.