Confounding and Selection Bias in Causal Inference

![[ConfoundingAndSelectionBias.png#invert|center]] > [!tldr] Confounding Bias > **Confounding bias** occurs when there is a **confounding variable** with causal effect on both the treatment and the response. In most cases, it is **pretreatment**: in terms of a [[Causal Graphs|causal graph]], > ```mermaid > flowchart LR > X[confounder X] --> T[treatment T] > X --> Y[response Y] > T --> Y > ``` > See [[Causal Graphs#Indirect Influences]] for an example. > > Another possibility is a **intermediate/post-treatment confounder**, which only occurs when the treatment is prolonged (e.g. education; otherwise the confounder cannot be later than its effect, i.e. the treatment): > ```mermaid > flowchart LR > T[earlier treatment T] --> X[confounder X] --> TT[later treatment T] > X --> Y[response Y] > T --> Y > TT --> Y > ``` > For example $X$ is education, $T$ is good study habits (which is fostered by previous education and contributes to attaining higher education in the future), and $Y$ is salary. > > However, if $X$ can be controlled, (at least in this causal model) we can inference causality in $Y_{0},Y_{1} \perp T ~|~ X.$ > [!tldr] Selection Bias > Selection bias occurs when an inappropriate variable is controlled for when studying causality, e.g. when trying to prevent [[Confounding and Selection Bias in Causal Inference]]. > > In a [[Causal Graphs|causal graph]], this can occur like: > ```mermaid > flowchart LR > T[treatment T] --> Y[response Y] > T --> X[controlled variable X] > Y --> X > ``` > > Alternatively, in a case of **mediators**, the causal graph looks like > ```mermaid > flowchart LR > T[treatment T] --> Y[response Y] > T --> X[mediator X] > X --> Y > ``` > In both cases, marginally we have $Y_{0},Y_{1} \perp T$, but by controlling $X$, this may not hold, and causal inference breaks down. For example, consider the [[Simpson's Paradox#IRL Example UCB Admission Rate by Sex]] example: - Let $T$ indicate sex ($1=$ woman), and assume for example's sake being a woman causes the student to apply for harder programs, whose difficulty is coded by $X$, so $T \to X$. - Let the response $Y$ be a binary $0/1$ encoding of admission/rejection, so obviously $X \to Y$. - Now $X$ is a mediator, so if we controlled for $X$ (e.g. looking at the data at the per-department level), we would conclude that being a woman "caused" higher admission rates; if we do not control for $X$ (i.e. looking at the data pooled across UCB), we would conclude the contrary.