For continuous treatment (e.g. price), we are interested in the **sensitivity** $\mathbb{E}[\partial Y / \partial T]$, or the local version $\mathbb{E}[\partial Y / \partial T ~|~ X]$. We denote it as $\beta$ (possibly a function of $X$). ### Estimating the ATE Since we are only interested the ATE, we may model $Y=\alpha(X)+\beta T+\epsilon,$where $\epsilon$ is an error term of mean $0$ and orthogonal to $X,T$. In other words, $Y$ is linear in $T$. Therefore, we can run the [[Population Linear Regression]] $Y \sim T$, using [[Orthogonal Projection, Confounding, and Missing Variable in OLS#^4760b7|the FWL theorem]] to get $\beta=\frac{\mathrm{Cov}(Y^{\ast},T^{\ast})}{\mathrm{Var}(T^{\ast})},$where $Y^{\ast}:= Y-\mu_{Y}(X)$, and $T^{\ast}:= T-\mu_{T}(X)$ are made orthogonal wrt. $X$, by subtracting their respective conditional expectations $\mu_{\ast}(X):=\mathbb{E}[\ast ~|~ X]$. - *$\mu_{T}$ is a generalization of the propensity score*. Now replacing everything with their finite-sample estimators, we have: - First replacing $\mu_{Y,T}$ with some regression model $\hat{\mu}_{Y,T}$ by fitting $Y,T$ against $X$. - Next, compute the residuals $\mathbf{y}^{\ast}:=\mathbf{y}-\hat{\mu}_{Y}(\mathbf{X})$ and $\mathbf{t}^{\ast}:=\mathbf{t}-\hat{\mu}_{T}(\mathbf{X})$. - Finally, run an OLS $Y^{\ast} \sim T^{\ast}$ using the residual-data $\mathbf{y}^{\ast}$ and $\mathbf{t}^{\ast}$. > [!connection] > Notice that this is similar in form to the [[Propensity Scores#Inverse Probability Weighted Estimator|IPWE]] for binary treatment, where $\mathrm{Var}(T)=\mathrm{Var}(T^{\ast})=\mathbb{E}_{X}[e(X)(1-e(X))]$, and $\mathbb{E}[Y(T-e(X))]=\mathbb{E}[(Y-\mu_{Y}(X))(T-e(X))]=\mathrm{Cov}(Y,T).$ ### Estimating the CATE To estimate the CATE, we need a more refined model $Y=\alpha(X)+\beta(X)T+\epsilon,$where $\beta$ is allowed to depend on $X$. Now the CATE is $\mathrm{CATE}(x)=\mathbb{E}[\beta(X)~|~X=x]=\beta(x).$ WLOG by centering $Y,T$ (conditionally on $X$), we have the model $Y^{\ast}=\alpha ^{\ast}(X)+\beta(X)T^{\ast}+\epsilon,$where $\alpha ^{\ast}$ absorbs the centering term, and we may assume $\mathbb{E}[Y^{\ast}~|~ X]=\mathbb{E}[T^{\ast}~|~ X]=0$: now taking expectation gives $0=\mathbb{E}[Y^{\ast}~|~X]=\alpha ^{\ast}(X).$Therefore, the model is reduced to $Y^{\ast}=\beta(X)T^{\ast}+\epsilon.$ From this, we obtain two ways of modeling $\beta$: One way is to simply run the regression $Y^{\ast}\sim T^{\ast}(1+X+X^{2}+\dots)$, effectively replacing a generic $\beta$ with a polynomial regression. If the OLS coefficients are $\hat{b}_{0},\dots$, the estimator is then $\hat{\beta}(x)=\hat{b}_{0}+\hat{b}_{1}x+\hat{b}_{2}x^{2}+\cdots.$ Alternatively, we can solve for $\beta(X)=\frac{Y^{\ast}-\epsilon}{T^{\ast}}.$ To make an estimation of the full relationship $\beta(X)$, we need (an estimate of) its observed values $\pmb{\beta}$: - First replace unknown quantities $\epsilon$ with $0$, and $Y^{\ast},T^{\ast}$ with the residuals $\mathbf{y}^{\ast}:=\mathbf{y}-\hat{\mu}_{Y}(\mathbf{X})$ and $\mathbf{t}^{\ast}:=\mathbf{t}-\hat{\mu}_{T}(\mathbf{X})$. - Now compute $\pmb{\beta}:=\frac{\mathbf{y}^{\ast}}{\mathbf{t}^{\ast}},$where the division is element-wise. - *However, this can cause division of near-$0$ values, making it unstable*. Fortunately, this idea can be remedied with the **R-loss**, i.e. the $L_{2}$ loss of $\hat{\beta}$ when predicting $Y^{\ast}$ with $\hat{\beta}(X)T^{\ast}$: $L(\hat{\beta}):= \| \mathbf{y}^{\ast}-\hat{\beta}(\mathbf{x})\mathbf{t}^{\ast} \|^{2}. $Now we can transform this into a weighted OLS problem: $L(\hat{\beta})=\sum_{i}t^{\ast 2}_{i}\left( \frac{y^{\ast}_{i}}{t^{\ast}_{i}}-\hat{\beta}(\mathbf{x}) \right)^{2},$i.e. *the weighted $L_{2}$ loss when predicting $Y^{\ast} / T^{\ast}$ with $\hat{\beta}(X)$*. The near-$0$ division is not an issue, since they are heavily de-emphasized by a weight of $t^{\ast 2}_{i} \approx 0$. ### Overfitting and CV Residuals As in regular ML, overfitting can be an issue, where (for example) overfitting $\hat{\mu}_{T}$ will make $T-\hat{\mu}_{T}(X)$ underestimates of the true $T^{\ast}$. This reduces its variance, and hence increases the variance of $\hat{\beta}$. One remedy is to use [[Cross Validation]] and use **out-of-fold residuals**, i.e. for the $k$th fold, we train the model with $\mathcal{T}-\mathcal{V}_{k}$, and out-of-fold residuals are its error on $\mathcal{V}_{k}$.