Validating model via predictions #732

cheecharron · 2024-01-12T16:00:31Z

Hello. I am analyzing trial results in which there are four treatments. Although the modeled results return contrasts between three of the treatments and a designated control group, I would like to generate estimates of the outcome for each treatment so that I can validate the predictions against the observed values in the holdout sample. A similar issue (#290) was raised before for generating predictions with two treatment groups, and the following code was forwarded by @Thomas9292 for generating predictions:

# Create holdout set
X_train, X_test, t_train, t_test, y_train, y_test_actual = train_test_split(df_confounder, df_treatment, target, test_size=0.2)

# Fit learner on training set
learner = XGBTRegressor()
learner.fit(X=X_train, treatment=t_train, y=y_train)

# Predict the TE for test, and request the components (predictions for t=1 and t=0)
te_test_preds, yhat_c, yhat_t = learner.predict(X_test, t_test, return_components=True)

# Mask the yhats to correspond with the observed treatment (we can only test accuracy for those)
yhat_c = yhat_c[1] * (1 - t_test)
yhat_t = yhat_t[1] * t_test
yhat_test = yhat_t + yhat_c

# Model prediction error
MSE = mean_squared_error(y_test_actual, yhat_test)
print(f"{'Model MSE:':25}{MSE}")

# Also plotted actuals vs. predictions in here, will spare you the code

Apparently the above code generates predicted outcomes for the treatment group (yhat_t) and control group (yhat_c). When I apply this code to my data, yhat_t returns three vectors, which I assume correspond to predictions for each of the three non-control treatments. However, yhat_c also returns three vectors. Do the three yhat_c vectors represent three different predictions for the control groups? Whatever the case, how might I generate predictions for each treatment, including the control group?

The text was updated successfully, but these errors were encountered:

t-tte · 2024-02-09T18:30:12Z

It's expected that the T-Learner returns a number of control vectors corresponding to the number of treatments. This is because the implementation simply loops over each treatment and estimates a separate model for it. So the yhat_cs are the predicted control outcomes from each of those separate models. To get the predicted control outcome for the units in each of the three treatment groups, you need to apply a similar masking as in the code snippet that you provided, for each of the three control vectors. So, for the units in the first treatment group, you select the control observations from the first control vector. And so on for the remaining pairs.

cheecharron · 2024-02-12T19:17:25Z

Thanks for your response, @t-tte. I see that I can generate predictions for each of the treatment groups and even for the control group for the T-learner, but it looks like I cannot generate predictions for the control group for other meta-learners (although I can generate predictions for the treatment groups for all but R-learners). Is it possible to generate predictions for outcomes for the treatment groups for an uplift tree? That is, I'd rather have the prediction for the raw outcome for each treatment than an uplift score for each treatment so that I can compare to observed values.

ras44 · 2024-06-18T12:10:16Z

hi @cheecharron as you mentioned, you can get conditional response surface predictions for T-learners via the return_components argument:

causalml/causalml/inference/meta/tlearner.py

Line 143 in a031566

return te, yhat_cs, yhat_ts

For the S-Learner, it's similar:

causalml/causalml/inference/meta/slearner.py

Line 94 in a031566

self, X, treatment=None, y=None, p=None, return_components=False, verbose=True

For the R meta learner, it's not as much about predicting the treatment or control conditional response curves, but directly calculating the CATE through the Robinson's transformation, which uses the conditional mean outcome and the propensity score (see https://arxiv.org/pdf/1712.04912). So in this case, you could calculate the counterfactual for each of your treatment and control cases by adding/subtracting the predicted CATE, but the R learner does not actually model the "factual" surfaces. And so, it doesn't produce a prediction that you can compare to an observed outcome as described in #290.

For the DR meta learner, the situation is similar (ref equations 2 and 3 in https://arxiv.org/pdf/2004.14497).

Regarding outcome predictions for an uplift tree, see the full_output argument at:

causalml/causalml/inference/tree/uplift.pyx

Line 2519 in a031566

def predict(self, X, full_output=False):

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validating model via predictions #732

Validating model via predictions #732

cheecharron commented Jan 12, 2024

t-tte commented Feb 9, 2024

cheecharron commented Feb 12, 2024

ras44 commented Jun 18, 2024

Validating model via predictions #732

Validating model via predictions #732

Comments

cheecharron commented Jan 12, 2024

t-tte commented Feb 9, 2024

cheecharron commented Feb 12, 2024

ras44 commented Jun 18, 2024