使用TMLE示例的提升曲线

本笔记本演示了在不知道真实治疗效果的情况下使用提升曲线的问题，以及如何通过使用TMLE作为真实治疗效果的代理来解决这个问题。

[1]:

%reload_ext autoreload
%autoreload 2
%matplotlib inline

[2]:

import os
base_path = os.path.abspath("../")
os.chdir(base_path)

[3]:

import logging
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, KFold
import sys
import warnings
warnings.simplefilter("ignore", UserWarning)

from lightgbm import LGBMRegressor

[4]:

import causalml

from causalml.dataset import synthetic_data
from causalml.inference.meta import BaseXRegressor, TMLELearner
from causalml.metrics.visualize import *
from causalml.propensity import calibrate

import importlib
print(importlib.metadata.version('causalml') )

Failed to import duecredit due to No module named 'duecredit'

0.15.3.dev0

[5]:

logger = logging.getLogger('causalml')
logger.setLevel(logging.DEBUG)
plt.style.use('fivethirtyeight')

生成合成数据

[6]:

# Generate synthetic data using mode 1
y, X, treatment, tau, b, e = synthetic_data(mode=1, n=1000000, p=10, sigma=5.)

[7]:

X_train, X_test, y_train, y_test, e_train, e_test, treatment_train, treatment_test, tau_train, tau_test, b_train, b_test = train_test_split(X, y, e, treatment, tau, b, test_size=0.5, random_state=42)

计算个体治疗效果 (ITE/CATE)

[8]:

# X Learner
learner_x = BaseXRegressor(learner=LGBMRegressor())
learner_x.fit(X=X_train, treatment=treatment_train, y=y_train)
cate_x_test = learner_x.predict(X=X_test, p=e_test, treatment=treatment_test).flatten()

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000981 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2550
[LightGBM] [Info] Number of data points in the train set: 240455, number of used features: 10
[LightGBM] [Info] Start training from score 1.025470
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000968 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2550
[LightGBM] [Info] Number of data points in the train set: 259545, number of used features: 10
[LightGBM] [Info] Start training from score 1.931372
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000901 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2550
[LightGBM] [Info] Number of data points in the train set: 240455, number of used features: 10
[LightGBM] [Info] Start training from score 0.429150
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000983 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2550
[LightGBM] [Info] Number of data points in the train set: 259545, number of used features: 10
[LightGBM] [Info] Start training from score 0.687872

[9]:

alpha=0.2
bins=30
plt.figure(figsize=(12,8))
plt.hist(cate_x_test, alpha=alpha, bins=bins, label='X Learner')
plt.hist(tau_test, alpha=alpha, bins=bins, label='Actual')

plt.title('Distribution of CATE Predictions by X-Learner and Actual')
plt.xlabel('Individual Treatment Effect (ITE/CATE)')
plt.ylabel('# of Samples')
_=plt.legend()

../_images/examples_validation_with_tmle_12_0.png

验证CATE而不使用TMLE

[10]:

df = pd.DataFrame({'y': y_test, 'w': treatment_test, 'tau': tau_test, 'X-Learner': cate_x_test, 'Actual': tau_test})

带有真实情况的提升曲线

如果真实治疗效果在模拟中是已知的，模型的提升曲线使用按模型的CATE估计排序的治疗效果的累积和。

在下图中，X-learner的提升曲线显示出接近地面实况最优提升的正向提升。

[11]:

plot(df, outcome_col='y', treatment_col='w', treatment_effect_col='tau')

../_images/examples_validation_with_tmle_17_0.png

无真实数据的提升曲线

如果真实的治疗效果在实践中未知，模型的提升曲线使用按模型的CATE估计排序的治疗组和对照组结果的累积平均差异。

在下图中，X-learner的提升曲线以及真实情况显示没有提升，这是不正确的。

[12]:

plot(df.drop('tau', axis=1), outcome_col='y', treatment_col='w')

TMLE

以TMLE为基准的提升曲线

通过使用TMLE作为真实情况的代理，X-learner的提升曲线与使用真实情况的原始曲线变得接近。

[13]:

n_fold = 5
kf = KFold(n_splits=n_fold)

[14]:

df = pd.DataFrame({'y': y_test, 'w': treatment_test, 'p': e_test, 'X-Learner': cate_x_test, 'Actual': tau_test})

[15]:

inference_cols = []
for i in range(X_test.shape[1]):
    col = 'col_' + str(i)
    df[col] = X_test[:,i]
    inference_cols.append(col)

[16]:

df.head()

[16]:

	y	w	p	X-Learner	实际	col_0	col_1	col_2	col_3	col_4	col_5	col_6	col_7	col_8	col_9
0	2.299468	0	0.875235	0.689955	0.812923	0.801219	0.824627	0.418361	0.576936	0.810729	0.186007	0.883184	0.057571	0.084963	0.782511
1	-2.601411	1	0.715290	0.950119	0.864145	0.885407	0.842883	0.014536	0.974505	0.858550	0.548230	0.164607	0.762274	0.198254	0.647855
2	9.295828	1	0.895537	0.675432	0.637853	0.406232	0.869474	0.808828	0.525918	0.526959	0.023063	0.903683	0.566092	0.242138	0.219698
3	2.362346	0	0.230146	0.555949	0.497591	0.914335	0.080846	0.501873	0.912275	0.405199	0.922577	0.054477	0.054306	0.385622	0.244462
4	-6.428204	1	0.772851	0.541349	0.551009	0.700812	0.401207	0.450781	0.988744	0.537332	0.124579	0.700980	0.135383	0.087629	0.198028

[17]:

tmle_df = get_tmlegain(df, inference_col=inference_cols, outcome_col='y', treatment_col='w', p_col='p',
                       n_segment=5, cv=kf, calibrate_propensity=True, ci=False)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002342 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2552
[LightGBM] [Info] Number of data points in the train set: 400000, number of used features: 11
[LightGBM] [Info] Start training from score 1.506199
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002307 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2552
[LightGBM] [Info] Number of data points in the train set: 400000, number of used features: 11
[LightGBM] [Info] Start training from score 1.492271
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002273 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2552
[LightGBM] [Info] Number of data points in the train set: 400000, number of used features: 11
[LightGBM] [Info] Start training from score 1.510604
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002306 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2552
[LightGBM] [Info] Number of data points in the train set: 400000, number of used features: 11
[LightGBM] [Info] Start training from score 1.499669
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002286 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2552
[LightGBM] [Info] Number of data points in the train set: 400000, number of used features: 11
[LightGBM] [Info] Start training from score 1.508310
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002342 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2552
[LightGBM] [Info] Number of data points in the train set: 400000, number of used features: 11
[LightGBM] [Info] Start training from score 1.506199
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002269 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2552
[LightGBM] [Info] Number of data points in the train set: 400000, number of used features: 11
[LightGBM] [Info] Start training from score 1.492271
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002360 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2552
[LightGBM] [Info] Number of data points in the train set: 400000, number of used features: 11
[LightGBM] [Info] Start training from score 1.510604
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002696 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2552
[LightGBM] [Info] Number of data points in the train set: 400000, number of used features: 11
[LightGBM] [Info] Start training from score 1.499669
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002270 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2552
[LightGBM] [Info] Number of data points in the train set: 400000, number of used features: 11
[LightGBM] [Info] Start training from score 1.508310
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002288 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2552
[LightGBM] [Info] Number of data points in the train set: 400000, number of used features: 11
[LightGBM] [Info] Start training from score 1.506199
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002326 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2552
[LightGBM] [Info] Number of data points in the train set: 400000, number of used features: 11
[LightGBM] [Info] Start training from score 1.492271
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002311 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2552
[LightGBM] [Info] Number of data points in the train set: 400000, number of used features: 11
[LightGBM] [Info] Start training from score 1.510604
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002322 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2552
[LightGBM] [Info] Number of data points in the train set: 400000, number of used features: 11
[LightGBM] [Info] Start training from score 1.499669
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002287 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2552
[LightGBM] [Info] Number of data points in the train set: 400000, number of used features: 11
[LightGBM] [Info] Start training from score 1.508310

[18]:

tmle_df

[18]:

	X-Learner	实际值
0.0	0.000000	0.000000
0.2	0.162729	0.181960
0.4	0.289292	0.312707
0.6	0.401203	0.413857
0.8	0.474771	0.496008
1.0	0.536501	0.536501

无置信区间的提升曲线

在这里我们可以直接使用plot_tmle()函数来生成结果并绘制提升曲线

[19]:

plot_tmlegain(df, inference_col=inference_cols, outcome_col='y', treatment_col='w', p_col='p',
              n_segment=5, cv=kf, calibrate_propensity=True, ci=False)

../_images/examples_validation_with_tmle_32_0.png

我们还提供了直接使用plot()的API调用，通过输入kind='gain'和tmle=True

[20]:

plot(df, kind='gain', tmle=True, inference_col=inference_cols, outcome_col='y', treatment_col='w', p_col='p',
     n_segment=5, cv=kf, calibrate_propensity=True, ci=False)

../_images/examples_validation_with_tmle_34_0.png

AUUC 分数

[21]:

auuc_score(df, tmle=True, inference_col=inference_cols, outcome_col='y', treatment_col='w', p_col='p',
           n_segment=5, cv=kf, calibrate_propensity=True, ci=False)

[21]:

X-Learner    0.310749
Actual       0.323505
dtype: float64

带有置信区间的提升曲线

[22]:

tmle_df = get_tmlegain(df, inference_col=inference_cols, outcome_col='y', treatment_col='w', p_col='p',
                       n_segment=5, cv=kf, calibrate_propensity=True, ci=True)

[23]:

tmle_df

[23]:

	X-Learner	实际值	X-Learner 下限	实际值下限	X-Learner 上限	实际值上限
0.0	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
0.2	0.162729	0.181960	0.144712	0.162806	0.180746	0.201114
0.4	0.289292	0.312707	0.253433	0.275556	0.325151	0.349859
0.6	0.401203	0.413857	0.349491	0.362746	0.452916	0.464968
0.8	0.474771	0.496008	0.407328	0.429929	0.542213	0.562086
1.0	0.536501	0.536501	0.498278	0.498278	0.574724	0.574724

[24]:

plot_tmlegain(df, inference_col=inference_cols, outcome_col='y', treatment_col='w', p_col='p',
              n_segment=5, cv=kf, calibrate_propensity=True, ci=True)

../_images/examples_validation_with_tmle_40_0.png

[25]:

plot(df, kind='gain', tmle=True, inference_col=inference_cols, outcome_col='y', treatment_col='w', p_col='p',
     n_segment=5, cv=kf, calibrate_propensity=True, ci=True)

../_images/examples_validation_with_tmle_41_0.png

以TMLE为基准的Qini曲线

无置信区间的基尼曲线

[26]:

qini = get_tmleqini(df, inference_col=inference_cols, outcome_col='y', treatment_col='w', p_col='p',
                    n_segment=5, cv=kf, calibrate_propensity=True, ci=False)

[27]:

qini

[27]:

	X-Learner	实际值
0.0	0.000000	0.000000
100000.0	59451.339999	74162.340931
200000.0	103923.696240	127661.597180
300000.0	135436.896364	153502.216545
400000.0	149594.578171	166344.875062
500000.0	138989.103266	138989.103266

[28]:

plot_tmleqini(df, inference_col=inference_cols, outcome_col='y', treatment_col='w', p_col='p',
              n_segment=5, cv=kf, calibrate_propensity=True, ci=False)

../_images/examples_validation_with_tmle_46_0.png

我们还提供了直接使用plot()的API调用，通过输入kind='qini'和tmle=True

[29]:

plot(df, kind='qini', tmle=True, inference_col=inference_cols, outcome_col='y', treatment_col='w', p_col='p',
     n_segment=5, cv=kf, calibrate_propensity=True, ci=False)

../_images/examples_validation_with_tmle_48_0.png

七牛评分

[30]:

qini_score(df, tmle=True, inference_col=inference_cols, outcome_col='y', treatment_col='w', p_col='p',
           n_segment=5, cv=kf, calibrate_propensity=True, ci=False)

[30]:

X-Learner    28404.717374
Actual       40615.470531
Random           0.000000
dtype: float64

带有置信区间的基尼曲线

[31]:

qini = get_tmleqini(df, inference_col=inference_cols, outcome_col='y', treatment_col='w', p_col='p',
                    n_segment=5, cv=kf, calibrate_propensity=True, ci=True)

[32]:

qini

[32]:

	X-Learner	实际值	X-Learner 下限	实际值下限	X-Learner 上限	实际值上限
0.0	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
100000.0	59451.339999	74162.340931	52869.065243	66355.766067	66033.614756	81968.915795
200000.0	103923.696240	127661.597180	91071.983173	112490.548288	116775.409307	142832.646073
300000.0	135436.896364	153502.216545	118121.046182	134765.053280	152752.746546	172239.379810
400000.0	149594.578171	166344.875062	129251.502323	145267.815499	169937.654019	187421.934626
500000.0	138989.103266	138989.103266	138989.103266	138989.103266	138989.103266	138989.103266

[33]:

plot_tmleqini(df, inference_col=inference_cols, outcome_col='y', treatment_col='w', p_col='p',
              n_segment=5, cv=kf, calibrate_propensity=True, ci=True)

../_images/examples_validation_with_tmle_54_0.png

[34]:

plot(df, kind='qini', tmle=True, inference_col=inference_cols, outcome_col='y', treatment_col='w', p_col='p',
     n_segment=5, cv=kf, calibrate_propensity=True, ci=True)

../_images/examples_validation_with_tmle_55_0.png