Zhao等人(2020年)提出的提升树特征选择

本笔记本包括两个部分:

  • 特征选择:演示如何使用过滤方法选择最重要的数值特征

  • 性能评估:使用顶级特征数据集评估AUUC性能

(论文参考: Zhao, Zhenyu, et al. “Feature Selection Methods for Uplift Modeling.” arXiv preprint arXiv:2005.03447 (2020).)

[1]:
import numpy as np
import pandas as pd
[2]:
from causalml.dataset import make_uplift_classification

导入FilterSelect类以使用Filter方法

[3]:
from causalml.feature_selection.filters import FilterSelect
[4]:
from causalml.inference.tree import UpliftRandomForestClassifier
from causalml.inference.meta import BaseXRegressor, BaseRRegressor, BaseSRegressor, BaseTRegressor
from causalml.metrics import plot_gain, auuc_score
[5]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
[6]:
import logging

logger = logging.getLogger('causalml')
logging.basicConfig(level=logging.INFO)

生成数据集

使用内置函数生成合成数据。

[7]:
# define parameters for simulation

y_name = 'conversion'
treatment_group_keys = ['control', 'treatment1']
n = 10000
n_classification_features = 50
n_classification_informative = 10
n_classification_repeated = 0
n_uplift_increase_dict = {'treatment1': 8}
n_uplift_decrease_dict = {'treatment1': 4}
delta_uplift_increase_dict = {'treatment1': 0.1}
delta_uplift_decrease_dict = {'treatment1': -0.1}

random_seed = 20200808
[8]:
df, X_names = make_uplift_classification(
    treatment_name=treatment_group_keys,
    y_name=y_name,
    n_samples=n,
    n_classification_features=n_classification_features,
    n_classification_informative=n_classification_informative,
    n_classification_repeated=n_classification_repeated,
    n_uplift_increase_dict=n_uplift_increase_dict,
    n_uplift_decrease_dict=n_uplift_decrease_dict,
    delta_uplift_increase_dict = delta_uplift_increase_dict,
    delta_uplift_decrease_dict = delta_uplift_decrease_dict,
    random_seed=random_seed
)
INFO:numexpr.utils:Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
[9]:
df.head()
[9]:
treatment_group_key x1_informative x2_informative x3_informative x4_informative x5_informative x6_informative x7_informative x8_informative x9_informative ... x56_uplift_increase x57_uplift_increase x58_uplift_increase x59_increase_mix x60_uplift_decrease x61_uplift_decrease x62_uplift_decrease x63_uplift_decrease conversion treatment_effect
0 treatment1 -4.004496 -1.250351 -2.800557 -0.368288 -0.115549 -2.492826 0.369516 0.290526 0.465153 ... 0.496144 1.847680 -0.337894 -0.672058 1.180352 0.778013 0.931000 2.947160 0 0
1 treatment1 -3.170028 -0.135293 1.484246 -2.131584 -0.760103 1.764765 0.972124 1.407131 -1.027603 ... 0.574955 3.578138 0.678118 -0.545227 -0.143942 -0.015188 1.189643 1.943692 1 0
2 treatment1 -0.763386 -0.785612 1.218781 -0.725835 1.044489 -1.521071 -2.266684 -1.614818 -0.113647 ... 0.985076 1.079181 0.578092 0.574370 -0.477429 0.679070 1.650897 2.768897 1 0
3 control 0.887727 0.049095 -2.242776 1.530966 0.392623 -0.203071 -0.549329 0.107296 -0.542277 ... -0.175352 0.683330 0.567545 0.349622 -0.789203 2.315184 0.658607 1.734836 0 0
4 control -1.672922 -1.156145 3.871476 -1.883713 -0.220122 -4.615669 0.141980 -0.933756 -0.765592 ... 0.485798 -0.355315 0.982488 -0.007260 2.895155 0.261848 -1.337001 -0.639983 1 0

5 行 × 66 列

[10]:
# Look at the conversion rate and sample size in each group
df.pivot_table(values='conversion',
               index='treatment_group_key',
               aggfunc=[np.mean, np.size],
               margins=True)
[10]:
平均值 大小
转化 转化
treatment_group_key
控制 0.50180 10000
treatment1 0.59750 10000
全部 0.54965 20000
[11]:
X_names
[11]:
['x1_informative',
 'x2_informative',
 'x3_informative',
 'x4_informative',
 'x5_informative',
 'x6_informative',
 'x7_informative',
 'x8_informative',
 'x9_informative',
 'x10_informative',
 'x11_irrelevant',
 'x12_irrelevant',
 'x13_irrelevant',
 'x14_irrelevant',
 'x15_irrelevant',
 'x16_irrelevant',
 'x17_irrelevant',
 'x18_irrelevant',
 'x19_irrelevant',
 'x20_irrelevant',
 'x21_irrelevant',
 'x22_irrelevant',
 'x23_irrelevant',
 'x24_irrelevant',
 'x25_irrelevant',
 'x26_irrelevant',
 'x27_irrelevant',
 'x28_irrelevant',
 'x29_irrelevant',
 'x30_irrelevant',
 'x31_irrelevant',
 'x32_irrelevant',
 'x33_irrelevant',
 'x34_irrelevant',
 'x35_irrelevant',
 'x36_irrelevant',
 'x37_irrelevant',
 'x38_irrelevant',
 'x39_irrelevant',
 'x40_irrelevant',
 'x41_irrelevant',
 'x42_irrelevant',
 'x43_irrelevant',
 'x44_irrelevant',
 'x45_irrelevant',
 'x46_irrelevant',
 'x47_irrelevant',
 'x48_irrelevant',
 'x49_irrelevant',
 'x50_irrelevant',
 'x51_uplift_increase',
 'x52_uplift_increase',
 'x53_uplift_increase',
 'x54_uplift_increase',
 'x55_uplift_increase',
 'x56_uplift_increase',
 'x57_uplift_increase',
 'x58_uplift_increase',
 'x59_increase_mix',
 'x60_uplift_decrease',
 'x61_uplift_decrease',
 'x62_uplift_decrease',
 'x63_uplift_decrease']

使用过滤方法进行特征选择

方法 = F (F 过滤器)

[12]:
filter_method = FilterSelect()
[13]:
# F Filter with order 1
method = 'F'
f_imp = filter_method.get_importance(df, X_names, y_name, method,
                      treatment_group = 'treatment1')
f_imp.head()
[13]:
方法 特征 排名 分数 p值 其他
0 F 过滤器 x53_uplift_increase 1.0 190.321410 4.262512e-43 df_num: 1.0, df_denom: 19996.0, order:1
0 F 过滤器 x57_uplift_increase 2.0 127.136380 2.127676e-29 df_num: 1.0, df_denom: 19996.0, order:1
0 F 过滤器 x3_informative 3.0 66.273458 4.152970e-16 df_num: 1.0, df_denom: 19996.0, order:1
0 F 过滤器 x4_informative 4.0 59.407590 1.341417e-14 df_num: 1.0, df_denom: 19996.0, order:1
0 F 过滤器 x62_uplift_decrease 5.0 3.957507 4.667636e-02 df_num: 1.0, df_denom: 19996.0, order:1
[14]:
# F Filter with order 2
method = 'F'
f_imp = filter_method.get_importance(df, X_names, y_name, method,
                      treatment_group = 'treatment1', order=2)
f_imp.head()
[14]:
方法 特征 排名 分数 p值 其他
0 F 过滤器 x53_uplift_increase 1.0 107.368286 4.160720e-47 df_num: 2.0, df_denom: 19994.0, order:2
0 F 过滤器 x57_uplift_increase 2.0 70.138050 4.423736e-31 df_num: 2.0, df_denom: 19994.0, order:2
0 F 过滤器 x3_informative 3.0 36.499465 1.504356e-16 df_num: 2.0, df_denom: 19994.0, order:2
0 F 过滤器 x4_informative 4.0 31.780547 1.658731e-14 df_num: 2.0, df_denom: 19994.0, order:2
0 F 过滤器 x55_uplift_increase 5.0 27.494904 1.189886e-12 df_num: 2.0, df_denom: 19994.0, order:2
[15]:
# F Filter with order 3
method = 'F'
f_imp = filter_method.get_importance(df, X_names, y_name, method,
                      treatment_group = 'treatment1', order=3)
f_imp.head()
[15]:
方法 特征 排名 分数 p值 其他
0 F 过滤器 x53_uplift_increase 1.0 72.064224 2.373628e-46 df_num: 3.0, df_denom: 19992.0, order:3
0 F 过滤器 x57_uplift_increase 2.0 46.841718 3.710784e-30 df_num: 3.0, df_denom: 19992.0, order:3
0 F 过滤器 x3_informative 3.0 24.089980 1.484634e-15 df_num: 3.0, df_denom: 19992.0, order:3
0 F 过滤器 x4_informative 4.0 23.097310 6.414267e-15 df_num: 3.0, df_denom: 19992.0, order:3
0 F 过滤器 x55_uplift_increase 5.0 18.072880 1.044117e-11 df_num: 3.0, df_denom: 19992.0, order:3

method = LR (似然比检验)

[16]:
# LR Filter with order 1
method = 'LR'
lr_imp = filter_method.get_importance(df, X_names, y_name, method,
                      treatment_group = 'treatment1')
lr_imp.head()
[16]:
方法 特征 排名 分数 p值 其他
0 LR 过滤器 x53_uplift_increase 1.0 203.811674 0.000000e+00 df: 1, order: 1
0 LR 过滤器 x57_uplift_increase 2.0 133.175328 0.000000e+00 df: 1, order: 1
0 LR 过滤器 x3_informative 3.0 64.366711 9.992007e-16 df: 1, order: 1
0 LR 过滤器 x4_informative 4.0 52.389798 4.550804e-13 df: 1, order: 1
0 LR 过滤器 x62_uplift_decrease 5.0 4.064347 4.379760e-02 df: 1, order: 1
[17]:
# LR Filter with order 2
method = 'LR'
lr_imp = filter_method.get_importance(df, X_names, y_name, method,
                      treatment_group = 'treatment1',order=2)
lr_imp.head()
[17]:
方法 特征 排名 分数 p值 其他
0 LR 过滤器 x53_uplift_increase 1.0 277.639095 0.000000e+00 df: 2, order: 2
0 LR 过滤器 x57_uplift_increase 2.0 156.134112 0.000000e+00 df: 2, order: 2
0 LR 过滤器 x55_uplift_increase 3.0 71.478979 3.330669e-16 df: 2, order: 2
0 LR 过滤器 x3_informative 4.0 44.938973 1.744319e-10 df: 2, order: 2
0 LR 过滤器 x4_informative 5.0 29.179971 4.609458e-07 df: 2, order: 2
[18]:
# LR Filter with order 3
method = 'LR'
lr_imp = filter_method.get_importance(df, X_names, y_name, method,
                      treatment_group = 'treatment1',order=3)
lr_imp.head()
[18]:
方法 特征 排名 分数 p值 其他
0 LR 过滤器 x53_uplift_increase 1.0 290.389201 0.000000e+00 df: 3, order: 3
0 LR 过滤器 x57_uplift_increase 2.0 153.942614 0.000000e+00 df: 3, order: 3
0 LR 过滤器 x55_uplift_increase 3.0 70.626667 3.108624e-15 df: 3, order: 3
0 LR 过滤器 x3_informative 4.0 45.477851 7.323235e-10 df: 3, order: 3
0 LR 过滤器 x4_informative 5.0 30.466528 1.100881e-06 df: 3, order: 3

method = KL (KL散度)

[19]:
method = 'KL'
kl_imp = filter_method.get_importance(df, X_names, y_name, method,
                      treatment_group = 'treatment1',
                      n_bins=10)
kl_imp.head()
[19]:
方法 特征 排名 分数 p值 其他
0 KL过滤器 x53_uplift_increase 1.0 0.022997 number_of_bins: 10
0 KL过滤器 x57_uplift_increase 2.0 0.014884 number_of_bins: 10
0 KL过滤器 x4_informative 3.0 0.012103 number_of_bins: 10
0 KL过滤器 x3_informative 4.0 0.010179 number_of_bins: 10
0 KL 过滤器 x55_uplift_increase 5.0 0.003836 number_of_bins: 10

我们发现这三种过滤方法都能够将大多数信息丰富提升增加的特征排名靠前。

性能评估

在使用顶级特征数据集时,评估多个提升模型的AUUC(提升曲线下面积)得分

[20]:
# train test split
df_train, df_test = train_test_split(df, test_size=0.2, random_state=111)
[21]:
# convert treatment column to 1 (treatment1) and 0 (control)
treatments = np.where((df_test['treatment_group_key']=='treatment1'), 1, 0)
print(treatments[:10])
print(df_test['treatment_group_key'][:10])
[0 0 1 1 0 1 1 0 0 0]
18998       control
11536       control
8552     treatment1
2652     treatment1
19671       control
13244    treatment1
3075     treatment1
8746        control
18530       control
5066        control
Name: treatment_group_key, dtype: object

提升随机森林分类器

[22]:
uplift_model = UpliftRandomForestClassifier(control_name='control', max_depth=8)
[23]:
# using all features
features = X_names
uplift_model.fit(X = df_train[features].values,
                 treatment = df_train['treatment_group_key'].values,
                 y = df_train[y_name].values)
y_preds = uplift_model.predict(df_test[features].values)

基于KL过滤器选择前N个特征

[24]:
top_n = 10
top_10_features = kl_imp['feature'][:top_n]
print(top_10_features)
0    x53_uplift_increase
0    x57_uplift_increase
0         x4_informative
0         x3_informative
0    x55_uplift_increase
0         x1_informative
0    x56_uplift_increase
0    x51_uplift_increase
0         x38_irrelevant
0    x58_uplift_increase
Name: feature, dtype: object
[25]:
top_n = 15
top_15_features = kl_imp['feature'][:top_n]
print(top_15_features)
0    x53_uplift_increase
0    x57_uplift_increase
0         x4_informative
0         x3_informative
0    x55_uplift_increase
0         x1_informative
0    x56_uplift_increase
0    x51_uplift_increase
0         x38_irrelevant
0    x58_uplift_increase
0         x48_irrelevant
0         x15_irrelevant
0         x27_irrelevant
0    x62_uplift_decrease
0         x23_irrelevant
Name: feature, dtype: object
[26]:
top_n = 20
top_20_features = kl_imp['feature'][:top_n]
print(top_20_features)
0    x53_uplift_increase
0    x57_uplift_increase
0         x4_informative
0         x3_informative
0    x55_uplift_increase
0         x1_informative
0    x56_uplift_increase
0    x51_uplift_increase
0         x38_irrelevant
0    x58_uplift_increase
0         x48_irrelevant
0         x15_irrelevant
0         x27_irrelevant
0    x62_uplift_decrease
0         x23_irrelevant
0         x29_irrelevant
0         x6_informative
0         x45_irrelevant
0         x40_irrelevant
0         x25_irrelevant
Name: feature, dtype: object

使用前N个特征再次训练Uplift模型

[27]:
# using top 10 features
features = top_10_features

uplift_model.fit(X = df_train[features].values,
                 treatment = df_train['treatment_group_key'].values,
                 y = df_train[y_name].values)
y_preds_t10 = uplift_model.predict(df_test[features].values)
[28]:
# using top 15 features
features = top_15_features

uplift_model.fit(X = df_train[features].values,
                 treatment = df_train['treatment_group_key'].values,
                 y = df_train[y_name].values)
y_preds_t15 = uplift_model.predict(df_test[features].values)
[29]:
# using top 20 features
features = top_20_features

uplift_model.fit(X = df_train[features].values,
                 treatment = df_train['treatment_group_key'].values,
                 y = df_train[y_name].values)
y_preds_t20 = uplift_model.predict(df_test[features].values)

以R学习器为基础并输入随机森林回归器

[32]:
r_rf_learner = BaseRRegressor(
    RandomForestRegressor(
        n_estimators = 100,
        max_depth = 8,
        min_samples_leaf = 100
    ),
control_name='control')
[33]:
# using all features
features = X_names
r_rf_learner.fit(X = df_train[features].values,
                 treatment = df_train['treatment_group_key'].values,
                 y = df_train[y_name].values)
y_preds = r_rf_learner.predict(df_test[features].values)
INFO:causalml:Generating propensity score
INFO:causalml:Calibrating propensity scores.
INFO:causalml:generating out-of-fold CV outcome estimates
INFO:causalml:training the treatment effect model for treatment1 with R-loss
[34]:
# using top 10 features
features = top_10_features
r_rf_learner.fit(X = df_train[features].values,
                 treatment = df_train['treatment_group_key'].values,
                 y = df_train[y_name].values)
y_preds_t10 = r_rf_learner.predict(df_test[features].values)
INFO:causalml:Generating propensity score
INFO:causalml:Calibrating propensity scores.
INFO:causalml:generating out-of-fold CV outcome estimates
INFO:causalml:training the treatment effect model for treatment1 with R-loss
[35]:
# using top 15 features
features = top_15_features
r_rf_learner.fit(X = df_train[features].values,
                 treatment = df_train['treatment_group_key'].values,
                 y = df_train[y_name].values)
y_preds_t15 = r_rf_learner.predict(df_test[features].values)
INFO:causalml:Generating propensity score
INFO:causalml:Calibrating propensity scores.
INFO:causalml:generating out-of-fold CV outcome estimates
INFO:causalml:training the treatment effect model for treatment1 with R-loss
[36]:
# using top 20 features
features = top_20_features
r_rf_learner.fit(X = df_train[features].values,
                 treatment = df_train['treatment_group_key'].values,
                 y = df_train[y_name].values)
y_preds_t20 = r_rf_learner.predict(df_test[features].values)
INFO:causalml:Generating propensity score
INFO:causalml:Calibrating propensity scores.
INFO:causalml:generating out-of-fold CV outcome estimates
INFO:causalml:training the treatment effect model for treatment1 with R-loss

以S Learner为基础并输入随机森林回归器

[39]:
slearner_rf = BaseSRegressor(
    RandomForestRegressor(
        n_estimators = 100,
        max_depth = 8,
        min_samples_leaf = 100
    ),
    control_name='control')
[40]:
# using all features
features = X_names
slearner_rf.fit(X = df_train[features].values,
                treatment = df_train['treatment_group_key'].values,
                y = df_train[y_name].values)
y_preds = slearner_rf.predict(df_test[features].values)
[41]:
# using top 10 features
features = top_10_features
slearner_rf.fit(X = df_train[features].values,
                treatment = df_train['treatment_group_key'].values,
                y = df_train[y_name].values)
y_preds_t10 = slearner_rf.predict(df_test[features].values)
[42]:
# using top 15 features
features = top_15_features
slearner_rf.fit(X = df_train[features].values,
                treatment = df_train['treatment_group_key'].values,
                y = df_train[y_name].values)
y_preds_t15 = slearner_rf.predict(df_test[features].values)
[43]:
# using top 20 features
features = top_20_features
slearner_rf.fit(X = df_train[features].values,
                treatment = df_train['treatment_group_key'].values,
                y = df_train[y_name].values)
y_preds_t20 = slearner_rf.predict(df_test[features].values)