基于逻辑回归的数据生成函数用于提升分类问题
此数据生成函数使用逻辑回归作为基础的数据生成模型。该函数能够更好地控制特征模式:特征如何与结果基线和治疗效果相关联。它支持6种不同的模式:线性、二次、三次、Relu、正弦和余弦。
本笔记本展示了如何使用此数据生成函数来生成数据,并可视化特征模式。
[1]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline
导入数据生成函数
[2]:
from causalml.dataset import make_uplift_classification_logistic
The sklearn.utils.testing module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.utils. Anything that cannot be imported from sklearn.utils is now part of the private API.
生成数据
[47]:
df, feature_name = make_uplift_classification_logistic( n_samples=100000,
treatment_name=['control', 'treatment1', 'treatment2', 'treatment3'],
y_name='conversion',
n_classification_features=10,
n_classification_informative=5,
n_classification_redundant=0,
n_classification_repeated=0,
n_uplift_dict={'treatment1': 2, 'treatment2': 2, 'treatment3': 3},
n_mix_informative_uplift_dict={'treatment1': 1, 'treatment2': 1, 'treatment3': 0},
delta_uplift_dict={'treatment1': 0.05, 'treatment2': 0.02, 'treatment3': -0.05},
feature_association_list = ['linear','quadratic','cubic','relu','sin','cos'],
random_select_association = False,
random_seed=20200416
)
[48]:
df.head()
[48]:
| treatment_group_key | x1_informative | x1_informative_transformed | x2_informative | x2_informative_transformed | x3_informative | x3_informative_transformed | x4_informative | x4_informative_transformed | x5_informative | ... | conversion_prob | control_conversion_prob | control_true_effect | treatment1_conversion_prob | treatment1_true_effect | treatment2_conversion_prob | treatment2_true_effect | treatment3_conversion_prob | treatment3_true_effect | conversion | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | treatment1 | -0.194205 | -0.192043 | 1.791408 | 1.572609 | 0.678028 | 0.080696 | -0.169306 | -0.683035 | -1.837155 | ... | 0.126770 | 0.076138 | 0.0 | 0.126770 | 0.050632 | 0.087545 | 0.011407 | 0.029396 | -0.046742 | 0 |
| 1 | treatment1 | -0.898070 | -0.894462 | 0.252125 | -0.663393 | -0.842844 | -0.156004 | -0.047769 | -0.683035 | -0.251752 | ... | 0.064278 | 0.070799 | 0.0 | 0.064278 | -0.006522 | 0.101076 | 0.030277 | 0.050778 | -0.020021 | 0 |
| 2 | treatment1 | 0.701002 | 0.701325 | 0.239320 | -0.667867 | 1.700766 | 1.278676 | -0.734568 | -0.683035 | -1.130113 | ... | 0.018480 | 0.014947 | 0.0 | 0.018480 | 0.003534 | 0.018055 | 0.003109 | 0.019327 | 0.004380 | 0 |
| 3 | control | -1.653684 | -1.648524 | -0.119123 | -0.698492 | -0.037645 | -0.000355 | 0.687429 | 0.495943 | -1.427400 | ... | 0.102799 | 0.102799 | 0.0 | 0.101410 | -0.001390 | 0.040230 | -0.062569 | 0.030753 | -0.072046 | 0 |
| 4 | treatment3 | 1.057909 | 1.057498 | -2.019523 | 2.190564 | -0.950180 | -0.223370 | -1.505741 | -0.683035 | -0.399457 | ... | 0.012964 | 0.106241 | 0.0 | 0.171309 | 0.065068 | 0.114526 | 0.008285 | 0.012964 | -0.093277 | 0 |
5 行 × 47 列
[49]:
feature_name
[49]:
['x1_informative', 'x2_informative', 'x3_informative', 'x4_informative', 'x5_informative', 'x6_irrelevant', 'x7_irrelevant', 'x8_irrelevant', 'x9_irrelevant', 'x10_irrelevant', 'x11_uplift', 'x12_uplift', 'x13_uplift', 'x14_uplift', 'x15_uplift', 'x16_uplift', 'x17_uplift', 'x18_mix', 'x19_mix']
实验组平均值
[50]:
df.groupby(['treatment_group_key'])['conversion'].mean()
[50]:
treatment_group_key
control 0.09896
treatment1 0.15088
treatment2 0.12042
treatment3 0.04972
Name: conversion, dtype: float64
可视化特征模式
[51]:
# Extract control and treatment1 for illustration
treatment_group_keys = ['control','treatment1']
y_name='conversion'
df1 = df[df['treatment_group_key'].isin(treatment_group_keys)].reset_index(drop=True)
df1.groupby(['treatment_group_key'])['conversion'].mean()
[51]:
treatment_group_key
control 0.09896
treatment1 0.15088
Name: conversion, dtype: float64
[53]:
color_dict = {'control':'#2471a3','treatment1':'#FF5733','treatment2':'#5D6D7E'
,'treatment3':'#34495E','treatment4':'#283747'}
hatch_dict = {'control':'','treatment1':'//'}
x_name_plot = ['x11_uplift', 'x12_uplift', 'x2_informative', 'x5_informative']
x_new_name_plot = ['Uplift Feature 1', 'Uplift Feature 2', 'Classification Feature 1','Classification Feature 2']
opacity = 0.8
plt.figure(figsize=(20, 3))
subplot_list = [141,142,143,144]
counter = 0
bar_width = 0.9/len(treatment_group_keys)
for x_name_i in x_name_plot:
bins = np.percentile(df1[x_name_i].values, np.linspace(0, 100, 11))[:-1]
df1['x_bin'] = np.digitize(df1[x_name_i].values, bins)
df_gb = df1.groupby(['treatment_group_key','x_bin'],as_index=False)[y_name].mean()
plt.subplot(subplot_list[counter])
for ti in range(len(treatment_group_keys)):
x_index = [ti * bar_width - len(treatment_group_keys)/2*bar_width + xi for xi in range(10)]
plt.bar(x_index,
df_gb[df_gb['treatment_group_key']==treatment_group_keys[ti]][y_name].values,
bar_width,
alpha=opacity,
color=color_dict[treatment_group_keys[ti]],
hatch = hatch_dict[treatment_group_keys[ti]],
label=treatment_group_keys[ti]
)
plt.xticks(range(10), [int(xi+10) for xi in np.linspace(0, 100, 11)[:-1]])
plt.xlabel(x_new_name_plot[counter],fontsize=16)
plt.ylabel('Conversion',fontsize=16)
#plt.title(x_name_i)
if counter == 0:
plt.legend(treatment_group_keys, loc=2,fontsize=16)
plt.ylim([0.,0.3])
counter+=1
在上图中,Uplift Feature 1 对处理效果呈现线性模式,Uplift Feature 2 对处理效果呈现二次模式,Classification Feature 1 对处理和对照的基线呈现二次模式,而 Classification Feature 2 对处理和对照的基线呈现正弦模式。
[ ]: