对Lalonde和IHDP数据集应用反驳测试#

导入依赖项#

[1]:

import dowhy
from dowhy import CausalModel
import pandas as pd
import numpy as np

# Config dict to set the logging level
import logging.config
DEFAULT_LOGGING = {
    'version': 1,
    'disable_existing_loggers': False,
    'loggers': {
        '': {
            'level': 'WARN',
        },
    }
}

logging.config.dictConfig(DEFAULT_LOGGING)
# Disabling warnings output
import warnings
from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

加载数据集#

婴儿健康与发展计划数据集 (IHDP)#

使用的测量数据包括儿童的出生体重、头围、早产周数、出生顺序、是否为第一胎、新生儿健康指数（参见Scott和Bauer 1989年）、性别、双胞胎状态，以及怀孕期间的行为——吸烟、饮酒、吸毒，以及母亲在分娩时的测量数据——年龄、婚姻状况、教育程度（未高中毕业、高中毕业、上过大学但未毕业、大学毕业）、怀孕期间是否工作、是否接受过产前护理，以及干预开始时家庭居住的地点（共8个）。共有6个连续协变量和19个二元协变量。

参考#

希尔, J. L. (2011). 贝叶斯非参数建模用于因果推断. 计算与图形统计杂志, 20(1), 217-240. https://doi.org/10.1198/jcgs.2010.08162

[2]:

data = pd.read_csv("https://raw.githubusercontent.com/AMLab-Amsterdam/CEVAE/master/datasets/IHDP/csv/ihdp_npci_1.csv", header = None)
col =  ["treatment", "y_factual", "y_cfactual", "mu0", "mu1" ,]
for i in range(1,26):
    col.append("x"+str(i))
data.columns = col
data = data.astype({"treatment":'bool'}, copy=False)
data.head()

[2]:

	治疗	实际结果	反事实结果	mu0	mu1	x1	x2	x3	x4	x5	...	x16	x17	x18	x19
0	True	5.599916	4.318780	3.268256	6.854457	-0.528603	-0.343455	1.128554	0.161703	-0.316603	...	1	1	1	1
1	False	6.875856	7.856495	6.636059	7.562718	-1.736945	-1.802002	0.383828	2.244320	-0.629189	...	1	1	1	1
2	False	2.996273	6.633952	1.570536	6.121617	-0.807451	-0.202946	-0.360898	-0.879606	0.808706	...	1	0	1	1
3	False	1.366206	5.697239	1.244738	5.889125	0.390083	0.596582	-1.850350	-0.879606	-0.004017	...	1	0	1	1
4	假	1.963538	6.202582	1.685048	6.191994	-1.045229	-0.602710	0.011465	0.161703	0.683672	...	1	1	1	1

5 行 × 30 列

Lalonde 数据集#

一个包含445个观测值和以下12个变量的数据框。

age: 年龄，单位为岁。
educ: 受教育年限。
black: 黑人指标变量。
hisp: 西班牙裔的指示变量。
已婚：婚姻状况的指示变量。
nodegr: 高中文凭的指示变量。
re74: 1974年的实际收入。
re75: 1975年的实际收入。
re78: 1978年的实际收入。
u74: 1974年收入为零的指示变量。
u75: 1975年收入为零的指示变量。
treat: 一个指示变量，用于表示处理状态。

参考文献#

Dehejia, Rajeev 和 Sadek Wahba. 1999. ``非实验研究中的因果效应：重新评估培训项目的评估.’’ 美国统计协会杂志 94 (448): 1053-1062.

LaLonde, Robert. 1986. ``评估培训项目的计量经济学评估.’’ 美国经济评论 76:604-620.

[3]:

import dowhy.datasets

lalonde = dowhy.datasets.lalonde_dataset()

lalonde.head()

[3]:

	处理	年龄	教育	黑人	无学位	1978年收入	1974年失业	1975年失业
0	假	23.0	10.0	1.0	1.0	0.00	1.0	1.0
1	假	26.0	12.0	0.0	0.0	12383.68	1.0	1.0
2	假	22.0	9.0	1.0	1.0	0.00	1.0	1.0
3	假	18.0	9.0	1.0	1.0	10740.08	1.0	1.0
4	假	45.0	11.0	1.0	1.0	11796.47	1.0	1.0

第一步：构建模型#

IHDP#

[4]:

# Create a causal model from the data and given common causes

common_causes = []

for i in range(1, 26):
    common_causes += ["x"+str(i)]

ihdp_model = CausalModel(
                data=data,
                treatment='treatment',
                outcome='y_factual',
                common_causes=common_causes
            )
ihdp_model.view_model(layout="dot")
from IPython.display import Image, display
display(Image(filename="causal_model.png"))

../_images/example_notebooks_dowhy_refutation_testing_12_0.png

../_images/example_notebooks_dowhy_refutation_testing_12_1.png

拉隆德#

[5]:

lalonde_model = CausalModel(
                data=lalonde,
                treatment='treat',
                outcome='re78',
                common_causes='nodegr+black+hisp+age+educ+married'.split('+')
            )
lalonde_model.view_model(layout="dot")
from IPython.display import Image, display
display(Image(filename="causal_model.png"))

../_images/example_notebooks_dowhy_refutation_testing_14_0.png

../_images/example_notebooks_dowhy_refutation_testing_14_1.png

步骤2：识别#

IHDP#

[6]:

#Identify the causal effect for the ihdp dataset
ihdp_identified_estimand = ihdp_model.identify_effect(proceed_when_unidentifiable=True)
print(ihdp_identified_estimand)

Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
     d                                                                         ↪
────────────(E[y_factual|x25,x21,x12,x6,x18,x3,x15,x16,x8,x4,x11,x2,x24,x17,x2 ↪
d[treatment]                                                                   ↪

↪
↪ 3,x7,x9,x5,x22,x13,x19,x1,x20,x10,x14])
↪
Estimand assumption 1, Unconfoundedness: If U→{treatment} and U→y_factual then P(y_factual|treatment,x25,x21,x12,x6,x18,x3,x15,x16,x8,x4,x11,x2,x24,x17,x23,x7,x9,x5,x22,x13,x19,x1,x20,x10,x14,U) = P(y_factual|treatment,x25,x21,x12,x6,x18,x3,x15,x16,x8,x4,x11,x2,x24,x17,x23,x7,x9,x5,x22,x13,x19,x1,x20,x10,x14)

### Estimand : 2
Estimand name: iv
No such variable(s) found!

### Estimand : 3
Estimand name: frontdoor
No such variable(s) found!

拉隆德#

[7]:

#Identify the causal effect for the lalonde dataset
lalonde_identified_estimand = lalonde_model.identify_effect(proceed_when_unidentifiable=True)
print(lalonde_identified_estimand)

Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
   d
────────(E[re78|age,nodegr,educ,married,black,hisp])
d[treat]
Estimand assumption 1, Unconfoundedness: If U→{treat} and U→re78 then P(re78|treat,age,nodegr,educ,married,black,hisp,U) = P(re78|treat,age,nodegr,educ,married,black,hisp)

### Estimand : 2
Estimand name: iv
No such variable(s) found!

### Estimand : 3
Estimand name: frontdoor
No such variable(s) found!

步骤3：估计（使用倾向得分加权）#

IHDP#

[8]:

ihdp_estimate = ihdp_model.estimate_effect(
                    ihdp_identified_estimand,
                    method_name="backdoor.propensity_score_weighting"
                )

print("The Causal Estimate is " + str(ihdp_estimate.value))

The Causal Estimate is 4.028748218389962

拉隆德#

[9]:

lalonde_estimate = lalonde_model.estimate_effect(
                        lalonde_identified_estimand,
                        method_name="backdoor.propensity_score_weighting"
                    )

print("The Causal Estimate is " + str(lalonde_estimate.value))

The Causal Estimate is 1639.8089610946918

步骤4：反驳#

IHDP#

添加随机共同原因#

[10]:

ihdp_refute_random_common_cause = ihdp_model.refute_estimate(
                                        ihdp_identified_estimand,
                                        ihdp_estimate,
                                        method_name="random_common_cause"
                                    )

print(ihdp_refute_random_common_cause)

Refute: Add a random common cause
Estimated effect:4.028748218389962
New effect:4.028748218389964
p value:1.0

用安慰剂替代治疗#

[11]:

ihdp_refute_placebo_treatment = ihdp_model.refute_estimate(
                                    ihdp_identified_estimand,
                                    ihdp_estimate,
                                    method_name="placebo_treatment_refuter",
                                    placebo_type="permute"
                                )

print(ihdp_refute_placebo_treatment)

Refute: Use a Placebo Treatment
Estimated effect:4.028748218389962
New effect:-0.4415444471465237
p value:0.02

移除数据的随机子集#

[12]:

ihdp_refute_random_subset = ihdp_model.refute_estimate(
                                    ihdp_identified_estimand,
                                    ihdp_estimate,
                                    method_name="data_subset_refuter",
                                    subset_fraction=0.8
                            )
print(ihdp_refute_random_subset)

Refute: Use a subset of data
Estimated effect:4.028748218389962
New effect:4.03306148112277
p value:0.96

拉隆德#

添加随机共同原因#

[13]:

lalonde_refute_random_common_cause = lalonde_model.refute_estimate(
                                            lalonde_identified_estimand,
                                            lalonde_estimate,
                                            method_name="random_common_cause"
                                        )

print(lalonde_refute_random_common_cause)

Refute: Add a random common cause
Estimated effect:1639.8089610946918
New effect:1639.8089610946924
p value:1.0

用安慰剂替代治疗#

[14]:

lalonde_refute_placebo_treatment = lalonde_model.refute_estimate(
                                        lalonde_identified_estimand,
                                        lalonde_estimate,
                                        method_name="placebo_treatment_refuter",
                                        placebo_type="permute"
                                    )

print(lalonde_refute_placebo_treatment)

Refute: Use a Placebo Treatment
Estimated effect:1639.8089610946918
New effect:-156.25759930326174
p value:0.82

移除数据的随机子集#

[15]:

lalonde_refute_random_subset = lalonde_model.refute_estimate(
                                    lalonde_identified_estimand,
                                    lalonde_estimate,
                                    method_name="data_subset_refuter",
                                    subset_fraction=0.9
                                )

print(lalonde_refute_random_subset)

Refute: Use a subset of data
Estimated effect:1639.8089610946918
New effect:1656.1546989508256
p value:0.8600000000000001