加权广义线性模型¶
[1]:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
加权GLM:泊松响应数据¶
加载数据¶
在这个例子中,我们将使用affair数据集,利用一些外生变量来预测婚外情发生率。
将生成权重以显示 freq_weights
等同于重复数据记录。另一方面,var_weights
等同于聚合数据。
[2]:
print(sm.datasets.fair.NOTE)
::
Number of observations: 6366
Number of variables: 9
Variable name definitions:
rate_marriage : How rate marriage, 1 = very poor, 2 = poor, 3 = fair,
4 = good, 5 = very good
age : Age
yrs_married : No. years married. Interval approximations. See
original paper for detailed explanation.
children : No. children
religious : How relgious, 1 = not, 2 = mildly, 3 = fairly,
4 = strongly
educ : Level of education, 9 = grade school, 12 = high
school, 14 = some college, 16 = college graduate,
17 = some graduate school, 20 = advanced degree
occupation : 1 = student, 2 = farming, agriculture; semi-skilled,
or unskilled worker; 3 = white-colloar; 4 = teacher
counselor social worker, nurse; artist, writers;
technician, skilled worker, 5 = managerial,
administrative, business, 6 = professional with
advanced degree
occupation_husb : Husband's occupation. Same as occupation.
affairs : measure of time spent in extramarital affairs
See the original paper for more details.
将数据加载到 pandas 数据框中。
[3]:
data = sm.datasets.fair.load_pandas().data
因变量(内生变量)是 affairs
[4]:
data.describe()
[4]:
rate_marriage | age | yrs_married | children | religious | educ | occupation | occupation_husb | affairs | |
---|---|---|---|---|---|---|---|---|---|
count | 6366.000000 | 6366.000000 | 6366.000000 | 6366.000000 | 6366.000000 | 6366.000000 | 6366.000000 | 6366.000000 | 6366.000000 |
mean | 4.109645 | 29.082862 | 9.009425 | 1.396874 | 2.426170 | 14.209865 | 3.424128 | 3.850141 | 0.705374 |
std | 0.961430 | 6.847882 | 7.280120 | 1.433471 | 0.878369 | 2.178003 | 0.942399 | 1.346435 | 2.203374 |
min | 1.000000 | 17.500000 | 0.500000 | 0.000000 | 1.000000 | 9.000000 | 1.000000 | 1.000000 | 0.000000 |
25% | 4.000000 | 22.000000 | 2.500000 | 0.000000 | 2.000000 | 12.000000 | 3.000000 | 3.000000 | 0.000000 |
50% | 4.000000 | 27.000000 | 6.000000 | 1.000000 | 2.000000 | 14.000000 | 3.000000 | 4.000000 | 0.000000 |
75% | 5.000000 | 32.000000 | 16.500000 | 2.000000 | 3.000000 | 16.000000 | 4.000000 | 5.000000 | 0.484848 |
max | 5.000000 | 42.000000 | 23.000000 | 5.500000 | 4.000000 | 20.000000 | 6.000000 | 6.000000 | 57.599991 |
[5]:
data[:3]
[5]:
rate_marriage | age | yrs_married | children | religious | educ | occupation | occupation_husb | affairs | |
---|---|---|---|---|---|---|---|---|---|
0 | 3.0 | 32.0 | 9.0 | 3.0 | 3.0 | 17.0 | 2.0 | 5.0 | 0.111111 |
1 | 3.0 | 27.0 | 13.0 | 3.0 | 1.0 | 14.0 | 3.0 | 4.0 | 3.230769 |
2 | 4.0 | 22.0 | 2.5 | 0.0 | 1.0 | 16.0 | 3.0 | 5.0 | 1.400000 |
在下面我们将主要使用泊松分布。虽然使用十进制事务也可以,但我们将它们转换为整数以获得计数分布。
[6]:
data["affairs"] = np.ceil(data["affairs"])
data[:3]
[6]:
rate_marriage | age | yrs_married | children | religious | educ | occupation | occupation_husb | affairs | |
---|---|---|---|---|---|---|---|---|---|
0 | 3.0 | 32.0 | 9.0 | 3.0 | 3.0 | 17.0 | 2.0 | 5.0 | 1.0 |
1 | 3.0 | 27.0 | 13.0 | 3.0 | 1.0 | 14.0 | 3.0 | 4.0 | 4.0 |
2 | 4.0 | 22.0 | 2.5 | 0.0 | 1.0 | 16.0 | 3.0 | 5.0 | 2.0 |
[7]:
(data["affairs"] == 0).mean()
[7]:
np.float64(0.6775054979579014)
[8]:
np.bincount(data["affairs"].astype(int))
[8]:
array([4313, 934, 488, 180, 130, 172, 7, 21, 67, 2, 0,
0, 17, 0, 0, 0, 3, 12, 8, 0, 0, 0,
0, 0, 2, 2, 2, 3, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1])
压缩和聚合观测数据¶
我们的原始数据集中有6366个观测值。当我们只考虑一些选定的变量时,我们会有较少的唯一观测值。接下来,我们以两种方式合并观测值,首先我们合并所有变量值相同的观测值,其次我们合并具有相同解释变量的观测值。
具有唯一观测值的数据集¶
我们使用 pandas 的 groupby 来合并相同的观察值,并创建一个新变量 freq
,用于计算对应行中具有这些值的观察次数。
[9]:
data2 = data.copy()
data2["const"] = 1
dc = (
data2["affairs rate_marriage age yrs_married const".split()]
.groupby("affairs rate_marriage age yrs_married".split())
.count()
)
dc.reset_index(inplace=True)
dc.rename(columns={"const": "freq"}, inplace=True)
print(dc.shape)
dc.head()
(476, 5)
[9]:
affairs | rate_marriage | age | yrs_married | freq | |
---|---|---|---|---|---|
0 | 0.0 | 1.0 | 17.5 | 0.5 | 1 |
1 | 0.0 | 1.0 | 22.0 | 2.5 | 3 |
2 | 0.0 | 1.0 | 27.0 | 2.5 | 1 |
3 | 0.0 | 1.0 | 27.0 | 6.0 | 5 |
4 | 0.0 | 1.0 | 27.0 | 9.0 | 1 |
具有唯一解释变量的数据集(外生变量)¶
对于下一个数据集,我们将具有相同解释变量值的观测值进行合并。然而,由于响应变量在合并的观测值之间可能不同,我们计算所有合并观测值的响应变量的均值和总和。
我们再次使用 pandas groupby
来合并观察结果并创建新变量。我们还把 MultiIndex
转换为简单的索引。
[10]:
gr = data["affairs rate_marriage age yrs_married".split()].groupby(
"rate_marriage age yrs_married".split()
)
df_a = gr.agg(["mean", "sum", "count"])
def merge_tuple(tpl):
if isinstance(tpl, tuple) and len(tpl) > 1:
return "_".join(map(str, tpl))
else:
return tpl
df_a.columns = df_a.columns.map(merge_tuple)
df_a.reset_index(inplace=True)
print(df_a.shape)
df_a.head()
(130, 6)
[10]:
rate_marriage | age | yrs_married | affairs_mean | affairs_sum | affairs_count | |
---|---|---|---|---|---|---|
0 | 1.0 | 17.5 | 0.5 | 0.000000 | 0.0 | 1 |
1 | 1.0 | 22.0 | 2.5 | 3.900000 | 39.0 | 10 |
2 | 1.0 | 27.0 | 2.5 | 3.400000 | 17.0 | 5 |
3 | 1.0 | 27.0 | 6.0 | 0.900000 | 9.0 | 10 |
4 | 1.0 | 27.0 | 9.0 | 1.333333 | 4.0 | 3 |
在将观测值合并后,我们得到了一个包含467个唯一观测值的数据框dc
,以及一个包含130个观测值的数据框df_a
,其中包含解释变量的唯一值。
[11]:
print("number of rows: \noriginal, with unique observations, with unique exog")
data.shape[0], dc.shape[0], df_a.shape[0]
number of rows:
original, with unique observations, with unique exog
[11]:
(6366, 476, 130)
分析¶
在下文中,我们将原始数据的GLM-Poisson结果与结合了多重性或聚合的观测值的模型进行比较,其中多重性或聚合由权重或暴露给出。
原始数据¶
[12]:
glm = smf.glm(
"affairs ~ rate_marriage + age + yrs_married",
data=data,
family=sm.families.Poisson(),
)
res_o = glm.fit()
print(res_o.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: affairs No. Observations: 6366
Model: GLM Df Residuals: 6362
Model Family: Poisson Df Model: 3
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -10351.
Date: Wed, 16 Oct 2024 Deviance: 15375.
Time: 18:27:14 Pearson chi2: 3.23e+04
No. Iterations: 6 Pseudo R-squ. (CS): 0.2420
Covariance Type: nonrobust
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 2.7155 0.107 25.294 0.000 2.505 2.926
rate_marriage -0.4952 0.012 -41.702 0.000 -0.518 -0.472
age -0.0299 0.004 -6.691 0.000 -0.039 -0.021
yrs_married -0.0108 0.004 -2.507 0.012 -0.019 -0.002
=================================================================================
[13]:
res_o.pearson_chi2 / res_o.df_resid
[13]:
np.float64(5.078702313363161)
浓缩数据(具有频率的唯一观测值)¶
结合相同的观测值并使用频率权重来考虑观测值的重复性,会产生完全相同的结果。当我们希望获得关于观测值的信息而不是所有相同观测值的聚合信息时,某些结果属性将有所不同。例如,残差不会考虑freq_weights
。
[14]:
glm = smf.glm(
"affairs ~ rate_marriage + age + yrs_married",
data=dc,
family=sm.families.Poisson(),
freq_weights=np.asarray(dc["freq"]),
)
res_f = glm.fit()
print(res_f.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: affairs No. Observations: 476
Model: GLM Df Residuals: 6362
Model Family: Poisson Df Model: 3
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -10351.
Date: Wed, 16 Oct 2024 Deviance: 15375.
Time: 18:27:14 Pearson chi2: 3.23e+04
No. Iterations: 6 Pseudo R-squ. (CS): 0.9754
Covariance Type: nonrobust
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 2.7155 0.107 25.294 0.000 2.505 2.926
rate_marriage -0.4952 0.012 -41.702 0.000 -0.518 -0.472
age -0.0299 0.004 -6.691 0.000 -0.039 -0.021
yrs_married -0.0108 0.004 -2.507 0.012 -0.019 -0.002
=================================================================================
[15]:
res_f.pearson_chi2 / res_f.df_resid
[15]:
np.float64(5.078702313363207)
使用 var_weights
而不是 freq_weights
¶
接下来,我们将var_weights
与freq_weights
进行比较。当内生变量反映的是平均值而非相同观测值时,通常会使用var_weights
。我没有看到理论上有任何理由说明它在一般情况下会产生相同的结果。
这产生了相同的结果,但df_resid
与freq_weights
示例不同,因为var_weights
不会改变有效观察的数量。
[16]:
glm = smf.glm(
"affairs ~ rate_marriage + age + yrs_married",
data=dc,
family=sm.families.Poisson(),
var_weights=np.asarray(dc["freq"]),
)
res_fv = glm.fit()
print(res_fv.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: affairs No. Observations: 476
Model: GLM Df Residuals: 472
Model Family: Poisson Df Model: 3
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -10351.
Date: Wed, 16 Oct 2024 Deviance: 15375.
Time: 18:27:14 Pearson chi2: 3.23e+04
No. Iterations: 6 Pseudo R-squ. (CS): 0.9754
Covariance Type: nonrobust
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 2.7155 0.107 25.294 0.000 2.505 2.926
rate_marriage -0.4952 0.012 -41.702 0.000 -0.518 -0.472
age -0.0299 0.004 -6.691 0.000 -0.039 -0.021
yrs_married -0.0108 0.004 -2.507 0.012 -0.019 -0.002
=================================================================================
由于错误的 df_resid
,从结果计算的离散度是不正确的。如果我们使用原始的 df_resid
,则是正确的。
[17]:
res_fv.pearson_chi2 / res_fv.df_resid, res_f.pearson_chi2 / res_f.df_resid
[17]:
(np.float64(68.45488160512018), np.float64(5.078702313363207))
聚合或平均数据(解释变量的唯一值)¶
在这些情况下,我们将具有相同解释变量值的观测值进行合并。对应的响应变量可以是总和或平均值。
使用 exposure
¶
如果我们的因变量是所有合并观测的响应之和,那么在泊松假设下,分布保持不变,但我们有不同的暴露
,由一个聚合观测所代表的个体数量给出。
参数估计和参数协方差与原始数据相同,但对数似然、偏差和皮尔逊卡方值不同
[18]:
glm = smf.glm(
"affairs_sum ~ rate_marriage + age + yrs_married",
data=df_a,
family=sm.families.Poisson(),
exposure=np.asarray(df_a["affairs_count"]),
)
res_e = glm.fit()
print(res_e.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: affairs_sum No. Observations: 130
Model: GLM Df Residuals: 126
Model Family: Poisson Df Model: 3
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -740.75
Date: Wed, 16 Oct 2024 Deviance: 967.46
Time: 18:27:14 Pearson chi2: 926.
No. Iterations: 6 Pseudo R-squ. (CS): 1.000
Covariance Type: nonrobust
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 2.7155 0.107 25.294 0.000 2.505 2.926
rate_marriage -0.4952 0.012 -41.702 0.000 -0.518 -0.472
age -0.0299 0.004 -6.691 0.000 -0.039 -0.021
yrs_married -0.0108 0.004 -2.507 0.012 -0.019 -0.002
=================================================================================
[19]:
res_e.pearson_chi2 / res_e.df_resid
[19]:
np.float64(7.350789109179585)
使用 var_weights¶
我们也可以使用因变量所有组合值的平均值。在这种情况下,方差将与一个组合观测所反映的总暴露的倒数相关。
[20]:
glm = smf.glm(
"affairs_mean ~ rate_marriage + age + yrs_married",
data=df_a,
family=sm.families.Poisson(),
var_weights=np.asarray(df_a["affairs_count"]),
)
res_a = glm.fit()
print(res_a.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: affairs_mean No. Observations: 130
Model: GLM Df Residuals: 126
Model Family: Poisson Df Model: 3
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -5954.2
Date: Wed, 16 Oct 2024 Deviance: 967.46
Time: 18:27:14 Pearson chi2: 926.
No. Iterations: 5 Pseudo R-squ. (CS): 1.000
Covariance Type: nonrobust
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 2.7155 0.107 25.294 0.000 2.505 2.926
rate_marriage -0.4952 0.012 -41.702 0.000 -0.518 -0.472
age -0.0299 0.004 -6.691 0.000 -0.039 -0.021
yrs_married -0.0108 0.004 -2.507 0.012 -0.019 -0.002
=================================================================================
比较¶
我们在上面的摘要打印中看到,params
和 cov_params
以及相关的 Wald 推断在不同版本之间是一致的。我们通过比较不同版本中的单个结果属性来总结这一点。
参数估计 params
,参数的标准误差 bse
和参数的 pvalues
对于参数为零的检验结果一致。然而,似然和拟合优度统计量 llf
、deviance
和 pearson_chi2
仅部分一致。具体来说,聚合版本与使用原始数据的结果不一致。
警告:llf
、deviance
和 pearson_chi2
的行为在未来的版本中可能仍会发生变化。
响应变量在解释变量的唯一值下的总和和平均值都具有适当的似然解释。然而,这种解释并未在这些三个统计量中体现出来。从计算角度来看,这可能是由于在使用聚合数据时缺少调整。然而,从理论上讲,我们可以考虑这些情况,特别是在似然分析不合适且结果应解释为拟似然估计的情况下,对于var_weights
的错误指定情况。var_weights
的定义存在歧义,因为它们既可以用于正确指定似然的平均值,也可以用于拟似然情况下的方差调整。我们目前并未尝试匹配似然规范。然而,在下一节中,我们将展示,当我们假设基础模型正确指定时,似然比类型的检验仍然为所有聚合版本产生相同的结果。
[21]:
results_all = [res_o, res_f, res_e, res_a]
names = "res_o res_f res_e res_a".split()
[22]:
pd.concat([r.params for r in results_all], axis=1, keys=names)
[22]:
res_o | res_f | res_e | res_a | |
---|---|---|---|---|
Intercept | 2.715533 | 2.715533 | 2.715533 | 2.715533 |
rate_marriage | -0.495180 | -0.495180 | -0.495180 | -0.495180 |
age | -0.029914 | -0.029914 | -0.029914 | -0.029914 |
yrs_married | -0.010763 | -0.010763 | -0.010763 | -0.010763 |
[23]:
pd.concat([r.bse for r in results_all], axis=1, keys=names)
[23]:
res_o | res_f | res_e | res_a | |
---|---|---|---|---|
Intercept | 0.107360 | 0.107360 | 0.107360 | 0.107360 |
rate_marriage | 0.011874 | 0.011874 | 0.011874 | 0.011874 |
age | 0.004471 | 0.004471 | 0.004471 | 0.004471 |
yrs_married | 0.004294 | 0.004294 | 0.004294 | 0.004294 |
[24]:
pd.concat([r.pvalues for r in results_all], axis=1, keys=names)
[24]:
res_o | res_f | res_e | res_a | |
---|---|---|---|---|
Intercept | 3.756282e-141 | 3.756280e-141 | 3.756282e-141 | 3.756282e-141 |
rate_marriage | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
age | 2.221918e-11 | 2.221918e-11 | 2.221918e-11 | 2.221918e-11 |
yrs_married | 1.219200e-02 | 1.219200e-02 | 1.219200e-02 | 1.219200e-02 |
[25]:
pd.DataFrame(
np.column_stack([[r.llf, r.deviance, r.pearson_chi2] for r in results_all]),
columns=names,
index=["llf", "deviance", "pearson chi2"],
)
[25]:
res_o | res_f | res_e | res_a | |
---|---|---|---|---|
llf | -10350.913296 | -10350.913296 | -740.748534 | -5954.219866 |
deviance | 15374.679054 | 15374.679054 | 967.455734 | 967.455734 |
pearson chi2 | 32310.704118 | 32310.704118 | 926.199428 | 926.199428 |
似然比类型检验¶
我们上面看到,似然性和相关统计量在聚合数据和原始个体数据之间并不一致。我们在下面说明,似然比检验和偏差差异在不同版本之间是一致的,然而皮尔逊卡方检验并不一致。
如前所述:这还不够清楚,可能会发生变化。
作为一个测试案例,我们删除age
变量,并计算似然比类型统计量,作为简化或受限模型与完整或无约束模型之间的差异。
原始观测值和频率权重¶
[26]:
glm = smf.glm(
"affairs ~ rate_marriage + yrs_married", data=data, family=sm.families.Poisson()
)
res_o2 = glm.fit()
# print(res_f2.summary())
res_o2.pearson_chi2 - res_o.pearson_chi2, res_o2.deviance - res_o.deviance, res_o2.llf - res_o.llf
[26]:
(np.float64(52.913431618326285),
np.float64(45.726693322503706),
np.float64(-22.863346661253672))
[27]:
glm = smf.glm(
"affairs ~ rate_marriage + yrs_married",
data=dc,
family=sm.families.Poisson(),
freq_weights=np.asarray(dc["freq"]),
)
res_f2 = glm.fit()
# print(res_f2.summary())
res_f2.pearson_chi2 - res_f.pearson_chi2, res_f2.deviance - res_f.deviance, res_f2.llf - res_f.llf
[27]:
(np.float64(52.91343161864643),
np.float64(45.726693322507344),
np.float64(-22.863346661253672))
聚合数据: exposure
和 var_weights
¶
注意:LR 检验与原始观测结果一致,pearson_chi2
不同且符号错误。
[28]:
glm = smf.glm(
"affairs_sum ~ rate_marriage + yrs_married",
data=df_a,
family=sm.families.Poisson(),
exposure=np.asarray(df_a["affairs_count"]),
)
res_e2 = glm.fit()
res_e2.pearson_chi2 - res_e.pearson_chi2, res_e2.deviance - res_e.deviance, res_e2.llf - res_e.llf
[28]:
(np.float64(-31.618527525112313),
np.float64(45.72669332250621),
np.float64(-22.863346661252535))
[29]:
glm = smf.glm(
"affairs_mean ~ rate_marriage + yrs_married",
data=df_a,
family=sm.families.Poisson(),
var_weights=np.asarray(df_a["affairs_count"]),
)
res_a2 = glm.fit()
res_a2.pearson_chi2 - res_a.pearson_chi2, res_a2.deviance - res_a.deviance, res_a2.llf - res_a.llf
[29]:
(np.float64(-31.61852752510174),
np.float64(45.72669332250632),
np.float64(-22.863346661253672))
研究皮尔逊卡方统计量¶
首先,我们进行一些健全性检查,确保在计算pearson_chi2
和resid_pearson
时没有基本的错误。
[30]:
res_e2.pearson_chi2, res_e.pearson_chi2, (res_e2.resid_pearson ** 2).sum(), (
res_e.resid_pearson ** 2
).sum()
[30]:
(np.float64(894.5809002315154),
np.float64(926.1994277566278),
np.float64(894.5809002315154),
np.float64(926.1994277566276))
[31]:
res_e._results.resid_response.mean(), res_e.model.family.variance(res_e.mu)[
:5
], res_e.mu[:5]
[31]:
(np.float64(1.624410008214629e-13),
array([ 5.42753476, 46.42940306, 19.98971769, 38.50138978, 11.18341883]),
array([ 5.42753476, 46.42940306, 19.98971769, 38.50138978, 11.18341883]))
[32]:
(res_e._results.resid_response ** 2 / res_e.model.family.variance(res_e.mu)).sum()
[32]:
np.float64(926.1994277566278)
[33]:
res_e2._results.resid_response.mean(), res_e2.model.family.variance(res_e2.mu)[
:5
], res_e2.mu[:5]
[33]:
(np.float64(3.235702304384456e-14),
array([ 4.77165474, 44.4026604 , 22.2013302 , 39.14749309, 10.54229538]),
array([ 4.77165474, 44.4026604 , 22.2013302 , 39.14749309, 10.54229538]))
[34]:
(res_e2._results.resid_response ** 2 / res_e2.model.family.variance(res_e2.mu)).sum()
[34]:
np.float64(894.5809002315154)
[35]:
(res_e2._results.resid_response ** 2).sum(), (res_e._results.resid_response ** 2).sum()
[35]:
(np.float64(51204.85737832326), np.float64(47104.64779595978))
导致符号错误的一个可能原因是,我们正在减去由不同分母分割的二次项。在一些相关情况下,文献中的建议是使用一个共同的分母。我们可以在完整和简化模型中使用相同的方差假设来比较皮尔逊卡方统计量。
在这种情况下,我们在所有版本中都得到了相同的皮尔逊卡方缩放差异,即简化模型与完整模型之间的差异。(问题#3616旨在进一步跟踪此问题。)
[36]:
(
(res_e2._results.resid_response ** 2 - res_e._results.resid_response ** 2)
/ res_e2.model.family.variance(res_e2.mu)
).sum()
[36]:
np.float64(44.43314175121921)
[37]:
(
(res_a2._results.resid_response ** 2 - res_a._results.resid_response ** 2)
/ res_a2.model.family.variance(res_a2.mu)
* res_a2.model.var_weights
).sum()
[37]:
np.float64(44.43314175121932)
[38]:
(
(res_f2._results.resid_response ** 2 - res_f._results.resid_response ** 2)
/ res_f2.model.family.variance(res_f2.mu)
* res_f2.model.freq_weights
).sum()
[38]:
np.float64(44.43314175121789)
[39]:
(
(res_o2._results.resid_response ** 2 - res_o._results.resid_response ** 2)
/ res_o2.model.family.variance(res_o2.mu)
).sum()
[39]:
np.float64(44.433141751219665)
余数¶
笔记本的其余部分仅包含一些额外的检查,可以忽略。
[40]:
np.exp(res_e2.model.exposure)[:5], np.asarray(df_a["affairs_count"])[:5]
[40]:
(array([ 1., 10., 5., 10., 3.]), array([ 1, 10, 5, 10, 3]))
[41]:
res_e2.resid_pearson.sum() - res_e.resid_pearson.sum()
[41]:
np.float64(-9.66481794586447)
[42]:
res_e2.mu[:5]
[42]:
array([ 4.77165474, 44.4026604 , 22.2013302 , 39.14749309, 10.54229538])
[43]:
res_a2.pearson_chi2, res_a.pearson_chi2, res_a2.resid_pearson.sum(), res_a.resid_pearson.sum()
[43]:
(np.float64(894.5809002315157),
np.float64(926.1994277566174),
np.float64(-42.34720713518753),
np.float64(-32.68238918933032))
[44]:
(
(res_a2._results.resid_response ** 2)
/ res_a2.model.family.variance(res_a2.mu)
* res_a2.model.var_weights
).sum()
[44]:
np.float64(894.5809002315157)
[45]:
(
(res_a._results.resid_response ** 2)
/ res_a.model.family.variance(res_a.mu)
* res_a.model.var_weights
).sum()
[45]:
np.float64(926.1994277566174)
[46]:
(
(res_a._results.resid_response ** 2)
/ res_a.model.family.variance(res_a2.mu)
* res_a.model.var_weights
).sum()
[46]:
np.float64(850.1477584802965)
[47]:
res_e.model.endog[:5], res_e2.model.endog[:5]
[47]:
(array([ 0., 39., 17., 9., 4.]), array([ 0., 39., 17., 9., 4.]))
[48]:
res_a.model.endog[:5], res_a2.model.endog[:5]
[48]:
(array([0. , 3.9 , 3.4 , 0.9 , 1.33333333]),
array([0. , 3.9 , 3.4 , 0.9 , 1.33333333]))
[49]:
res_a2.model.endog[:5] * np.exp(res_e2.model.exposure)[:5]
[49]:
array([ 0., 39., 17., 9., 4.])
[50]:
res_a2.model.endog[:5] * res_a2.model.var_weights[:5]
[50]:
array([ 0., 39., 17., 9., 4.])
[51]:
from scipy import stats
stats.chi2.sf(27.19530754604785, 1), stats.chi2.sf(29.083798806764687, 1)
[51]:
(np.float64(1.8390448369994545e-07), np.float64(6.931421143170174e-08))
[52]:
res_o.pvalues
[52]:
Intercept 3.756282e-141
rate_marriage 0.000000e+00
age 2.221918e-11
yrs_married 1.219200e-02
dtype: float64
[53]:
print(res_e2.summary())
print(res_e.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: affairs_sum No. Observations: 130
Model: GLM Df Residuals: 127
Model Family: Poisson Df Model: 2
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -763.61
Date: Wed, 16 Oct 2024 Deviance: 1013.2
Time: 18:27:15 Pearson chi2: 895.
No. Iterations: 6 Pseudo R-squ. (CS): 1.000
Covariance Type: nonrobust
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 2.0754 0.050 41.512 0.000 1.977 2.173
rate_marriage -0.4947 0.012 -41.743 0.000 -0.518 -0.471
yrs_married -0.0360 0.002 -17.542 0.000 -0.040 -0.032
=================================================================================
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: affairs_sum No. Observations: 130
Model: GLM Df Residuals: 126
Model Family: Poisson Df Model: 3
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -740.75
Date: Wed, 16 Oct 2024 Deviance: 967.46
Time: 18:27:15 Pearson chi2: 926.
No. Iterations: 6 Pseudo R-squ. (CS): 1.000
Covariance Type: nonrobust
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 2.7155 0.107 25.294 0.000 2.505 2.926
rate_marriage -0.4952 0.012 -41.702 0.000 -0.518 -0.472
age -0.0299 0.004 -6.691 0.000 -0.039 -0.021
yrs_married -0.0108 0.004 -2.507 0.012 -0.019 -0.002
=================================================================================
[54]:
print(res_f2.summary())
print(res_f.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: affairs No. Observations: 476
Model: GLM Df Residuals: 6363
Model Family: Poisson Df Model: 2
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -10374.
Date: Wed, 16 Oct 2024 Deviance: 15420.
Time: 18:27:15 Pearson chi2: 3.24e+04
No. Iterations: 6 Pseudo R-squ. (CS): 0.9729
Covariance Type: nonrobust
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 2.0754 0.050 41.512 0.000 1.977 2.173
rate_marriage -0.4947 0.012 -41.743 0.000 -0.518 -0.471
yrs_married -0.0360 0.002 -17.542 0.000 -0.040 -0.032
=================================================================================
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: affairs No. Observations: 476
Model: GLM Df Residuals: 6362
Model Family: Poisson Df Model: 3
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -10351.
Date: Wed, 16 Oct 2024 Deviance: 15375.
Time: 18:27:15 Pearson chi2: 3.23e+04
No. Iterations: 6 Pseudo R-squ. (CS): 0.9754
Covariance Type: nonrobust
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 2.7155 0.107 25.294 0.000 2.505 2.926
rate_marriage -0.4952 0.012 -41.702 0.000 -0.518 -0.472
age -0.0299 0.004 -6.691 0.000 -0.039 -0.021
yrs_married -0.0108 0.004 -2.507 0.012 -0.019 -0.002
=================================================================================