加权广义线性模型

[1]:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm

加权GLM:泊松响应数据

加载数据

在这个例子中,我们将使用affair数据集,利用一些外生变量来预测婚外情发生率。

将生成权重以显示 freq_weights 等同于重复数据记录。另一方面,var_weights 等同于聚合数据。

[2]:
print(sm.datasets.fair.NOTE)
::

    Number of observations: 6366
    Number of variables: 9
    Variable name definitions:

        rate_marriage   : How rate marriage, 1 = very poor, 2 = poor, 3 = fair,
                        4 = good, 5 = very good
        age             : Age
        yrs_married     : No. years married. Interval approximations. See
                        original paper for detailed explanation.
        children        : No. children
        religious       : How relgious, 1 = not, 2 = mildly, 3 = fairly,
                        4 = strongly
        educ            : Level of education, 9 = grade school, 12 = high
                        school, 14 = some college, 16 = college graduate,
                        17 = some graduate school, 20 = advanced degree
        occupation      : 1 = student, 2 = farming, agriculture; semi-skilled,
                        or unskilled worker; 3 = white-colloar; 4 = teacher
                        counselor social worker, nurse; artist, writers;
                        technician, skilled worker, 5 = managerial,
                        administrative, business, 6 = professional with
                        advanced degree
        occupation_husb : Husband's occupation. Same as occupation.
        affairs         : measure of time spent in extramarital affairs

    See the original paper for more details.

将数据加载到 pandas 数据框中。

[3]:
data = sm.datasets.fair.load_pandas().data

因变量(内生变量)是 affairs

[4]:
data.describe()
[4]:
rate_marriage age yrs_married children religious educ occupation occupation_husb affairs
count 6366.000000 6366.000000 6366.000000 6366.000000 6366.000000 6366.000000 6366.000000 6366.000000 6366.000000
mean 4.109645 29.082862 9.009425 1.396874 2.426170 14.209865 3.424128 3.850141 0.705374
std 0.961430 6.847882 7.280120 1.433471 0.878369 2.178003 0.942399 1.346435 2.203374
min 1.000000 17.500000 0.500000 0.000000 1.000000 9.000000 1.000000 1.000000 0.000000
25% 4.000000 22.000000 2.500000 0.000000 2.000000 12.000000 3.000000 3.000000 0.000000
50% 4.000000 27.000000 6.000000 1.000000 2.000000 14.000000 3.000000 4.000000 0.000000
75% 5.000000 32.000000 16.500000 2.000000 3.000000 16.000000 4.000000 5.000000 0.484848
max 5.000000 42.000000 23.000000 5.500000 4.000000 20.000000 6.000000 6.000000 57.599991
[5]:
data[:3]
[5]:
rate_marriage age yrs_married children religious educ occupation occupation_husb affairs
0 3.0 32.0 9.0 3.0 3.0 17.0 2.0 5.0 0.111111
1 3.0 27.0 13.0 3.0 1.0 14.0 3.0 4.0 3.230769
2 4.0 22.0 2.5 0.0 1.0 16.0 3.0 5.0 1.400000

在下面我们将主要使用泊松分布。虽然使用十进制事务也可以,但我们将它们转换为整数以获得计数分布。

[6]:
data["affairs"] = np.ceil(data["affairs"])
data[:3]
[6]:
rate_marriage age yrs_married children religious educ occupation occupation_husb affairs
0 3.0 32.0 9.0 3.0 3.0 17.0 2.0 5.0 1.0
1 3.0 27.0 13.0 3.0 1.0 14.0 3.0 4.0 4.0
2 4.0 22.0 2.5 0.0 1.0 16.0 3.0 5.0 2.0
[7]:
(data["affairs"] == 0).mean()
[7]:
np.float64(0.6775054979579014)
[8]:
np.bincount(data["affairs"].astype(int))
[8]:
array([4313,  934,  488,  180,  130,  172,    7,   21,   67,    2,    0,
          0,   17,    0,    0,    0,    3,   12,    8,    0,    0,    0,
          0,    0,    2,    2,    2,    3,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    1,    1,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    1])

压缩和聚合观测数据

我们的原始数据集中有6366个观测值。当我们只考虑一些选定的变量时,我们会有较少的唯一观测值。接下来,我们以两种方式合并观测值,首先我们合并所有变量值相同的观测值,其次我们合并具有相同解释变量的观测值。

具有唯一观测值的数据集

我们使用 pandas 的 groupby 来合并相同的观察值,并创建一个新变量 freq,用于计算对应行中具有这些值的观察次数。

[9]:
data2 = data.copy()
data2["const"] = 1
dc = (
    data2["affairs rate_marriage age yrs_married const".split()]
    .groupby("affairs rate_marriage age yrs_married".split())
    .count()
)
dc.reset_index(inplace=True)
dc.rename(columns={"const": "freq"}, inplace=True)
print(dc.shape)
dc.head()
(476, 5)
[9]:
affairs rate_marriage age yrs_married freq
0 0.0 1.0 17.5 0.5 1
1 0.0 1.0 22.0 2.5 3
2 0.0 1.0 27.0 2.5 1
3 0.0 1.0 27.0 6.0 5
4 0.0 1.0 27.0 9.0 1

具有唯一解释变量的数据集(外生变量)

对于下一个数据集,我们将具有相同解释变量值的观测值进行合并。然而,由于响应变量在合并的观测值之间可能不同,我们计算所有合并观测值的响应变量的均值和总和。

我们再次使用 pandas groupby 来合并观察结果并创建新变量。我们还把 MultiIndex 转换为简单的索引。

[10]:
gr = data["affairs rate_marriage age yrs_married".split()].groupby(
    "rate_marriage age yrs_married".split()
)
df_a = gr.agg(["mean", "sum", "count"])


def merge_tuple(tpl):
    if isinstance(tpl, tuple) and len(tpl) > 1:
        return "_".join(map(str, tpl))
    else:
        return tpl


df_a.columns = df_a.columns.map(merge_tuple)
df_a.reset_index(inplace=True)
print(df_a.shape)
df_a.head()
(130, 6)
[10]:
rate_marriage age yrs_married affairs_mean affairs_sum affairs_count
0 1.0 17.5 0.5 0.000000 0.0 1
1 1.0 22.0 2.5 3.900000 39.0 10
2 1.0 27.0 2.5 3.400000 17.0 5
3 1.0 27.0 6.0 0.900000 9.0 10
4 1.0 27.0 9.0 1.333333 4.0 3

在将观测值合并后,我们得到了一个包含467个唯一观测值的数据框dc,以及一个包含130个观测值的数据框df_a,其中包含解释变量的唯一值。

[11]:
print("number of rows: \noriginal, with unique observations, with unique exog")
data.shape[0], dc.shape[0], df_a.shape[0]
number of rows:
original, with unique observations, with unique exog
[11]:
(6366, 476, 130)

分析

在下文中,我们将原始数据的GLM-Poisson结果与结合了多重性或聚合的观测值的模型进行比较,其中多重性或聚合由权重或暴露给出。

原始数据

[12]:
glm = smf.glm(
    "affairs ~ rate_marriage + age + yrs_married",
    data=data,
    family=sm.families.Poisson(),
)
res_o = glm.fit()
print(res_o.summary())
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                affairs   No. Observations:                 6366
Model:                            GLM   Df Residuals:                     6362
Model Family:                 Poisson   Df Model:                            3
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -10351.
Date:                Wed, 16 Oct 2024   Deviance:                       15375.
Time:                        18:27:14   Pearson chi2:                 3.23e+04
No. Iterations:                     6   Pseudo R-squ. (CS):             0.2420
Covariance Type:            nonrobust
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         2.7155      0.107     25.294      0.000       2.505       2.926
rate_marriage    -0.4952      0.012    -41.702      0.000      -0.518      -0.472
age              -0.0299      0.004     -6.691      0.000      -0.039      -0.021
yrs_married      -0.0108      0.004     -2.507      0.012      -0.019      -0.002
=================================================================================
[13]:
res_o.pearson_chi2 / res_o.df_resid
[13]:
np.float64(5.078702313363161)

浓缩数据(具有频率的唯一观测值)

结合相同的观测值并使用频率权重来考虑观测值的重复性,会产生完全相同的结果。当我们希望获得关于观测值的信息而不是所有相同观测值的聚合信息时,某些结果属性将有所不同。例如,残差不会考虑freq_weights

[14]:
glm = smf.glm(
    "affairs ~ rate_marriage + age + yrs_married",
    data=dc,
    family=sm.families.Poisson(),
    freq_weights=np.asarray(dc["freq"]),
)
res_f = glm.fit()
print(res_f.summary())
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                affairs   No. Observations:                  476
Model:                            GLM   Df Residuals:                     6362
Model Family:                 Poisson   Df Model:                            3
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -10351.
Date:                Wed, 16 Oct 2024   Deviance:                       15375.
Time:                        18:27:14   Pearson chi2:                 3.23e+04
No. Iterations:                     6   Pseudo R-squ. (CS):             0.9754
Covariance Type:            nonrobust
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         2.7155      0.107     25.294      0.000       2.505       2.926
rate_marriage    -0.4952      0.012    -41.702      0.000      -0.518      -0.472
age              -0.0299      0.004     -6.691      0.000      -0.039      -0.021
yrs_married      -0.0108      0.004     -2.507      0.012      -0.019      -0.002
=================================================================================
[15]:
res_f.pearson_chi2 / res_f.df_resid
[15]:
np.float64(5.078702313363207)

使用 var_weights 而不是 freq_weights

接下来,我们将var_weightsfreq_weights进行比较。当内生变量反映的是平均值而非相同观测值时,通常会使用var_weights。我没有看到理论上有任何理由说明它在一般情况下会产生相同的结果。

这产生了相同的结果,但df_residfreq_weights示例不同,因为var_weights不会改变有效观察的数量。

[16]:
glm = smf.glm(
    "affairs ~ rate_marriage + age + yrs_married",
    data=dc,
    family=sm.families.Poisson(),
    var_weights=np.asarray(dc["freq"]),
)
res_fv = glm.fit()
print(res_fv.summary())
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                affairs   No. Observations:                  476
Model:                            GLM   Df Residuals:                      472
Model Family:                 Poisson   Df Model:                            3
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -10351.
Date:                Wed, 16 Oct 2024   Deviance:                       15375.
Time:                        18:27:14   Pearson chi2:                 3.23e+04
No. Iterations:                     6   Pseudo R-squ. (CS):             0.9754
Covariance Type:            nonrobust
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         2.7155      0.107     25.294      0.000       2.505       2.926
rate_marriage    -0.4952      0.012    -41.702      0.000      -0.518      -0.472
age              -0.0299      0.004     -6.691      0.000      -0.039      -0.021
yrs_married      -0.0108      0.004     -2.507      0.012      -0.019      -0.002
=================================================================================

由于错误的 df_resid,从结果计算的离散度是不正确的。如果我们使用原始的 df_resid,则是正确的。

[17]:
res_fv.pearson_chi2 / res_fv.df_resid, res_f.pearson_chi2 / res_f.df_resid
[17]:
(np.float64(68.45488160512018), np.float64(5.078702313363207))

聚合或平均数据(解释变量的唯一值)

在这些情况下,我们将具有相同解释变量值的观测值进行合并。对应的响应变量可以是总和或平均值。

使用 exposure

如果我们的因变量是所有合并观测的响应之和,那么在泊松假设下,分布保持不变,但我们有不同的暴露,由一个聚合观测所代表的个体数量给出。

参数估计和参数协方差与原始数据相同,但对数似然、偏差和皮尔逊卡方值不同

[18]:
glm = smf.glm(
    "affairs_sum ~ rate_marriage + age + yrs_married",
    data=df_a,
    family=sm.families.Poisson(),
    exposure=np.asarray(df_a["affairs_count"]),
)
res_e = glm.fit()
print(res_e.summary())
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:            affairs_sum   No. Observations:                  130
Model:                            GLM   Df Residuals:                      126
Model Family:                 Poisson   Df Model:                            3
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -740.75
Date:                Wed, 16 Oct 2024   Deviance:                       967.46
Time:                        18:27:14   Pearson chi2:                     926.
No. Iterations:                     6   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         2.7155      0.107     25.294      0.000       2.505       2.926
rate_marriage    -0.4952      0.012    -41.702      0.000      -0.518      -0.472
age              -0.0299      0.004     -6.691      0.000      -0.039      -0.021
yrs_married      -0.0108      0.004     -2.507      0.012      -0.019      -0.002
=================================================================================
[19]:
res_e.pearson_chi2 / res_e.df_resid
[19]:
np.float64(7.350789109179585)

使用 var_weights

我们也可以使用因变量所有组合值的平均值。在这种情况下,方差将与一个组合观测所反映的总暴露的倒数相关。

[20]:
glm = smf.glm(
    "affairs_mean ~ rate_marriage + age + yrs_married",
    data=df_a,
    family=sm.families.Poisson(),
    var_weights=np.asarray(df_a["affairs_count"]),
)
res_a = glm.fit()
print(res_a.summary())
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:           affairs_mean   No. Observations:                  130
Model:                            GLM   Df Residuals:                      126
Model Family:                 Poisson   Df Model:                            3
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -5954.2
Date:                Wed, 16 Oct 2024   Deviance:                       967.46
Time:                        18:27:14   Pearson chi2:                     926.
No. Iterations:                     5   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         2.7155      0.107     25.294      0.000       2.505       2.926
rate_marriage    -0.4952      0.012    -41.702      0.000      -0.518      -0.472
age              -0.0299      0.004     -6.691      0.000      -0.039      -0.021
yrs_married      -0.0108      0.004     -2.507      0.012      -0.019      -0.002
=================================================================================

比较

我们在上面的摘要打印中看到,paramscov_params 以及相关的 Wald 推断在不同版本之间是一致的。我们通过比较不同版本中的单个结果属性来总结这一点。

参数估计 params,参数的标准误差 bse 和参数的 pvalues 对于参数为零的检验结果一致。然而,似然和拟合优度统计量 llfdeviancepearson_chi2 仅部分一致。具体来说,聚合版本与使用原始数据的结果不一致。

警告llfdeviancepearson_chi2 的行为在未来的版本中可能仍会发生变化。

响应变量在解释变量的唯一值下的总和和平均值都具有适当的似然解释。然而,这种解释并未在这些三个统计量中体现出来。从计算角度来看,这可能是由于在使用聚合数据时缺少调整。然而,从理论上讲,我们可以考虑这些情况,特别是在似然分析不合适且结果应解释为拟似然估计的情况下,对于var_weights的错误指定情况。var_weights的定义存在歧义,因为它们既可以用于正确指定似然的平均值,也可以用于拟似然情况下的方差调整。我们目前并未尝试匹配似然规范。然而,在下一节中,我们将展示,当我们假设基础模型正确指定时,似然比类型的检验仍然为所有聚合版本产生相同的结果。

[21]:
results_all = [res_o, res_f, res_e, res_a]
names = "res_o res_f res_e res_a".split()
[22]:
pd.concat([r.params for r in results_all], axis=1, keys=names)
[22]:
res_o res_f res_e res_a
Intercept 2.715533 2.715533 2.715533 2.715533
rate_marriage -0.495180 -0.495180 -0.495180 -0.495180
age -0.029914 -0.029914 -0.029914 -0.029914
yrs_married -0.010763 -0.010763 -0.010763 -0.010763
[23]:
pd.concat([r.bse for r in results_all], axis=1, keys=names)
[23]:
res_o res_f res_e res_a
Intercept 0.107360 0.107360 0.107360 0.107360
rate_marriage 0.011874 0.011874 0.011874 0.011874
age 0.004471 0.004471 0.004471 0.004471
yrs_married 0.004294 0.004294 0.004294 0.004294
[24]:
pd.concat([r.pvalues for r in results_all], axis=1, keys=names)
[24]:
res_o res_f res_e res_a
Intercept 3.756282e-141 3.756280e-141 3.756282e-141 3.756282e-141
rate_marriage 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
age 2.221918e-11 2.221918e-11 2.221918e-11 2.221918e-11
yrs_married 1.219200e-02 1.219200e-02 1.219200e-02 1.219200e-02
[25]:
pd.DataFrame(
    np.column_stack([[r.llf, r.deviance, r.pearson_chi2] for r in results_all]),
    columns=names,
    index=["llf", "deviance", "pearson chi2"],
)
[25]:
res_o res_f res_e res_a
llf -10350.913296 -10350.913296 -740.748534 -5954.219866
deviance 15374.679054 15374.679054 967.455734 967.455734
pearson chi2 32310.704118 32310.704118 926.199428 926.199428

似然比类型检验

我们上面看到,似然性和相关统计量在聚合数据和原始个体数据之间并不一致。我们在下面说明,似然比检验和偏差差异在不同版本之间是一致的,然而皮尔逊卡方检验并不一致。

如前所述:这还不够清楚,可能会发生变化。

作为一个测试案例,我们删除age变量,并计算似然比类型统计量,作为简化或受限模型与完整或无约束模型之间的差异。

原始观测值和频率权重

[26]:
glm = smf.glm(
    "affairs ~ rate_marriage + yrs_married", data=data, family=sm.families.Poisson()
)
res_o2 = glm.fit()
# print(res_f2.summary())
res_o2.pearson_chi2 - res_o.pearson_chi2, res_o2.deviance - res_o.deviance, res_o2.llf - res_o.llf
[26]:
(np.float64(52.913431618326285),
 np.float64(45.726693322503706),
 np.float64(-22.863346661253672))
[27]:
glm = smf.glm(
    "affairs ~ rate_marriage + yrs_married",
    data=dc,
    family=sm.families.Poisson(),
    freq_weights=np.asarray(dc["freq"]),
)
res_f2 = glm.fit()
# print(res_f2.summary())
res_f2.pearson_chi2 - res_f.pearson_chi2, res_f2.deviance - res_f.deviance, res_f2.llf - res_f.llf
[27]:
(np.float64(52.91343161864643),
 np.float64(45.726693322507344),
 np.float64(-22.863346661253672))

聚合数据: exposurevar_weights

注意:LR 检验与原始观测结果一致,pearson_chi2 不同且符号错误。

[28]:
glm = smf.glm(
    "affairs_sum ~ rate_marriage + yrs_married",
    data=df_a,
    family=sm.families.Poisson(),
    exposure=np.asarray(df_a["affairs_count"]),
)
res_e2 = glm.fit()
res_e2.pearson_chi2 - res_e.pearson_chi2, res_e2.deviance - res_e.deviance, res_e2.llf - res_e.llf
[28]:
(np.float64(-31.618527525112313),
 np.float64(45.72669332250621),
 np.float64(-22.863346661252535))
[29]:
glm = smf.glm(
    "affairs_mean ~ rate_marriage + yrs_married",
    data=df_a,
    family=sm.families.Poisson(),
    var_weights=np.asarray(df_a["affairs_count"]),
)
res_a2 = glm.fit()
res_a2.pearson_chi2 - res_a.pearson_chi2, res_a2.deviance - res_a.deviance, res_a2.llf - res_a.llf
[29]:
(np.float64(-31.61852752510174),
 np.float64(45.72669332250632),
 np.float64(-22.863346661253672))

研究皮尔逊卡方统计量

首先,我们进行一些健全性检查,确保在计算pearson_chi2resid_pearson时没有基本的错误。

[30]:
res_e2.pearson_chi2, res_e.pearson_chi2, (res_e2.resid_pearson ** 2).sum(), (
    res_e.resid_pearson ** 2
).sum()
[30]:
(np.float64(894.5809002315154),
 np.float64(926.1994277566278),
 np.float64(894.5809002315154),
 np.float64(926.1994277566276))
[31]:
res_e._results.resid_response.mean(), res_e.model.family.variance(res_e.mu)[
    :5
], res_e.mu[:5]
[31]:
(np.float64(1.624410008214629e-13),
 array([ 5.42753476, 46.42940306, 19.98971769, 38.50138978, 11.18341883]),
 array([ 5.42753476, 46.42940306, 19.98971769, 38.50138978, 11.18341883]))
[32]:
(res_e._results.resid_response ** 2 / res_e.model.family.variance(res_e.mu)).sum()
[32]:
np.float64(926.1994277566278)
[33]:
res_e2._results.resid_response.mean(), res_e2.model.family.variance(res_e2.mu)[
    :5
], res_e2.mu[:5]
[33]:
(np.float64(3.235702304384456e-14),
 array([ 4.77165474, 44.4026604 , 22.2013302 , 39.14749309, 10.54229538]),
 array([ 4.77165474, 44.4026604 , 22.2013302 , 39.14749309, 10.54229538]))
[34]:
(res_e2._results.resid_response ** 2 / res_e2.model.family.variance(res_e2.mu)).sum()
[34]:
np.float64(894.5809002315154)
[35]:
(res_e2._results.resid_response ** 2).sum(), (res_e._results.resid_response ** 2).sum()
[35]:
(np.float64(51204.85737832326), np.float64(47104.64779595978))

导致符号错误的一个可能原因是,我们正在减去由不同分母分割的二次项。在一些相关情况下,文献中的建议是使用一个共同的分母。我们可以在完整和简化模型中使用相同的方差假设来比较皮尔逊卡方统计量。

在这种情况下,我们在所有版本中都得到了相同的皮尔逊卡方缩放差异,即简化模型与完整模型之间的差异。(问题#3616旨在进一步跟踪此问题。)

[36]:
(
    (res_e2._results.resid_response ** 2 - res_e._results.resid_response ** 2)
    / res_e2.model.family.variance(res_e2.mu)
).sum()
[36]:
np.float64(44.43314175121921)
[37]:
(
    (res_a2._results.resid_response ** 2 - res_a._results.resid_response ** 2)
    / res_a2.model.family.variance(res_a2.mu)
    * res_a2.model.var_weights
).sum()
[37]:
np.float64(44.43314175121932)
[38]:
(
    (res_f2._results.resid_response ** 2 - res_f._results.resid_response ** 2)
    / res_f2.model.family.variance(res_f2.mu)
    * res_f2.model.freq_weights
).sum()
[38]:
np.float64(44.43314175121789)
[39]:
(
    (res_o2._results.resid_response ** 2 - res_o._results.resid_response ** 2)
    / res_o2.model.family.variance(res_o2.mu)
).sum()
[39]:
np.float64(44.433141751219665)

余数

笔记本的其余部分仅包含一些额外的检查,可以忽略。

[40]:
np.exp(res_e2.model.exposure)[:5], np.asarray(df_a["affairs_count"])[:5]
[40]:
(array([ 1., 10.,  5., 10.,  3.]), array([ 1, 10,  5, 10,  3]))
[41]:
res_e2.resid_pearson.sum() - res_e.resid_pearson.sum()
[41]:
np.float64(-9.66481794586447)
[42]:
res_e2.mu[:5]
[42]:
array([ 4.77165474, 44.4026604 , 22.2013302 , 39.14749309, 10.54229538])
[43]:
res_a2.pearson_chi2, res_a.pearson_chi2, res_a2.resid_pearson.sum(), res_a.resid_pearson.sum()
[43]:
(np.float64(894.5809002315157),
 np.float64(926.1994277566174),
 np.float64(-42.34720713518753),
 np.float64(-32.68238918933032))
[44]:
(
    (res_a2._results.resid_response ** 2)
    / res_a2.model.family.variance(res_a2.mu)
    * res_a2.model.var_weights
).sum()
[44]:
np.float64(894.5809002315157)
[45]:
(
    (res_a._results.resid_response ** 2)
    / res_a.model.family.variance(res_a.mu)
    * res_a.model.var_weights
).sum()
[45]:
np.float64(926.1994277566174)
[46]:
(
    (res_a._results.resid_response ** 2)
    / res_a.model.family.variance(res_a2.mu)
    * res_a.model.var_weights
).sum()
[46]:
np.float64(850.1477584802965)
[47]:
res_e.model.endog[:5], res_e2.model.endog[:5]
[47]:
(array([ 0., 39., 17.,  9.,  4.]), array([ 0., 39., 17.,  9.,  4.]))
[48]:
res_a.model.endog[:5], res_a2.model.endog[:5]
[48]:
(array([0.        , 3.9       , 3.4       , 0.9       , 1.33333333]),
 array([0.        , 3.9       , 3.4       , 0.9       , 1.33333333]))
[49]:
res_a2.model.endog[:5] * np.exp(res_e2.model.exposure)[:5]
[49]:
array([ 0., 39., 17.,  9.,  4.])
[50]:
res_a2.model.endog[:5] * res_a2.model.var_weights[:5]
[50]:
array([ 0., 39., 17.,  9.,  4.])
[51]:
from scipy import stats

stats.chi2.sf(27.19530754604785, 1), stats.chi2.sf(29.083798806764687, 1)
[51]:
(np.float64(1.8390448369994545e-07), np.float64(6.931421143170174e-08))
[52]:
res_o.pvalues
[52]:
Intercept        3.756282e-141
rate_marriage     0.000000e+00
age               2.221918e-11
yrs_married       1.219200e-02
dtype: float64
[53]:
print(res_e2.summary())
print(res_e.summary())
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:            affairs_sum   No. Observations:                  130
Model:                            GLM   Df Residuals:                      127
Model Family:                 Poisson   Df Model:                            2
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -763.61
Date:                Wed, 16 Oct 2024   Deviance:                       1013.2
Time:                        18:27:15   Pearson chi2:                     895.
No. Iterations:                     6   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         2.0754      0.050     41.512      0.000       1.977       2.173
rate_marriage    -0.4947      0.012    -41.743      0.000      -0.518      -0.471
yrs_married      -0.0360      0.002    -17.542      0.000      -0.040      -0.032
=================================================================================
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:            affairs_sum   No. Observations:                  130
Model:                            GLM   Df Residuals:                      126
Model Family:                 Poisson   Df Model:                            3
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -740.75
Date:                Wed, 16 Oct 2024   Deviance:                       967.46
Time:                        18:27:15   Pearson chi2:                     926.
No. Iterations:                     6   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         2.7155      0.107     25.294      0.000       2.505       2.926
rate_marriage    -0.4952      0.012    -41.702      0.000      -0.518      -0.472
age              -0.0299      0.004     -6.691      0.000      -0.039      -0.021
yrs_married      -0.0108      0.004     -2.507      0.012      -0.019      -0.002
=================================================================================
[54]:
print(res_f2.summary())
print(res_f.summary())
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                affairs   No. Observations:                  476
Model:                            GLM   Df Residuals:                     6363
Model Family:                 Poisson   Df Model:                            2
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -10374.
Date:                Wed, 16 Oct 2024   Deviance:                       15420.
Time:                        18:27:15   Pearson chi2:                 3.24e+04
No. Iterations:                     6   Pseudo R-squ. (CS):             0.9729
Covariance Type:            nonrobust
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         2.0754      0.050     41.512      0.000       1.977       2.173
rate_marriage    -0.4947      0.012    -41.743      0.000      -0.518      -0.471
yrs_married      -0.0360      0.002    -17.542      0.000      -0.040      -0.032
=================================================================================
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                affairs   No. Observations:                  476
Model:                            GLM   Df Residuals:                     6362
Model Family:                 Poisson   Df Model:                            3
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -10351.
Date:                Wed, 16 Oct 2024   Deviance:                       15375.
Time:                        18:27:15   Pearson chi2:                 3.23e+04
No. Iterations:                     6   Pseudo R-squ. (CS):             0.9754
Covariance Type:            nonrobust
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         2.7155      0.107     25.294      0.000       2.505       2.926
rate_marriage    -0.4952      0.012    -41.702      0.000      -0.518      -0.472
age              -0.0299      0.004     -6.691      0.000      -0.039      -0.021
yrs_married      -0.0108      0.004     -2.507      0.012      -0.019      -0.002
=================================================================================

Last update: Oct 16, 2024