回归图

[1]:
%matplotlib inline
[2]:
from statsmodels.compat import lzip
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols

plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)

邓肯的声望数据集

加载数据

我们可以使用一个实用函数来加载来自伟大的Rdatasets包的任何可用R数据集。

[3]:
prestige = sm.datasets.get_rdataset("Duncan", "carData", cache=True).data
[4]:
prestige.head()
[4]:
type income education prestige
rownames
accountant prof 62 86 82
pilot prof 72 76 83
architect prof 75 92 90
author prof 55 90 76
chemist prof 64 86 90
[5]:
prestige_model = ols("prestige ~ income + education", data=prestige).fit()
[6]:
print(prestige_model.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:               prestige   R-squared:                       0.828
Model:                            OLS   Adj. R-squared:                  0.820
Method:                 Least Squares   F-statistic:                     101.2
Date:                Wed, 16 Oct 2024   Prob (F-statistic):           8.65e-17
Time:                        18:27:52   Log-Likelihood:                -178.98
No. Observations:                  45   AIC:                             364.0
Df Residuals:                      42   BIC:                             369.4
Df Model:                           2
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -6.0647      4.272     -1.420      0.163     -14.686       2.556
income         0.5987      0.120      5.003      0.000       0.357       0.840
education      0.5458      0.098      5.555      0.000       0.348       0.744
==============================================================================
Omnibus:                        1.279   Durbin-Watson:                   1.458
Prob(Omnibus):                  0.528   Jarque-Bera (JB):                0.520
Skew:                           0.155   Prob(JB):                        0.771
Kurtosis:                       3.426   Cond. No.                         163.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

影响图

影响图显示了(外部)学生化残差与每个观测值的杠杆作用,这些杠杆作用由帽子矩阵测量。

外部学生化残差是残差通过其标准差进行缩放后的结果,其中

\[var(\hat{\epsilon}_i)=\hat{\sigma}^2_i(1-h_{ii})\]

\[\hat{\sigma}^2_i=\frac{1}{n - p - 1 \;\;}\sum_{j}^{n}\;\;\;\forall \;\;\; j \neq i\]

\(n\) 是观测值的数量,\(p\) 是回归变量的数量。\(h_{ii}\) 是帽子矩阵的对角线元素的第 \(i\) 个元素

\[H=X(X^{\;\prime}X)^{-1}X^{\;\prime}\]

每个点的影响可以通过标准关键字参数可视化。选项包括库克距离和DFFITS,这两种影响度量。

[7]:
fig = sm.graphics.influence_plot(prestige_model, criterion="cooks")
fig.tight_layout(pad=1.0)
../../../_images/examples_notebooks_generated_regression_plots_12_0.png

如你所见,有几个令人担忧的观察结果。承包商和记者的杠杆率都很低,但残差很大。RR.engineer的残差很小,但杠杆率很大。指挥和部长的杠杆率和残差都很高,因此,影响力也很大。

部分回归图 (Duncan)

由于我们正在进行多元回归分析,我们不能仅仅通过查看单个的双变量图来辨别关系。相反,我们希望查看因变量与自变量之间的关系,同时考虑其他自变量的条件。我们可以通过使用偏回归图(也称为添加变量图)来实现这一点。

在部分回归图中,为了辨别响应变量与第\(k\)个变量之间的关系,我们通过将响应变量与除\(X_k\)之外的自变量进行回归来计算残差。我们可以用\(X_{\sim k}\)来表示这一点。然后,我们通过将\(X_k\)\(X_{\sim k}\)进行回归来计算残差。部分回归图是前者的残差与后者的残差的图。

这个图的显著特点是拟合线的斜率为 \(\beta_k\),截距为零。这个图的残差与原始模型在完整 \(X\) 上的最小二乘拟合的残差相同。你可以很容易地辨别出单个数据值对系数估计的影响。如果 obs_labels 为 True,那么这些点将用它们的观察标签进行注释。你还可以看到诸如同方差性和线性等基本假设的违反。

[8]:
fig = sm.graphics.plot_partregress(
    "prestige", "income", ["income", "education"], data=prestige
)
fig.tight_layout(pad=1.0)
../../../_images/examples_notebooks_generated_regression_plots_16_0.png
[9]:
fig = sm.graphics.plot_partregress("prestige", "income", ["education"], data=prestige)
fig.tight_layout(pad=1.0)
../../../_images/examples_notebooks_generated_regression_plots_17_0.png

正如你所见,部分回归图证实了导体、部长和RR.engineer对收入与声望之间部分关系的影响。这些案例大大降低了收入对声望的影响。删除这些案例证实了这一点。

[10]:
subset = ~prestige.index.isin(["conductor", "RR.engineer", "minister"])
prestige_model2 = ols(
    "prestige ~ income + education", data=prestige, subset=subset
).fit()
print(prestige_model2.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:               prestige   R-squared:                       0.876
Model:                            OLS   Adj. R-squared:                  0.870
Method:                 Least Squares   F-statistic:                     138.1
Date:                Wed, 16 Oct 2024   Prob (F-statistic):           2.02e-18
Time:                        18:27:53   Log-Likelihood:                -160.59
No. Observations:                  42   AIC:                             327.2
Df Residuals:                      39   BIC:                             332.4
Df Model:                           2
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -6.3174      3.680     -1.717      0.094     -13.760       1.125
income         0.9307      0.154      6.053      0.000       0.620       1.242
education      0.2846      0.121      2.345      0.024       0.039       0.530
==============================================================================
Omnibus:                        3.811   Durbin-Watson:                   1.468
Prob(Omnibus):                  0.149   Jarque-Bera (JB):                2.802
Skew:                          -0.614   Prob(JB):                        0.246
Kurtosis:                       3.303   Cond. No.                         158.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

要快速检查所有回归变量,您可以使用 plot_partregress_grid。这些图不会标记点,但您可以使用它们来识别问题,然后使用 plot_partregress 获取更多信息。

[11]:
fig = sm.graphics.plot_partregress_grid(prestige_model)
fig.tight_layout(pad=1.0)
../../../_images/examples_notebooks_generated_regression_plots_21_0.png

成分-成分加残差 (CCPR) 图

CCPR图提供了一种通过考虑其他自变量的影响来判断一个回归因子对响应变量的影响的方法。偏残差图定义为\(\text{Residuals} + B_iX_i \text{ }\text{ }\)\(X_i\)的关系。该组件添加了\(B_iX_i\)\(X_i\)的关系,以显示拟合线的位置。如果\(X_i\)与任何其他自变量高度相关,则应谨慎对待。在这种情况下,图中显示的方差将是真实方差的一个低估。

[12]:
fig = sm.graphics.plot_ccpr(prestige_model, "education")
fig.tight_layout(pad=1.0)
../../../_images/examples_notebooks_generated_regression_plots_24_0.png

如您所见,在收入条件下的教育所解释的声望变化之间的关系似乎是线性的,尽管您可以看到有一些观察结果对这种关系产生了相当大的影响。我们可以通过使用plot_ccpr_grid快速查看多个变量。

[13]:
fig = sm.graphics.plot_ccpr_grid(prestige_model)
fig.tight_layout(pad=1.0)
../../../_images/examples_notebooks_generated_regression_plots_26_0.png

单变量回归诊断

plot_regress_exog 函数是一个便捷函数,它提供了一个 2x2 的图表,包含因变量和带有置信区间的拟合值与所选自变量的对比图、模型残差与所选自变量的对比图、偏回归图以及 CCPR 图。该函数可用于快速检查关于单个回归量的建模假设。

[14]:
fig = sm.graphics.plot_regress_exog(prestige_model, "education")
fig.tight_layout(pad=1.0)
../../../_images/examples_notebooks_generated_regression_plots_29_0.png

拟合图

plot_fit 函数绘制拟合值与所选自变量的关系图。它包括预测置信区间,并可选择绘制真实的因变量。

[15]:
fig = sm.graphics.plot_fit(prestige_model, "education")
fig.tight_layout(pad=1.0)
../../../_images/examples_notebooks_generated_regression_plots_32_0.png

2009年全州犯罪数据集

将以下内容与http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter4/statareg_self_assessment_answers4.htm进行比较

尽管这里的数据与那个示例中的数据不同。你可以通过取消下面必要单元格的注释来运行那个示例。

[16]:
# dta = pd.read_csv("http://www.stat.ufl.edu/~aa/social/csv_files/statewide-crime-2.csv")
# dta = dta.set_index("State", inplace=True).dropna()
# dta.rename(columns={"VR" : "crime",
#                    "MR" : "murder",
#                    "M"  : "pctmetro",
#                    "W"  : "pctwhite",
#                    "H"  : "pcths",
#                    "P"  : "poverty",
#                    "S"  : "single"
#                    }, inplace=True)
#
# crime_model = ols("murder ~ pctmetro + poverty + pcths + single", data=dta).fit()
[17]:
dta = sm.datasets.statecrime.load_pandas().data
[18]:
crime_model = ols("murder ~ urban + poverty + hs_grad + single", data=dta).fit()
print(crime_model.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                 murder   R-squared:                       0.813
Model:                            OLS   Adj. R-squared:                  0.797
Method:                 Least Squares   F-statistic:                     50.08
Date:                Wed, 16 Oct 2024   Prob (F-statistic):           3.42e-16
Time:                        18:27:54   Log-Likelihood:                -95.050
No. Observations:                  51   AIC:                             200.1
Df Residuals:                      46   BIC:                             209.8
Df Model:                           4
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -44.1024     12.086     -3.649      0.001     -68.430     -19.774
urban          0.0109      0.015      0.707      0.483      -0.020       0.042
poverty        0.4121      0.140      2.939      0.005       0.130       0.694
hs_grad        0.3059      0.117      2.611      0.012       0.070       0.542
single         0.6374      0.070      9.065      0.000       0.496       0.779
==============================================================================
Omnibus:                        1.618   Durbin-Watson:                   2.507
Prob(Omnibus):                  0.445   Jarque-Bera (JB):                0.831
Skew:                          -0.220   Prob(JB):                        0.660
Kurtosis:                       3.445   Cond. No.                     5.80e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.8e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

部分回归图 (犯罪数据)

[19]:
fig = sm.graphics.plot_partregress_grid(crime_model)
fig.tight_layout(pad=1.0)
../../../_images/examples_notebooks_generated_regression_plots_39_0.png
[20]:
fig = sm.graphics.plot_partregress(
    "murder", "hs_grad", ["urban", "poverty", "single"], data=dta
)
fig.tight_layout(pad=1.0)
../../../_images/examples_notebooks_generated_regression_plots_40_0.png

杠杆-残差平方图

与influence_plot密切相关的是leverage-resid2图。

[21]:
fig = sm.graphics.plot_leverage_resid2(crime_model)
fig.tight_layout(pad=1.0)
../../../_images/examples_notebooks_generated_regression_plots_43_0.png

影响图

[22]:
fig = sm.graphics.influence_plot(crime_model)
fig.tight_layout(pad=1.0)
../../../_images/examples_notebooks_generated_regression_plots_45_0.png

使用稳健回归来纠正异常值。

在重新创建Stata结果时,这里的问题之一是M估计量对杠杆点不稳健。MM估计量应该在这个例子中表现得更好。

[23]:
from statsmodels.formula.api import rlm
[24]:
rob_crime_model = rlm(
    "murder ~ urban + poverty + hs_grad + single",
    data=dta,
    M=sm.robust.norms.TukeyBiweight(3),
).fit(conv="weights")
print(rob_crime_model.summary())
                    Robust linear Model Regression Results
==============================================================================
Dep. Variable:                 murder   No. Observations:                   51
Model:                            RLM   Df Residuals:                       46
Method:                          IRLS   Df Model:                            4
Norm:                   TukeyBiweight
Scale Est.:                       mad
Cov Type:                          H1
Date:                Wed, 16 Oct 2024
Time:                        18:27:55
No. Iterations:                    50
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -4.2986      9.494     -0.453      0.651     -22.907      14.310
urban          0.0029      0.012      0.241      0.809      -0.021       0.027
poverty        0.2753      0.110      2.499      0.012       0.059       0.491
hs_grad       -0.0302      0.092     -0.328      0.743      -0.211       0.150
single         0.2902      0.055      5.253      0.000       0.182       0.398
==============================================================================

If the model instance has been used for another fit with different fit parameters, then the fit options might not be the correct ones anymore .
[25]:
# rob_crime_model = rlm("murder ~ pctmetro + poverty + pcths + single", data=dta, M=sm.robust.norms.TukeyBiweight()).fit(conv="weights")
# print(rob_crime_model.summary())

目前RLM还没有影响力诊断方法,但我们可以重新创建它们。(这取决于问题 #888的状态)

[26]:
weights = rob_crime_model.weights
idx = weights > 0
X = rob_crime_model.model.exog[idx.values]
ww = weights[idx] / weights[idx].mean()
hat_matrix_diag = ww * (X * np.linalg.pinv(X).T).sum(1)
resid = rob_crime_model.resid
resid2 = resid ** 2
resid2 /= resid2.sum()
nobs = int(idx.sum())
hm = hat_matrix_diag.mean()
rm = resid2.mean()
[27]:
from statsmodels.graphics import utils

fig, ax = plt.subplots(figsize=(16, 8))
ax.plot(resid2[idx], hat_matrix_diag, "o")
ax = utils.annotate_axes(
    range(nobs),
    labels=rob_crime_model.model.data.row_labels[idx],
    points=lzip(resid2[idx], hat_matrix_diag),
    offset_points=[(-5, 5)] * nobs,
    size="large",
    ax=ax,
)
ax.set_xlabel("resid2")
ax.set_ylabel("leverage")
ylim = ax.get_ylim()
ax.vlines(rm, *ylim)
xlim = ax.get_xlim()
ax.hlines(hm, *xlim)
ax.margins(0, 0)
../../../_images/examples_notebooks_generated_regression_plots_53_0.png

Last update: Oct 16, 2024