回归图¶
[1]:
%matplotlib inline
[2]:
from statsmodels.compat import lzip
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols
plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)
邓肯的声望数据集¶
加载数据¶
我们可以使用一个实用函数来加载来自伟大的Rdatasets包的任何可用R数据集。
[3]:
prestige = sm.datasets.get_rdataset("Duncan", "carData", cache=True).data
[4]:
prestige.head()
[4]:
type | income | education | prestige | |
---|---|---|---|---|
rownames | ||||
accountant | prof | 62 | 86 | 82 |
pilot | prof | 72 | 76 | 83 |
architect | prof | 75 | 92 | 90 |
author | prof | 55 | 90 | 76 |
chemist | prof | 64 | 86 | 90 |
[5]:
prestige_model = ols("prestige ~ income + education", data=prestige).fit()
[6]:
print(prestige_model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: prestige R-squared: 0.828
Model: OLS Adj. R-squared: 0.820
Method: Least Squares F-statistic: 101.2
Date: Wed, 16 Oct 2024 Prob (F-statistic): 8.65e-17
Time: 18:27:52 Log-Likelihood: -178.98
No. Observations: 45 AIC: 364.0
Df Residuals: 42 BIC: 369.4
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -6.0647 4.272 -1.420 0.163 -14.686 2.556
income 0.5987 0.120 5.003 0.000 0.357 0.840
education 0.5458 0.098 5.555 0.000 0.348 0.744
==============================================================================
Omnibus: 1.279 Durbin-Watson: 1.458
Prob(Omnibus): 0.528 Jarque-Bera (JB): 0.520
Skew: 0.155 Prob(JB): 0.771
Kurtosis: 3.426 Cond. No. 163.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
影响图¶
影响图显示了(外部)学生化残差与每个观测值的杠杆作用,这些杠杆作用由帽子矩阵测量。
外部学生化残差是残差通过其标准差进行缩放后的结果,其中
与
\(n\) 是观测值的数量,\(p\) 是回归变量的数量。\(h_{ii}\) 是帽子矩阵的对角线元素的第 \(i\) 个元素
每个点的影响可以通过标准关键字参数可视化。选项包括库克距离和DFFITS,这两种影响度量。
[7]:
fig = sm.graphics.influence_plot(prestige_model, criterion="cooks")
fig.tight_layout(pad=1.0)

如你所见,有几个令人担忧的观察结果。承包商和记者的杠杆率都很低,但残差很大。RR.engineer的残差很小,但杠杆率很大。指挥和部长的杠杆率和残差都很高,因此,影响力也很大。
部分回归图 (Duncan)¶
由于我们正在进行多元回归分析,我们不能仅仅通过查看单个的双变量图来辨别关系。相反,我们希望查看因变量与自变量之间的关系,同时考虑其他自变量的条件。我们可以通过使用偏回归图(也称为添加变量图)来实现这一点。
在部分回归图中,为了辨别响应变量与第\(k\)个变量之间的关系,我们通过将响应变量与除\(X_k\)之外的自变量进行回归来计算残差。我们可以用\(X_{\sim k}\)来表示这一点。然后,我们通过将\(X_k\)对\(X_{\sim k}\)进行回归来计算残差。部分回归图是前者的残差与后者的残差的图。
这个图的显著特点是拟合线的斜率为 \(\beta_k\),截距为零。这个图的残差与原始模型在完整 \(X\) 上的最小二乘拟合的残差相同。你可以很容易地辨别出单个数据值对系数估计的影响。如果 obs_labels 为 True,那么这些点将用它们的观察标签进行注释。你还可以看到诸如同方差性和线性等基本假设的违反。
[8]:
fig = sm.graphics.plot_partregress(
"prestige", "income", ["income", "education"], data=prestige
)
fig.tight_layout(pad=1.0)

[9]:
fig = sm.graphics.plot_partregress("prestige", "income", ["education"], data=prestige)
fig.tight_layout(pad=1.0)

正如你所见,部分回归图证实了导体、部长和RR.engineer对收入与声望之间部分关系的影响。这些案例大大降低了收入对声望的影响。删除这些案例证实了这一点。
[10]:
subset = ~prestige.index.isin(["conductor", "RR.engineer", "minister"])
prestige_model2 = ols(
"prestige ~ income + education", data=prestige, subset=subset
).fit()
print(prestige_model2.summary())
OLS Regression Results
==============================================================================
Dep. Variable: prestige R-squared: 0.876
Model: OLS Adj. R-squared: 0.870
Method: Least Squares F-statistic: 138.1
Date: Wed, 16 Oct 2024 Prob (F-statistic): 2.02e-18
Time: 18:27:53 Log-Likelihood: -160.59
No. Observations: 42 AIC: 327.2
Df Residuals: 39 BIC: 332.4
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -6.3174 3.680 -1.717 0.094 -13.760 1.125
income 0.9307 0.154 6.053 0.000 0.620 1.242
education 0.2846 0.121 2.345 0.024 0.039 0.530
==============================================================================
Omnibus: 3.811 Durbin-Watson: 1.468
Prob(Omnibus): 0.149 Jarque-Bera (JB): 2.802
Skew: -0.614 Prob(JB): 0.246
Kurtosis: 3.303 Cond. No. 158.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
要快速检查所有回归变量,您可以使用 plot_partregress_grid。这些图不会标记点,但您可以使用它们来识别问题,然后使用 plot_partregress 获取更多信息。
[11]:
fig = sm.graphics.plot_partregress_grid(prestige_model)
fig.tight_layout(pad=1.0)

成分-成分加残差 (CCPR) 图¶
CCPR图提供了一种通过考虑其他自变量的影响来判断一个回归因子对响应变量的影响的方法。偏残差图定义为\(\text{Residuals} + B_iX_i \text{ }\text{ }\)与\(X_i\)的关系。该组件添加了\(B_iX_i\)与\(X_i\)的关系,以显示拟合线的位置。如果\(X_i\)与任何其他自变量高度相关,则应谨慎对待。在这种情况下,图中显示的方差将是真实方差的一个低估。
[12]:
fig = sm.graphics.plot_ccpr(prestige_model, "education")
fig.tight_layout(pad=1.0)

如您所见,在收入条件下的教育所解释的声望变化之间的关系似乎是线性的,尽管您可以看到有一些观察结果对这种关系产生了相当大的影响。我们可以通过使用plot_ccpr_grid快速查看多个变量。
[13]:
fig = sm.graphics.plot_ccpr_grid(prestige_model)
fig.tight_layout(pad=1.0)

单变量回归诊断¶
plot_regress_exog 函数是一个便捷函数,它提供了一个 2x2 的图表,包含因变量和带有置信区间的拟合值与所选自变量的对比图、模型残差与所选自变量的对比图、偏回归图以及 CCPR 图。该函数可用于快速检查关于单个回归量的建模假设。
[14]:
fig = sm.graphics.plot_regress_exog(prestige_model, "education")
fig.tight_layout(pad=1.0)

拟合图¶
plot_fit 函数绘制拟合值与所选自变量的关系图。它包括预测置信区间,并可选择绘制真实的因变量。
[15]:
fig = sm.graphics.plot_fit(prestige_model, "education")
fig.tight_layout(pad=1.0)

2009年全州犯罪数据集¶
将以下内容与http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter4/statareg_self_assessment_answers4.htm进行比较
尽管这里的数据与那个示例中的数据不同。你可以通过取消下面必要单元格的注释来运行那个示例。
[16]:
# dta = pd.read_csv("http://www.stat.ufl.edu/~aa/social/csv_files/statewide-crime-2.csv")
# dta = dta.set_index("State", inplace=True).dropna()
# dta.rename(columns={"VR" : "crime",
# "MR" : "murder",
# "M" : "pctmetro",
# "W" : "pctwhite",
# "H" : "pcths",
# "P" : "poverty",
# "S" : "single"
# }, inplace=True)
#
# crime_model = ols("murder ~ pctmetro + poverty + pcths + single", data=dta).fit()
[17]:
dta = sm.datasets.statecrime.load_pandas().data
[18]:
crime_model = ols("murder ~ urban + poverty + hs_grad + single", data=dta).fit()
print(crime_model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: murder R-squared: 0.813
Model: OLS Adj. R-squared: 0.797
Method: Least Squares F-statistic: 50.08
Date: Wed, 16 Oct 2024 Prob (F-statistic): 3.42e-16
Time: 18:27:54 Log-Likelihood: -95.050
No. Observations: 51 AIC: 200.1
Df Residuals: 46 BIC: 209.8
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -44.1024 12.086 -3.649 0.001 -68.430 -19.774
urban 0.0109 0.015 0.707 0.483 -0.020 0.042
poverty 0.4121 0.140 2.939 0.005 0.130 0.694
hs_grad 0.3059 0.117 2.611 0.012 0.070 0.542
single 0.6374 0.070 9.065 0.000 0.496 0.779
==============================================================================
Omnibus: 1.618 Durbin-Watson: 2.507
Prob(Omnibus): 0.445 Jarque-Bera (JB): 0.831
Skew: -0.220 Prob(JB): 0.660
Kurtosis: 3.445 Cond. No. 5.80e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.8e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
部分回归图 (犯罪数据)¶
[19]:
fig = sm.graphics.plot_partregress_grid(crime_model)
fig.tight_layout(pad=1.0)

[20]:
fig = sm.graphics.plot_partregress(
"murder", "hs_grad", ["urban", "poverty", "single"], data=dta
)
fig.tight_layout(pad=1.0)

杠杆-残差平方图¶
与influence_plot密切相关的是leverage-resid2图。
[21]:
fig = sm.graphics.plot_leverage_resid2(crime_model)
fig.tight_layout(pad=1.0)

影响图¶
[22]:
fig = sm.graphics.influence_plot(crime_model)
fig.tight_layout(pad=1.0)

使用稳健回归来纠正异常值。¶
在重新创建Stata结果时,这里的问题之一是M估计量对杠杆点不稳健。MM估计量应该在这个例子中表现得更好。
[23]:
from statsmodels.formula.api import rlm
[24]:
rob_crime_model = rlm(
"murder ~ urban + poverty + hs_grad + single",
data=dta,
M=sm.robust.norms.TukeyBiweight(3),
).fit(conv="weights")
print(rob_crime_model.summary())
Robust linear Model Regression Results
==============================================================================
Dep. Variable: murder No. Observations: 51
Model: RLM Df Residuals: 46
Method: IRLS Df Model: 4
Norm: TukeyBiweight
Scale Est.: mad
Cov Type: H1
Date: Wed, 16 Oct 2024
Time: 18:27:55
No. Iterations: 50
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -4.2986 9.494 -0.453 0.651 -22.907 14.310
urban 0.0029 0.012 0.241 0.809 -0.021 0.027
poverty 0.2753 0.110 2.499 0.012 0.059 0.491
hs_grad -0.0302 0.092 -0.328 0.743 -0.211 0.150
single 0.2902 0.055 5.253 0.000 0.182 0.398
==============================================================================
If the model instance has been used for another fit with different fit parameters, then the fit options might not be the correct ones anymore .
[25]:
# rob_crime_model = rlm("murder ~ pctmetro + poverty + pcths + single", data=dta, M=sm.robust.norms.TukeyBiweight()).fit(conv="weights")
# print(rob_crime_model.summary())
目前RLM还没有影响力诊断方法,但我们可以重新创建它们。(这取决于问题 #888的状态)
[26]:
weights = rob_crime_model.weights
idx = weights > 0
X = rob_crime_model.model.exog[idx.values]
ww = weights[idx] / weights[idx].mean()
hat_matrix_diag = ww * (X * np.linalg.pinv(X).T).sum(1)
resid = rob_crime_model.resid
resid2 = resid ** 2
resid2 /= resid2.sum()
nobs = int(idx.sum())
hm = hat_matrix_diag.mean()
rm = resid2.mean()
[27]:
from statsmodels.graphics import utils
fig, ax = plt.subplots(figsize=(16, 8))
ax.plot(resid2[idx], hat_matrix_diag, "o")
ax = utils.annotate_axes(
range(nobs),
labels=rob_crime_model.model.data.row_labels[idx],
points=lzip(resid2[idx], hat_matrix_diag),
offset_points=[(-5, 5)] * nobs,
size="large",
ax=ax,
)
ax.set_xlabel("resid2")
ax.set_ylabel("leverage")
ylim = ax.get_ylim()
ax.vlines(rm, *ylim)
xlim = ax.get_xlim()
ax.hlines(hm, *xlim)
ax.margins(0, 0)
