序数回归¶
[1]:
import numpy as np
import pandas as pd
import scipy.stats as stats
from statsmodels.miscmodels.ordinal_model import OrderedModel
从UCLA网站加载Stata数据文件。这个笔记本的灵感来自于https://stats.idre.ucla.edu/r/dae/ordinal-logistic-regression/,这是UCLA的一个R笔记本。
[2]:
url = "https://stats.idre.ucla.edu/stat/data/ologit.dta"
data_student = pd.read_stata(url)
[3]:
data_student.head(5)
[3]:
| apply | pared | public | gpa | |
|---|---|---|---|---|
| 0 | very likely | 0 | 0 | 3.26 |
| 1 | somewhat likely | 1 | 0 | 3.21 |
| 2 | unlikely | 1 | 1 | 3.94 |
| 3 | somewhat likely | 0 | 0 | 2.81 |
| 4 | somewhat likely | 0 | 0 | 2.53 |
[4]:
data_student.dtypes
[4]:
apply category
pared int8
public int8
gpa float32
dtype: object
[5]:
data_student['apply'].dtype
[5]:
CategoricalDtype(categories=['unlikely', 'somewhat likely', 'very likely'], ordered=True, categories_dtype=object)
这个数据集是关于本科生在给定三个外生变量的情况下申请研究生院的概率:- 他们的平均绩点(gpa),一个介于0到4之间的浮点数。- pared,一个二进制值,表示至少有一位家长上过研究生院。- 以及public,一个二进制值,表示学生当前的本科院校是公立还是私立。
apply,目标变量是具有有序类别的分类变量:不太可能 < 有点可能 < 非常可能。它是一个pd.Serie类型的分类变量,这比NumPy数组更受欢迎。
该模型基于一个数值型的潜在变量 \(y_{latent}\),我们无法直接观察到它,但可以通过外生变量计算出来。此外,我们可以使用这个 \(y_{latent}\) 来定义我们可以观察到的 \(y\)。
更多详情请参阅OrderedModel的文档,UCLA网页或这本书。
Probit 序数回归:¶
[6]:
mod_prob = OrderedModel(data_student['apply'],
data_student[['pared', 'public', 'gpa']],
distr='probit')
res_prob = mod_prob.fit(method='bfgs')
res_prob.summary()
Optimization terminated successfully.
Current function value: 0.896869
Iterations: 17
Function evaluations: 21
Gradient evaluations: 21
[6]:
| Dep. Variable: | apply | Log-Likelihood: | -358.75 |
|---|---|---|---|
| Model: | OrderedModel | AIC: | 727.5 |
| Method: | Maximum Likelihood | BIC: | 747.5 |
| Date: | Wed, 16 Oct 2024 | ||
| Time: | 18:27:00 | ||
| No. Observations: | 400 | ||
| Df Residuals: | 395 | ||
| Df Model: | 3 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| pared | 0.5981 | 0.158 | 3.789 | 0.000 | 0.289 | 0.908 |
| public | 0.0102 | 0.173 | 0.059 | 0.953 | -0.329 | 0.349 |
| gpa | 0.3582 | 0.157 | 2.285 | 0.022 | 0.051 | 0.665 |
| unlikely/somewhat likely | 1.2968 | 0.468 | 2.774 | 0.006 | 0.381 | 2.213 |
| somewhat likely/very likely | 0.1873 | 0.074 | 2.530 | 0.011 | 0.042 | 0.332 |
在我们的模型中,我们有3个外生变量(如果我们保持文档的符号,即\(\beta\)),因此我们需要估计3个系数。
这3个估计值及其标准误差可以在摘要表中检索到。
由于目标变量中有3个类别(不太可能、有些可能、非常可能),我们需要估计两个阈值。如方法文档中OrderedModel.transform_threshold_params所述,第一个估计的阈值是实际值,所有其他阈值都是以累积指数增量表示的。实际阈值可以通过以下方式计算:
[7]:
num_of_thresholds = 2
mod_prob.transform_threshold_params(res_prob.params[-num_of_thresholds:])
[7]:
array([ -inf, 1.29684541, 2.50285885, inf])
Logit 序数回归:¶
[8]:
mod_log = OrderedModel(data_student['apply'],
data_student[['pared', 'public', 'gpa']],
distr='logit')
res_log = mod_log.fit(method='bfgs', disp=False)
res_log.summary()
[8]:
| Dep. Variable: | apply | Log-Likelihood: | -358.51 |
|---|---|---|---|
| Model: | OrderedModel | AIC: | 727.0 |
| Method: | Maximum Likelihood | BIC: | 747.0 |
| Date: | Wed, 16 Oct 2024 | ||
| Time: | 18:27:00 | ||
| No. Observations: | 400 | ||
| Df Residuals: | 395 | ||
| Df Model: | 3 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| pared | 1.0476 | 0.266 | 3.942 | 0.000 | 0.527 | 1.569 |
| public | -0.0586 | 0.298 | -0.197 | 0.844 | -0.642 | 0.525 |
| gpa | 0.6158 | 0.261 | 2.363 | 0.018 | 0.105 | 1.127 |
| unlikely/somewhat likely | 2.2035 | 0.780 | 2.827 | 0.005 | 0.676 | 3.731 |
| somewhat likely/very likely | 0.7398 | 0.080 | 9.236 | 0.000 | 0.583 | 0.897 |
[9]:
predicted = res_log.model.predict(res_log.params, exog=data_student[['pared', 'public', 'gpa']])
predicted
[9]:
array([[0.54884071, 0.35932276, 0.09183653],
[0.30558191, 0.47594216, 0.21847593],
[0.22938356, 0.47819057, 0.29242587],
...,
[0.69380357, 0.25470075, 0.05149568],
[0.54884071, 0.35932276, 0.09183653],
[0.50896793, 0.38494062, 0.10609145]])
[10]:
pred_choice = predicted.argmax(1)
print('Fraction of correct choice predictions')
print((np.asarray(data_student['apply'].values.codes) == pred_choice).mean())
Fraction of correct choice predictions
0.5775
使用自定义累积cLogLog分布的序数回归:¶
除了logit和probit回归外,SciPy.stats包中的任何连续分布都可以用于distr参数。或者,可以通过从rv_continuous创建子类并实现一些方法来定义自己的分布。
[11]:
# using a SciPy distribution
res_exp = OrderedModel(data_student['apply'],
data_student[['pared', 'public', 'gpa']],
distr=stats.expon).fit(method='bfgs', disp=False)
res_exp.summary()
[11]:
| Dep. Variable: | apply | Log-Likelihood: | -360.84 |
|---|---|---|---|
| Model: | OrderedModel | AIC: | 731.7 |
| Method: | Maximum Likelihood | BIC: | 751.6 |
| Date: | Wed, 16 Oct 2024 | ||
| Time: | 18:27:01 | ||
| No. Observations: | 400 | ||
| Df Residuals: | 395 | ||
| Df Model: | 3 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| pared | 0.4690 | 0.117 | 4.021 | 0.000 | 0.240 | 0.698 |
| public | -0.1308 | 0.149 | -0.879 | 0.379 | -0.422 | 0.161 |
| gpa | 0.2198 | 0.134 | 1.638 | 0.101 | -0.043 | 0.483 |
| unlikely/somewhat likely | 1.5370 | 0.405 | 3.792 | 0.000 | 0.742 | 2.332 |
| somewhat likely/very likely | 0.4082 | 0.093 | 4.403 | 0.000 | 0.226 | 0.590 |
[12]:
# minimal definition of a custom scipy distribution.
class CLogLog(stats.rv_continuous):
def _ppf(self, q):
return np.log(-np.log(1 - q))
def _cdf(self, x):
return 1 - np.exp(-np.exp(x))
cloglog = CLogLog()
# definition of the model and fitting
res_cloglog = OrderedModel(data_student['apply'],
data_student[['pared', 'public', 'gpa']],
distr=cloglog).fit(method='bfgs', disp=False)
res_cloglog.summary()
[12]:
| Dep. Variable: | apply | Log-Likelihood: | -359.75 |
|---|---|---|---|
| Model: | OrderedModel | AIC: | 729.5 |
| Method: | Maximum Likelihood | BIC: | 749.5 |
| Date: | Wed, 16 Oct 2024 | ||
| Time: | 18:27:01 | ||
| No. Observations: | 400 | ||
| Df Residuals: | 395 | ||
| Df Model: | 3 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| pared | 0.5167 | 0.161 | 3.202 | 0.001 | 0.200 | 0.833 |
| public | 0.1081 | 0.168 | 0.643 | 0.520 | -0.221 | 0.438 |
| gpa | 0.3344 | 0.154 | 2.168 | 0.030 | 0.032 | 0.637 |
| unlikely/somewhat likely | 0.8705 | 0.455 | 1.912 | 0.056 | -0.022 | 1.763 |
| somewhat likely/very likely | 0.0989 | 0.071 | 1.384 | 0.167 | -0.041 | 0.239 |
使用公式 - 内生变量的处理¶
Pandas 的有序分类和数值型数据在公式中作为因变量是支持的。其他类型将会引发一个 ValueError。
[13]:
modf_logit = OrderedModel.from_formula("apply ~ 0 + pared + public + gpa", data_student,
distr='logit')
resf_logit = modf_logit.fit(method='bfgs')
resf_logit.summary()
Optimization terminated successfully.
Current function value: 0.896281
Iterations: 22
Function evaluations: 24
Gradient evaluations: 24
[13]:
| Dep. Variable: | apply | Log-Likelihood: | -358.51 |
|---|---|---|---|
| Model: | OrderedModel | AIC: | 727.0 |
| Method: | Maximum Likelihood | BIC: | 747.0 |
| Date: | Wed, 16 Oct 2024 | ||
| Time: | 18:27:01 | ||
| No. Observations: | 400 | ||
| Df Residuals: | 395 | ||
| Df Model: | 3 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| pared | 1.0476 | 0.266 | 3.942 | 0.000 | 0.527 | 1.569 |
| public | -0.0586 | 0.298 | -0.197 | 0.844 | -0.642 | 0.525 |
| gpa | 0.6158 | 0.261 | 2.363 | 0.018 | 0.105 | 1.127 |
| unlikely/somewhat likely | 2.2035 | 0.780 | 2.827 | 0.005 | 0.676 | 3.731 |
| somewhat likely/very likely | 0.7398 | 0.080 | 9.236 | 0.000 | 0.583 | 0.897 |
使用数值代码作为因变量是支持的,但会丢失类别级别的名称。级别和名称对应于因变量的唯一值,按字母数字顺序排序,就像在不使用公式的情况下一样。
[14]:
data_student["apply_codes"] = data_student['apply'].cat.codes * 2 + 5
data_student["apply_codes"].head()
[14]:
0 9
1 7
2 5
3 7
4 7
Name: apply_codes, dtype: int8
[15]:
OrderedModel.from_formula("apply_codes ~ 0 + pared + public + gpa", data_student,
distr='logit').fit().summary()
Optimization terminated successfully.
Current function value: 0.896281
Iterations: 421
Function evaluations: 663
[15]:
| Dep. Variable: | apply_codes | Log-Likelihood: | -358.51 |
|---|---|---|---|
| Model: | OrderedModel | AIC: | 727.0 |
| Method: | Maximum Likelihood | BIC: | 747.0 |
| Date: | Wed, 16 Oct 2024 | ||
| Time: | 18:27:02 | ||
| No. Observations: | 400 | ||
| Df Residuals: | 395 | ||
| Df Model: | 3 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| pared | 1.0477 | 0.266 | 3.942 | 0.000 | 0.527 | 1.569 |
| public | -0.0587 | 0.298 | -0.197 | 0.844 | -0.642 | 0.525 |
| gpa | 0.6157 | 0.261 | 2.362 | 0.018 | 0.105 | 1.127 |
| 5.0/7.0 | 2.2033 | 0.780 | 2.826 | 0.005 | 0.675 | 3.731 |
| 7.0/9.0 | 0.7398 | 0.080 | 9.236 | 0.000 | 0.583 | 0.897 |
[16]:
resf_logit.predict(data_student.iloc[:5])
[16]:
| 0 | 1 | 2 | |
|---|---|---|---|
| 0 | 0.548841 | 0.359323 | 0.091837 |
| 1 | 0.305582 | 0.475942 | 0.218476 |
| 2 | 0.229384 | 0.478191 | 0.292426 |
| 3 | 0.616118 | 0.312690 | 0.071191 |
| 4 | 0.656003 | 0.283398 | 0.060599 |
直接使用字符串值作为因变量会引发ValueError。
[17]:
data_student["apply_str"] = np.asarray(data_student["apply"])
data_student["apply_str"].head()
[17]:
0 very likely
1 somewhat likely
2 unlikely
3 somewhat likely
4 somewhat likely
Name: apply_str, dtype: object
[18]:
data_student.apply_str = pd.Categorical(data_student.apply_str, ordered=True)
data_student.public = data_student.public.astype(float)
data_student.pared = data_student.pared.astype(float)
[19]:
OrderedModel.from_formula("apply_str ~ 0 + pared + public + gpa", data_student,
distr='logit')
[19]:
<statsmodels.miscmodels.ordinal_model.OrderedModel at 0x115f15510>
使用公式 - 模型中无常数项¶
OrderedModel 的参数化要求模型中没有常数项,无论是显式的还是隐式的。常数项相当于移动所有阈值,因此不能单独识别。
Patsy的公式规范不允许在没有显式或隐式常数的情况下生成设计矩阵,如果解释变量中包含分类变量(或可能是样条函数)。作为变通方法,statsmodels会移除显式截距。
因此,有两种有效的情况可以获得不带截距的设计矩阵。
指定一个没有显式和隐式截距的模型,这在模型中只有数值变量时是可能的。
指定一个带有显式截距的模型,statsmodels 将会移除它。
具有隐式截距的模型将会过度参数化,参数估计将不完全确定,cov_params将不可逆,标准误差可能包含nans。
在下面我们看一个带有附加分类变量的示例。
[20]:
nobs = len(data_student)
data_student["dummy"] = (np.arange(nobs) < (nobs / 2)).astype(float)
显式截距,将被移除:
注意 “1 +” 在这里是多余的,因为它是patsy的默认设置。
[21]:
modfd_logit = OrderedModel.from_formula("apply ~ 1 + pared + public + gpa + C(dummy)", data_student,
distr='logit')
resfd_logit = modfd_logit.fit(method='bfgs')
print(resfd_logit.summary())
Optimization terminated successfully.
Current function value: 0.896247
Iterations: 26
Function evaluations: 28
Gradient evaluations: 28
OrderedModel Results
==============================================================================
Dep. Variable: apply Log-Likelihood: -358.50
Model: OrderedModel AIC: 729.0
Method: Maximum Likelihood BIC: 752.9
Date: Wed, 16 Oct 2024
Time: 18:27:02
No. Observations: 400
Df Residuals: 394
Df Model: 4
===============================================================================================
coef std err z P>|z| [0.025 0.975]
-----------------------------------------------------------------------------------------------
C(dummy)[T.1.0] 0.0326 0.198 0.164 0.869 -0.356 0.421
pared 1.0489 0.266 3.945 0.000 0.528 1.570
public -0.0589 0.298 -0.198 0.843 -0.643 0.525
gpa 0.6153 0.261 2.360 0.018 0.104 1.126
unlikely/somewhat likely 2.2183 0.785 2.826 0.005 0.680 3.757
somewhat likely/very likely 0.7398 0.080 9.237 0.000 0.583 0.897
===============================================================================================
[22]:
modfd_logit.k_vars
[22]:
4
[23]:
modfd_logit.k_constant
[23]:
0
隐式截距 创建过度参数化的模型
在公式中指定“0 +”会删除显式截距。然而,分类编码现在改为包含一个隐式截距。在这个例子中,创建的虚拟变量 C(dummy)[0.0] 和 C(dummy)[1.0] 的总和为一。
OrderedModel.from_formula("apply ~ 0 + pared + public + gpa + C(dummy)", data_student, distr='logit')
为了查看在过度参数化情况下会发生什么,我们可以通过显式指定是否存在常数来避免模型中的常数检查。我们使用 hasconst=False,尽管模型中存在隐式常数。
两个虚拟变量列的参数和第一个阈值不是单独识别的。这些参数的估计值和标准误差的可用性是任意的,并且取决于不同环境中的数值细节。
一些汇总指标,如对数似然值,在收敛容差和数值精度范围内不受此影响。预测也应该是可能的。然而,推断不可用,或无效。
[24]:
modfd2_logit = OrderedModel.from_formula("apply ~ 0 + pared + public + gpa + C(dummy)", data_student,
distr='logit', hasconst=False)
resfd2_logit = modfd2_logit.fit(method='bfgs')
print(resfd2_logit.summary())
Optimization terminated successfully.
Current function value: 0.896247
Iterations: 24
Function evaluations: 26
Gradient evaluations: 26
OrderedModel Results
==============================================================================
Dep. Variable: apply Log-Likelihood: -358.50
Model: OrderedModel AIC: 731.0
Method: Maximum Likelihood BIC: 758.9
Date: Wed, 16 Oct 2024
Time: 18:27:02
No. Observations: 400
Df Residuals: 393
Df Model: 5
===============================================================================================
coef std err z P>|z| [0.025 0.975]
-----------------------------------------------------------------------------------------------
C(dummy)[0.0] -0.6834 nan nan nan nan nan
C(dummy)[1.0] -0.6508 nan nan nan nan nan
pared 1.0489 nan nan nan nan nan
public -0.0588 nan nan nan nan nan
gpa 0.6153 nan nan nan nan nan
unlikely/somewhat likely 1.5349 nan nan nan nan nan
somewhat likely/very likely 0.7398 nan nan nan nan nan
===============================================================================================
/Users/cw/baidu/code/fin_tool/github/statsmodels/venv/lib/python3.11/site-packages/statsmodels/base/model.py:595: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
warnings.warn('Inverting hessian failed, no bse or cov_params '
[25]:
resfd2_logit.predict(data_student.iloc[:5])
[25]:
| 0 | 1 | 2 | |
|---|---|---|---|
| 0 | 0.544858 | 0.361972 | 0.093170 |
| 1 | 0.301918 | 0.476667 | 0.221416 |
| 2 | 0.226434 | 0.477700 | 0.295867 |
| 3 | 0.612254 | 0.315481 | 0.072264 |
| 4 | 0.652280 | 0.286188 | 0.061532 |
[26]:
resf_logit.predict()
[26]:
array([[0.54884071, 0.35932276, 0.09183653],
[0.30558191, 0.47594216, 0.21847593],
[0.22938356, 0.47819057, 0.29242587],
...,
[0.69380357, 0.25470075, 0.05149568],
[0.54884071, 0.35932276, 0.09183653],
[0.50896793, 0.38494062, 0.10609145]])
二元模型与Logit的比较¶
如果因变量是有序分类变量且只有两个级别,那么该模型也可以通过Logit模型进行估计。
在这种情况下,模型(理论上)是相同的,除了常数的参数化。Logit与其他大多数模型一样,通常需要一个截距。这对应于OrderedModel中的阈值参数,但符号相反。
实现方式有所不同,并非所有相同的结果统计和后估计功能都可用。估计的参数和其他结果统计主要基于优化的收敛容差而有所不同。
[27]:
from statsmodels.discrete.discrete_model import Logit
from statsmodels.tools.tools import add_constant
我们从数据中删除中间类别,保留两个极端类别。
[28]:
mask_drop = data_student['apply'] == "somewhat likely"
data2 = data_student.loc[~mask_drop, :].copy()
# we need to remove the category also from the Categorical Index
data2['apply'] = data2['apply'].cat.remove_categories("somewhat likely")
data2["apply"].head()
[28]:
0 very likely
2 unlikely
5 unlikely
8 unlikely
10 unlikely
Name: apply, dtype: category
Categories (2, object): ['unlikely' < 'very likely']
[29]:
mod_log = OrderedModel(data2['apply'],
data2[['pared', 'public', 'gpa']],
distr='logit')
res_log = mod_log.fit(method='bfgs', disp=False)
res_log.summary()
[29]:
| Dep. Variable: | apply | Log-Likelihood: | -102.87 |
|---|---|---|---|
| Model: | OrderedModel | AIC: | 213.7 |
| Method: | Maximum Likelihood | BIC: | 228.0 |
| Date: | Wed, 16 Oct 2024 | ||
| Time: | 18:27:03 | ||
| No. Observations: | 260 | ||
| Df Residuals: | 256 | ||
| Df Model: | 3 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| pared | 1.2861 | 0.438 | 2.934 | 0.003 | 0.427 | 2.145 |
| public | 0.4014 | 0.444 | 0.903 | 0.366 | -0.470 | 1.272 |
| gpa | 0.7854 | 0.489 | 1.605 | 0.108 | -0.174 | 1.744 |
| unlikely/very likely | 4.4147 | 1.485 | 2.974 | 0.003 | 1.505 | 7.324 |
Logit 模型默认没有常数项,我们必须将其添加到解释变量中。
在估计中的收敛容差主要导致的数值精度范围内,Logit 和有序模型的结果基本上是相同的。
唯一的区别在于常数的符号,Logit 和 OrdereModel 的常数符号相反。这是由于 OrderedModel 中使用切点参数化而不是在设计矩阵中包含常数列的结果。
[30]:
ex = add_constant(data2[['pared', 'public', 'gpa']], prepend=False)
mod_logit = Logit(data2['apply'].cat.codes, ex)
res_logit = mod_logit.fit(method='bfgs', disp=False)
[31]:
res_logit.summary()
[31]:
| Dep. Variable: | y | No. Observations: | 260 |
|---|---|---|---|
| Model: | Logit | Df Residuals: | 256 |
| Method: | MLE | Df Model: | 3 |
| Date: | Wed, 16 Oct 2024 | Pseudo R-squ.: | 0.07842 |
| Time: | 18:27:03 | Log-Likelihood: | -102.87 |
| converged: | True | LL-Null: | -111.62 |
| Covariance Type: | nonrobust | LLR p-value: | 0.0005560 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| pared | 1.2861 | 0.438 | 2.934 | 0.003 | 0.427 | 2.145 |
| public | 0.4014 | 0.444 | 0.903 | 0.366 | -0.470 | 1.272 |
| gpa | 0.7854 | 0.489 | 1.605 | 0.108 | -0.174 | 1.744 |
| const | -4.4148 | 1.485 | -2.974 | 0.003 | -7.324 | -1.505 |
在OrderedModel中,稳健标准误差同样可用,其方式与discrete.Logit中相同。例如,我们指定HAC协方差类型,尽管我们处理的是横截面数据,并且自相关并不适用。
[32]:
res_logit_hac = mod_logit.fit(method='bfgs', disp=False, cov_type="hac", cov_kwds={"maxlags": 2})
res_log_hac = mod_log.fit(method='bfgs', disp=False, cov_type="hac", cov_kwds={"maxlags": 2})
[33]:
res_logit_hac.bse.values - res_log_hac.bse
[33]:
pared 7.966873e-08
public -1.820026e-07
gpa 1.183549e-07
unlikely/very likely 3.345510e-07
dtype: float64