Patsy: 分类变量的对比编码系统

注意

本文档基于UCLA的这一优秀资源

一个有 K 个类别或水平的分类变量,通常以 K-1 个虚拟变量的序列进入回归分析。这相当于对水平均值的线性假设。也就是说,这些变量的每个检验统计量相当于检验该水平的均值是否在统计上显著不同于基准类别的均值。这种虚拟编码在 R 术语中称为处理编码,我们将遵循这一惯例。然而,存在不同的编码方法,这些方法相当于不同的线性假设集合。

事实上,虚拟编码在技术上并不是一种对比编码。这是因为虚拟变量加起来等于一,并且不是模型截距的功能独立部分。另一方面,对于具有k个水平的分类变量,其一组对比是一组k-1个功能上独立的线性组合,这些组合是因子水平均值的线性组合,并且也独立于虚拟变量的总和。虚拟编码本身并没有错。它捕捉了所有的系数,但当模型假设系数独立时,例如在ANOVA中,这会使问题复杂化。线性回归模型不假设系数的独立性,因此在这种上下文中,虚拟编码通常是唯一教授的编码方式。

要查看Patsy中的对比矩阵,我们将使用UCLA ATS的数据。首先让我们加载数据。

示例数据

In [1]: import pandas

In [2]: url = 'https://stats.idre.ucla.edu/stat/data/hsb2.csv'

In [3]: hsb2 = pandas.read_csv(url)

查看每个种族水平((1 = 西班牙裔, 2 = 亚裔, 3 = 非裔美国人, 4 = 高加索人))的因变量write的平均值将会很有启发。

In [4]: hsb2.groupby('race')['write'].mean()
Out[4]: 
race
1    46.458333
2    58.000000
3    48.200000
4    54.055172
Name: write, dtype: float64

处理(虚拟)编码

虚拟编码可能是最广为人知的编码方案。它将分类变量的每个级别与一个基准参考级别进行比较。基准参考级别是截距的值。它是Patsy中无序分类因子的默认对比。种族的治疗对比矩阵将是

In [5]: from patsy.contrasts import Treatment

In [6]: levels = [1,2,3,4]

In [7]: contrast = Treatment(reference=0).code_without_intercept(levels)

In [8]: print(contrast.matrix)
[[0. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]

这里我们使用了reference=0,这意味着第一级,即西班牙裔,是参考类别,其他级别的效果是相对于该类别进行测量的。如上所述,这些列的总和不等于零,因此它们与截距不独立。为了明确起见,让我们看看这将如何编码race变量。

In [9]: contrast.matrix[hsb2.race-1, :][:20]
Out[9]: 
array([[0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.]])

这是一个有点技巧的问题,因为种族类别方便地映射到基于零的索引。如果不是这样,这种转换会在后台进行,因此这在一般情况下不会起作用,但仍然是一个有用的练习来巩固概念。下面使用上述三种对比来说明输出结果

In [10]: from statsmodels.formula.api import ols

In [11]: mod = ols("write ~ C(race, Treatment)", data=hsb2)

In [12]: res = mod.fit()

In [13]: print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  write   R-squared:                       0.107
Model:                            OLS   Adj. R-squared:                  0.093
Method:                 Least Squares   F-statistic:                     7.833
Date:                   三, 16 10 2024   Prob (F-statistic):           5.78e-05
Time:                        18:36:20   Log-Likelihood:                -721.77
No. Observations:                 200   AIC:                             1452.
Df Residuals:                     196   BIC:                             1465.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
===========================================================================================
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
Intercept                  46.4583      1.842     25.218      0.000      42.825      50.091
C(race, Treatment)[T.2]    11.5417      3.286      3.512      0.001       5.061      18.022
C(race, Treatment)[T.3]     1.7417      2.732      0.637      0.525      -3.647       7.131
C(race, Treatment)[T.4]     7.5968      1.989      3.820      0.000       3.675      11.519
==============================================================================
Omnibus:                       10.487   Durbin-Watson:                   1.779
Prob(Omnibus):                  0.005   Jarque-Bera (JB):               11.031
Skew:                          -0.551   Prob(JB):                      0.00402
Kurtosis:                       2.670   Cond. No.                         8.25
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

我们明确给出了种族的对比;然而,由于Treatment是默认的,我们可以省略这一点。

简单编码

与处理编码类似,简单编码将每个水平与一个固定的参考水平进行比较。然而,在简单编码中,截距是所有因子水平的总体平均值。有关如何实现简单对比,请参见用户定义编码

In [14]: contrast = Simple().code_without_intercept(levels)

In [15]: print(contrast.matrix)
[[-0.25 -0.25 -0.25]
 [ 0.75 -0.25 -0.25]
 [-0.25  0.75 -0.25]
 [-0.25 -0.25  0.75]]

In [16]: mod = ols("write ~ C(race, Simple)", data=hsb2)

In [17]: res = mod.fit()

In [18]: print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  write   R-squared:                       0.107
Model:                            OLS   Adj. R-squared:                  0.093
Method:                 Least Squares   F-statistic:                     7.833
Date:                   三, 16 10 2024   Prob (F-statistic):           5.78e-05
Time:                        18:36:20   Log-Likelihood:                -721.77
No. Observations:                 200   AIC:                             1452.
Df Residuals:                     196   BIC:                             1465.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
===========================================================================================
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
Intercept                  51.6784      0.982     52.619      0.000      49.741      53.615
C(race, Simple)[Simp.1]    11.5417      3.286      3.512      0.001       5.061      18.022
C(race, Simple)[Simp.2]     1.7417      2.732      0.637      0.525      -3.647       7.131
C(race, Simple)[Simp.3]     7.5968      1.989      3.820      0.000       3.675      11.519
==============================================================================
Omnibus:                       10.487   Durbin-Watson:                   1.779
Prob(Omnibus):                  0.005   Jarque-Bera (JB):               11.031
Skew:                          -0.551   Prob(JB):                      0.00402
Kurtosis:                       2.670   Cond. No.                         7.03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

总和(偏差)编码

总编码将因变量在给定水平的均值与因变量在所有水平上的总体均值进行比较。也就是说,它使用前k-1个水平与第k个水平之间的对比。在这个例子中,第1级与所有其他级进行比较,第2级与所有其他级进行比较,第3级与所有其他级进行比较。

In [19]: from patsy.contrasts import Sum

In [20]: contrast = Sum().code_without_intercept(levels)

In [21]: print(contrast.matrix)
[[ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]
 [-1. -1. -1.]]

In [22]: mod = ols("write ~ C(race, Sum)", data=hsb2)

In [23]: res = mod.fit()

In [24]: print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  write   R-squared:                       0.107
Model:                            OLS   Adj. R-squared:                  0.093
Method:                 Least Squares   F-statistic:                     7.833
Date:                   三, 16 10 2024   Prob (F-statistic):           5.78e-05
Time:                        18:36:20   Log-Likelihood:                -721.77
No. Observations:                 200   AIC:                             1452.
Df Residuals:                     196   BIC:                             1465.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept            51.6784      0.982     52.619      0.000      49.741      53.615
C(race, Sum)[S.1]    -5.2200      1.631     -3.200      0.002      -8.437      -2.003
C(race, Sum)[S.2]     6.3216      2.160      2.926      0.004       2.061      10.582
C(race, Sum)[S.3]    -3.4784      1.732     -2.008      0.046      -6.895      -0.062
==============================================================================
Omnibus:                       10.487   Durbin-Watson:                   1.779
Prob(Omnibus):                  0.005   Jarque-Bera (JB):               11.031
Skew:                          -0.551   Prob(JB):                      0.00402
Kurtosis:                       2.670   Cond. No.                         6.72
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

这对应于一种参数化方法,强制所有系数之和为零。请注意,这里的截距是总平均值,其中总平均值是因变量在每个水平上的平均值的平均值。

In [25]: hsb2.groupby('race')['write'].mean().mean()
Out[25]: np.float64(51.67837643678162)

向后差分编码

在向后差分编码中,某一水平的因变量均值与前一水平的因变量均值进行比较。这种编码类型可能对名义变量或有序变量有用。

In [26]: from patsy.contrasts import Diff

In [27]: contrast = Diff().code_without_intercept(levels)

In [28]: print(contrast.matrix)
[[-0.75 -0.5  -0.25]
 [ 0.25 -0.5  -0.25]
 [ 0.25  0.5  -0.25]
 [ 0.25  0.5   0.75]]

In [29]: mod = ols("write ~ C(race, Diff)", data=hsb2)

In [30]: res = mod.fit()

In [31]: print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  write   R-squared:                       0.107
Model:                            OLS   Adj. R-squared:                  0.093
Method:                 Least Squares   F-statistic:                     7.833
Date:                   三, 16 10 2024   Prob (F-statistic):           5.78e-05
Time:                        18:36:20   Log-Likelihood:                -721.77
No. Observations:                 200   AIC:                             1452.
Df Residuals:                     196   BIC:                             1465.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept             51.6784      0.982     52.619      0.000      49.741      53.615
C(race, Diff)[D.1]    11.5417      3.286      3.512      0.001       5.061      18.022
C(race, Diff)[D.2]    -9.8000      3.388     -2.893      0.004     -16.481      -3.119
C(race, Diff)[D.3]     5.8552      2.153      2.720      0.007       1.610      10.101
==============================================================================
Omnibus:                       10.487   Durbin-Watson:                   1.779
Prob(Omnibus):                  0.005   Jarque-Bera (JB):               11.031
Skew:                          -0.551   Prob(JB):                      0.00402
Kurtosis:                       2.670   Cond. No.                         8.30
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

例如,这里的1级系数是2级与1级相比的write的平均值。即,

In [32]: res.params["C(race, Diff)[D.1]"]
Out[32]: np.float64(11.541666666666618)

In [33]: hsb2.groupby('race').mean()["write"].loc[2] - \
   ....:     hsb2.groupby('race').mean()["write"].loc[1]
   ....: 
Out[33]: np.float64(11.541666666666664)

Helmert 编码

我们的Helmert编码版本有时被称为反向Helmert编码。一个水平的因变量的平均值与所有先前水平的因变量的平均值进行比较。因此,有时使用“反向”这个名称来区分于正向Helmert编码。对于像种族这样的名义变量,这种比较没有太大意义,但我们会像这样使用Helmert对比:

In [34]: from patsy.contrasts import Helmert

In [35]: contrast = Helmert().code_without_intercept(levels)

In [36]: print(contrast.matrix)
[[-1. -1. -1.]
 [ 1. -1. -1.]
 [ 0.  2. -1.]
 [ 0.  0.  3.]]

In [37]: mod = ols("write ~ C(race, Helmert)", data=hsb2)

In [38]: res = mod.fit()

In [39]: print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  write   R-squared:                       0.107
Model:                            OLS   Adj. R-squared:                  0.093
Method:                 Least Squares   F-statistic:                     7.833
Date:                   三, 16 10 2024   Prob (F-statistic):           5.78e-05
Time:                        18:36:20   Log-Likelihood:                -721.77
No. Observations:                 200   AIC:                             1452.
Df Residuals:                     196   BIC:                             1465.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept                51.6784      0.982     52.619      0.000      49.741      53.615
C(race, Helmert)[H.2]     5.7708      1.643      3.512      0.001       2.530       9.011
C(race, Helmert)[H.3]    -1.3431      0.867     -1.548      0.123      -3.054       0.368
C(race, Helmert)[H.4]     0.7923      0.372      2.130      0.034       0.059       1.526
==============================================================================
Omnibus:                       10.487   Durbin-Watson:                   1.779
Prob(Omnibus):                  0.005   Jarque-Bera (JB):               11.031
Skew:                          -0.551   Prob(JB):                      0.00402
Kurtosis:                       2.670   Cond. No.                         7.26
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

为了说明,第4级的比较是前三个级别的因变量均值与第4级均值的比较

In [40]: grouped = hsb2.groupby('race')

In [41]: grouped.mean()["write"].loc[4] - grouped.mean()["write"].loc[:3].mean()
Out[41]: np.float64(3.169061302681982)

如您所见,这些仅在常数上相等。Helmert对比的其他版本给出了实际的均值差异。无论如何,假设检验是相同的。

In [42]: k = 4

In [43]: 1./k * (grouped.mean()["write"].loc[k] - grouped.mean()["write"].loc[:k-1].mean())
Out[43]: np.float64(0.7922653256704955)

In [44]: k = 3

In [45]: 1./k * (grouped.mean()["write"].loc[k] - grouped.mean()["write"].loc[:k-1].mean())
Out[45]: np.float64(-1.3430555555555561)

正交多项式编码

多项式编码在k=4水平下所采用的系数是分类变量中的线性、二次和三次趋势。这里的分类变量假设由一个基础的、等间距的数值变量表示。因此,这种编码方式仅用于具有等间距的有序分类变量。通常,多项式对比产生k-1阶的多项式。由于race不是一个有序的因子变量,我们以read为例。首先,我们需要从read创建一个有序的分类变量。

In [46]: _, bins = np.histogram(hsb2.read, 3)

In [47]: try: # requires numpy main
   ....:     readcat = np.digitize(hsb2.read, bins, True)
   ....: except:
   ....:     readcat = np.digitize(hsb2.read, bins)
   ....: 

In [48]: hsb2['readcat'] = readcat

In [49]: hsb2.groupby('readcat').mean()['write']
Out[49]: 
readcat
0    46.000000
1    44.980392
2    53.356436
3    60.127660
Name: write, dtype: float64
In [50]: from patsy.contrasts import Poly

In [51]: levels = hsb2.readcat.unique().tolist()

In [52]: contrast = Poly().code_without_intercept(levels)

In [53]: print(contrast.matrix)
[[-0.6708  0.5    -0.2236]
 [-0.2236 -0.5     0.6708]
 [ 0.2236 -0.5    -0.6708]
 [ 0.6708  0.5     0.2236]]

In [54]: mod = ols("write ~ C(readcat, Poly)", data=hsb2)

In [55]: res = mod.fit()

In [56]: print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  write   R-squared:                       0.320
Model:                            OLS   Adj. R-squared:                  0.309
Method:                 Least Squares   F-statistic:                     30.73
Date:                   三, 16 10 2024   Prob (F-statistic):           2.51e-16
Time:                        18:36:20   Log-Likelihood:                -694.54
No. Observations:                 200   AIC:                             1397.
Df Residuals:                     196   BIC:                             1410.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================================
                                 coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------
Intercept                     51.1161      2.018     25.324      0.000      47.135      55.097
C(readcat, Poly).Linear       11.3501      5.348      2.122      0.035       0.803      21.897
C(readcat, Poly).Quadratic     3.8954      4.037      0.965      0.336      -4.066      11.857
C(readcat, Poly).Cubic        -2.4598      1.998     -1.231      0.220      -6.400       1.480
==============================================================================
Omnibus:                        9.741   Durbin-Watson:                   1.699
Prob(Omnibus):                  0.008   Jarque-Bera (JB):               10.263
Skew:                          -0.535   Prob(JB):                      0.00591
Kurtosis:                       2.703   Cond. No.                         13.7
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

如您所见,readcat 对因变量 write 有显著的线性效应,但没有显著的二次或三次效应。

用户定义编码

如果你想使用自己的编码方式,你必须通过编写一个包含 `code_with_intercept` 和 `code_without_intercept` 方法的编码类来实现,这两个方法返回一个 patsy.contrast.ContrastMatrix 实例。

In [57]: from patsy.contrasts import ContrastMatrix
   ....: 
   ....: def _name_levels(prefix, levels):
   ....:     return ["[%s%s]" % (prefix, level) for level in levels]
   ....: 

In [58]: class Simple(object):
   ....:     def _simple_contrast(self, levels):
   ....:         nlevels = len(levels)
   ....:         contr = -1./nlevels * np.ones((nlevels, nlevels-1))
   ....:         contr[1:][np.diag_indices(nlevels-1)] = (nlevels-1.)/nlevels
   ....:         return contr
   ....: 
   ....:     def code_with_intercept(self, levels):
   ....:         contrast = np.column_stack((np.ones(len(levels)),
   ....:                                    self._simple_contrast(levels)))
   ....:         return ContrastMatrix(contrast, _name_levels("Simp.", levels))
   ....: 
   ....:    def code_without_intercept(self, levels):
   ....:        contrast = self._simple_contrast(levels)
   ....:        return ContrastMatrix(contrast, _name_levels("Simp.", levels[:-1]))
   ....: 
  File <tokenize>:13
    def code_without_intercept(self, levels):
    ^
IndentationError: unindent does not match any outer indentation level


In [60]: mod = ols("write ~ C(race, Simple)", data=hsb2)
   ....: res = mod.fit()
   ....: print(res.summary())
   ....: 
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  write   R-squared:                       0.107
Model:                            OLS   Adj. R-squared:                  0.093
Method:                 Least Squares   F-statistic:                     7.833
Date:                   三, 16 10 2024   Prob (F-statistic):           5.78e-05
Time:                        18:36:20   Log-Likelihood:                -721.77
No. Observations:                 200   AIC:                             1452.
Df Residuals:                     196   BIC:                             1465.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
===========================================================================================
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
Intercept                  51.6784      0.982     52.619      0.000      49.741      53.615
C(race, Simple)[Simp.1]    11.5417      3.286      3.512      0.001       5.061      18.022
C(race, Simple)[Simp.2]     1.7417      2.732      0.637      0.525      -3.647       7.131
C(race, Simple)[Simp.3]     7.5968      1.989      3.820      0.000       3.675      11.519
==============================================================================
Omnibus:                       10.487   Durbin-Watson:                   1.779
Prob(Omnibus):                  0.005   Jarque-Bera (JB):               11.031
Skew:                          -0.551   Prob(JB):                      0.00402
Kurtosis:                       2.670   Cond. No.                         7.03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Last update: Oct 16, 2024