AutoGluon 表格 - 特征工程¶

介绍¶

特征工程涉及将原始表格数据

将其转换为机器学习模型可读取的格式
尝试增强一些列（在机器学习术语中称为“特征”），以向机器学习模型提供更多信息，希望获得更准确的结果。

AutoGluon 为你完成了一些这样的工作。本文档描述了它是如何工作的，以及你如何扩展它。我们在这里描述了默认行为，其中大部分是可配置的，以及如何从默认行为中改变行为的指导。

列类型¶

AutoGluon Tabular 识别以下类型的特征，并对它们进行单独处理：

特征类型	示例值
布尔值	A, B
数值	1.3, 2.0, -1.6
分类	红色, 蓝色, 黄色
日期时间	2021年1月31日，3月31日
文本	玛丽有一只小羊羔

此外，其他AutoGluon预测模块识别额外的特征类型，这些也可以通过使用MultiModal选项在AutoGluon Tabular中启用。

特征类型	示例值
图片	路径/图片123.png

列类型检测¶

布尔列是任何只有2个唯一值的列。
任何字符串列都被视为分类列，除非它们是文本（见下文）。如果你告诉模型哪些列是分类的，哪些是连续的，某些模型会表现得更好。
数值列在传递时不会发生变化，除了将它们标识为float或int。目前，数值列不会被测试以确定它们是否可能是分类的。您可以使用Pandas语法.astype("category")强制将它们视为分类，请参见下文。
文本列是通过首先检查大多数行是否唯一来检测的。如果它们是唯一的，并且在大多数行中检测到多个单独的单词，则该行是文本列。有关详细信息，请参阅源代码中的common/features/infer_types.py。
日期时间列通过尝试将它们转换为Pandas日期时间来检测。Pandas可以检测到广泛的日期时间格式。如果列中的许多值成功转换，则它们是日期时间。目前，看起来纯粹是数字的日期时间（例如20210530）无法正确检测。任何NaN值都设置为列的平均值。详情请参见common/features/infer_types.py。

问题类型检测¶

如果用户没有指定问题是分类问题还是回归问题，系统会检查‘label’列以尝试猜测。有几个迹象表明这是一个回归问题：值是浮点数非整数，并且有大量唯一值。在分类中，系统会检测到多类和二元（n=2类别）分类。详情请参见utils/utils.py。

要覆盖自动推断，明确传递 problem_type（'binary'、'regression'、'multiclass' 之一）到 TabularPredictor()。例如：

predictor = TabularPredictor(label='class', problem_type='multiclass').fit(train_data)

自动特征工程¶

数值列¶

数值列，包括整数和浮点数，目前没有自动化的特征工程。

分类列¶

由于许多下游模型需要将类别编码为整数，因此每个分类特征都被映射为单调递增的整数。

日期时间列¶

被识别为日期时间的列，被转换为几个特征：

一个数值型的Pandas日期时间。请注意，这在pandas.Timestamp.min和pandas.Timestamp.max分别指定了最大值和最小值，这可能会影响非常遥远的未来或过去的日期。
提取的几列，默认是 [year, month, day, dayofweek]。这可以通过 DatetimeFeatureGenerator 进行配置。

请注意，由上述逻辑生成的缺失、无效和超出范围的特征将被转换为所有有效行的平均值。

文本列¶

如果启用了MultiModal选项，则使用带有预训练NLP模型的完整Transformer神经网络模型处理文本列。

否则，它们将以两种更简单的方式处理：

一个n-gram特征生成器从文本特征中提取n-grams（短字符串），添加许多额外的列，每个n-gram特征对应一列。这些列是'n-hot'编码的，如果原始特征包含n-gram一次或多次，则包含1或更多，否则为0。默认情况下，在应用此阶段之前，所有文本列都会被连接起来，并且n-grams是单个单词，而不是单词的子字符串。您可以通过TextNgramFeatureGenerator类来配置此功能。n-gram生成在generators/text_ngram.py中完成。
计算了一些额外的数值特征，例如单词计数、字符计数、大写字符比例等。这可以通过TextSpecialFeatureGenerator进行配置。这是在generators/text_special.py中完成的。

额外处理¶

在传递给模型之前，只包含1个值的列将被删除。
在传递给模型之前，包含其他列重复项的列将被移除。

特征工程示例¶

默认情况下，使用一个名为AutoMLPipelineFeatureGenerator的特征生成器。让我们看看它的实际应用。我们将创建一个包含浮点数列、整数列、日期时间列和分类列的数据框。首先，我们将查看我们创建的原始数据。

from autogluon.tabular import TabularDataset, TabularPredictor
import pandas as pd
import numpy as np
import random
from sklearn.datasets import make_regression
from datetime import datetime

x, y = make_regression(n_samples = 100,n_features = 5,n_targets = 1, random_state = 1)
dfx = pd.DataFrame(x, columns=['A','B','C','D','E'])
dfy = pd.DataFrame(y, columns=['label'])

# Create an integer column, a datetime column, a categorical column and a string column to demonstrate how they are processed.
dfx['B'] = (dfx['B']).astype(int)
dfx['C'] = datetime(2000,1,1) + pd.to_timedelta(dfx['C'].astype(int), unit='D')
dfx['D'] = pd.cut(dfx['D'] * 10, [-np.inf,-5,0,5,np.inf],labels=['v','w','x','y'])
dfx['E'] = pd.Series(list(' '.join(random.choice(["abc", "d", "ef", "ghi", "jkl"]) for i in range(4)) for j in range(100)))
dataset=TabularDataset(dfx)
print(dfx)

A  B          C  D                E
-0.545774  0 2000-01-01  y     ghi jkl d ef
-0.468674  0 2000-01-02  x       d ef ef ef
 1.767960  0 1999-12-31  v  abc jkl abc abc
-0.118771  1 2000-01-01  y    ef ef ghi jkl
 0.630196  0 1999-12-31  w    abc jkl d jkl
..       ... ..        ... ..              ...
-1.182318 -1 2000-01-01  v    ef ef abc abc
0.562761  0 2000-01-01  v          d d d d
-0.797270  0 2000-01-01  w   jkl ghi ef abc
0.502741  0 1999-12-31  y   ghi abc abc ef
2.056356  0 1999-12-30  w    ef ghi jkl ef

[100 rows x 5 columns]

现在让我们调用默认的特征生成器 AutoMLPipeLineFeatureGenerator，不带任何参数，看看它会做什么。

from autogluon.features.generators import AutoMLPipelineFeatureGenerator
auto_ml_pipeline_feature_generator = AutoMLPipelineFeatureGenerator()
auto_ml_pipeline_feature_generator.fit_transform(X=dfx)

	A	B	D	E	C	C.year	C.month	C.day	C.dayofweek	E.char_count	E.symbol_ratio.	__nlp__.abc	__nlp__.ef	__nlp__.ghi	__nlp__.jkl	__nlp__._total_
0	-0.545774	0	3	NaN	946684800000000000	2000	1	1	5	5	3	0	1	1	1	3
1	-0.468674	0	2	NaN	946771200000000000	2000	1	2	6	3	5	0	3	0	0	1
2	1.767960	0	0	1	946598400000000000	1999	12	31	4	8	0	3	0	0	1	2
3	-0.118771	1	3	NaN	946684800000000000	2000	1	1	5	6	2	0	2	1	1	3
4	0.630196	0	1	NaN	946598400000000000	1999	12	31	4	6	2	1	0	0	2	2
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
95	-1.182318	-1	0	NaN	946684800000000000	2000	1	1	5	6	2	2	2	0	0	2
96	0.562761	0	0	NaN	946684800000000000	2000	1	1	5	0	8	0	0	0	0	0
97	-0.797270	0	1	5	946684800000000000	2000	1	1	5	7	1	1	1	1	1	4
98	0.502741	0	3	NaN	946598400000000000	1999	12	31	4	7	1	2	1	1	0	3
99	2.056356	0	1	NaN	946512000000000000	1999	12	30	3	6	2	0	2	1	1	3

100 行 × 16 列

我们可以看到：

浮点数和整数列 'A' 和 'B' 保持不变。
日期时间列 'C' 已被转换为原始值（以纳秒为单位），并解析为年、月、日和星期几的附加列。
字符串分类列 'D' 已被一对一映射为整数 - 许多模型只接受数值输入。
自由文本列已被映射到一些摘要特征（'char_count' 等）以及一个 N-hot 矩阵，表示每个文本是否包含每个单词。

要获取更多详细信息，我们应该将管道作为TabularPredictor.fit()的一部分调用。我们需要合并dfx和dfy数据框，因为fit()期望一个单一的数据框。

df = pd.concat([dfx, dfy], axis=1)
predictor = TabularPredictor(label='label')
predictor.fit(df, hyperparameters={'GBM' : {}}, feature_generator=auto_ml_pipeline_feature_generator)

No path specified. Models will be saved in: "AutogluonModels/ag-20241127_095424"
Verbosity: 2 (Standard Logging)
=================== System Info ===================
AutoGluon Version:  1.2b20241127
Python Version:     3.11.9
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Tue Sep 24 10:00:37 UTC 2024
CPU Count:          8
Memory Avail:       28.80 GB / 30.95 GB (93.1%)
Disk Space Avail:   213.30 GB / 255.99 GB (83.3%)
===================================================
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='experimental' : New in v1.2: Pre-trained foundation model + parallel fits. The absolute best accuracy without consideration for inference speed. Does not support GPU.
	presets='best'         : Maximize accuracy. Recommended for most users. Use in competitions and benchmarks.
	presets='high'         : Strong accuracy with fast inference speed.
	presets='good'         : Good accuracy with very fast inference speed.
	presets='medium'       : Fast training time, ideal for initial prototyping.
Beginning AutoGluon training ...
AutoGluon will save models to "/home/ci/autogluon/docs/tutorials/tabular/AutogluonModels/ag-20241127_095424"
Train Data Rows:    100
Train Data Columns: 5
Label Column:       label
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
Label info (max, min, mean, stddev): (186.98105511749836, -267.99365510467214, 9.38193, 71.29287)
If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])
Problem Type:       regression
Preprocessing data ...
Using Feature Generators to preprocess the data ...
AutoMLPipelineFeatureGenerator is already fit, so the training data will be processed via .transform() instead of .fit_transform().
Types of features in original data (raw dtype, special dtypes):
('category', [])     : 1 | ['D']
('datetime', [])     : 1 | ['C']
('float', [])        : 1 | ['A']
('int', [])          : 1 | ['B']
('object', ['text']) : 1 | ['E']
Types of features in processed data (raw dtype, special dtypes):
('category', [])                    : 1 | ['D']
('category', ['text_as_category'])  : 1 | ['E']
('float', [])                       : 1 | ['A']
('int', [])                         : 1 | ['B']
('int', ['binned', 'text_special']) : 2 | ['E.char_count', 'E.symbol_ratio. ']
('int', ['datetime_as_int'])        : 5 | ['C', 'C.year', 'C.month', 'C.day', 'C.dayofweek']
('int', ['text_ngram'])             : 5 | ['__nlp__.abc', '__nlp__.ef', '__nlp__.ghi', '__nlp__.jkl', '__nlp__._total_']
Data preprocessing and feature engineering runtime = 0.03s ...
AutoGluon will gauge predictive performance using evaluation metric: 'root_mean_squared_error'
This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 80, Val Rows: 20
User-specified model hyperparameters to be fit:
{
	'GBM': [{}],
}
Fitting 1 L1 models, fit_strategy="sequential" ...
Fitting model: LightGBM ...
-60.6523	 = Validation score   (-root_mean_squared_error)
0.24s	 = Training   runtime
0.0s	 = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
Ensemble Weights: {'LightGBM': 1.0}
-60.6523	 = Validation score   (-root_mean_squared_error)
0.0s	 = Training   runtime
0.0s	 = Validation runtime
AutoGluon training complete, total runtime = 0.3s ... Best model: WeightedEnsemble_L2 | Estimated inference throughput: 6226.2 rows/s (20 batch size)
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("/home/ci/autogluon/docs/tutorials/tabular/AutogluonModels/ag-20241127_095424")

<autogluon.tabular.predictor.predictor.TabularPredictor at 0x7f1661fcc290>

阅读输出时，请注意：

字符串分类列 'D'，尽管被映射为整数，仍然被识别为分类列。
整数列 'B' 未被识别为分类列，尽管它只有几个唯一值：

print(len(set(dfx['B'])))

为了将其标记为分类变量，我们可以在原始数据框中明确地将其标记为分类：

dfx["B"] = dfx["B"].astype("category")
auto_ml_pipeline_feature_generator = AutoMLPipelineFeatureGenerator()
auto_ml_pipeline_feature_generator.fit_transform(X=dfx)

Fitting AutoMLPipelineFeatureGenerator...
Available Memory:                    29463.99 MB
Train Data (Original)  Memory Usage: 0.01 MB (0.0% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Fitting CategoryFeatureGenerator...
Fitting CategoryMemoryMinimizeFeatureGenerator...
Fitting DatetimeFeatureGenerator...
Fitting TextSpecialFeatureGenerator...
Fitting BinnedFeatureGenerator...
Fitting DropDuplicatesFeatureGenerator...
Fitting TextNgramFeatureGenerator...
Fitting CountVectorizer for text features: ['E']
CountVectorizer fit with vocabulary size = 4
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Stage 5 Generators:
Fitting DropDuplicatesFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('category', [])     : 2 | ['B', 'D']
('datetime', [])     : 1 | ['C']
('float', [])        : 1 | ['A']
('object', ['text']) : 1 | ['E']
Types of features in processed data (raw dtype, special dtypes):
('category', [])                    : 2 | ['B', 'D']
('category', ['text_as_category'])  : 1 | ['E']
('float', [])                       : 1 | ['A']
('int', ['binned', 'text_special']) : 2 | ['E.char_count', 'E.symbol_ratio. ']
('int', ['datetime_as_int'])        : 5 | ['C', 'C.year', 'C.month', 'C.day', 'C.dayofweek']
('int', ['text_ngram'])             : 5 | ['__nlp__.abc', '__nlp__.ef', '__nlp__.ghi', '__nlp__.jkl', '__nlp__._total_']
0.1s = Fit runtime
5 features in original data used to generate 16 features in processed data.
Train Data (Processed) Memory Usage: 0.01 MB (0.0% of available memory)

	A	B	D	E	C	C.year	C.month	C.day	C.dayofweek	E.char_count	E.symbol_ratio.	__nlp__.abc	__nlp__.ef	__nlp__.ghi	__nlp__.jkl	__nlp__._total_
0	-0.545774	1	3	NaN	946684800000000000	2000	1	1	5	5	3	0	1	1	1	3
1	-0.468674	1	2	NaN	946771200000000000	2000	1	2	6	3	5	0	3	0	0	1
2	1.767960	1	0	1	946598400000000000	1999	12	31	4	8	0	3	0	0	1	2
3	-0.118771	2	3	NaN	946684800000000000	2000	1	1	5	6	2	0	2	1	1	3
4	0.630196	1	1	NaN	946598400000000000	1999	12	31	4	6	2	1	0	0	2	2
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
95	-1.182318	0	0	NaN	946684800000000000	2000	1	1	5	6	2	2	2	0	0	2
96	0.562761	1	0	NaN	946684800000000000	2000	1	1	5	0	8	0	0	0	0	0
97	-0.797270	1	1	5	946684800000000000	2000	1	1	5	7	1	1	1	1	1	4
98	0.502741	1	3	NaN	946598400000000000	1999	12	31	4	7	1	2	1	1	0	3
99	2.056356	1	1	NaN	946512000000000000	1999	12	30	3	6	2	0	2	1	1	3

100 行 × 16 列

缺失值处理¶

为了说明缺失值的处理，让我们将第一行设置为所有NaN：

dfx.iloc[0] = np.nan
dfx.head()

	A	B	C	D	E
0	NaN	NaN	NaT	NaN	NaN
1	-0.468674	0	2000-01-02	x	d ef ef ef
2	1.767960	0	1999-12-31	v	abc jkl abc abc
3	-0.118771	1	2000-01-01	y	ef ef ghi jkl
4	0.630196	0	1999-12-31	w	abc jkl d jkl

现在如果我们重新处理：

auto_ml_pipeline_feature_generator = AutoMLPipelineFeatureGenerator()
auto_ml_pipeline_feature_generator.fit_transform(X=dfx)

Fitting AutoMLPipelineFeatureGenerator...
Available Memory:                    29463.87 MB
Train Data (Original)  Memory Usage: 0.01 MB (0.0% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Fitting CategoryFeatureGenerator...
Fitting CategoryMemoryMinimizeFeatureGenerator...
Fitting DatetimeFeatureGenerator...
Fitting TextSpecialFeatureGenerator...
Fitting BinnedFeatureGenerator...
Fitting DropDuplicatesFeatureGenerator...
Fitting TextNgramFeatureGenerator...
Fitting CountVectorizer for text features: ['E']
CountVectorizer fit with vocabulary size = 4
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Stage 5 Generators:
Fitting DropDuplicatesFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('category', [])     : 2 | ['B', 'D']
('datetime', [])     : 1 | ['C']
('float', [])        : 1 | ['A']
('object', ['text']) : 1 | ['E']
Types of features in processed data (raw dtype, special dtypes):
('category', [])                    : 2 | ['B', 'D']
('category', ['text_as_category'])  : 1 | ['E']
('float', [])                       : 1 | ['A']
('int', ['binned', 'text_special']) : 3 | ['E.char_count', 'E.word_count', 'E.symbol_ratio. ']
('int', ['datetime_as_int'])        : 5 | ['C', 'C.year', 'C.month', 'C.day', 'C.dayofweek']
('int', ['text_ngram'])             : 5 | ['__nlp__.abc', '__nlp__.ef', '__nlp__.ghi', '__nlp__.jkl', '__nlp__._total_']
0.1s = Fit runtime
5 features in original data used to generate 17 features in processed data.
Train Data (Processed) Memory Usage: 0.01 MB (0.0% of available memory)

	A	B	D	E	C	C.year	C.month	C.day	C.dayofweek	E.char_count	E.word_count	E.symbol_ratio.	__nlp__.abc	__nlp__.ef	__nlp__.ghi	__nlp__.jkl	__nlp__._total_
0	NaN	NaN	NaN	NaN	946687418181818240	2000	1	1	5	0	0	0	0	0	0	0	0
1	-0.468674	1	2	NaN	946771200000000000	2000	1	2	6	4	1	6	0	3	0	0	1
2	1.767960	1	0	1	946598400000000000	1999	12	31	4	9	1	1	3	0	0	1	2
3	-0.118771	2	3	NaN	946684800000000000	2000	1	1	5	7	1	3	0	2	1	1	3
4	0.630196	1	1	NaN	946598400000000000	1999	12	31	4	7	1	3	1	0	0	2	2
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
95	-1.182318	0	0	NaN	946684800000000000	2000	1	1	5	7	1	3	2	2	0	0	2
96	0.562761	1	0	NaN	946684800000000000	2000	1	1	5	1	1	9	0	0	0	0	0
97	-0.797270	1	1	5	946684800000000000	2000	1	1	5	8	1	2	1	1	1	1	4
98	0.502741	1	3	NaN	946598400000000000	1999	12	31	4	8	1	2	2	1	1	0	3
99	2.056356	1	1	NaN	946512000000000000	1999	12	30	3	7	1	3	0	2	1	1	3

100 行 × 17 列

我们看到浮点数、整数、分类和文本字段 'A'、'B'、'D' 和 'E' 保留了 NaN，但日期时间列 'C' 已被设置为非 NaN 值的平均值。

特征工程的自定义¶

要自定义您的特征生成管道，建议调用PipelineFeatureGenerator，根据需要将非默认参数传递给其他特征生成器。例如，如果我们认为下游模型会受益于移除罕见的分类值并用NaN替换，我们可以向CategoryFeatureGenerator提供参数maximum_num_cat，如下所示：

from autogluon.features.generators import PipelineFeatureGenerator, CategoryFeatureGenerator, IdentityFeatureGenerator
from autogluon.common.features.types import R_INT, R_FLOAT
mypipeline = PipelineFeatureGenerator(
    generators = [[        
        CategoryFeatureGenerator(maximum_num_cat=10),  # Overridden from default.
        IdentityFeatureGenerator(infer_features_in_args=dict(valid_raw_types=[R_INT, R_FLOAT])),
    ]]
)

如果我们随后转储转换后的数据，我们可以看到所有列都已转换为数值，因为这是大多数模型所要求的，而罕见的分类值已被替换为NaN：

mypipeline.fit_transform(X=dfx)

Fitting PipelineFeatureGenerator...
Available Memory:                    29463.87 MB
Train Data (Original)  Memory Usage: 0.01 MB (0.0% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting CategoryFeatureGenerator...
Fitting CategoryMemoryMinimizeFeatureGenerator...
Fitting IdentityFeatureGenerator...
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Stage 5 Generators:
Fitting DropDuplicatesFeatureGenerator...
Unused Original Features (Count: 1): ['C']
These features were not used to generate any of the output features. Add a feature generator compatible with these features to utilize them.
Features can also be unused if they carry very little information, such as being categorical but having almost entirely unique values or being duplicates of other features.
These features do not need to be present at inference time.
('datetime', []) : 1 | ['C']
Types of features in original data (raw dtype, special dtypes):
('category', [])     : 2 | ['B', 'D']
('float', [])        : 1 | ['A']
('object', ['text']) : 1 | ['E']
Types of features in processed data (raw dtype, special dtypes):
('category', [])                   : 2 | ['B', 'D']
('category', ['text_as_category']) : 1 | ['E']
('float', [])                      : 1 | ['A']
0.0s = Fit runtime
4 features in original data used to generate 4 features in processed data.
Train Data (Processed) Memory Usage: 0.00 MB (0.0% of available memory)

	B	D	E	A
0	NaN	NaN	NaN	NaN
1	1	2	NaN	-0.468674
2	1	0	1	1.767960
3	2	3	NaN	-0.118771
4	1	1	NaN	0.630196
...	...	...	...	...
95	0	0	NaN	-1.182318
96	1	0	NaN	0.562761
97	1	1	5	-0.797270
98	1	3	NaN	0.502741
99	1	1	NaN	2.056356

100 行 × 4 列

有关自定义特征工程的更多信息，请参阅详细的笔记本 examples/tabular/example_custom_feature_generator.py。