将自定义模型添加到AutoGluon(高级)¶
提示: 如果您是AutoGluon的新手,请查看Predicting Columns in a Table - Quick Start以学习AutoGluon API的基础知识。
在本教程中,我们将介绍超出向AutoGluon添加自定义模型主题的高级自定义模型选项。
假设在本教程之前,您已经完全阅读了Adding a custom model to AutoGluon。
加载数据¶
首先我们将加载数据。在本教程中,我们将使用成人收入数据集,因为它包含了整数、浮点数和分类特征的混合。
from autogluon.tabular import TabularDataset
train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv') # can be local CSV file as well, returns Pandas DataFrame
test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv') # another Pandas DataFrame
label = 'class' # specifies which column do we want to predict
train_data = train_data.sample(n=1000, random_state=0) # subsample for faster demo
train_data.head(5)
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6118 | 51 | Private | 39264 | Some-college | 10 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 0 | 0 | 40 | United-States | >50K |
23204 | 58 | Private | 51662 | 10th | 6 | Married-civ-spouse | Other-service | Wife | White | Female | 0 | 0 | 8 | United-States | <=50K |
29590 | 40 | Private | 326310 | Some-college | 10 | Married-civ-spouse | Craft-repair | Husband | White | Male | 0 | 0 | 44 | United-States | <=50K |
18116 | 37 | Private | 222450 | HS-grad | 9 | Never-married | Sales | Not-in-family | White | Male | 0 | 2339 | 40 | El-Salvador | <=50K |
33964 | 62 | Private | 109190 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 15024 | 0 | 40 | United-States | >50K |
强制特征在不进行预处理/丢弃的情况下传递给模型¶
你可能想要这样做的原因是,如果你有模型逻辑要求某一列始终存在,无论其内容如何。例如,如果你正在微调一个预训练的语言模型,该模型期望有一个特征指示给定行中文本的语言,这决定了文本如何预处理,但训练数据只包含一种语言,如果没有这个调整,语言标识符特征在拟合模型之前就会被丢弃。
强制特征在模型特定的预处理中不被丢弃¶
为了避免由于只有一个唯一值而丢弃自定义模型中的特征,
请将以下 _get_default_auxiliary_params
方法添加到您的自定义模型类中:
from autogluon.core.models import AbstractModel
class DummyModel(AbstractModel):
def _fit(self, X, **kwargs):
print(f'Before {self.__class__.__name__} Preprocessing ({len(X.columns)} features):\n\t{list(X.columns)}')
X = self.preprocess(X)
print(f'After {self.__class__.__name__} Preprocessing ({len(X.columns)} features):\n\t{list(X.columns)}')
print(X.head(5))
class DummyModelKeepUnique(DummyModel):
def _get_default_auxiliary_params(self) -> dict:
default_auxiliary_params = super()._get_default_auxiliary_params()
extra_auxiliary_params = dict(
drop_unique=False, # Whether to drop features that have only 1 unique value, default is True
)
default_auxiliary_params.update(extra_auxiliary_params)
return default_auxiliary_params
强制特征在全局预处理中不被丢弃¶
虽然上述针对特定模型的预处理修复方法在全局预处理后特征仍然存在时有效,但如果特征在到达模型之前已经被删除,则此方法将无济于事。为此,我们需要创建一个新的特征生成器类,将预处理逻辑在正常特征和用户覆盖特征之间分开。
这是一个示例实现:
# WARNING: To use this in practice, you must put this code in a separate python file
# from the main process and import it or else it will not be serializable.)
from autogluon.features import BulkFeatureGenerator, AutoMLPipelineFeatureGenerator, IdentityFeatureGenerator
class CustomFeatureGeneratorWithUserOverride(BulkFeatureGenerator):
def __init__(self, automl_generator_kwargs: dict = None, **kwargs):
generators = self._get_default_generators(automl_generator_kwargs=automl_generator_kwargs)
super().__init__(generators=generators, **kwargs)
def _get_default_generators(self, automl_generator_kwargs: dict = None):
if automl_generator_kwargs is None:
automl_generator_kwargs = dict()
generators = [
[
# Preprocessing logic that handles normal features
AutoMLPipelineFeatureGenerator(banned_feature_special_types=['user_override'], **automl_generator_kwargs),
# Preprocessing logic that handles special features user wishes to treat separately, here we simply skip preprocessing for these features.
IdentityFeatureGenerator(infer_features_in_args=dict(required_special_types=['user_override'])),
],
]
return generators
上述代码根据特征元数据中是否标记有'user_override'
特殊类型来分割特征的预处理逻辑。要以这种方式标记三个特征['age', 'native-country', 'dummy_feature']
,您可以执行以下操作:
# add a useless dummy feature to show that it is not dropped in preprocessing
train_data['dummy_feature'] = 'dummy value'
test_data['dummy_feature'] = 'dummy value'
from autogluon.tabular import FeatureMetadata
feature_metadata = FeatureMetadata.from_df(train_data)
print('Before inserting overrides:')
print(feature_metadata)
feature_metadata = feature_metadata.add_special_types(
{
'age': ['user_override'],
'native-country': ['user_override'],
'dummy_feature': ['user_override'],
}
)
print('After inserting overrides:')
print(feature_metadata)
Before inserting overrides:
('int', []) : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
('object', []) : 10 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
After inserting overrides:
('int', []) : 5 | ['fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
('int', ['user_override']) : 1 | ['age']
('object', []) : 8 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
('object', ['user_override']) : 2 | ['native-country', 'dummy_feature']
请注意,这只是一个自定义特征生成器的示例实现,它具有分叉的预处理逻辑。
用户可以根据需要使其标记和特征生成器逻辑任意复杂。
在此示例中,我们对非标记特征执行标准预处理,而对于标记特征,我们通过
IdentityFeatureGenerator
传递它们,这是一个不改变特征的逻辑操作。
您可以使用任何类型的特征生成器来满足您的需求,而不是使用 IdentityFeatureGenerator
。
将所有内容整合在一起¶
# Separate features and labels
X = train_data.drop(columns=[label])
y = train_data[label]
X_test = test_data.drop(columns=[label])
y_test = test_data[label]
# preprocess the label column, as done in the prior custom model tutorial
from autogluon.core.data import LabelCleaner
from autogluon.core.utils import infer_problem_type
# Construct a LabelCleaner to neatly convert labels to float/integers during model training/inference, can also use to inverse_transform back to original.
problem_type = infer_problem_type(y=y) # Infer problem type (or else specify directly)
label_cleaner = LabelCleaner.construct(problem_type=problem_type, y=y)
y_preprocessed = label_cleaner.transform(y)
y_test_preprocessed = label_cleaner.transform(y_test)
# Make sure to specify your custom feature metadata to the feature generator
my_custom_feature_generator = CustomFeatureGeneratorWithUserOverride(feature_metadata_in=feature_metadata)
X_preprocessed = my_custom_feature_generator.fit_transform(X)
X_test_preprocessed = my_custom_feature_generator.transform(X_test)
注意用户覆盖功能未被预处理:
print(list(X_preprocessed.columns))
X_preprocessed.head(5)
['fnlwgt', 'education-num', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'age', 'native-country', 'dummy_feature']
fnlwgt | education-num | sex | capital-gain | capital-loss | hours-per-week | workclass | education | marital-status | occupation | relationship | race | age | native-country | dummy_feature | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6118 | 39264 | 10 | 0 | 0 | 0 | 40 | 3 | 14 | 1 | 4 | 5 | 4 | 51 | United-States | dummy value |
23204 | 51662 | 6 | 0 | 0 | 0 | 8 | 3 | 0 | 1 | 8 | 5 | 4 | 58 | United-States | dummy value |
29590 | 326310 | 10 | 1 | 0 | 0 | 44 | 3 | 14 | 1 | 3 | 0 | 4 | 40 | United-States | dummy value |
18116 | 222450 | 9 | 1 | 0 | 2339 | 40 | 3 | 11 | 3 | 12 | 1 | 4 | 37 | El-Salvador | dummy value |
33964 | 109190 | 13 | 1 | 15024 | 0 | 40 | 3 | 9 | 1 | 4 | 0 | 4 | 62 | United-States | dummy value |
现在让我们看看当我们发送这些数据来拟合一个虚拟模型时会发生什么:
dummy_model = DummyModel()
dummy_model.fit(X=X, y=y, feature_metadata=my_custom_feature_generator.feature_metadata)
Before DummyModel Preprocessing (15 features):
['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'dummy_feature']
After DummyModel Preprocessing (14 features):
['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country']
age workclass fnlwgt education education-num \
6118 51 Private 39264 Some-college 10
23204 58 Private 51662 10th 6
29590 40 Private 326310 Some-college 10
18116 37 Private 222450 HS-grad 9
33964 62 Private 109190 Bachelors 13
marital-status occupation relationship race sex \
6118 Married-civ-spouse Exec-managerial Wife White Female
23204 Married-civ-spouse Other-service Wife White Female
29590 Married-civ-spouse Craft-repair Husband White Male
18116 Never-married Sales Not-in-family White Male
33964 Married-civ-spouse Exec-managerial Husband White Male
capital-gain capital-loss hours-per-week native-country
6118 0 0 40 United-States
23204 0 0 8 United-States
29590 0 0 44 United-States
18116 0 2339 40 El-Salvador
33964 15024 0 40 United-States
<__main__.DummyModel at 0x7fbb4adfb710>
注意模型在预处理调用期间如何丢弃了dummy_feature
。现在让我们看看如果我们使用DummyModelKeepUnique
会发生什么:
dummy_model_keep_unique = DummyModelKeepUnique()
dummy_model_keep_unique.fit(X=X, y=y, feature_metadata=my_custom_feature_generator.feature_metadata)
Before DummyModelKeepUnique Preprocessing (15 features):
['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'dummy_feature']
After DummyModelKeepUnique Preprocessing (15 features):
['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'dummy_feature']
age workclass fnlwgt education education-num \
6118 51 Private 39264 Some-college 10
23204 58 Private 51662 10th 6
29590 40 Private 326310 Some-college 10
18116 37 Private 222450 HS-grad 9
33964 62 Private 109190 Bachelors 13
marital-status occupation relationship race sex \
6118 Married-civ-spouse Exec-managerial Wife White Female
23204 Married-civ-spouse Other-service Wife White Female
29590 Married-civ-spouse Craft-repair Husband White Male
18116 Never-married Sales Not-in-family White Male
33964 Married-civ-spouse Exec-managerial Husband White Male
capital-gain capital-loss hours-per-week native-country \
6118 0 0 40 United-States
23204 0 0 8 United-States
29590 0 0 44 United-States
18116 0 2339 40 El-Salvador
33964 15024 0 40 United-States
dummy_feature
6118 dummy value
23204 dummy value
29590 dummy value
18116 dummy value
33964 dummy value
<__main__.DummyModelKeepUnique at 0x7fbb49d42f50>
现在 dummy_feature
不再被删除了!
上述代码逻辑可以重复用于测试您自己的复杂模型实现,只需将DummyModelKeepUnique
替换为您的自定义模型,并检查它是否保留了您想要使用的特征。
通过TabularPredictor保留特征¶
现在让我们演示如何通过TabularPredictor用更少的代码行来实现这一点。 请注意,如果在本教程中运行此代码,将会引发异常,因为 自定义模型和特征生成器必须存在于其他文件中才能被序列化。 因此,我们不会在教程中运行此代码。 (它还会引发异常,因为DummyModel不是一个真实的模型)
from autogluon.tabular import TabularPredictor
feature_generator = CustomFeatureGeneratorWithUserOverride()
predictor = TabularPredictor(label=label)
predictor.fit(
train_data=train_data,
feature_metadata=feature_metadata, # feature metadata with your overrides
feature_generator=feature_generator, # your custom feature generator that handles the overrides
hyperparameters={
'GBM': {}, # Can fit your custom model alongside default models
DummyModel: {}, # Will drop dummy_feature
DummyModelKeepUnique: {}, # Will not drop dummy_feature
# DummyModel: {'ag_args_fit': {'drop_unique': False}}, # This is another way to get same result as using DummyModelKeepUnique
}
)