将自定义模型添加到AutoGluon(高级)

Open In Colab Open In SageMaker Studio Lab

提示: 如果您是AutoGluon的新手,请查看Predicting Columns in a Table - Quick Start以学习AutoGluon API的基础知识。

在本教程中,我们将介绍超出向AutoGluon添加自定义模型主题的高级自定义模型选项。

假设在本教程之前,您已经完全阅读了Adding a custom model to AutoGluon

加载数据

首先我们将加载数据。在本教程中,我们将使用成人收入数据集,因为它包含了整数、浮点数和分类特征的混合。

from autogluon.tabular import TabularDataset

train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')  # can be local CSV file as well, returns Pandas DataFrame
test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')  # another Pandas DataFrame
label = 'class'  # specifies which column do we want to predict
train_data = train_data.sample(n=1000, random_state=0)  # subsample for faster demo

train_data.head(5)
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country class
6118 51 Private 39264 Some-college 10 Married-civ-spouse Exec-managerial Wife White Female 0 0 40 United-States >50K
23204 58 Private 51662 10th 6 Married-civ-spouse Other-service Wife White Female 0 0 8 United-States <=50K
29590 40 Private 326310 Some-college 10 Married-civ-spouse Craft-repair Husband White Male 0 0 44 United-States <=50K
18116 37 Private 222450 HS-grad 9 Never-married Sales Not-in-family White Male 0 2339 40 El-Salvador <=50K
33964 62 Private 109190 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 15024 0 40 United-States >50K

强制特征在不进行预处理/丢弃的情况下传递给模型

你可能想要这样做的原因是,如果你有模型逻辑要求某一列始终存在,无论其内容如何。例如,如果你正在微调一个预训练的语言模型,该模型期望有一个特征指示给定行中文本的语言,这决定了文本如何预处理,但训练数据只包含一种语言,如果没有这个调整,语言标识符特征在拟合模型之前就会被丢弃。

强制特征在模型特定的预处理中不被丢弃

为了避免由于只有一个唯一值而丢弃自定义模型中的特征, 请将以下 _get_default_auxiliary_params 方法添加到您的自定义模型类中:

from autogluon.core.models import AbstractModel

class DummyModel(AbstractModel):
    def _fit(self, X, **kwargs):
        print(f'Before {self.__class__.__name__} Preprocessing ({len(X.columns)} features):\n\t{list(X.columns)}')
        X = self.preprocess(X)
        print(f'After  {self.__class__.__name__} Preprocessing ({len(X.columns)} features):\n\t{list(X.columns)}')
        print(X.head(5))

class DummyModelKeepUnique(DummyModel):
    def _get_default_auxiliary_params(self) -> dict:
        default_auxiliary_params = super()._get_default_auxiliary_params()
        extra_auxiliary_params = dict(
            drop_unique=False,  # Whether to drop features that have only 1 unique value, default is True
        )
        default_auxiliary_params.update(extra_auxiliary_params)
        return default_auxiliary_params

强制特征在全局预处理中不被丢弃

虽然上述针对特定模型的预处理修复方法在全局预处理后特征仍然存在时有效,但如果特征在到达模型之前已经被删除,则此方法将无济于事。为此,我们需要创建一个新的特征生成器类,将预处理逻辑在正常特征和用户覆盖特征之间分开。

这是一个示例实现:

# WARNING: To use this in practice, you must put this code in a separate python file
#  from the main process and import it or else it will not be serializable.)
from autogluon.features import BulkFeatureGenerator, AutoMLPipelineFeatureGenerator, IdentityFeatureGenerator


class CustomFeatureGeneratorWithUserOverride(BulkFeatureGenerator):
    def __init__(self, automl_generator_kwargs: dict = None, **kwargs):
        generators = self._get_default_generators(automl_generator_kwargs=automl_generator_kwargs)
        super().__init__(generators=generators, **kwargs)

    def _get_default_generators(self, automl_generator_kwargs: dict = None):
        if automl_generator_kwargs is None:
            automl_generator_kwargs = dict()

        generators = [
            [
                # Preprocessing logic that handles normal features
                AutoMLPipelineFeatureGenerator(banned_feature_special_types=['user_override'], **automl_generator_kwargs),

                # Preprocessing logic that handles special features user wishes to treat separately, here we simply skip preprocessing for these features.
                IdentityFeatureGenerator(infer_features_in_args=dict(required_special_types=['user_override'])),
            ],
        ]
        return generators

上述代码根据特征元数据中是否标记有'user_override'特殊类型来分割特征的预处理逻辑。要以这种方式标记三个特征['age', 'native-country', 'dummy_feature'],您可以执行以下操作:

# add a useless dummy feature to show that it is not dropped in preprocessing
train_data['dummy_feature'] = 'dummy value'
test_data['dummy_feature'] = 'dummy value'

from autogluon.tabular import FeatureMetadata
feature_metadata = FeatureMetadata.from_df(train_data)

print('Before inserting overrides:')
print(feature_metadata)

feature_metadata = feature_metadata.add_special_types(
    {
        'age': ['user_override'],
        'native-country': ['user_override'],
        'dummy_feature': ['user_override'],
    }
)

print('After inserting overrides:')
print(feature_metadata)
Before inserting overrides:
('int', [])    :  6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
('object', []) : 10 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
After inserting overrides:
('int', [])                   : 5 | ['fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
('int', ['user_override'])    : 1 | ['age']
('object', [])                : 8 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
('object', ['user_override']) : 2 | ['native-country', 'dummy_feature']

请注意,这只是一个自定义特征生成器的示例实现,它具有分叉的预处理逻辑。 用户可以根据需要使其标记和特征生成器逻辑任意复杂。 在此示例中,我们对非标记特征执行标准预处理,而对于标记特征,我们通过 IdentityFeatureGenerator 传递它们,这是一个不改变特征的逻辑操作。 您可以使用任何类型的特征生成器来满足您的需求,而不是使用 IdentityFeatureGenerator

将所有内容整合在一起

# Separate features and labels
X = train_data.drop(columns=[label])
y = train_data[label]
X_test = test_data.drop(columns=[label])
y_test = test_data[label]

# preprocess the label column, as done in the prior custom model tutorial
from autogluon.core.data import LabelCleaner
from autogluon.core.utils import infer_problem_type
# Construct a LabelCleaner to neatly convert labels to float/integers during model training/inference, can also use to inverse_transform back to original.
problem_type = infer_problem_type(y=y)  # Infer problem type (or else specify directly)
label_cleaner = LabelCleaner.construct(problem_type=problem_type, y=y)
y_preprocessed = label_cleaner.transform(y)
y_test_preprocessed = label_cleaner.transform(y_test)

# Make sure to specify your custom feature metadata to the feature generator
my_custom_feature_generator = CustomFeatureGeneratorWithUserOverride(feature_metadata_in=feature_metadata)

X_preprocessed = my_custom_feature_generator.fit_transform(X)
X_test_preprocessed = my_custom_feature_generator.transform(X_test)

注意用户覆盖功能未被预处理:

print(list(X_preprocessed.columns))
X_preprocessed.head(5)
['fnlwgt', 'education-num', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'age', 'native-country', 'dummy_feature']
fnlwgt education-num sex capital-gain capital-loss hours-per-week workclass education marital-status occupation relationship race age native-country dummy_feature
6118 39264 10 0 0 0 40 3 14 1 4 5 4 51 United-States dummy value
23204 51662 6 0 0 0 8 3 0 1 8 5 4 58 United-States dummy value
29590 326310 10 1 0 0 44 3 14 1 3 0 4 40 United-States dummy value
18116 222450 9 1 0 2339 40 3 11 3 12 1 4 37 El-Salvador dummy value
33964 109190 13 1 15024 0 40 3 9 1 4 0 4 62 United-States dummy value

现在让我们看看当我们发送这些数据来拟合一个虚拟模型时会发生什么:

dummy_model = DummyModel()
dummy_model.fit(X=X, y=y, feature_metadata=my_custom_feature_generator.feature_metadata)
Before DummyModel Preprocessing (15 features):
	['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'dummy_feature']
After  DummyModel Preprocessing (14 features):
	['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country']
       age workclass  fnlwgt      education  education-num  \
6118    51   Private   39264   Some-college             10   
23204   58   Private   51662           10th              6   
29590   40   Private  326310   Some-college             10   
18116   37   Private  222450        HS-grad              9   
33964   62   Private  109190      Bachelors             13   

            marital-status        occupation    relationship    race      sex  \
6118    Married-civ-spouse   Exec-managerial            Wife   White   Female   
23204   Married-civ-spouse     Other-service            Wife   White   Female   
29590   Married-civ-spouse      Craft-repair         Husband   White     Male   
18116        Never-married             Sales   Not-in-family   White     Male   
33964   Married-civ-spouse   Exec-managerial         Husband   White     Male   

       capital-gain  capital-loss  hours-per-week  native-country  
6118              0             0              40   United-States  
23204             0             0               8   United-States  
29590             0             0              44   United-States  
18116             0          2339              40     El-Salvador  
33964         15024             0              40   United-States
<__main__.DummyModel at 0x7fbb4adfb710>

注意模型在预处理调用期间如何丢弃了dummy_feature。现在让我们看看如果我们使用DummyModelKeepUnique会发生什么:

dummy_model_keep_unique = DummyModelKeepUnique()
dummy_model_keep_unique.fit(X=X, y=y, feature_metadata=my_custom_feature_generator.feature_metadata)
Before DummyModelKeepUnique Preprocessing (15 features):
	['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'dummy_feature']
After  DummyModelKeepUnique Preprocessing (15 features):
	['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'dummy_feature']
       age workclass  fnlwgt      education  education-num  \
6118    51   Private   39264   Some-college             10   
23204   58   Private   51662           10th              6   
29590   40   Private  326310   Some-college             10   
18116   37   Private  222450        HS-grad              9   
33964   62   Private  109190      Bachelors             13   

            marital-status        occupation    relationship    race      sex  \
6118    Married-civ-spouse   Exec-managerial            Wife   White   Female   
23204   Married-civ-spouse     Other-service            Wife   White   Female   
29590   Married-civ-spouse      Craft-repair         Husband   White     Male   
18116        Never-married             Sales   Not-in-family   White     Male   
33964   Married-civ-spouse   Exec-managerial         Husband   White     Male   

       capital-gain  capital-loss  hours-per-week  native-country  \
6118              0             0              40   United-States   
23204             0             0               8   United-States   
29590             0             0              44   United-States   
18116             0          2339              40     El-Salvador   
33964         15024             0              40   United-States   

      dummy_feature  
6118    dummy value  
23204   dummy value  
29590   dummy value  
18116   dummy value  
33964   dummy value
<__main__.DummyModelKeepUnique at 0x7fbb49d42f50>

现在 dummy_feature 不再被删除了!

上述代码逻辑可以重复用于测试您自己的复杂模型实现,只需将DummyModelKeepUnique替换为您的自定义模型,并检查它是否保留了您想要使用的特征。

通过TabularPredictor保留特征

现在让我们演示如何通过TabularPredictor用更少的代码行来实现这一点。 请注意,如果在本教程中运行此代码,将会引发异常,因为 自定义模型和特征生成器必须存在于其他文件中才能被序列化。 因此,我们不会在教程中运行此代码。 (它还会引发异常,因为DummyModel不是一个真实的模型)

from autogluon.tabular import TabularPredictor

feature_generator = CustomFeatureGeneratorWithUserOverride()
predictor = TabularPredictor(label=label)
predictor.fit(
    train_data=train_data,
    feature_metadata=feature_metadata,  # feature metadata with your overrides
    feature_generator=feature_generator,  # your custom feature generator that handles the overrides
    hyperparameters={
        'GBM': {},  # Can fit your custom model alongside default models
        DummyModel: {},  # Will drop dummy_feature
        DummyModelKeepUnique: {},  # Will not drop dummy_feature
        # DummyModel: {'ag_args_fit': {'drop_unique': False}},  # This is another way to get same result as using DummyModelKeepUnique
    }
)