将自定义模型添加到AutoGluon¶
Tip: If you are new to AutoGluon, review Predicting Columns in a Table - Quick Start to learn the basics of the AutoGluon API.
本教程描述了如何向AutoGluon添加一个自定义模型,该模型可以与默认模型一起进行训练、超参数调优和集成(默认模型文档)。
在这个例子中,我们创建了一个自定义的随机森林模型用于AutoGluon。AutoGluon中的所有模型都继承自AbstractModel类(AbstractModel源代码),并且必须遵循其API以与其他模型协同工作。
请注意,虽然本教程提供了一个基本的模型实现,但这并未涵盖大多数已实现模型中使用的许多方面。
为了更好地理解如何实现更高级的功能,请参考以下模型的源代码:
功能 |
参考实现 |
---|---|
尊重时间限制/提前停止逻辑 |
|
尊重内存使用限制 |
LGBModel 和 RFModel |
样本权重支持 |
LGBModel |
验证数据和eval_metric的使用 |
LGBModel |
GPU训练支持 |
LGBModel |
非可序列化模型的保存/加载逻辑 |
|
高级问题类型支持(Softclass, Quantile) |
RFModel |
文本特征类型支持 |
|
图像特征类型支持 |
|
懒加载包依赖项 |
LGBModel |
自定义HPO逻辑 |
LGBModel |
实现自定义模型¶
在这里,我们定义了在本教程的其余部分中将使用的自定义模型。
必须实现的最重要的方法是 _fit
和 _preprocess
。
要与官方的AutoGluon随机森林实现进行比较,请参阅RFModel源代码。
跟随代码注释以更好地理解代码的工作原理。
import numpy as np
import pandas as pd
from autogluon.core.models import AbstractModel
from autogluon.features.generators import LabelEncoderFeatureGenerator
class CustomRandomForestModel(AbstractModel):
def __init__(self, **kwargs):
# Simply pass along kwargs to parent, and init our internal `_feature_generator` variable to None
super().__init__(**kwargs)
self._feature_generator = None
# The `_preprocess` method takes the input data and transforms it to the internal representation usable by the model.
# `_preprocess` is called by `preprocess` and is used during model fit and model inference.
def _preprocess(self, X: pd.DataFrame, is_train=False, **kwargs) -> np.ndarray:
print(f'Entering the `_preprocess` method: {len(X)} rows of data (is_train={is_train})')
X = super()._preprocess(X, **kwargs)
if is_train:
# X will be the training data.
self._feature_generator = LabelEncoderFeatureGenerator(verbosity=0)
self._feature_generator.fit(X=X)
if self._feature_generator.features_in:
# This converts categorical features to numeric via stateful label encoding.
X = X.copy()
X[self._feature_generator.features_in] = self._feature_generator.transform(X=X)
# Add a fillna call to handle missing values.
# Some algorithms will be able to handle NaN values internally (LightGBM).
# In those cases, you can simply pass the NaN values into the inner model.
# Finally, convert to numpy for optimized memory usage and because sklearn RF works with raw numpy input.
return X.fillna(0).to_numpy(dtype=np.float32)
# The `_fit` method takes the input training data (and optionally the validation data) and trains the model.
def _fit(self,
X: pd.DataFrame, # training data
y: pd.Series, # training labels
# X_val=None, # val data (unused in RF model)
# y_val=None, # val labels (unused in RF model)
# time_limit=None, # time limit in seconds (ignored in tutorial)
**kwargs): # kwargs includes many other potential inputs, refer to AbstractModel documentation for details
print('Entering the `_fit` method')
# First we import the required dependencies for the model. Note that we do not import them outside of the method.
# This enables AutoGluon to be highly extensible and modular.
# For an example of best practices when importing model dependencies, refer to LGBModel.
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
# Valid self.problem_type values include ['binary', 'multiclass', 'regression', 'quantile', 'softclass']
if self.problem_type in ['regression', 'softclass']:
model_cls = RandomForestRegressor
else:
model_cls = RandomForestClassifier
# Make sure to call preprocess on X near the start of `_fit`.
# This is necessary because the data is converted via preprocess during predict, and needs to be in the same format as during fit.
X = self.preprocess(X, is_train=True)
# This fetches the user-specified (and default) hyperparameters for the model.
params = self._get_model_params()
print(f'Hyperparameters: {params}')
# self.model should be set to the trained inner model, so that internally during predict we can call `self.model.predict(...)`
self.model = model_cls(**params)
self.model.fit(X, y)
print('Exiting the `_fit` method')
# The `_set_default_params` method defines the default hyperparameters of the model.
# User-specified parameters will override these values on a key-by-key basis.
def _set_default_params(self):
default_params = {
'n_estimators': 300,
'n_jobs': -1,
'random_state': 0,
}
for param, val in default_params.items():
self._set_default_param_value(param, val)
# The `_get_default_auxiliary_params` method defines various model-agnostic parameters such as maximum memory usage and valid input column dtypes.
# For most users who build custom models, they will only need to specify the valid/invalid dtypes to the model here.
def _get_default_auxiliary_params(self) -> dict:
default_auxiliary_params = super()._get_default_auxiliary_params()
extra_auxiliary_params = dict(
# the total set of raw dtypes are: ['int', 'float', 'category', 'object', 'datetime']
# object feature dtypes include raw text and image paths, which should only be handled by specialized models
# datetime raw dtypes are generally converted to int in upstream pre-processing,
# so models generally shouldn't need to explicitly support datetime dtypes.
valid_raw_types=['int', 'float', 'category'],
# Other options include `valid_special_types`, `ignored_type_group_raw`, and `ignored_type_group_special`.
# Refer to AbstractModel for more details on available options.
)
default_auxiliary_params.update(extra_auxiliary_params)
return default_auxiliary_params
Loading the data¶
接下来我们将加载数据。在本教程中,我们将使用成人收入数据集,因为它包含了整数、浮点数和分类特征的混合。
from autogluon.tabular import TabularDataset
train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv') # can be local CSV file as well, returns Pandas DataFrame
test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv') # another Pandas DataFrame
label = 'class' # specifies which column do we want to predict
train_data = train_data.sample(n=1000, random_state=0) # subsample for faster demo
train_data.head(5)
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6118 | 51 | Private | 39264 | Some-college | 10 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 0 | 0 | 40 | United-States | >50K |
23204 | 58 | Private | 51662 | 10th | 6 | Married-civ-spouse | Other-service | Wife | White | Female | 0 | 0 | 8 | United-States | <=50K |
29590 | 40 | Private | 326310 | Some-college | 10 | Married-civ-spouse | Craft-repair | Husband | White | Male | 0 | 0 | 44 | United-States | <=50K |
18116 | 37 | Private | 222450 | HS-grad | 9 | Never-married | Sales | Not-in-family | White | Male | 0 | 2339 | 40 | El-Salvador | <=50K |
33964 | 62 | Private | 109190 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 15024 | 0 | 40 | United-States | >50K |
不使用TabularPredictor训练自定义模型¶
下面我们将演示如何在TabularPredictor之外训练模型。这对于调试和减少在实现模型时需要理解的代码量非常有用。
这个过程类似于在调用TabularPredictor
的fit方法时内部发生的情况,但被简化和最小化了。
如果数据已经清理过(全部为数值),那么我们可以直接使用数据调用fit,但adult数据集并非如此。
清理标签¶
将输入数据作为模型的有效输入的第一步是清理标签。
目前,它们是字符串,但我们需要将它们转换为数值(0和1)以进行二元分类。
幸运的是,AutoGluon 已经实现了逻辑来检测这是二分类问题(通过 infer_problem_type
),并且有一个转换器将标签映射为 0 和 1(LabelCleaner
):
# Separate features and labels
X = train_data.drop(columns=[label])
y = train_data[label]
X_test = test_data.drop(columns=[label])
y_test = test_data[label]
from autogluon.core.data import LabelCleaner
from autogluon.core.utils import infer_problem_type
# Construct a LabelCleaner to neatly convert labels to float/integers during model training/inference, can also use to inverse_transform back to original.
problem_type = infer_problem_type(y=y) # Infer problem type (or else specify directly)
label_cleaner = LabelCleaner.construct(problem_type=problem_type, y=y)
y_clean = label_cleaner.transform(y)
print(f'Labels cleaned: {label_cleaner.inv_map}')
print(f'inferred problem type as: {problem_type}')
print('Cleaned label values:')
y_clean.head(5)
Labels cleaned: {' <=50K': 0, ' >50K': 1}
inferred problem type as: binary
Cleaned label values:
6118 1
23204 0
29590 0
18116 0
33964 1
Name: class, dtype: uint8
清理特征¶
接下来,我们需要清理特征。目前,像‘workclass’这样的特征是对象类型(字符串),但我们实际上希望将它们用作分类特征。大多数模型不会接受字符串输入,因此我们需要将字符串转换为数字。
AutoGluon 包含一个专门用于清理、转换和生成特征的模块,称为 autogluon.features。在这里,我们将使用 TabularPredictor
内部使用的相同特征生成器,将对象数据类型转换为分类类型并最小化内存使用。
from autogluon.common.utils.log_utils import set_logger_verbosity
from autogluon.features.generators import AutoMLPipelineFeatureGenerator
set_logger_verbosity(2) # Set logger so more detailed logging is shown for tutorial
feature_generator = AutoMLPipelineFeatureGenerator()
X_clean = feature_generator.fit_transform(X)
X_clean.head(5)
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 29487.26 MB
Train Data (Original) Memory Usage: 0.57 MB (0.0% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Fitting CategoryFeatureGenerator...
Fitting CategoryMemoryMinimizeFeatureGenerator...
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Stage 5 Generators:
Fitting DropDuplicatesFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('int', []) : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
('object', []) : 8 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
Types of features in processed data (raw dtype, special dtypes):
('category', []) : 7 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
('int', []) : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
('int', ['bool']) : 1 | ['sex']
0.1s = Fit runtime
14 features in original data used to generate 14 features in processed data.
Train Data (Processed) Memory Usage: 0.06 MB (0.0% of available memory)
age | fnlwgt | education-num | sex | capital-gain | capital-loss | hours-per-week | workclass | education | marital-status | occupation | relationship | race | native-country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6118 | 51 | 39264 | 10 | 0 | 0 | 0 | 40 | 3 | 14 | 1 | 4 | 5 | 4 | 24 |
23204 | 58 | 51662 | 6 | 0 | 0 | 0 | 8 | 3 | 0 | 1 | 8 | 5 | 4 | 24 |
29590 | 40 | 326310 | 10 | 1 | 0 | 0 | 44 | 3 | 14 | 1 | 3 | 0 | 4 | 24 |
18116 | 37 | 222450 | 9 | 1 | 0 | 2339 | 40 | 3 | 11 | 3 | 12 | 1 | 4 | 6 |
33964 | 62 | 109190 | 13 | 1 | 15024 | 0 | 40 | 3 | 9 | 1 | 4 | 0 | 4 | 24 |
AutoMLPipelineFeatureGenerator 不会填充数值特征的缺失值,也不会重新缩放数值特征的值或对分类特征进行独热编码。如果模型需要这些操作,你需要将这些操作添加到你的 _preprocess
方法中,并且可能会发现一些 FeatureGenerator 类对此很有用。
拟合模型¶
我们现在准备用清理后的特征和标签来拟合模型。
custom_model = CustomRandomForestModel()
# We could also specify hyperparameters to override defaults
# custom_model = CustomRandomForestModel(hyperparameters={'max_depth': 10})
custom_model.fit(X=X_clean, y=y_clean) # Fit custom model
# To save to disk and load the model, do the following:
# load_path = custom_model.path
# custom_model.save()
# del custom_model
# custom_model = CustomRandomForestModel.load(path=load_path)
Entering the `_fit` method
Entering the `_preprocess` method: 1000 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0}
Exiting the `_fit` method
Warning: No name was specified for model, defaulting to class name: CustomRandomForestModel
No path specified. Models will be saved in: "AutogluonModels/ag-20241127_095057CustomRandomForestModel"
Warning: No path was specified for model, defaulting to: /home/ci/autogluon/docs/tutorials/tabular/advanced/AutogluonModels/ag-20241127_095057
Selected class <--> label mapping: class 1 = 1, class 0 = 0
Model CustomRandomForestModel's eval_metric inferred to be 'accuracy' because problem_type='binary' and eval_metric was not specified during init.
<__main__.CustomRandomForestModel at 0x7f49307c54d0>
使用训练好的模型进行预测¶
现在模型已经拟合,我们可以对新数据进行预测。请记住,我们需要对新数据执行与训练数据相同的数据和标签转换。
# Prepare test data
X_test_clean = feature_generator.transform(X_test)
y_test_clean = label_cleaner.transform(y_test)
X_test.head(5)
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 31 | Private | 169085 | 11th | 7 | Married-civ-spouse | Sales | Wife | White | Female | 0 | 0 | 20 | United-States |
1 | 17 | Self-emp-not-inc | 226203 | 12th | 8 | Never-married | Sales | Own-child | White | Male | 0 | 0 | 45 | United-States |
2 | 47 | Private | 54260 | Assoc-voc | 11 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 1887 | 60 | United-States |
3 | 21 | Private | 176262 | Some-college | 10 | Never-married | Exec-managerial | Own-child | White | Female | 0 | 0 | 30 | United-States |
4 | 17 | Private | 241185 | 12th | 8 | Never-married | Prof-specialty | Own-child | White | Male | 0 | 0 | 20 | United-States |
从测试数据中获取原始预测
y_pred = custom_model.predict(X_test_clean)
print(y_pred[:5])
Entering the `_preprocess` method: 9769 rows of data (is_train=False)
[0 0 1 0 0]
请注意,这些预测是针对正类(无论哪个类被推断为1)。为了获得更可解释的结果,请执行以下操作:
y_pred_orig = label_cleaner.inverse_transform(y_pred)
y_pred_orig.head(5)
0 <=50K
1 <=50K
2 >50K
3 <=50K
4 <=50K
dtype: object
使用训练好的模型进行评分¶
默认情况下,模型有一个特定于problem_type的eval_metric。对于二元分类,它使用准确率。
我们可以通过以下操作获取模型的准确率分数:
score = custom_model.score(X_test_clean, y_test_clean)
print(f'Test score ({custom_model.eval_metric.name}) = {score}')
Entering the `_preprocess` method: 9769 rows of data (is_train=False)
Test score (accuracy) = 0.8424608455317842
训练一个没有使用TabularPredictor的自定义模型¶
AutoGluon 中的一些更高级功能,例如装袋,可以在模型继承自 AbstractModel 后非常容易地实现。
你甚至可以用几行代码打包你的自定义模型。这是快速提升几乎任何模型质量的简便方法:
from autogluon.core.models import BaggedEnsembleModel
bagged_custom_model = BaggedEnsembleModel(CustomRandomForestModel())
# Parallel folding currently doesn't work with a class not defined in a separate module because of underlying pickle serialization issue
# You don't need this following line if you put your custom model in a separate file and import it.
bagged_custom_model.params['fold_fitting_strategy'] = 'sequential_local'
bagged_custom_model.fit(X=X_clean, y=y_clean, k_fold=10) # Perform 10-fold bagging
bagged_score = bagged_custom_model.score(X_test_clean, y_test_clean)
print(f'Test score ({bagged_custom_model.eval_metric.name}) = {bagged_score} (bagged)')
print(f'Bagging increased model accuracy by {round(bagged_score - score, 4) * 100}%!')
Entering the `_fit` method
Entering the `_preprocess` method: 900 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0}
Exiting the `_fit` method
Entering the `_preprocess` method: 100 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 900 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0}
Exiting the `_fit` method
Entering the `_preprocess` method: 100 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 900 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0}
Exiting the `_fit` method
Entering the `_preprocess` method: 100 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 900 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0}
Exiting the `_fit` method
Entering the `_preprocess` method: 100 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 900 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0}
Exiting the `_fit` method
Entering the `_preprocess` method: 100 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 900 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0}
Exiting the `_fit` method
Entering the `_preprocess` method: 100 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 900 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0}
Exiting the `_fit` method
Entering the `_preprocess` method: 100 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 900 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0}
Exiting the `_fit` method
Entering the `_preprocess` method: 100 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 900 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0}
Exiting the `_fit` method
Entering the `_preprocess` method: 100 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 900 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0}
Exiting the `_fit` method
Entering the `_preprocess` method: 100 rows of data (is_train=False)
Entering the `_preprocess` method: 9769 rows of data (is_train=False)
Entering the `_preprocess` method: 9769 rows of data (is_train=False)
Entering the `_preprocess` method: 9769 rows of data (is_train=False)
Entering the `_preprocess` method: 9769 rows of data (is_train=False)
Entering the `_preprocess` method: 9769 rows of data (is_train=False)
Entering the `_preprocess` method: 9769 rows of data (is_train=False)
Entering the `_preprocess` method: 9769 rows of data (is_train=False)
Entering the `_preprocess` method: 9769 rows of data (is_train=False)
Entering the `_preprocess` method: 9769 rows of data (is_train=False)
Entering the `_preprocess` method: 9769 rows of data (is_train=False)
Test score (accuracy) = 0.8435868563824342 (bagged)
Bagging increased model accuracy by 0.11%!
Warning: No name was specified for model, defaulting to class name: CustomRandomForestModel
No path specified. Models will be saved in: "AutogluonModels/ag-20241127_095058CustomRandomForestModel"
Warning: No path was specified for model, defaulting to: /home/ci/autogluon/docs/tutorials/tabular/advanced/AutogluonModels/ag-20241127_095058
Warning: No name was specified for model, defaulting to class name: BaggedEnsembleModel
No path specified. Models will be saved in: "AutogluonModels/ag-20241127_095058BaggedEnsembleModel"
Warning: No path was specified for model, defaulting to: /home/ci/autogluon/docs/tutorials/tabular/advanced/AutogluonModels/ag-20241127_095058
Selected class <--> label mapping: class 1 = 1, class 0 = 0
Selected class <--> label mapping: class 1 = 1, class 0 = 0
Model CustomRandomForestModel's eval_metric inferred to be 'accuracy' because problem_type='binary' and eval_metric was not specified during init.
Model 's eval_metric inferred to be 'accuracy' because problem_type='binary' and eval_metric was not specified during init.
Fitting 10 child models (S1F1 - S1F10) | Fitting with SequentialLocalFoldFittingStrategy
Model S1F1's eval_metric inferred to be 'accuracy' because problem_type='binary' and eval_metric was not specified during init.
Model S1F2's eval_metric inferred to be 'accuracy' because problem_type='binary' and eval_metric was not specified during init.
Model S1F3's eval_metric inferred to be 'accuracy' because problem_type='binary' and eval_metric was not specified during init.
Model S1F4's eval_metric inferred to be 'accuracy' because problem_type='binary' and eval_metric was not specified during init.
Model S1F5's eval_metric inferred to be 'accuracy' because problem_type='binary' and eval_metric was not specified during init.
Model S1F6's eval_metric inferred to be 'accuracy' because problem_type='binary' and eval_metric was not specified during init.
Model S1F7's eval_metric inferred to be 'accuracy' because problem_type='binary' and eval_metric was not specified during init.
Model S1F8's eval_metric inferred to be 'accuracy' because problem_type='binary' and eval_metric was not specified during init.
Model S1F9's eval_metric inferred to be 'accuracy' because problem_type='binary' and eval_metric was not specified during init.
Model S1F10's eval_metric inferred to be 'accuracy' because problem_type='binary' and eval_metric was not specified during init.
请注意,袋装模型在不同的训练数据分割上训练了10个CustomRandomForestModels。在进行预测时,袋装模型会平均这10个模型的预测结果。
使用TabularPredictor训练自定义模型¶
虽然不使用TabularPredictor可以简化我们在开发和调试模型时需要处理的代码量,但最终我们还是希望利用TabularPredictor来充分发挥模型的潜力。
使用TabularPredictor从原始数据训练模型的代码非常简单。无需指定LabelCleaner、FeatureGenerator或验证集,所有这些都在内部处理。
在这里,我们使用不同的超参数训练了3个CustomRandomForestModel。
from autogluon.tabular import TabularPredictor
# custom_hyperparameters = {CustomRandomForestModel: {}} # train 1 CustomRandomForestModel Model with default hyperparameters
custom_hyperparameters = {CustomRandomForestModel: [{}, {'max_depth': 10}, {'max_features': 0.9, 'max_depth': 20}]} # Train 3 CustomRandomForestModel with different hyperparameters
predictor = TabularPredictor(label=label).fit(train_data, hyperparameters=custom_hyperparameters)
Entering the `_fit` method
Entering the `_preprocess` method: 800 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0}
Exiting the `_fit` method
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 800 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 10}
Exiting the `_fit` method
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 800 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_features': 0.9, 'max_depth': 20}
Exiting the `_fit` method
Entering the `_preprocess` method: 200 rows of data (is_train=False)
No path specified. Models will be saved in: "AutogluonModels/ag-20241127_095105"
Verbosity: 2 (Standard Logging)
=================== System Info ===================
AutoGluon Version: 1.2b20241127
Python Version: 3.11.9
Operating System: Linux
Platform Machine: x86_64
Platform Version: #1 SMP Tue Sep 24 10:00:37 UTC 2024
CPU Count: 8
Memory Avail: 28.75 GB / 30.95 GB (92.9%)
Disk Space Avail: 213.64 GB / 255.99 GB (83.5%)
===================================================
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...
Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
presets='experimental' : New in v1.2: Pre-trained foundation model + parallel fits. The absolute best accuracy without consideration for inference speed. Does not support GPU.
presets='best' : Maximize accuracy. Recommended for most users. Use in competitions and benchmarks.
presets='high' : Strong accuracy with fast inference speed.
presets='good' : Good accuracy with very fast inference speed.
presets='medium' : Fast training time, ideal for initial prototyping.
Beginning AutoGluon training ...
AutoGluon will save models to "/home/ci/autogluon/docs/tutorials/tabular/advanced/AutogluonModels/ag-20241127_095105"
Train Data Rows: 1000
Train Data Columns: 14
Label Column: class
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
2 unique label values: [' >50K', ' <=50K']
If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])
Problem Type: binary
Preprocessing data ...
Selected class <--> label mapping: class 1 = >50K, class 0 = <=50K
Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive ( >50K) vs negative ( <=50K) class.
To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init.
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 29438.65 MB
Train Data (Original) Memory Usage: 0.56 MB (0.0% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Fitting CategoryFeatureGenerator...
Fitting CategoryMemoryMinimizeFeatureGenerator...
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Stage 5 Generators:
Fitting DropDuplicatesFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('int', []) : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
('object', []) : 8 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
Types of features in processed data (raw dtype, special dtypes):
('category', []) : 7 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
('int', []) : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
('int', ['bool']) : 1 | ['sex']
0.1s = Fit runtime
14 features in original data used to generate 14 features in processed data.
Train Data (Processed) Memory Usage: 0.06 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.09s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 800, Val Rows: 200
User-specified model hyperparameters to be fit:
{
'<class '__main__.CustomRandomForestModel'>': [{}, {'max_depth': 10}, {'max_features': 0.9, 'max_depth': 20}],
}
Custom Model Type Detected: <class '__main__.CustomRandomForestModel'>
Custom Model Type Detected: <class '__main__.CustomRandomForestModel'>
Custom Model Type Detected: <class '__main__.CustomRandomForestModel'>
Fitting 3 L1 models, fit_strategy="sequential" ...
Fitting model: CustomRandomForestModel ...
0.835 = Validation score (accuracy)
0.52s = Training runtime
0.05s = Validation runtime
Fitting model: CustomRandomForestModel_2 ...
0.845 = Validation score (accuracy)
0.5s = Training runtime
0.05s = Validation runtime
Fitting model: CustomRandomForestModel_3 ...
0.84 = Validation score (accuracy)
0.52s = Training runtime
0.05s = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
Ensemble Weights: {'CustomRandomForestModel_2': 1.0}
0.845 = Validation score (accuracy)
0.0s = Training runtime
0.0s = Validation runtime
AutoGluon training complete, total runtime = 1.84s ... Best model: WeightedEnsemble_L2 | Estimated inference throughput: 4179.6 rows/s (200 batch size)
Disabling decision threshold calibration for metric `accuracy` due to having fewer than 10000 rows of validation data for calibration, to avoid overfitting (200 rows).
`accuracy` is generally not improved through threshold calibration. Force calibration via specifying `calibrate_decision_threshold=True`.
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("/home/ci/autogluon/docs/tutorials/tabular/advanced/AutogluonModels/ag-20241127_095105")
预测器排行榜¶
这里我们展示了每个训练模型的统计数据。请注意,还训练了一个WeightedEnsemble模型。该模型试图通过集成其他模型的预测来获得更好的验证分数。
predictor.leaderboard(test_data)
Entering the `_preprocess` method: 9769 rows of data (is_train=False)
Entering the `_preprocess` method: 9769 rows of data (is_train=False)
Entering the `_preprocess` method: 9769 rows of data (is_train=False)
model | score_test | score_val | eval_metric | pred_time_test | pred_time_val | fit_time | pred_time_test_marginal | pred_time_val_marginal | fit_time_marginal | stack_level | can_infer | fit_order | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | CustomRandomForestModel_2 | 0.846044 | 0.845 | accuracy | 0.087140 | 0.047109 | 0.499932 | 0.087140 | 0.047109 | 0.499932 | 1 | True | 2 |
1 | WeightedEnsemble_L2 | 0.846044 | 0.845 | accuracy | 0.089154 | 0.047852 | 0.502505 | 0.002014 | 0.000743 | 0.002573 | 2 | True | 4 |
2 | CustomRandomForestModel | 0.840414 | 0.835 | accuracy | 0.098934 | 0.046829 | 0.517806 | 0.098934 | 0.046829 | 0.517806 | 1 | True | 1 |
3 | CustomRandomForestModel_3 | 0.828846 | 0.840 | accuracy | 0.087710 | 0.047476 | 0.519437 | 0.087710 | 0.047476 | 0.519437 | 1 | True | 3 |
使用拟合预测器进行预测¶
在这里,我们使用拟合预测器进行预测。这将自动使用最佳模型(得分最高的模型)进行预测。
y_pred = predictor.predict(test_data)
# y_pred = predictor.predict(test_data, model='CustomRandomForestModel_3') # If we want a specific model to predict
y_pred.head(5)
Entering the `_preprocess` method: 9769 rows of data (is_train=False)
0 <=50K
1 <=50K
2 >50K
3 <=50K
4 <=50K
Name: class, dtype: object
使用TabularPredictor对自定义模型进行超参数调优¶
我们可以通过指定超参数搜索空间来代替确切值,轻松地对自定义模型进行超参数调优。
在这里,我们对自定义模型进行了20秒的超参数调优:
from autogluon.common import space
custom_hyperparameters_hpo = {CustomRandomForestModel: {
'max_depth': space.Int(lower=5, upper=30),
'max_features': space.Real(lower=0.1, upper=1.0),
'criterion': space.Categorical('gini', 'entropy'),
}}
# Hyperparameter tune CustomRandomForestModel for 20 seconds
predictor = TabularPredictor(label=label).fit(train_data,
hyperparameters=custom_hyperparameters_hpo,
hyperparameter_tune_kwargs='auto', # enables HPO
time_limit=20)
Entering the `_fit` method
Entering the `_preprocess` method: 800 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 5, 'max_features': 0.1, 'criterion': 'gini'}
Exiting the `_fit` method
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 800 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 20, 'max_features': 0.7436704297351775, 'criterion': 'gini'}
Exiting the `_fit` method
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 800 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 8, 'max_features': 0.8625265649057129, 'criterion': 'entropy'}
Exiting the `_fit` method
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 800 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 26, 'max_features': 0.4459435365634299, 'criterion': 'entropy'}
Exiting the `_fit` method
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 800 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 11, 'max_features': 0.15104167958569886, 'criterion': 'entropy'}
Exiting the `_fit` method
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 800 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 6, 'max_features': 0.8125525342743981, 'criterion': 'gini'}
Exiting the `_fit` method
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 800 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 19, 'max_features': 0.6112401049845391, 'criterion': 'entropy'}
Exiting the `_fit` method
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 800 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 30, 'max_features': 0.16393245237809825, 'criterion': 'entropy'}
Exiting the `_fit` method
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 800 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 25, 'max_features': 0.11819655769629316, 'criterion': 'entropy'}
Exiting the `_fit` method
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 800 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 10, 'max_features': 0.8003410758548655, 'criterion': 'entropy'}
Exiting the `_fit` method
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 800 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 5, 'max_features': 0.9807565080094875, 'criterion': 'entropy'}
Exiting the `_fit` method
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 800 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 22, 'max_features': 0.5153314260276387, 'criterion': 'gini'}
Exiting the `_fit` method
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 800 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 24, 'max_features': 0.20644698328203992, 'criterion': 'entropy'}
Exiting the `_fit` method
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 800 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 6, 'max_features': 0.22901795866814179, 'criterion': 'gini'}
Exiting the `_fit` method
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 800 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 5, 'max_features': 0.5696634895750645, 'criterion': 'entropy'}
Exiting the `_fit` method
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 800 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 28, 'max_features': 0.3381000508941643, 'criterion': 'gini'}
Exiting the `_fit` method
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 800 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 23, 'max_features': 0.5105352989948937, 'criterion': 'entropy'}
Exiting the `_fit` method
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 800 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 5, 'max_features': 0.11691082039271963, 'criterion': 'gini'}
Exiting the `_fit` method
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 800 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 10, 'max_features': 0.6508861504501793, 'criterion': 'entropy'}
Exiting the `_fit` method
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 800 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 22, 'max_features': 0.9493732706631618, 'criterion': 'gini'}
Exiting the `_fit` method
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 800 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 15, 'max_features': 0.42355711051640743, 'criterion': 'entropy'}
Exiting the `_fit` method
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 800 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 6, 'max_features': 0.7278680763345383, 'criterion': 'gini'}
Exiting the `_fit` method
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 800 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 30, 'max_features': 0.7000900439011009, 'criterion': 'entropy'}
Exiting the `_fit` method
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 800 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 16, 'max_features': 0.2893443049664568, 'criterion': 'entropy'}
Exiting the `_fit` method
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 800 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 5, 'max_features': 0.38388551583176544, 'criterion': 'entropy'}
Exiting the `_fit` method
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 800 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 17, 'max_features': 0.6131770933760917, 'criterion': 'entropy'}
Exiting the `_fit` method
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 800 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 16, 'max_features': 0.9895364542533036, 'criterion': 'gini'}
Exiting the `_fit` method
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 800 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 20, 'max_features': 0.2879890804853512, 'criterion': 'gini'}
Exiting the `_fit` method
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 800 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 17, 'max_features': 0.6877974929188586, 'criterion': 'entropy'}
Exiting the `_fit` method
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_fit` method
Entering the `_preprocess` method: 800 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 19, 'max_features': 0.5196796955706757, 'criterion': 'gini'}
Exiting the `_fit` method
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_preprocess` method: 200 rows of data (is_train=False)
No path specified. Models will be saved in: "AutogluonModels/ag-20241127_095107"
Verbosity: 2 (Standard Logging)
=================== System Info ===================
AutoGluon Version: 1.2b20241127
Python Version: 3.11.9
Operating System: Linux
Platform Machine: x86_64
Platform Version: #1 SMP Tue Sep 24 10:00:37 UTC 2024
CPU Count: 8
Memory Avail: 28.75 GB / 30.95 GB (92.9%)
Disk Space Avail: 213.62 GB / 255.99 GB (83.5%)
===================================================
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...
Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
presets='experimental' : New in v1.2: Pre-trained foundation model + parallel fits. The absolute best accuracy without consideration for inference speed. Does not support GPU.
presets='best' : Maximize accuracy. Recommended for most users. Use in competitions and benchmarks.
presets='high' : Strong accuracy with fast inference speed.
presets='good' : Good accuracy with very fast inference speed.
presets='medium' : Fast training time, ideal for initial prototyping.
Warning: hyperparameter tuning is currently experimental and may cause the process to hang.
Beginning AutoGluon training ... Time limit = 20s
AutoGluon will save models to "/home/ci/autogluon/docs/tutorials/tabular/advanced/AutogluonModels/ag-20241127_095107"
Train Data Rows: 1000
Train Data Columns: 14
Label Column: class
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
2 unique label values: [' >50K', ' <=50K']
If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])
Problem Type: binary
Preprocessing data ...
Selected class <--> label mapping: class 1 = >50K, class 0 = <=50K
Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive ( >50K) vs negative ( <=50K) class.
To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init.
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 29436.65 MB
Train Data (Original) Memory Usage: 0.56 MB (0.0% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Fitting CategoryFeatureGenerator...
Fitting CategoryMemoryMinimizeFeatureGenerator...
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Stage 5 Generators:
Fitting DropDuplicatesFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('int', []) : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
('object', []) : 8 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
Types of features in processed data (raw dtype, special dtypes):
('category', []) : 7 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
('int', []) : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
('int', ['bool']) : 1 | ['sex']
0.1s = Fit runtime
14 features in original data used to generate 14 features in processed data.
Train Data (Processed) Memory Usage: 0.06 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.08s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 800, Val Rows: 200
User-specified model hyperparameters to be fit:
{
'<class '__main__.CustomRandomForestModel'>': [{'max_depth': Int: lower=5, upper=30, 'max_features': Real: lower=0.1, upper=1.0, 'criterion': Categorical['gini', 'entropy']}],
}
Custom Model Type Detected: <class '__main__.CustomRandomForestModel'>
Fitting 1 L1 models, fit_strategy="sequential" ...
Hyperparameter tuning model: CustomRandomForestModel ... Tuning model for up to 17.92s of the 19.91s of remaining time.
Stopping HPO to satisfy time limit...
Fitted model: CustomRandomForestModel/T1 ...
0.805 = Validation score (accuracy)
0.47s = Training runtime
0.05s = Validation runtime
Fitted model: CustomRandomForestModel/T2 ...
0.835 = Validation score (accuracy)
0.5s = Training runtime
0.05s = Validation runtime
Fitted model: CustomRandomForestModel/T3 ...
0.825 = Validation score (accuracy)
0.5s = Training runtime
0.05s = Validation runtime
Fitted model: CustomRandomForestModel/T4 ...
0.855 = Validation score (accuracy)
0.49s = Training runtime
0.05s = Validation runtime
Fitted model: CustomRandomForestModel/T5 ...
0.835 = Validation score (accuracy)
0.49s = Training runtime
0.05s = Validation runtime
Fitted model: CustomRandomForestModel/T6 ...
0.83 = Validation score (accuracy)
0.49s = Training runtime
0.05s = Validation runtime
Fitted model: CustomRandomForestModel/T7 ...
0.845 = Validation score (accuracy)
0.5s = Training runtime
0.05s = Validation runtime
Fitted model: CustomRandomForestModel/T8 ...
0.845 = Validation score (accuracy)
0.49s = Training runtime
0.05s = Validation runtime
Fitted model: CustomRandomForestModel/T9 ...
0.835 = Validation score (accuracy)
0.49s = Training runtime
0.05s = Validation runtime
Fitted model: CustomRandomForestModel/T10 ...
0.845 = Validation score (accuracy)
0.51s = Training runtime
0.05s = Validation runtime
Fitted model: CustomRandomForestModel/T11 ...
0.85 = Validation score (accuracy)
0.5s = Training runtime
0.05s = Validation runtime
Fitted model: CustomRandomForestModel/T12 ...
0.835 = Validation score (accuracy)
0.49s = Training runtime
0.05s = Validation runtime
Fitted model: CustomRandomForestModel/T13 ...
0.84 = Validation score (accuracy)
0.49s = Training runtime
0.05s = Validation runtime
Fitted model: CustomRandomForestModel/T14 ...
0.835 = Validation score (accuracy)
0.49s = Training runtime
0.05s = Validation runtime
Fitted model: CustomRandomForestModel/T15 ...
0.845 = Validation score (accuracy)
0.48s = Training runtime
0.05s = Validation runtime
Fitted model: CustomRandomForestModel/T16 ...
0.85 = Validation score (accuracy)
0.49s = Training runtime
0.05s = Validation runtime
Fitted model: CustomRandomForestModel/T17 ...
0.85 = Validation score (accuracy)
0.49s = Training runtime
0.05s = Validation runtime
Fitted model: CustomRandomForestModel/T18 ...
0.805 = Validation score (accuracy)
0.48s = Training runtime
0.05s = Validation runtime
Fitted model: CustomRandomForestModel/T19 ...
0.845 = Validation score (accuracy)
0.5s = Training runtime
0.05s = Validation runtime
Fitted model: CustomRandomForestModel/T20 ...
0.835 = Validation score (accuracy)
0.51s = Training runtime
0.05s = Validation runtime
Fitted model: CustomRandomForestModel/T21 ...
0.85 = Validation score (accuracy)
0.5s = Training runtime
0.05s = Validation runtime
Fitted model: CustomRandomForestModel/T22 ...
0.83 = Validation score (accuracy)
0.49s = Training runtime
0.05s = Validation runtime
Fitted model: CustomRandomForestModel/T23 ...
0.84 = Validation score (accuracy)
0.5s = Training runtime
0.05s = Validation runtime
Fitted model: CustomRandomForestModel/T24 ...
0.845 = Validation score (accuracy)
0.49s = Training runtime
0.05s = Validation runtime
Fitted model: CustomRandomForestModel/T25 ...
0.845 = Validation score (accuracy)
0.49s = Training runtime
0.05s = Validation runtime
Fitted model: CustomRandomForestModel/T26 ...
0.845 = Validation score (accuracy)
0.5s = Training runtime
0.05s = Validation runtime
Fitted model: CustomRandomForestModel/T27 ...
0.835 = Validation score (accuracy)
0.52s = Training runtime
0.05s = Validation runtime
Fitted model: CustomRandomForestModel/T28 ...
0.85 = Validation score (accuracy)
0.49s = Training runtime
0.05s = Validation runtime
Fitted model: CustomRandomForestModel/T29 ...
0.84 = Validation score (accuracy)
0.5s = Training runtime
0.05s = Validation runtime
Fitted model: CustomRandomForestModel/T30 ...
0.835 = Validation score (accuracy)
0.5s = Training runtime
0.05s = Validation runtime
Fitting model: WeightedEnsemble_L2 ... Training model for up to 19.92s of the -0.00s of remaining time.
Ensemble Weights: {'CustomRandomForestModel/T4': 1.0}
0.855 = Validation score (accuracy)
0.0s = Training runtime
0.0s = Validation runtime
AutoGluon training complete, total runtime = 20.05s ... Best model: WeightedEnsemble_L2 | Estimated inference throughput: 4129.9 rows/s (200 batch size)
Disabling decision threshold calibration for metric `accuracy` due to having fewer than 10000 rows of validation data for calibration, to avoid overfitting (200 rows).
`accuracy` is generally not improved through threshold calibration. Force calibration via specifying `calibrate_decision_threshold=True`.
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("/home/ci/autogluon/docs/tutorials/tabular/advanced/AutogluonModels/ag-20241127_095107")
预测器排行榜(HPO)¶
HPO运行的排行榜将显示名称中带有后缀'/Tx'
的模型。这表示它们是在哪个HPO试验中执行的。
leaderboard_hpo = predictor.leaderboard()
leaderboard_hpo
model | score_val | eval_metric | pred_time_val | fit_time | pred_time_val_marginal | fit_time_marginal | stack_level | can_infer | fit_order | |
---|---|---|---|---|---|---|---|---|---|---|
0 | CustomRandomForestModel/T4 | 0.855 | accuracy | 0.047702 | 0.494681 | 0.047702 | 0.494681 | 1 | True | 4 |
1 | WeightedEnsemble_L2 | 0.855 | accuracy | 0.048427 | 0.497273 | 0.000725 | 0.002592 | 2 | True | 31 |
2 | CustomRandomForestModel/T11 | 0.850 | accuracy | 0.046314 | 0.496783 | 0.046314 | 0.496783 | 1 | True | 11 |
3 | CustomRandomForestModel/T28 | 0.850 | accuracy | 0.046404 | 0.488681 | 0.046404 | 0.488681 | 1 | True | 28 |
4 | CustomRandomForestModel/T16 | 0.850 | accuracy | 0.046493 | 0.490231 | 0.046493 | 0.490231 | 1 | True | 16 |
5 | CustomRandomForestModel/T17 | 0.850 | accuracy | 0.047140 | 0.491716 | 0.047140 | 0.491716 | 1 | True | 17 |
6 | CustomRandomForestModel/T21 | 0.850 | accuracy | 0.047389 | 0.496071 | 0.047389 | 0.496071 | 1 | True | 21 |
7 | CustomRandomForestModel/T15 | 0.845 | accuracy | 0.046412 | 0.482677 | 0.046412 | 0.482677 | 1 | True | 15 |
8 | CustomRandomForestModel/T8 | 0.845 | accuracy | 0.046805 | 0.487744 | 0.046805 | 0.487744 | 1 | True | 8 |
9 | CustomRandomForestModel/T24 | 0.845 | accuracy | 0.047056 | 0.492893 | 0.047056 | 0.492893 | 1 | True | 24 |
10 | CustomRandomForestModel/T25 | 0.845 | accuracy | 0.047136 | 0.490474 | 0.047136 | 0.490474 | 1 | True | 25 |
11 | CustomRandomForestModel/T19 | 0.845 | accuracy | 0.047263 | 0.499039 | 0.047263 | 0.499039 | 1 | True | 19 |
12 | CustomRandomForestModel/T7 | 0.845 | accuracy | 0.047345 | 0.496344 | 0.047345 | 0.496344 | 1 | True | 7 |
13 | CustomRandomForestModel/T26 | 0.845 | accuracy | 0.047765 | 0.499947 | 0.047765 | 0.499947 | 1 | True | 26 |
14 | CustomRandomForestModel/T10 | 0.845 | accuracy | 0.047879 | 0.508763 | 0.047879 | 0.508763 | 1 | True | 10 |
15 | CustomRandomForestModel/T13 | 0.840 | accuracy | 0.046828 | 0.492749 | 0.046828 | 0.492749 | 1 | True | 13 |
16 | CustomRandomForestModel/T29 | 0.840 | accuracy | 0.047030 | 0.500797 | 0.047030 | 0.500797 | 1 | True | 29 |
17 | CustomRandomForestModel/T23 | 0.840 | accuracy | 0.048148 | 0.500881 | 0.048148 | 0.500881 | 1 | True | 23 |
18 | CustomRandomForestModel/T9 | 0.835 | accuracy | 0.046202 | 0.485823 | 0.046202 | 0.485823 | 1 | True | 9 |
19 | CustomRandomForestModel/T5 | 0.835 | accuracy | 0.046503 | 0.487067 | 0.046503 | 0.487067 | 1 | True | 5 |
20 | CustomRandomForestModel/T30 | 0.835 | accuracy | 0.046667 | 0.496747 | 0.046667 | 0.496747 | 1 | True | 30 |
21 | CustomRandomForestModel/T2 | 0.835 | accuracy | 0.046779 | 0.496093 | 0.046779 | 0.496093 | 1 | True | 2 |
22 | CustomRandomForestModel/T20 | 0.835 | accuracy | 0.046856 | 0.514037 | 0.046856 | 0.514037 | 1 | True | 20 |
23 | CustomRandomForestModel/T14 | 0.835 | accuracy | 0.047088 | 0.485311 | 0.047088 | 0.485311 | 1 | True | 14 |
24 | CustomRandomForestModel/T27 | 0.835 | accuracy | 0.047269 | 0.515591 | 0.047269 | 0.515591 | 1 | True | 27 |
25 | CustomRandomForestModel/T12 | 0.835 | accuracy | 0.047595 | 0.491838 | 0.047595 | 0.491838 | 1 | True | 12 |
26 | CustomRandomForestModel/T22 | 0.830 | accuracy | 0.046718 | 0.491145 | 0.046718 | 0.491145 | 1 | True | 22 |
27 | CustomRandomForestModel/T6 | 0.830 | accuracy | 0.047780 | 0.488095 | 0.047780 | 0.488095 | 1 | True | 6 |
28 | CustomRandomForestModel/T3 | 0.825 | accuracy | 0.048805 | 0.504421 | 0.048805 | 0.504421 | 1 | True | 3 |
29 | CustomRandomForestModel/T18 | 0.805 | accuracy | 0.046407 | 0.475975 | 0.046407 | 0.475975 | 1 | True | 18 |
30 | CustomRandomForestModel/T1 | 0.805 | accuracy | 0.048273 | 0.473814 | 0.048273 | 0.473814 | 1 | True | 1 |
获取训练模型的超参数¶
让我们获取具有最高验证分数的模型的超参数。
best_model_name = leaderboard_hpo[leaderboard_hpo['stack_level'] == 1]['model'].iloc[0]
predictor_info = predictor.info()
best_model_info = predictor_info['model_info'][best_model_name]
print(best_model_info)
print(f'Best Model Hyperparameters ({best_model_name}):')
print(best_model_info['hyperparameters'])
{'name': 'CustomRandomForestModel/T4', 'model_type': 'CustomRandomForestModel', 'problem_type': 'binary', 'eval_metric': 'accuracy', 'stopping_metric': 'accuracy', 'fit_time': 0.49468064308166504, 'num_classes': 2, 'quantile_levels': None, 'predict_time': 0.04770231246948242, 'val_score': 0.855, 'hyperparameters': {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 26, 'max_features': 0.4459435365634299, 'criterion': 'entropy'}, 'hyperparameters_fit': {}, 'hyperparameters_nondefault': ['max_depth', 'max_features', 'criterion', 'n_estimators', 'n_jobs', 'random_state'], 'ag_args_fit': {'max_memory_usage_ratio': 1.0, 'max_time_limit_ratio': 1.0, 'max_time_limit': None, 'min_time_limit': 0, 'valid_raw_types': ['int', 'float', 'category'], 'valid_special_types': None, 'ignored_type_group_special': None, 'ignored_type_group_raw': None, 'get_features_kwargs': None, 'get_features_kwargs_extra': None, 'predict_1_batch_size': None, 'temperature_scalar': None}, 'num_features': 14, 'features': ['age', 'fnlwgt', 'education-num', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'native-country'], 'feature_metadata': <autogluon.common.features.feature_metadata.FeatureMetadata object at 0x7f492f09b090>, 'memory_size': 4803201, 'compile_time': None, 'is_initialized': True, 'is_fit': True, 'is_valid': True, 'can_infer': True, 'has_learning_curves': False, 'val_in_fit': True, 'unlabeled_in_fit': False, 'num_cpus': 8, 'num_gpus': 0.0}
Best Model Hyperparameters (CustomRandomForestModel/T4):
{'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 26, 'max_features': 0.4459435365634299, 'criterion': 'entropy'}
使用TabularPredictor与其他模型一起训练自定义模型¶
最后,我们将训练自定义模型(带有调整过的超参数)以及默认的AutoGluon模型。
所有这些都需要通过get_hyperparameter_config
获取默认模型的超参数字典,并将CustomRandomForestModel添加为一个键。
from autogluon.tabular.configs.hyperparameter_configs import get_hyperparameter_config
# Now we can add the custom model with tuned hyperparameters to be trained alongside the default models:
custom_hyperparameters = get_hyperparameter_config('default')
custom_hyperparameters[CustomRandomForestModel] = best_model_info['hyperparameters']
print(custom_hyperparameters)
{'NN_TORCH': {}, 'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, {'learning_rate': 0.03, 'num_leaves': 128, 'feature_fraction': 0.9, 'min_data_in_leaf': 3, 'ag_args': {'name_suffix': 'Large', 'priority': 0, 'hyperparameter_tune_kwargs': None}}], 'CAT': {}, 'XGB': {}, 'FASTAI': {}, 'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}], 'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}], 'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}], <class '__main__.CustomRandomForestModel'>: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 26, 'max_features': 0.4459435365634299, 'criterion': 'entropy'}}
predictor = TabularPredictor(label=label).fit(train_data, hyperparameters=custom_hyperparameters) # Train the default models plus a single tuned CustomRandomForestModel
# predictor = TabularPredictor(label=label).fit(train_data, hyperparameters=custom_hyperparameters, presets='best_quality') # We can even use the custom model in a multi-layer stack ensemble
predictor.leaderboard(test_data)
Entering the `_fit` method
Entering the `_preprocess` method: 800 rows of data (is_train=True)
Hyperparameters: {'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 26, 'max_features': 0.4459435365634299, 'criterion': 'entropy'}
Exiting the `_fit` method
Entering the `_preprocess` method: 200 rows of data (is_train=False)
Entering the `_preprocess` method: 9769 rows of data (is_train=False)
No path specified. Models will be saved in: "AutogluonModels/ag-20241127_095130"
Verbosity: 2 (Standard Logging)
=================== System Info ===================
AutoGluon Version: 1.2b20241127
Python Version: 3.11.9
Operating System: Linux
Platform Machine: x86_64
Platform Version: #1 SMP Tue Sep 24 10:00:37 UTC 2024
CPU Count: 8
Memory Avail: 28.73 GB / 30.95 GB (92.8%)
Disk Space Avail: 213.52 GB / 255.99 GB (83.4%)
===================================================
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...
Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
presets='experimental' : New in v1.2: Pre-trained foundation model + parallel fits. The absolute best accuracy without consideration for inference speed. Does not support GPU.
presets='best' : Maximize accuracy. Recommended for most users. Use in competitions and benchmarks.
presets='high' : Strong accuracy with fast inference speed.
presets='good' : Good accuracy with very fast inference speed.
presets='medium' : Fast training time, ideal for initial prototyping.
Beginning AutoGluon training ...
AutoGluon will save models to "/home/ci/autogluon/docs/tutorials/tabular/advanced/AutogluonModels/ag-20241127_095130"
Train Data Rows: 1000
Train Data Columns: 14
Label Column: class
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
2 unique label values: [' >50K', ' <=50K']
If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])
Problem Type: binary
Preprocessing data ...
Selected class <--> label mapping: class 1 = >50K, class 0 = <=50K
Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive ( >50K) vs negative ( <=50K) class.
To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init.
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 29415.95 MB
Train Data (Original) Memory Usage: 0.56 MB (0.0% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Fitting CategoryFeatureGenerator...
Fitting CategoryMemoryMinimizeFeatureGenerator...
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Stage 5 Generators:
Fitting DropDuplicatesFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('int', []) : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
('object', []) : 8 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
Types of features in processed data (raw dtype, special dtypes):
('category', []) : 7 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
('int', []) : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
('int', ['bool']) : 1 | ['sex']
0.1s = Fit runtime
14 features in original data used to generate 14 features in processed data.
Train Data (Processed) Memory Usage: 0.06 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.08s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 800, Val Rows: 200
User-specified model hyperparameters to be fit:
{
'NN_TORCH': [{}],
'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, {'learning_rate': 0.03, 'num_leaves': 128, 'feature_fraction': 0.9, 'min_data_in_leaf': 3, 'ag_args': {'name_suffix': 'Large', 'priority': 0, 'hyperparameter_tune_kwargs': None}}],
'CAT': [{}],
'XGB': [{}],
'FASTAI': [{}],
'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
'<class '__main__.CustomRandomForestModel'>': [{'n_estimators': 300, 'n_jobs': -1, 'random_state': 0, 'max_depth': 26, 'max_features': 0.4459435365634299, 'criterion': 'entropy'}],
}
Custom Model Type Detected: <class '__main__.CustomRandomForestModel'>
Fitting 14 L1 models, fit_strategy="sequential" ...
Fitting model: KNeighborsUnif ...
0.725 = Validation score (accuracy)
0.03s = Training runtime
0.01s = Validation runtime
Fitting model: KNeighborsDist ...
0.71 = Validation score (accuracy)
0.01s = Training runtime
0.01s = Validation runtime
Fitting model: LightGBMXT ...
0.85 = Validation score (accuracy)
0.28s = Training runtime
0.0s = Validation runtime
Fitting model: LightGBM ...
0.84 = Validation score (accuracy)
0.29s = Training runtime
0.0s = Validation runtime
Fitting model: RandomForestGini ...
0.84 = Validation score (accuracy)
0.66s = Training runtime
0.05s = Validation runtime
Fitting model: RandomForestEntr ...
0.835 = Validation score (accuracy)
0.59s = Training runtime
0.05s = Validation runtime
Fitting model: CatBoost ...
0.86 = Validation score (accuracy)
1.86s = Training runtime
0.0s = Validation runtime
Fitting model: ExtraTreesGini ...
0.815 = Validation score (accuracy)
0.61s = Training runtime
0.05s = Validation runtime
Fitting model: ExtraTreesEntr ...
0.82 = Validation score (accuracy)
0.58s = Training runtime
0.05s = Validation runtime
Fitting model: NeuralNetFastAI ...
No improvement since epoch 7: early stopping
0.84 = Validation score (accuracy)
3.06s = Training runtime
0.01s = Validation runtime
Fitting model: XGBoost ...
0.845 = Validation score (accuracy)
0.26s = Training runtime
0.01s = Validation runtime
Fitting model: NeuralNetTorch ...
0.855 = Validation score (accuracy)
3.37s = Training runtime
0.01s = Validation runtime
Fitting model: LightGBMLarge ...
0.795 = Validation score (accuracy)
0.78s = Training runtime
0.0s = Validation runtime
Fitting model: CustomRandomForestModel ...
0.855 = Validation score (accuracy)
0.52s = Training runtime
0.05s = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
Ensemble Weights: {'LightGBMXT': 0.267, 'CustomRandomForestModel': 0.267, 'CatBoost': 0.2, 'RandomForestGini': 0.133, 'ExtraTreesEntr': 0.133}
0.885 = Validation score (accuracy)
0.09s = Training runtime
0.0s = Validation runtime
AutoGluon training complete, total runtime = 13.57s ... Best model: WeightedEnsemble_L2 | Estimated inference throughput: 1326.6 rows/s (200 batch size)
Disabling decision threshold calibration for metric `accuracy` due to having fewer than 10000 rows of validation data for calibration, to avoid overfitting (200 rows).
`accuracy` is generally not improved through threshold calibration. Force calibration via specifying `calibrate_decision_threshold=True`.
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("/home/ci/autogluon/docs/tutorials/tabular/advanced/AutogluonModels/ag-20241127_095130")
model | score_test | score_val | eval_metric | pred_time_test | pred_time_val | fit_time | pred_time_test_marginal | pred_time_val_marginal | fit_time_marginal | stack_level | can_infer | fit_order | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | CatBoost | 0.852902 | 0.860 | accuracy | 0.011480 | 0.004135 | 1.864036 | 0.011480 | 0.004135 | 1.864036 | 1 | True | 7 |
1 | WeightedEnsemble_L2 | 0.850957 | 0.885 | accuracy | 0.356746 | 0.150763 | 4.000829 | 0.004435 | 0.000707 | 0.089826 | 2 | True | 15 |
2 | LightGBMXT | 0.850752 | 0.850 | accuracy | 0.018226 | 0.003682 | 0.283427 | 0.018226 | 0.003682 | 0.283427 | 1 | True | 3 |
3 | NeuralNetFastAI | 0.848193 | 0.840 | accuracy | 0.136087 | 0.009625 | 3.055832 | 0.136087 | 0.009625 | 3.055832 | 1 | True | 10 |
4 | LightGBM | 0.841335 | 0.840 | accuracy | 0.013442 | 0.003443 | 0.292836 | 0.013442 | 0.003443 | 0.292836 | 1 | True | 4 |
5 | RandomForestGini | 0.840004 | 0.840 | accuracy | 0.110018 | 0.047464 | 0.660194 | 0.110018 | 0.047464 | 0.660194 | 1 | True | 5 |
6 | XGBoost | 0.838162 | 0.845 | accuracy | 0.058857 | 0.006267 | 0.259349 | 0.058857 | 0.006267 | 0.259349 | 1 | True | 11 |
7 | RandomForestEntr | 0.837240 | 0.835 | accuracy | 0.116725 | 0.046871 | 0.587866 | 0.116725 | 0.046871 | 0.587866 | 1 | True | 6 |
8 | CustomRandomForestModel | 0.834988 | 0.855 | accuracy | 0.108045 | 0.047540 | 0.521197 | 0.108045 | 0.047540 | 0.521197 | 1 | True | 14 |
9 | NeuralNetTorch | 0.833248 | 0.855 | accuracy | 0.047004 | 0.010044 | 3.368351 | 0.047004 | 0.010044 | 3.368351 | 1 | True | 12 |
10 | ExtraTreesGini | 0.831917 | 0.815 | accuracy | 0.100978 | 0.046512 | 0.606500 | 0.100978 | 0.046512 | 0.606500 | 1 | True | 8 |
11 | LightGBMLarge | 0.829461 | 0.795 | accuracy | 0.066073 | 0.004562 | 0.780304 | 0.066073 | 0.004562 | 0.780304 | 1 | True | 13 |
12 | ExtraTreesEntr | 0.829358 | 0.820 | accuracy | 0.104541 | 0.047234 | 0.582150 | 0.104541 | 0.047234 | 0.582150 | 1 | True | 9 |
13 | KNeighborsUnif | 0.744600 | 0.725 | accuracy | 0.041047 | 0.013742 | 0.030056 | 0.041047 | 0.013742 | 0.030056 | 1 | True | 1 |
14 | KNeighborsDist | 0.710922 | 0.710 | accuracy | 0.040276 | 0.013443 | 0.010359 | 0.040276 | 0.013443 | 0.010359 | 1 | True | 2 |
总结¶
这就是将自定义模型添加到AutoGluon所需的全部步骤。如果您创建了自定义模型,请考虑提交一个PR,以便我们可以将其正式添加到AutoGluon中!
For more tutorials, refer to Predicting Columns in a Table - Quick Start and Predicting Columns in a Table - In Depth.
有关高级自定义模型的教程,请参阅 Adding a custom model to AutoGluon (Advanced))