AutoMM中的超参数优化

Open In Colab Open In SageMaker Studio Lab

超参数优化(HPO)是一种帮助解决调整机器学习模型超参数挑战的方法。机器学习算法有多个复杂的超参数,这些参数生成了一个巨大的搜索空间,而深度学习方法的搜索空间甚至比传统机器学习算法更大。在大规模搜索空间上进行调整是一个艰巨的挑战,但AutoMM提供了多种选项,让您可以根据领域知识和计算资源的限制来指导拟合过程。

创建图像数据集

在本教程中,我们将再次使用来自Kaggle的Shopee-IET数据集的子集进行演示。每张图片包含一件衣物,相应的标签指定了其衣物类别。我们的数据子集包含以下可能的标签:BabyPants, BabyShirt, womencasualshoes, womenchiffontop

我们可以通过自动下载URL数据来加载数据集:

import warnings
warnings.filterwarnings('ignore')
from datetime import datetime

from autogluon.multimodal.utils.misc import shopee_dataset
download_dir = './ag_automm_tutorial_hpo'
train_data, test_data = shopee_dataset(download_dir)
train_data = train_data.sample(frac=0.5)
print(train_data)
Downloading ./ag_automm_tutorial_hpo/file.zip from https://automl-mm-bench.s3.amazonaws.com/vision_datasets/shopee.zip...
image  label
640  /home/ci/autogluon/docs/tutorials/multimodal/a...      3
354  /home/ci/autogluon/docs/tutorials/multimodal/a...      1
434  /home/ci/autogluon/docs/tutorials/multimodal/a...      2
785  /home/ci/autogluon/docs/tutorials/multimodal/a...      3
433  /home/ci/autogluon/docs/tutorials/multimodal/a...      2
..                                                 ...    ...
661  /home/ci/autogluon/docs/tutorials/multimodal/a...      3
754  /home/ci/autogluon/docs/tutorials/multimodal/a...      3
349  /home/ci/autogluon/docs/tutorials/multimodal/a...      1
727  /home/ci/autogluon/docs/tutorials/multimodal/a...      3
168  /home/ci/autogluon/docs/tutorials/multimodal/a...      0

[400 rows x 2 columns]
/home/ci/opt/venv/lib/python3.11/site-packages/mmengine/optim/optimizer/zero_optimizer.py:11: DeprecationWarning: `TorchScript` support for functional optimizers is deprecated and will be removed in a future PyTorch release. Consider using the `torch.compile` optimizer instead.
  from torch.distributed.optim import \
0%|          | 0.00/84.0M [00:00<?, ?iB/s]
10%|▉         | 8.38M/84.0M [00:00<00:01, 52.4MiB/s]
19%|█▉        | 16.1M/84.0M [00:00<00:01, 64.3MiB/s]
27%|██▋       | 22.9M/84.0M [00:00<00:01, 55.4MiB/s]
34%|███▍      | 28.6M/84.0M [00:00<00:01, 32.3MiB/s]
39%|███▉      | 32.8M/84.0M [00:00<00:01, 32.2MiB/s]
44%|████▎     | 36.6M/84.0M [00:00<00:01, 33.5MiB/s]
50%|████▉     | 41.9M/84.0M [00:01<00:01, 34.2MiB/s]
58%|█████▊    | 48.5M/84.0M [00:01<00:01, 35.0MiB/s]
62%|██████▏   | 52.3M/84.0M [00:01<00:01, 30.1MiB/s]
70%|██████▉   | 58.7M/84.0M [00:01<00:00, 36.4MiB/s]
78%|███████▊  | 65.3M/84.0M [00:01<00:00, 41.5MiB/s]
83%|████████▎ | 69.8M/84.0M [00:01<00:00, 38.4MiB/s]
88%|████████▊ | 73.9M/84.0M [00:01<00:00, 35.8MiB/s]
92%|█████████▏| 77.6M/84.0M [00:02<00:00, 28.8MiB/s]
99%|█████████▊| 82.8M/84.0M [00:02<00:00, 33.6MiB/s]
100%|██████████| 84.0M/84.0M [00:02<00:00, 35.9MiB/s]

这个数据集中总共有400个数据点。image列存储实际图像的路径,label列代表标签类别。

常规模型拟合

回想一下,如果我们使用Autogluon预定义的默认设置,我们可以简单地使用MultiModalPredictor来拟合模型,只需三行代码:

from autogluon.multimodal import MultiModalPredictor
predictor_regular = MultiModalPredictor(label="label")
start_time = datetime.now()
predictor_regular.fit(
    train_data=train_data,
    hyperparameters = {"model.timm_image.checkpoint_name": "ghostnet_100"}
)
end_time = datetime.now()
elapsed_seconds = (end_time - start_time).total_seconds()
elapsed_min = divmod(elapsed_seconds, 60)
print("Total fitting time: ", f"{int(elapsed_min[0])}m{int(elapsed_min[1])}s")
Total fitting time:  0m55s
No path specified. Models will be saved in: "AutogluonModels/ag-20241127_103022"
=================== System Info ===================
AutoGluon Version:  1.2b20241127
Python Version:     3.11.9
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Tue Sep 24 10:00:37 UTC 2024
CPU Count:          8
Pytorch Version:    2.5.1+cu124
CUDA Version:       12.4
Memory Avail:       28.43 GB / 30.95 GB (91.9%)
Disk Space Avail:   174.90 GB / 255.99 GB (68.3%)
===================================================
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == int, but few unique label-values observed).
4 unique label values:  [3, 1, 2, 0]
If 'multiclass' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])
AutoMM starts to create your model. ✨✨✨

To track the learning progress, you can open a terminal and launch Tensorboard:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20241127_103022
    ```
Seed set to 0
GPU Count: 1
GPU Count to be Used: 1
GPU 0 Name: Tesla T4
GPU 0 Memory: 0.43GB/15.0GB (Used/Total)
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name              | Type                            | Params | Mode 
------------------------------------------------------------------------------
0 | model             | TimmAutoModelForImagePrediction | 3.9 M  | train
1 | validation_metric | MulticlassAccuracy              | 0      | train
2 | loss_func         | CrossEntropyLoss                | 0      | train
------------------------------------------------------------------------------
3.9 M     Trainable params
0         Non-trainable params
3.9 M     Total params
15.627    Total estimated model params size (MB)
418       Modules in train mode
0         Modules in eval mode
Epoch 0, global step 1: 'val_accuracy' reached 0.17500 (best 0.17500), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20241127_103022/epoch=0-step=1.ckpt' as top 3
Epoch 0, global step 3: 'val_accuracy' reached 0.23750 (best 0.23750), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20241127_103022/epoch=0-step=3.ckpt' as top 3
Epoch 1, global step 4: 'val_accuracy' reached 0.25000 (best 0.25000), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20241127_103022/epoch=1-step=4.ckpt' as top 3
Epoch 1, global step 6: 'val_accuracy' reached 0.30000 (best 0.30000), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20241127_103022/epoch=1-step=6.ckpt' as top 3
Epoch 2, global step 7: 'val_accuracy' reached 0.35000 (best 0.35000), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20241127_103022/epoch=2-step=7.ckpt' as top 3
Epoch 2, global step 9: 'val_accuracy' reached 0.42500 (best 0.42500), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20241127_103022/epoch=2-step=9.ckpt' as top 3
Epoch 3, global step 10: 'val_accuracy' reached 0.46250 (best 0.46250), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20241127_103022/epoch=3-step=10.ckpt' as top 3
Epoch 3, global step 12: 'val_accuracy' reached 0.61250 (best 0.61250), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20241127_103022/epoch=3-step=12.ckpt' as top 3
Epoch 4, global step 13: 'val_accuracy' reached 0.60000 (best 0.61250), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20241127_103022/epoch=4-step=13.ckpt' as top 3
Epoch 4, global step 15: 'val_accuracy' reached 0.61250 (best 0.61250), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20241127_103022/epoch=4-step=15.ckpt' as top 3
Epoch 5, global step 16: 'val_accuracy' reached 0.61250 (best 0.61250), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20241127_103022/epoch=5-step=16.ckpt' as top 3
Epoch 5, global step 18: 'val_accuracy' reached 0.66250 (best 0.66250), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20241127_103022/epoch=5-step=18.ckpt' as top 3
Epoch 6, global step 19: 'val_accuracy' reached 0.66250 (best 0.66250), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20241127_103022/epoch=6-step=19.ckpt' as top 3
Epoch 6, global step 21: 'val_accuracy' reached 0.66250 (best 0.66250), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20241127_103022/epoch=6-step=21.ckpt' as top 3
Epoch 7, global step 22: 'val_accuracy' reached 0.70000 (best 0.70000), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20241127_103022/epoch=7-step=22.ckpt' as top 3
Epoch 7, global step 24: 'val_accuracy' was not in top 3
Epoch 8, global step 25: 'val_accuracy' reached 0.67500 (best 0.70000), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20241127_103022/epoch=8-step=25.ckpt' as top 3
Epoch 8, global step 27: 'val_accuracy' reached 0.70000 (best 0.70000), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20241127_103022/epoch=8-step=27.ckpt' as top 3
Epoch 9, global step 28: 'val_accuracy' reached 0.68750 (best 0.70000), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20241127_103022/epoch=9-step=28.ckpt' as top 3
Epoch 9, global step 30: 'val_accuracy' reached 0.72500 (best 0.72500), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20241127_103022/epoch=9-step=30.ckpt' as top 3
Epoch 10, global step 31: 'val_accuracy' was not in top 3
Epoch 10, global step 33: 'val_accuracy' reached 0.72500 (best 0.72500), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20241127_103022/epoch=10-step=33.ckpt' as top 3
Epoch 11, global step 34: 'val_accuracy' reached 0.71250 (best 0.72500), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20241127_103022/epoch=11-step=34.ckpt' as top 3
Epoch 11, global step 36: 'val_accuracy' was not in top 3
Epoch 12, global step 37: 'val_accuracy' reached 0.72500 (best 0.72500), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20241127_103022/epoch=12-step=37.ckpt' as top 3
Epoch 12, global step 39: 'val_accuracy' was not in top 3
Epoch 13, global step 40: 'val_accuracy' was not in top 3
Epoch 13, global step 42: 'val_accuracy' was not in top 3
Epoch 14, global step 43: 'val_accuracy' was not in top 3
Epoch 14, global step 45: 'val_accuracy' was not in top 3
Start to fuse 3 checkpoints via the greedy soup algorithm.
AutoMM has created your model. 🎉🎉🎉

To load the model, use the code below:
    ```python
    from autogluon.multimodal import MultiModalPredictor
    predictor = MultiModalPredictor.load("/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20241127_103022")
    ```

If you are not satisfied with the model, try to increase the training time, 
adjust the hyperparameters (https://auto.gluon.ai/stable/tutorials/multimodal/advanced_topics/customization.html),
or post issues on GitHub (https://github.com/autogluon/autogluon/issues).

让我们检查一下拟合模型的测试准确率:

scores = predictor_regular.evaluate(test_data, metrics=["accuracy"])
print('Top-1 test acc: %.3f' % scores["accuracy"])
Top-1 test acc: 0.738

在模型拟合期间使用HPO

如果您希望对拟合过程有更多的控制,您可以在MultiModalPredictor中通过简单地在hyperparameterhyperparameter_tune_kwargs中添加更多选项来指定超参数优化(HPO)的各种选项。

在MultiModalPredictor中,我们有几个选项。我们在后端使用了Ray Tune tune库,因此我们需要传入一个Tune搜索空间或一个AutoGluon搜索空间,这些将被转换为Tune搜索空间。

  1. 定义用于神经网络训练的各种超参数值的搜索空间:

    hyperparameters = {
            "optimization.learning_rate": tune.uniform(0.00005, 0.005),
            "optimization.optim_type": tune.choice(["adamw", "sgd"]),
            "optimization.max_epochs": tune.choice(["10", "20"]), 
            "model.timm_image.checkpoint_name": tune.choice(["swin_base_patch4_window7_224", "convnext_base_in22ft1k"])
            }
    

    这是一个示例,但不是详尽的列表。您可以在自定义AutoMM中找到完整的支持列表。

  1. 使用hyperparameter_tune_kwargs定义HPO的搜索策略。您可以传入一个字符串或初始化一个ray.tune.schedulers.TrialScheduler对象。

    a. Specifying how to search through your chosen hyperparameter space (supports `random` and `bayes`):
    "searcher": "bayes"
    
    b. Specifying how to schedule jobs to train a network under a particular hyperparameter configuration (supports `FIFO` and `ASHA`):
    "scheduler": "ASHA"
    
    c. Number of trials you would like to carry out HPO:
    "num_trials": 20
    
    d. Number of checkpoints to keep on disk per trial, see Ray documentation for more details. Must be >= 1. (default is 3):
    "num_to_keep": 3
    

让我们使用不同的学习率和骨干模型的组合来进行HPO工作:

from ray import tune

predictor_hpo = MultiModalPredictor(label="label")

hyperparameters = {
            "optimization.learning_rate": tune.uniform(0.00005, 0.001),
            "model.timm_image.checkpoint_name": tune.choice(["ghostnet_100",
                                                             "mobilenetv3_large_100"])
}
hyperparameter_tune_kwargs = {
    "searcher": "bayes", # random
    "scheduler": "ASHA",
    "num_trials": 2,
    "num_to_keep": 3,
}
start_time_hpo = datetime.now()
predictor_hpo.fit(
        train_data=train_data,
        hyperparameters=hyperparameters,
        hyperparameter_tune_kwargs=hyperparameter_tune_kwargs,
    )
end_time_hpo = datetime.now()
elapsed_seconds_hpo = (end_time_hpo - start_time_hpo).total_seconds()
elapsed_min_hpo = divmod(elapsed_seconds_hpo, 60)
print("Total fitting time: ", f"{int(elapsed_min_hpo[0])}m{int(elapsed_min_hpo[1])}s")
No path specified. Models will be saved in: "AutogluonModels/ag-20241127_103119"
=================== System Info ===================
AutoGluon Version:  1.2b20241127
Python Version:     3.11.9
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Tue Sep 24 10:00:37 UTC 2024
CPU Count:          8
Pytorch Version:    2.5.1+cu124
CUDA Version:       12.4
Memory Avail:       27.49 GB / 30.95 GB (88.8%)
Disk Space Avail:   174.88 GB / 255.99 GB (68.3%)
===================================================
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == int, but few unique label-values observed).
4 unique label values:  [3, 1, 2, 0]
If 'multiclass' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])
Removing non-optimal trials and only keep the best one.
Start to fuse 3 checkpoints via the greedy soup algorithm.
AutoMM has created your model. 🎉🎉🎉

To load the model, use the code below:
    ```python
    from autogluon.multimodal import MultiModalPredictor
    predictor = MultiModalPredictor.load("/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20241127_103119")
    ```

If you are not satisfied with the model, try to increase the training time, 
adjust the hyperparameters (https://auto.gluon.ai/stable/tutorials/multimodal/advanced_topics/customization.html),
or post issues on GitHub (https://github.com/autogluon/autogluon/issues).

调优状态

Current time:2024-11-27 10:33:02
Running for: 00:01:37.90
Memory: 5.2/30.9 GiB

系统信息

Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4096.000: None | Iter 1024.000: None | Iter 256.000: None | Iter 64.000: None | Iter 16.000: 0.871874988079071 | Iter 4.000: 0.6593749821186066 | Iter 1.000: 0.32499998807907104
Logical resource usage: 8.0/8 CPUs, 1.0/1 GPUs (0.0/1.0 accelerator_type:T4)

试验状态

Trial name status loc model.names model.timm_image.che ckpoint_name optimization.learnin g_rate iter total time (s) val_accuracy
64b09fc1 TERMINATED10.0.0.57:6256('timm_image', _6b80mobilenetv3_lar_18e00.000156298 28 37.1824 0.875
0fdb8f17 TERMINATED10.0.0.57:6456('timm_image', _d6c0mobilenetv3_lar_18e00.000934244 32 42.2214 0.9125

试验进度

Trial name should_checkpoint val_accuracy
0fdb8f17 True 0.9125
64b09fc1 True 0.875
Total fitting time:  1m46s

让我们检查一下HPO后拟合模型的测试准确率:

scores_hpo = predictor_hpo.evaluate(test_data, metrics=["accuracy"])
print('Top-1 test acc: %.3f' % scores_hpo["accuracy"])
Top-1 test acc: 0.912

从训练日志中,您应该能够看到当前的最佳试验如下:

Current best trial: 47aef96a with val_accuracy=0.862500011920929 and parameters={'optimization.learning_rate': 0.0007195214018085505, 'model.timm_image.checkpoint_name': 'ghostnet_100'}

在我们简单的2次试验HPO运行后,通过搜索不同的学习率和模型,我们得到了比上一节提供的开箱即用解决方案更好的测试准确率。HPO有助于选择具有最高验证准确率的超参数组合。

Other Examples

You may go to AutoMM Examples to explore other examples about AutoMM.

Customization

To learn how to customize AutoMM, please refer to Customize AutoMM.