通过参数高效微调实现单GPU十亿级模型训练¶

正如斯坦福大学人类中心人工智能研究所最近的一篇论文所指出的那样，随着“基础模型”的兴起，人工智能正在经历一次范式转变，这些基础模型是在多样化的数据集上通常以自监督的方式进行训练的巨型模型。这些基础模型是AutoMM的关键，可以轻松适应下游应用。然而，随着这些基础模型规模的增大，微调这些模型变得越来越困难。以下是来自微软研究博客的图表，展示了这一趋势：

Scaling of foundation models

AutoMM的目标是通过开源基础模型帮助任何人解决机器学习问题，包括这些巨型模型。为了微调这些大规模模型，我们采用了最近流行的参数高效微调技术。其思想是微调基础模型中的一小部分权重（例如，BitFit），或者在固定骨干网络之上添加一个微小的可调结构（例如，Prompt Tuning， LoRA，Adapter，MAM Adapter，IA^3）。这些技术可以有效减少峰值内存使用和模型训练时间，同时保持性能。

在本教程中，我们将介绍如何在MultiModalPredictor中应用参数高效微调。我们首先介绍如何采用"ia3_bias"算法进行参数高效微调。随后，我们将展示如何简单地结合"ia3_bias" 和梯度检查点，通过单个NVIDIA T4 GPU微调Google的FLAN-T5的XL变体。

准备数据集¶

跨语言亚马逊产品评论情感数据集包含四种语言的亚马逊产品评论。在这里，我们加载了数据集的英语和德语部分。在标签列中，0表示负面情感，1表示正面情感。为了演示目的，我们将训练数据下采样到1000个样本。我们将在英语数据集上训练模型，并直接在德语和日语测试集上评估其性能。

!wget --quiet https://automl-mm-bench.s3.amazonaws.com/multilingual-datasets/amazon_review_sentiment_cross_lingual.zip -O amazon_review_sentiment_cross_lingual.zip
!unzip -q -o amazon_review_sentiment_cross_lingual.zip -d .

import os
import shutil
os.environ["TRANSFORMERS_CACHE"] = "cache"

def clear_cache():
    if os.path.exists("cache"):
        shutil.rmtree("cache")

clear_cache()

import pandas as pd
import warnings
warnings.filterwarnings("ignore")

train_en_df = pd.read_csv("amazon_review_sentiment_cross_lingual/en_train.tsv",
                          sep="\t",
                          header=None,
                          names=["label", "text"]) \
                .sample(1000, random_state=123).reset_index(drop=True)

test_en_df = pd.read_csv("amazon_review_sentiment_cross_lingual/en_test.tsv",
                          sep="\t",
                          header=None,
                          names=["label", "text"]) \
               .sample(200, random_state=123).reset_index(drop=True)
test_de_df = pd.read_csv("amazon_review_sentiment_cross_lingual/de_test.tsv",
                          sep="\t", header=None, names=["label", "text"]) \
               .sample(200, random_state=123).reset_index(drop=True)

test_jp_df = pd.read_csv('amazon_review_sentiment_cross_lingual/jp_test.tsv',
                          sep='\t', header=None, names=['label', 'text']) \
               .sample(200, random_state=123).reset_index(drop=True)
train_en_df.head(5)

	label	text
0	0	This is a film that literally sees little wron...
1	0	This music is pretty intelligent, but not very...
2	0	One of the best pieces of rock ever recorded, ...
3	0	Reading the posted reviews here, is like revis...
4	1	I've just finished page 341, the last page. It...

test_jp_df.head(5)

	label	text
0	1	原作はビクトル・ユーゴの長編小説だが、私が子供の頃読んだのは短縮版の「ああ無情」。それでもこ...
1	1	ほかの作品のレビューにみんな書いているのに、何故この作品について書いている人が一人しかいない...
2	0	一番の問題点は青島が出ていない事でしょう。ＴＶ番組では『芸人が出ていればバラエティだから...
3	0	昔、りんたろう監督によるアニメ「カムイの剣」があった。「カムイの剣」…を観た人なら本作...
4	1	以前のアルバムを聴いていないのでなんとも言えないが、クラシックなメタルを聞いてきた耳には、と...

使用IA3 + BitFit微调多语言模型¶

在AutoMM中，要启用高效微调，只需将optimization.efficient_finetune指定为"ia3_bias"。

from autogluon.multimodal import MultiModalPredictor
import uuid

model_path = f"./tmp/{uuid.uuid4().hex}-multilingual_ia3"
predictor = MultiModalPredictor(label="label",
                                path=model_path)
predictor.fit(train_en_df,
              presets="multilingual",
              hyperparameters={
                  "optimization.efficient_finetune": "ia3_bias",
                  "optimization.lr_decay": 0.9,
                  "optimization.learning_rate": 3e-03,
                  "optimization.end_lr": 3e-03,
                  "optimization.max_epochs": 2,
                  "optimization.warmup_steps": 0,
                  "env.batch_size": 32,
              })

/home/ci/opt/venv/lib/python3.11/site-packages/mmengine/optim/optimizer/zero_optimizer.py:11: DeprecationWarning: `TorchScript` support for functional optimizers is deprecated and will be removed in a future PyTorch release. Consider using the `torch.compile` optimizer instead.
  from torch.distributed.optim import \
=================== System Info ===================
AutoGluon Version:  1.2b20241127
Python Version:     3.11.9
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Tue Sep 24 10:00:37 UTC 2024
CPU Count:          8
Pytorch Version:    2.5.1+cu124
CUDA Version:       12.4
Memory Avail:       28.43 GB / 30.95 GB (91.9%)
Disk Space Avail:   184.86 GB / 255.99 GB (72.2%)
===================================================
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
2 unique label values:  [0, 1]
If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])
AutoMM starts to create your model. ✨✨✨

To track the learning progress, you can open a terminal and launch Tensorboard:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/tmp/efd14d7ba9894d22a5c32335ad68afc7-multilingual_ia3
    ```
Seed set to 0
GPU Count: 1
GPU Count to be Used: 1
GPU 0 Name: Tesla T4
GPU 0 Memory: 0.43GB/15.0GB (Used/Total)
Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name              | Type                         | Params | Mode 
---------------------------------------------------------------------------
0 | model             | HFAutoModelForTextPrediction | 278 M  | train
1 | validation_metric | BinaryAUROC                  | 0      | train
2 | loss_func         | CrossEntropyLoss             | 0      | train
---------------------------------------------------------------------------
122 K     Trainable params
278 M     Non-trainable params
278 M     Total params
1,112.955 Total estimated model params size (MB)
28        Modules in train mode
213       Modules in eval mode
Epoch 0, global step 12: 'val_roc_auc' reached 0.75150 (best 0.75150), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/tmp/efd14d7ba9894d22a5c32335ad68afc7-multilingual_ia3/epoch=0-step=12.ckpt' as top 1
Epoch 0, global step 25: 'val_roc_auc' reached 0.88195 (best 0.88195), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/tmp/efd14d7ba9894d22a5c32335ad68afc7-multilingual_ia3/epoch=0-step=25.ckpt' as top 1
Epoch 1, global step 37: 'val_roc_auc' reached 0.89321 (best 0.89321), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/tmp/efd14d7ba9894d22a5c32335ad68afc7-multilingual_ia3/epoch=1-step=37.ckpt' as top 1
Epoch 1, global step 50: 'val_roc_auc' reached 0.89626 (best 0.89626), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/tmp/efd14d7ba9894d22a5c32335ad68afc7-multilingual_ia3/epoch=1-step=50.ckpt' as top 1
`Trainer.fit` stopped: `max_epochs=2` reached.
AutoMM has created your model. 🎉🎉🎉

To load the model, use the code below:
    ```python
    from autogluon.multimodal import MultiModalPredictor
    predictor = MultiModalPredictor.load("/home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/tmp/efd14d7ba9894d22a5c32335ad68afc7-multilingual_ia3")
    ```

If you are not satisfied with the model, try to increase the training time, 
adjust the hyperparameters (https://auto.gluon.ai/stable/tutorials/multimodal/advanced_topics/customization.html),
or post issues on GitHub (https://github.com/autogluon/autogluon/issues).

<autogluon.multimodal.predictor.MultiModalPredictor at 0x7fd4bec80a90>

可调参数的比例约为所有参数的0.5%。实际上，仅使用英语数据训练的模型在测试集上也能取得良好的性能，甚至在德语/日语测试集上也是如此。它在AutoMM for Text - Multilingual Problems中获得了与完全微调相当的结果。

score_in_en = predictor.evaluate(test_en_df)
score_in_de = predictor.evaluate(test_de_df)
score_in_jp = predictor.evaluate(test_jp_df)
print('Score in the English Testset:', score_in_en)
print('Score in the German Testset:', score_in_de)
print('Score in the Japanese Testset:', score_in_jp)

Score in the English Testset: {'roc_auc': 0.9409607074665862}
Score in the German Testset: {'roc_auc': 0.9071514423076923}
Score in the Japanese Testset: {'roc_auc': 0.8726342710997442}

在单个GPU上训练FLAN-T5-XL¶

通过结合梯度检查点和参数高效的微调，可以在AWS G4实例中提供的单个T4 GPU上微调接近20亿参数的google/flan-t5-xl。要开启梯度检查点，只需将"model.hf_text.gradient_checkpointing"设置为True。为了加速训练，我们将训练样本的数量下采样到200。

# Just for clean the space
clear_cache()
shutil.rmtree(model_path)

from autogluon.multimodal import MultiModalPredictor

train_en_df_downsample = train_en_df.sample(200, random_state=123)

new_model_path = f"./tmp/{uuid.uuid4().hex}-multilingual_ia3_gradient_checkpoint"
predictor = MultiModalPredictor(label="label",
                                path=new_model_path)
predictor.fit(train_en_df_downsample,
              presets="multilingual",
              hyperparameters={
                  "model.hf_text.checkpoint_name": "google/flan-t5-xl",
                  "model.hf_text.gradient_checkpointing": True,
                  "model.hf_text.low_cpu_mem_usage": True,
                  "optimization.efficient_finetune": "ia3_bias",
                  "optimization.lr_decay": 0.9,
                  "optimization.learning_rate": 3e-03,
                  "optimization.end_lr": 3e-03,
                  "optimization.max_epochs": 1,
                  "optimization.warmup_steps": 0,
                  "env.batch_size": 1,
                  "env.eval_batch_size_ratio": 1
              })

Global seed set to 123
Auto select gpus: [0]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name              | Type                         | Params
-------------------------------------------------------------------
0 | model             | HFAutoModelForTextPrediction | 1.2 B 
1 | validation_metric | AUROC                        | 0     
2 | loss_func         | CrossEntropyLoss             | 0     
-------------------------------------------------------------------
203 K     Trainable params
1.2 B     Non-trainable params
1.2 B     Total params
4,894.913 Total estimated model params size (MB)
Epoch 0, global step 20: 'val_roc_auc' reached 0.88802 (best 0.88802), saving model to '/home/ubuntu/autogluon/docs/tutorials/multimodal/advanced_topics/multilingual_ia3_gradient_checkpoint/epoch=0-step=20.ckpt' as top 1
Epoch 0, global step 40: 'val_roc_auc' reached 0.94531 (best 0.94531), saving model to '/home/ubuntu/autogluon/docs/tutorials/multimodal/advanced_topics/multilingual_ia3_gradient_checkpoint/epoch=0-step=40.ckpt' as top 1
`Trainer.fit` stopped: `max_epochs=1` reached.





<autogluon.multimodal.predictor.MultiModalPredictor at 0x7fd58c4dbca0>

score_in_en = predictor.evaluate(test_en_df)
print('Score in the English Testset:', score_in_en)

Score in the English Testset: {'roc_auc': 0.931263189629183}

# Just for clean the space
clear_cache()
shutil.rmtree(new_model_path)

Other Examples¶

You may go to AutoMM Examples to explore other examples about AutoMM.

Customization¶

To learn how to customize AutoMM, please refer to Customize AutoMM.