使用TensorRT加速预测¶

TensorRT，基于NVIDIA CUDA®并行编程模型构建，使我们能够通过利用NVIDIA AI、自主机器、高性能计算和图形中的库、开发工具和技术来优化推理。AutoGluon-MultiModal现在通过predictor.optimize_for_inference()接口与TensorRT集成。本教程展示了如何利用TensorRT提升推理速度，这有助于提高部署环境的效率。

import os
import numpy as np
import time
import warnings
from IPython.display import clear_output
warnings.filterwarnings('ignore')
np.random.seed(123)

安装所需的包¶

由于tensorrt/onnx/onnxruntime-gpu包目前是autogluon.multimodal的可选依赖项，我们需要确保这些包已正确安装。

try:
    import tensorrt, onnx, onnxruntime
    print(f"tensorrt=={tensorrt.__version__}, onnx=={onnx.__version__}, onnxruntime=={onnxruntime.__version__}")
except ImportError:
    !pip install autogluon.multimodal[tests]
    !pip install -U "tensorrt>=10.0.0b0,<11.0"
    clear_output()

Dataset¶

For demonstration, we use a simplified and subsampled version of PetFinder dataset. The task is to predict the animals’ adoption rates based on their adoption profile information. In this simplified version, the adoption speed is grouped into two categories: 0 (slow) and 1 (fast).

首先，让我们下载并准备数据集。

download_dir = './ag_automm_tutorial'
zip_file = 'https://automl-mm-bench.s3.amazonaws.com/petfinder_for_tutorial.zip'
from autogluon.core.utils.loaders import load_zip
load_zip.unzip(zip_file, unzip_dir=download_dir)

Downloading ./ag_automm_tutorial/file.zip from https://automl-mm-bench.s3.amazonaws.com/petfinder_for_tutorial.zip...

0%|          | 0.00/18.8M [00:00<?, ?iB/s]
40%|████      | 7.52M/18.8M [00:00<00:00, 75.2MiB/s]
80%|████████  | 15.0M/18.8M [00:00<00:00, 46.3MiB/s]
100%|██████████| 18.8M/18.8M [00:00<00:00, 47.9MiB/s]

接下来，我们将加载CSV文件。

import pandas as pd
dataset_path = download_dir + '/petfinder_for_tutorial'
train_data = pd.read_csv(f'{dataset_path}/train.csv', index_col=0)
test_data = pd.read_csv(f'{dataset_path}/test.csv', index_col=0)
label_col = 'AdoptionSpeed'

我们需要扩展图像路径以在训练中加载它们。

image_col = 'Images'
train_data[image_col] = train_data[image_col].apply(lambda ele: ele.split(';')[0]) # Use the first image for a quick tutorial
test_data[image_col] = test_data[image_col].apply(lambda ele: ele.split(';')[0])

def path_expander(path, base_folder):
    path_l = path.split(';')
    return ';'.join([os.path.abspath(os.path.join(base_folder, path)) for path in path_l])

train_data[image_col] = train_data[image_col].apply(lambda ele: path_expander(ele, base_folder=dataset_path))
test_data[image_col] = test_data[image_col].apply(lambda ele: path_expander(ele, base_folder=dataset_path))

每只动物的领养档案包括图片、文字描述以及各种表格特征，如年龄、品种、名称、颜色等。

Training¶

现在让我们用训练数据来拟合预测器。这里我们设置了一个紧张的时间预算来进行快速演示。

from autogluon.multimodal import MultiModalPredictor
hyperparameters = {
    "optimization.max_epochs": 2,
    "model.names": ["numerical_mlp", "categorical_mlp", "timm_image", "hf_text", "fusion_mlp"],
    "model.timm_image.checkpoint_name": "mobilenetv3_small_100",
    "model.hf_text.checkpoint_name": "google/electra-small-discriminator",
    
}
predictor = MultiModalPredictor(label=label_col).fit(
    train_data=train_data,
    hyperparameters=hyperparameters,
    time_limit=120, # seconds
)

clear_output()

在底层，AutoMM 自动推断问题类型（分类或回归），检测数据模态，从多模态模型池中选择相关模型，并训练所选模型。如果有多个骨干网络可用，AutoMM 会在它们之上附加一个后期融合模型（MLP 或 transformer）。

使用默认的PyTorch模块进行预测¶

给定一个没有标签列的多模态数据框，我们可以预测标签。

请注意，我们将在这里使用一小部分测试数据进行基准测试。之后，我们将对整个测试数据集进行评估，以评估准确性的损失。

batch_size = 2
n_trails = 10
sample = test_data.head(batch_size)

# Use first prediction for initialization (e.g., allocating memory)
y_pred = predictor.predict_proba(sample)

pred_time = []
for _ in range(n_trails):
    tic = time.time()
    y_pred = predictor.predict_proba(sample)
    elapsed = time.time()-tic
    pred_time.append(elapsed)
    print(f"elapsed (pytorch): {elapsed*1000:.1f} ms (batch_size={batch_size})")

elapsed (pytorch): 365.4 ms (batch_size=2)
elapsed (pytorch): 372.4 ms (batch_size=2)
elapsed (pytorch): 384.0 ms (batch_size=2)
elapsed (pytorch): 369.6 ms (batch_size=2)
elapsed (pytorch): 351.4 ms (batch_size=2)
elapsed (pytorch): 373.6 ms (batch_size=2)
elapsed (pytorch): 377.3 ms (batch_size=2)
elapsed (pytorch): 358.3 ms (batch_size=2)
elapsed (pytorch): 359.3 ms (batch_size=2)
elapsed (pytorch): 363.4 ms (batch_size=2)

使用TensorRT模块进行预测¶

首先，让我们加载一个新的预测器，以优化其推理性能。

model_path = predictor.path
trt_predictor = MultiModalPredictor.load(path=model_path)
trt_predictor.optimize_for_inference()

# Again, use first prediction for initialization (e.g., allocating memory)
y_pred_trt = trt_predictor.predict_proba(sample)

clear_output()

Load pretrained checkpoint: /home/ci/autogluon/docs/tutorials/multimodal/advanced_topics/AutogluonModels/ag-20241127_104114/model.ckpt

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[8], line 3
      1 model_path = predictor.path
      2 trt_predictor = MultiModalPredictor.load(path=model_path)
----> 3 trt_predictor.optimize_for_inference()
      5 # Again, use first prediction for initialization (e.g., allocating memory)
      6 y_pred_trt = trt_predictor.predict_proba(sample)

File ~/autogluon/multimodal/src/autogluon/multimodal/predictor.py:917, in MultiModalPredictor.optimize_for_inference(self, providers)
    892 def optimize_for_inference(
    893     self,
    894     providers: Optional[Union[dict, List[str]]] = None,
    895 ):
    896     """
    897     Optimize the predictor's model for inference.
    898 
   (...)
    915         The onnx-based module that can be used to replace predictor._model for model inference.
    916     """
--> 917     return self._learner.optimize_for_inference(providers=providers)

File ~/autogluon/multimodal/src/autogluon/multimodal/utils/export.py:199, in ExportMixin.optimize_for_inference(self, providers)
    196 data = pd.DataFrame.from_dict(data_dict)
    198 onnx_module = None
--> 199 onnx_path = self.export_onnx(data=data, truncate_long_and_double=True)
    201 onnx_module = OnnxModule(onnx_path, providers)
    202 onnx_module.input_keys = self._model.input_keys

File ~/autogluon/multimodal/src/autogluon/multimodal/utils/export.py:122, in ExportMixin.export_onnx(self, data, path, batch_size, verbose, opset_version, truncate_long_and_double)
    119 warnings.warn("Currently, the functionality of exporting to ONNX is experimental.")
    121 # Data preprocessing, loading, and filtering
--> 122 batch = self.get_processed_batch_for_deployment(
    123     data=data,
    124     onnx_tracing=True,
    125     batch_size=batch_size,
    126     truncate_long_and_double=truncate_long_and_double,
    127 )
    128 input_keys = self._model.input_keys
    129 input_vec = [batch[k] for k in input_keys]

File ~/autogluon/multimodal/src/autogluon/multimodal/utils/export.py:288, in ExportMixin.get_processed_batch_for_deployment(self, data, onnx_tracing, batch_size, to_numpy, requires_label, truncate_long_and_double)
    286 inp = batch[key]
    287 # support mixed precision on floating point inputs, and leave integer inputs (for language models) untouched.
--> 288 if inp.dtype.is_floating_point:
    289     batch[key] = inp.to(device, dtype=dtype)
    290 else:

AttributeError: 'tuple' object has no attribute 'dtype'

在底层，optimize_for_inference() 会生成一个基于 onnxruntime 的模块，可以作为 torch.nn.Module 的直接替代品。它将替换内部的基于 torch 的模块 predictor._model 以进行优化推理。

警告

函数 optimize_for_inference() 会修改内部模型定义，仅用于推理。在此之后调用 predictor.fit() 会导致错误。建议使用 MultiModalPredictor.load 重新加载模型，以便重新拟合模型。

然后，我们可以像往常一样进行预测或提取嵌入。为了公平比较推理速度，这里我们多次运行预测。

pred_time_trt = []
for _ in range(n_trails):
    tic = time.time()
    y_pred_trt = trt_predictor.predict_proba(sample)
    elapsed = time.time()-tic
    pred_time_trt.append(elapsed)
    print(f"elapsed (tensorrt): {elapsed*1000:.1f} ms (batch_size={batch_size})")

为了验证预测结果的正确性，我们可以并排比较结果。

让我们看一下预期结果和TensorRT结果。

y_pred, y_pred_trt

由于我们默认使用混合精度（FP16），可能会有精度损失。我们可以看到概率非常接近，我们应该能够安全地假设这些结果在大多数情况下相对接近。更多详情请参阅TensorRT开发者指南中的降低精度部分。

np.testing.assert_allclose(y_pred, y_pred_trt, atol=0.01)

可视化推理速度¶

我们可以通过除以预测时间来计算推理时间。

infer_speed = batch_size/np.mean(pred_time)
infer_speed_trt = batch_size/np.mean(pred_time_trt)

然后，可视化速度改进。

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
fig.set_figheight(1.5)
ax.barh(["PyTorch", "TensorRT"], [infer_speed, infer_speed_trt])
ax.annotate(f"{infer_speed:.1f} rows/s", xy=(infer_speed, 0))
ax.annotate(f"{infer_speed_trt:.1f} rows/s", xy=(infer_speed_trt, 1))
_ = plt.xlabel('Inference Speed (rows per second)')

比较评估指标¶

现在我们能够通过optimize_for_inference()实现更好的推理速度，但是这是否会对基础精度损失产生影响？

让我们从整个测试数据集的评估开始。

metric = predictor.evaluate(test_data)
metric_trt = trt_predictor.evaluate(test_data)
clear_output()

metric_df = pd.DataFrame.from_dict({"PyTorch": metric, "TensorRT": metric_trt})
metric_df

评估结果预计将非常接近。

如果评估结果之间存在任何显著差距，请尝试通过使用CUDA执行提供程序来禁用混合精度：

predictor.optimize_for_inference(providers=["CUDAExecutionProvider"])

请参阅执行提供者以获取完整的提供者列表。

Other Examples¶

You may go to AutoMM Examples to explore other examples about AutoMM.

Customization¶

To learn how to customize AutoMM, please refer to Customize AutoMM.