TensorRT-LLM 部署

注意

在阅读本节之前，请先阅读TensorRT-LLM检查点工作流程。

ModelOpt 工具包支持将 ModelOpt 导出的 LLM 自动转换为 TensorRT-LLM 检查点和用于加速推理的引擎。

这种转换是通过以下方式实现的：

将Huggingface、NeMo和ModelOpt导出的检查点转换为TensorRT-LLM检查点。
从TensorRT-LLM检查点构建TensorRT-LLM引擎。

导出量化模型

模型量化后，量化模型可以导出为TensorRT-LLM检查点格式存储为

一个记录模型结构和元数据的单一JSON文件（config.json）
一组safetensors文件，每个文件记录单个GPU等级上的本地校准模型（模型权重，每个GPU的缩放因子）。

导出API (export_tensorrt_llm_checkpoint) 可以如下使用：

from modelopt.torch.export import export_tensorrt_llm_checkpoint

with torch.inference_mode():
    export_tensorrt_llm_checkpoint(
        model,  # The quantized model.
        decoder_type,  # The type of the model as str, e.g gpt, gptj, llama.
        dtype,  # the weights data type to export the unquantized layers.
        export_dir,  # The directory where the exported files will be stored.
        inference_tensor_parallel,  # The number of GPUs used in the inference time for tensor parallelism.
        inference_pipeline_parallel,  # The number of GPUs used in the inference time for pipeline parallelism.
    )

如果export_tensorrt_llm_checkpoint调用成功，TensorRT-LLM检查点将被保存。否则，例如decoder_type不受支持，将保存一个torch state_dict检查点。

Model support matrix for the TensorRT-LLM checkpoint export
模型 / 量化	FP16 / BF16	FP8	INT8_SQ	INT4_AWQ
GPT2	是	是	是	否
GPTJ	是	是	是	是
LLAMA 2	是	是	是	是
LLAMA 3	是	是	否	是
Mistral	是	是	是	是
Mixtral 8x7B	是	是	否	是
Falcon 40B, 180B	是	是	是	是
猎鹰 7B	是	是	是	否
MPT 7B, 30B	是	是	是	是
百川 1, 2	是	是	是	是
ChatGLM2, 3 6B	是	否	否	是
布鲁姆	是	是	是	是
Phi-1, 2, 3	是	是	是	是
Nemotron 8	是	是	否	是
Gemma 2B, 7B	是	是	否	是
循环宝石	是	是	是	是
StarCoder 2	是	是	是	是
Qwen-1, 1.5	是	是	是	是

转换为TensorRT-LLM

一旦TensorRT-LLM检查点可用，请按照TensorRT-LLM构建API来构建和部署量化的LLM。