TensorRT-LLM 构建工作流程

概述

构建工作流程包含两个主要步骤。

从训练框架导出的现有模型检查点创建TensorRT-LLM模型。
将TensorRT-LLM模型构建为TensorRT-LLM引擎。

为了将TensorRT-LLM的优化特性推广到所有模型，并在TensorRT-LLM用户之间共享不同模型的相同工作流程，TensorRT-LLM对模型的定义方式和导入方式有一些约定。

TensorRT-LLM 检查点约定已在 TensorRT-LLM 检查点中记录，所有仅解码器模型已迁移以采用该约定。特定模型的 convert_checkpoint.py 脚本作为源代码在示例目录中提供，并且已添加了 trtllm-build CLI 工具。然而，在核心 TensorRT-LLM 库之外提供转换检查点脚本作为示例存在一些缺点：

TensorRT-LLM 发展非常迅速，模型的定义代码可能已经改变以提高性能；这意味着 convert_checkpoint.py 已经过时。
TensorRT-LLM 正在创建一套新的高级API，这些API在一个类中处理模型转换、引擎构建和推理，以便更易于使用。因此，高级API需要调用权重转换代码，这些代码应该是TensorRT-LLM核心库的一部分，而不是示例。不同模型的转换代码应具有相同的接口，这样高级API就不需要为不同模型添加许多特定代码。

为了缓解这些问题，特定模型的convert_checkpoint.py脚本正在被重构。大部分转换代码将被移动到核心库中，紧邻模型定义。参考tensorrt_llm/models/llama/作为一个例子。现在有一套新的API用于导入模型和转换权重。0.9版本重构了LLaMA模型类以采用新的API，其他模型的重构工作正在进行中。

转换API

LLaMA模型的权重转换API如下所示。引入了一个TopModelMixin类，声明了from_hugging_face()接口，LLaMAForCausalLM类继承了TopModelMixin（不是直接父类，但在其基类层次结构中），并实现了该接口。

class TopModelMixin
    @classmethod
    def from_hugging_face(cls,
                          hf_model_dir: str,
                          dtype: Optional[str] = 'float16',
                          mapping: Optional[Mapping] = None,
                          **kwargs):
        raise NotImplementedError("Subclass shall override this")

# TopModelMixin is in the part of base class hierarchy
class LLaMAForCausalLM (DecoderModelForCausalLM):
    @classmethod
    def from_hugging_face(cls,
             hf_model_dir,
             dtype='float16',
             mapping: Optional[Mapping] = None) -> LLaMAForCausalLM:
        # creating a TensorRT-LLM llama model object
        # converting HuggingFace checkpoint to TensorRT-LLM expected weights dict
        # Load the weights to llama model object

然后，在GitHub仓库的examples/llama/目录中的convert_checkpoint.py脚本中，逻辑可以大大简化。即使由于某些原因，TensorRT-LLM LLaMA类的模型定义代码发生了变化，from_hugging_face API将保持不变，因此使用此接口的现有工作流程不会受到影响。

#other args omitted for simplicity here.
llama = LLaMAForCausalLM.from_hugging_face(model_dir, dtype, mapping=mapping)
llama.save_checkpoint(output_dir, save_config=(rank==0))

from_hugging_face API 故意不将检查点保存到磁盘，而是返回一个内存中的对象。调用 save_checkpoint 来保存模型。这保持了灵活性，并使转换->构建的流程在一个过程中更快。通常，对于大型模型，保存和加载磁盘较慢，因此应避免。

由于LLaMA模型也以不同的格式发布，例如Meta检查点，LLaMAForCausalLM类为此提供了一个from_meta_ckpt函数。此函数未在TopModelMixin类中声明，因为它是LLaMA特定的，因此其他模型不使用它。

在0.9版本中，仅对LLaMA进行了重构。由于流行的LLaMA（及其变体）模型由Hugging Face和Meta检查点格式发布，因此仅实现了这两个功能。

在未来的版本中，可能会添加from_jax、from_nemo、from_keras或其他工厂方法，用于不同的训练检查点。例如，examples/gemma目录中的Gemma 2B模型和convert_checkpoint.py文件除了支持Hugging Face格式外，还支持JAX和Keras格式。模型开发者可以选择为他们贡献给TensorRT-LLM的模型实现这些工厂方法的任何子集。

对于一些TensorRT-LLM模型开发者不支持的格式，您仍然可以在核心库之外自由实现自己的权重转换；流程将如下所示：

config = read_config_from_the_custom_training_checkpoint(model_dir)
llama = LLaMAForCausalLM(config)

# option 1:
# Create a weights dict and then calls LLaMAForCausalLM.load
weights_dict = convert_weights_from_custom_training_checkpoint(model_dir)
llama.load(weights_dict)

# option 2:
# Internally assign the model parameters directly
convert_and_load_weights_into_trtllm_llama(llama, model_dir)
# Use the llama object as usual, to save the checkpoint or build engines

尽管进行这些自定义权重加载存在一些限制和陷阱，如果模型定义在TensorRT-LLM核心库内，而权重加载/转换在核心库外，当新版本的TensorRT-LLM发布时，转换代码可能需要更新。

量化API

TensorRT-LLM 依赖 NVIDIA Modelopt 工具包来支持一些量化方法，如：FP8、W4A16_AWQ、W4A8_AWQ，同时它也有自己的量化实现，如 Smooth Quant、INT8 KV 缓存和仅 INT4/INT8 权重。

在 TensorRT-LLM 0.8 版本中：

对于Modelopt支持的量化算法，一个独立的脚本， example/quantization/quantize.py 可以导出TensorRT-LLM检查点，并且需要执行trtllm-build命令来将检查点构建为引擎。
对于非Modelopt量化算法，用户需要使用每个模型的convert_checkpoint.py脚本来导出TensorRT-LLM检查点。

使用quantize()接口来统一不同的量化流程。默认实现已添加到PretrainedModel类中。

class PretrainedModel:
    @classmethod
    def quantize(
        cls,
        hf_model_dir,
        output_dir,
        quant_config: QuantConfig,
        mapping: Optional[Mapping] = None): #some args are omitted here
        # Internally quantize the given hugging face models using Modelopt
        # and save the checkpoint to output_dir

默认实现仅处理Modelopt支持的量化。LLaMA类随后继承了这个PretrainedModel，并将Modelopt量化分派给超类的默认实现。
如果Modelopt尚未支持新模型，模型开发者会在子类实现中引发错误。

class LLaMAForCausalLM:
    @classmethod
    def quantize(
        cls,
        hf_model_dir,
        output_dir,
        quant_config: QuantiConfig,
        mapping: Optional[Mapping] = None): #some args are omitted here
        use_modelopt_quantization = ... # determine if to use Modelopt or use native
        if use_modelopt_quantization:
            super().quantize(hf_model_dir,
                             output_dir,
                             quant_config)
        else:
            # handles TensorRT-LLM native model specific quantization
            # or raise exceptions if not supported

quantize API 设计用于内部利用多GPU资源进行量化。例如，一个LLaMA 70B BF16模型需要140G内存，如果我们进行FP8量化，那么还需要额外的70G。因此，我们至少需要210G内存，需要4个A100（H100）来量化LLaMA 70B模型。如果你想在MPI程序中调用quantize API，请小心并确保quantize API仅由rank 0调用。

在MPI程序中使用quantize API看起来像这样，只有排名0的进程调用它。在非MPI程序中，不需要if rank == 0和mpi_barrier()。

quant_config = QuantConfig()
quant_config.quant_algo = quant_mode.W4A16_AWQ
mapping = Mapping(world_size=tp_size, tp_size=tp_size)
if rank == 0:
    LLaMAForCausalLM.quantize(hf_model_dir,
                          checkpoint_dir,
                          quant_config=quant_config)
mpi_barrier() # wait for rank-o finishes the quantization
llama = LLaMAForCausalLM.from_checkpoint(checkpoint_dir, rank)
engine = build(llama, build_config)
engine.save(engine_dir)

examples/quantization/quantize.py 保留用于向后兼容。

构建API

tensorrt_llm.build API 构建 TensorRT-LLM 模型对象到 TensorRT-LLM 引擎。这个新的 API 取代了旧的流程：创建构建器，创建网络对象，将模型追踪到网络，并构建 TensorRT 引擎。这个 API 的使用方式如下：

llama = ... # create LLaMAForCausalLM object
build_config = BuildConfig(max_batch_size=1)
engine = tensorrt_llm.build(llama, build_config)
engine.save(engine_dir)

Llama对象可以通过转换API或量化API部分中提到的任何方法创建。

trtllm-build CLI 工具是围绕 tensorrt_llm.build API 的一个轻量级封装。CLI 工具的标志与 BuildConfig 类的字段保持紧密对应。

如果模型被保存到磁盘，然后稍后构建到引擎中，TensorRT-LLM 提供了一个 from_checkpoint API 来反序列化检查点。

## TensorRT-LLM code
class PretrainedModel:
    @classmethod
    def from_checkpoint(cls,
                    ckpt_dir: str,
                    rank: int = 0,
                    config: PretrainedConfig = None):
        # Internally load the model weights from a given checkpoint directory

调用from_checkpoint API将检查点反序列化为模型对象。可以调用tensorrt_llm.build API来构建引擎。

llama = LLaMAForCausalLM.from_checkpoint(checkpoint_dir)
engine = build(llama, build_config)
engine.save(engine_dir)

CLI工具

上述所有权重转换、量化和构建API都有相应的CLI工具以便使用。

特定模型的convert_checkpoint.py脚本位于examples/ xxx>/文件夹内。
一个统一的量化脚本位于examples/quantization/quantize.py中，并且可以被所有支持的模型共享。
一个 trtllm-build CLI 工具从 TensorRT-LLM 检查点构建所有模型。

请参考以下CLI工具的注意事项：

这些脚本和工具应用于脚本编写。不要导入这些工具中定义的Python函数/类。TensorRT-LLM不保证这些脚本的内容与之前版本兼容。在不可避免的情况下，这些工具的选项也可能会更改。
示例文件夹中的这些脚本可能使用了TensorRT-LLM内部/不稳定的API，如果示例的版本与TensorRT-LLM安装版本不匹配，可能无法正常工作。有一些GitHub问题是由版本不匹配引起的。
- https://github.com/NVIDIA/TensorRT-LLM/issues/1293
- https://github.com/NVIDIA/TensorRT-LLM/issues/1252
- https://github.com/NVIDIA/TensorRT-LLM/issues/1079
您应始终安装与examples/ xxx>/requirements.txt中指定的相同版本的TensorRT-LLM。
在未来，考虑到不同模型的属性可能不同，每个模型的转换脚本可能会也可能不会被统一为一个由模型共享的单一脚本。然而，TensorRT-LLM团队将努力确保不同脚本中相同功能的标志保持一致。
TensorRT-LLM 团队鼓励使用新的低级转换/量化/构建 API 而不是这些脚本。转换 API 将逐步按模型添加，可能需要几个版本的时间。