模型配置导出

将优化模型导出到TensorRT-LLM检查点的代码。

函数

`export_tensorrt_llm_checkpoint`	将torch模型导出到TensorRT-LLM检查点并保存到export_dir。
`torch_to_tensorrt_llm_checkpoint`	将torch模型转换为每个GPU等级的TensorRT-LLM检查点。

export_tensorrt_llm_checkpoint(model, decoder_type, dtype=None, export_dir='/tmp', inference_tensor_parallel=0, inference_pipeline_parallel=1, naive_fp8_quantization=False, use_nfs_workspace=False)

将torch模型导出为TensorRT-LLM检查点并保存到export_dir。

Parameters:

model (Module) – 火炬模型。
decoder_type (str) – 解码器的类型，例如 gpt, gptj, llama。
dtype (dtype | None) – 导出未量化层的权重数据类型，如果为None，则为默认模型数据类型。
export_dir (Path | str) – 目标导出路径。
inference_tensor_parallel (int) – 目标推理时间张量并行。我们将合并或拆分校准张量并行以进行推理。默认值为0，表示使用校准而不手动配置合并或拆分。
inference_pipeline_parallel (int) – 目标推理时间管道并行。我们将合并或拆分校准管道并行性以进行推理。默认值为1，表示没有管道并行性。
inference_pipeline_parallel – 目标推理时间管道并行。
naive_fp8_quantization (bool) – 将模型简单地量化为FP8，无需校准。所有缩放因子都设置为1。
use_nfs_workspace (bool) – 如果为True，将在export_dir下创建一个NFS工作区，并用作跨进程/节点通信的共享内存。

对于tensorrt_llm部署，将表示保存在export_dir下。我们将把model_config保存为两个文件：

.json: The nested dict that maps to the PretrainedConfig in TensorRT-LLM.
https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/models/modeling_utils.py.

.safetensors: 权重列表的文件，以safetensors格式存储。每个等级都是唯一的。

torch_to_tensorrt_llm_checkpoint(model, decoder_type, dtype=None, inference_tensor_parallel=0, inference_pipeline_parallel=1, naive_fp8_quantization=False, workspace_path=None)

将torch模型转换为每个GPU等级的TensorRT-LLM检查点。

TensorRT-LLM 检查点是可用于 TensorRT-LLM 构建 API 的 LLM 模型格式，用于引擎构建过程。 https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/architecture/checkpoint.md

Parameters:

model (Module) – 火炬模型。
decoder_type (str) – 解码器的类型，例如 gpt, gptj, llama。
dtype (dtype | None) – 导出未量化层的权重数据类型，如果为None，则为默认模型数据类型。
inference_tensor_parallel (int) – 目标推理时间张量并行。我们将合并或拆分校准张量并行以进行推理。默认值为0，表示使用校准而不进行手动配置的合并或拆分。
inference_pipeline_parallel (int) – 目标推理时间管道并行。我们将合并或拆分校准管道并行性以进行推理。默认值为1，表示没有管道并行性。
naive_fp8_quantization (bool) – 将模型简单地量化为FP8，无需校准。所有缩放因子都设置为1。
workspace_path (Path | str | None) – 用于跨秩通信后处理的NFS目录路径。

Yields:

A tuple of: tensorrt_llm_config: 一个映射到TensorRT-LLM中PretrainedConfig的字典。 https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/models/modeling_utils.py weights: 一个存储所有模型权重和每个等级的缩放因子的字典。 per_layer_quantization: 一个包含所有量化层的逐层量化信息的字典，用于混合精度，否则为空字典。

Return type:

迭代器[元组[字典[字符串, 任意类型], 字典[字符串, 张量], 字典[字符串, 任意类型]]]