NVIDIA TensorRT 模型优化器¶
NVIDIA TensorRT Model Optimizer是一个专为NVIDIA GPU推理优化的模型库。它包含针对大语言模型(LLMs)、视觉语言模型(VLMs)和扩散模型的训练后量化(PTQ)与量化感知训练(QAT)工具。
我们建议通过以下方式安装该库:
使用PTQ量化HuggingFace模型¶
您可以使用TensorRT模型优化器仓库中提供的示例脚本对HuggingFace模型进行量化。用于LLM PTQ的主要脚本通常位于examples/llm_ptq
目录中。
以下示例展示了如何使用modelopt的PTQ API对模型进行量化:
import modelopt.torch.quantization as mtq
from transformers import AutoModelForCausalLM
# Load the model from HuggingFace
model = AutoModelForCausalLM.from_pretrained("<path_or_model_id>")
# Select the quantization config, for example, FP8
config = mtq.FP8_DEFAULT_CFG
# Define a forward loop function for calibration
def forward_loop(model):
for data in calib_set:
model(data)
# PTQ with in-place replacement of quantized modules
model = mtq.quantize(model, config, forward_loop)
模型量化后,您可以使用导出API将其导出为量化检查点:
import torch
from modelopt.torch.export import export_hf_checkpoint
with torch.inference_mode():
export_hf_checkpoint(
model, # The quantized model.
export_dir, # The directory where the exported files will be stored.
)
量化后的检查点随后可以使用vLLM进行部署。例如,以下代码展示了如何部署nvidia/Llama-3.1-8B-Instruct-FP8
(这是从meta-llama/Llama-3.1-8B-Instruct
衍生出的FP8量化检查点)通过vLLM实现:
from vllm import LLM, SamplingParams
def main():
model_id = "nvidia/Llama-3.1-8B-Instruct-FP8"
# Ensure you specify quantization='modelopt' when loading the modelopt checkpoint
llm = LLM(model=model_id, quantization="modelopt", trust_remote_code=True)
sampling_params = SamplingParams(temperature=0.8, top_p=0.9)
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
if __name__ == "__main__":
main()