快速入门指南

这是尝试使用TensorRT-LLM的起点。具体来说，本快速入门指南使您能够快速设置并使用TensorRT-LLM发送HTTP请求。

先决条件

本快速入门使用Meta Llama 3.1模型。该模型受特定许可证约束。要下载模型文件，请同意条款并通过Hugging Face进行身份验证。
完成安装步骤。
从Hugging Face Hub拉取Llama 3.1 8B模型的聊天调优变体的权重和分词器文件。
```
git clone https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct
```

LLM API

LLM API 是一个 Python API，旨在直接在 Python 中简化 TensorRT-LLM 的设置和推理。它通过简单地指定 HuggingFace 仓库名称或模型检查点来实现模型优化。LLM API 通过管理检查点转换、引擎构建、引擎加载和模型推理来简化流程，所有这些都通过一个 Python 对象完成。

这里是一个简单的例子，展示如何使用LLM API与TinyLlama。

from tensorrt_llm import LLM, SamplingParams


def main():

    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

    llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")

    outputs = llm.generate(prompts, sampling_params)

    # Print the outputs.
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


# The entry point of the program need to be protected for spawning processes.
if __name__ == '__main__':
    main()

要了解更多关于LLM API的信息，请查看API介绍和LLM示例介绍。

将模型编译为TensorRT引擎

使用来自GitHub仓库examples/llama目录的Llama模型定义。该模型定义是一个最小示例，展示了TensorRT-LLM中可用的一些优化。

# From the root of the cloned repository, start the TensorRT-LLM container
make -C docker release_run LOCAL_USER=1

# Log in to huggingface-cli
# You can get your token from huggingface.co/settings/token
huggingface-cli login --token *****

# Convert the model into TensorRT-LLM checkpoint format
cd examples/llama
pip install -r requirements.txt
pip install --upgrade transformers # Llama 3.1 requires transformer 4.43.0+ version.
python3 convert_checkpoint.py --model_dir Meta-Llama-3.1-8B-Instruct --output_dir llama-3.1-8b-ckpt

# Compile model
trtllm-build --checkpoint_dir llama-3.1-8b-ckpt \
    --gemm_plugin float16 \
    --output_dir ./llama-3.1-8b-engine

当你使用TensorRT-LLM API创建模型定义时，你会从NVIDIA TensorRT原语构建一个操作图，这些操作构成了你的神经网络的层。这些操作映射到特定的内核；为GPU预写的程序。

在这个例子中，我们包含了gpt_attention插件，它实现了一个类似于FlashAttention的融合注意力内核，以及gemm插件，它执行带有FP32累积的矩阵乘法。我们还将整个模型的期望精度指定为FP16，与您从Hugging Face下载的权重的默认精度相匹配。有关插件和量化的更多信息，请参阅Llama示例和数值精度部分。

运行模型

现在你已经有了模型引擎，运行引擎并执行推理。

python3 ../run.py --engine_dir ./llama-3.1-8b-engine  --max_output_len 100 --tokenizer_dir Meta-Llama-3.1-8B-Instruct --input_text "How do I count to nine in French?"

使用Triton推理服务器部署

要创建您的LLM的生产就绪部署，请使用Triton Inference Server后端用于TensorRT-LLM，以利用TensorRT-LLM C++运行时进行快速推理执行，并包括诸如飞行中批处理和分页KV缓存等优化。带有TensorRT-LLM后端的Triton Inference Server可通过NVIDIA NGC提供的预构建容器获得。

克隆 TensorRT-LLM 后端仓库：

cd ..
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
cd tensorrtllm_backend

请参考End to end workflow to run llama 7b在TensorRT-LLM后端仓库中，使用Triton推理服务器部署模型。

下一步

在本快速入门指南中，您：

已安装并构建了TensorRT-LLM
检索了模型权重
编译并运行了模型
使用Triton推理服务器部署了模型
作为使用基于FastAPI的OpenAI API服务器部署引擎的替代方案，您可以使用trtllm-serve CLI。

更多示例，请参考：

examples/ 展示了如何在最新的LLMs上运行快速基准测试。