使用Triton服务Torch-TensorRT模型¶

在讨论机器学习基础设施时，优化和部署是密不可分的。一旦完成了网络级别的优化以获得最大性能，下一步就是部署它。

然而，提供这个优化模型也伴随着一系列考虑和挑战，例如：构建支持并发模型执行的基础设施，支持通过HTTP或gRPC的客户端等。

Triton Inference Server 解决了上述问题以及更多。让我们逐步讨论使用 Torch-TensorRT 优化模型、将其部署在 Triton Inference Server 上以及构建客户端以查询模型的过程。

第一步：使用Torch-TensorRT优化您的模型¶

大多数Torch-TensorRT用户都会熟悉这一步。为了本次演示的目的，我们将使用来自Torchhub的ResNet50模型。

我们将在//examples/triton目录中工作，该目录包含本教程中使用的脚本。

首先拉取NGC PyTorch Docker容器。您可能需要创建一个账户并从这里获取API密钥。使用您的密钥注册并登录（注册后按照这里的说明操作）。

# YY.MM is the yy:mm for the publishing tag for NVIDIA's Pytorch
# container; eg. 24.08
# NOTE: Use the publishing tag for both the PyTorch container and the Triton Containers

docker run -it --gpus all -v ${PWD}:/scratch_space nvcr.io/nvidia/pytorch:YY.MM-py3
cd /scratch_space

使用容器，我们可以将模型导出到Triton模型仓库中的正确目录。此导出脚本使用Dynamo前端为Torch-TensorRT编译PyTorch模型到TensorRT。然后我们使用TorchScript作为序列化格式保存模型，该格式由Triton支持。

import torch
import torch_tensorrt as torchtrt
import torchvision

import torch
import torch_tensorrt
torch.hub._validate_not_a_forked_repo=lambda a,b,c: True

# load model
model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet50', pretrained=True).eval().to("cuda")

# Compile with Torch TensorRT;
trt_model = torch_tensorrt.compile(model,
    inputs= [torch_tensorrt.Input((1, 3, 224, 224))],
    enabled_precisions= {torch_tensorrt.dtype.f16}
)

ts_trt_model = torch.jit.trace(trt_model, torch.rand(1, 3, 224, 224).to("cuda"))

# Save the model
torch.jit.save(ts_trt_model, "/triton_example/model_repository/resnet50/1/model.pt")

您可以使用以下命令运行脚本（从//examples/triton）

docker run --gpus all -it --rm -v ${PWD}:/triton_example nvcr.io/nvidia/pytorch:YY.MM-py3 python /triton_example/export.py

这将在模型仓库的正确目录中保存ResNet模型的序列化TorchScript版本。

步骤2：设置Triton推理服务器¶

如果您是Triton推理服务器的新手并想了解更多信息，我们强烈建议您查看我们的Github仓库。

要使用Triton，我们需要创建一个模型仓库。顾名思义，模型仓库是推理服务器托管的模型的仓库。虽然Triton可以从多个仓库中提供模型，但在本例中，我们将讨论模型仓库的最简单形式。

这个仓库的结构应该看起来像这样：

model_repository
|
+-- resnet50
    |
    +-- config.pbtxt
    +-- 1
        |
        +-- model.pt

Triton 需要两个文件来服务模型：模型本身和模型配置文件，通常以 config.pbtxt 形式提供。对于我们在步骤 1 中准备的模型，可以使用以下配置：

name: "resnet50"
backend: "pytorch"
max_batch_size : 0
input [
  {
    name: "x"
    data_type: TYPE_FP32
    dims: [ 1, 3, 224, 224 ]
  }
]
output [
  {
    name: "output0"
    data_type: TYPE_FP32
    dims: [1, 1000]
  }
]

config.pbtxt 文件用于描述确切的模型配置，包括输入和输出层的名称和形状、数据类型、调度和批处理细节等详细信息。如果您是 Triton 的新手，我们强烈建议您查看文档的这一部分以获取更多详细信息。

设置好模型仓库后，我们可以继续使用下面的docker命令启动Triton服务器。有关容器的拉取标签，请参考此页面。

# Make sure that the TensorRT version in the Triton container
# and TensorRT version in the environment used to optimize the model
# are the same. Roughly, like publishing tags should have the same TensorRT version

docker run --gpus all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v ${PWD}:/triton_example nvcr.io/nvidia/tritonserver:YY.MM-py3 tritonserver --model-repository=/triton_example/model_repository

这应该会启动一个Triton推理服务器。下一步，构建一个简单的HTTP客户端来查询服务器。

步骤3：构建一个Triton客户端来查询服务器¶

在继续之前，请确保手头有一个样本图像。如果没有，请下载一个示例图像以测试推理。在本节中，我们将介绍一个非常基础的客户端。有关更多详细的示例，请参阅Triton Client Repository

wget  -O img1.jpg "https://www.hakaimagazine.com/wp-content/uploads/header-gulf-birds.jpg"

然后我们需要安装用于构建Python客户端的依赖项。这些依赖项会因客户端而异。有关Triton支持的所有语言的完整列表，请参阅Triton的客户端仓库。

pip install torchvision
pip install attrdict
pip install nvidia-pyindex
pip install tritonclient[all]

让我们深入了解客户端。首先，我们编写一个小的预处理函数来调整和归一化查询图像。

import numpy as np
from torchvision import transforms
from PIL import Image
import tritonclient.http as httpclient
from tritonclient.utils import triton_to_np_dtype

# preprocessing function
def rn50_preprocess(img_path="/triton_example/img1.jpg"):
  img = Image.open(img_path)
  preprocess = transforms.Compose(
      [
          transforms.Resize(256),
          transforms.CenterCrop(224),
          transforms.ToTensor(),
          transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
      ]
  )
  return preprocess(img).unsqueeze(0).numpy()

 transformed_img = rn50_preprocess()

构建客户端需要三个基本点。首先，我们与Triton Inference Server建立连接。

# Setting up client
client = httpclient.InferenceServerClient(url="localhost:8000")

其次，我们指定模型的输入和输出层的名称。这可以在导出时获得，并且应该已经在您的config.pbtxt中指定。

inputs = httpclient.InferInput("x", transformed_img.shape, datatype="FP32")
inputs.set_data_from_numpy(transformed_img, binary_data=True)

outputs = httpclient.InferRequestedOutput("output0", binary_data=True, class_count=1000)

最后，我们向Triton推理服务器发送一个推理请求。

# Querying the server
results = client.infer(model_name="resnet50", inputs=[inputs], outputs=[outputs])
inference_output = results.as_numpy('output0')
print(inference_output[:5])

输出应如下所示：

[b'12.468750:90' b'11.523438:92' b'9.664062:14' b'8.429688:136'
 b'8.234375:11']

这里的输出格式是:。要了解如何将这些映射到标签名称及更多信息，请参考Triton Inference Server的文档。

你可以快速尝试使用这个客户端

# Remember to use the same publishing tag for all steps (e.g. 24.08)

docker run -it --net=host -v ${PWD}:/triton_example nvcr.io/nvidia/tritonserver:YY.MM-py3-sdk bash -c "pip install torchvision && python /triton_example/client.py"