Torch-TensorRT (FX 前端) 用户指南¶

Torch-TensorRT (FX 前端) 是一个工具，可以通过 torch.fx 将 PyTorch 模型转换为针对在 Nvidia GPU 上运行的 TensorRT 引擎进行优化的工具。TensorRT 是由 NVIDIA 开发的推理引擎，由各种优化组成，包括内核融合、图优化、低精度等。该工具在 Python 环境中开发，使得研究人员和工程师可以非常方便地使用此工作流程。用户使用此工具需要经历几个阶段，我们将在这里介绍它们。

> Torch-TensorRT (FX 前端) 处于 Beta 阶段，目前建议与 PyTorch nightly 版本一起使用。

# Test an example by
$ python py/torch_tensorrt/fx/example/lower_example.py

将PyTorch模型转换为TensorRT引擎¶

一般来说，欢迎用户使用compile()来完成从模型到tensorRT引擎的转换。这是一个封装API，包含了完成此转换所需的主要步骤。请参考examples/fx下的lower_example.py文件中的示例用法。

def compile(
    module: nn.Module,
    input,
    max_batch_size=2048,
    max_workspace_size=33554432,
    explicit_batch_dimension=False,
    lower_precision=LowerPrecision.FP16,
    verbose_log=False,
    timing_cache_prefix="",
    save_timing_cache=False,
    cuda_graph_batch_size=-1,
    dynamic_batch=True,
) -> nn.Module:

    """
    Takes in original module, input and lowering setting, run lowering workflow to turn module
    into lowered module, or so called TRTModule.

    Args:
        module: Original module for lowering.
        input: Input for module.
        max_batch_size: Maximum batch size (must be >= 1 to be set, 0 means not set)
        max_workspace_size: Maximum size of workspace given to TensorRT.
        explicit_batch_dimension: Use explicit batch dimension in TensorRT if set True, otherwise use implicit batch dimension.
        lower_precision: lower_precision config given to TRTModule.
        verbose_log: Enable verbose log for TensorRT if set True.
        timing_cache_prefix: Timing cache file name for timing cache used by fx2trt.
        save_timing_cache: Update timing cache with current timing cache data if set to True.
        cuda_graph_batch_size: Cuda graph batch size, default to be -1.
        dynamic_batch: batch dimension (dim=0) is dynamic.
    Returns:
        A torch.nn.Module lowered by TensorRT.
    """

在本节中，我们将通过一个示例来说明fx路径使用的主要步骤。用户可以参考examples/fx中的fx2trt_example.py文件。

步骤1：使用acc_tracer跟踪模型

Acc_tracer 是一个继承自 FX tracer 的追踪器。它带有参数标准化器，可以将所有参数转换为 kwargs 并传递给 TRT 转换器。

import torch_tensorrt.fx.tracer.acc_tracer.acc_tracer as acc_tracer

# Build the model which needs to be a PyTorch nn.Module.
my_pytorch_model = build_model()

# Prepare inputs to the model. Inputs have to be a List of Tensors
inputs = [Tensor, Tensor, ...]

# Trace the model with acc_tracer.
acc_mod = acc_tracer.trace(my_pytorch_model, inputs)

常见错误：

符号跟踪变量不能用作控制流的输入这意味着模型包含动态控制流。请参阅FX指南中的“动态控制流”部分。

步骤2：构建TensorRT引擎

TensorRT处理批次维度有两种不同的模式，显式批次维度和隐式批次维度。这种模式被早期版本的TensorRT使用，现在已被弃用，但为了向后兼容性仍然支持。在显式批次模式下，所有维度都是显式的，并且可以是动态的，即它们的长度可以在执行时改变。许多新功能，如动态形状和循环，仅在此模式下可用。用户仍然可以选择在compile()中设置explicit_batch_dimension=False时使用隐式批次模式。我们不推荐使用它，因为在未来的TensorRT版本中将缺乏支持。

显式批处理是默认模式，必须为动态形状设置。对于大多数视觉任务，如果用户希望获得与隐式模式类似的效果，其中只有批处理维度发生变化，他们可以选择在compile()中启用dynamic_batch。它有一些要求： 1. 输入、输出和激活的形状除了批处理维度外是固定的。 2. 输入、输出和激活的批处理维度是主要维度。 3. 模型中的所有操作符不修改批处理维度（排列、转置、分割等）或在批处理维度上进行计算（求和、softmax等）。

对于最后一个路径的示例，如果我们有一个形状为 (batch, sequence, dimension) 的3D张量 t，操作如 torch.transpose(0, 2)。如果这三个条件中的任何一个不满足，我们需要将 InputTensorSpec 指定为具有动态范围的输入。

import deeplearning.trt.fx2trt.converter.converters
from torch.fx.experimental.fx2trt.fx2trt import InputTensorSpec, TRTInterpreter

# InputTensorSpec is a dataclass we use to store input information.
# There're two ways we can build input_specs.
# Option 1, build it manually.
input_specs = [
  InputTensorSpec(shape=(1, 2, 3), dtype=torch.float32),
  InputTensorSpec(shape=(1, 4, 5), dtype=torch.float32),
]
# Option 2, build it using sample_inputs where user provide a sample
inputs = [
torch.rand((1,2,3), dtype=torch.float32),
torch.rand((1,4,5), dtype=torch.float32),
]
input_specs = InputTensorSpec.from_tensors(inputs)

# IMPORTANT: If dynamic shape is needed, we need to build it slightly differently.
input_specs = [
    InputTensorSpec(
        shape=(-1, 2, 3),
        dtype=torch.float32,
        # Currently we only support one set of dynamic range. User may set other dimensions but it is not promised to work for any models
        # (min_shape, optimize_target_shape, max_shape)
        # For more information refer to fx/input_tensor_spec.py
        shape_ranges = [
            ((1, 2, 3), (4, 2, 3), (100, 2, 3)),
        ],
    ),
    InputTensorSpec(shape=(1, 4, 5), dtype=torch.float32),
]

# Build a TRT interpreter. Set explicit_batch_dimension accordingly.
interpreter = TRTInterpreter(
    acc_mod, input_specs, explicit_batch_dimension=True/False
)

# The output of TRTInterpreter run() is wrapped as TRTInterpreterResult.
# The TRTInterpreterResult contains required parameter to build TRTModule,
# and other informational output from TRTInterpreter run.
class TRTInterpreterResult(NamedTuple):
    engine: Any
    input_names: Sequence[str]
    output_names: Sequence[str]
    serialized_cache: bytearray

#max_batch_size: set accordingly for maximum batch size you will use.
#max_workspace_size: set to the maximum size we can afford for temporary buffer
#lower_precision: the precision model layers are running on (TensorRT will choose the best perforamnce precision).
#sparse_weights: allow the builder to examine weights and use optimized functions when weights have suitable sparsity
#force_fp32_output: force output to be fp32
#strict_type_constraints: Usually we should set it to False unless we want to control the precision of certain layer for numeric #reasons.
#algorithm_selector: set up algorithm selection for certain layer
#timing_cache: enable timing cache for TensorRT
#profiling_verbosity: TensorRT logging level
trt_interpreter_result = interpreter.run(
    max_batch_size=64,
    max_workspace_size=1 << 25,
    sparse_weights=False,
    force_fp32_output=False,
    strict_type_constraints=False,
    algorithm_selector=None,
    timing_cache=None,
    profiling_verbosity=None,
)

常见错误：

运行时错误：当前不支持函数xxx的转换！ - 这意味着我们不支持这个xxx操作符。请参考下面的“如何添加缺失的操作”部分以获取进一步的说明。

步骤3：运行模型

一种方法是使用TRTModule，它基本上是一个PyTorch的nn.Module。

from torch_tensorrt.fx import TRTModule
mod = TRTModule(
    trt_interpreter_result.engine,
    trt_interpreter_result.input_names,
    trt_interpreter_result.output_names)
# Just like all other PyTorch modules
outputs = mod(*inputs)
torch.save(mod, "trt.pt")
reload_trt_mod = torch.load("trt.pt")
reload_model_output = reload_trt_mod(*inputs)

到目前为止，我们详细解释了将PyTorch模型转换为TensorRT引擎的主要步骤。欢迎用户参考源代码以获取一些参数的解释。在转换方案中，有两个重要的操作。一个是acc tracer，它帮助我们将PyTorch模型转换为acc图。另一个是FX路径转换器，它帮助将acc图的操作转换为相应的TensorRT操作，并为其构建TensorRT引擎。

Acc Tracer¶

Acc tracer 是一个自定义的 FX 符号追踪器。与普通的 FX 符号追踪器相比，它做了一些额外的事情。我们主要依赖它将 PyTorch 操作或内置操作转换为 acc 操作。fx2trt 使用 acc 操作有两个主要目的：

在PyTorch操作和内置操作中有许多执行类似操作的操作，例如torch.add、builtin.add和torch.Tensor.add。使用acc tracer，我们将这三个操作规范化为单个acc_ops.add。这有助于减少我们需要编写的转换器数量。
acc ops 只有 kwargs，这使得编写转换器更容易，因为我们不需要添加额外的逻辑来在 args 和 kwargs 中查找参数。

FX2TRT¶

在符号追踪之后，我们得到了PyTorch模型的图表示。fx2trt利用了fx.Interpreter的强大功能。fx.Interpreter逐个节点遍历整个图，并调用该节点所代表的函数。fx2trt通过为每个节点调用相应的转换器来覆盖原始的函数调用行为。每个转换器函数都会添加相应的TensorRT层。

下面是一个转换器函数的示例。装饰器用于将此转换器函数注册到相应的节点。在此示例中，我们将此转换器注册到目标为 acc_ops.sigmoid 的 fx 节点。

@tensorrt_converter(acc_ops.sigmoid)
def acc_ops_sigmoid(network, target, args, kwargs, name):
    """
    network: TensorRT network. We'll be adding layers to it.

    The rest arguments are attributes of fx node.
    """
    input_val = kwargs['input']

    if not isinstance(input_val, trt.tensorrt.ITensor):
        raise RuntimeError(f'Sigmoid received input {input_val} that is not part '
                        'of the TensorRT region!')

    layer = network.add_activation(input=input_val, type=trt.ActivationType.SIGMOID)
    layer.name = name
    return layer.get_output(0)

如何添加缺失的操作¶

你实际上可以将其添加到任何你想要的地方，只需要记住导入文件，以便在使用acc_tracer进行跟踪之前，所有的acc操作和映射器都会被注册。

步骤 1. 添加一个新的 acc op

待办事项：需要更多地解释关于acc操作的逻辑，比如我们何时想要分解一个操作以及何时想要重用其他操作。

在acc tracer中，如果图中节点有注册到acc操作的映射，我们将图中的节点转换为acc操作。

为了使转换到acc ops发生，需要两件事。一是应该定义一个acc op函数，二是应该注册一个映射。

定义一个acc操作很简单，我们首先只需要一个函数，并通过这个装饰器acc_normalizer.py将函数注册为acc操作。例如，以下代码添加了一个名为foo()的acc操作，它将两个给定的输入相加。

# NOTE: all acc ops should only take kwargs as inputs, therefore we need the "*"
# at the beginning.
@register_acc_op
def foo(*, input, other, alpha):
    return input + alpha * other

有两种方法可以注册映射。一种是register_acc_op_mapping()。让我们注册一个从torch.add到我们刚刚创建的foo()的映射。我们需要为其添加装饰器register_acc_op_mapping。

this_arg_is_optional = True

@register_acc_op_mapping(
    op_and_target=("call_function", torch.add),
    arg_replacement_tuples=[
        ("input", "input"),
        ("other", "other"),
        ("alpha", "alpha", this_arg_is_optional),
    ],
)
@register_acc_op
def foo(*, input, other, alpha=1.0):
    return input + alpha * other

op_and_target 决定了哪个节点将触发此映射。op 和 target 是 FX 节点的属性。在 acc_normalization 中，当我们看到一个节点的 op 和 target 与 op_and_target 中设置的相同时，我们将触发映射。由于我们希望从 torch.add 映射，那么 op 将是 call_function，target 将是 torch.add。arg_replacement_tuples 决定了我们如何使用原始节点的 args 和 kwargs 来构造新的 acc op 节点的 kwargs。arg_replacement_tuples 中的每个元组代表一个参数映射规则。它包含两个或三个元素。第三个元素是一个布尔变量，用于确定此 kwarg 在原始节点中是否是可选的。只有当它为 True 时，我们才需要指定第三个元素。第一个元素是原始节点中的参数名称，它将用作 acc op 节点的参数，其名称是元组中的第二个元素。元组的顺序很重要，因为元组的位置决定了参数在原始节点的 args 中的位置。我们使用此信息将原始节点的 args 映射到 acc op 节点的 kwargs。如果以下情况都不成立，我们不需要指定 arg_replacement_tuples。

原始节点和累加操作节点的kwargs具有不同的名称。
有可选参数。

另一种注册映射的方法是通过register_custom_acc_mapper_fn()。这种方法旨在减少冗余的操作注册，因为它允许你使用一个函数通过一些组合映射到一个或多个现有的acc操作。在函数中，你基本上可以做任何你想做的事情。让我们用一个例子来解释它是如何工作的。

@register_acc_op
def foo(*, input, other, alpha=1.0):
    return input + alpha * other

@register_custom_acc_mapper_fn(
    op_and_target=("call_function", torch.add),
    arg_replacement_tuples=[
        ("input", "input"),
        ("other", "other"),
        ("alpha", "alpha", this_arg_is_optional),
    ],
)
def custom_mapper(node: torch.fx.Node, _: nn.Module) -> torch.fx.Node:
    """
    `node` is original node, which is a call_function node with target
    being torch.add.
    """
    alpha = 1
    if "alpha" in node.kwargs:
        alpha = node.kwargs["alpha"]
    foo_kwargs = {"input": node["input"], "other": node["other"], "alpha": alpha}
    with node.graph.inserting_before(node):
        foo_node = node.graph.call_function(foo, kwargs=foo_kwargs)
        foo_node.meta = node.meta.copy()
        return foo_node

在自定义映射器函数中，我们构建了一个acc操作节点并返回它。我们在这里返回的节点将接管原始节点的所有子节点acc_normalizer.py。

最后一步将是为我们添加的新acc操作或映射函数添加单元测试。添加单元测试的位置在这里test_acc_tracer.py。

步骤2. 添加一个新的转换器

所有为acc ops开发的转换器都在acc_op_converter.py中。它可以为你提供一个如何添加转换器的好例子。

本质上，转换器是将acc操作映射到TensorRT层的机制。如果我们能够找到所有需要的TensorRT层，我们就可以开始使用TensorRT APIs为节点添加转换器。

@tensorrt_converter(acc_ops.sigmoid)
def acc_ops_sigmoid(network, target, args, kwargs, name):
    """
    network: TensorRT network. We'll be adding layers to it.

    The rest arguments are attributes of fx node.
    """
    input_val = kwargs['input']

    if not isinstance(input_val, trt.tensorrt.ITensor):
        raise RuntimeError(f'Sigmoid received input {input_val} that is not part '
                        'of the TensorRT region!')

    layer = network.add_activation(input=input_val, type=trt.ActivationType.SIGMOID)
    layer.name = name
    return layer.get_output(0)

我们需要使用tensorrt_converter装饰器来注册转换器。装饰器的参数是我们需要转换的fx节点的目标。在转换器中，我们可以在kwargs中找到fx节点的输入。如示例所示，原始节点是acc_ops.sigmoid，在acc_ops.py中只有一个参数“input”。我们获取输入并检查它是否是TensorRT张量。之后，我们向TensorRT网络添加一个sigmoid层并返回该层的输出。我们返回的输出将由fx.Interpreter传递给acc_ops.sigmoid的子节点。

如果我们在TensorRT中找不到与节点执行相同操作的对应层怎么办。

在这种情况下，我们需要做更多的工作。TensorRT 提供了作为自定义层的插件。我们尚未实现此功能。一旦启用，我们将进行更新。

最后一步是为我们添加的新转换器添加单元测试。用户可以在此文件夹中添加相应的单元测试。