注意

本教程可以在Google Colab上交互式使用！你也可以点击 here在本地运行Jupyter笔记本。

快速入门

本教程面向Apache TVM的新手。通过一个简单示例展示如何使用Apache TVM来编译一个基础的神经网络。

概述 

Apache TVM是一个机器学习编译框架，遵循Python优先开发和通用部署的原则。它接收预训练的机器学习模型，编译并生成可部署的模块，这些模块可以嵌入到任何地方运行。 Apache TVM还支持自定义优化流程，以引入新的优化、库、代码生成等功能。

Apache TVM可以帮助实现以下功能：

优化机器学习工作负载的性能，组合库和代码生成。
部署机器学习工作负载到多样化的新环境，包括新运行时和新硬件。
持续改进和定制 Python中的机器学习部署流程，通过快速定制库分发、引入自定义操作符和代码生成。

整体流程 

接下来我们将展示使用Apache TVM编译神经网络模型的整体流程，包括如何优化、部署和运行模型。整体流程如下图所示：

../../_static/downloads/tvm_overall_flow.svg

整体流程包含以下步骤：

构建或导入模型: 构建神经网络模型或从其他框架(如PyTorch、ONNX)导入预训练模型，并创建TVM IRModule，其中包含编译所需的所有信息，包括用于计算图的高级Relax函数和用于张量程序的低级TensorIR函数。
执行可组合优化: 执行一系列优化转换，如图优化、张量程序优化和库调度。
构建与通用部署: 将优化后的模型构建为可部署模块，适配通用运行时环境，并可在不同设备上执行，例如CPU、GPU或其他加速器。

构建或导入模型 

在开始之前，让我们先构建一个神经网络模型。在本教程中，为了简化操作，我们将直接使用TVM Relax前端（这是一个类似于PyTorch的API）在脚本中定义一个两层的MLP网络。

import tvm
from tvm import relax
from tvm.relax.frontend import nn


class MLPModel(nn.Module):
    def __init__(self):
        super(MLPModel, self).__init__()
        self.fc1 = nn.Linear(784, 256)
        self.relu1 = nn.ReLU()
        self.fc2 = nn.Linear(256, 10)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu1(x)
        x = self.fc2(x)
        return x

然后我们可以将模型导出为TVM IRModule，这是TVM中的核心中间表示形式。

mod, param_spec = MLPModel().export_tvm(
    spec={"forward": {"x": nn.spec.Tensor((1, 784), "float32")}}
)
mod.show()

# from tvm.script import ir as I
# from tvm.script import relax as R

@I.ir_module
class Module:
    @R.function
    def forward(x: R.Tensor((1, 784), dtype="float32"), fc1_weight: R.Tensor((256, 784), dtype="float32"), fc1_bias: R.Tensor((256,), dtype="float32"), fc2_weight: R.Tensor((10, 256), dtype="float32"), fc2_bias: R.Tensor((10,), dtype="float32")) -> R.Tensor((1, 10), dtype="float32"):
        R.func_attr({"num_input": 1})
        with R.dataflow():
            permute_dims: R.Tensor((784, 256), dtype="float32") = R.permute_dims(fc1_weight, axes=None)
            matmul: R.Tensor((1, 256), dtype="float32") = R.matmul(x, permute_dims, out_dtype="void")
            add: R.Tensor((1, 256), dtype="float32") = R.add(matmul, fc1_bias)
            relu: R.Tensor((1, 256), dtype="float32") = R.nn.relu(add)
            permute_dims1: R.Tensor((256, 10), dtype="float32") = R.permute_dims(fc2_weight, axes=None)
            matmul1: R.Tensor((1, 10), dtype="float32") = R.matmul(relu, permute_dims1, out_dtype="void")
            add1: R.Tensor((1, 10), dtype="float32") = R.add(matmul1, fc2_bias)
            gv: R.Tensor((1, 10), dtype="float32") = add1
            R.output(gv)
        return gv

执行优化转换 

Apache TVM 利用 pipeline 来转换和优化程序。该流水线封装了一系列转换步骤，旨在同时实现两个目标（处于同一层级）：

模型优化: 如算子融合、布局重写等。
张量程序优化: 将算子映射到底层实现(包括库或代码生成)

注意

这两者是目标而非流水线的阶段。这两项优化是在同一层级执行的，或者分别在两个阶段进行。

注意

在本教程中，我们仅通过利用zero优化流程来演示整体流程，而非针对任何特定目标进行优化。

mod = relax.get_pipeline("zero")(mod)

构建与通用部署 

优化完成后，我们可以将模型构建为可部署模块并在不同设备上运行。

import numpy as np

target = tvm.target.Target("llvm")
ex = tvm.compile(mod, target)
device = tvm.cpu()
vm = relax.VirtualMachine(ex, device)
data = np.random.rand(1, 784).astype("float32")
tvm_data = tvm.nd.array(data, device=device)
params = [np.random.rand(*param.shape).astype("float32") for _, param in param_spec]
params = [tvm.nd.array(param, device=device) for param in params]
print(vm["forward"](tvm_data, *params).numpy())

[[27143.484 25515.908 26329.016 24623.168 26308.906 26924.281 24331.305
  25950.732 25848.861 26064.36 ]]

我们的目标是将机器学习引入到任何感兴趣的语言应用中，同时提供最小的运行时支持。

IRModule中的每个函数在运行时都会变成一个可执行函数。例如在LLM场景中，我们可以直接调用prefill和decode函数。
```
prefill_logits = vm["prefill"](inputs, weight, kv_cache)
decoded_logits = vm["decode"](inputs, weight, kv_cache)
```

TVM运行时自带原生数据结构，如NDArray，还能与现有生态系统进行零拷贝交换（通过DLPack与PyTorch交换）

# 将PyTorch张量转换为TVM NDArray
x_tvm = tvm.nd.from_dlpack(x_torch.to_dlpack())
# 将TVM NDArray转换为PyTorch张量
x_torch = torch.from_dlpack(x_tvm.to_dlpack())

TVM运行时可以在非Python环境中工作，因此适用于移动设备等场景

// C++代码片段
runtime::Module vm = ex.GetFunction("load_executable")();
vm.GetFunction("init")(...);
NDArray out = vm.GetFunction("prefill")(data, weight, kv_cache);

// Java代码片段
Module vm = ex.getFunction("load_executable").invoke();
vm.getFunction("init").pushArg(...).invoke;
NDArray out = vm.getFunction("prefill").pushArg(data).pushArg(weight).pushArg(kv_cache).invoke();

阅读下一篇 

本教程演示了使用Apache TVM编译神经网络模型的整体流程。如需了解更多高级或特定主题，请参考以下教程

由Sphinx-Gallery生成的画廊

快速入门

概述

整体流程

构建或导入模型

执行优化转换

构建与通用部署

阅读下一篇