(原型) PyTorch 2 导出训练后量化¶

创建于：2023年10月02日 | 最后更新：2024年10月23日 | 最后验证：2024年11月05日

本教程介绍了基于torch._export.export在图形模式下进行训练后静态量化的步骤。与FX Graph Mode Quantization相比，此流程预计将具有显著更高的模型覆盖率（14K模型上的88%），更好的可编程性以及简化的用户体验。

使用流程的前提是torch.export.export可导出，你可以在Export DB中找到支持的结构。

量化2与量化器的高级架构可能如下所示：

float_model(Python)                          Example Input
    \                                              /
     \                                            /
—-------------------------------------------------------
|                        export                        |
—-------------------------------------------------------
                            |
                    FX Graph in ATen     Backend Specific Quantizer
                            |                       /
—--------------------------------------------------------
|                     prepare_pt2e                      |
—--------------------------------------------------------
                            |
                     Calibrate/Train
                            |
—--------------------------------------------------------
|                    convert_pt2e                       |
—--------------------------------------------------------
                            |
                    Quantized Model
                            |
—--------------------------------------------------------
|                       Lowering                        |
—--------------------------------------------------------
                            |
        Executorch, Inductor or <Other Backends>

PyTorch 2 导出量化 API 如下所示：

import torch
class M(torch.nn.Module):
   def __init__(self):
      super().__init__()
      self.linear = torch.nn.Linear(5, 10)

   def forward(self, x):
      return self.linear(x)


example_inputs = (torch.randn(1, 5),)
m = M().eval()

# Step 1. program capture
# This is available for pytorch 2.5+, for more details on lower pytorch versions
# please check `Export the model with torch.export` section
m = torch.export.export_for_training(m, example_inputs).module()
# we get a model with aten ops


# Step 2. quantization
from torch.ao.quantization.quantize_pt2e import (
  prepare_pt2e,
  convert_pt2e,
)

from torch.ao.quantization.quantizer.xnnpack_quantizer import (
  XNNPACKQuantizer,
  get_symmetric_quantization_config,
)
# backend developer will write their own Quantizer and expose methods to allow
# users to express how they
# want the model to be quantized
quantizer = XNNPACKQuantizer().set_global(get_symmetric_quantization_config())
m = prepare_pt2e(m, quantizer)

# calibration omitted

m = convert_pt2e(m)
# we have a model with aten ops doing integer computations when possible

PyTorch 2 导出量化的动机¶

在PyTorch 2之前的版本中，我们使用FX图模式量化，它使用 QConfigMapping 和BackendConfig 进行定制。QConfigMapping允许建模用户指定他们希望模型如何被量化，BackendConfig允许后端开发者指定他们后端支持的量化方式。虽然该API相对较好地覆盖了大多数用例，但它并不是完全可扩展的。当前API有两个主要限制：

使用现有对象表达复杂操作符模式的量化意图（如何观察/量化操作符模式）的限制：QConfig 和 QConfigMapping。
在用户如何表达他们希望模型如何量化的意图方面，支持有限。例如，如果用户希望量化模型中的每隔一个线性层，或者量化行为对张量的实际形状有一些依赖（例如，只有当线性层有3D输入时才观察/量化输入和输出），后端开发人员或建模用户需要更改核心量化API/流程。

一些改进可以使现有流程更好：

我们使用QConfigMapping和BackendConfig作为独立的对象， QConfigMapping描述了用户希望如何量化他们的模型的意图，BackendConfig描述了后端支持的量化类型。 BackendConfig是特定于后端的，但QConfigMapping不是，用户可能会提供一个与特定BackendConfig不兼容的QConfigMapping，这不是一个很好的用户体验。理想情况下，我们可以通过使配置（QConfigMapping）和量化能力（BackendConfig）都特定于后端来更好地组织这一点，这样关于不兼容性的混淆就会减少。
在QConfig中，我们将观察者/fake_quant观察者类作为对象暴露给用户以配置量化，这增加了用户可能需要关心的事项。例如，不仅dtype，还包括观察应该如何进行，这些可能对用户隐藏，以便用户流程更简单。

以下是新API的好处总结：

可编程性（解决1.和2.）：当用户的量化需求未被现有的量化器覆盖时，用户可以构建自己的量化器，并如上所述将其与其他量化器组合使用。
简化的用户体验（解决第3点）：提供一个单一的实例，后端和用户都可以与之交互。因此，您不再需要用户面向的量化配置映射来映射用户的意图，也不需要后端与之交互的单独量化配置来配置后端支持。我们仍然会为用户提供一种方法来查询量化器支持的内容。通过单一实例，组合不同的量化能力也比以前更加自然。

例如，XNNPACK 不支持 embedding_byte，而我们在 ExecuTorch 中对此有原生支持。因此，如果我们有一个 ExecuTorchQuantizer，它只量化 embedding_byte，那么它可以与 XNNPACKQuantizer 组合使用。（以前，这通常是将两个 BackendConfig 连接在一起，并且由于 QConfigMapping 中的选项不是特定于后端的，用户还需要自己弄清楚如何指定配置以匹配组合后端的量化能力。通过单个量化器实例，我们可以组合两个量化器并查询组合量化器的能力，这使得它更不容易出错且更清晰，例如，composed_quantizer.quantization_capabilities())。
关注点分离（解决第4点）：在我们设计量化器API时，我们也从观察者概念中解耦了量化的规范，如通过dtype、最小/最大值（位数）、对称性等表达。目前，观察者既捕获量化规范，也捕获如何观察（直方图与最小最大观察者）。通过这一改变，建模用户无需与观察者和伪量化对象交互。

定义辅助函数并准备数据集¶

我们将从进行必要的导入开始，定义一些辅助函数并准备数据。这些步骤与Static Quantization with Eager Mode in PyTorch中的步骤相同。

要使用整个ImageNet数据集运行本教程中的代码，首先按照ImageNet Data的说明下载ImageNet。将下载的文件解压缩到data_path文件夹中。

下载torchvision resnet18模型并将其重命名为data/resnet18_pretrained_float.pth。

import os
import sys
import time
import numpy as np

import torch
import torch.nn as nn
from torch.utils.data import DataLoader

import torchvision
from torchvision import datasets
from torchvision.models.resnet import resnet18
import torchvision.transforms as transforms

# Set up warnings
import warnings
warnings.filterwarnings(
    action='ignore',
    category=DeprecationWarning,
    module=r'.*'
)
warnings.filterwarnings(
    action='default',
    module=r'torch.ao.quantization'
)

# Specify random seed for repeatable results
_ = torch.manual_seed(191009)


class AverageMeter(object):
    """Computes and stores the average and current value"""
    def __init__(self, name, fmt=':f'):
        self.name = name
        self.fmt = fmt
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

    def __str__(self):
        fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})'
        return fmtstr.format(**self.__dict__)


def accuracy(output, target, topk=(1,)):
    """
    Computes the accuracy over the k top predictions for the specified
    values of k.
    """
    with torch.no_grad():
        maxk = max(topk)
        batch_size = target.size(0)

        _, pred = output.topk(maxk, 1, True, True)
        pred = pred.t()
        correct = pred.eq(target.view(1, -1).expand_as(pred))

        res = []
        for k in topk:
            correct_k = correct[:k].reshape(-1).float().sum(0, keepdim=True)
            res.append(correct_k.mul_(100.0 / batch_size))
        return res


def evaluate(model, criterion, data_loader):
    model.eval()
    top1 = AverageMeter('Acc@1', ':6.2f')
    top5 = AverageMeter('Acc@5', ':6.2f')
    cnt = 0
    with torch.no_grad():
        for image, target in data_loader:
            output = model(image)
            loss = criterion(output, target)
            cnt += 1
            acc1, acc5 = accuracy(output, target, topk=(1, 5))
            top1.update(acc1[0], image.size(0))
            top5.update(acc5[0], image.size(0))
    print('')

    return top1, top5

def load_model(model_file):
    model = resnet18(pretrained=False)
    state_dict = torch.load(model_file, weights_only=True)
    model.load_state_dict(state_dict)
    model.to("cpu")
    return model

def print_size_of_model(model):
    torch.save(model.state_dict(), "temp.p")
    print("Size (MB):", os.path.getsize("temp.p")/1e6)
    os.remove("temp.p")

def prepare_data_loaders(data_path):
    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                     std=[0.229, 0.224, 0.225])
    dataset = torchvision.datasets.ImageNet(
        data_path, split="train", transform=transforms.Compose([
            transforms.RandomResizedCrop(224),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            normalize,
        ]))
    dataset_test = torchvision.datasets.ImageNet(
        data_path, split="val", transform=transforms.Compose([
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            normalize,
        ]))

    train_sampler = torch.utils.data.RandomSampler(dataset)
    test_sampler = torch.utils.data.SequentialSampler(dataset_test)

    data_loader = torch.utils.data.DataLoader(
        dataset, batch_size=train_batch_size,
        sampler=train_sampler)

    data_loader_test = torch.utils.data.DataLoader(
        dataset_test, batch_size=eval_batch_size,
        sampler=test_sampler)

    return data_loader, data_loader_test

data_path = '~/.data/imagenet'
saved_model_dir = 'data/'
float_model_file = 'resnet18_pretrained_float.pth'

train_batch_size = 30
eval_batch_size = 50

data_loader, data_loader_test = prepare_data_loaders(data_path)
example_inputs = (next(iter(data_loader))[0])
criterion = nn.CrossEntropyLoss()
float_model = load_model(saved_model_dir + float_model_file).to("cpu")
float_model.eval()

# create another instance of the model since
# we need to keep the original model around
model_to_quantize = load_model(saved_model_dir + float_model_file).to("cpu")

将模型设置为评估模式¶

对于训练后量化，我们需要将模型设置为评估模式。

model_to_quantize.eval()

使用 torch.export 导出模型¶

以下是您可以使用torch.export导出模型的方法：

example_inputs = (torch.rand(2, 3, 224, 224),)
# for pytorch 2.5+
exported_model = torch.export.export_for_training(model_to_quantize, example_inputs).module()

# for pytorch 2.4 and before
# from torch._export import capture_pre_autograd_graph
# exported_model = capture_pre_autograd_graph(model_to_quantize, example_inputs)

# or capture with dynamic dimensions
# for pytorch 2.5+
dynamic_shapes = tuple(
  {0: torch.export.Dim("dim")} if i == 0 else None
  for i in range(len(example_inputs))
)
exported_model = torch.export.export_for_training(model_to_quantize, example_inputs, dynamic_shapes=dynamic_shapes).module()

# for pytorch 2.4 and before
# dynamic_shape API may vary as well
# from torch._export import dynamic_dim
# exported_model = capture_pre_autograd_graph(model_to_quantize, example_inputs, constraints=[dynamic_dim(example_inputs[0], 0)])

导入后端特定的量化器并配置如何量化模型¶

以下代码片段描述了如何量化模型：

from torch.ao.quantization.quantizer.xnnpack_quantizer import (
  XNNPACKQuantizer,
  get_symmetric_quantization_config,
)
quantizer = XNNPACKQuantizer()
quantizer.set_global(get_symmetric_quantization_config())

Quantizer 是后端特定的，每个 Quantizer 都会提供自己的方式让用户配置他们的模型。作为一个例子，以下是 XNNPackQuantizer 支持的不同配置 API：

quantizer.set_global(qconfig_opt)  # qconfig_opt is an optional quantization config
    .set_object_type(torch.nn.Conv2d, qconfig_opt) # can be a module type
    .set_object_type(torch.nn.functional.linear, qconfig_opt) # or torch functional op
    .set_module_name("foo.bar", qconfig_opt)

注意

查看我们的教程，其中描述了如何编写一个新的Quantizer。

准备模型以进行训练后量化¶

prepare_pt2e 将 BatchNorm 操作符折叠到前面的 Conv2d 操作符中，并在模型的适当位置插入观察者。

prepared_model = prepare_pt2e(exported_model, quantizer)
print(prepared_model.graph)

校准¶

校准函数在观察者插入模型后运行。校准的目的是通过一些代表工作负载的样本示例（例如训练数据集的一个样本）运行，以便模型中的观察者能够观察张量的统计数据，我们稍后可以使用这些信息来计算量化参数。

def calibrate(model, data_loader):
    model.eval()
    with torch.no_grad():
        for image, target in data_loader:
            model(image)
calibrate(prepared_model, data_loader_test)  # run calibration on sample data

将校准模型转换为量化模型¶

convert_pt2e 接受一个校准后的模型并生成一个量化模型。

quantized_model = convert_pt2e(prepared_model)
print(quantized_model)

在这一步，我们目前有两种表示形式供您选择，但我们长期提供的具体表示形式可能会根据PyTorch用户的反馈而改变。

Q/DQ 表示（默认）

之前的文档中，representations 所有量化操作符都被表示为 dequantize -> fp32_op -> qauntize。

def quantized_linear(x_int8, x_scale, x_zero_point, weight_int8, weight_scale, weight_zero_point, bias_fp32, output_scale, output_zero_point):
    x_fp32 = torch.ops.quantized_decomposed.dequantize_per_tensor(
             x_i8, x_scale, x_zero_point, x_quant_min, x_quant_max, torch.int8)
    weight_fp32 = torch.ops.quantized_decomposed.dequantize_per_tensor(
             weight_i8, weight_scale, weight_zero_point, weight_quant_min, weight_quant_max, torch.int8)
    weight_permuted = torch.ops.aten.permute_copy.default(weight_fp32, [1, 0]);
    out_fp32 = torch.ops.aten.addmm.default(bias_fp32, x_fp32, weight_permuted)
    out_i8 = torch.ops.quantized_decomposed.quantize_per_tensor(
    out_fp32, out_scale, out_zero_point, out_quant_min, out_quant_max, torch.int8)
    return out_i8

参考量化模型表示

我们将为选定的操作提供特殊的表示，例如量化线性。其他操作表示为dq -> float32_op -> q，并且q/dq被分解为更原始的操作符。你可以通过使用convert_pt2e(..., use_reference_representation=True)来获得这种表示。

# Reference Quantized Pattern for quantized linear
def quantized_linear(x_int8, x_scale, x_zero_point, weight_int8, weight_scale, weight_zero_point, bias_fp32, output_scale, output_zero_point):
    x_int16 = x_int8.to(torch.int16)
    weight_int16 = weight_int8.to(torch.int16)
    acc_int32 = torch.ops.out_dtype(torch.mm, torch.int32, (x_int16 - x_zero_point), (weight_int16 - weight_zero_point))
    bias_scale = x_scale * weight_scale
    bias_int32 = out_dtype(torch.ops.aten.div.Tensor, torch.int32, bias_fp32, bias_scale)
    acc_int32 = acc_int32 + bias_int32
    acc_int32 = torch.ops.out_dtype(torch.ops.aten.mul.Scalar, torch.int32, acc_int32, x_scale * weight_scale / output_scale) + output_zero_point
    out_int8 = torch.ops.aten.clamp(acc_int32, qmin, qmax).to(torch.int8)
    return out_int8

查看这里获取最新的参考表示。

检查模型大小和准确性评估¶

现在我们可以将大小和模型准确性与基线模型进行比较。

# Baseline model size and accuracy
print("Size of baseline model")
print_size_of_model(float_model)

top1, top5 = evaluate(float_model, criterion, data_loader_test)
print("Baseline Float Model Evaluation accuracy: %2.2f, %2.2f"%(top1.avg, top5.avg))

# Quantized model size and accuracy
print("Size of model after quantization")
# export again to remove unused weights
quantized_model = torch.export.export_for_training(quantized_model, example_inputs).module()
print_size_of_model(quantized_model)

top1, top5 = evaluate(quantized_model, criterion, data_loader_test)
print("[before serilaization] Evaluation accuracy on test dataset: %2.2f, %2.2f"%(top1.avg, top5.avg))

注意

我们现在无法进行性能评估，因为模型尚未降低到目标设备，它只是ATen操作符中量化计算的一个表示。

注意

权重目前仍然是fp32格式，我们可能会在未来对量化操作进行常量传播以获得整数权重。

如果您想获得更好的准确性或性能，请尝试以不同的方式配置quantizer，每个quantizer都有其自己的配置方式，因此请查阅您正在使用的quantizer的文档，以了解更多关于如何更好地控制模型量化的信息。

保存和加载量化模型¶

我们将展示如何保存和加载量化模型。

# 0. Store reference output, for example, inputs, and check evaluation accuracy:
example_inputs = (next(iter(data_loader))[0],)
ref = quantized_model(*example_inputs)
top1, top5 = evaluate(quantized_model, criterion, data_loader_test)
print("[before serialization] Evaluation accuracy on test dataset: %2.2f, %2.2f"%(top1.avg, top5.avg))

# 1. Export the model and Save ExportedProgram
pt2e_quantized_model_file_path = saved_model_dir + "resnet18_pt2e_quantized.pth"
# capture the model to get an ExportedProgram
quantized_ep = torch.export.export(quantized_model, example_inputs)
# use torch.export.save to save an ExportedProgram
torch.export.save(quantized_ep, pt2e_quantized_model_file_path)


# 2. Load the saved ExportedProgram
loaded_quantized_ep = torch.export.load(pt2e_quantized_model_file_path)
loaded_quantized_model = loaded_quantized_ep.module()

# 3. Check results for example inputs and check evaluation accuracy again:
res = loaded_quantized_model(*example_inputs)
print("diff:", ref - res)

top1, top5 = evaluate(loaded_quantized_model, criterion, data_loader_test)
print("[after serialization/deserialization] Evaluation accuracy on test dataset: %2.2f, %2.2f"%(top1.avg, top5.avg))

输出:

[before serialization] Evaluation accuracy on test dataset: 79.82, 94.55
diff: tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])

[after serialization/deserialization] Evaluation accuracy on test dataset: 79.82, 94.55

调试量化模型¶

你可以使用Numeric Suite来帮助在eager模式和FX图模式下进行调试。新版本的Numeric Suite与PyTorch 2导出模型的兼容性仍在开发中。

降低和性能评估¶

此时生成的模型并不是在设备上运行的最终模型，它是一个参考量化模型，用于捕捉用户预期的量化计算，表示为ATen运算符和一些额外的量化/反量化运算符。为了获得在真实设备上运行的模型，我们需要降低模型。例如，对于在边缘设备上运行的模型，我们可以通过委托和ExecuTorch运行时运算符来降低。

结论¶

在本教程中，我们介绍了PyTorch 2中的整体量化流程，使用XNNPACKQuantizer进行导出量化，并得到了一个可以进一步降低到支持XNNPACK后端推理的后端的量化模型。要在您自己的后端中使用此功能，请首先遵循教程并为您的后端实现一个Quantizer，然后使用该Quantizer对模型进行量化。