PyTorch 量化

ModelOpt的PyTorch量化提供的主要优势：

支持高级量化格式，例如，块级Int4和FP8。
Hugging Face 和 NeMo 中对 LLM 模型的原生支持。
高级量化算法，例如SmoothQuant，AWQ。
支持部署到ONNX和NVIDIA TensorRT。

注意

ModelOpt量化是伪量化，这意味着它仅在PyTorch中模拟低精度计算。实际的加速和内存节省应通过将模型导出到部署框架来实现。

提示

本指南涵盖了ModelOpt量化的使用。有关量化格式和推荐用例的详细信息，请参阅量化格式。

应用训练后量化（PTQ）

PTQ 可以通过在将常规 PyTorch 模型转换为量化模型后，对一小部分训练或评估数据（通常为 128-512 个样本）进行简单的校准来实现。使用 ModelOpt 量化模型的最简单方法是使用 mtq.quantize()。

mtq.quantize() 接受一个模型、一个量化配置和一个前向循环可调用对象作为输入。量化配置指定了要量化的层、它们的量化格式以及用于校准的算法。请参考量化配置以获取默认支持的量化配置列表。您也可以按照自定义量化器配置中描述的方式定义自己的量化配置。

ModelOpt 支持诸如 AWQ、SmoothQuant 或 max 等校准算法。更多详情请参考 mtq.calibrate。

前向循环用于按顺序通过模型传递数据以收集校准的统计信息。它应该包裹校准数据加载器和模型。

以下是使用ModelOpt执行PTQ的示例：

import modelopt.torch.quantization as mtq

# Setup the model
model = get_model()

# Select quantization config
config = mtq.INT8_SMOOTHQUANT_CFG

# Quantization need calibration data. Setup calibration data loader
# An example of creating a calibration data loader looks like the following:
data_loader = get_dataloader(num_samples=calib_size)


# Define forward_loop. Please wrap the data loader in the forward_loop
def forward_loop(model):
    for batch in data_loader:
        model(batch)


# Quantize the model and perform calibration (PTQ)
model = mtq.quantize(model, config, forward_loop)

为了验证量化节点是否正确放置在模型中，让我们打印量化模型的摘要，如下所示：

# Print quantization summary after successfully quantizing the model with mtq.quantize
# This will show the quantizers inserted in the model and their configurations
mtq.print_quant_summary(model)

在PTQ之后，模型可以通过正常的PyTorch ONNX导出流程导出到ONNX。

torch.onnx.export(model, sample_input, onnx_file)

ModelOpt 还支持直接将 Huggingface 或 Nemo LLM 模型导出到 TensorRT-LLM 进行部署。请参阅 TensorRT-LLM 部署了解更多详情。

量化感知训练 (QAT)

QAT 是一种微调量化模型的技术，用于恢复由于量化导致的模型质量下降。虽然 QAT 比 PTQ 需要更多的计算资源，但它在恢复模型质量方面非常有效。

使用mtq.quantize()量化的模型可以直接通过QAT进行微调。通常在QAT期间，量化器状态被冻结，模型权重被微调。

以下是执行QAT的示例：

import modelopt.torch.quantization as mtq

# Select quantization config
config = mtq.INT8_DEFAULT_CFG


# Define forward loop for calibration
def forward_loop(model):
    for data in calib_set:
        model(data)


# QAT after replacement of regular modules to quantized modules
model = mtq.quantize(model, config, forward_loop)

# Fine-tune with original training pipeline
# Adjust learning rate and training duration
train(model, train_loader, optimizer, scheduler, ...)

提示

我们建议使用QAT进行原始训练周期的10%。对于LLMs，我们发现即使QAT微调时间少于原始预训练时间的1%，通常也足以恢复模型质量。

存储和恢复量化模型

模型权重和量化器状态需要保存以供将来使用或恢复训练。请参阅保存和恢复ModelOpt修改的模型以了解如何保存和恢复量化模型。

使用AutoQuantize(`auto_quantize`)进行最优部分量化

auto_quantize 或 AutoQuantize 是来自 ModelOpt 的 PTQ 算法，它通过为每一层搜索最佳量化格式来量化模型，同时满足用户指定的性能约束。AutoQuantize 允许在模型精度和性能之间进行权衡。请参阅 auto_quantize 以获取有关 API 使用的更多详细信息。

目前AutoQuantize仅支持effective_bits作为性能约束（适用于仅权重量化和权重及激活量化）。effective_bits约束指定了量化模型的有效位数。

您可以指定一个effective_bits约束，例如8.8，用于使用FP8_DEFAULT_CFG进行部分量化。 AutoQuantize将跳过对量化最敏感的层进行量化，以便最终部分量化模型的有效位数为8.8。由于跳过了对量化高度敏感的一些层的量化，该模型的准确性将比使用默认配置量化的模型更好。

以下是执行AutoQuantize的方法：

import modelopt.torch.quantization as mtq
import modelopt.torch.opt as mto

# Define the model & calibration dataloader
model = ...
calib_dataloader = ...

# Define forward_step function.
# forward_step should take the model and data as input and return the output
def forward_step(model, data):
    output =  model(data)
    return output

# Define loss function which takes the model output and data as input and returns the loss
def loss_func(output, data):
    loss = ...
    return loss


# Perform AutoQuantize
model, search_state_dict = mtq.auto_quantize(
    model,
    constraints = {"effective_bits": 4.8},
    # supported quantization formats are listed in `modelopt.torch.quantization.config.choices`
    quantization_formats = ["W4A8_AWQ_BETA_CFG", "FP8_DEFAULT_CFG", None]
    data_loader = calib_dataloader,
    forward_step=forward_step,
    loss_func=loss_func,
    ...
    )

# Save the searched model for future use
mto.save(model, "auto_quantize_model.pt")

高级主题

TensorQuantizer

在底层，ModelOpt mtq.quantize() 将 TensorQuantizer （量化器模块）插入到模型层中，如线性层、卷积层等，并修补它们的前向方法以执行量化。

量化参数如QuantizerAttributeConfig中所述。它们可以在初始化时通过传递QuantizerAttributeConfig来设置，或者稍后通过调用TensorQuantizer.set_from_attribute_config()来设置。如果未明确设置量化参数，量化器将使用默认值。

以下是创建量化器模块的示例：

from modelopt.torch.quantization.config import QuantizerAttributeConfig
from modelopt.torch.quantization.nn import TensorQuantizer

# Create quantizer module with default quantization parameters
quantizer = TensorQuantizer()

quant_x = quantizer(x)  # Quantize input x

# Create quantizer module with custom quantization parameters
# Example setting for INT4 block-wise quantization
quantizer_custom = TensorQuantizer(QuantizerAttributeConfig(num_bits=4, block_sizes={-1: 128}))

# Quantize input with custom quantization parameters
quant_x = quantizer_custom(x)  # Quantize input x

自定义量化器配置

ModelOpt 在常见层中插入输入量化器、权重量化器和输出量化器，但默认情况下禁用输出量化器。希望自定义默认量化器配置的高级用户可以使用通配符或过滤器函数匹配来更新提供给 mtq.quantize 的 config 字典。

以下是指定自定义量化器配置到 mtq.quantize 的示例：

# Select quantization config
config = mtq.INT8_DEFAULT_CFG.copy()
config["quant_cfg"]["*.bmm.output_quantizer"] = {
    "enable": True
}  # Enable output quantizer for bmm layer

# Perform PTQ/QAT;
model = mtq.quantize(model, config, forward_loop)

自定义量化模块和量化器放置

modelopt.torch.quantization 有一组默认的量化模块（详见 modelopt.torch.quantization.nn.modules 的详细列表）和量化器放置规则（输入、输出和权重量化器）。然而，在某些情况下，您可能希望定义一个自定义的量化模块和/或自定义量化器的放置。

ModelOpt 提供了一种定义自定义量化模块并将其注册到量化框架中的方法。这使您可以：

处理不支持的模块，例如，需要量化的子类化线性层。
自定义量化器的放置位置，例如将量化器放置在特殊位置，如注意力层的KV缓存中。

以下是一个定义自定义量化LayerNorm模块的示例：

from modelopt.torch.quantization.nn import TensorQuantizer


class QuantLayerNorm(nn.LayerNorm):
    def __init__(self, normalized_shape):
        super().__init__(normalized_shape)
        self._setup()

    def _setup(self):
        # Method to setup the quantizers
        self.input_quantizer = TensorQuantizer()
        self.weight_quantizer = TensorQuantizer()

    def forward(self, input):
        # You can customize the quantizer placement anywhere in the forward method
        input = self.input_quantizer(input)
        weight = self.weight_quantizer(self.weight)
        return F.layer_norm(input, self.normalized_shape, weight, self.bias, self.eps)

在定义了自定义量化模块后，您需要注册这个模块，以便mtq.quantize API能够自动将原始模块替换为量化版本。请注意，自定义的QuantLayerNorm必须有一个_setup方法，该方法实例化在前向方法中调用的量化器属性。以下是注册自定义量化模块的代码：

import modelopt.torch.quantization as mtq

# Register the custom quantized module
mtq.register(original_cls=nn.LayerNorm, quantized_cls=QuantLayerNorm)

# Perform PTQ
# nn.LayerNorm modules in the model will be replaced with the QuantLayerNorm module
model = mtq.quantize(model, config, forward_loop)

如果您定义了一个自定义量化模块，可能需要自定义量化配置。请参阅自定义量化器配置了解更多详情。

快速评估

权重折叠避免了在每次推理前向传递过程中重复量化权重，从而加速评估。这可以通过以下代码实现：

# Fold quantizer together with weight tensor
mtq.fold_weight(quantized_model)

# Run model evaluation
user_evaluate_func(quantized_model)

注意

权重折叠后，模型不能再导出到ONNX或使用QAT进行微调。

从pytorch_quantization迁移

ModelOpt PyTorch量化是从pytorch_quantization重构并扩展而来的。

以前的pytorch_quantization用户可以通过替换导入语句轻松迁移到modelopt.torch.quantization。