模型量化

面向用户的量化API。

函数

`calibrate`	根据选定的算法调整权重和缩放因子。
`postprocess_amax`	实验性API，用于在校准后对amax值进行后处理。
`quantize`	对模型进行量化和校准。
`auto_quantize`	用于`AutoQuantize`的API，该API通过搜索每层的最佳量化格式来量化模型。
`disable_quantizer`	通过通配符或过滤函数禁用量化器。
`enable_quantizer`	通过通配符或过滤函数启用量化器。
`print_quant_summary`	打印模型中所有量化器模块的摘要。
`fold_weight`	用于快速评估的折叠权重量化器。

auto_quantize(model, constraints={'effective_bits': 4.8}, quantization_formats=['W4A8_AWQ_BETA_CFG', 'FP8_DEFAULT_CFG', None], data_loader=None, forward_step=None, loss_func=None, forward_backward_step=None, num_calib_steps=512, num_score_steps=128, verbose=False)

用于AutoQuantize的API，该API通过搜索每层的最佳量化格式来量化模型。

auto_quantize 使用基于梯度的敏感度评分来对每层的量化格式进行排序，并搜索每层的最佳量化格式。

Parameters:

model (Module) – 一个包含量化器模块的pytorch模型。
约束条件 (字典[字符串, 浮点数 | 字符串]) –
搜索的约束条件。目前我们仅支持 effective_bits。 effective_bits 指定量化模型的有效位数。

以下是一个有效的 effective_bits 参数的示例：
```
# 对于有效量化位数为4.8的情况
constraints = {"effective_bits": 4.8}
```
quantization_formats (List[str | None]) –
A list of the string names of the quantization formats to search for. The supported quantization formats are as listed by modelopt.torch.quantization.config.choices.

In addition, the quantization format can also be None which implies skipping quantization for the layer.

Note

The quantization formats will be applied on a per-layer match basis. The global model level name based quantizer attribute setting will be ignored. For example, in FP8_DEFAULT_CFG quantizer configuration the key "*lm_head*": {"enable": False} disables quantization for the lm_head layer. However in auto_quantize, the quantization format for the lm_head layer will be searched. This is because the key "*lm_head*" sets the quantizer attributes based on the global model level name, not per-layer basis. The keys "*input_quantizer", "*weight_quantizer" etc. in FP8_DEFAULT_CFG match on a per-layer basis - hence the corresponding quantizers will be set as specified.

Here is an example quantization_formats argument:
# A valid `quantization_formats` argument # This will search for the best per-layer quantization from FP8, W4A8_AWQ or No quantization quantization_formats = ["FP8_DEFAULT_CFG", "W4A8_AWQ", None]
data_loader (Iterable) – 一个迭代器，生成用于校准量化层和估计auto_quantize分数的数据。
forward_step (Callable[[Module, Any], Any | Tensor]) –
一个可调用对象，它接受模型和来自data_loader的一批数据作为输入，将数据通过模型前向传播并返回模型输出。这是一个必需的参数。

以下是一个有效的forward_step示例：
```
# 接受模型和一批数据作为输入并返回模型输出
def forward_step(model, batch) -> torch.Tensor:
    output = model(batch)
    return output
```
loss_func (Callable[[Any, Any], Tensor]) –
（可选）一个可调用对象，它将模型输出和数据批次作为输入并计算损失。模型输出是由forward_step给出的输出。.backward()将在损失上调用。

以下是一个有效的loss_func示例：
```
# 将模型输出和数据批次作为输入并返回损失
def loss_func(output, batch) -> torch.Tensor:
    ...
    return loss


# 损失应该是一个标量张量，以便可以调用loss.backward()
loss = loss_func(output, batch)
loss.backward()
```
如果未提供此参数，则应提供forward_backward_step。
forward_backward_step (Callable[[Module, Any], Any] | None) –
（可选）一个可调用对象，它从data_loader中获取一批数据，通过模型进行前向传播，计算损失并在损失上运行反向传播。

以下是一个有效的forward_backward_step参数的示例：
```
# 将模型和一批数据作为输入，并运行前向和反向传播
def forward_backward_step(model, batch) -> None:
    output = model(batch)
    loss = my_loss_func(output, batch)
    run_custom_backward(loss)
```
如果未提供此参数，则应提供loss_func。
num_calib_steps (int) – 用于校准量化模型的批次数。建议值为512。
num_score_steps (int) – 用于估计auto_quantize分数的批次数。建议值为128。较高的值可能会增加执行auto_quantize所需的时间。
verbose (bool) – 如果为True，则打印搜索进度/中间结果。

Returns: A tuple (model, state_dict) where model is the searched and quantized model and: state_dict 包含搜索过程的历史记录和详细统计信息。

注意

auto_quantize 将某些层分组并限制它们的量化格式相同。例如，属于同一个transformer层的Q、K、V线性层将具有相同的量化格式。这是为了确保与TensorRT-LLM的兼容性，后者将这三个线性层融合成一个单一的线性层。

在rules中定义的正则表达式模式规则列表用于指定层组。正则表达式模式中的第一个捕获组（即pattern.match(name).group(1)）用于对层进行分组。所有共享相同第一个捕获组的层将具有相同的量化格式。

例如，规则 r"^(.*?)\.(q_proj|k_proj|v_proj)$" 将属于同一 transformer 层的 q_proj、k_proj、v_proj 线性层分组。

您可以根据需要修改规则以对图层进行分组。

from modelopt.torch.quantization.algorithms import AutoQuantizeSearcher

# To additionally group the layers belonging to same `mlp` layer,
# add the following rule
AutoQuantizeSearcher.rules.append(r"^(.*?)\.mlp")

# Perform `auto_quantize`
model, state_dict = auto_quantize(model, ...)

注意

auto_quantize API 和算法是实验性的，可能会发生变化。搜索到的 auto_quantize 模型可能还不能直接部署到 TensorRT-LLM。

calibrate(model, algorithm='max', forward_loop=None)

根据选定的算法调整权重和缩放因子。

Parameters:

model (Module) – 一个包含量化器模块的pytorch模型。
algorithm (str | MaxCalibConfig | SmoothQuantCalibConfig | AWQLiteCalibConfig | AWQClipCalibConfig | AWQFullCalibConfig | RealQuantizeConfig | None) – A string or dictionary specifying the calibration algorithm to use. Supported algorithms are "max", "smoothquant", "awq_lite", "awq_full", and "awq_clip". If a dictionary is passed, the key "method" should specify the calibration algorithm to use. Other key-value pairs in this dictionary will be passed as kwargs to the algorithm. An example dictionary argument: {"method": "awq_clip", "max_co_batch_size": 4096}. If None, no calibration is performed. For real quantization, the key method should be real_quantize, and the calibration algorithm used should be specified in additional_algorithm.
forward_loop (Callable[[Module], None] | None) – 一个可调用对象，它将模型作为参数，并通过模型转发校准数据。对于使用"max"算法的仅权重量化，这不是必需的。

Return type:

模块

返回：校准后的pytorch模型。

disable_quantizer(model, wildcard_or_filter_func)

通过通配符或过滤函数禁用量化器。

Parameters:

模型 (模块) –
wildcard_or_filter_func (str | Callable) –

enable_quantizer(model, wildcard_or_filter_func)

通过通配符或过滤函数启用量化器。

Parameters:

模型 (模块) –
wildcard_or_filter_func (str | Callable) –

fold_weight(model)

用于快速评估的折叠权重量化器。

Parameters:: 模型 (模块) –

postprocess_amax(model, key, post_process_fn)

实验性API，用于在校准后对amax值进行后处理。

Parameters:

模型 (模块) –
key (str) –

Return type:

模块

print_quant_summary(model)

打印模型中所有量化器模块的摘要。

Parameters:: 模型 (模块) –

quantize(model, config, forward_loop=None)

对模型进行量化和校准。

此方法执行模块的量化替换，并根据quant_cfg指定的方式进行校准。 forward_loop用于通过模型前向传播数据并收集校准统计信息。

Parameters:

model (Module) – 一个 PyTorch 模型
config (Dict[str, Any]) –
A dictionary or an instance of QuantizeConfig specifying the values for keys "quant_cfg" and "algorithm". It is basically a dictionary specifying the values for keys "quant_cfg" and "algorithm". The "quant_cfg" key specifies the quantization configurations. The "algorithm" key specifies the algorithm argument to calibrate.

Quantization configurations is a dictionary mapping wildcards or filter functions to its quantizer attributes. The wildcards or filter functions are matched against the quantizer module names. The quantizer modules have names ending with weight_quantizer and input_quantizer and they perform weight quantization and input quantization (or activation quantization) respectively. The quantizer modules are instances of TensorQuantizer. The quantizer attributes are defined by QuantizerAttributeConfig. See QuantizerAttributeConfig for details on the quantizer attributes and their values.

An example config dictionary is given below:

See 量化格式 to learn more about the supported quantization formats. See 量化配置 for more details on config dictionary.
forward_loop (Callable[[Module], None] | None) –
A callable that forwards all calibration data through the model. This is used to gather statistics for calibration. It should take model as the argument. It does not need to return anything.

This argument is not required for weight-only quantization with the "max" algorithm.

Here are a few examples for correct forward_loop definitions: Example 1:
```
def forward_loop(model) -> None:
    # iterate over the data loader and forward data through the model
    for batch in data_loader:
        model(batch)
```
Example 2:
```
def forward_loop(model) -> float:
    # evaluate the model on the task
    return evaluate(model, task, ....)
```
Example 3:
```
def forward_loop(model) -> None:
    # run evaluation pipeline
    evaluator.model = model
    evaluator.evaluate()
```
Note

Calibration does not require forwarding the entire dataset through the model. Please subsample the dataset or reduce the number of batches if needed.

Return type:

模块

返回：一个已经量化和校准的pytorch模型。