模型量化
面向用户的量化API。
函数
根据选定的算法调整权重和缩放因子。 |
|
实验性API,用于在校准后对amax值进行后处理。 |
|
对模型进行量化和校准。 |
|
用于 |
|
通过通配符或过滤函数禁用量化器。 |
|
通过通配符或过滤函数启用量化器。 |
|
打印模型中所有量化器模块的摘要。 |
|
用于快速评估的折叠权重量化器。 |
- auto_quantize(model, constraints={'effective_bits': 4.8}, quantization_formats=['W4A8_AWQ_BETA_CFG', 'FP8_DEFAULT_CFG', None], data_loader=None, forward_step=None, loss_func=None, forward_backward_step=None, num_calib_steps=512, num_score_steps=128, verbose=False)
用于
AutoQuantize的API,该API通过搜索每层的最佳量化格式来量化模型。auto_quantize使用基于梯度的敏感度评分来对每层的量化格式进行排序,并搜索每层的最佳量化格式。- Parameters:
model (Module) – 一个包含量化器模块的pytorch模型。
约束条件 (字典[字符串, 浮点数 | 字符串]) –
搜索的约束条件。目前我们仅支持
effective_bits。effective_bits指定量化模型的有效位数。以下是一个有效的
effective_bits参数的示例:# 对于有效量化位数为4.8的情况 constraints = {"effective_bits": 4.8}
quantization_formats (List[str | None]) –
A list of the string names of the quantization formats to search for. The supported quantization formats are as listed by
modelopt.torch.quantization.config.choices.In addition, the quantization format can also be
Nonewhich implies skipping quantization for the layer.Note
The quantization formats will be applied on a per-layer match basis. The global model level name based quantizer attribute setting will be ignored. For example, in
FP8_DEFAULT_CFGquantizer configuration the key"*lm_head*": {"enable": False}disables quantization for thelm_headlayer. However inauto_quantize, the quantization format for thelm_headlayer will be searched. This is because the key"*lm_head*"sets the quantizer attributes based on the global model level name, not per-layer basis. The keys"*input_quantizer","*weight_quantizer"etc. inFP8_DEFAULT_CFGmatch on a per-layer basis - hence the corresponding quantizers will be set as specified.Here is an example quantization_formats argument:
# A valid `quantization_formats` argument # This will search for the best per-layer quantization from FP8, W4A8_AWQ or No quantization quantization_formats = ["FP8_DEFAULT_CFG", "W4A8_AWQ", None]
data_loader (Iterable) – 一个迭代器,生成用于校准量化层和估计
auto_quantize分数的数据。forward_step (Callable[[Module, Any], Any | Tensor]) –
一个可调用对象,它接受模型和来自
data_loader的一批数据作为输入,将数据通过模型前向传播并返回模型输出。 这是一个必需的参数。以下是一个有效的
forward_step示例:# 接受模型和一批数据作为输入并返回模型输出 def forward_step(model, batch) -> torch.Tensor: output = model(batch) return output
loss_func (Callable[[Any, Any], Tensor]) –
(可选)一个可调用对象,它将模型输出和数据批次作为输入并计算损失。模型输出是由
forward_step给出的输出。.backward()将在损失上调用。以下是一个有效的
loss_func示例:# 将模型输出和数据批次作为输入并返回损失 def loss_func(output, batch) -> torch.Tensor: ... return loss # 损失应该是一个标量张量,以便可以调用loss.backward() loss = loss_func(output, batch) loss.backward()
如果未提供此参数,则应提供
forward_backward_step。forward_backward_step (Callable[[Module, Any], Any] | None) –
(可选)一个可调用对象,它从
data_loader中获取一批数据,通过模型进行前向传播,计算损失并在损失上运行反向传播。以下是一个有效的
forward_backward_step参数的示例:# 将模型和一批数据作为输入,并运行前向和反向传播 def forward_backward_step(model, batch) -> None: output = model(batch) loss = my_loss_func(output, batch) run_custom_backward(loss)
如果未提供此参数,则应提供
loss_func。num_calib_steps (int) – 用于校准量化模型的批次数。建议值为512。
num_score_steps (int) – 用于估计
auto_quantize分数的批次数。建议值为128。 较高的值可能会增加执行auto_quantize所需的时间。verbose (bool) – 如果为True,则打印搜索进度/中间结果。
- Returns: A tuple (model, state_dict) where
modelis the searched and quantized model and state_dict包含搜索过程的历史记录和详细统计信息。
注意
auto_quantize将某些层分组并限制它们的量化格式相同。例如,属于同一个transformer层的Q、K、V线性层将具有相同的量化格式。这是为了确保与TensorRT-LLM的兼容性,后者将这三个线性层融合成一个单一的线性层。在
rules中定义的正则表达式模式规则列表用于指定层组。正则表达式模式中的第一个捕获组(即pattern.match(name).group(1))用于对层进行分组。所有共享相同第一个捕获组的层将具有相同的量化格式。例如,规则
r"^(.*?)\.(q_proj|k_proj|v_proj)$"将属于同一 transformer 层的 q_proj、k_proj、v_proj 线性层分组。您可以根据需要修改规则以对图层进行分组。
from modelopt.torch.quantization.algorithms import AutoQuantizeSearcher # To additionally group the layers belonging to same `mlp` layer, # add the following rule AutoQuantizeSearcher.rules.append(r"^(.*?)\.mlp") # Perform `auto_quantize` model, state_dict = auto_quantize(model, ...)
注意
auto_quantizeAPI 和算法是实验性的,可能会发生变化。搜索到的auto_quantize模型可能还不能直接部署到 TensorRT-LLM。
- calibrate(model, algorithm='max', forward_loop=None)
根据选定的算法调整权重和缩放因子。
- Parameters:
model (Module) – 一个包含量化器模块的pytorch模型。
algorithm (str | MaxCalibConfig | SmoothQuantCalibConfig | AWQLiteCalibConfig | AWQClipCalibConfig | AWQFullCalibConfig | RealQuantizeConfig | None) – A string or dictionary specifying the calibration algorithm to use. Supported algorithms are
"max","smoothquant","awq_lite","awq_full", and"awq_clip". If a dictionary is passed, the key"method"should specify the calibration algorithm to use. Other key-value pairs in this dictionary will be passed as kwargs to the algorithm. An example dictionary argument:{"method": "awq_clip", "max_co_batch_size": 4096}. IfNone, no calibration is performed. For real quantization, the keymethodshould bereal_quantize, and the calibration algorithm used should be specified inadditional_algorithm.forward_loop (Callable[[Module], None] | None) – 一个可调用对象,它将模型作为参数,并通过模型转发校准数据。对于使用
"max"算法的仅权重量化,这不是必需的。
- Return type:
模块
返回:校准后的pytorch模型。
- disable_quantizer(model, wildcard_or_filter_func)
通过通配符或过滤函数禁用量化器。
- Parameters:
模型 (模块) –
wildcard_or_filter_func (str | Callable) –
- enable_quantizer(model, wildcard_or_filter_func)
通过通配符或过滤函数启用量化器。
- Parameters:
模型 (模块) –
wildcard_or_filter_func (str | Callable) –
- fold_weight(model)
用于快速评估的折叠权重量化器。
- Parameters:
模型 (模块) –
- postprocess_amax(model, key, post_process_fn)
实验性API,用于在校准后对amax值进行后处理。
- Parameters:
模型 (模块) –
key (str) –
- Return type:
模块
- print_quant_summary(model)
打印模型中所有量化器模块的摘要。
- Parameters:
模型 (模块) –
- quantize(model, config, forward_loop=None)
对模型进行量化和校准。
此方法执行模块的量化替换,并根据
quant_cfg指定的方式进行校准。forward_loop用于通过模型前向传播数据并收集校准统计信息。- Parameters:
model (Module) – 一个 PyTorch 模型
config (Dict[str, Any]) –
A dictionary or an instance of
QuantizeConfigspecifying the values for keys"quant_cfg"and"algorithm". It is basically a dictionary specifying the values for keys"quant_cfg"and"algorithm". The"quant_cfg"key specifies the quantization configurations. The"algorithm"key specifies thealgorithmargument tocalibrate.Quantization configurations is a dictionary mapping wildcards or filter functions to its quantizer attributes. The wildcards or filter functions are matched against the quantizer module names. The quantizer modules have names ending with
weight_quantizerandinput_quantizerand they perform weight quantization and input quantization (or activation quantization) respectively. The quantizer modules are instances ofTensorQuantizer. The quantizer attributes are defined byQuantizerAttributeConfig. SeeQuantizerAttributeConfigfor details on the quantizer attributes and their values.An example
configdictionary is given below:See 量化格式 to learn more about the supported quantization formats. See 量化配置 for more details on
configdictionary.forward_loop (Callable[[Module], None] | None) –
A callable that forwards all calibration data through the model. This is used to gather statistics for calibration. It should take model as the argument. It does not need to return anything.
This argument is not required for weight-only quantization with the
"max"algorithm.Here are a few examples for correct
forward_loopdefinitions: Example 1:def forward_loop(model) -> None: # iterate over the data loader and forward data through the model for batch in data_loader: model(batch)
Example 2:
def forward_loop(model) -> float: # evaluate the model on the task return evaluate(model, task, ....)
Example 3:
def forward_loop(model) -> None: # run evaluation pipeline evaluator.model = model evaluator.evaluate()
Note
Calibration does not require forwarding the entire dataset through the model. Please subsample the dataset or reduce the number of batches if needed.
- Return type:
模块
返回:一个已经量化和校准的pytorch模型。