要在GitHub上执行或查看/下载此笔记本

将量化应用于语音识别模型

量化简介

量化通常用于SpeechBrain自动语音识别模型的低延迟应用，例如实时语音识别。

量化通过将模型的权重和激活从浮点值转换为较低分辨率的数值（如8位整数）来工作。这不仅减少了模型的内存占用，还减少了推理延迟，因为整数运算通常比浮点运算更快。

这种转换通过将给定范围内的值映射到量化范围，并将值“裁剪”到所选分辨率的最接近值来实现。一般来说，量化主要涉及两个概念：零点（zero point）和比例因子（scale factor）。

零点：在量化过程中，0被映射到的量化值。
缩放因子：数据范围被缩放以适应量化范围的比例因子。

总的来说，零点和比例因子描述了映射的工作原理。

换句话说，

\(y = round\left(\frac{x}{S} + Z\right)\)

其中 \(x\) 是原始值，\(y\) 是量化值，\(S\) 是比例因子，\(Z\) 是零点。

量化方法

量化可以根据何时进行量化来分类：在量化感知训练（QAT）中，量化在训练过程中被纳入，而在训练后量化（PTQ）中，量化仅在模型训练完成后应用。本教程的重点是对预训练模型进行量化，这意味着它将专注于后者。

PTQ 可以根据模型激活量化的时间进一步细分为两种方法。动态量化在模型推理过程中进行，而静态量化则在推理发生之前进行。

对于所有类型的量化，权重可以预先量化，因为权重依赖于模型本身，而不是输入数据。这意味着在量化时，权重值范围的信息已经可用，因此可以在不需要任何进一步信息的情况下对权重进行量化。

然而，模型的激活值，即应用激活函数后的值，依赖于输入数据。这意味着激活值的范围在运行时可能会发生变化，这促使了不同的量化方法。

动态量化

在动态量化中，子模块在准备阶段被转换为量化版本，以便权重被适当地量化。然后，在推理过程中，每个量化层观察输入的数据，并根据观察到的内容调整量化参数。随着推理的进行，这种情况会反复发生，因此得名“动态”量化。

静态量化

与动态量化不同，静态量化在运行时不会进行任何调整。相反，观察者模块被插入到将进行量化的选定位置，并将模型应用于一组代表性的数据样本。然后，观察者模块将根据输入到模型中的数据选择量化参数，这些参数在运行时将保持不变。

比较动态和静态量化

动态量化不会固定零点（zero point）和比例因子（scale factor），而是根据运行时观察到的数据进行调整。相比之下，静态量化需要一个初始的校准阶段。在静态量化的校准过程中，观察者模块会记录激活值的数据范围，并使用它来确定零点和比例因子。

动态量化的优势在于它不需要校准，并且适用于输入数据范围可能有较大变化的模块。另一方面，静态量化在运行时不需要进行实时量化调整，这可能会减少延迟，但可能会以降低准确性为代价。

教程目的

本教程将展示如何调整PyTorch量化函数，使其能够应用于SpeechBrain模型，以及如何对量化模型进行基准测试。

本教程将重点介绍预训练的自动语音识别（ASR）模型，这些模型可以轻松加载并使用库中的speechbrain.inference.ASR模块。

先决条件

安装 SpeechBrain

%%capture
# Installing SpeechBrain via pip
BRANCH = 'develop'
!python -m pip install git+https://github.com/speechbrain/speechbrain.git@$BRANCH

安装其他依赖项

kenlm 和 pygtrie 是我们选择的模型所依赖的外部库，用于实现 n-gram 相关的功能。如果你的模型不使用这些库，你可能不需要它们。将这些安装替换为你的模型所需的其他外部库。

%%capture
!pip install https://github.com/kpu/kenlm/archive/master.zip
!pip install pygtrie

导入

import gc
import numpy as np
import os
import sentencepiece
import speechbrain
import time
import torch
import torch.nn as nn
import tqdm

from collections import Counter
from copy import deepcopy

模型选择

在本教程中，我们将使用一个在CommonVoice英语数据集上使用CTC训练的Wav2Vec 2.0模型。

Wav2Vec 2.0 模型是基于 transformer 的。此外，这是一个编码器 ASR 模型，意味着它没有解码器层，而是使用解码函数。虽然编码器不使用语言模型，但解码函数可以选择使用语言模型进行 n-gram 重评分，这就是为什么需要安装 kenlm。

from speechbrain.inference.ASR import EncoderASR

asr_model = EncoderASR.from_hparams(
    source="speechbrain/asr-wav2vec2-commonvoice-14-en",
    savedir="/content/pretrained_ASR/asr-wav2vec2-commonvoice-14-en",
)

/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(
Some weights of Wav2Vec2Model were not initialized from the model checkpoint at facebook/wav2vec2-large-lv60 and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
WARNING:speechbrain.lobes.models.huggingface_transformers.huggingface:speechbrain.lobes.models.huggingface_transformers.huggingface - Wav2Vec2Model is frozen.

让我们更仔细地看一下模型的子模块。

asr_model

EncoderASR(
  (mods): ModuleDict(
    (encoder): LengthsCapableSequential(
      (wav2vec2): Wav2Vec2(
        (model): Wav2Vec2Model(
          (feature_extractor): Wav2Vec2FeatureEncoder(
            (conv_layers): ModuleList(
              (0): Wav2Vec2LayerNormConvLayer(
                (conv): Conv1d(1, 512, kernel_size=(10,), stride=(5,))
                (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
                (activation): GELUActivation()
              )
              (1-4): 4 x Wav2Vec2LayerNormConvLayer(
                (conv): Conv1d(512, 512, kernel_size=(3,), stride=(2,))
                (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
                (activation): GELUActivation()
              )
              (5-6): 2 x Wav2Vec2LayerNormConvLayer(
                (conv): Conv1d(512, 512, kernel_size=(2,), stride=(2,))
                (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
                (activation): GELUActivation()
              )
            )
          )
          (feature_projection): Wav2Vec2FeatureProjection(
            (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (projection): Linear(in_features=512, out_features=1024, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (encoder): Wav2Vec2EncoderStableLayerNorm(
            (pos_conv_embed): Wav2Vec2PositionalConvEmbedding(
              (conv): ParametrizedConv1d(
                1024, 1024, kernel_size=(128,), stride=(1,), padding=(64,), groups=16
                (parametrizations): ModuleDict(
                  (weight): ParametrizationList(
                    (0): _WeightNorm()
                  )
                )
              )
              (padding): Wav2Vec2SamePadLayer()
              (activation): GELUActivation()
            )
            (layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (layers): ModuleList(
              (0-23): 24 x Wav2Vec2EncoderLayerStableLayerNorm(
                (attention): Wav2Vec2Attention(
                  (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
                  (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
                  (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
                  (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
                )
                (dropout): Dropout(p=0.1, inplace=False)
                (layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
                (feed_forward): Wav2Vec2FeedForward(
                  (intermediate_dropout): Dropout(p=0.1, inplace=False)
                  (intermediate_dense): Linear(in_features=1024, out_features=4096, bias=True)
                  (intermediate_act_fn): GELUActivation()
                  (output_dense): Linear(in_features=4096, out_features=1024, bias=True)
                  (output_dropout): Dropout(p=0.1, inplace=False)
                )
                (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
              )
            )
          )
        )
      )
      (enc): Sequential(
        (linear1): Linear(
          (w): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (bn1): BatchNorm1d(
          (norm): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (activation): LeakyReLU(negative_slope=0.01)
        (drop): Dropout(p=0.15, inplace=False)
        (linear2): Linear(
          (w): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (bn2): BatchNorm1d(
          (norm): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (activation2): LeakyReLU(negative_slope=0.01)
        (drop2): Dropout(p=0.15, inplace=False)
        (linear3): Linear(
          (w): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (bn3): BatchNorm1d(
          (norm): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (activation3): LeakyReLU(negative_slope=0.01)
      )
      (ctc_lin): Linear(
        (w): Linear(in_features=1024, out_features=1000, bias=True)
      )
      (log_softmax): Softmax()
    )
  )
  (decoding_function): CTCBeamSearcher()
)

请注意，并非所有模块都可以量化，有些模块无法使用某些方法进行量化。特别要注意以下列表中的模块，这些模块可以在不需要自定义修改的情况下进行量化，以绕过PyTorch的限制：

动态可量化模块

nn.Linear
nn.LSTM
nn.GRU
nn.RNNCell
nn.GRUCell
nn.LSTMCell
nn.EmbeddingBag
nn.Embedding

静态可量化模块

nn.Linear
nn.Conv1d/2d/3d
nn.EmbeddingBag
nn.Embedding

有了这些信息，我们就可以开始确定我们的量化方案。从我们选择的模型中，我们可以识别出以下模块：

encoder.wav2vec2.model.feature_extractor: 包含7个nn.Conv1d层，这些层必须进行静态量化。
encoder.wav2vec2.model.feature_projection: 包含1个nn.Linear层，可以动态和静态量化。
encoder.wav2vec2.model.encoder.pos_conv_embed: 包含一个nn.ParameterizedConv1d层，PyTorch中尚未实现该层的量化。
encoder.wav2vec2.model.encoder.layers: 对于依赖注意力的模块（例如包含变压器层的子模块），静态量化尚未正确实现，因此只能应用动态量化。
encoder.enc: 一系列nn.Linear和nn.BatchNorm1d层。遗憾的是，如果BatchNorm层不在卷积层之后，PyTorch不允许对其进行静态量化，因此必须对该子模块进行动态量化。
encoder.ctc_lin: 包含1个nn.Linear层，可以动态或静态量化。

请注意，我们刚刚分离出了模型的“主要”子模块——通过对我们挑选出的子模块中的特定层应用不同的量化策略，可以以更细粒度的方式进行量化。（例如，我们可以对encoder.wav2vec2.model.encoder.layers内部的特定nn.Linear层应用静态量化，即使整个子模块不能以这种方式量化。）

然而，量化是有开销的，因为输入必须被量化，输出必须被反量化，因此不建议以过于细粒度的方式进行量化。例如，同时静态量化多个层意味着只需要一次量化和一次反量化，而分别量化它们则意味着在数据从一个层流向另一个层时反复执行反量化和量化。

鉴于量化的限制以及经验收集的数据，对于这个模型，我们将动态量化encoder.wav2vec2.model.encoder.layers和encoder.enc，并静态量化encoder.wav2vec2.model.feature_extractor和encoder.wav2vec2.model.feature_projection。

encoder.ctc_lin 不会被量化，因为实验表明，如果对其进行量化，会对WER（词错误率，一种准确度的衡量标准）产生很大影响。

由于子模块对不同的量化方法有不同的响应，您可能需要尝试动态和静态量化的组合，以找到最适合您模型的组合。

数据下载和预处理

下载LibriSpeech dev-clean数据集，该数据集包含音频样本和相应的转录文本。这将是我们用于评估量化前后模型性能的数据集。选择这个数据集是因为它相对较小——我们不需要一个大数据集来评估模型的性能——而且它是干净的，即没有背景噪音或音频伪影，这些可能会不必要地干扰模型的准确性。

需要额外的预处理将数据集转换为适合应用我们的模型的格式，以及用于将模型的输出与参考转录进行比较。我们希望有一个音频-参考对的列表，以便将模型在每个音频样本上的输出与正确的参考转录进行比较。

%%capture
!mkdir librispeech_dev_clean
!wget https://www.openslr.org/resources/12/dev-clean.tar.gz -P /content
!tar -xvf dev-clean.tar.gz -C librispeech_dev_clean

from speechbrain.dataio.dataio import read_audio

# Retrieve the downloaded speech data as a list of audio-reference pairs
def get_samples(root):
    audios = []
    references = []
    for book in os.listdir(root):
        for chapter in os.listdir(f"{root}/{book}"):
            for file in os.listdir(f"{root}/{book}/{chapter}"):
                if file.endswith("txt"):
                    with open(f"{root}/{book}/{chapter}/{file}", "r") as f:
                        for line in f.readlines():
                            audio_path, reference = line.split(" ", 1)
                            full_audio_path = f"{root}/{book}/{chapter}/{audio_path}.flac"
                            audios.append(read_audio(full_audio_path))
                            references.append(reference)
    return audios, references

audios, references = get_samples("/content/librispeech_dev_clean/LibriSpeech/dev-clean")
assert len(audios) == len(references)

量化设置

实用函数

在这里我们定义了get_module和set_module，这些实用函数用于通过提供字符串来检索和设置模块内的子模块。这是为了执行局部量化所必需的，即用量化子模块替换子模块而不量化其他任何内容。

实用函数基于getattr和setattr函数，但允许使用嵌套属性，例如。

module_string = "encoder.wav2vec2.model.feature_projection"

这允许检索和设置嵌套的子模块。

def get_module(model, module_string):
    curr = model.mods
    for attr in module_string.split("."):
        if attr.isnumeric():
            curr = curr[int(attr)]
        else:
            curr = getattr(curr, attr)
    return curr

def set_module(model, module_string, new_module):
    curr = model.mods
    attrs = module_string.split(".")
    for attr in attrs[:-1]:
        if attr.isnumeric():
            curr = curr[int(attr)]
        else:
            curr = getattr(curr, attr)
    if attrs[-1].isnumeric():
        curr[int(attrs[-1])] = new_module
    else:
        setattr(curr, attrs[-1], new_module)

静态量化包装器

静态量化需要QuantStub和DeQuantStub模块来指示量化模块和非量化模块之间的边界，以及指示校准时应放置量化观察器的位置。

在校准过程中，量化观察器将记录数据的范围，以确定量化的比例因子和零点，从而实现更优的量化结果。

此外，在静态量化时，QuantStub 和 DeQuantStub 将被转换为分别量化和反量化输入张量的层，从而使量化模块能够顺利与非量化模块接口。

请注意下面__getattr__被重写，以便允许检索包装器内引用模型的属性。

此外，DeQuantStub 必须能够处理从模型返回的元组，即多个返回值，因为 DeQuantStub 的前向函数本身并不处理元组。

from torch.ao.quantization import QuantStub, DeQuantStub

class StaticQuant(nn.Module):
    def __init__(self, model):
        super().__init__()
        self.quant = QuantStub()
        self.model = model
        self.dequant = DeQuantStub()

    def __getattr__(self, name):
        if name in self.__dict__:
            return self.__dict__[name]
        elif name in self.__dict__['_modules']:
            return self.__dict__['_modules'][name]
        else:
            return getattr(self.__dict__['_modules']['model'], name)

    def forward(self, x, *args, **kwargs):
        x = self.quant(x)
        x = self.model(x, *args, **kwargs)
        if isinstance(x, tuple):
            return tuple(self.dequant(output) for output in x)
        else:
            return self.dequant(x)

量化函数

这是一个自定义的量化函数，它允许子模块进行动态和静态量化。它还提供了额外的灵活性，允许应用诸如量化分辨率和其他量化配置的超参数。这使得我们可以更简单地将量化策略组合应用到我们的模型中。

详情请参阅文档字符串。

def custom_quantize(
        model,
        dynamic_modules=None,
        static_modules=None,
        calibration_samples=None,
        dynamic_targets=None,
        dynamic_dtype=torch.qint8,
        static_qconfig=torch.ao.quantization.default_qconfig,
):
    """Performs in-place quantization of an ASR model

    The quantization is customizable. A combination of dynamic and static
    quantization can be performed on specific submodules that are passed into
    this function.

    Names of submodules passed into this class are implicitly assumed to be
    nested fields of ``model.mods``. For example, the ``model.mods.encoder.enc``
    submodule should be passed in as ``encoder.enc``.

    Reference https://pytorch.org/docs/stable/quantization.html for
    what torch modules can and cannot be dynamically/statically quantized.

    Arguments
    ---------
    model : torch.nn.Module
        Model to be quantized.
    dynamic_modules : list[str]
        Names of the submodules to be dynamically quantized. They should be
        formatted as stated above.
    static_modules : list[str]
        Names of the submodules to be statically quantized. They should be
        formatted as stated above.'
    calibration_samples : list[torch.Tensor]
        Sample inputs used for calibration during static quantization.
    dynamic_targets : set[torch.nn.Module]
        Torch modules to be quantized during dynamic quantization.
    dynamic_dtype : torch.dtype
        The torch datatype that values will be converted to during dynamic
        quantization. This should be a quantized datatype, such as
        ``torch.quint8``, ``torch.qint8``, ``torch.qint32``
    static_qconfig : torch.ao.quantization.qconfig.QConfig
        The quantization config for static quantization, which, among other
        things, specifies the observer modules that will be inserted
        and the resolution of quantization.

    Returns
    -------
    None
    """

    ##################################################
    # Dynamic Quantization                           #
    ##################################################
    if dynamic_modules is not None and len(dynamic_modules) > 0:
        if dynamic_targets is None:
            dynamic_targets = {
                torch.nn.LSTM,
                torch.nn.GRU,
                torch.nn.RNNCell,
                torch.nn.GRUCell,
                torch.nn.LSTMCell,
                torch.nn.Linear
            }

        for module in dynamic_modules:
            torch.quantization.quantize_dynamic(
                get_module(model, module),
                dynamic_targets,
                dtype=dynamic_dtype,
                inplace=True,
            )

    ##################################################
    # Static Quantization                            #
    ##################################################
    if static_modules is not None and len(static_modules) > 0:
        if calibration_samples is None or len(calibration_samples) == 0:
            raise Exception("No calibration samples provided for static quantization.")

        for module in static_modules:
            set_module(
                model,
                module,
                StaticQuant(get_module(model, module)),
            )
            get_module(model, module).qconfig = static_qconfig

        torch.ao.quantization.prepare(model, inplace=True)

        for sample in calibration_samples:
            model.transcribe_batch(sample.unsqueeze(0), torch.tensor([1.0]))

        torch.ao.quantization.convert(model, inplace=True)

基准测试设置

我们将重点关注ASR的两个主要性能指标，实时因子（RTF）和词错误率（WER）。

RTF是总推理时间与输入音频总长度的比率。这一点很重要，因为RTF低于1意味着推理所需的时间少于播放音频所需的时间，这可能允许实时语音识别（不包括其他延迟来源）。

WER 是模型在参考文本中产生的词级错误（替换、删除、插入）数量与参考文本中单词数量的比率。

总的来说，这两个指标使我们能够评估量化前后模型的延迟和准确性。

WER

Levenshtein距离，或编辑距离，是WER指标的核心。它衡量将一个字符串转换为另一个字符串所需的替换、删除和/或插入的次数，可以使用动态规划方法计算。

Levenshtein距离和WER之间的主要区别在于，前者在字符级别上考虑字符串，而后者考虑整个单词的替换/删除/插入。

Speechbrain 提供了实用函数来测量 WER 和其他相关指标。

from speechbrain.utils.edit_distance import accumulatable_wer_stats

def compute_wer(references, hypotheses):
    if isinstance(references, str):
        references = [references.split()]
    else:
        references = [ref.split() for ref in references]
    if isinstance(hypotheses, str):
        hypotheses = [hypotheses.split()]
    else:
        hypotheses = [hyp.split() for hyp in hypotheses]
    if len(references) != len(hypotheses):
        raise Exception("Number of references is not equal to the number of hypotheses")
    stats = accumulatable_wer_stats(references, hypotheses, Counter())
    return stats['WER']

修改EncoderASR的transcribe_batch

修改现有的transcribe_batch方法，以便计时编码器的前向函数。

不同的ASR类型有不同的transcribe_batch实现，因此可能需要根据您自己的模型进行适当的微调。

import functools

# Functions necessary for preprocessing the input and generating transcriptions

def preprocess_input(model: EncoderASR, input):
    with torch.no_grad():
        wavs = input.unsqueeze(0)
        wav_lens = torch.tensor([1.0])
        wavs = wavs.float()
        wavs, wav_lens = wavs.to(model.device), wav_lens.to(model.device)
        return wavs, wav_lens

def generate(model, predictions):
    is_ctc_text_encoder_tokenizer = isinstance(
        model.tokenizer, speechbrain.dataio.encoder.CTCTextEncoder
    )
    if isinstance(model.hparams.decoding_function, functools.partial):
        if is_ctc_text_encoder_tokenizer:
            predicted_words = [
                "".join(model.tokenizer.decode_ndim(token_seq))
                for token_seq in predictions
            ]
        else:
            predicted_words = [
                model.tokenizer.decode_ids(token_seq)
                for token_seq in predictions
            ]
    else:
        predicted_words = [hyp[0].text for hyp in predictions]
    return predicted_words

请注意，我们只关注与量化相关的推理时间变化，而不包括输入预处理或单词生成的开销。这就是为什么我们只记录编码器的前向函数的持续时间。

def timed_transcribe(model: EncoderASR, input):
    with torch.no_grad():
        wavs, wav_lens = preprocess_input(model, input)
        start = time.time()
        encoder_out = model.mods.encoder(wavs, wav_lens)
        end = time.time()
        duration = end - start
        predictions = model.decoding_function(encoder_out, wav_lens)
        predicted_words = generate(model, predictions)
    return predicted_words[0], duration

基准模型性能

延迟测量在最初通常是不稳定的，因此引入了一个预热阶段，以确保更准确的性能评估。

def benchmark(model, samples, references):
    total_audio_length = sum([sample.shape[0] / 16000 for sample in samples])
    total_cpu_time = 0
    outputs = []

    for sample in tqdm.tqdm(samples[:10], desc="warming up"):
        timed_transcribe(model, sample)

    for sample in tqdm.tqdm(samples, desc="evaluating"):
        output, duration = timed_transcribe(model, sample)
        outputs.append(output)
        total_cpu_time += duration

    wer = compute_wer(references, outputs)
    rtf = total_cpu_time / total_audio_length
    return wer, rtf

量化和基准测试

在准备好量化和基准测试的必要设置代码后，我们可以开始在实际量化和量化前后对模型进行基准测试。

选择数据

为了节省时间，选择一部分音频数据来对模型进行基准测试。

n = 100
audio_subset = audios[:n]
ref_subset = references[:n]

原始模型

# Deepcopy the original model to avoid propagating unwanted changes
original_model = deepcopy(asr_model)

original_model.eval()
wer, rtf = benchmark(original_model, audio_subset, ref_subset)

warming up: 100%|██████████| 10/10 [01:40<00:00, 10.01s/it]
evaluating: 100%|██████████| 100/100 [09:32<00:00,  5.73s/it]

print(f"Original Model\nWER(%): {wer}\nRTF: {rtf}")

Original Model
WER(%): 6.067291781577496
RTF: 0.7967449480673793

为了避免超过会话的RAM限制，请在基准测试后删除模型。

del original_model
gc.collect()

量化模型

首先，让我们回顾一下模型架构：

asr_model

EncoderASR(
  (mods): ModuleDict(
    (encoder): LengthsCapableSequential(
      (wav2vec2): Wav2Vec2(
        (model): Wav2Vec2Model(
          (feature_extractor): Wav2Vec2FeatureEncoder(
            (conv_layers): ModuleList(
              (0): Wav2Vec2LayerNormConvLayer(
                (conv): Conv1d(1, 512, kernel_size=(10,), stride=(5,))
                (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
                (activation): GELUActivation()
              )
              (1-4): 4 x Wav2Vec2LayerNormConvLayer(
                (conv): Conv1d(512, 512, kernel_size=(3,), stride=(2,))
                (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
                (activation): GELUActivation()
              )
              (5-6): 2 x Wav2Vec2LayerNormConvLayer(
                (conv): Conv1d(512, 512, kernel_size=(2,), stride=(2,))
                (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
                (activation): GELUActivation()
              )
            )
          )
          (feature_projection): Wav2Vec2FeatureProjection(
            (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (projection): Linear(in_features=512, out_features=1024, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (encoder): Wav2Vec2EncoderStableLayerNorm(
            (pos_conv_embed): Wav2Vec2PositionalConvEmbedding(
              (conv): ParametrizedConv1d(
                1024, 1024, kernel_size=(128,), stride=(1,), padding=(64,), groups=16
                (parametrizations): ModuleDict(
                  (weight): ParametrizationList(
                    (0): _WeightNorm()
                  )
                )
              )
              (padding): Wav2Vec2SamePadLayer()
              (activation): GELUActivation()
            )
            (layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (layers): ModuleList(
              (0-23): 24 x Wav2Vec2EncoderLayerStableLayerNorm(
                (attention): Wav2Vec2Attention(
                  (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
                  (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
                  (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
                  (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
                )
                (dropout): Dropout(p=0.1, inplace=False)
                (layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
                (feed_forward): Wav2Vec2FeedForward(
                  (intermediate_dropout): Dropout(p=0.1, inplace=False)
                  (intermediate_dense): Linear(in_features=1024, out_features=4096, bias=True)
                  (intermediate_act_fn): GELUActivation()
                  (output_dense): Linear(in_features=4096, out_features=1024, bias=True)
                  (output_dropout): Dropout(p=0.1, inplace=False)
                )
                (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
              )
            )
          )
        )
      )
      (enc): Sequential(
        (linear1): Linear(
          (w): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (bn1): BatchNorm1d(
          (norm): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (activation): LeakyReLU(negative_slope=0.01)
        (drop): Dropout(p=0.15, inplace=False)
        (linear2): Linear(
          (w): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (bn2): BatchNorm1d(
          (norm): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (activation2): LeakyReLU(negative_slope=0.01)
        (drop2): Dropout(p=0.15, inplace=False)
        (linear3): Linear(
          (w): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (bn3): BatchNorm1d(
          (norm): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (activation3): LeakyReLU(negative_slope=0.01)
      )
      (ctc_lin): Linear(
        (w): Linear(in_features=1024, out_features=1000, bias=True)
      )
      (log_softmax): Softmax()
    )
  )
  (decoding_function): CTCBeamSearcher()
)

如前所述，在本教程中，我们将对注意力层和顺序线性层应用动态量化，对其他可量化层（不包括ctc_lin，实验观察到其对量化反应不佳）应用静态量化。

请记住，并非所有的PyTorch层都可以量化，有些只能动态或静态量化，因此在选择要量化的模块和量化方法时存在限制。

对于您的模型，请随意尝试以找到最佳结果。

dynamic_modules = [
    "encoder.wav2vec2.model.encoder.layers",
    "encoder.enc"
]
static_modules = [
    "encoder.wav2vec2.model.feature_projection",
    "encoder.wav2vec2.model.feature_extractor",
]

随机选择校准样本用于静态量化。

from operator import itemgetter

np.random.seed(1337)
indices = np.random.choice(len(audios), 10)
calibration_samples = list(itemgetter(*indices)(audios))

我们拥有量化模型所需的一切。

# Deepcopy the original model to avoid propagating unwanted changes
quantized_model = deepcopy(asr_model)

custom_quantize(
    model=quantized_model,
    dynamic_modules=dynamic_modules,
    static_modules=static_modules,
    calibration_samples=calibration_samples,
)

这是量化后的模型。注意指定的子模块如何被替换为量化版本。

quantized_model

EncoderASR(
  (mods): ModuleDict(
    (encoder): LengthsCapableSequential(
      (wav2vec2): Wav2Vec2(
        (model): Wav2Vec2Model(
          (feature_extractor): Static(
            (quant): Quantize(scale=tensor([0.1671]), zero_point=tensor([60]), dtype=torch.quint8)
            (model): Wav2Vec2FeatureEncoder(
              (conv_layers): ModuleList(
                (0): Wav2Vec2LayerNormConvLayer(
                  (conv): QuantizedConv1d(1, 512, kernel_size=(10,), stride=(5,), scale=0.23443543910980225, zero_point=67)
                  (layer_norm): QuantizedLayerNorm((512,), eps=1e-05, elementwise_affine=True)
                  (activation): GELUActivation()
                )
                (1): Wav2Vec2LayerNormConvLayer(
                  (conv): QuantizedConv1d(512, 512, kernel_size=(3,), stride=(2,), scale=0.8026854991912842, zero_point=62)
                  (layer_norm): QuantizedLayerNorm((512,), eps=1e-05, elementwise_affine=True)
                  (activation): GELUActivation()
                )
                (2): Wav2Vec2LayerNormConvLayer(
                  (conv): QuantizedConv1d(512, 512, kernel_size=(3,), stride=(2,), scale=1.169354796409607, zero_point=89)
                  (layer_norm): QuantizedLayerNorm((512,), eps=1e-05, elementwise_affine=True)
                  (activation): GELUActivation()
                )
                (3): Wav2Vec2LayerNormConvLayer(
                  (conv): QuantizedConv1d(512, 512, kernel_size=(3,), stride=(2,), scale=0.8424969911575317, zero_point=66)
                  (layer_norm): QuantizedLayerNorm((512,), eps=1e-05, elementwise_affine=True)
                  (activation): GELUActivation()
                )
                (4): Wav2Vec2LayerNormConvLayer(
                  (conv): QuantizedConv1d(512, 512, kernel_size=(3,), stride=(2,), scale=0.592667818069458, zero_point=54)
                  (layer_norm): QuantizedLayerNorm((512,), eps=1e-05, elementwise_affine=True)
                  (activation): GELUActivation()
                )
                (5): Wav2Vec2LayerNormConvLayer(
                  (conv): QuantizedConv1d(512, 512, kernel_size=(2,), stride=(2,), scale=0.4864558279514313, zero_point=68)
                  (layer_norm): QuantizedLayerNorm((512,), eps=1e-05, elementwise_affine=True)
                  (activation): GELUActivation()
                )
                (6): Wav2Vec2LayerNormConvLayer(
                  (conv): QuantizedConv1d(512, 512, kernel_size=(2,), stride=(2,), scale=0.4137037694454193, zero_point=41)
                  (layer_norm): QuantizedLayerNorm((512,), eps=1e-05, elementwise_affine=True)
                  (activation): GELUActivation()
                )
              )
            )
            (dequant): DeQuantize()
          )
          (feature_projection): Static(
            (quant): Quantize(scale=tensor([0.0369]), zero_point=tensor([5]), dtype=torch.quint8)
            (model): Wav2Vec2FeatureProjection(
              (layer_norm): QuantizedLayerNorm((512,), eps=1e-05, elementwise_affine=True)
              (projection): QuantizedLinear(in_features=512, out_features=1024, scale=0.7401247620582581, zero_point=64, qscheme=torch.per_tensor_affine)
              (dropout): QuantizedDropout(p=0.1, inplace=False)
            )
            (dequant): DeQuantize()
          )
          (encoder): Wav2Vec2EncoderStableLayerNorm(
            (pos_conv_embed): Wav2Vec2PositionalConvEmbedding(
              (conv): ParametrizedConv1d(
                1024, 1024, kernel_size=(128,), stride=(1,), padding=(64,), groups=16
                (parametrizations): ModuleDict(
                  (weight): ParametrizationList(
                    (0): _WeightNorm()
                  )
                )
              )
              (padding): Wav2Vec2SamePadLayer()
              (activation): GELUActivation()
            )
            (layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (layers): ModuleList(
              (0-23): 24 x Wav2Vec2EncoderLayerStableLayerNorm(
                (attention): Wav2Vec2Attention(
                  (k_proj): DynamicQuantizedLinear(in_features=1024, out_features=1024, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
                  (v_proj): DynamicQuantizedLinear(in_features=1024, out_features=1024, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
                  (q_proj): DynamicQuantizedLinear(in_features=1024, out_features=1024, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
                  (out_proj): DynamicQuantizedLinear(in_features=1024, out_features=1024, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
                )
                (dropout): Dropout(p=0.1, inplace=False)
                (layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
                (feed_forward): Wav2Vec2FeedForward(
                  (intermediate_dropout): Dropout(p=0.1, inplace=False)
                  (intermediate_dense): DynamicQuantizedLinear(in_features=1024, out_features=4096, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
                  (intermediate_act_fn): GELUActivation()
                  (output_dense): DynamicQuantizedLinear(in_features=4096, out_features=1024, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
                  (output_dropout): Dropout(p=0.1, inplace=False)
                )
                (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
              )
            )
          )
        )
      )
      (enc): Sequential(
        (linear1): Linear(
          (w): DynamicQuantizedLinear(in_features=1024, out_features=1024, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
        )
        (bn1): BatchNorm1d(
          (norm): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (activation): LeakyReLU(negative_slope=0.01)
        (drop): Dropout(p=0.15, inplace=False)
        (linear2): Linear(
          (w): DynamicQuantizedLinear(in_features=1024, out_features=1024, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
        )
        (bn2): BatchNorm1d(
          (norm): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (activation2): LeakyReLU(negative_slope=0.01)
        (drop2): Dropout(p=0.15, inplace=False)
        (linear3): Linear(
          (w): DynamicQuantizedLinear(in_features=1024, out_features=1024, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
        )
        (bn3): BatchNorm1d(
          (norm): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (activation3): LeakyReLU(negative_slope=0.01)
      )
      (ctc_lin): Linear(
        (w): Linear(in_features=1024, out_features=1000, bias=True)
      )
      (log_softmax): Softmax()
    )
  )
  (decoding_function): CTCBeamSearcher()
)

接下来，我们对量化模型进行基准测试。

quantized_model.eval()
wer, rtf = benchmark(quantized_model, audio_subset, ref_subset)

warming up: 100%|██████████| 10/10 [01:16<00:00,  7.61s/it]
evaluating: 100%|██████████| 100/100 [07:12<00:00,  4.32s/it]

print(f"Quantized Model\nWER(%): {wer}\nRTF: {rtf}")

Quantized Model
WER(%): 7.335907335907336
RTF: 0.6004914075674289

我们可以观察到RTF显著减少，同时WER合理增加。这表明量化已经成功。

最后，如果您需要使用其他模型进行更多的量化基准测试，您可以删除此模型以释放RAM。

del quantized_model
gc.collect()

引用SpeechBrain

如果您在研究中或业务中使用SpeechBrain，请使用以下BibTeX条目引用它：

@misc{speechbrainV1,
  title={Open-Source Conversational AI with {SpeechBrain} 1.0},
  author={Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Gaelle Laperriere and Mickael Rouvier and Renato De Mori and Yannick Esteve},
  year={2024},
  eprint={2407.00463},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2407.00463},
}
@misc{speechbrain,
  title={{SpeechBrain}: A General-Purpose Speech Toolkit},
  author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
  year={2021},
  eprint={2106.04624},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  note={arXiv:2106.04624}
}