要在GitHub上执行或查看/下载此笔记本

使用SpeechBrain和HuggingFace微调或使用Whisper、wav2vec2、HuBERT等

本教程描述了如何结合（使用和微调）来自HuggingFace Transformers库的预训练模型，例如Whisper、wav2vec 2.0、HuBERT、WavLM等。这些模型可以轻松地插入到SpeechBrain中，以处理与语音或音频相关的任务：自动语音识别、说话人识别、口语理解等。

预训练怎么样？ 由于从必要的资源（数十个GPU运行数百小时）到由于流水线导致的复现问题，预训练大型SSL模型非常复杂。目前，SpeechBrain仅提供wav2vec 2.0模型的预训练。

为什么选择SpeechBrain？ 可以提到许多不同的原因来激励使用SpeechBrain。然而，在预训练模型的非常具体的背景下，SpeechBrain使研究人员和用户能够将这些架构与最先进的语音和音频相关技术连接起来。例如，SpeechBrain允许您轻松地微调一个预训练的wav2vec2模型，结合一个变压器解码器、一个波束搜索算法和一个变压器语言模型，以构建一个最先进的语音识别器。它还可以帮助您简单地使用预训练的Whisper编码器来执行情感识别。据我们所知，大多数其他工具包都无法让您实现这一点。

本教程感兴趣的架构 我们将只考虑两种最新的现有预训练模型：wav2vec 2.0 和 Whisper。然而，SpeechBrain 支持许多其他模型：wavLM、HuBERT …

Wav2Vec 是一种基于变压器的编码器架构，支持语音的自监督表示学习。请参阅官方论文以获取更多详细信息：wav2vec2。

Wav2vec2的图示，来源。

Whisper 是一个完整的变压器（编码器-解码器），基于大量半监督数据（超过60万小时的语音）进行训练。请参阅官方论文以获取更多详细信息：whisper

Whisper的图示，来源。

通过本教程，您将学习如何：

实例化一个wav2vec2或Whisper以从音频文件中提取特征。
使用wav2vec2和Whisper编码器作为您的管道（ASR, TIMIT）的一个模块。
使用Whisper作为编码器-解码器架构进行微调（ASR，LibriSpeech）
了解我们集成的当前限制。

先决条件

来自 HuggingFace 的 Wav2Vec 2.0 和 Whisper

Wav2vec 2.0 模型最初是通过 Faiseq GitHub 分享的，最近由于与 HuggingFace Transformers API 的良好集成，已迁移到 HuggingFace。同样的情况也发生在 Whisper 模型上，它从原始仓库迁移到了 HuggingFace Transformers API。因此，如果你想在 SpeechBrain 中使用预训练的 Transformer 模型，你只需要一个 HuggingFace 仓库！（例如“facebook/wav2vec2-large-lv60”、“openai/whisper-large”或“microsoft/wavlm-large”）。

但首先，让我们安装所有需要的包…

%%capture
# Installing SpeechBrain
BRANCH = 'develop'
!git clone https://github.com/speechbrain/speechbrain.git -b $BRANCH
%cd /content/speechbrain/
!python -m pip install .

安装HuggingFace Transformers接口。

最后，让我们下载并加载一个音频文件来播放。

%%capture
!wget https://www.dropbox.com/s/u8qyvuyie2op286/spk1_snt1.wav

import speechbrain as sb

source = sb.dataio.dataio.read_audio('spk1_snt1.wav').squeeze()
print(source.shape)

这是导入的信号：

import matplotlib.pyplot as plt
plt.figure(1)
plt.plot(source)
plt.show()

from IPython.display import Audio
Audio('spk1_snt1.wav')

Wav2vec2、HuBERT、WavLM 和 Whisper 模型在 SpeechBrain 中作为 lobes 提供。因此，它们的实现可以在以下位置找到：

speechbrain.lobes.models.huggingface_wav2vec.py
speechbrain.lobes.models.huggingface_whisper.py

现在，我们实例化每一个。需要注意的是，在以下示例中，返回的对象是标准的PyTorch模块，这在SpeechBrain中几乎总是如此。

# BE CAREFUL, IF YOU ARE NOT CONNECTED TO A GPU RUNTIME, THIS WILL CRASH
# THis only happens on Colab, you can of course load models on
from speechbrain.lobes.models.huggingface_transformers.wav2vec2 import Wav2Vec2
from speechbrain.lobes.models.huggingface_transformers.whisper import Whisper

# HuggingFace model hub
model_hub_w2v2 = "facebook/wav2vec2-base-960h"
model_hub_whisper = "openai/whisper-tiny"

model_w2v2 = Wav2Vec2(model_hub_w2v2, save_path='/content/pretrained/')
model_whisper = Whisper(model_hub_whisper, save_path='/content/pretrained/')

在这里，我们可以探索模型…

print(model_whisper)

现在，我们可以尝试从这些模型中提取音频特征！然而，在我们的示例中，我们有两个不同的模型，如果我们的目标仅仅是获取音频输入的潜在表示，则需要不同的前向操作。Wav2vec 2.0 是一个变压器编码器，所以我们只需要获取最后一层的输出。而 Whisper 则是一个完全训练的编码器-解码器。因此，我们必须确保只获取编码器的输出！

source = source.unsqueeze(0)
print(source.shape)

fea_w2v2 = model_w2v2(source)
print(fea_w2v2.shape)

# This can be given as an argument when we instantiate the model as well
model_whisper.encoder_only=True
fea_whisper = model_whisper(source)
print(fea_whisper.shape)

我在看什么？

这些特征对应于变压器后获得的上下文表示（参见初始wav2vec2图示中的C）。因此，对于基础模型，此输出维度为768（如论文中所述）。然后，wav2vec2的输出频率为50Hz，音频文件长度为2.87秒，这解释了我们在时间维度上获得的143。实际上，形状是[batch, time, features]。同样的逻辑可以应用于Whisper，因为我们获得了变压器编码器的最后隐藏状态。

Wav2Vec 2.0 和 Whisper 编码器作为您的管道（ASR, TIMIT）的一部分

到目前为止，我们只看到了如何使用预训练的wav2vec2和whisper对单个音频文件进行推理。当然，如果你只想提取特征，你可以简单地遍历你的数据集并存储所有内容……或者你可以使用SpeechBrain将这些模型直接插入到你的管道中，以实时计算特征（并对它们进行微调！）

事实上，如果你熟悉我们的YAML形式（如果不熟悉，请先查看我们的教程），HuggingFaceWav2Vec2和HuggingFaceWhisper可以简单地作为一个块添加到你的超参数文件中：

对于Wav2vec 2.0：

wav2vec2: !new:speechbrain.lobes.models.huggingface_transformers.wav2vec2.Wav2Vec2
    source: !ref <wav2vec2_hub>
    freeze: True
    save_path: !ref <save_folder>/wav2vec2_checkpoint

对于Whisper：

whisper: !new:speechbrain.lobes.models.huggingface_transformers.whisper.Whisper
    pretrained_path: !ref <wav2vec2_url>
    freeze: True
    encoder_only: True
    save_path: !ref <save_folder>/wav2vec2_checkpoint/model.pt

freeze 允许你微调（False）或冻结（True）神经参数。请注意，你也可以选择仅冻结 Whisper 的编码器或仅冻结 wav2vec 2.0 的特征提取器。在你的管道中，你将拥有两个 PyTorch 模块对象，它们可以作为标准层来传播你的数据！

在此之后，您将需要关于SpeechBrain的基础知识。如果您对某些内容不理解，请参考本教程开头的先决条件。

现在，我们将更深入地探讨可以在这里找到的Librispeech ASR（CTC）配方。

如果您不熟悉CTC ASR，请参考我们简化且高度注释的模板。

在以下部分中，我们将仅突出显示在您的配方中使用whisper或wav2vec2模型所需的代码的重要部分！

理解yaml参数。

在这个设置中，我们希望针对我们的下游任务微调whisper或wav2vec2模型。更准确地说，模型的架构是：

[ wav -> wav2vec2 or whisper -> Dense ] = encoder

为了实现这一点，我们的YAML文件由不同的关键组件组成（如果您对whisper感兴趣，请删除w2v2的引用，反之亦然）：

  [...]

  # URL for the biggest and already fine-tuned english wav2vec2 model and parameters.
  # URL for the medium whisper as well.
  wav2vec2_hub: "facebook/wav2vec2-large-960h-lv60-self"
  whisper_hub: "openai/whisper-medium"
  freeze_pretrained: False
  lr_pretrained: 0.0001

  [...]

  # The instianciation of the SpeechBrain lobe
  wav2vec2: !new:speechbrain.lobes.models.huggingface_transformers.wav2vec2.Wav2Vec2
    source: !ref <wav2vec2_hub>
    freeze: !ref <freeze_pretrained>
    save_path: !ref <save_folder>/wav2vec2_checkpoint

  # The instianciation of the SpeechBrain lobe
  whisper: !new:speechbrain.lobes.models.huggingface_transformers.whisper.Whisper
    source: !ref <whisper_hub>
    freeze: !ref <freeze_pretrained>
    encoder_only: True
    save_path: !ref <save_folder>/whisper_checkpoint
  
  # A simple DNN that receive as inputs the output of the pretrained model
  # Here the output dimensionality of the LARGE wav2vec2 and MEDIUM whisper are 1024.
  enc: !new:speechbrain.lobes.models.VanillaNN.VanillaNN
    input_shape: [null, null, 1024]
    activation: !ref <activation>
    dnn_blocks: !ref <dnn_layers>
    dnn_neurons: !ref <dnn_neurons>

  [...]

  # Two optimizers and schedulers to allow:
  # 1. The learning of the encoder and the decoders.
  # 2. Slowly fine-tune only the pretrained (w2v2 or whisper) parts.
  adam_opt_class: !name:torch.optim.AdamW
    lr: !ref <lr>

  pretrained_opt_class: !name:torch.optim.AdamW
    lr: !ref <lr_pretrained>
    
  lr_annealing_adam: !new:speechbrain.nnet.schedulers.NewBobScheduler
    initial_value: !ref <lr>
    improvement_threshold: 0.0025
    annealing_factor: 0.8
    patient: 0

  lr_annealing_pretrained: !new:speechbrain.nnet.schedulers.NewBobScheduler
    initial_value: !ref <lr_pretrained>
    improvement_threshold: 0.0025
    annealing_factor: 0.9

  # We add the wav2vec2 / whisper to the modules list so it is uploaded on the GPUs.
  # Remove the one that is not used!
  modules:
    wav2vec2: !ref <wav2vec2>
    whisper: !ref <whisper>
    enc: !ref <enc>
    emb: !ref <emb>
    dec: !ref <dec>
    ctc_lin: !ref <ctc_lin>
    seq_lin: !ref <seq_lin>

  # We do not add the wav2vec2 / whisper to the model list, so we can apply one optimizer
  # to the randomly initialized model and the other to the pretrained model.
  model: !new:torch.nn.ModuleList
    - [!ref <enc>, !ref <emb>, !ref <dec>, !ref <ctc_lin>, !ref <seq_lin>]

  # We add the wav2vec2 /whisper to our checkpointer so the model can be saved!
  checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
    checkpoints_dir: !ref <save_folder>
    recoverables:
        model: !ref <model>
        wav2vec2: !ref <wav2vec2>
        whisper: !ref <whisper>
        lr_annealing_adam: !ref <lr_annealing_adam>
        lr_annealing_wav2vec: !ref <lr_annealing_wav2vec>
        counter: !ref <epoch_counter>

我们将所有内容合并到python配方文件中：

  class ASR(sb.Brain):
    def compute_forward(self, batch, stage):
      [...]
      # The compute forward is strictly identical to any compute_forward method
      # for ASR, except that we just call the wav2vec2 / whisper on the wavs instead of computing acoustic features (FBANKs, MFCCs ...).
      feats = self.modules.wav2vec2(wavs)
      feats = self.modules.whisper(wavs)
      x = self.modules.enc(feats)
      [...]
    
    def init_optimizers(self):
        # Initializes the whisper optimizer and model optimizer. The same can be done for wav2vec2.
        self.pretrained_optimizer = self.hparams.pretrained_opt_class(
            self.modules.whisper.parameters()
        )
        self.adam_optimizer = self.hparams.adam_opt_class(
            self.hparams.model.parameters()
        )
        [...]
    
    def on_stage_end(self, stage, stage_loss, epoch):
        #Gets called at the end of a epoch.
        [...]
        if stage == sb.Stage.VALID:

            # Here we apply our learning_rate annealing on both optimizers
            old_lr_adam, new_lr_adam = self.hparams.lr_annealing_adam(wer)
            old_lr_pretrained, new_lr_pretrained = self.hparams.lr_annealing_pretrained(wer)
            sb.nnet.schedulers.update_learning_rate(
                self.adam_optimizer, new_lr_adam
            )
            sb.nnet.schedulers.update_learning_rate(
                self.pretrained_optimizer, new_lr_wav2vec
            )

    def fit_batch(self, batch):
        # Override of the Brain Class fit_batch function.
        # Managing automatic mixed precision
        [...]
        outputs = self.compute_forward(batch, sb.Stage.TRAIN)

        loss = self.compute_objectives(outputs, batch, sb.Stage.TRAIN)
        loss.backward()

        # Here we manage both optimizers
        # (Learning enc+dec and Fine-tuning wav2vec2).
        if self.check_gradients(loss):
            self.pretrained_optimizer.step()
            self.adam_optimizer.step()

        self.pretrained_optimizer.zero_grad()
        self.adam_optimizer.zero_grad()

        return loss.detach().cpu()

注意：当然，如果你正在使用一个冻结的wav2vec2模型，就不需要采用两种不同的优化器了 ;-) 就是这样！如果你按照这种方式运行你的配方，你的whisper / wav2vec 2.0预训练编码器将成为你架构的一部分，并根据你的需求进行微调（或不进行微调）。

使用Whisper作为完全预训练的编码器-解码器

Whisper 是一个完整的 transformer。理论上，这意味着你可以直接使用它进行零样本语音识别或语音翻译。实际上，你很可能希望在你的内部数据集上进行微调。这两种选项都可以在 SpeechBrain 中完成，我们只需要相应地稍微修改我们的 YAML 和配方。事实上，我们不再需要 DNN 解码器，因为 Whisper 已经有一个。我们也不再依赖 CTC 损失，因为 Transformer 解码器可以通过负对数似然进行训练。最后，我们必须决定是否要将我们的模型连接到贪婪搜索解码，或者连接到更复杂的带有或不带有语言模型评分的束搜索解码器！以下是 SpeechBrain 对 Whisper 支持的内容的总结：

特征提取
编码器微调
编码器-解码器零样本自动语音识别或语音翻译
编码器-解码器微调
贪婪解码
带和不带语言模型的束搜索解码

在这里，我们将重点介绍如何在Librispeech上使用贪婪解码对基础whisper进行微调。

为了实现这一点，我们首先需要修改之前的YaML文件和python脚本。在这里，我们需要将encoder_only设置为False，因为我们希望保留解码器。我们还需要集成一个搜索函数，该函数将获取解码器预测的最可能的标记，并以自回归的方式将其（与之前的标记连接）反馈给解码器。与之前的示例相反，我们不需要在Whisper解码器之上添加语言建模头，因为当你获取Whisper模型时，它已经为你创建好了。现在你已经拥有了微调Whisper编码器-解码器所需的一切！

让我们看看实际会发生什么：

  [...]

  whisper_hub: "openai/whisper-medium"
  freeze_pretrained: False
  lr_pretrained: 0.0001

  # we need to specify the language of the inputs audios.
  language: english

  # These values will be used during decoding.
  # The first one design the first token to be added during searching.
  # The second is the token to stop the expansion of hypotheses that have reached eos.
  timestamp_index: 50363
  eos_index: 50257

  # This value is the ratio of steps during the decoding.
  # e.g, encoded speech is [B, T, F], then the maximal number of steps will be T * max_decode_ratio.
  max_decode_ratio: 0.5

  [...]

  # The instanciation of the SpeechBrain lobe
  whisper: !new:speechbrain.lobes.models.huggingface_transformers.whisper.Whisper
    source: !ref <whisper_hub>
    freeze: !ref <freeze_pretrained>
    encoder_only: False # :)
    save_path: !ref <save_folder>/whisper_checkpoint

  [...]

  pretrained_opt_class: !name:torch.optim.AdamW
    lr: !ref <lr_pretrained>
    
  lr_annealing_pretrained: !new:speechbrain.nnet.schedulers.NewBobScheduler
    initial_value: !ref <lr_pretrained>
    improvement_threshold: 0.0025
    annealing_factor: 0.9

  # We add the  whisper to the modules list so it is uploaded on the GPUs.
  modules:
    whisper: !ref <whisper>

  # We creates the searcher method to decode the Whisper model.
  valid_greedy_searcher: !new:speechbrain.decoders.seq2seq.S2SWhisperGreedySearch
    model: !ref <whisper>
    bos_index: !ref <timestamp_index>
    eos_index: !ref <eos_index>
    min_decode_ratio: 0
    max_decode_ratio: !ref <max_decode_ratio>

  # We add the whisper to our checkpointer so the model can be saved!
  checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
   checkpoints_dir: !ref <save_folder>
   recoverables:
      whisper: !ref <whisper>
      scheduler_whisper: !ref <lr_annealing_whisper>
      counter: !ref <epoch_counter>

我们将所有内容合并到python配方文件中：

  class ASR(sb.Brain):
    def compute_forward(self, batch, stage):
      wavs, wav_lens = batch.sig
      bos_tokens, bos_tokens_lens = batch.tokens_bos
      
      [...]

      # The compute forward is similar to any compute_forward method for ASR   
      # with Transformers in SpeechBrain.

      # Forward encoder + decoder
      enc_out, logits, _ = self.modules.whisper(wavs, bos_tokens)

      log_probs = self.hparams.log_softmax(logits)

      hyps = None
      if stage != sb.Stage.TRAIN:
          # perform greedy searcher and return the hypotheses found
          hyps, _ = self.hparams.valid_greedy_searcher(enc_out, wav_lens)

      [...]

      return log_probs, hyps, wav_lens

    def compute_objectives(self, predictions, batch, stage):
      log_probs, hyps, wav_lens, = predictions

      tokens_eos, tokens_eos_lens = batch.tokens_eos

      [...]

      # compute the NLL loss
      loss = self.hparams.nll_loss(
            log_probs, tokens_eos, tokens_eos_lens,
        )

      if stage != sb.Stage.TRAIN:
        tokens, tokens_lens = batch.tokens

        # Decode token terms to words
        predicted_words = self.tokenizer.batch_decode(
            hyps, skip_special_tokens=True
        )

        # Convert indices to words
        target_words = undo_padding(tokens, tokens_lens)
        target_words = self.tokenizer.batch_decode(
            target_words, skip_special_tokens=True
        )

        # Compute our metrics
        self.wer_metric.append(ids, predicted_words, target_words)
        self.cer_metric.append(ids, predicted_words, target_words)

        [...]

      return loss
    
 def on_stage_end(self, stage, stage_loss, epoch):
        """Gets called at the end of an epoch."""
        # Compute/store important stats
        stage_stats = {"loss": stage_loss}
        if stage == sb.Stage.TRAIN:
            self.train_stats = stage_stats
        else:
            stage_stats["CER"] = self.cer_metric.summarize("error_rate")
            stage_stats["WER"] = self.wer_metric.summarize("error_rate")

        # Perform end-of-iteration things, like annealing, logging, etc.
        if stage == sb.Stage.VALID:

            old_lr_whisper, new_lr_whisper = self.hparams.lr_annealing_whisper(
                stage_stats["loss"]
            )

            sb.nnet.schedulers.update_learning_rate(
                self.optimizer, new_lr_whisper
            )
            self.hparams.train_logger.log_stats(
                stats_meta={"epoch": epoch, "lr_whisper": old_lr_whisper},
                train_stats=self.train_stats,
                valid_stats=stage_stats,
            )
            self.checkpointer.save_and_keep_only(
                meta={"WER": stage_stats["WER"]}, min_keys=["WER"],
            )
        elif stage == sb.Stage.TEST:
            self.hparams.train_logger.log_stats(
                stats_meta={"Epoch loaded": self.hparams.epoch_counter.current},
                test_stats=stage_stats,
            )
            with open(self.hparams.wer_file, "w") as w:
                self.wer_metric.write_stats(w)

这样，你就可以在你选择的数据集上微调最新的Whisper模型了！

你可以尝试使用这个模型，并尝试通过使用束搜索解码而不是贪婪搜索来改进它，或者你可以直接扩大规模并使用最大的可用Whisper模型…所有这些都可以通过SpeechBrain实现！

引用SpeechBrain

如果您在研究中或业务中使用SpeechBrain，请使用以下BibTeX条目引用它：

@misc{speechbrainV1,
  title={Open-Source Conversational AI with {SpeechBrain} 1.0},
  author={Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Gaelle Laperriere and Mickael Rouvier and Renato De Mori and Yannick Esteve},
  year={2024},
  eprint={2407.00463},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2407.00463},
}
@misc{speechbrain,
  title={{SpeechBrain}: A General-Purpose Speech Toolkit},
  author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
  year={2021},
  eprint={2106.04624},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  note={arXiv:2106.04624}
}