要在GitHub上执行或查看/下载此笔记本

从零开始的语音识别

准备好使用SpeechBrain进入构建自己的语音识别器的世界了吗？

你很幸运，因为本教程正是你所需要的！我们将指导你完成设置离线端到端基于注意力的语音识别器的整个过程。

但在我们深入之前，让我们快速了解一下语音识别，并看看SpeechBrain带来的酷炫技术。

让我们开始吧！🚀

语音识别概述

在图中，我们展示了SpeechBrain中使用的典型语音识别流程的示例：

语音识别过程从直接使用原始波形🎤开始。

原始波形通过各种语音增强技术进行污染，例如时间/频率丢失、速度变化、添加噪声、混响等。这些干扰根据用户指定的概率随机激活，并且即时应用，无需将增强信号存储在磁盘上。

要更深入地了解污染技术，请查看我们关于语音增强和环境破坏的教程。

接下来，我们提取语音特征，例如短时傅里叶变换 (STFT)、频谱图、FBANKs和MFCCs。得益于高效的GPU友好实现，这些特征可以实时计算。

有关更详细的信息，请参阅我们的教程speech representation和speech features。

随后，特征被输入到语音识别器中，这是一个将输入特征序列映射到输出标记序列（例如，音素、字符、子词、单词）的神经网络。SpeechBrain支持流行的技术，如连接时序分类（CTC）、传感器或带有注意力的编码器/解码器（使用基于RNN和Transformer的系统）。

输出标记的后验概率由beamsearcher处理，该beamsearcher探索替代方案并输出最佳方案。可选地，可以使用基于RNN或transformers的外部语言模型对替代方案进行重新评分🤖。

并非所有提到的模块都是强制性的；例如，如果数据污染对特定任务没有帮助，则可以跳过。甚至可以使用贪心搜索代替束搜索以进行快速解码。

现在，让我们更详细地讨论一下支持语音识别的不同技术：🚀

连接时序分类 (CTC)

CTC 是 SpeechBrain 中最简单的语音识别系统。

在每个时间步，它都会产生一个预测。CTC引入了一个独特的标记，blank，使网络在不确定时可以不输出任何内容。CTC成本函数使用动态规划来对齐所有可能的对齐方式。

对于每个对齐，可以计算相应的概率。最终的CTC成本是所有可能对齐的概率之和，使用前向算法（与神经网络中使用的算法不同，如隐马尔可夫模型文献中所述）高效计算。

在编码器-解码器架构中，注意力机制用于学习输入输出序列之间的对齐。在CTC中，不学习对齐；相反，集成发生在所有可能的对齐上。

本质上，CTC实现涉及在语音识别器上加入一个专门的成本函数，通常基于循环神经网络（RNNs），尽管并不限于此。🧠

传感器

在图中，Transducers通过引入自回归预测器和联合网络来增强CTC。

编码器将输入特征转换为一串编码表示。另一方面，预测器基于先前发出的输出生成潜在表示。连接网络将这两者合并，而softmax分类器预测当前输出标记。在训练期间，CTC损失在分类器之后应用。

要深入了解Transducers，请查看Loren Lugosch提供的这个信息丰富的教程：Transducer Tutorial 📚。

带注意力的编码器-解码器 👂

语音识别中另一种广泛使用的方法涉及采用编码器-解码器架构。

编码器处理一系列语音特征（或直接处理原始样本）以生成一系列状态，表示为h。
解码器利用最后一个隐藏状态并生成N个输出标记。通常，解码器是自回归的，前一个输出会反馈到输入中。解码在预测到句子结束（eos）标记时停止。
编码器和解码器可以使用各种神经架构构建，例如RNNs、CNNs、Transformers或它们的组合。

包含注意力有助于编码器和解码器状态之间的动态连接。SpeechBrain支持不同的注意力类型，包括基于RNN系统的内容或位置感知，以及基于Transformers的键值。作为收敛增强，通常在编码器顶部应用CTC损失。🚀

该架构提供了灵活性和适应性，使得能够在各种应用中实现有效的语音识别。

Beamsearch

编码器-解码器模型中使用的beamsearcher遵循自回归过程。以下是其操作方式：

初始化：过程从（序列开始）标记开始。
预测：模型根据当前输入预测出N个最有希望的下一个标记。
喂养替代方案：这些N个替代方案被输入到解码器中以生成未来的假设。
选择：根据某些标准或评分机制选择最佳的N个假设。
迭代：循环继续，直到预测到（序列结束）标记。

SpeechBrain-Page-2 (1).png

我们鼓励对语音识别不够熟悉的读者在继续之前更多地熟悉这项技术。除了科学论文外，你还可以在网上找到令人惊叹的教程和博客文章，例如：

在简要概述之后，现在让我们看看如何使用SpeechBrain开发一个语音识别系统（编码器-解码器 + CTC）。

为了简单起见，训练将使用一个名为mini-librispeech的小型开源数据集进行，该数据集仅包含几小时的训练数据。在实际情况下，您需要更多的训练材料（例如100甚至1000小时）才能达到可接受的性能。

安装

为了足够快地运行代码，我们建议使用GPU（Runtime => change runtime type => GPU）。在本教程中，我们将参考speechbrain/templates/ASR中的代码。

在开始之前，让我们安装speechbrain：

%%capture
# Installing SpeechBrain via pip
BRANCH = 'develop'
!python -m pip install git+https://github.com/speechbrain/speechbrain.git@$BRANCH

# Clone SpeechBrain repository
!git clone https://github.com/speechbrain/speechbrain/

需要哪些步骤？

1. 准备您的数据

创建数据清单文件（CSV或JSON格式），指定语音数据的位置和相应的文本注释。
利用像mini_librispeech_prepare.py这样的工具来生成这些清单文件。

2. 训练一个分词器

决定用于训练语音识别器和语言模型的基本单位（例如，字符、音素、子词、单词）。

执行分词器训练脚本：

cd speechbrain/templates/speech_recognition/Tokenizer
python train.py tokenizer.yaml

3. 训练一个语言模型

使用大型文本语料库训练语言模型（最好与目标应用程序在同一语言领域内）。

语言模型的示例训练脚本：

pip install datasets
cd speechbrain/templates/speech_recognition/LM
python train.py RNNLM.yaml

4. 训练语音识别器

使用选定的模型（例如，CRDNN）训练语音识别器，该模型具有自回归GRU解码器和注意力机制。

使用beamsearch与训练好的语言模型进行序列生成：

cd speechbrain/templates/speech_recognition/ASR
python train.py train.yaml

5. 使用语音识别器（推理）

训练后，部署训练好的语音识别器进行推理。
利用SpeechBrain中的EncoderDecoderASR等类来简化推理过程。

每个步骤对于构建一个有效的端到端语音识别器都至关重要。

我们现在将提供所有这些步骤的详细描述。

步骤1：准备您的数据

数据准备是训练端到端语音识别器的关键初始步骤。其主要目标是生成数据清单文件，这些文件指示SpeechBrain音频数据的位置及其对应的转录。这些清单文件以广泛使用的CSV和JSON格式编写，在组织训练过程中起着至关重要的作用。

数据清单文件

让我们深入了解JSON格式的数据清单文件的结构：

{
  "1867-154075-0032": {
    "wav": "{data_root}/LibriSpeech/train-clean-5/1867/154075/1867-154075-0032.flac",
    "length": 16.09,
    "words": "AND HE BRUSHED A HAND ACROSS HIS FOREHEAD AND WAS INSTANTLY HIMSELF CALM AND COOL VERY WELL THEN IT SEEMS I'VE MADE AN ASS OF MYSELF BUT I'LL TRY TO MAKE UP FOR IT NOW WHAT ABOUT CAROLINE"
  },
  "1867-154075-0001": {
    "wav": "{data_root}/LibriSpeech/train-clean-5/1867/154075/1867-154075-0001.flac",
    "length": 14.9,
    "words": "THAT DROPPED HIM INTO THE COAL BIN DID HE GET COAL DUST ON HIS SHOES RIGHT AND HE DIDN'T HAVE SENSE ENOUGH TO WIPE IT OFF AN AMATEUR A RANK AMATEUR I TOLD YOU SAID THE MAN OF THE SNEER WITH SATISFACTION"
  },
  "1867-154075-0028": {
    "wav": "{data_root}/LibriSpeech/train-clean-5/1867/154075/1867-154075-0028.flac",
    "length": 16.41,
    "words": "MY NAME IS JOHN MARK I'M DOONE SOME CALL ME RONICKY DOONE I'M GLAD TO KNOW YOU RONICKY DOONE I IMAGINE THAT NAME FITS YOU NOW TELL ME THE STORY OF WHY YOU CAME TO THIS HOUSE OF COURSE IT WASN'T TO SEE A GIRL"
  },
}

此结构遵循分层格式，其中口语句子的唯一标识符作为第一个键。为每个条目指定了关键字段，如语音记录的路径、其长度（以秒为单位）以及说出的单词序列。

一个特殊的变量，data_root，允许从命令行或YAML超参数文件中动态更改数据文件夹。

准备脚本

为您的特定数据集创建准备脚本是必要的，因为每个数据集都有其自己的格式。例如，mini_librispeech_prepare.py脚本，专为mini-librispeech数据集定制，作为基础模板。该脚本自动下载公开可用的数据，搜索音频文件和转录，并创建JSON文件。

将此脚本作为自定义数据准备的起点，用于您的目标数据集。它提供了一个实用的指南，通过三个独立的数据清单文件来组织训练、验证和测试阶段。

本地复制您的数据

在HPC集群或类似环境中，优化代码性能涉及将数据复制到计算节点的本地文件夹。虽然在Google Colab中不适用，但通过从本地文件系统而不是共享文件系统获取数据，这种做法显著加快了代码执行速度。

在开始为训练您的语音识别器进行数据准备的关键旅程时，请注意这些考虑事项。🚀🎙️

步骤2：分词器

为您的语音识别器选择基本标记是一个影响模型性能的关键决策。您有几种选择，每种选择都有其自身的优势和挑战。

使用字符作为标记

一种直接的方法是预测字符，将单词序列转换为字符序列。例如：

THE CITY OF MONTREAL => ['T','H','E', '_', 'C','I','T','Y','_', 'O', 'F', '_, 'M','O','N','T','R','E','A','L']

这种方法的优点和缺点包括总标记数量少、有机会泛化到未见过的单词，以及预测长序列的挑战。

使用单词作为标记

预测完整单词是另一种选择：

THE CITY OF MONTREAL => ['THE','CITY','OF','MONTREAL']

优点包括输出序列较短，但系统无法推广到新词，且训练材料较少的标记可能会被分配。

字节对编码 (BPE) 令牌

一个折衷的方法是字节对编码（BPE），这是一种从数据压缩中继承的技术。它为最频繁的字符序列分配标记：

THE CITY OF MONTREAL => ['THE', '▁CITY', '▁OF', '▁MO', 'NT', 'RE', 'AL']

BPE根据最频繁的字符对找到标记，允许标记长度的灵活性。

有多少个BPE标记？

令牌数量是一个取决于可用语音数据的超参数。作为参考，对于像LibriSpeech（1000小时的英语句子）这样的数据集，1k到10k的令牌是合理的。

训练一个分词器

SpeechBrain 利用 SentencePiece 进行分词。要找到训练转录的分词，请运行以下代码：

cd speechbrain/templates/speech_recognition/Tokenizer
python train.py tokenizer.yaml

这一步在塑造您的语音识别器的行为中至关重要。尝试不同的分词策略，以找到最适合您的数据集和目标的那一个。🚀🔍

让我们训练分词器：

%cd /content/speechbrain/templates/speech_recognition/Tokenizer
!python train.py tokenizer.yaml

代码可能需要一些时间，因为数据需要下载和准备。对于SpeechBrain中的所有其他配方，我们有一个训练脚本（train.py）和一个超参数文件（tokenizer.yaml）。让我们先仔细看看后者：

# ############################################################################
# Tokenizer: subword BPE tokenizer with unigram 1K
# Training: Mini-LibriSpeech
# Authors:  Abdel Heba 2021
#           Mirco Ravanelli 2021
# ############################################################################


# Set up folders for reading from and writing to
data_folder: ../data
output_folder: ./save

# Path where data-specification files are stored
train_annotation: ../train.json
valid_annotation: ../valid.json
test_annotation: ../test.json

# Tokenizer parameters
token_type: unigram  # ["unigram", "bpe", "char"]
token_output: 1000  # index(blank/eos/bos/unk) = 0
character_coverage: 1.0
annotation_read: words # field to read

# Tokenizer object
tokenizer: !name:speechbrain.tokenizers.SentencePiece.SentencePiece
   model_dir: !ref <output_folder>
   vocab_size: !ref <token_output>
   annotation_train: !ref <train_annotation>
   annotation_read: !ref <annotation_read>
   model_type: !ref <token_type> # ["unigram", "bpe", "char"]
   character_coverage: !ref <character_coverage>
   annotation_list_to_check: [!ref <train_annotation>, !ref <valid_annotation>]
   annotation_format: json

分词器仅在训练注释上进行训练。我们在这里设置了词汇表大小为1000。我们没有使用标准的BPE算法，而是使用了一种基于unigram平滑的变体。更多信息请参见sentencepiece。分词器将保存在指定的output_folder中。

现在让我们来看一下训练脚本 train.py:

if __name__ == "__main__":

    # Load hyperparameters file with command-line overrides
    hparams_file, run_opts, overrides = sb.parse_arguments(sys.argv[1:])
    with open(hparams_file) as fin:
        hparams = load_hyperpyyaml(fin, overrides)

    # Create experiment directory
    sb.create_experiment_directory(
        experiment_directory=hparams["output_folder"],
        hyperparams_to_save=hparams_file,
        overrides=overrides,
    )

    # Data preparation, to be run on only one process.
    prepare_mini_librispeech(
        data_folder=hparams["data_folder"],
        save_json_train=hparams["train_annotation"],
        save_json_valid=hparams["valid_annotation"],
        save_json_test=hparams["test_annotation"],
    )

    # Train tokenizer
    hparams["tokenizer"]()

本质上，我们使用prepare_mini_librispeech脚本准备数据，然后运行封装在speechbrain.tokenizers.SentencePiece.SentencePiece中的sentencepiece分词器。

让我们看一下由分词器生成的文件。如果你进入指定的输出文件夹（Tokenizer/save），你可以找到两个文件：

1000_unigram.model
1000_unigram.vocab

第一个是一个包含所有用于标记化输入文本所需信息的二进制文件。第二个是一个报告分配的标记列表（及其对数概率）的文本文件：

▁THE  -3.2458
S -3.36618
ED  -3.84476
▁ -3.91777
E -3.92101
▁AND  -3.92316
▁A  -3.97359
▁TO -4.00462
▁OF -4.08116
....

现在让我展示如何使用学习到的模型来对文本进行分词：

import torch
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load("/content/speechbrain/templates/speech_recognition/Tokenizer/save/1000_unigram.model")

# Encode as pieces
print(sp.encode_as_pieces('THE CITY OF MONTREAL'))

# Encode as ids
print(sp.encode_as_ids('THE CITY OF MONTREAL'))

请注意，sentencepiece 分词器还为每个分配的标记分配一个唯一的索引。这些索引将对应于我们用于语言模型和自动语音识别（ASR）的神经网络的输出。

步骤3：训练语言模型

语言模型（LM）在提高语音识别器的性能中起着至关重要的作用。在本教程中，我们采用了浅融合的概念，将语言信息整合到语音识别器的束搜索器中，以重新评分部分假设。这包括用语言分数对语音识别器提供的部分假设进行评分，惩罚那些“不太可能”被观察到的标记序列。

文本语料库

训练语言模型通常涉及使用大型文本语料库，预测最可能的下一个标记。如果您的应用程序缺乏足够的文本语料库，您可以选择跳过这一部分。此外，在大型文本语料库上训练语言模型对计算资源要求很高，因此考虑利用预训练模型并在需要时进行微调。

在本教程中，我们在mini-librispeech的训练转录上训练一个语言模型。请记住，这是一个简化的演示，用于教育目的。

训练一个语言模型

我们将训练一个简单的基于RNN的语言模型，该模型根据前面的标记估计下一个标记。

SpeechBrain-Page-3 (1).png

要训练它，请运行以下代码：

!pip install datasets
%cd /content/speechbrain/templates/speech_recognition/LM
!python train.py RNNLM.yaml #--device='cpu'

从输出中可以看出，训练和验证损失随着时间的推移都呈现出持续下降的趋势。

在深入代码之前，让我们先探索在指定的output_folder中生成的内容：

train_log.txt: 该文件包含在每个epoch计算的统计信息（例如，train_loss, valid_loss）。
log.txt: 一个详细的日志记录器，为每个基本操作提供时间戳。
env.log: 显示所有使用的依赖项及其各自的版本，便于复制。
train.py, hyperparams.yaml: 实验文件及其对应的超参数副本，对于确保可重复性至关重要。
save: 学习到的模型存储的仓库。

在save文件夹中，子文件夹包含训练期间保存的检查点，格式为CKPT+data+time。通常，这里有两个检查点：最好的（即最旧的，代表最佳性能）和最新的（即最近的）。如果只有一个检查点，则表示最后一个周期也是最好的。

每个检查点文件夹包含恢复训练所需的所有信息，包括模型、优化器、调度器、epoch计数器等。RNNLM模型的参数存储在model.ckpt文件中，使用torch.load可读取的二进制格式。

教程的超参数部分提供了用于训练语言模型的设置的全面概述。以下是解释的改进版本：

超参数

要详细了解完整的RNNLM.yaml文件，请参考此链接。

在初始部分，定义了基本配置，如随机种子、输出文件夹路径和训练记录器：

seed: 2602
__set_seed: !apply:torch.manual_seed [!ref <seed>]
output_folder: !ref results/RNNLM/
save_folder: !ref <output_folder>/save
train_log: !ref <output_folder>/train_log.txt

接下来的部分概述了用于训练、验证和测试的文本语料库的路径：

lm_train_data: data/train.txt
lm_valid_data: data/valid.txt
lm_test_data: data/test.txt

与其他方法不同，语言模型（LM）直接处理大量原始文本语料库，无需JSON/CSV文件，利用HuggingFace数据集以提高效率。

接下来，详细介绍了训练日志记录器的设置以及分词器的规范（利用上一步训练的分词器）：

train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
    save_file: !ref <train_log>

tokenizer_file: ../Tokenizer/save/1000_unigram.model

接下来，定义了关键的训练超参数，包括epochs、batch size和学习率，以及关键的架构参数，如嵌入维度、RNN大小、层数和输出维度：

number_of_epochs: 20
batch_size: 80
lr: 0.001
accu_steps: 1
ckpt_interval_minutes: 15

emb_dim: 256
rnn_size: 512
layers: 2
output_neurons: 1000

随后，介绍了用于训练语言模型的对象，包括RNN模型、成本函数、优化器和学习率调度器：

model: !new:templates.speech_recognition.LM.custom_model.CustomModel
    embedding_dim: !ref <emb_dim>
    rnn_size: !ref <rnn_size>
    layers: !ref <layers>

compute_cost: !name:speechbrain.nnet.losses.nll_loss

optimizer: !name:torch.optim.Adam
    lr: !ref <lr>
    betas: (0.9, 0.98)
    eps: 0.000000001

lr_annealing: !new:speechbrain.nnet.schedulers.NewBobScheduler
    initial_value: !ref <lr>
    improvement_threshold: 0.0025
    annealing_factor: 0.8
    patient: 0

YAML 文件以 epoch 计数器、tokenizer 和 checkpointer 的规范结束：

epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
    limit: !ref <number_of_epochs>

modules:
    model: !ref <model>

tokenizer: !new:sentencepiece.SentencePieceProcessor

checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
    checkpoints_dir: !ref <save_folder>
    recoverables:
        model: !ref <model>
        scheduler: !ref <lr_annealing>
        counter: !ref <epoch_counter>

pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
    loadables:
        tokenizer: !ref <tokenizer>
    paths:
        tokenizer: !ref <tokenizer_file>

预训练器类促进了分词器对象与预训练分词器文件之间的连接。

实验文件

现在让我们来看看在train.py中如何使用yaml文件中声明的对象、函数和超参数来实现语言模型。

让我们从train.py的主函数开始：

# Recipe begins!
if __name__ == "__main__":

    # Reading command line arguments
    hparams_file, run_opts, overrides = sb.parse_arguments(sys.argv[1:])

    # Initialize ddp (useful only for multi-GPU DDP training)
    sb.utils.distributed.ddp_init_group(run_opts)

    # Load hyperparameters file with command-line overrides
    with open(hparams_file) as fin:
        hparams = load_hyperpyyaml(fin, overrides)

    # Create experiment directory
    sb.create_experiment_directory(
        experiment_directory=hparams["output_folder"],
        hyperparams_to_save=hparams_file,
        overrides=overrides,
    )

我们在这里进行一些初步操作，例如解析命令行、初始化分布式数据并行（如果使用多个GPU则需要）、创建输出文件夹以及读取yaml文件。

在读取yaml文件后，使用load_hyperpyyaml，所有在超参数文件中声明的对象都会被初始化，并以字典形式提供（连同yaml文件中报告的其他函数和参数）。例如，我们将有hparams['model']，hparams['optimizer']，hparams['batch_size']等。

数据输入输出管道

然后我们调用一个特殊函数来创建用于训练、验证和测试的数据集对象。

    # Create dataset objects "train", "valid", and "test"
    train_data, valid_data, test_data = dataio_prepare(hparams)

让我们更仔细地看一下。

def dataio_prepare(hparams):
    """This function prepares the datasets to be used in the brain class.
    It also defines the data processing pipeline through user-defined functions.

    The language model is trained with the text files specified by the user in
    the hyperparameter file.

    Arguments
    ---------
    hparams : dict
        This dictionary is loaded from the `train.yaml` file, and it includes
        all the hyperparameters needed for dataset construction and loading.

    Returns
    -------
    datasets : list
        List containing "train", "valid", and "test" sets that correspond
        to the appropriate DynamicItemDataset object.
    """

    logging.info("generating datasets...")

    # Prepare datasets
    datasets = load_dataset(
        "text",
        data_files={
            "train": hparams["lm_train_data"],
            "valid": hparams["lm_valid_data"],
            "test": hparams["lm_test_data"],
        },
    )

    # Convert huggingface's dataset to DynamicItemDataset via a magical function
    train_data = sb.dataio.dataset.DynamicItemDataset.from_arrow_dataset(
        datasets["train"]
    )
    valid_data = sb.dataio.dataset.DynamicItemDataset.from_arrow_dataset(
        datasets["valid"]
    )
    test_data = sb.dataio.dataset.DynamicItemDataset.from_arrow_dataset(
        datasets["test"]
    )

    datasets = [train_data, valid_data, test_data]
    tokenizer = hparams["tokenizer"]

    # Define text processing pipeline. We start from the raw text and then
    # encode it using the tokenizer. The tokens with bos are used for feeding
    # the neural network, the tokens with eos for computing the cost function.
    @sb.utils.data_pipeline.takes("text")
    @sb.utils.data_pipeline.provides("text", "tokens_bos", "tokens_eos")
    def text_pipeline(text):
        yield text
        tokens_list = tokenizer.encode_as_ids(text)
        tokens_bos = torch.LongTensor([hparams["bos_index"]] + (tokens_list))
        yield tokens_bos
        tokens_eos = torch.LongTensor(tokens_list + [hparams["eos_index"]])
        yield tokens_eos

    sb.dataio.dataset.add_dynamic_item(datasets, text_pipeline)

    # 4. Set outputs to add into the batch. The batch variable will contain
    # all these fields (e.g, batch.id, batch.text, batch.tokens.bos,..)
    sb.dataio.dataset.set_output_keys(
        datasets, ["id", "text", "tokens_bos", "tokens_eos"],
    )
    return train_data, valid_data, test_data

第一部分只是从HuggingFace数据集到SpeechBrain中使用的DynamicItemDataset的转换。

你可以注意到我们暴露了文本处理函数 text_pipeline，它接收一个句子的文本作为输入，并以不同的方式处理它。

文本处理函数将原始文本转换为相应的标记（以索引形式）。我们还创建了其他变量，例如在序列前面添加句子开始标记的版本，以及将句子结束标记作为最后一个元素的版本。它们的用途将在后面变得清晰。

在返回数据集对象之前，dataio_prepare 指定了我们希望输出的键。正如我们稍后将看到的，这些键将在 brain 类中作为 batch.id、batch.text、batch.tokens_bos 等可用。有关数据加载器的更多信息，请查看本教程

在数据集定义之后，主函数可以继续进行brain类的初始化：

    # Initialize the Brain object to prepare for LM training.
    lm_brain = LM(
        modules=hparams["modules"],
        opt_class=hparams["optimizer"],
        hparams=hparams,
        run_opts=run_opts,
        checkpointer=hparams["checkpointer"],
    )

大脑类实现了支持训练和验证循环所需的所有功能。它的fit和evaluate方法分别执行训练和测试：

    lm_brain.fit(
        lm_brain.hparams.epoch_counter,
        train_data,
        valid_data,
        train_loader_kwargs=hparams["train_dataloader_opts"],
        valid_loader_kwargs=hparams["valid_dataloader_opts"],
    )

    # Load best checkpoint for evaluation
    test_stats = lm_brain.evaluate(
        test_data,
        min_key="loss",
        test_loader_kwargs=hparams["test_dataloader_opts"],
    )

训练和验证数据加载器作为输入提供给fit方法，而测试数据集则输入到evaluate方法中。

现在让我们来看看在brain类中定义的最重要的方法。

前向计算

让我们从forward函数开始，它定义了将输入文本转换为输出预测所需的所有计算。

    def compute_forward(self, batch, stage):
        """Predicts the next word given the previous ones.

        Arguments
        ---------
        batch : PaddedBatch
            This batch object contains all the relevant tensors for computation.
        stage : sb.Stage
            One of sb.Stage.TRAIN, sb.Stage.VALID, or sb.Stage.TEST.

        Returns
        -------
        predictions : torch.Tensor
            A tensor containing the posterior probabilities (predictions).
        """
        batch = batch.to(self.device)
        tokens_bos, _ = batch.tokens_bos
        pred = self.hparams.model(tokens_bos)
        return pred

在这种情况下，计算链非常简单。我们只需将批次放在正确的设备上，并将编码后的标记输入模型。我们将带有的标记输入模型。当添加标记时，实际上我们将所有标记移动了一个元素。这样，我们的输入对应于前一个标记，而我们的模型试图预测当前的标记。

计算目标

现在让我们来看一下compute_objectives方法，它接收目标、预测并估计一个损失函数：

    def compute_objectives(self, predictions, batch, stage):
        """Computes the loss given the predicted and targeted outputs.

        Arguments
        ---------
        predictions : torch.Tensor
            The posterior probabilities from `compute_forward`.
        batch : PaddedBatch
            This batch object contains all the relevant tensors for computation.
        stage : sb.Stage
            One of sb.Stage.TRAIN, sb.Stage.VALID, or sb.Stage.TEST.

        Returns
        -------
        loss : torch.Tensor
            A one-element tensor used for backpropagating the gradient.
        """
        batch = batch.to(self.device)
        tokens_eos, tokens_len = batch.tokens_eos
        loss = self.hparams.compute_cost(
            predictions, tokens_eos, length=tokens_len
        )
        return loss

预测结果是在前向方法中计算的。通过将这些预测结果与目标标记进行比较来评估成本函数。我们在这里使用带有特殊标记的标记，因为我们还想预测句子的结束时间。

####其他方法 除了这两个重要的函数外，我们还有一些其他方法被brain类使用。特别是，fit_batch 训练每一批数据（通过使用backward方法计算梯度和使用step one进行更新）。on_stage_end 在每个阶段结束时调用（例如，在每个训练周期结束时），主要负责统计管理、学习率退火和检查点。有关brain类的更详细描述，请查看本教程。有关检查点的更多信息，请查看这里

步骤4：训练基于注意力的端到端语音识别器

现在是时候训练我们基于注意力的端到端语音识别器了。这个离线识别器采用了一种复杂的架构，在编码器中结合了卷积、循环和全连接模型，并使用自回归GRU解码器。

编码器和解码器之间的关键联系是一个注意力机制。为了提高性能，最终的单词序列是通过束搜索获得的，结合了之前训练的RNNLM。

架构概述:

编码器： 结合了卷积、循环和全连接模型。
解码器：自回归GRU解码器。
注意力机制： 增强编码器和解码器之间的信息流。
CTC（连接时序分类）：与基于注意力的系统联合训练，应用于编码器之上。
数据增强：采用技术来增强数据并提高整体系统性能。

训练语音识别器

要训练语音识别器，请运行以下代码：

%cd /content/speechbrain/templates/speech_recognition/ASR
!python train.py train.yaml --number_of_epochs=1  --batch_size=2  --enable_add_reverb=False --enable_add_noise=False #To speed up

在Google Colab上执行此代码可能需要相当长的时间。通过监控日志，您将观察到每个epoch后损失的逐步改善。

与RNNLM部分类似，指定的output_folder将包含之前讨论的文件和文件夹。此外，还会保存一个名为wer.txt的文件，提供每个测试句子的词错误率（WER）的详细报告。该文件不仅记录了WER值，还包括与真实转录的对齐信息，以便进行更深入的分析：

%WER 3.09 [ 1622 / 52576, 167 ins, 171 del, 1284 sub ]
%SER 33.66 [ 882 / 2620 ]
Scored 2620 sentences, 0 not present in hyp.
================================================================================
ALIGNMENTS

Format:
<utterance-id>, WER DETAILS
<eps> ; reference  ; on ; the ; first ;  line
  I   ;     S      ; =  ;  =  ;   S   ;   D  
 and  ; hypothesis ; on ; the ; third ; <eps>
================================================================================
672-122797-0033, %WER 0.00 [ 0 / 2, 0 ins, 0 del, 0 sub ]
A ; STORY
= ;   =  
A ; STORY
================================================================================
2094-142345-0041, %WER 0.00 [ 0 / 1, 0 ins, 0 del, 0 sub ]
DIRECTION
    =    
DIRECTION
================================================================================
2830-3980-0026, %WER 50.00 [ 1 / 2, 0 ins, 0 del, 1 sub ]
VERSE ; TWO
  S   ;  =
FIRST ; TWO
================================================================================
237-134500-0025, %WER 50.00 [ 1 / 2, 0 ins, 0 del, 1 sub ]
OH ;  EMIL
=  ;   S  
OH ; AMIEL
================================================================================
7127-75947-0012, %WER 0.00 [ 0 / 2, 0 ins, 0 del, 0 sub ]
INDEED ; AH
  =    ; =
INDEED ; AH
================================================================================

现在让我们更详细地看一下超参数 (train.yaml) 和实验脚本 (train.py)。

超参数

超参数文件从基本内容的定义开始，例如种子和路径设置：

# Seed needs to be set at top of yaml, before objects with parameters are instantiated
seed: 2602
__set_seed: !apply:torch.manual_seed [!ref <seed>]

# If you plan to train a system on an HPC cluster with a big dataset,
# we strongly suggest doing the following:
# 1- Compress the dataset in a single tar or zip file.
# 2- Copy your dataset locally (i.e., the local disk of the computing node).
# 3- Uncompress the dataset in the local folder.
# 4- Set data_folder with the local path
# Reading data from the local disk of the compute node (e.g. $SLURM_TMPDIR with SLURM-based clusters) is very important.
# It allows you to read the data much faster without slowing down the shared filesystem.

data_folder: ../data # In this case, data will be automatically downloaded here.
data_folder_noise: !ref <data_folder>/noise # The noisy sequencies for data augmentation will automatically be downloaded here.
data_folder_rir: !ref <data_folder>/rir # The impulse responses used for data augmentation will automatically be downloaded here.

# Data for augmentation
NOISE_DATASET_URL: https://www.dropbox.com/scl/fi/a09pj97s5ifan81dqhi4n/noises.zip?rlkey=j8b0n9kdjdr32o1f06t0cw5b7&dl=1
RIR_DATASET_URL: https://www.dropbox.com/scl/fi/linhy77c36mu10965a836/RIRs.zip?rlkey=pg9cu8vrpn2u173vhiqyu743u&dl=1

output_folder: !ref results/CRDNN_BPE_960h_LM/<seed>
test_wer_file: !ref <output_folder>/wer_test.txt
save_folder: !ref <output_folder>/save
train_log: !ref <output_folder>/train_log.txt

# Language model (LM) pretraining
# NB: To avoid mismatch, the speech recognizer must be trained with the same
# tokenizer used for LM training. Here, we download everything from the
# speechbrain HuggingFace repository. However, a local path pointing to a
# directory containing the lm.ckpt and tokenizer.ckpt may also be specified
# instead. E.g if you want to use your own LM / tokenizer.
pretrained_path: speechbrain/asr-crdnn-rnnlm-librispeech


# Path where data manifest files will be stored. The data manifest files are created by the
# data preparation script
train_annotation: ../train.json
valid_annotation: ../valid.json
test_annotation: ../test.json
noise_annotation: ../noise.csv
rir_annotation: ../rir.csv

skip_prep: False

# The train logger writes training statistics to a file, as well as stdout.
train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
    save_file: !ref <train_log>

data_folder 对应于存储 mini-librispeech 的路径。如果不可用，mini-librispeech 数据集将在此处下载。如前所述，脚本还支持数据增强。为此，我们使用开放 rir 数据集的脉冲响应和噪声序列（同样，如果不可用，它将在此处下载）。

我们还指定了语言模型保存的文件夹。在这种情况下，我们使用HuggingFace上提供的官方预训练语言模型，但您可以更改并使用在前一步骤中训练的模型（您应该指向存储最佳model.cpkt的文件夹中的检查点）。重要的是，用于语言模型的标记集与用于训练语音识别器的标记集必须完全匹配。

我们还需要指定用于训练、验证和测试的数据清单文件。如果这些文件不可用，它们将由在train.py中调用的数据准备脚本创建。

之后，我们定义了一堆用于训练、特征提取、模型定义和解码的参数：

# Training parameters
number_of_epochs: 15
number_of_ctc_epochs: 5
batch_size: 8
lr: 1.0
ctc_weight: 0.5
sorting: ascending
ckpt_interval_minutes: 15 # save checkpoint every N min
label_smoothing: 0.1

# Dataloader options
train_dataloader_opts:
    batch_size: !ref <batch_size>

valid_dataloader_opts:
    batch_size: !ref <batch_size>

test_dataloader_opts:
    batch_size: !ref <batch_size>


# Feature parameters
sample_rate: 16000
n_fft: 400
n_mels: 40

# Model parameters
activation: !name:torch.nn.LeakyReLU
dropout: 0.15
cnn_blocks: 2
cnn_channels: (128, 256)
inter_layer_pooling_size: (2, 2)
cnn_kernelsize: (3, 3)
time_pooling_size: 4
rnn_class: !name:speechbrain.nnet.RNN.LSTM
rnn_layers: 4
rnn_neurons: 1024
rnn_bidirectional: True
dnn_blocks: 2
dnn_neurons: 512
emb_size: 128
dec_neurons: 1024
output_neurons: 1000  # Number of tokens (same as LM)
blank_index: 0
bos_index: 0
eos_index: 0
unk_index: 0

# Decoding parameters
min_decode_ratio: 0.0
max_decode_ratio: 1.0
valid_beam_size: 8
test_beam_size: 80
eos_threshold: 1.5
using_max_attn_shift: True
max_attn_shift: 240
lm_weight: 0.50
ctc_weight_decode: 0.0
coverage_penalty: 1.5
temperature: 1.25
temperature_lm: 1.25

例如，我们定义了训练的轮数、初始学习率、批量大小、CTC损失的权重，以及许多其他参数。

通过将排序设置为ascending，我们在创建批次之前按升序对所有句子进行排序。这最大限度地减少了零填充的需求，从而在不损失性能的情况下使训练更快（至少在这个任务中使用这个模型时）。

许多其他参数，例如用于数据增强的参数，已被定义。要了解所有这些参数的确切含义，您可以参考使用此超参数的函数/类的文档字符串。

在下一个块中，我们定义了实现语音识别器所需的最重要的类：

# The first object passed to the Brain class is this "Epoch Counter"
# which is saved by the Checkpointer so that training can be resumed
# if it gets interrupted at any point.
epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
    limit: !ref <number_of_epochs>

# Feature extraction
compute_features: !new:speechbrain.lobes.features.Fbank
    sample_rate: !ref <sample_rate>
    n_fft: !ref <n_fft>
    n_mels: !ref <n_mels>

# Feature normalization (mean and std)
normalize: !new:speechbrain.processing.features.InputNormalization
    norm_type: global

# Added noise and reverb come from OpenRIR dataset, automatically
# downloaded and prepared with this Environmental Corruption class.
env_corrupt: !new:speechbrain.lobes.augment.EnvCorrupt
    openrir_folder: !ref <data_folder_rirs>
    babble_prob: 0.0
    reverb_prob: 0.0
    noise_prob: 1.0
    noise_snr_low: 0
    noise_snr_high: 15

# Adds speech change + time and frequnecy dropouts (time-domain implementation).
augmentation: !new:speechbrain.lobes.augment.TimeDomainSpecAugment
    sample_rate: !ref <sample_rate>
    speeds: [95, 100, 105]

# The CRDNN model is an encoder that combines CNNs, RNNs, and DNNs.
encoder: !new:speechbrain.lobes.models.CRDNN.CRDNN
    input_shape: [null, null, !ref <n_mels>]
    activation: !ref <activation>
    dropout: !ref <dropout>
    cnn_blocks: !ref <cnn_blocks>
    cnn_channels: !ref <cnn_channels>
    cnn_kernelsize: !ref <cnn_kernelsize>
    inter_layer_pooling_size: !ref <inter_layer_pooling_size>
    time_pooling: True
    using_2d_pooling: False
    time_pooling_size: !ref <time_pooling_size>
    rnn_class: !ref <rnn_class>
    rnn_layers: !ref <rnn_layers>
    rnn_neurons: !ref <rnn_neurons>
    rnn_bidirectional: !ref <rnn_bidirectional>
    rnn_re_init: True
    dnn_blocks: !ref <dnn_blocks>
    dnn_neurons: !ref <dnn_neurons>
    use_rnnp: False

# Embedding (from indexes to an embedding space of dimension emb_size).
embedding: !new:speechbrain.nnet.embedding.Embedding
    num_embeddings: !ref <output_neurons>
    embedding_dim: !ref <emb_size>

# Attention-based RNN decoder.
decoder: !new:speechbrain.nnet.RNN.AttentionalRNNDecoder
    enc_dim: !ref <dnn_neurons>
    input_size: !ref <emb_size>
    rnn_type: gru
    attn_type: location
    hidden_size: !ref <dec_neurons>
    attn_dim: 1024
    num_layers: 1
    scaling: 1.0
    channels: 10
    kernel_size: 100
    re_init: True
    dropout: !ref <dropout>

# Linear transformation on the top of the encoder.
ctc_lin: !new:speechbrain.nnet.linear.Linear
    input_size: !ref <dnn_neurons>
    n_neurons: !ref <output_neurons>

# Linear transformation on the top of the decoder.
seq_lin: !new:speechbrain.nnet.linear.Linear
    input_size: !ref <dec_neurons>
    n_neurons: !ref <output_neurons>

# Final softmax (for log posteriors computation).
log_softmax: !new:speechbrain.nnet.activations.Softmax
    apply_log: True

# Cost definition for the CTC part.
ctc_cost: !name:speechbrain.nnet.losses.ctc_loss
    blank_index: !ref <blank_index>

# Tokenizer initialization
tokenizer: !new:sentencepiece.SentencePieceProcessor

# Objects in "modules" dict will have their parameters moved to the correct
# device, as well as having train()/eval() called on them by the Brain class
modules:
    encoder: !ref <encoder>
    embedding: !ref <embedding>
    decoder: !ref <decoder>
    ctc_lin: !ref <ctc_lin>
    seq_lin: !ref <seq_lin>
    normalize: !ref <normalize>
    env_corrupt: !ref <env_corrupt>
    lm_model: !ref <lm_model>

# Gathering all the submodels in a single model object.
model: !new:torch.nn.ModuleList
    - - !ref <encoder>
      - !ref <embedding>
      - !ref <decoder>
      - !ref <ctc_lin>
      - !ref <seq_lin>

# This is the RNNLM that is used according to the Huggingface repository
# NB: It has to match the pre-trained RNNLM!!
lm_model: !new:speechbrain.lobes.models.RNNLM.RNNLM
    output_neurons: !ref <output_neurons>
    embedding_dim: !ref <emb_size>
    activation: !name:torch.nn.LeakyReLU
    dropout: 0.0
    rnn_layers: 2
    rnn_neurons: 2048
    dnn_blocks: 1
    dnn_neurons: 512
    return_hidden: True  # For inference

例如，我们定义了用于计算特征并对其进行归一化的函数。我们定义了环境损坏和数据增强的类（请参阅本教程），以及编码器、解码器和语音识别器所需的其他模型的架构。

然后我们报告beasearch的参数：

# Define scorers for beam search

# If ctc_scorer is set, the decoder uses CTC + attention beamsearch. This
# improves the performance, but slows down decoding.
ctc_scorer: !new:speechbrain.decoders.scorer.CTCScorer
    eos_index: !ref <eos_index>
    blank_index: !ref <blank_index>
    ctc_fc: !ref <ctc_lin>

# If coverage_scorer is set, coverage penalty is applied based on accumulated
# attention weights during beamsearch.
coverage_scorer: !new:speechbrain.decoders.scorer.CoverageScorer
    vocab_size: !ref <output_neurons>

# If the lm_scorer is set, a language model
# is applied (with a weight specified in scorer).
rnnlm_scorer: !new:speechbrain.decoders.scorer.RNNLMScorer
    language_model: !ref <lm_model>
    temperature: !ref <temperature_lm>

# Gathering all scorers in a scorer instance for beamsearch:
# - full_scorers are scorers which score on full vocab set, while partial_scorers
# are scorers which score on pruned tokens.
# - The number of pruned tokens is decided by scorer_beam_scale * beam_size.
# - For some scorers like ctc_scorer, ngramlm_scorer, putting them
# into full_scorers list would be too heavy. partial_scorers are more
# efficient because they score on pruned tokens at little cost of
# performance drop. For other scorers, please see the speechbrain.decoders.scorer.
test_scorer: !new:speechbrain.decoders.scorer.ScorerBuilder
    scorer_beam_scale: 1.5
    full_scorers: [
        !ref <rnnlm_scorer>,
        !ref <coverage_scorer>]
    partial_scorers: [!ref <ctc_scorer>]
    weights:
        rnnlm: !ref <lm_weight>
        coverage: !ref <coverage_penalty>
        ctc: !ref <ctc_weight_decode>

valid_scorer: !new:speechbrain.decoders.scorer.ScorerBuilder
    full_scorers: [!ref <coverage_scorer>]
    weights:
        coverage: !ref <coverage_penalty>

# Beamsearch is applied on the top of the decoder. For a description of
# the other parameters, please see the speechbrain.decoders.S2SRNNBeamSearcher.

# It makes sense to have a lighter search during validation. In this case,
# we don't use scorers during decoding.
valid_search: !new:speechbrain.decoders.S2SRNNBeamSearcher
    embedding: !ref <embedding>
    decoder: !ref <decoder>
    linear: !ref <seq_lin>
    bos_index: !ref <bos_index>
    eos_index: !ref <eos_index>
    min_decode_ratio: !ref <min_decode_ratio>
    max_decode_ratio: !ref <max_decode_ratio>
    beam_size: !ref <valid_beam_size>
    eos_threshold: !ref <eos_threshold>
    using_max_attn_shift: !ref <using_max_attn_shift>
    max_attn_shift: !ref <max_attn_shift>
    temperature: !ref <temperature>
    scorer: !ref <valid_scorer>

# The final decoding on the test set can be more computationally demanding.
# In this case, we use the LM + CTC probabilities during decoding as well,
# which are defined in scorer.
# Please, remove scorer if you need a faster decoder.
test_search: !new:speechbrain.decoders.S2SRNNBeamSearcher
    embedding: !ref <embedding>
    decoder: !ref <decoder>
    linear: !ref <seq_lin>
    bos_index: !ref <bos_index>
    eos_index: !ref <eos_index>
    min_decode_ratio: !ref <min_decode_ratio>
    max_decode_ratio: !ref <max_decode_ratio>
    beam_size: !ref <test_beam_size>
    eos_threshold: !ref <eos_threshold>
    using_max_attn_shift: !ref <using_max_attn_shift>
    max_attn_shift: !ref <max_attn_shift>
    temperature: !ref <temperature>
    scorer: !ref <test_scorer>

我们在这里使用不同的超参数进行验证和测试的beamsearch。特别是，验证阶段使用了较小的beam size。原因是验证在每个epoch结束时进行，因此应该快速完成。而评估只在最后进行一次，我们可以更加准确。

最后，我们声明训练配方所需的最后一个对象，例如 lr_annealing、optimizer、checkpointer 等：

 This function manages learning rate annealing over the epochs.
# We here use the NewBoB algorithm, that anneals the learning rate if
# the improvements over two consecutive epochs is less than the defined
# threshold.
lr_annealing: !new:speechbrain.nnet.schedulers.NewBobScheduler
    initial_value: !ref <lr>
    improvement_threshold: 0.0025
    annealing_factor: 0.8
    patient: 0

# This optimizer will be constructed by the Brain class after all parameters
# are moved to the correct device. Then it will be added to the checkpointer.
opt_class: !name:torch.optim.Adadelta
    lr: !ref <lr>
    rho: 0.95
    eps: 1.e-8

# Functions that compute the statistics to track during the validation step.
error_rate_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats

cer_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats
    split_tokens: True

# This object is used for saving the state of training both so that it
# can be resumed if it gets interrupted, and also so that the best checkpoint
# can be later loaded for evaluation or inference.
checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
    checkpoints_dir: !ref <save_folder>
    recoverables:
        model: !ref <model>
        scheduler: !ref <lr_annealing>
        normalizer: !ref <normalize>
        counter: !ref <epoch_counter>

# This object is used to pretrain the language model and the tokenizers
# (defined above). In this case, we also pretrain the ASR model (to make
# sure the model converges on a small amount of data)
pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
    collect_in: !ref <save_folder>
    loadables:
        lm: !ref <lm_model>
        tokenizer: !ref <tokenizer>
        model: !ref <model>
    paths:
        lm: !ref <pretrained_path>/lm.ckpt
        tokenizer: !ref <pretrained_path>/tokenizer.ckpt
        model: !ref <pretrained_path>/asr.ckpt

最终的对象是预训练器，它将语言模型、分词器和声学语音识别模型与用于预训练的相应文件链接起来。我们在这里也对声学模型进行预训练。对于这样一个小的数据集，很难使端到端的语音识别器收敛，因此我们使用另一个模型对其进行预训练（在更大的数据集上训练时应跳过此部分）。

实验文件

现在让我们看看在yaml文件中声明的不同元素是如何在train.py中连接的。训练脚本紧密遵循已经描述过的语言模型的脚本。

main 函数从实现基本功能开始，例如解析命令行、初始化分布式数据并行（用于多GPU训练）以及读取yaml文件。

if __name__ == "__main__":

    # Reading command line arguments
    hparams_file, run_opts, overrides = sb.parse_arguments(sys.argv[1:])

    # Initialize ddp (useful only for multi-GPU DDP training)
    sb.utils.distributed.ddp_init_group(run_opts)

    # Load hyperparameters file with command-line overrides
    with open(hparams_file) as fin:
        hparams = load_hyperpyyaml(fin, overrides)

    # Create experiment directory
    sb.create_experiment_directory(
        experiment_directory=hparams["output_folder"],
        hyperparams_to_save=hparams_file,
        overrides=overrides,
    )

    # Data preparation, to be run on only one process.
    if not hparams["skip_prep"]:
        sb.utils.distributed.run_on_main(
            prepare_mini_librispeech,
            kwargs={
                "data_folder": hparams["data_folder"],
                "save_json_train": hparams["train_annotation"],
                "save_json_valid": hparams["valid_annotation"],
                "save_json_test": hparams["test_annotation"],
            },
        )
    sb.utils.distributed.run_on_main(hparams["prepare_noise_data"])
    sb.utils.distributed.run_on_main(hparams["prepare_rir_data"])

yaml 文件通过 load_hyperpyyaml 函数读取。读取后，我们将初始化所有声明的对象，并通过 hparams 字典以及其他函数和变量（例如，hparams['model']、hparams['test_search']、hparams['batch_size']）使其可用。

之后，我们运行数据准备，其目标是创建数据清单文件（如果尚未可用）。此操作需要在磁盘上写入一些文件。因此，我们必须使用sb.utils.distributed.run_on_main来确保此操作仅由主进程执行。这在使用多个GPU与DDP时避免了可能的冲突。有关Speechbrai中多GPU训练的更多信息，请参阅本教程。

数据输入输出管道

此时，我们可以创建用于训练、验证和测试循环的数据集对象：

    # We can now directly create the datasets for training, valid, and test
    datasets = dataio_prepare(hparams)

此功能允许用户完全自定义数据读取管道。让我们更详细地了解一下：

def dataio_prepare(hparams):
    """This function prepares the datasets to be used in the brain class.
    It also defines the data processing pipeline through user-defined functions.


    Arguments
    ---------
    hparams : dict
        This dictionary is loaded from the `train.yaml` file, and it includes
        all the hyperparameters needed for dataset construction and loading.

    Returns
    -------
    datasets : dict
        Dictionary containing "train", "valid", and "test" keys that correspond
        to the DynamicItemDataset objects.
    """
    # Define audio pipeline. In this case, we simply read the path contained
    # in the variable wav with the audio reader.
    @sb.utils.data_pipeline.takes("wav")
    @sb.utils.data_pipeline.provides("sig")
    def audio_pipeline(wav):
        """Load the audio signal. This is done on the CPU in the `collate_fn`."""
        sig = sb.dataio.dataio.read_audio(wav)
        return sig

    # Define text processing pipeline. We start from the raw text and then
    # encode it using the tokenizer. The tokens with BOS are used for feeding
    # decoder during training, the tokens with EOS for computing the cost function.
    # The tokens without BOS or EOS is for computing CTC loss.
    @sb.utils.data_pipeline.takes("words")
    @sb.utils.data_pipeline.provides(
        "words", "tokens_list", "tokens_bos", "tokens_eos", "tokens"
    )
    def text_pipeline(words):
        """Processes the transcriptions to generate proper labels"""
        yield words
        tokens_list = hparams["tokenizer"].encode_as_ids(words)
        yield tokens_list
        tokens_bos = torch.LongTensor([hparams["bos_index"]] + (tokens_list))
        yield tokens_bos
        tokens_eos = torch.LongTensor(tokens_list + [hparams["eos_index"]])
        yield tokens_eos
        tokens = torch.LongTensor(tokens_list)
        yield tokens

    # Define datasets from json data manifest file
    # Define datasets sorted by ascending lengths for efficiency
    datasets = {}
    data_folder = hparams["data_folder"]
    for dataset in ["train", "valid", "test"]:
        datasets[dataset] = sb.dataio.dataset.DynamicItemDataset.from_json(
            json_path=hparams[f"{dataset}_annotation"],
            replacements={"data_root": data_folder},
            dynamic_items=[audio_pipeline, text_pipeline],
            output_keys=[
                "id",
                "sig",
                "words",
                "tokens_bos",
                "tokens_eos",
                "tokens",
            ],
        )
        hparams[f"{dataset}_dataloader_opts"]["shuffle"] = False

    # Sorting traiing data with ascending order makes the code  much
    # faster  because we minimize zero-padding. In most of the cases, this
    # does not harm the performance.
    if hparams["sorting"] == "ascending":
        datasets["train"] = datasets["train"].filtered_sorted(sort_key="length")
        hparams["train_dataloader_opts"]["shuffle"] = False

    elif hparams["sorting"] == "descending":
        datasets["train"] = datasets["train"].filtered_sorted(
            sort_key="length", reverse=True
        )
        hparams["train_dataloader_opts"]["shuffle"] = False

    elif hparams["sorting"] == "random":
        hparams["train_dataloader_opts"]["shuffle"] = True
        pass

    else:
        raise NotImplementedError(
            "sorting must be random, ascending or descending"
        )
    return datasets

在dataio_prepare中，我们定义了用于处理JSON文件中定义的条目的子函数。第一个函数名为audio_pipeline，它接收音频信号的路径（wav）并读取它。它返回一个包含读取的语音句子的张量。该函数的输入条目（即wav）必须与数据清单文件中相应键的名称相同：

  "1867-154075-0032": {
    "wav": "{data_root}/LibriSpeech/train-clean-5/1867/154075/1867-154075-0032.flac",
    "length": 16.09,
    "words": "AND HE BRUSHED A HAND ACROSS HIS FOREHEAD AND WAS INSTANTLY HIMSELF CALM AND COOL VERY WELL THEN IT SEEMS I'VE MADE AN ASS OF MYSELF BUT I'LL TRY TO MAKE UP FOR IT NOW WHAT ABOUT CAROLINE"
  },

同样地，我们定义了另一个名为text_pipeline的函数，用于处理信号转录并将其转换为定义模型可用的格式。该函数读取JSON文件中定义的字符串words并对其进行标记化（输出每个标记的索引）。它返回带有特殊句子开始标记在前面的标记序列，以及带有句子结束标记在后面的版本。我们稍后会看到为什么需要这些额外的元素。

然后我们创建DynamicItemDataset并将其与上面定义的处理函数连接起来。我们定义了所需的输出键。这些键将在批处理变量中的brain类中可用，如下所示：

batch.id
batch.sig
batch.words
batch.tokens_bos
batch.tokens_eos
batch.tokens

dataio_prepare 函数的最后一部分负责数据排序。在这种情况下，我们将数据按升序排序，以最小化零填充并加快训练速度。有关数据加载器的更多信息，请参阅本教程

在定义了dataio函数之后，我们执行语言模型、ASR模型和分词器的预训练：

    run_on_main(hparams["pretrainer"].collect_files)
    hparams["pretrainer"].load_collected(device=run_opts["device"])

我们在这里使用run_on_main包装器，因为collect_files方法可能需要从网络下载预训练模型。即使在使用多个GPU与DDP时，此操作也应仅由单个进程完成。

此时我们初始化Brain类并使用它来运行训练和评估：

    # Trainer initialization
    asr_brain = ASR(
        modules=hparams["modules"],
        opt_class=hparams["opt_class"],
        hparams=hparams,
        run_opts=run_opts,
        checkpointer=hparams["checkpointer"],
    )

    # Training
    asr_brain.fit(
        asr_brain.hparams.epoch_counter,
        datasets["train"],
        datasets["valid"],
        train_loader_kwargs=hparams["train_dataloader_opts"],
        valid_loader_kwargs=hparams["valid_dataloader_opts"],
    )

    # Load best checkpoint for evaluation
    test_stats = asr_brain.evaluate(
        test_set=datasets["test"],
        min_key="WER",
        test_loader_kwargs=hparams["test_dataloader_opts"],
    )

有关Brain类如何工作的更多信息，请参阅本教程。请注意，fit和evaluate方法也接受数据集对象作为输入。从这个数据集中，会自动创建一个pytorch数据加载器。后者创建用于训练和评估的批次。

当采样具有不同长度的语音句子时，会执行零填充。为了跟踪每个批次中每个句子的实际长度，数据加载器还会返回一个包含相对长度的特殊张量。例如，假设batch.sig[0]是包含输入波形作为[batch, time]张量的变量：

tensor([[1, 1, 0, 0],
        [1, 1, 1, 0],
        [1, 1, 0, 0]])

batch.sig[1] 将包含以下相对长度：

tensor([0.5000, 0.7500, 1.0000])

有了这些信息，我们可以从一些计算中排除零填充的步骤（例如特征归一化、统计池化、损失等）。

为什么使用相对长度而不是绝对长度？

相对于绝对长度，偏好相对长度的原因在于神经网络中时间分辨率的动态特性。包括池化、步幅卷积、转置卷积、FFT计算等在内的多种操作都有可能改变序列中的时间步数。

通过采用相对位置技巧，神经计算每个阶段的实际时间步长的计算变得更加灵活。这是通过将相对长度乘以张量的总长度来实现的。因此，该方法适应了由各种网络操作引入的时间分辨率变化，确保了在整个神经网络计算过程中时间信息的更健壮和适应性表示。

前向计算

在Brain类中，我们需要定义一些重要的方法，例如：

compute_forward，指定了将输入波形转换为输出后验概率所需的所有计算)
compute_objective，它计算给定标签和模型执行的预测的损失函数。

让我们先看一下 compute_forward：

    def compute_forward(self, batch, stage):
        """Runs all the computation of the CTC + seq2seq ASR. It returns the
        posterior probabilities of the CTC and seq2seq networks.

        Arguments
        ---------
        batch : PaddedBatch
            This batch object contains all the relevant tensors for computation.
        stage : sb.Stage
            One of sb.Stage.TRAIN, sb.Stage.VALID, or sb.Stage.TEST.

        Returns
        -------
        predictions : dict
            At training time it returns predicted seq2seq log probabilities.
            If needed it also returns the ctc output log probabilities.
            At validation/test time, it returns the predicted tokens as well.
        """
        # We first move the batch to the appropriate device.
        batch = batch.to(self.device)

        feats, self.feat_lens = self.prepare_features(stage, batch.sig)
        tokens_bos, _ = self.prepare_tokens(stage, batch.tokens_bos)

        # Running the encoder (prevent propagation to feature extraction)
        encoded_signal = self.modules.encoder(feats.detach())

        # Embed tokens and pass tokens & encoded signal to decoder
        embedded_tokens = self.modules.embedding(tokens_bos.detach())
        decoder_outputs, _ = self.modules.decoder(
            embedded_tokens, encoded_signal, self.feat_lens
        )

        # Output layer for seq2seq log-probabilities
        logits = self.modules.seq_lin(decoder_outputs)
        predictions = {"seq_logprobs": self.hparams.log_softmax(logits)}

        if self.is_ctc_active(stage):
            # Output layer for ctc log-probabilities
            ctc_logits = self.modules.ctc_lin(encoded_signal)
            predictions["ctc_logprobs"] = self.hparams.log_softmax(ctc_logits)

        elif stage != sb.Stage.TRAIN:
            if stage == sb.Stage.VALID:
                hyps, _, _, _ = self.hparams.valid_search(
                    encoded_signal, self.feat_lens
                )
            elif stage == sb.Stage.TEST:
                hyps, _, _, _ = self.hparams.test_search(
                    encoded_signal, self.feat_lens
                )

            predictions["tokens"] = hyps

        return predictions

该函数接收批次变量和当前阶段（可以是sb.Stage.TRAIN、sb.Stage.VALID或sb.Stage.TEST）。然后我们将批次放在正确的设备上，计算特征，并使用我们的CRDNN编码器对它们进行编码。有关特征计算的更多信息，请查看本教程，而有关语音增强的更多详细信息，请查看此处。之后，我们将编码的状态输入到一个基于自回归注意力的解码器中，该解码器对标记进行一些预测。在验证和测试阶段，我们在标记预测的基础上应用了束搜索。我们的系统在编码器的基础上应用了额外的CTC损失。如果需要，CTC可以在N个周期后关闭。

计算目标

现在让我们来看看 compute_objectives 函数：

    def compute_objectives(self, predictions, batch, stage):
        """Computes the loss given the predicted and targeted outputs. We here
        do multi-task learning and the loss is a weighted sum of the ctc + seq2seq
        costs.

        Arguments
        ---------
        predictions : dict
            The output dict from `compute_forward`.
        batch : PaddedBatch
            This batch object contains all the relevant tensors for computation.
        stage : sb.Stage
            One of sb.Stage.TRAIN, sb.Stage.VALID, or sb.Stage.TEST.

        Returns
        -------
        loss : torch.Tensor
            A one-element tensor used for backpropagating the gradient.
        """

        # Compute sequence loss against targets with EOS
        tokens_eos, tokens_eos_lens = self.prepare_tokens(
            stage, batch.tokens_eos
        )
        loss = sb.nnet.losses.nll_loss(
            log_probabilities=predictions["seq_logprobs"],
            targets=tokens_eos,
            length=tokens_eos_lens,
            label_smoothing=self.hparams.label_smoothing,
        )

        # Add ctc loss if necessary. The total cost is a weighted sum of
        # ctc loss + seq2seq loss
        if self.is_ctc_active(stage):
            # Load tokens without EOS as CTC targets
            tokens, tokens_lens = self.prepare_tokens(stage, batch.tokens)
            loss_ctc = self.hparams.ctc_cost(
                predictions["ctc_logprobs"], tokens, self.feat_lens, tokens_lens
            )
            loss *= 1 - self.hparams.ctc_weight
            loss += self.hparams.ctc_weight * loss_ctc

        if stage != sb.Stage.TRAIN:
            # Converted predicted tokens from indexes to words
            predicted_words = [
                self.hparams.tokenizer.decode_ids(prediction).split(" ")
                for prediction in predictions["tokens"]
            ]
            target_words = [words.split(" ") for words in batch.words]

            # Monitor word error rate and character error rated at
            # valid and test time.
            self.wer_metric.append(batch.id, predicted_words, target_words)
            self.cer_metric.append(batch.id, predicted_words, target_words)

        return loss

根据预测和目标，我们计算负对数似然损失（NLL），如果需要，还计算连接主义时间分类（CTC）损失。这两个损失通过一个权重（ctc_weight）进行组合。在验证或测试阶段，我们计算词错误率（WER）和字符错误率（CER）。

其他方法

除了主要函数 forward 和 compute_objective 外，代码还包括 on_stage_start 和 on_stage_end 函数。前者初始化统计对象，例如词错误率（WER）和字符错误率（CER）。后者负责监督几个关键方面：

统计更新： 管理训练期间统计数据的更新。
学习率退火： 处理学习率在多个周期内的调整。
日志记录： 在训练过程中便于记录关键信息。
检查点： 管理可恢复训练的检查点的创建和存储。

通过整合这些函数，代码确保了语音识别系统的全面且高效的训练流程。

就这样。你可以直接运行代码并训练你的语音识别器。

预训练和微调

在从头开始训练可能不是最佳选择的情况下，从预训练模型开始并对其进行微调的选项变得非常有价值。

需要注意的是，为了使这种方法无缝工作，您的模型架构必须与预训练模型的架构精确匹配。

实现这一点的一个便捷方法是利用YAML文件中的pretrainer类。如果你的目标是预训练语音识别器的编码器，可以使用以下代码片段：

pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
 loadables:
     encoder: !ref <encoder>
 paths:
   encoder: !ref <encoder_ptfile>

这里，!ref 指向在YAML文件中先前定义的编码器模型，而 encoder_ptfile 表示预训练模型存储的路径。

要执行预训练过程，请确保在train.py文件中调用预训练器：

run_on_main(hparams["pretrainer"].collect_files)
hparams["pretrainer"].load_collected(device=run_opts["device"])

在调用Brain类的fit方法之前，必须调用此函数。

为了更全面的理解和实际示例，请参考我们的预训练和微调教程。该资源提供了关于在语音识别系统中有效利用预训练模型的详细见解。

步骤5：推理

此时，我们可以使用训练好的语音识别器。对于这种类型的ASR模型，speechbrain提供了一些类（take a look here），例如EncoderDecoderASR，可以使推理更加容易。例如，我们可以使用托管在我们HuggingFace仓库中的预训练模型，仅用4行代码转录一个音频文件：

from speechbrain.inference.ASR import EncoderDecoderASR

asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-crdnn-rnnlm-librispeech", savedir="/content/pretrained_model")
audio_file = 'speechbrain/asr-crdnn-rnnlm-librispeech/example.wav'
asr_model.transcribe_file(audio_file)

但是，这如何与您的自定义ASR系统一起工作？

利用您的自定义语音识别器

此时，你有两种选择来训练和部署你的语音识别器在你的数据上：

利用可用接口（例如，EncoderDecoderASR）：
- 被认为是最优雅和方便的选择。
- 您的模型应遵守某些约束，以无缝适应所提出的接口。
- 这种方法简化了将您的自定义ASR模型与现有接口的集成，增强了适应性和可维护性。
构建您自己的自定义界面：
- 为您的自定义ASR模型精确打造一个界面。
- 提供灵活性以满足独特的需求和规格。
- 非常适合现有接口不完全满足您需求的场景。

注意： 这些解决方案不仅限于ASR，还可以扩展到其他任务，如说话人识别和源分离。

使用`EndoderDecoderASR`接口

EncoderDecoderASR 类接口允许你将训练好的模型与训练配方解耦，并在几行代码中对任何新的音频文件进行推断（或编码）。该类具有以下方法：

encode_batch：将编码器应用于输入批次并返回一些编码特征。
transcribe_file: 转录输入中的单个音频文件。
transcribe_batch: 转录输入批次。

事实上，如果你满足我们将在下一段详细说明的几个约束条件，你可以简单地这样做：

from speechbrain.inference.ASR import EncoderDecoderASR

asr_model = EncoderDecoderASR.from_hparams(source="your_local_folder", hparams_file='your_file.yaml', savedir="pretrained_model")
audio_file = 'your_file.wav'
asr_model.transcribe_file(audio_file)

然而，为了允许对所有可能的EncoderDecoder ASR管道进行这种泛化，在部署系统时您必须考虑一些约束条件：

必要的模块。 正如你在EncoderDecoderASR类中所看到的，你在yaml文件中定义的模块必须包含具有特定名称的某些元素。实际上，你需要一个分词器、一个解码器和一个解码器。编码器可以简单地是一个由特征计算、归一化和模型编码序列组成的speechbrain.nnet.containers.LengthsCapableSequential。

    HPARAMS_NEEDED = ["tokenizer"]
    MODULES_NEEDED = [
        "encoder",
        "decoder",
    ]

你还需要在YAML文件中声明这些实体，并创建以下名为modules的字典：

encoder: !new:speechbrain.nnet.containers.LengthsCapableSequential
    input_shape: [null, null, !ref <n_mels>]
    compute_features: !ref <compute_features>
    normalize: !ref <normalize>
    model: !ref <enc>

decoder: !new:speechbrain.nnet.RNN.AttentionalRNNDecoder
    enc_dim: !ref <dnn_neurons>
    input_size: !ref <emb_size>
    rnn_type: gru
    attn_type: location
    hidden_size: !ref <dec_neurons>
    attn_dim: 1024
    num_layers: 1
    scaling: 1.0
    channels: 10
    kernel_size: 100
    re_init: True
    dropout: !ref <dropout>


modules:
    encoder: !ref <encoder>
    decoder: !ref <decoder>
    lm_model: !ref <lm_model>

在这种情况下，enc 是一个 CRDNN，但也可以是任何自定义的神经网络实例。

为什么你需要确保这一点？ 嗯，这仅仅是因为这些是我们在推断EncoderDecoderASR类时调用的模块。这里有一个encode_batch()函数的例子。

[...]
  wavs = wavs.float()
  wavs, wav_lens = wavs.to(self.device), wav_lens.to(self.device)
  encoder_out = self.modules.encoder(wavs, wav_lens)
return encoder_out

如果我有一个包含多个深度神经网络等的复杂asr_encoder结构怎么办？ 简单地将所有内容放入你的yaml中的torch.nn.ModuleList中：

asr_encoder: !new:torch.nn.ModuleList
    - [!ref <enc>, my_different_blocks ... ]

调用预训练器以加载检查点。 最后，你需要定义一个调用预训练器的操作，该操作将加载你训练模型的不同检查点到相应的SpeechBrain模块中。简而言之，它将加载你的编码器、语言模型的权重，或者甚至只是加载分词器。

pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
    loadables:
        asr: !ref <asr_model>
        lm: !ref <lm_model>
        tokenizer: !ref <tokenizer>
    paths:
      asr: !ref <asr_model_ptfile>
      lm: !ref <lm_model_ptfile>
      tokenizer: !ref <tokenizer_ptfile>

可加载字段在文件（例如与检查点相关的lm在<lm_model_ptfile>中）和yaml实例（例如<lm_model>）之间创建了一个链接，该实例只不过是你的lm。

如果你尊重这两个约束，它应该可以工作！这里，我们给出一个仅用于推理的yaml的完整示例：

# ############################################################################
# Model: E2E ASR with attention-based ASR
# Encoder: CRDNN model
# Decoder: GRU + beamsearch + RNNLM
# Tokens: BPE with unigram
# Authors:  Ju-Chieh Chou, Mirco Ravanelli, Abdel Heba, Peter Plantinga 2020
# ############################################################################


# Feature parameters
sample_rate: 16000
n_fft: 400
n_mels: 40

# Model parameters
activation: !name:torch.nn.LeakyReLU
dropout: 0.15
cnn_blocks: 2
cnn_channels: (128, 256)
inter_layer_pooling_size: (2, 2)
cnn_kernelsize: (3, 3)
time_pooling_size: 4
rnn_class: !name:speechbrain.nnet.RNN.LSTM
rnn_layers: 4
rnn_neurons: 1024
rnn_bidirectional: True
dnn_blocks: 2
dnn_neurons: 512
emb_size: 128
dec_neurons: 1024
output_neurons: 1000  # index(blank/eos/bos) = 0
blank_index: 0

# Decoding parameters
bos_index: 0
eos_index: 0
min_decode_ratio: 0.0
max_decode_ratio: 1.0
beam_size: 80
eos_threshold: 1.5
using_max_attn_shift: True
max_attn_shift: 240
lm_weight: 0.50
coverage_penalty: 1.5
temperature: 1.25
temperature_lm: 1.25

normalize: !new:speechbrain.processing.features.InputNormalization
    norm_type: global

compute_features: !new:speechbrain.lobes.features.Fbank
    sample_rate: !ref <sample_rate>
    n_fft: !ref <n_fft>
    n_mels: !ref <n_mels>

enc: !new:speechbrain.lobes.models.CRDNN.CRDNN
    input_shape: [null, null, !ref <n_mels>]
    activation: !ref <activation>
    dropout: !ref <dropout>
    cnn_blocks: !ref <cnn_blocks>
    cnn_channels: !ref <cnn_channels>
    cnn_kernelsize: !ref <cnn_kernelsize>
    inter_layer_pooling_size: !ref <inter_layer_pooling_size>
    time_pooling: True
    using_2d_pooling: False
    time_pooling_size: !ref <time_pooling_size>
    rnn_class: !ref <rnn_class>
    rnn_layers: !ref <rnn_layers>
    rnn_neurons: !ref <rnn_neurons>
    rnn_bidirectional: !ref <rnn_bidirectional>
    rnn_re_init: True
    dnn_blocks: !ref <dnn_blocks>
    dnn_neurons: !ref <dnn_neurons>

emb: !new:speechbrain.nnet.embedding.Embedding
    num_embeddings: !ref <output_neurons>
    embedding_dim: !ref <emb_size>

dec: !new:speechbrain.nnet.RNN.AttentionalRNNDecoder
    enc_dim: !ref <dnn_neurons>
    input_size: !ref <emb_size>
    rnn_type: gru
    attn_type: location
    hidden_size: !ref <dec_neurons>
    attn_dim: 1024
    num_layers: 1
    scaling: 1.0
    channels: 10
    kernel_size: 100
    re_init: True
    dropout: !ref <dropout>

ctc_lin: !new:speechbrain.nnet.linear.Linear
    input_size: !ref <dnn_neurons>
    n_neurons: !ref <output_neurons>

seq_lin: !new:speechbrain.nnet.linear.Linear
    input_size: !ref <dec_neurons>
    n_neurons: !ref <output_neurons>

log_softmax: !new:speechbrain.nnet.activations.Softmax
    apply_log: True

lm_model: !new:speechbrain.lobes.models.RNNLM.RNNLM
    output_neurons: !ref <output_neurons>
    embedding_dim: !ref <emb_size>
    activation: !name:torch.nn.LeakyReLU
    dropout: 0.0
    rnn_layers: 2
    rnn_neurons: 2048
    dnn_blocks: 1
    dnn_neurons: 512
    return_hidden: True  # For inference

tokenizer: !new:sentencepiece.SentencePieceProcessor

asr_model: !new:torch.nn.ModuleList
    - [!ref <enc>, !ref <emb>, !ref <dec>, !ref <ctc_lin>, !ref <seq_lin>]

# We compose the inference (encoder) pipeline.
encoder: !new:speechbrain.nnet.containers.LengthsCapableSequential
    input_shape: [null, null, !ref <n_mels>]
    compute_features: !ref <compute_features>
    normalize: !ref <normalize>
    model: !ref <enc>

ctc_scorer: !new:speechbrain.decoders.scorer.CTCScorer
    eos_index: !ref <eos_index>
    blank_index: !ref <blank_index>
    ctc_fc: !ref <ctc_lin>

coverage_scorer: !new:speechbrain.decoders.scorer.CoverageScorer
    vocab_size: !ref <output_neurons>

rnnlm_scorer: !new:speechbrain.decoders.scorer.RNNLMScorer
    language_model: !ref <lm_model>
    temperature: !ref <temperature_lm>

scorer: !new:speechbrain.decoders.scorer.ScorerBuilder
    scorer_beam_scale: 1.5
    full_scorers: [
        !ref <rnnlm_scorer>,
        !ref <coverage_scorer>]
    partial_scorers: [!ref <ctc_scorer>]
    weights:
        rnnlm: !ref <lm_weight>
        coverage: !ref <coverage_penalty>
        ctc: !ref <ctc_weight_decode>

decoder: !new:speechbrain.decoders.S2SRNNBeamSearcher
    embedding: !ref <emb>
    decoder: !ref <dec>
    linear: !ref <seq_lin>
    bos_index: !ref <bos_index>
    eos_index: !ref <eos_index>
    min_decode_ratio: !ref <min_decode_ratio>
    max_decode_ratio: !ref <max_decode_ratio>
    beam_size: !ref <test_beam_size>
    eos_threshold: !ref <eos_threshold>
    using_max_attn_shift: !ref <using_max_attn_shift>
    max_attn_shift: !ref <max_attn_shift>
    temperature: !ref <temperature>
    scorer: !ref <scorer>

modules:
    encoder: !ref <encoder>
    decoder: !ref <decoder>
    lm_model: !ref <lm_model>

pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
    loadables:
        asr: !ref <asr_model>
        lm: !ref <lm_model>
        tokenizer: !ref <tokenizer>

正如你所见，这是一个标准的YAML文件，但带有一个加载模型的预训练器。它与用于训练的yaml文件类似。我们只需要移除所有特定于训练的部分（例如，训练参数、优化器、检查点等），并添加预训练器和encoder、decoder元素，这些元素将所需的模块与其预训练文件链接起来。

开发你自己的推理接口

虽然EncoderDecoderASR类被设计得尽可能通用，但您可能需要一个更复杂的推理方案，以更好地满足您的需求。在这种情况下，您必须开发自己的接口。为此，请按照以下步骤操作：

创建您的自定义界面，继承自 Pretrained（代码这里）：

class MySuperTask(Pretrained):
  # Here, do not hesitate to also add some required modules
  # for further transparency.
  HPARAMS_NEEDED = ["mymodule1", "mymodule2"]
  MODULES_NEEDED = [
        "mytask_enc",
        "my_searcher",
  ]
  def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # Do whatever is needed here w.r.t your system

这将使您的类能够调用有用的函数，例如 .from_hparams()，它基于 HyperPyYAML 文件获取和加载，load_audio() 加载给定的音频文件。很可能，我们在 Pretrained 类中编写的大多数方法都能满足您的需求。如果不能，您可以覆盖它们以实现自定义功能。

开发您的界面和不同的功能。不幸的是，我们无法在这里提供一个足够通用的示例。您可以向此类添加任何您认为可以使对您的数据/模型的推断更简单和自然的函数。例如，我们可以在这里创建一个函数，该函数仅使用mytask_enc模块编码wav文件。

class MySuperTask(Pretrained):
  # Here, do not hesitate to also add some required modules
  # for further transparency.
  HPARAMS_NEEDED = ["mymodule1", "mymodule2"]
  MODULES_NEEDED = [
        "mytask_enc",
        "my_searcher",
  ]
  def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # Do whatever is needed here w.r.t your system
  
  def encode_file(self, path):
        waveform = self.load_audio(path)
        # Fake a batch:
        batch = waveform.unsqueeze(0)
        rel_length = torch.tensor([1.0])
        with torch.no_grad():
          rel_lens = rel_length.to(self.device)
          encoder_out = self.encode_batch(waveform, rel_lens)
        
        return encode_file

现在，我们可以通过以下方式使用您的接口：

from speechbrain.pretrained import MySuperTask

my_model = MySuperTask.from_hparams(source="your_local_folder", hparams_file='your_file.yaml', savedir="pretrained_model")
audio_file = 'your_file.wav'
encoded = my_model.encode_file(audio_file)

正如你所见，这种形式极为灵活，使你能够创建一个全面的界面，可以用来对你的预训练模型进行任何你想要的操作。

我们为端到端自动语音识别（E2E ASR）、说话人识别、源分离、语音增强等提供了不同的通用接口。如果感兴趣，请查看这里！

自定义您的语音识别器

在一般情况下，您可能拥有自己的数据，并且希望使用自己的模型。让我们进一步讨论如何自定义您的配方。

建议: 从一个有效的配方开始（比如用于此模板的配方），并仅进行所需的最小修改以进行定制。逐步测试你的模型。确保你的模型可以在由少量句子组成的微小数据集上过拟合。如果它没有过拟合，那么你的模型中可能存在错误。

使用您的数据进行训练

在更改数据集时，您只需更新数据准备脚本，以便我们创建符合预期的JSON文件。train.py脚本期望JSON文件如下所示：

{
  "1867-154075-0032": {
    "wav": "{data_root}/LibriSpeech/train-clean-5/1867/154075/1867-154075-0032.flac",
    "length": 16.09,
    "words": "AND HE BRUSHED A HAND ACROSS HIS FOREHEAD AND WAS INSTANTLY HIMSELF CALM AND COOL VERY WELL THEN IT SEEMS I'VE MADE AN ASS OF MYSELF BUT I'LL TRY TO MAKE UP FOR IT NOW WHAT ABOUT CAROLINE"
  },
  "1867-154075-0001": {
    "wav": "{data_root}/LibriSpeech/train-clean-5/1867/154075/1867-154075-0001.flac",
    "length": 14.9,
    "words": "THAT DROPPED HIM INTO THE COAL BIN DID HE GET COAL DUST ON HIS SHOES RIGHT AND HE DIDN'T HAVE SENSE ENOUGH TO WIPE IT OFF AN AMATEUR A RANK AMATEUR I TOLD YOU SAID THE MAN OF THE SNEER WITH SATISFACTION"
  },

您需要解析您的数据集并为每个句子创建带有唯一ID的JSON文件，包括音频信号的路径（wav）、语音句子的长度（以秒为单位）（length）以及单词转录（“words”）。就是这样！

使用您自己的模型进行训练

在某些时候，你可能会有自己的模型，并希望将其插入到语音识别管道中。例如，你可能想用不同的东西替换我们的CRDNN编码器。要做到这一点，你必须创建自己的类，并在那里指定你的神经网络的计算列表。你可以查看speechbrain.lobes.models中已经存在的模型。如果你的模型是一个简单的计算管道，你可以使用sequential container。如果模型是一个更复杂的计算链，你可以将其创建为torch.nn.Module的实例，并在那里定义__init__和forward方法，就像这里一样。

一旦你定义了你的模型，你只需要在yaml文件中声明它并在train.py中使用它

重要提示：
当插入一个新模型时，您需要重新调整系统中最重要的一些超参数（例如，学习率、批量大小和架构参数），以确保其良好运行。

结论

在本教程中，我们展示了如何使用SpeechBrain从头开始创建一个端到端的语音识别器。所提出的系统包含了开发最先进系统的所有基本要素（即数据增强、标记化、语言模型、束搜索、注意力机制等）

我们仅使用一个小数据集描述了所有步骤。在实际情况下，您需要使用更多的数据进行训练（例如，请参阅我们的LibriSpeech recipes）。

引用SpeechBrain

如果您在研究中或业务中使用SpeechBrain，请使用以下BibTeX条目引用它：

@misc{speechbrainV1,
  title={Open-Source Conversational AI with {SpeechBrain} 1.0},
  author={Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Gaelle Laperriere and Mickael Rouvier and Renato De Mori and Yannick Esteve},
  year={2024},
  eprint={2407.00463},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2407.00463},
}
@misc{speechbrain,
  title={{SpeechBrain}: A General-Purpose Speech Toolkit},
  author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
  year={2021},
  eprint={2106.04624},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  note={arXiv:2106.04624}
}

从零开始的语音识别

语音识别概述

连接时序分类 (CTC)

传感器

带注意力的编码器-解码器 👂

Beamsearch

安装

需要哪些步骤？

1. 准备您的数据

2. 训练一个分词器

3. 训练一个语言模型

4. 训练语音识别器

5. 使用语音识别器（推理）

步骤1：准备您的数据

数据清单文件

准备脚本

本地复制您的数据

步骤2：分词器

使用字符作为标记

使用单词作为标记

字节对编码 (BPE) 令牌

有多少个BPE标记？

训练一个分词器

步骤3：训练语言模型

文本语料库

训练一个语言模型

超参数

实验文件

数据输入输出管道

前向计算

计算目标

步骤4：训练基于注意力的端到端语音识别器

架构概述:

训练语音识别器

超参数

实验文件

数据输入输出管道

为什么使用相对长度而不是绝对长度？

前向计算

计算目标

其他方法

预训练和微调

步骤5：推理

利用您的自定义语音识别器

使用EndoderDecoderASR接口

开发你自己的推理接口

自定义您的语音识别器

使用您的数据进行训练

使用您自己的模型进行训练

结论

相关教程

引用SpeechBrain

使用`EndoderDecoderASR`接口