要在GitHub上执行或查看/下载此笔记本

推断您训练好的SpeechBrain模型

在本教程中，我们将学习在训练好的模型上进行推理的不同方法。请注意，这与加载预训练模型以进行进一步训练或迁移学习无关。如果对这些主题感兴趣，请参考相应的教程。

先决条件

上下文

在这个例子中，我们将考虑一个用户希望使用一个自定义的预训练语音识别器由他自己训练的来转录一些音频文件。如果您对使用在线可用的预训练模型感兴趣，请参考预训练教程。以下内容可以扩展到任何SpeechBrain支持的任务，因为我们提供了一种统一的方式来处理所有这些任务。

可用的不同选项

此时，您有三个选项可供选择：

在您的ASR类（从Brain扩展）中定义一个自定义的python函数。这会在训练配方和您的转录本之间引入强耦合。这对于原型设计和在您的数据集上获取简单转录本非常方便。然而，不建议用于部署。
使用已有的接口（例如EncoderDecoderASR，在预训练教程中介绍）。这可能是最优雅和方便的方式。然而，您的模型需要符合一些约束条件以适应所提出的接口。
构建你自己的界面，完美适配你的自定义ASR模型。

重要提示：所有这些解决方案也适用于其他任务（说话人识别、源分离…）

1. 训练脚本中的自定义函数

该方法的目标是使用户能够在train.py结束时调用一个函数，该函数转录给定的数据集：

    # Trainer initialization
    asr_brain = ASR(
        modules=hparams["modules"],
        opt_class=hparams["opt_class"],
        hparams=hparams,
        run_opts=run_opts,
        checkpointer=hparams["checkpointer"],
    )

    # Training
    asr_brain.fit(
        asr_brain.hparams.epoch_counter,
        datasets["train"],
        datasets["valid"],
        train_loader_kwargs=hparams["train_dataloader_opts"],
        valid_loader_kwargs=hparams["valid_dataloader_opts"],
    )

    # Load best checkpoint for evaluation
    test_stats = asr_brain.evaluate(
        test_set=datasets["test"],
        min_key="WER",
        test_loader_kwargs=hparams["test_dataloader_opts"],
    )

    # Load best checkpoint for transcription !!!!!!
    # You need to create this function w.r.t your system architecture !!!!!!
    transcripts = asr_brain.transcribe_dataset(
        dataset=datasets["your_dataset"], # Must be obtained from the dataio_function
        min_key="WER", # We load the model with the lowest WER
        loader_kwargs=hparams["transcribe_dataloader_opts"], # opts for the dataloading
    )

如你所见，由于需要实例化的Brain类，与训练配方存在强耦合。

注意 1: 如果你不想调用它们，你可以移除 .fit() 和 .evaluate()。这只是一个更好地展示如何使用它的示例。

注意 2: 在这里，.transcribe_dataset() 函数接受一个 dataset 对象进行转录。你也可以简单地使用路径代替。完全由你决定如何实现这个函数。

现在：在这个函数中放什么？这里，我们将基于模板给出一个示例，但你需要根据你的系统进行调整。

def transcribe_dataset(
        self,
        dataset, # Must be obtained from the dataio_function
        min_key, # We load the model with the lowest WER
        loader_kwargs # opts for the dataloading
    ):
  
    # If dataset isn't a Dataloader, we create it.
    if not isinstance(dataset, DataLoader):
        loader_kwargs["ckpt_prefix"] = None
        dataset = self.make_dataloader(
            dataset, Stage.TEST, **loader_kwargs
        )
    
    
    self.on_evaluate_start(min_key=min_key) # We call the on_evaluate_start that will load the best model
    self.modules.eval() # We set the model to eval mode (remove dropout etc)

    # Now we iterate over the dataset and we simply compute_forward and decode
    with torch.no_grad():

        transcripts = []
        for batch in tqdm(dataset, dynamic_ncols=True):
            
            # Make sure that your compute_forward returns the predictions !!!
            # In the case of the template, when stage = TEST, a beam search is applied
            # in compute_forward().
            out = self.compute_forward(batch, stage=sb.Stage.TEST)
            p_seq, wav_lens, predicted_tokens = out
            
            # We go from tokens to words.
            predicted_words = self.tokenizer(
                predicted_tokens, task="decode_from_list"
            )
            transcripts.append(predicted_words)
            
    return transcripts

流程很简单：加载模型 -> 执行 compute_forward -> 进行 detokenize。

2. 使用 `EndoderDecoderASR` 接口

EncoderDecoderASR 类接口允许你将训练好的模型与训练方案解耦，并在几行代码中对任何新的音频文件进行推断（或编码）。如果你对ASR不感兴趣，你可以在interfaces.py文件中找到许多其他接口来满足你的需求。如果你打算以生产方式部署你的模型，即如果你计划以稳定且频繁的方式使用你的模型，那么这种解决方案是首选。当然，这将需要你稍微修改yaml文件。

该类有以下方法：

encode_batch：将编码器应用于输入批次并返回一些编码特征。
transcribe_file: 转录输入中的单个音频文件。
transcribe_batch: 转录输入批次。

事实上，如果你满足我们将在下一段详细说明的几个约束条件，你可以简单地这样做：

from speechbrain.inference.ASR import EncoderDecoderASR

asr_model = EncoderDecoderASR.from_hparams(source="your_folder", hparams_file='your_file.yaml', savedir="pretrained_model")
asr_model.transcribe_file('your_file.wav')

然而，为了允许对所有可能的EncoderDecoder ASR管道进行这种泛化，在部署系统时您必须考虑一些约束条件：

必要的模块。 正如你在EncoderDecoderASR类中所看到的，你在yaml文件中定义的模块必须包含具有特定名称的某些元素。实际上，你需要一个分词器、一个解码器和一个解码器。编码器可以简单地是一个由特征计算、归一化和模型编码序列组成的speechbrain.nnet.containers.LengthsCapableSequential。

    HPARAMS_NEEDED = ["tokenizer"]
    MODULES_NEEDED = [
        "encoder",
        "decoder",
    ]

你还需要在YAML文件中声明这些实体，并创建以下名为modules的字典：

encoder: !new:speechbrain.nnet.containers.LengthsCapableSequential
    input_shape: [null, null, !ref <n_mels>]
    compute_features: !ref <compute_features>
    normalize: !ref <normalize>
    model: !ref <enc>

ctc_scorer: !new:speechbrain.decoders.scorer.CTCScorer
    eos_index: !ref <eos_index>
    blank_index: !ref <blank_index>
    ctc_fc: !ref <ctc_lin>

coverage_scorer: !new:speechbrain.decoders.scorer.CoverageScorer
    vocab_size: !ref <output_neurons>

rnnlm_scorer: !new:speechbrain.decoders.scorer.RNNLMScorer
    language_model: !ref <lm_model>
    temperature: !ref <temperature_lm>

scorer: !new:speechbrain.decoders.scorer.ScorerBuilder
    scorer_beam_scale: 1.5
    full_scorers: [
        !ref <rnnlm_scorer>,
        !ref <coverage_scorer>]
    partial_scorers: [!ref <ctc_scorer>]
    weights:
        rnnlm: !ref <lm_weight>
        coverage: !ref <coverage_penalty>
        ctc: !ref <ctc_weight_decode>

decoder: !new:speechbrain.decoders.S2SRNNBeamSearcher
    embedding: !ref <emb>
    decoder: !ref <dec>
    linear: !ref <seq_lin>
    bos_index: !ref <bos_index>
    eos_index: !ref <eos_index>
    min_decode_ratio: !ref <min_decode_ratio>
    max_decode_ratio: !ref <max_decode_ratio>
    beam_size: !ref <test_beam_size>
    eos_threshold: !ref <eos_threshold>
    using_max_attn_shift: !ref <using_max_attn_shift>
    max_attn_shift: !ref <max_attn_shift>
    temperature: !ref <temperature>
    scorer: !ref <scorer>

modules:
    encoder: !ref <encoder>
    decoder: !ref <decoder>
    lm_model: !ref <lm_model>

在这种情况下，enc 是一个 CRDNN，但也可以是任何自定义的神经网络实例。

为什么你需要确保这一点？ 嗯，这仅仅是因为这些是我们在推断EncoderDecoderASR类时调用的模块。这里有一个encode_batch()函数的例子。

[...]
  wavs = wavs.float()
  wavs, wav_lens = wavs.to(self.device), wav_lens.to(self.device)
  encoder_out = self.modules.encoder(wavs, wav_lens)
return encoder_out

如果我有一个包含多个深度神经网络等的复杂asr_encoder结构怎么办？ 简单地将所有内容放入你的yaml中的torch.nn.ModuleList中：

asr_encoder: !new:torch.nn.ModuleList
    - [!ref <enc>, my_different_blocks ... ]

调用预训练器以加载检查点。 最后，你需要定义一个调用预训练器的操作，该操作将加载你训练模型的不同检查点到相应的SpeechBrain模块中。简而言之，它将加载你的编码器、语言模型的权重，或者甚至只是加载分词器。

pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
    loadables:
        asr: !ref <asr_model>
        lm: !ref <lm_model>
        tokenizer: !ref <tokenizer>
    paths:
      asr: !ref <asr_model_ptfile>
      lm: !ref <lm_model_ptfile>
      tokenizer: !ref <tokenizer_ptfile>

可加载字段在文件（例如与检查点相关的lm在中）和yaml实例（例如）之间创建了一个链接，该实例只不过是你的lm。

如果你尊重这两个约束，它应该可以工作！这里，我们给出一个仅用于推理的yaml的完整示例：

# ############################################################################
# Model: E2E ASR with attention-based ASR
# Encoder: CRDNN model
# Decoder: GRU + beamsearch + RNNLM
# Tokens: BPE with unigram
# Authors:  Ju-Chieh Chou, Mirco Ravanelli, Abdel Heba, Peter Plantinga 2020
# ############################################################################


# Feature parameters
sample_rate: 16000
n_fft: 400
n_mels: 40

# Model parameters
activation: !name:torch.nn.LeakyReLU
dropout: 0.15
cnn_blocks: 2
cnn_channels: (128, 256)
inter_layer_pooling_size: (2, 2)
cnn_kernelsize: (3, 3)
time_pooling_size: 4
rnn_class: !name:speechbrain.nnet.RNN.LSTM
rnn_layers: 4
rnn_neurons: 1024
rnn_bidirectional: True
dnn_blocks: 2
dnn_neurons: 512
emb_size: 128
dec_neurons: 1024
output_neurons: 1000  # index(blank/eos/bos) = 0
blank_index: 0

# Decoding parameters
bos_index: 0
eos_index: 0
min_decode_ratio: 0.0
max_decode_ratio: 1.0
beam_size: 80
eos_threshold: 1.5
using_max_attn_shift: True
max_attn_shift: 240
lm_weight: 0.50
coverage_penalty: 1.5
temperature: 1.25
temperature_lm: 1.25

normalize: !new:speechbrain.processing.features.InputNormalization
    norm_type: global

compute_features: !new:speechbrain.lobes.features.Fbank
    sample_rate: !ref <sample_rate>
    n_fft: !ref <n_fft>
    n_mels: !ref <n_mels>

enc: !new:speechbrain.lobes.models.CRDNN.CRDNN
    input_shape: [null, null, !ref <n_mels>]
    activation: !ref <activation>
    dropout: !ref <dropout>
    cnn_blocks: !ref <cnn_blocks>
    cnn_channels: !ref <cnn_channels>
    cnn_kernelsize: !ref <cnn_kernelsize>
    inter_layer_pooling_size: !ref <inter_layer_pooling_size>
    time_pooling: True
    using_2d_pooling: False
    time_pooling_size: !ref <time_pooling_size>
    rnn_class: !ref <rnn_class>
    rnn_layers: !ref <rnn_layers>
    rnn_neurons: !ref <rnn_neurons>
    rnn_bidirectional: !ref <rnn_bidirectional>
    rnn_re_init: True
    dnn_blocks: !ref <dnn_blocks>
    dnn_neurons: !ref <dnn_neurons>

emb: !new:speechbrain.nnet.embedding.Embedding
    num_embeddings: !ref <output_neurons>
    embedding_dim: !ref <emb_size>

dec: !new:speechbrain.nnet.RNN.AttentionalRNNDecoder
    enc_dim: !ref <dnn_neurons>
    input_size: !ref <emb_size>
    rnn_type: gru
    attn_type: location
    hidden_size: !ref <dec_neurons>
    attn_dim: 1024
    num_layers: 1
    scaling: 1.0
    channels: 10
    kernel_size: 100
    re_init: True
    dropout: !ref <dropout>

ctc_lin: !new:speechbrain.nnet.linear.Linear
    input_size: !ref <dnn_neurons>
    n_neurons: !ref <output_neurons>

seq_lin: !new:speechbrain.nnet.linear.Linear
    input_size: !ref <dec_neurons>
    n_neurons: !ref <output_neurons>

log_softmax: !new:speechbrain.nnet.activations.Softmax
    apply_log: True

lm_model: !new:speechbrain.lobes.models.RNNLM.RNNLM
    output_neurons: !ref <output_neurons>
    embedding_dim: !ref <emb_size>
    activation: !name:torch.nn.LeakyReLU
    dropout: 0.0
    rnn_layers: 2
    rnn_neurons: 2048
    dnn_blocks: 1
    dnn_neurons: 512
    return_hidden: True  # For inference

tokenizer: !new:sentencepiece.SentencePieceProcessor

asr_model: !new:torch.nn.ModuleList
    - [!ref <enc>, !ref <emb>, !ref <dec>, !ref <ctc_lin>, !ref <seq_lin>]

# We compose the inference (encoder) pipeline.
encoder: !new:speechbrain.nnet.containers.LengthsCapableSequential
    input_shape: [null, null, !ref <n_mels>]
    compute_features: !ref <compute_features>
    normalize: !ref <normalize>
    model: !ref <enc>


ctc_scorer: !new:speechbrain.decoders.scorer.CTCScorer
    eos_index: !ref <eos_index>
    blank_index: !ref <blank_index>
    ctc_fc: !ref <ctc_lin>

coverage_scorer: !new:speechbrain.decoders.scorer.CoverageScorer
    vocab_size: !ref <output_neurons>

rnnlm_scorer: !new:speechbrain.decoders.scorer.RNNLMScorer
    language_model: !ref <lm_model>
    temperature: !ref <temperature_lm>

scorer: !new:speechbrain.decoders.scorer.ScorerBuilder
    scorer_beam_scale: 1.5
    full_scorers: [
        !ref <rnnlm_scorer>,
        !ref <coverage_scorer>]
    partial_scorers: [!ref <ctc_scorer>]
    weights:
        rnnlm: !ref <lm_weight>
        coverage: !ref <coverage_penalty>
        ctc: !ref <ctc_weight_decode>

decoder: !new:speechbrain.decoders.S2SRNNBeamSearcher
    embedding: !ref <emb>
    decoder: !ref <dec>
    linear: !ref <seq_lin>
    bos_index: !ref <bos_index>
    eos_index: !ref <eos_index>
    min_decode_ratio: !ref <min_decode_ratio>
    max_decode_ratio: !ref <max_decode_ratio>
    beam_size: !ref <test_beam_size>
    eos_threshold: !ref <eos_threshold>
    using_max_attn_shift: !ref <using_max_attn_shift>
    max_attn_shift: !ref <max_attn_shift>
    temperature: !ref <temperature>
    scorer: !ref <scorer>
    

modules:
    encoder: !ref <encoder>
    decoder: !ref <decoder>
    lm_model: !ref <lm_model>

pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
    loadables:
        asr: !ref <asr_model>
        lm: !ref <lm_model>
        tokenizer: !ref <tokenizer>

正如你所见，这是一个标准的YAML文件，但带有一个加载模型的预训练器。它与用于训练的yaml文件类似。我们只需要移除所有特定于训练的部分（例如，训练参数、优化器、检查点等），并添加预训练器和encoder、decoder元素，这些元素将所需的模块与其预训练文件链接起来。

3. 开发你自己的推理接口

虽然EncoderDecoderASR类被设计得尽可能通用，但您可能需要一个更复杂的推理方案，以更好地满足您的需求。在这种情况下，您必须开发自己的接口。为此，请按照以下步骤操作：

创建您自定义的界面，继承自 Pretrained（代码在此文件中）：

class MySuperTask(Pretrained):
  # Here, do not hesitate to also add some required modules
  # for further transparency.
  HPARAMS_NEEDED = ["mymodule1", "mymodule2"]
  MODULES_NEEDED = [
        "mytask_enc",
        "my_searcher",
  ]
  def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # Do whatever is needed here w.r.t your system

这将使您的类能够调用有用的函数，例如 .from_hparams()，它基于 HyperPyYAML 文件获取和加载，load_audio() 加载给定的音频文件。很可能，我们在 Pretrained 类中编写的大多数方法都能满足您的需求。如果不能，您可以覆盖它们以实现自定义功能。

开发您的界面和不同的功能。不幸的是，我们无法在这里提供一个足够通用的示例。您可以向此类添加任何您认为可以使对您的数据/模型的推断更简单和自然的函数。例如，我们可以在这里创建一个函数，该函数仅使用mytask_enc模块编码wav文件。

class MySuperTask(Pretrained):
  # Here, do not hesitate to also add some required modules
  # for further transparency.
  HPARAMS_NEEDED = ["mymodule1", "mymodule2"]
  MODULES_NEEDED = [
        "mytask_enc",
        "my_searcher",
  ]
  def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # Do whatever is needed here w.r.t your system
  
  def encode_file(self, path):
        waveform = self.load_audio(path)
        # Fake a batch:
        batch = waveform.unsqueeze(0)
        rel_length = torch.tensor([1.0])
        with torch.no_grad():
          rel_lens = rel_length.to(self.device)
          encoder_out = self.encode_batch(waveform, rel_lens)
        
        return encode_file

现在，我们可以通过以下方式使用您的接口：

from speechbrain.inference.my_super_task import MySuperTask

my_model = MySuperTask.from_hparams(source="your_local_folder", hparams_file='your_file.yaml', savedir="pretrained_model")
audio_file = 'your_file.wav'
encoded = my_model.encode_file(audio_file)

正如你所见，这种形式极为灵活，使你能够创建一个全面的界面，可以用来对你的预训练模型进行任何你想要的操作。

我们为端到端自动语音识别（E2E ASR）、说话人识别、源分离、语音增强等提供了不同的通用接口。如果感兴趣，请查看这里！

通用预训练推理

在某些情况下，用户可能希望在一个外部文件中开发他们的推理接口。这可以通过使用foreign class来实现。你可以查看这里报告的示例：

from speechbrain.inference.interfaces import foreign_class
classifier = foreign_class(source="speechbrain/emotion-recognition-wav2vec2-IEMOCAP", pymodule_file="custom_interface.py", classname="CustomEncoderWav2vec2Classifier")
out_prob, score, index, text_lab = classifier.classify_file("speechbrain/emotion-recognition-wav2vec2-IEMOCAP/anger.wav")
print(text_lab)

在这种情况下，推理接口不是写在speechbrain.pretrained.interfaces中的类，而是在外部文件（custom_interface.py）中编写的。

如果您需要的接口在speechbrain.pretrained.interfaces中不可用，这可能会有用。如果您愿意，您可以将其添加到那里。然而，如果您使用foreign_class，我们还为您提供了从任何其他路径获取推理代码的可能性。

引用SpeechBrain

如果您在研究中或业务中使用SpeechBrain，请使用以下BibTeX条目引用它：

@misc{speechbrainV1,
  title={Open-Source Conversational AI with {SpeechBrain} 1.0},
  author={Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Gaelle Laperriere and Mickael Rouvier and Renato De Mori and Yannick Esteve},
  year={2024},
  eprint={2407.00463},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2407.00463},
}
@misc{speechbrain,
  title={{SpeechBrain}: A General-Purpose Speech Toolkit},
  author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
  year={2021},
  eprint={2106.04624},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  note={arXiv:2106.04624}
}