ESPnet2

约18分钟

ESPnet2

ESPnet1的主要变更

Chainer free
- 完全弃用Chainer。
- The development of Chainer is stopped at v7: https://chainer.org/announcement/2019/12/05/released-v7.html
Kaldi free
- 编译Kaldi不是强制要求。
- 如果您发现某些配方要求必须使用Kaldi，请报告此问题。这应被视为ESPnet2中的一个错误来处理。
- 我们仍可选择性地支持Kaldi开发的功能。
- 我们仍遵循Kaldi风格，即依赖Kaldi的utils/目录。
On the fly feature extraction & text preprocessing for training
- 在训练前无需创建特征文件，直接输入波形数据即可。
- 我们支持原始波形输入和提取的特征。
- 文本的预处理，如字符化或使用sentencepieces进行分词，也可以在训练过程中应用。
- 支持来自s3prl的自监督学习表示
Discarding the JSON format describing the training corpus.
- 为什么我们放弃使用JSON格式？因为从大型JSON文件生成的字典对象会占用大量内存，解析这样的大型JSON文件也需要很长时间。
Support distributed data-parallel training (Not enough tested)
- 同时支持使用DistributedDataParallel进行单节点多GPU训练。

理解ESPnet2配方

Recipe 是一组脚本，使用户能够完整复现实验流程，包括数据准备、模型定义、训练、评估和模型发布等环节。

你可以在egs2中找到新的配方（ESPnet2的示例的缩写）：

espnet/  # Python modules of espnet1
espnet2/ # Python modules of espnet2
egs/     # espnet1 recipes
egs2/    # espnet2 recipes

egs2 配方始终按照 egs2/<数据集>/<任务> 的结构组织。例如，用户应能通过以下方式完整复现实验：

# Dataset: an4, Task: ASR
cd egs2/an4/asr1/

# Run the full experiment
./run.sh

请注意，配方的使用方式与ESPnet1几乎相同。

现在，让我们逐步了解这些配方具体是如何工作的。

切换到基础目录

# e.g.
cd egs2/an4/asr1/

an4是一个小型语料库，可以免费获取，因此可能适合本教程。你可以用同样的方式执行任何其他配方。例如wsj、librispeech等。

请注意，所有脚本都应在egs2//目录层级下运行。

# Doesn't work
cd egs2/an4/
./asr1/run.sh
./asr1/scripts/<some-script>.sh

# Doesn't work
cd egs2/an4/asr1/local/
./data.sh

# Works
cd egs2/an4/asr1
./run.sh
./scripts/<some-script>.sh

每个配方的目录结构

egs2/an4/asr1/
  - conf/      # Configuration files for training, inference, etc.
  - scripts/   # Bash utilities of espnet2
  - pyscripts/ # Python utilities of espnet2
  - steps/     # From Kaldi utilities
  - utils/     # From Kaldi utilities
  - db.sh      # The directory path of each corpora
  - path.sh    # Setup script for environment variables
  - cmd.sh     # Configuration for your backend of job scheduler
  - run.sh     # Entry point
  - asr.sh     # Invoked by run.sh

修改配置

在执行run.sh之前，您需要修改db.sh来指定您的语料库。例如，当您处理egs2/wsj的配方时，您需要更改db.sh中WSJ0和WSJ1的路径。
部分语料库可从网络免费获取，初始状态下它们被标记为"downloads/"。如果已经下载完成，您也可以将其更改为您的语料库路径。
path.sh 用于为 run.sh 设置环境。请注意，ESPnet 使用的 Python 解释器不是您终端当前的 Python，而是在 tools/ 目录下安装的 Python。因此您需要先执行 path.sh 才能使用这个 Python。
```
. path.sh
python
```
cmd.sh 用于指定作业调度系统的后端。如果您的本地机器环境中没有此类系统，则无需对此文件进行任何更改。参见 Using Job scheduling system

`run.sh`

./run.sh

run.sh 是一个示例脚本，我们通常称之为"配方(recipe)"，用于运行与DNN实验相关的所有阶段：数据准备、训练和评估。

查看训练状态

显示日志文件

% tail -f exp/*_train_*/train.log
[host] 2020-04-05 16:34:54,278 (trainer:192) INFO: 2/40epoch started. Estimated time to finish: 7 minutes and 58.63 seconds
[host] 2020-04-05 16:34:56,315 (trainer:453) INFO: 2epoch:train:1-10batch: iter_time=0.006, forward_time=0.076, loss=50.873, los
s_att=35.801, loss_ctc=65.945, acc=0.471, backward_time=0.072, optim_step_time=0.006, lr_0=1.000, train_time=0.203
[host] 2020-04-05 16:34:58,046 (trainer:453) INFO: 2epoch:train:11-20batch: iter_time=4.280e-05, forward_time=0.068, loss=44.369
, loss_att=28.776, loss_ctc=59.962, acc=0.506, backward_time=0.055, optim_step_time=0.006, lr_0=1.000, train_time=0.173

在图像文件中显示训练状态

# Accuracy plot
# (eog is Eye of GNOME Image Viewer)
eog exp/*_train_*/images/acc.img
# Attention plot
eog exp/*_train_*/att_ws/<sample-id>/<param-name>.img

使用TensorBoard

tensorboard --logdir exp/*_train_*/tensorboard/

实用技巧

小批量大小与GPU数量之间的关系

在ESPnet2中，多GPU训练时的批次大小行为与ESPnet1不同。在ESPnet2中，无论使用多少GPU，总批次大小都不会改变。因此，如果增加GPU数量，您需要手动增加批次大小。更多信息请参阅此文档。

使用指定实验目录进行评估

如果您已经训练好一个模型，可能会想知道在后续评估时如何将其提供给run.sh脚本。默认情况下，目录名称会根据给定的选项（如asr_args、lm_args等参数）自动生成。您可以通过--asr_exp和--lm_exp参数来覆盖默认设置。

# For ASR recipe
./run.sh --skip_data_prep true --skip_train true --asr_exp <your_asr_exp_directory> --lm_exp <your_lm_exp_directory>

# For TTS recipe
./run.sh --skip_data_prep true --skip_train true --tts_exp <your_tts_exp_directory>

使用预训练模型进行评估而无需训练

./run.sh --download_model <model_name> --skip_train true

你需要自行填写model_name。你可以在Hugging Face上使用espnet标签搜索预训练模型。关于我们的预训练模型，请查看以下链接：https://github.com/espnet/espnet_model_zoo

使用OpenAI Whisper进行评估

ESPnet2 提供了一个脚本，用于使用OpenAI的Whisper进行推理和评分。这可用于评估语音生成模型。示例如下：

#!/usr/bin/env bash
# Set bash to 'debug' mode, it will exit on :
# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
set -e
set -u
set -o pipefail

whisper_tag=medium    # whisper model tag, e.g., small, medium, large, etc
cleaner=whisper_en
hyp_cleaner=whisper_en
nj=1
test_sets="test/WSJ/test_eval92"
# decode_options is used in Whisper model's transcribe method
decode_options="{language: en, task: transcribe, temperature: 0, beam_size: 10, fp16: False}"

for x in ${test_sets}; do
    wavscp=dump/raw/${x}/wav.scp    # path to wav.scp
    outdir=whisper-${whisper_tag}_outputs/${x}  # path to save output
    gt_text=dump/raw/${x}/text      # path to groundtruth text file (for scoring only)

    scripts/utils/evaluate_asr.sh \
        --whisper_tag ${whisper_tag} \
        --nj ${nj} \
        --gpu_inference true \
        --stage 2 \
        --stop_stage 3 \
        --cleaner ${cleaner} \
        --hyp_cleaner ${hyp_cleaner} \
        --decode_options "${decode_options}" \
        --gt_text ${gt_text} \
        ${wavscp} \
        ${outdir}
done

ESPnet鼓励您使用Hugging Face等平台分享您的结果。

为了分享您的模型，每个任务的最后三个阶段简化了这一流程。模型会被打包成zip文件并上传至所选平台（一个或两个）。

对于Hugging Face，您需要首先创建一个存储库( = /)。请记住在继续之前安装git-lfs 。然后，按如下方式执行run.sh：

./run.sh --stage <packing stage> --skip-packing false --skip-upload-hf false --hf-repo <my_repo>

阶段编号根据任务不同而有所差异。请查阅任务专用的shell脚本（例如asr1/asr.sh）以查看需要指定的编号。通过设置前面提到的标志，可以将打包好的模型上传到huggingface。

使用自监督学习表示作为特征

ESPnet支持使用自监督学习表示(SSLR)替代传统的频谱特征。在某些情况下，SSLR可以提升性能表现。

要在您的任务中使用SSLRs，您需要进行一些修改。

通过tools/installers/install_s3prl.sh安装S3PRL。
如需使用HuBERT/Wav2Vec，应通过tools/installers/install_fairseq.sh安装fairseq。

以下是使用SSLR的各种技巧。

为减少collect_stats步骤耗时，请在run.sh中指定--feats_normalize uttmvn参数，并将其作为参数传递给asr.sh或其他任务专用脚本。（推荐）

在配置文件中，指定frontend和preencoder。以HuBERT为例：upstream名称可以是S3PRL支持的任何模型。multilayer-feature=True表示最终表示是SSLR模型所有层隐藏状态的加权和。

frontend: s3prl
frontend_conf:
   frontend_conf:
      upstream: hubert_large_ll60k  # 注意：如果上游模型变更，请相应修改preencoder中的input_size
   download_dir: ./hub
   multilayer_feature: True

这里的preencoder用于降低输入编码器的维度以减少内存消耗。input_size取决于上游模型，而output_size可以设置为任意值。

preencoder: linear
preencoder_conf:
   input_size: 1024  # 注意：如果上游模型变更，请相应修改此值
   output_size: 80

由于不同upstream模型的帧移大小不同，例如HuBERT和Wav2Vec2.0具有20ms的帧移。有时需要修改encoder配置中的下采样率(input_layer)。例如，使用input_layer: conv2d2将产生总计40ms的帧移，这对某些任务已经足够。

流式语音识别

ESPnet支持基于分块同步波束搜索的流式Transformer/Conformer语音识别。更多细节请参阅论文。

训练

要实现流式语音识别，请在配置文件中使用分块Transformer/Conformer编码器。以blockwise Transformer为例：encoder名称可以是contextual_block_transformer或contextual_block_conformer。

encoder: contextual_block_transformer
encoder_conf:
    block_size: 40         # block size for block processing
    hop_size: 16           # hop size for block processing
    look_ahead: 16         # look-ahead size for block processing
    init_average: true     # whether to use average input as initial context
    ctx_pos_enc: true      # whether to use positional encoding for the context vectors

解码

要启用在线解码，应在run.sh中添加参数--use_streaming true。

./run.sh --stage 12 --use_streaming true

常见问题

训练过程中关于'NoneType' object has no attribute 'max'的问题：请确保在训练时使用forward_train函数，更多详情请查看此处。
我已成功训练模型，但在解码过程中遇到上述问题：你可能忘记指定--use_streaming true来选择流式推理。

实时因子与延迟

为了计算实时因子(RTF)和(非流式)延迟，脚本utils/calculate_rtf.py已被重新设计，现在可同时适用于ESPnet1和ESPnet2。该脚本基于解码日志文件中的时间标记计算推理时间，并报告所有解码语句的平均实时因子(RTF)和平均延迟。对于ESPnet2，该脚本将在解码阶段完成后自动运行(参见下方的限制部分)，但也可以作为独立脚本运行：

使用说明

usage: calculate_rtf.py [-h] [--log-dir LOG_DIR]
                        [--log-name {decode,asr_inference}]
                        [--input-shift INPUT_SHIFT]
                        [--start-times-marker {input lengths,speech length}]
                        [--end-times-marker {prediction,best hypo}]

calculate real time factor (RTF)

optional arguments:
  -h, --help            show this help message and exit
  --log-dir LOG_DIR     path to logging directory
  --log-name {decode,asr_inference}
                        name of logfile, e.g., 'decode' (espnet1) and
                        'asr_inference' (espnet2)
  --input-shift INPUT_SHIFT
                        shift of inputs in milliseconds
  --start-times-marker {input lengths,speech length}
                        String marking start of decoding in logfile, e.g.,
                        'input lengths' (espnet1) and 'speech length'
                        (espnet2)
  --end-times-marker {prediction,best hypo}
                        String marking end of decoding in logfile, e.g.,
                        'prediction' (espnet1) and 'best hypo' (espnet2)

注意事项

默认设置仍针对ESPnet1使用：

--log-name 'decode'
--input-shift 10.0
--start-times-marker 'input lengths'
--end-times-marker 'prediction'

对于ESPnet2，通过不同的前端/特征配置可以实现除10毫秒以外的其他帧移。因此与ESPnet1不同（后者以固定的10毫秒帧移记录输入特征帧），在ESPnet2中记录的是语音样本数，需要为--input-shift参数指定以毫秒为单位的音频样本位移（1/采样率 x 1000）（参见下方示例中针对16000 Hz采样率的--input-shift 0.0625设置）。

示例

在espnet/egs2/librispeech/asr1目录下，以下命令调用预训练的ESPnet2模型运行解码阶段：

./run.sh --stage 12  --use_streaming false --skip_data_prep true --skip_train true --download_model byan/librispeech_asr_train_asr_conformer_raw_bpe_batch_bins30000000_accum_grad3_optim_conflr0.001_sp

Librispeech测试集test_clean子集的延迟和实时因子(RTF)计算结果可在espnet/egs2/librispeech/asr1/exp/byan/librispeech_asr_train_asr_conformer_raw_bpe_batch_bins30000000_accum_grad3_optim_conflr0.001_sp/decode_asr_lm_lm_train_lm_transformer2_en_bpe5000_valid.loss.ave_asr_model_valid.acc.ave/test_clean/logdir/calculate_rtf.log文件中查看：

# ../../../utils/calculate_rtf.py --log-dir exp/byan/librispeech_asr_train_asr_conformer_raw_bpe_batch_bins30000000_accum_grad3_optim_conflr0.001_sp/decode_as
r_lm_lm_train_lm_transformer2_en_bpe5000_valid.loss.ave_asr_model_valid.acc.ave/test_clean/logdir --log-name asr_inference --input-shift 0.0625 --start-times-
marker "speech length" --end-times-marker "best hypo"
Total audio duration: 19452.481 [sec]
Total decoding time: 137762.231 [sec]
RTF: 7.082
Latency: 52581.004 [ms/sentence]

局限性

目前仅支持非流式推理模式
解码阶段12在asr.sh中会自动运行rtf和延迟计算，如果"asr_inference_tool == "espnet2.bin.asr_inference"；其他推理工具如k2和maskctc仍有待实现

Transducer ASR

重要提示: 如果您遇到任何与warp-transducer相关的问题，请在我们的fork仓库中提交issue。

ESPnet2支持使用(RNN-)Transducer损失函数训练的模型，即Transducer模型。目前ESPnet2中存在两个版本的此类模型：一个位于asr目录下，另一个位于asr_transducer目录下。第一个版本设计为CTC-Attention语音识别模型的补充，而第二个版本则是独立设计且专门用于Transducer任务。为此，我们使用ESPnetASRTransducerModel替代ESPnetASRModel，并采用名为ASRTransducerTask的新任务来取代ASRTask。

对用户而言，这意味着两件事。首先，根据使用的版本不同，某些功能或模块可能不受支持。其次，不同模型之间常见ASR功能或模块的使用方式可能存在差异。此外，在独立版本中，某些核心模块（例如：preencoder或postencoder）在验证完成前可能会缺失。

本教程的以下部分专门介绍asr_transducer下的版本。因此，用户需注意，此处描述的大多数功能可能在其他版本中不可用。

通用用法

要在您的实验中启用Transducer模型训练或解码，应在run.sh中的asr.sh提供以下选项：

asr.sh --asr_task asr_transducer [...]

在训练过程中进行Transducer损失计算时，我们默认依赖于warp-transducer的一个分支版本。具体安装步骤请参阅此处。

注意： 我们在损失计算过程中提供了FastEmit正则化[Yu et al., 2021]功能。要启用该功能，需要在model_conf中设置fastemit_lambda参数：

model_conf:
  fastemit_lambda: Regularization parameter for FastEmit. (float, default = 0.0)

此外，我们还支持使用k2工具包中提供的Pruned RNN-T损失函数[Kuang et al. 2022]进行训练。要启用此功能，需在model_conf中将参数use_k2_pruned_loss设置为True。之后，可以通过在model_conf中设置k2_pruned_loss_args参数来控制损失计算：

model_conf:
  use_k2_pruned_loss: True
  k2_pruned_loss_args:
    prune_range: How many tokens by frame are used compute the pruned loss. (int, default = 5)
    simple_loss_scaling: The weight to scale the simple loss after warm-up. (float, default = 0.5)
    lm_scale: The scale factor to smooth the LM part. (float, default = 0.0)
    am_scale: The scale factor to smooth the AM part. (float, default = 0.0)
    loss_type: Define the type of path to take for loss computation, either 'regular', 'smoothed' or 'constrained'. (str, default = "regular")

注意： 由于此版本在训练时可以限制每个时间步生成的token数量，我们还提供了参数validation_nstep。该参数允许用户在验证过程中应用类似的约束条件，当报告CER或/和WER时：

model_conf:
  validation_nstep: Maximum number of symbol expansions at each time step when reporting CER or/and WER using mAES.

更多信息，请参阅推理部分和"改进的自适应扩展搜索"算法。

架构

该架构由三个模块组成：编码器、解码器和联合网络。每个模块都有一个（或三个）配置，包含各种参数以设置内部组件。以下部分描述了每个模块的必需和可选参数。

编码器

对于编码器，我们提出了一种独特的编码器类型，封装了以下模块：Branchformer、Conformer、Conv 1D和E-Branchformer。这与ESPnet1中的自定义编码器类似，意味着我们不需要在此设置参数encoder: [type]。相反，编码器架构由传递给encoder_conf的三个配置定义：

input_conf (Dict): 输入块的配置。
main_conf (Dict): 主配置，包含所有模块共享的参数。
body_conf (List[Dict]): 编码器架构中除输入块外每个模块的配置列表。

第一和第二种配置是可选的。如果需要，可以在每种配置中修改以下参数：

main_conf:
  pos_wise_act_type: Conformer position-wise feed-forward activation type. (str, default = "swish")
  conv_mod_act_type: Conformer convolution module activation type. (str, default = "swish")
  pos_enc_dropout_rate: Dropout rate for the positional encoding layer, if used. (float, default = 0.0)
  pos_enc_max_len: Positional encoding maximum length. (int, default = 5000)
  simplified_att_score: Whether to use simplified attention score computation. (bool, default = False)
  norm_type: X-former normalization module type. (str, default = "layer_norm")
  conv_mod_norm_type: Branchformer convolution module normalization type. (str, default = "layer_norm")
  after_norm_eps: Epsilon value for the final normalization module. (float, default = 1e-05 or 0.25 for BasicNorm)
  after_norm_partial: Partial value for the final normalization module, if norm_type = 'rms_norm'. (float, default = -1.0)
  blockdrop_rate: Probability threshold of dropping out each encoder block. (float, default = 0.0)
  # For more information on the parameters below, please refer to espnet2/asr_transducer/activation.py
  ftswish_threshold: Threshold value for FTSwish activation formulation.
  ftswish_mean_shift: Mean shifting value for FTSwish activation formulation.
  hardtanh_min_val: Minimum value of the linear region range for HardTanh activation. (float, default = -1.0)
  hardtanh_max_val: Maximum value of the linear region range for HardTanh. (float, default = 1.0)
  leakyrelu_neg_slope: Negative slope value for LeakyReLU activation formulation.
  smish_alpha: Alpha value for Smish variant activation fomulation. (float, default = 1.0)
  smish_beta: Beta value for Smish variant activation formulation. (float, default = 1.0)
  softplus_beta: Beta value for softplus activation formulation in Mish activation. (float, default = 1.0)
  softplus_threshold: Values above this revert to a linear function in Mish activation. (int, default = 20)
  swish_beta: Beta value for E-Swish activation formulation. (float, default = 20)

input_conf:
  block_type: Input block type, either "conv2d" or "vgg". (str, default = "conv2d")
  conv_size: Convolution output size. For "vgg", the two convolution outputs can be controlled by passing a tuple. (int, default = 256)
  subsampling_factor: Subsampling factor of the input block, either 2 (only conv2d), 4 or 6. (int, default = 4)

唯一必需的配置是body_conf，它逐块定义了编码器主体的架构。每个块根据其类型（由block_type定义）拥有自己的一组必需和可选参数：

    # Branchformer
    - block_type: branchformer
      hidden_size: Hidden (and output) dimension. (int)
      linear_size: Dimension of the Linear layers. (int)
      conv_mod_kernel_size: Size of the convolving kernel in the ConvolutionalSpatialGatingUnit module. (int)
      heads (optional): Number of heads in multi-head attention. (int, default = 4)
      norm_eps (optional): Epsilon value for the normalization module. (float, default = 1e-05 or 0.25 for BasicNorm)
      norm_partial (optional): Partial value for the normalization module, if norm_type = 'rms_norm'. (float, default = -1.0)
      conv_mod_norm_eps (optional): Epsilon value for ConvolutionalSpatialGatingUnit module normalization. (float, default = 1e-05 or 0.25 for BasicNorm)
      conv_mod_norm_partial (optional): Partial value for the ConvolutionalSpatialGatingUnit module normalization, if conv_norm_type = 'rms_norm'. (float, default = -1.0)
      dropout_rate (optional): Dropout rate for some intermediate layers. (float, default = 0.0)
      att_dropout_rate (optional): Dropout rate for the attention module. (float, default = 0.0)

    # Conformer
    - block_type: conformer
      hidden_size: Hidden (and output) dimension. (int)
      linear_size: Dimension of feed-forward module. (int)
      conv_mod_kernel_size: Size of the convolving kernel in the ConformerConvolution module. (int)
      heads (optional): Number of heads in multi-head attention. (int, default = 4)
      norm_eps (optional): Epsilon value for normalization module. (float, default = 1e-05 or 0.25 for BasicNorm)
      norm_partial (optional): Partial value for the normalization module, if norm_type = 'rms_norm'. (float, default = -1.0)
      conv_mod_norm_eps (optional): Epsilon value for Batchnorm1d in the ConformerConvolution module. (float, default = 1e-05)
      conv_mod_norm_momentum (optional): Momentum value for Batchnorm1d in the ConformerConvolution module. (float, default = 0.1)
      dropout_rate (optional): Dropout rate for some intermediate layers. (float, default = 0.0)
      att_dropout_rate (optional): Dropout rate for the attention module. (float, default = 0.0)
      pos_wise_dropout_rate (optional): Dropout rate for the position-wise feed-forward module. (float, default = 0.0)

    # Conv 1D
    - block_type: conv1d
      output_size: Output size. (int)
      kernel_size: Size of the convolving kernel. (int or Tuple)
      stride (optional): Stride of the sliding blocks. (int or tuple, default = 1)
      dilation (optional): Parameter to control the stride of elements within the neighborhood. (int or tuple, default = 1)
      groups (optional): Number of blocked connections from input channels to output channels. (int, default = 1)
      bias (optional): Whether to add a learnable bias to the output. (bool, default = True)
      relu (optional): Whether to use a ReLU activation after convolution. (bool, default = True)
      batch_norm: Whether to use batch normalization after convolution. (bool, default = False)
      dropout_rate (optional): Dropout rate for the Conv1d outputs. (float, default = 0.0)

    # E-Branchformer
    - block_type: ebranchformer
      hidden_size: Hidden (and output) dimension. (int)
      linear_size: Dimension of the feed-forward module and othger linear layers. (int)
      conv_mod_kernel_size: Size of the convolving kernel in the ConvolutionalSpatialGatingUnit module. (int)
      depthwise_conv_kernel_size: Size of the convolving kernel in the DepthwiseConvolution module. (int, default = conv_mod_kernel_size)
      heads (optional): Number of heads in multi-head attention. (int, default = 4)
      norm_eps (optional): Epsilon value for the normalization module. (float, default = 1e-05 or 0.25 for BasicNorm)
      norm_partial (optional): Partial value for the normalization module, if norm_type = 'rms_norm'. (float, default = -1.0)
      conv_mod_norm_eps (optional): Epsilon value for ConvolutionalSpatialGatingUnit module normalization. (float, default = 1e-05 or 0.25 for BasicNorm)
      conv_mod_norm_partial (optional): Partial value for the ConvolutionalSpatialGatingUnit module normalization, if conv_norm_type = 'rms_norm'. (float, default = -1.0)
      dropout_rate (optional): Dropout rate for some intermediate layers. (float, default = 0.0)
      att_dropout_rate (optional): Dropout rate for the attention module. (float, default = 0.0)

此外，每个块都有一个参数num_blocks用于构建定义块的N次重复（整数，默认值=1）。当您想使用一组共享相同参数的块而不必为每个配置单独编写时，这个功能非常有用。

示例1：二维卷积 + 2层一维卷积 + 14层Conformer结构

encoder_conf:
    main_conf:
      pos_wise_act_type: swish
      pos_enc_dropout_rate: 0.1
      conv_mod_act_type: swish
    input_conf:
      block_type: conv2d
      conv_size: 256
      subsampling_factor: 4
    body_conf:
    - block_type: conv1d
      output_size: 128
      kernel_size: 3
    - block_type: conv1d
      output_size: 256
      kernel_size: 2
    - block_type: conformer
      linear_size: 1024
      hidden_size: 256
      heads: 8
      dropout_rate: 0.1
      pos_wise_dropout_rate: 0.1
      att_dropout_rate: 0.1
      conv_mod_kernel_size: 31
      num_blocks: 14

解码器

对于解码器，提供四种类型的模块：无状态('stateless')、RNN('rnn')、MEGA('mega')或RWKV('rwkv')。与编码器不同，这些模块共享参数，意味着我们只需在配置中定义一个模块。通过向decoder参数传递相应的类型字符串来定义模块堆栈的类型。内部组件通过decoder_conf字段定义，该字段包含以下可控制参数：

decoder_conf:
  embed_size: Size of the embedding layer (int, default = 256).
  num_blocks: Number of decoder blocks/layers (int, default = 4 for MEGA or 1 for RNN).
  rnn_type (RNN only): Type of RNN cells (int, default = "lstm").
  hidden_size (RNN only): Size of the hidden layers (int, default = 256).
  block_size (MEGA/RWKV only): Size of the block's input/output (int, default = 512).
  linear_size (MEGA/RWKV only): Feed-Forward module hidden size (int, default = 1024).
  attention_size (RWKV only): Hidden-size of the attention module. (int, default = None).
  context_size (RWKV only): Context size for the WKV kernel module (int, default = 1024).
  qk_size (MEGA only): Shared query and key size for attention module (int, default = 128).
  v_size (MEGA only): Value size for attention module (int, default = 1024).
  chunk_size (MEGA only): Chunk size for attention computation (int, default = -1, i.e. full context).
  num_heads (MEGA only): Number of EMA heads (int, default = 4).
  rel_pos_bias (MEGA only): Type of relative position bias in attention module (str, default = "simple").
  max_positions (MEGA only): Maximum number of position for RelativePositionBias (int, default = 2048).
  truncation_length (MEGA only): Maximum length for truncation in EMA module (int, default = None).
  normalization_type (MEGA/RWKV only): Normalization layer type (str, default = "layer_norm").
  normalization_args (MEGA/RKWV only): Normalization layer arguments (dict, default = {}).
  activation_type (MEGA only): Activation function type (str, default = "swish").
  activation_args (MEGA only): Activation function arguments (dict, default = {}).
  rescale_every (RWKV only): Whether to rescale input every N blocks during inference (int, default = 0)
  dropout_rate (excl. RWKV): Dropout rate for main block modules (float, default = 0.0).
  embed_dropout_rate: Dropout rate for embedding layer (float, default = 0.0).
  att_dropout_rate (MEGA/RWKV only): Dropout rate for the attention module.
  ema_dropout_rate (MEGA only): Dropout rate for the EMA module.
  ffn_dropout_rate (MEGA/RWKV only): Dropout rate for the feed-forward module.

示例1：RNN解码器。

decoder: rnn
decoder_conf:
    rnn_type: lstm
    num_layers: 2
    embed_size: 256
    hidden_size: 256
    dropout_rate: 0.1
    embed_dropout_rate: 0.1

示例2：MEGA解码器。

decoder: mega
decoder_conf:
    block_size: 256
    linear_size: 2048
    qk_size: 128
    v_size: 1024
    max_positions: 1024
    num_heads: 4
    rel_pos_bias_type: "rotary"
    chunk_size: 256
    num_blocks: 6
    dropout_rate: 0.1
    ffn_dropout_rate: 0.1
    att_dropout_rate: 0.1
    embed_dropout_rate: 0.1

联合网络

目前，我们仅提供由三个线性层和一个激活函数组成的标准联合网络模块。该模块定义是可选的，但可以通过配置参数joint_network_conf修改以下参数：

joint_network_conf:
  joint_space_size: Size of the joint space (int, default = 256).
  joint_act_type: Type of activation in the joint network (str, default = "tanh").

与激活函数相关的选项也可以通过编码器部分引入的参数进行修改（参见main_conf描述）。

多任务学习

我们还支持包含两个辅助任务的多任务学习：CTC（连接时序分类）和带有标签平滑选项的交叉熵（此处称为LM损失）。这些辅助任务对总体任务的贡献定义如下：

总损失 = (λ_trans x 转录损失) + (λ_auxCTC x 辅助CTC损失) + (λ_auxLM x 辅助语言模型损失)

其中损失函数(L_*)按顺序分别为：Transducer损失、CTC损失和LM损失。Lambda值定义了它们对总损失的各自贡献。每个任务可以使用以下选项进行参数化，这些选项传递给model_conf：

model_conf:
  transducer_weight: Weight of the Transducer loss (float, default = 1.0)
  auxiliary_ctc_weight: Weight of the CTC loss. (float, default = 0.0)
  auxiliary_ctc_dropout_rate: Dropout rate for the CTC loss inputs. (float, default = 0.0)
  auxiliary_lm_loss_weight: Weight of the LM loss. (float, default = 0.2)
  auxiliary_lm_loss_smoothing: Smoothing rate for LM loss. If > 0, label smoothing is enabled. (float, default = 0.0)

注意： 目前ESPnet2暂不支持其他辅助任务。

推理

通过在你的解码配置中设置search_type参数，Transducer也可以使用各种解码算法：

不进行前缀搜索的束搜索算法 [Graves, 2012]。(search_type: default)
时间同步解码 [Saon et al., 2020]。 (search_type: tsd)
对齐长度同步解码 [Saon et al., 2020]。（search_type: alsd）
改进的自适应扩展搜索算法，基于 [Kim et al., 2021] 和 [Boyer et al., 2021] 的研究成果。(search_type: maes)

这些算法共享两个参数来控制波束大小（beam_size）以及部分/最终假设的归一化（score_norm）。此外，三种算法具有特定参数：

时间同步解码

search_type: tsd
max_sym_exp : Number of maximum symbol expansions at each time step. (int > 1, default = 3)

对齐长度同步解码

search_type: alsd
u_max: Maximum expected target sequence length. (int, default = 50)

改进的自适应扩展搜索

search_type: maes
nstep: Number of maximum expansion steps at each time step (int, default = 2)
expansion_gamma: Number of additional candidates in expanded hypotheses selection. (int, default = 2)
expansion_beta: Allowed logp difference for prune-by-value method. (float, default = 2.3)

注意：除默认算法外，所描述的参数用于控制性能和解码速度。每个参数的最佳值取决于具体任务；较高的值通常会延长解码时间以提升性能，而较低的值则会缩短解码时间但可能影响性能。

注2： 独立版本中的算法与其他版本相同。但由于设计选择，部分内容进行了重构，同时添加了一些小的优化。

流式处理

为了实现Transducer模型的流式处理能力，我们支持[Zhang et al., 2021]中提出的动态分块训练和逐块解码方法。我们的实现基于Icefall提出的版本，而该版本本身源自原始的WeNet实现。

关于不同流程和参数的完整说明，我们建议读者参考对应的论文。

训练

要训练流式模型，应在main_conf中将参数dynamic_chunk_training设置为True（参见Encoder章节）。从这里开始，用户可以通过两个参数来控制动态分块选择（short_chunk_threshold和short_chunk_size），以及另一个参数来控制因果卷积和注意力模块中的左上下文（num_left_chunks）。

所有这些参数都可以通过main_conf进行配置，该配置已在编码器部分介绍：

dynamic_chunk_training: Whether to train streaming model with dynamic chunks. (bool, default = False)
short_chunk_threshold: Chunk length threshold (in percent) for dynamic chunk selection. (int, default = 0.75)
short_chunk_size: Minimum number of frames during dynamic chunk training. (int, default = 25)
num_left_chunks: The number of left chunks the attention module can see during training, where the actual size is defined by `short_chunk_threshold` and `short_chunk_size`. (int, default = 0, i.e. full context)

解码

要进行逐块推理，应在解码配置中将参数streaming设置为True（否则将执行离线解码）。有两个参数可用于控制解码过程：

decoding_window: The input audio length, in milliseconds, to process during decoding. (int, default = 640)
left_context: Number of previous frames (AFTER subsampling) the attention module can see in current chunk. (int, default = 32)

注意： 除ALSD外，所有搜索算法都支持分块推理。

常见问题

如何向自定义编码器添加新的块类型？

提供的路径是相对于目录：espnet2/asr_transducer/encoder/

添加对新块类型的支持可以通过三个主要步骤实现：

在encoder/blocks/目录中编写您需要的块类。该类应包含以下方法：__init__(...)、forward(...)（训练+离线）、chunk_forward(...)（在线解码）、reset_streaming_cache(...)（在线缓存定义）。有关实现内部组件的更多细节，我们建议用户参考现有的块定义和流式处理部分。
在building.py中，编写一个块构造方法，并在build_body_blocks(...)中为你的块类型添加一个新条件，调用该构造方法。如果需要跨块共享额外参数，可以在build_main_parameters(...)中添加它们，并将main_conf传递给你的构造方法。
在validation.py中，向`validate_block_arguments(...)添加新的条件，以便在构建之前设置和验证必需的块参数（如果尚未涵盖）。

如需了解更多信息或示例，请参阅指定文件。如需添加与新模块相关的其他类，应将其添加在模块类内部或modules/目录下。