约10分钟

使用说明

如果您是新用户，我们建议查阅ESPnet2教程，因为ESPnet1是较旧的实现版本。目前大部分开发工作已转向ESPnet2。请注意，由于这一转变，本文档中的部分信息可能已过时。

目录结构

espnet/              # Python modules
utils/               # Utility scripts of ESPnet
test/                # Unit test
test_utils/          # Unit test for executable scripts
egs/                 # The complete recipe for each corpora
    an4/             # AN4 is tiny corpus and can be obtained freely, so it might be suitable for tutorial
      asr1/          # ASR recipe
          - run.sh   # Executable script
          - cmd.sh   # To select the backend for job scheduler
          - path.sh  # Setup script for environment variables
          - conf/    # Containing Configuration files
          - steps/   # The steps scripts from Kaldi
          - utils/   # The utils scripts from Kaldi
      tts1/          # TTS recipe
    ...

示例脚本的执行

移动到egs目录下的示例目录。我们准备了几个主要的语音识别基准测试，包括WSJ、CHiME-4和TED。以下目录是使用CMU人口普查数据库(AN4)配方进行语音识别实验的示例。

$ cd egs/an4/asr1

进入目录后，使用chainer后端执行以下主脚本：

$ ./run.sh --backend chainer

或使用pytorch后端执行以下主脚本：

$ ./run.sh --backend pytorch

通过这个主脚本，您可以执行ASR实验的完整流程，包括

数据下载
数据准备 (Kaldi风格)
特征提取 (Kaldi风格)
字典与JSON格式数据准备
基于chainer或pytorch进行训练。
识别与评分

日志记录

可以通过以下命令监控训练进度（训练数据和验证数据的损失和准确率）

$ tail -f exp/${expdir}/train.log

当我们使用./run.sh --verbose 0（--verbose 0是大多数配方中的默认设置）时，它会提供以下信息

epoch       iteration   main/loss   main/loss_ctc  main/loss_att  validation/main/loss  validation/main/loss_ctc  validation/main/loss_att  main/acc    validation/main/acc  elapsed_time  eps
:
:
6           89700       63.7861     83.8041        43.768                                                                                   0.731425                         136184        1e-08
6           89800       71.5186     93.9897        49.0475                                                                                  0.72843                          136320        1e-08
6           89900       72.1616     94.3773        49.9459                                                                                  0.730052                         136473        1e-08
7           90000       64.2985     84.4583        44.1386        72.506                94.9823                   50.0296                   0.740617    0.72476              137936        1e-08
7           90100       81.6931     106.74         56.6462                                                                                  0.733486                         138049        1e-08
7           90200       74.6084     97.5268        51.6901                                                                                  0.731593                         138175        1e-08
     total [#################.................................] 35.54%
this epoch [#####.............................................] 10.84%
     91300 iter, 7 epoch / 20 epochs
   0.71428 iters/sec. Estimated time to finish: 2 days, 16:23:34.613215.

请注意，an4配方默认使用--verbose 1，因为该配方通常用于调试目的。

此外，Tensorboard事件会自动记录在tensorboard/${expname}文件夹中。因此，当您安装Tensorboard后，可以通过以下方式轻松比较多个实验：

$ tensorboard --logdir tensorboard

并连接到指定地址（默认：localhost:6006）。这将提供以下信息：请注意，为简化安装流程，我们未包含Tensorboard的安装步骤。如需使用Tensorboard，请手动安装（pip install tensorflow; pip install tensorboard）。

修改run.sh中的选项

我们依赖utils/parse_options.sh来解析shell脚本中的命令行参数，它被用于run.sh中：

例如：如果脚本包含 ngpu 选项

#!/usr/bin/env bash
# run.sh
ngpu=1
. utils/parse_options.sh
echo ${ngpu}

然后你可以按如下方式更改值：

$ ./run.sh --ngpu 2
echo 2

GPU的使用

Training: If you want to use GPUs in your experiment, please set --ngpu option in run.sh appropriately, e.g.,

  # use single gpu
  $ ./run.sh --ngpu 1

  # use multi-gpu
  $ ./run.sh --ngpu 3

  # if you want to specify gpus, set CUDA_VISIBLE_DEVICES as follows
  # (Note that if you use slurm, this specification is not needed)
  $ CUDA_VISIBLE_DEVICES=0,1,2 ./run.sh --ngpu 3

  # use cpu
  $ ./run.sh --ngpu 0

默认设置使用单个GPU (--ngpu 1)。

ASR decoding: ESPnet also supports the GPU-based decoding for fast recognition.
- 请手动删除run.sh中的以下行：
```
#### 使用CPU进行解码
ngpu=0
```
- 在asr_recog.py中为--batchsize选项设置1个或多个值以启用GPU解码
- 并执行脚本（例如，run.sh --stage 5 --ngpu 1）
- 通过使用GPU解码，您将获得显著的加速效果

ESPnet1 转换器

重要提示：如果您遇到任何与Transducer损失相关的问题，请在我们的warp-transducer分支中提交问题。

ESPnet支持使用Transducer损失函数训练的模型，即Transducer模型。要训练此类模型，应在训练配置中设置以下内容：

criterion: loss
model-module: "espnet.nets.pytorch_backend.e2e_asr_transducer:E2E"

架构

ESPnet目前提供多种Transducer架构：

RNN-Transducer（默认，例如：etype: blstm 搭配 dtype: lstm）
自定义Transducer（例如：etype: custom 和 dtype: custom）
混合自定义/RNN-Transducer（例如：etype: custom 与 dtype: lstm）

架构规范分为编码器和解码器部分，由用户分别通过训练配置中的etype和dtype来定义。如果为任一参数指定了custom，则对应部分将使用可自定义架构。否则，将选择基于RNN的架构。

这里的自定义架构是ESPnet中Transducer模型的独特功能。它提供了架构定义上的灵活性，便于复现一些混合使用不同层类型或参数的SOTA Transducer模型（在同一模型部分如编码器或解码器内）。因此，该架构定义与RNN架构有所不同：

自定义架构的每个块（或层）应通过enc-block-arch或/和dec-block-arch参数单独指定：

# e.g: Conv-Transformer encoder
etype: custom
enc-block-arch:
        - type: conv1d
          idim: 80
          odim: 32
          kernel_size: [3, 7]
          stride: [1, 2]
        - type: conv1d
          idim: 32
          odim: 32
          kernel_size: 3
          stride: 2
        - type: conv1d
          idim: 32
          odim: 384
          kernel_size: 3
          stride: 1
        - type: transformer
          d_hidden: 384
          d_ff: 1536
          heads: 4

自定义编码器(tdnn, conformer 或 transformer)和自定义解码器(causal-conv1d 或 transformer)允许使用不同的块类型。每个类型都有一组必需和可选参数：

# 1D convolution (TDNN) block
- type: conv1d
  idim: [Input dimension. (int)]
  odim: [Output dimension. (int)]
  kernel_size: [Size of the context window. (int or tuple)]
  stride (optional): [Stride of the sliding blocks. (int or tuple, default = 1)]
  dilation (optional): [Parameter to control the stride of elements within the neighborhood. (int or tuple, default = 1)]
  groups (optional): [Number of blocked connections from input channels to output channels. (int, default = 1)
  bias (optional): [Whether to add a learnable bias to the output. (bool, default = True)]
  use-relu (optional): [Whether to use a ReLU activation after convolution. (bool, default = True)]
  use-batchnorm: [Whether to use batch normalization after convolution. (bool, default = False)]
  dropout-rate (optional): [Dropout-rate for TDNN block. (float, default = 0.0)]

# Transformer
- type: transformer
  d_hidden: [Input/output dimension of Transformer block. (int)]
  d_ff: [Hidden dimension of the Feed-forward module. (int)]
  heads: [Number of heads in multi-head attention. (int)]
  dropout-rate (optional): [Dropout-rate for Transformer block. (float, default = 0.0)]
  pos-dropout-rate (optional): [Dropout-rate for positional encoding module. (float, default = 0.0)]
  att-dropout-rate (optional): [Dropout-rate for attention module. (float, default = 0.0)]

# Conformer
- type: conformer
  d_hidden: [Input/output dimension of Conformer block (int)]
  d_ff: [Hidden dimension of the Feed-forward module. (int)]
  heads: [Number of heads in multi-head attention. (int)]
  macaron_style: [Whether to use macaron style. (bool)]
  use_conv_mod: [Whether to use convolutional module. (bool)]
  conv_mod_kernel (required if use_conv_mod = True): [Number of kernel in convolutional module. (int)]
  dropout-rate (optional): [Dropout-rate for Transformer block. (float, default = 0.0)]
  pos-dropout-rate (optional): [Dropout-rate for positional encoding module. (float, default = 0.0)]
  att-dropout-rate (optional): [Dropout-rate for attention module. (float, default = 0.0)]

# Causal Conv1d
- type: causal-conv1d
  idim: [Input dimension. (int)]
  odim: [Output dimension. (int)]
  kernel_size: [Size of the context window. (int)]
  stride (optional): [Stride of the sliding blocks. (int, default = 1)]
  dilation (optional): [Parameter to control the stride of elements within the neighborhood. (int, default = 1)]
  groups (optional): [Number of blocked connections from input channels to output channels. (int, default = 1)
  bias (optional): [Whether to add a learnable bias to the output. (bool, default = True)]
  use-relu (optional): [Whether to use a ReLU activation after convolution. (bool, default = True)]
  use-batchnorm: [Whether to use batch normalization after convolution. (bool, default = False)]
  dropout-rate (optional): [Dropout-rate for TDNN block. (float, default = 0.0)]

通过enc-block-repeat或/和dec-block-repeat参数指定架构中的总块数/层数，可以重复定义的架构：

# e.g.: 2x (Causal-Conv1d + Transformer) decoder
dtype: transformer
dec-block-arch:
        - type: causal-conv1d
          idim: 256
          odim: 256
          kernel_size: 5
        - type: transformer
          d_hidden: 256
          d_ff: 256
          heads: 4
          dropout-rate: 0.1
          att-dropout-rate: 0.4
dec-block-repeat: 2

多任务学习

我们还支持使用多种辅助损失进行多任务学习，例如：CTC、带标签平滑的交叉熵（LM损失）、辅助Transducer以及对称KL散度。这四种损失可以与主Transducer损失同时训练，共同优化总损失，其定义如下：

其中损失函数按顺序分别为：主Transducer损失、CTC损失、辅助Transducer损失、对称KL散度损失以及语言模型损失。Lambda值定义了各项损失对总体损失的贡献权重。此外，根据任务需求，可以独立选择或省略任一损失函数。

每种损失函数都可以在训练配置中定义，并附带其特定选项，如下所示：

# Transducer loss (L1)
transducer-loss-weight: [Weight of the main Transducer loss (float)]

# CTC loss (L2)
use-ctc-loss: True
ctc-loss-weight (optional): [Weight of the CTC loss. (float, default = 0.5)]
ctc-loss-dropout-rate (optional): [Dropout rate for encoder output representation. (float, default = 0.0)]

# Auxiliary Transducer loss (L3)
use-aux-transducer-loss: True
aux-transducer-loss-weight (optional): [Weight of the auxiliary Transducer loss. (float, default = 0.4)]
aux-transducer-loss-enc-output-layers (required if use-aux-transducer-loss = True): [List of intermediate encoder layer IDs to compute auxiliary Transducer loss(es). (list)]
aux-transducer-loss-mlp-dim (optional): [Hidden dimension for the MLP network. (int, default = 320)]
aux-transducer-loss-mlp-dropout-rate: [Dropout rate for the MLP network. (float, default = 0.0)]

# Symmetric KL divergence loss (L4)
# Note: It can be only used in addition to the auxiliary Transducer loss.
use-symm-kl-div-loss: True
symm-kl-div-loss-weight (optional): [Weight of the symmetric KL divergence loss. (float, default = 0.2)]

# LM loss (L5)
use-lm-loss: True
lm-loss-weight (optional): [Weight of the LM loss. (float, default = 0.2)]
lm-loss-smoothing-rate: [Smoothing rate for LM loss. If > 0, label smoothing is enabled. (float, default = 0.0)]

推理

通过设置解码配置中的beam-size和search-type参数，Transducer也可以使用多种解码算法。

贪心搜索限制为每个时间步仅发射一次(beam-size: 1)。
不进行前缀搜索的束搜索算法（beam-size: >1 和 search-type: default）。
时间同步解码 [Saon et al., 2020] (beam-size: >1 和 search-type: tsd)。
对齐长度同步解码 [Saon et al., 2020] (beam-size: >1 和 search-type: alsd)。
基于[Kim et al., 2020]改进的N步约束束搜索（beam-size: >1 且 search-type: default）。
改进的自适应扩展搜索，基于[Kim et al., 2021]和NSC（beam-size: >1和search-type: maes）。

这些算法共享两个参数来控制波束大小(beam-size)和最终假设归一化(score-norm-transducer)。每个算法的具体参数如下：

# Default beam search
search-type: default

# Time-synchronous decoding
search-type: tsd
max-sym-exp: [Number of maximum symbol expansions at each time step (int)]

# Alignement-length decoding
search-type: alsd
u-max: [Maximum output sequence length (int)]

# N-step Constrained beam search
search-type: nsc
nstep: [Number of maximum expansion steps at each time step (int)]
        # nstep = max-sym-exp + 1 (blank)
prefix-alpha: [Maximum prefix length in prefix search (int)]

# modified Adaptive Expansion Search
search-type: maes
nstep: [Number of maximum expansion steps at each time step (int, > 1)]
prefix-alpha: [Maximum prefix length in prefix search (int)]
expansion-gamma: [Number of additional candidates in expanded hypotheses selection (int)]
expansion-beta: [Allowed logp difference for prune-by-value method (float, > 0)]

除默认算法外，所描述的参数用于控制性能和解码速度。每个参数的最佳值取决于具体任务；较高的值通常会延长解码时间以提升性能，而较低的值则会牺牲性能来缩短解码时间。

附加说明

与使用CTC训练类似，Transducer不会输出验证准确率。因此，最佳模型是根据其损失值选择的（即--recog_model model.loss.best）。
MTL与Transducer训练/解码选项存在若干差异。用户应参考espnet/espnet/nets/pytorch_backend/e2e_asr_transducer.py获取概述，并查阅espnet/espnet/nets/pytorch_backend/transducer/arguments了解所有可能的参数。
FastEmit正则化 [Yu et al., 2021] 可通过训练参数 --fastemit-lambda 启用（默认值 = 0.0）。
支持使用语言模型(LM)对RNN解码器进行预初始化。请注意，需要使用常规的解码器键值。语言模型的状态字典键(predictor.*)将根据声学模型的状态字典键(dec.*)进行重命名。
目前尚不支持使用Transformer语言模型进行Transformer解码器的预初始化。

修改训练配置

训练和解码的默认配置分别写在conf/train.yaml和conf/decode.yaml中。可以通过特定参数进行覆盖：例如

# e.g.
asr_train.py --config conf/train.yaml --batch-size 24
# e.g.--config2 and --config3 are also provided and the latter option can overwrite the former.
asr_train.py --config conf/train.yaml --config2 conf/new.yaml

通过这种方式，您需要编辑run.sh，有时可能不太方便。我们建议您修改yaml文件并传递给run.sh，而不是直接提供参数：

# e.g.
./run.sh --train-config conf/train_modified.yaml
# e.g.
./run.sh --train-config conf/train_modified.yaml --decode-config conf/decode_modified.yaml

我们还提供了一个实用工具，可以从输入的yaml文件生成yaml文件：

# e.g. You can give any parameters as '-a key=value' and '-a' is repeatable.
#      This generates new file at 'conf/train_batch-size24_epochs10.yaml'
./run.sh --train-config $(change_yaml.py conf/train.yaml -a batch-size=24 -a epochs=10)
# e.g. '-o' option specifies the output file name instead of auto named file.
./run.sh --train-config $(change_yaml.py conf/train.yaml -o conf/train2.yaml -a batch-size=24)

如何设置小批量(minibatch)

从espnet v0.4.0版本开始，我们在--batch-count中提供了三个选项来指定小批量大小（具体实现请参见espnet.utils.batchfy）；

--batch-count seq --batch-seqs 32 --batch-seq-maxlen-in 800 --batch-seq-maxlen-out 150.
该选项与v0.4.0之前的旧设置兼容。它将小批量大小计算为序列数量，并在输入或输出序列的最大长度分别超过800或150时减小批量大小。
--batch-count bin --batch-bins 100000.
这会创建一个填充后的输入/输出小批量张量，其中包含不超过100,000个bin的最大数量（即max(ilen) * idim + max(olen) * odim）。基本上，这个选项比--batch-count seq能让训练迭代更快。如果你已经有最佳的--batch-seqs x配置，可以尝试--batch-bins $((x * (mean(ilen) * idim + mean(olen) * odim)))。
--batch-count frame --batch-frames-in 800 --batch-frames-out 100 --batch-frames-inout 900.
这将创建在输入、输出及输入+输出帧数分别不超过800、100和900的情况下，包含最大帧数的小批量数据。您可以部分设置--batch-frames-xxx参数。与--batch-bins类似，此选项会使训练迭代比--batch-count seq更快。如果您已有最佳的--batch-seqs x配置，可以尝试--batch-frames-in $((x * (mean(ilen) * idim)) --batch-frames-out $((x * mean(olen) * odim))。

如何使用微调

ESPnet目前支持两种微调操作：迁移学习和冻结训练。我们期望用户在主训练配置文件（例如：conf/train*.yaml）中定义以下选项。如果需要，可以通过在选项前添加--前缀直接传递给(asr|tts|vc)_train.py。

迁移学习

迁移学习选项分为编码器初始化（enc-init）和解码器初始化（dec-init）。不过，可以为这两个选项指定相同的模型。
每个选项接受一个快照路径（例如：[espnet_model_path]/results/snapshot.ep.1）或模型路径（例如：[espnet_model_path]/results/model.loss.best）作为参数。
此外，还可以指定编码器和解码器模块的列表（以逗号分隔），通过选项enc-init-mods和dec-init-mods来控制需要迁移的模块。
对于每个指定的模块，我们仅期望与目标模型模块名称的开头部分匹配。因此，如果多个模块具有相同的前缀，可以使用相同的键来指定它们。
必填项：enc-init: /home/usr/espnet/egs/vivos/asr1/exp/train_nodev_pytorch_train/results/model.loss.best -> 指定在VIVOS上预训练的模型用于迁移学习。示例1：enc-init-mods: 'enc.' -> 迁移所有编码器参数。示例2：enc-init-mods: 'enc.embed.,enc.0.' -> 迁移编码器嵌入层和第一层参数。

冻结

可以通过freeze-mods（在espnet2中为freeze_param）启用冻结选项。
该选项接受一个模型模块列表（用逗号分隔）作为参数。如前所述，我们并不要求指定的模块完全匹配。
示例1：freeze-mods: 'enc.embed.' -> 冻结编码器嵌入层参数。示例2：freeze-mods: 'dec.embed,dec.0.' -> 冻结解码器嵌入层和第一层参数。示例3（espnet2）：freeze_param: 'encoder.embed' -> 冻结编码器嵌入层参数。

重要说明

给定一个预训练的源模型，指定用于迁移学习的模块预期与目标模型模块具有相同的参数（即：层和单元）。
我们还支持为RNN-Transducer解码器使用预训练的RNN语言模型进行初始化。
RNN models use different key names for encoder and decoder parts compared to Transformer, Conformer or Custom models:
- RNN模型使用enc.表示编码器部分，使用dec.表示解码器部分。
- Transformer/Conformer/自定义模型使用encoder.表示编码器部分，使用decoder.表示解码器部分。

Chainer与PyTorch后端

	Chainer	Pytorch
性能	◎	◎
速度	○	◎
多GPU	支持	支持
类VGG编码器	支持	支持
Transformer	支持	支持
RNNLM 集成	支持	支持
#注意力机制类型	3种(无注意力、点积注意力、位置注意力)	12种包括多头注意力的变体
支持TTS配方	不支持	已支持