更改训练配置

约7分钟

修改训练配置

显示用法

有两种方式可以显示命令行选项：--help 和 --print_config

# Show the command line option
python -m espnet2.bin.asr_train --help
# Show the all configuration in yaml format
python -m espnet2.bin.asr_train --print_config

在本节中，我们以espnet2.bin.asr_train为例，但其他基于Task类的训练工具具有相同的接口，因此您可以将其替换为其他命令。

请注意，ESPnet2在选项名称分隔时总是选择_而非-，以避免混淆。

# Bad
--batch-size
# Good
--batch_size

--print_config的一个显著特点是它能动态显示根据给定参数解析的配置：您可以查找可修改类的参数。

% # Show parameters of Adam optimizer
% python -m espnet2.bin.asr_train --optim adam --print_config
...
optim: adam
optim_conf:
    lr: 0.001
    betas:
    - 0.9
    - 0.999
    eps: 1.0e-08
    weight_decay: 0
    amsgrad: false
...
% # Show parameters of ReduceLROnPlateau scheduler
% python -m espnet2.bin.asr_train --scheduler ReduceLROnPlateau --print_config
...
scheduler: reducelronplateau
scheduler_conf:
    mode: min
    factor: 0.1
    patience: 10
    verbose: false
    threshold: 0.0001
    threshold_mode: rel
    cooldown: 0
    min_lr: 0
    eps: 1.0e-08
...

配置文件

你可以在conf/train_*.yaml中找到DNN训练的配置文件。

ls conf/

我们采用ConfigArgParse作为此配置系统。YAML格式的配置文件与命令行参数具有同等效果。例如，以下两种方式是等效的：

# config.yaml
foo: 3
bar: 4

python -m espnet2.bin.asr_train --config conf/config.yaml
python -m espnet2.bin.asr_train --foo 3 --bar 4

修改字典类型值的配置

部分参数被命名为*_conf，例如optim_conf、decoder_conf，它们的值为dict类型。我们还提供了一种方法来配置这类字典对象中的嵌套值。

# e.g. Change parameters one by one
python -m espnet2.bin.asr_train --optim_conf lr=0.1 --optim_conf rho=0.8
# e.g. Give the parameters in yaml format
python -m espnet2.bin.asr_train --optim_conf "{lr: 0.1, rho: 0.8}"

恢复训练过程

python -m espnet2.bin.asr_train --resume true

训练过程的状态会在每个epoch结束时保存为checkpoint.pth，训练过程可以从下一个epoch开始时恢复。检查点包含以下状态。

模型状态
优化器状态
调度器状态
报告者状态
torch.cuda.amp状态（来自torch=1.6）

使用预训练模型进行迁移学习/微调

使用 --init_param :::

# Load all parameters
python -m espnet2.bin.asr_train --init_param model.pth
# Load only the parameters starting with "decoder"
python -m espnet2.bin.asr_train --init_param model.pth:decoder
# Load only the parameters starting with "decoder" and set it to model.decoder
python -m espnet2.bin.asr_train --init_param model.pth:decoder:decoder
# Set parameters to model.decoder
python -m espnet2.bin.asr_train --init_param decoder.pth::decoder
# Load all parameters excluding "decoder.embed"
python -m espnet2.bin.asr_train --init_param model.pth:::decoder.embed
# Load all parameters excluding "encoder" and "decoder.embed"
python -m espnet2.bin.asr_train --init_param model.pth:::encoder,decoder.embed

冻结参数

python -m espnet2.bin.asr_train --freeze_param encoder.enc encoder.decoder

更改日志记录间隔

训练中间状态的结果将按指定数量显示：

python -m espnet2.bin.asr_train --log_interval 100

Weights & Biases 集成

关于Weights & Biases: https://docs.wandb.com/

安装与设置
参见：https://docs.wandb.com/quickstart
```
wandb login
```

启用wandb

python -m espnet2.bin.asr_train --use_wandb true

然后访问显示的URL。

[可选] 使用HTTPS代理

export HTTPS_PROXY=...your proxy
export CURL_CA_BUNDLE=your.pem
export CURL_CA_BUNDLE=   # 禁用SSL证书验证

小批量大小与GPU数量之间的关系

可以按如下方式更改批量大小：

# Change both of the batch_size for training and validation
python -m espnet2.bin.asr_train --batch_size 20
# Change the batch_size for validation
python -m espnet2.bin.asr_train --valid_batch_size 200

在多GPU训练期间，batch-size的行为与ESPNet1不同。

ESPNet1: 批量大小将乘以GPU数量。

python -m espnet.bin.asr_train --batch_size 10 --ngpu 2 # 实际批量大小为20，每个GPU设备分配10个

ESPnet2: The batch-size is not changed regardless of the number of GPUs.
- 因此，您应该设置比GPU数量更大的批次大小。
```
python -m espnet.bin.asr_train --batch_size 10 --ngpu 2 # Actual batch_size is 10 and each GPU devices are assigned to 5
```

更改小批量类型

我们采用可变的小批量大小，同时考虑输入特征的维度，以充分利用GPU内存。

共有6种类型：

batch_type	批量大小调整选项	可变批量大小	要求
unsorted	--batch_size	否	-
sorted	--batch_size	No	特征的长度信息
folded	--batch_size	Yes	特征的长度信息
length	--batch_bins	Yes	特征的长度信息
numel	--batch_bins	是	特征数据的形状信息
catbel	--batch_size	否	-

请注意，如果--batch_type=length或--batch_type=numel，则--batch_size参数将被忽略。

`--batch_type unsorted`

此模式没有特殊功能，仅创建固定大小的迷你批次，不按长度顺序进行任何排序。如果您打算将ESPnet用于非序列到序列任务，这种类型可能适合。

与其他模式不同，此模式不需要特征维度的信息。换句话说，无需强制准备shape_file：

python -m espnet.bin.asr_train \
  --batch_size 10 --batch_type unsorted \
  --train_data_path_and_name_and_type "train.scp,feats,npy" \
  --valid_data_path_and_name_and_type "valid.scp,feats,npy" \
  --train_shape_file "train.scp" \
  --valid_shape_file "valid.scp"

这个系统可能看起来有些奇怪，您可能还会觉得--*_shape_file过于冗长，因为训练语料完全可以仅通过--*_data_path_and_name_and_type来描述。

从实现的角度来看，我们将PyTorch中的Dataset和BatchSampler的数据源进行了分离，其中--*_data_path_and_name_and_type和--*_shape_file分别对应这两者。从训练策略的角度来看，由于支持根据每个特征的长度/维度来调整批次大小，因此我们需要在训练前准备好形状信息。

`--batch_type sorted`

此模式会按长度顺序排序创建固定大小的迷你批次。该模式需要长度信息。

python -m espnet.bin.asr_train \
  --batch_size 10 --batch_type sorted \
  --train_data_path_and_name_and_type "train.scp,feats,npy" \
  --train_data_path_and_name_and_type "train2.scp,feats2,npy" \
  --valid_data_path_and_name_and_type "valid.scp,feats,npy" \
  --valid_data_path_and_name_and_type "valid2.scp,feats2,npy" \
  --train_shape_file "train_length.txt" \
  --valid_shape_file "valid_length.txt"

例如：length.txt

sample_id1 1230
sample_id2 156
sample_id3 890
...

第一列表示样本ID，第二列是对应特征的长度。您可以看到在我们的配方中使用了shape file作为输入。

例如：shape.txt

sample_id1 1230,80
sample_id2 156,80
sample_id3 890,80
...

该文件描述了特征形状的完整信息；第一个数字表示序列长度，第二个及之后的数字表示特征维度：Length,Dim1,Dim2,...。

对于--batch_type sorted、--batch_type folded和--batch_type length仅参考第一个数字，而形状信息仅在--batch_type numel时需要提供。

`--batch_type folded`

在ESPnet1中，此模式被称为seq。

此模式创建的小批量大小为base_batch_size // max_i(1 + L_i // f_i)。其中L_i表示小批量中第i个特征的最大长度，f_i是该特征对应的--fold length参数。此模式需要长度信息。

python -m espnet.bin.asr_train \
  --batch_size 20 --batch_type folded \
  --train_data_path_and_name_and_type "train.scp,feats,npy" \
  --train_data_path_and_name_and_type "train2.scp,feats2,npy" \
  --valid_data_path_and_name_and_type "valid.scp,feats,npy" \
  --valid_data_path_and_name_and_type "valid2.scp,feats2,npy" \
  --train_shape_file "train_length.scp" \
  --train_shape_file "train_length2.scp" \
  --valid_shape_file "valid_length.scp" \
  --valid_shape_file "valid_length2.scp" \
  --fold_length 5000 \
  --fold_length 300

请注意，*_shape_file的重复次数必须与--fold_length的数量相等，但您不需要输入与数据文件数量相同的形状文件数量。例如，您可以按以下方式提供：

python -m espnet.bin.asr_train \
  --batch_size 20 --batch_type folded \
  --train_data_path_and_name_and_type "train.scp,feats,npy" \
  --train_data_path_and_name_and_type "train2.scp,feats2,npy" \
  --valid_data_path_and_name_and_type "valid.scp,feats,npy" \
  --valid_data_path_and_name_and_type "valid2.scp,feats2,npy" \
  --train_shape_file "train_length.txt" \
  --valid_shape_file "valid_length.txt" \
  --fold_length 5000

在这个示例中，会考虑第一个特征的长度，而第二个特征可以忽略。该技术同样适用于--batch_type length和--batch_type numel。

`--batch_type length`

在ESPnet1中，此模式被称为frame。

你需要指定--batch_bins而非--batch_size来确定小批量大小。每个小批量会尽可能包含相同数量的数据箱（bin），这里的箱数是通过计算小批量中所有特征的总长度得出的；即bins = sum(len(feat) for feats in batch for feat in feats)。此模式需要长度信息。

python -m espnet.bin.asr_train \
  --batch_bins 10000 --batch_type length \
  --train_data_path_and_name_and_type "train.scp,feats,npy" \
  --train_data_path_and_name_and_type "train2.scp,feats2,npy" \
  --valid_data_path_and_name_and_type "valid.scp,feats,npy" \
  --valid_data_path_and_name_and_type "valid2.scp,feats2,npy" \
  --train_shape_file "train_length.txt" \
  --train_shape_file "train_length2.txt" \
  --valid_shape_file "valid_length.txt" \
  --valid_shape_file "valid_length2.txt" \

`--batch_type numel`

在ESPnet1中，此模式被称为bins。

你需要指定--batch_bins来确定小批量大小，而不是使用--batch_size。每个小批量尽可能包含相同数量的元素箱；即bins = sum(numel(feat) for feats in batch for feat in feats)，其中numel返回每个特征形状的无限乘积；shape[0] * shape[1] * ...

python -m espnet.bin.asr_train \
  --batch_bins 200000 --batch_type numel \
  --train_data_path_and_name_and_type "train.scp,feats,npy" \
  --train_data_path_and_name_and_type "train2.scp,feats2,npy" \
  --valid_data_path_and_name_and_type  "valid.scp,feats,npy" \
  --valid_data_path_and_name_and_type  "valid2.scp,feats2,npy" \
  --train_shape_file "train_shape.txt" \
  --train_shape_file "train_shape2.txt" \
  --valid_shape_file "valid_shape.txt" \
  --valid_shape_file "valid_shape2.txt"

这种batch_type专注于分类任务场景。它确保在每个小批次(mini-batch)中，所有样本都属于不同类别。--batch_size用于确定小批次大小。此批次类型不兼容默认的sequence iterator_type，而是专为与category iterator_type配合使用而设计。因此，与其显式指定--batch_type catbel，更推荐使用--iterator_type category，这将自动将batch_type设为catbel。同时必须使用能调整样本时长的预处理器来支持小批次构建，例如espnet2/train/preprocessor/SpkPreprocessor。

python -m espnet.bin.spk_train \
  --batch_bins 256 --iterator_type category \
  --train_data_path_and_name_and_type "train.scp,feats,npy" \
  --valid_data_path_and_name_and_type  "valid.scp,feats,npy" \
  --train_shape_file "train_shape.txt" \
  --valid_shape_file "valid_shape.txt" \

梯度累积

在训练过程中，有几种方法可以处理模型架构大于GPU设备内存容量的情况。

使用更多数量的GPU
使用半决策张量
使用 torch.utils.checkpoint
梯度累积

梯度累积是一种处理比可用批次更大的小批次的技术。

将一个小批量(mini-batch)分割成若干部分，依次对每个部分进行前向传播和反向传播，并逐个累积梯度，而优化器的更新则按照前向传播的次数进行触发，如下所示：

# accum_grad is the number of pieces
for i, batch in enumerate(iterator):
    loss = net(batch)
    (loss / accum_grad).backward() # Gradients are accumulated
    if i % accum_grad:
        optim.update()
        optim.zero_grads()

给出 --accum_grad 来使用此选项。

python -m espnet.bin.asr_train --accum_grad 2

有效的batch_size将几乎等同于accum_grad * batch_size，除了以下情况：

随机状态
基于小批量(mini-batch)的一些统计层，例如批归一化(BatchNormalization)
每次迭代的batch_size不一致的情况。

自动混合精度训练

python -m espnet.bin.asr_train --use_amp true

可复现性与确定性

存在一些可能导致训练不可复现的情况。

初始化因PyTorch/ESPnet版本差异而产生的参数。
Reducing order for float values during multi GPUs training.
- 我不确定NCCL是否是确定性的。
Random seed difference
- 我们为每个epoch固定了随机种子。
CuDNN or some non-deterministic operations for CUDA: See https://pytorch.org/docs/stable/notes/randomness.html

默认情况下，CuDNN在我们的训练中执行确定性模式，可以通过以下方式关闭：

python -m espnet.bin.asr_train --cudnn_deterministic false