弱监督学习（语音转文本）

约8分钟

弱监督学习（语音转文本）

This is a template of S2T1 recipe for ESPnet2. It is based on ASR1, but follows the style of OpenAI's Whisper to train a single encoder-decoder model for various speech processing tasks. Specifically, it uses special tokens as task specifiers (e.g., transcribe, translate) or prediction targets (e.g., language ID) so that a single model can perform multiple tasks for multiple languages. It further supports conditional generation where the condition is the previous sentence within the long talk.

更多详情请参阅我们的OWSM论文(ASRU 2023)。

配方流程
1. 数据准备
- ESPnet格式:
2. 速度扰动
3. Wav格式
4. 移除过长或过短的数据
5. 生成词符列表
6. 语言模型统计收集
7. 语言模型训练
8. 语言模型困惑度
9. Ngram语言模型训练
10. S2T 统计信息收集
11. 语音转文本(S2T)训练
12. S2T推理
13. S2T 评分
14-15. (可选) 打包结果以供上传
如何运行
OWSM 训练
如何微调预训练的OWSM模型
Related work

配方流程

S2T1配方包含16个阶段。

数据准备

数据准备阶段。

ESPnet格式：

它会调用local/data.sh在data/目录下创建Kaldi风格的数据文件夹，用于训练集和验证集。

训练数据具有以下格式：

&lt;sop&gt; prev&lt;sos&gt;&lt;category&gt;&lt;task&gt;&lt;starttime1&gt; utt1&lt;endtime1&gt;&lt;starttime2&gt; utt2&lt;endtime2&gt;&lt;eos&gt;

其中<sop>是一个特殊标记，表示前文/提示句的开始。时间戳也被视为特殊标记，因为音频具有固定长度（30秒）和分辨率（20毫秒或40毫秒）。示例如下：

&lt;sop&gt; I'm going to talk today about energy and climate.&lt;sos&gt;&lt;en&gt;&lt;transcribe&gt;&lt;0.00&gt; And that might seem a bit surprising, because my full-time work at the foundation is mostly about vaccines and seeds, about the things that we need to invent and deliver to help the poorest two billion live better lives.&lt;14.12&gt;&lt;15.36&gt; But energy and climate are extremely important to these people; in fact, more important than to anyone else on the planet.&lt;24.26&gt;&lt;eos&gt;

在数据准备阶段，需要生成三个文本文件：

text contains the normal target sentence, i.e., the text between <sos> and <eos>.
- 这必须包含如上所述的所有特殊标记。
- 如果您的数据没有时间戳，需要使用<notimestamps>替代。即便如此，您仍需确保非语言符号列表(nlsyms_txt)包含所有可能时间戳的范围。
- <sop>, <sos>, 和 <eos> 实际上是在预处理/数据加载阶段插入的。您只需要 <language><task><starttime1> utt1<endtime1><starttime2> utt2<endtime2>
- For ST (speech translation), text should contain the translation in the target language.
- 关于数据准备及预期格式的示例，请参见 https://github.com/espnet/espnet/tree/master/egs2/owsm_v1/s2t1/local。
text.prev contains the previous sentence, i.e., the text between <sop> and <sos>. This might be unavailable at the beginning of a talk. In such cases, a special token <na> will be used.
- 此处不应包含任何特殊标记，除了<na>。在上面的示例中，请提取<sop>和<sos>之间的文本并放置于此。
text.ctc contains the ASR transcript without any special token, which is used for the CTC loss. For ASR utterances, this can be derived from text, but for ST utterances, this is in a different language. If the ASR transcription is not available, <na> will be used.
- This should not contain any special tokens (e.g. timestamps). For ASR, just take the text between <task> and <eos> and put it here (timestamps removed). For ST, text.ctc should contain the ASR transcript of the source language and thus will be different from text (the target language translation).

进一步说明：

The text.prev file should be called text.prev (with a period). Same with text.ctc
- 然而在训练配置中，text_ctc_name必须设为text_ctc（且text_ctc_name必须为text_ctc）
If you support multiple tasks (e.g. ASR and ST), all utterances are expected to be put into text, text.prev, and text.ctc. The task token (e.g. <asr>) will distinguish between different tasks.
If the same utterance is used multiple times (e.g. once in ASR and once in ST), each copy of the utterance needs a unique ID. You can append the task to the utterance ID to make it unique.
- 使用 utils/fix_data_dir.sh --utt_extra_files "utt2num_samples text.ctc text.prev" SPLIT 来确保文件仍然保持排序。

速度扰动

通过速度扰动增强训练数据。将生成data/${train_set}_spXX（XX表示速度因子）。此步骤是可选的。请注意，时间戳也需要相应调整。

Wav格式

将wav.scp中的音频文件统一转换为单一格式（wav/flac/kaldi_ark）。

移除过长或过短的数据

移除过长或过短的数据。

生成令牌列表

从训练数据生成令牌列表。使用BPE令牌。

语言模型统计信息收集

基于神经网络(NN)的语言模型(LM)是S2T1任务的可选组件。您可以通过设置--use_lm false跳过第6-9阶段。统计计算阶段。该阶段会收集LM文本的形状信息并计算用于LM训练的统计数据。

语言模型训练

基于神经网络的LM模型训练阶段。您可以通过--lm_config和--lm_args选项来更改训练设置。

另请参阅：

语言模型困惑度

基于神经网络的LM评估阶段。困惑度(PPL)是针对训练好的模型计算的

另请参阅：

更改训练配置

Ngram语言模型训练

基于N-gram的语言模型训练阶段。

S2T统计信息收集

统计计算阶段。该阶段收集输入和输出文本的形状信息，用于S2T训练。

语音转文本(S2T)训练

S2T模型训练阶段。您可以通过--s2t_config和--s2t_args选项更改训练设置。

另请参阅：

S2T推理

S2T推理阶段。我们可以使用任何准备好的测试数据执行ASR或ST。

S2T评分

计算ASR错误率（字符/单词/标记）。

14-15. (可选) 打包结果以上传

打包阶段。将训练好的模型文件打包并上传至Hugging Face。

另请参阅：

ESPnet 模型库
将训练好的模型上传至Hugging Face以便分享。更多信息请参阅Docs。

如何运行

OWSM训练

我们为OWSM训练创建了多个配方。请查看egs2/mixed_v1、egs2/mixed_v2、egs2/mixed_v3获取更多信息。

如何微调预训练的OWSM模型

预训练的OWSM可以在特定数据集上进行微调。这里，我们以AISHELL-1为例。

准备s2t1配方

我们使用这个s2t1模板来微调OWSM。因此我们首先在自定义数据集egs2/aishell下创建这个目录。

egs2/TEMPLATE/s2t1/setup.sh egs2/aishell/s2t1

然后，我们下载一个预训练模型，例如 espnet/owsm_v2_ebranchformer，使用以下命令：

# go to the created dir
cd egs2/aishell/s2t1

# source path.sh
. ./path.sh

# download model from hugging face using espnet_model_zoo_download
# we use dummy names for required arguments and we do not run any actual stage (thus --stage 100)
./s2t.sh --download_model espnet/owsm_v2_ebranchformer --stage 100 --train_set dummy --valid_set dummy2 --test_sets dummy3

下载的模型将保存在本地缓存中，然后进行解压。系统会自动创建一个exp目录，其中包含指向检查点文件和配置文件的符号链接。

要使用预训练模型，我们需要以下重要文件：

config: 一个包含所有训练参数和标记列表的yaml文件。文件名为config.yaml。
模型检查点：名称为xxx.pth。在本例中，它是valid.total_count.ave_5best.till25epoch.pth。
stats: 当feats_normalize设置为global_mvn时，用于对输入语音特征进行归一化处理。
bpe模型：这是sentencepiece使用的BPE模型。

stats的路径可以在config.yaml中找到，例如：

grep stats_file exp/espnet/owsm_v2_ebranchformer/config.yaml

bpemodel的路径也可以在config.yaml中找到，例如：

grep bpemodel exp/espnet/owsm_v2_ebranchformer/config.yaml

在以下章节中，我们将手动将这两个文件复制到正确的位置。

准备OWS格式的数据

数据应以OWSM格式准备。更多详情请参阅1. 数据准备。

由于AISHELL-1已被纳入OWSM v1，我们可以复用那些准备脚本。对于您自己的数据，请自行编写脚本并确保特殊标记（如语言代码）与预训练模型保持一致。请注意，我们不会为微调生成新的词汇表，而是直接使用预训练模型中的词汇表。

cd local/
ln -s ../../../mixed_v1/s2t1/local/utils.py ./
ln -s ../../../mixed_v1/s2t1/local/prepare_aishell.* ./
cd ..

# modify data_dir and execute:
./local/prepare_aishell.sh

准备好的数据将存储在一个名为data的新目录中。

接下来，我们在s2t.sh中执行各个阶段。为简化操作，我们创建了如下所示的run.sh，其内容主要复制自OWSM v2配方。

#!/usr/bin/env bash
# Set bash to 'debug' mode, it will exit on :
# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
set -e
set -u
set -o pipefail

train_set=AISHELL-1/train
valid_set=AISHELL-1/dev
test_sets="AISHELL-1/dev"

nbpe=50000  # this should be consistent with the pre-trained model
s2t_config=conf/train_s2t_ebf_conv2d_size1024_e12_d12.yaml
inference_config=conf/decode_s2t.yaml

# inference only args
# --cleaner whisper_basic --hyp_cleaner whisper_basic
./s2t.sh \
    --stage 3 \
    --stop_stage 4 \
    --use_lm false \
    --num_nodes 1 \
    --ngpu 4 \
    --nj 32 \
    --gpu_inference true \
    --inference_nj 4 \
    --num_splits_s2t 1 \
    --feats_type raw \
    --audio_format flac.ark \
    --token_type bpe \
    --nbpe ${nbpe} \
    --bpe_input_sentence_size 10000000 \
    --s2t_config "${s2t_config}" \
    --inference_config "${inference_config}" \
    --train_set "${train_set}" \
    --valid_set "${valid_set}" \
    --test_sets "${test_sets}" \
    --bpe_train_text "dump/raw/${train_set}/text" \
    --bpe_nlsyms data/nlsyms.txt \
    --lm_train_text "dump/raw/${train_set}/text" "$@"

我们运行阶段3和阶段4来格式化数据：

./run.sh --stage 3 --stop_stage 4

我们自行创建BPE标记目录。这相当于第5阶段，但我们不生成新的标记列表。

mkdir -p data/token_list/bpe_unigram50000
cp path_to_bpe_model data/token_list/bpe_unigram50000 # path_to_bpe_model is in config.yaml

# we extract the token list
python -c "import yaml; config = yaml.safe_load(open('exp/espnet/owsm_v2_ebranchformer/config.yaml', 'r')); open('data/token_list/bpe_unigram50000/tokens.txt', 'w').write('\n'.join(config['token_list'])
)"

微调模型

我们为微调创建了一个训练配置文件。该文件基于config.yaml中的原始配置进行了修改。请注意，您可能需要调整训练超参数（如学习率）。模型很容易在小训练集上过拟合。

preprocessor: s2t
preprocessor_conf:
    text_prev_name: text_prev
    text_ctc_name: text_ctc
    fs: 16000
    na_symbol: '&lt;na&gt;'
    speech_length: 30
    speech_resolution: 0.02
    speech_init_silence: 30
    text_prev_apply_prob: 0.0   # we do not use previous prompt
    time_apply_prob: 0.0    # we do not use any timestamp for fine-tuning
    notime_symbol: '&lt;notimestamps&gt;'
    first_time_symbol: '&lt;0.00&gt;'
    last_time_symbol: '&lt;30.00&gt;'

frontend_conf:
    n_fft: 512
    win_length: 400
    hop_length: 160

specaug: specaug
specaug_conf:
    apply_time_warp: false
    time_warp_window: 5
    time_warp_mode: bicubic
    apply_freq_mask: true
    freq_mask_width_range:
    - 0
    - 27
    num_freq_mask: 2
    apply_time_mask: true
    time_mask_width_ratio_range:
    - 0.
    - 0.05
    num_time_mask: 10

encoder: e_branchformer
encoder_conf:
    output_size: 1024
    attention_heads: 16
    attention_layer_type: selfattn
    pos_enc_layer_type: abs_pos
    rel_pos_type: latest
    cgmlp_linear_units: 4096
    cgmlp_conv_kernel: 31
    use_linear_after_conv: false
    gate_activation: identity
    num_blocks: 12
    dropout_rate: 0.1
    positional_dropout_rate: 0.1
    attention_dropout_rate: 0.1
    input_layer: conv2d
    layer_drop_rate: 0.0
    linear_units: 4096
    positionwise_layer_type: linear
    use_ffn: true
    macaron_ffn: true
    merge_conv_kernel: 31

decoder: transformer
decoder_conf:
    attention_heads: 16
    linear_units: 4096
    num_blocks: 12
    dropout_rate: 0.1
    positional_dropout_rate: 0.1
    self_attention_dropout_rate: 0.1
    src_attention_dropout_rate: 0.1

model_conf:
    ctc_weight: 0.3
    lsm_weight: 0.1
    length_normalized_loss: false
    sym_na: '&lt;na&gt;'

# NOTE: you may need to tune these hyperparams
optim: adamw
optim_conf:
    lr: 1.0e-04
    betas:
    - 0.9
    - 0.98
    eps: 1.0e-06
    weight_decay: 0.0
scheduler: warmuplr
scheduler_conf:
    warmup_steps: 5000

# NOTE: we are using 4 GPUs with 48GB memory
batch_type: unsorted
batch_size: 16
accum_grad: 4
max_epoch: 20
patience: none
init: none
best_model_criterion:
-   - valid
    - acc
    - max
-   - valid
    - total_count
    - max
keep_nbest_models: 5
use_amp: true
num_workers: 4
unused_parameters: false
seed: 2023
num_att_plot: 1

# fine-tune
init_param:
- exp/espnet/owsm_v2_ebranchformer/valid.total_count.ave_5best.till25epoch.pth
ignore_init_mismatch: false

我们需要收集语音和文本的形状数据，但跳过均值和方差的收集，因为我们使用的是预训练模型。我们以较小的批次大小运行第10阶段：

./run.sh --stage 10 --stop_stage 10 --feats_normalize utterance_mvn --s2t_args "--model_conf extract_feats_in_collect_stats=false --batch_size 5"

然后，我们将现有的均值和方差复制到正确的位置：

cp path_to_train_stats exp/s2t_stats_raw_bpe50000/train/  # path_to_train_stats is in config.yaml

现在，我们可以开始训练：

./run.sh --stage 11 --stop_stage 11

@article{peng2023reproducing,
  title={Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data},
  author={Peng, Yifan and Tian, Jinchuan and Yan, Brian and Berrebbi, Dan and Chang, Xuankai and Li, Xinjian and Shi, Jiatong and Arora, Siddhant and Chen, William and Sharma, Roshan and others},
  journal={arXiv preprint arXiv:2309.13876},
  year={2023}
}

弱监督学习（语音转文本）

弱监督学习（语音转文本）

目录

配方流程

如何运行

相关工作