数据准备

生成清单

DeepSpeech2在PaddlePaddle上 接受一个文本清单文件作为其数据集接口。清单文件总结了一组语音数据，每行包含一个音频片段的一些元数据（例如，文件路径、转录、持续时间），以 JSON 格式，例如：

{"audio_filepath": "/home/work/.cache/paddle/Libri/134686/1089-134686-0001.flac", "duration": 3.275, "text": "stuff it into you his belly counselled him"}
{"audio_filepath": "/home/work/.cache/paddle/Libri/134686/1089-134686-0007.flac", "duration": 4.275, "text": "a cold lucid indifference reigned in his soul"}

要使用您的自定义数据，您只需要生成这样的清单文件来总结数据集。给定这样的总结清单，训练、推理和所有其他模块可以知道如何访问音频文件，以及它们的元数据，包括转录标签。

有关如何生成这样的清单文件，请参阅 examples/librispeech/local/librispeech.py，它将下载数据并为 LibriSpeech 数据集生成清单文件。

计算标准化器的均值和标准差

要对音频特征进行z-score归一化（零均值，单位标准差），我们必须提前估计特征的均值和标准差，使用一些训练样本：

python3 utils/compute_mean_std.py \
--num_samples 2000 \
--spectrum_type linear \
--manifest_path examples/librispeech/data/manifest.train \
--output_path examples/librispeech/data/mean_std.npz

它将计算2000个随机采样音频片段的功率谱特征的均值和标准差，这些片段列在 examples/librispeech/data/manifest.train 中，并将结果保存到 examples/librispeech/data/mean_std.npz 以供进一步使用。

建立词汇

需要一个可能字符的词汇表来将转录转换为训练的标记索引列表，并在解码时将索引列表转换回文本。这样的基于字符的词汇表可以使用 utils/build_vocab.py 构建。

python3 utils/build_vocab.py \
--count_threshold 0 \
--vocab_path examples/librispeech/data/eng_vocab.txt \
--manifest_paths examples/librispeech/data/manifest.train

它将写入一个词汇文件 examples/librispeech/data/vocab.txt，其中包含 examples/librispeech/data/manifest.train 中的所有转录文本，没有词汇截断 (--count_threshold 0)。