Speech-to-Text Translation

约4分钟

Speech-to-Text Translation

本页面正在建设中

ESPnet-ST-v2项目的文档，将在ACL 2023会议上展示。该分支将被合并到ESPnet主分支中。更多详情请见此处。

要使用此开发版本，请克隆此分支，然后继续正常的ESPnet2安装（无需Kaldi）：

git clone https://github.com/brianyan918/espnet-ml.git

git checkout md_pr

结果
- Offline ST
- Simultaneous ST
Offline ST Models
代码组件
- 代码修改提示

结果

The tables below provide a performance snapshot of different core architectures for offline and simultaneous ST. Links for downloading our example models or for building your own models from scratch are also provided.

Here we're reporting MuST-C-v2 English-to-German results. These example models were trained using only the MuST-C-v2 ST corpus. No additional MT data was used. These models use ASR pre-training on the same data for faster convergence (ASR config).

离线语音识别

模型	BLEU分数	训练配置	解码配置
注意力编码器-解码器	25.7	link	link
多解码器注意力编码-解码	27.6
CTC/Attention	28.6	link	link
多解码器 CTC/注意力机制	28.8
Transducer	27.6

Simultaneous ST

模型	BLEU	AL
分块注意力编码解码	22.8	3.23
标签同步块状CTC/注意力	24.4	3.23
时间同步分块CTC/注意力机制	24.6	2.34
分块式转换器	22.9	2.37

Offline ST Models

核心架构

注意力编码器-解码器

为何选择此模型？

注意力编码器-解码器是一种常用模型。其相对简单的架构使其成为尝试新辅助技术的坚实基础，也是开始ST任务的合理选择。

注意事项

最需要警惕的显著弱点是终止检测问题：自回归解码器依赖长度惩罚/奖励超参数来稳定输出长度。这种对超参数的依赖会带来过度调参的风险。

CTC/注意力机制

为何选择此模型？

The CTC/attention incorporates non-autoregressive hard alignment (CTC) and autoregressive soft alignment (attention) into a single model. CTC counteracts several weaknesses of its attentional counterpart via joint training/decoding (more details). Notably, the CTC/attention alleviates the end-detection problem of the pure attention approach. Compared to the attentional encoder-decoder, CTC/attention produces superior translation quality.

注意事项

联合训练/解码会产生额外的计算成本。据经验观察，CTC/注意力机制比纯注意力机制慢10-20%。

转换器

为何选择此模型？

转换器是一种自回归的硬对齐模型。与带有注意力解码器的模型不同，转换器模型通常使用浅层LSTM解码器。值得注意的是，转换器的解码器避免了其注意力对应模型的二次计算复杂度——推理速度明显更快。

注意事项

The translation quality lags behind that of CTC/attention due to its low capacity decoder. Further the transducer's loss function must marginalize over all possible alignment paths -- this makes training relatively slow. We also found that transducers are more difficult to train to convergence, likely due to the monotonic property of this framework. We solve this using a hierarchical encoding scheme (described in Section 4.1 of this paper) to encourage the encoder to take on the burden of re-ordering.

多解码器

为何选择此模型？

The multi-decoder is an end-to-end differentiable cascade, consisting of an ASR subnet and an MT subnet. This approach inherits several strengths from cascaded approaches, the most prominent of which is the ability to perform search/retrieval over intermediate ASR representations (more details). The translation quality of the multi-decoder is greater than that of the attentional encoder decoder. The multi-decoder approach can also be applied to CTC/attention models -- this combination results in the strongest performance amongst our example models.

注意事项

多解码器推理涉及两个连续的波束搜索，一个用于ASR子网络，另一个用于MT子网络。模型大小也大于单编码器-解码器结构（尽管如果两种方法都使用ASR多任务处理，可训练参数数量相似）。

辅助技术

ASR预训练

从ASR模型初始化编码器参数是加速ST训练收敛的便捷方法，可以缩短实验周期。通常，经过ASR预训练的模型表现也更优。

ASR预训练的一个示例可以在此配置中找到。

SSL前端/编码器

我们可以利用自监督学习表示作为前端特征或编码器初始化。前一种方法通常计算量较小，因为SSL模型可能非常庞大。

使用SSL前端的示例可在此配置中找到。有关使用SSL表示的更多信息，请参阅此处。

LLM解码器

从预训练的大型语言模型初始化解码器参数可以显著提升性能。这通常会大幅增加模型规模，但收敛所需的迭代次数会减少。

使用预训练LLM初始化的示例可在此配置中找到。

分层编码

Building deeper, more sophisticated encoders can improve ST performance. We have found that hierarchical encoding, where initial layers are trained towards an ASR CTC objective and final layers are trained towards a ST CTC objective, encourages the encoder to take on more of the input-to-output re-ordering required for translation.

分层编码的示例可以在此配置中找到。

代码组件

三个主要代码组件可以在espnet2/st/espnet_model.py、espnet2/tasks/st.py和espnet2/bin/st_inference.py中找到。

espnet2/st/espnet_model.py 定义了ST模型的初始化、前向传播和损失函数。

espnet2/tasks/st.py 是处理数据加载器、训练循环等任务的任务封装器。

espnet2/bin/st_inference.py 定义了处理推理的Speech2Text API。

代码修改提示

如果您正在开发新功能并希望调试新的训练/推理逻辑，可以使用Python调试器（如pdb）直接运行Python命令（绕过配方脚本）。

运行训练或推理阶段后，ESPnet会生成一个日志文件。在这些日志文件的顶部，您将找到对应的Python命令。注意：调试时可能需要设置--multiprocessing_distributed False。