speechbrain.lobes.models.transformer.Conformer 模块

Conformer 实现。

作者

钟建元 2020
萨穆埃莱·科内尔 2021
Sylvain de Langen 2023

摘要

类：

`ConformerDecoder`	该类实现了Transformer解码器。
`ConformerDecoderLayer`	这是Conformer编码器层的实现。
`ConformerEncoder`	该类实现了Conformer编码器。
`ConformerEncoderLayer`	这是Conformer编码器层的实现。
`ConformerEncoderLayerStreamingContext`	用于`ConformerEncoderLayer`的流媒体元数据和状态。
`ConformerEncoderStreamingContext`	`ConformerEncoder` 的流媒体元数据和状态。
`ConvolutionModule`	这是Conformer中卷积模块的实现。

参考

class speechbrain.lobes.models.transformer.Conformer.ConformerEncoderLayerStreamingContext(mha_left_context_size: int, mha_left_context: Tensor | None = None, dcconv_left_context: Tensor | None = None)[source]

基础类：object

流式元数据和状态用于ConformerEncoderLayer。

多头注意力和动态块卷积需要保存一些左上下文，这些上下文会作为左填充插入。

有关更多详细信息，请参阅ConvolutionModule文档。

mha_left_context_size: int: 对于这一层，指定应保存多少帧输入。通常，所有层使用相同的值，但可以修改。

mha_left_context: Tensor | None = None: 在当前块的左侧插入的左侧上下文作为多头注意力的输入。它可以是None（如果我们处理的是第一个块）或<= mha_left_context_size，因为在最初的几个块中，可能没有足够的左侧上下文来填充。

dcconv_left_context: Tensor | None = None

根据动态块卷积方法，在卷积左侧插入的左侧上下文。

与mha_left_context不同，这里保留的帧数是固定的，并且是从卷积模块的核大小推断出来的。

class speechbrain.lobes.models.transformer.Conformer.ConformerEncoderStreamingContext(dynchunktrain_config: DynChunkTrainConfig, layers: List[ConformerEncoderLayerStreamingContext])[source]

基础类：object

流式元数据和状态用于ConformerEncoder。

dynchunktrain_config: DynChunkTrainConfig: 动态块训练配置，包含块大小和上下文大小信息。

layers: List[ConformerEncoderLayerStreamingContext]: 为编码器的每一层流式传输元数据和状态。

class speechbrain.lobes.models.transformer.Conformer.ConvolutionModule(input_size, kernel_size=31, bias=True, activation=<class 'speechbrain.nnet.activations.Swish'>, dropout=0.0, causal=False, dilation=1)[source]

基础：Module

这是Conformer中卷积模块的实现。

Parameters:

input_size (int) – 输入嵌入维度的预期大小。
kernel_size (int, optional) – 非瓶颈卷积层的核大小。
bias (bool, 可选) – 是否在非瓶颈卷积层中使用偏置。
activation (torch.nn.Module) – 在非瓶颈卷积层后使用的激活函数。
dropout (float, optional) – 丢弃率。
causal (bool, optional) – 卷积是否应该是因果的。
dilation (int, optional) – 非瓶颈卷积层的扩张因子。

Example

>>> import torch
>>> x = torch.rand((8, 60, 512))
>>> net = ConvolutionModule(512, 3)
>>> output = net(x)
>>> output.shape
torch.Size([8, 60, 512])

forward(x: Tensor, mask: Tensor | None = None, dynchunktrain_config: DynChunkTrainConfig | None = None)[source]

将卷积应用于输入张量 x。

Parameters:

x (torch.Tensor) – 输入卷积模块的张量。
mask (torch.Tensor, optional) – 如果指定，将使用masked_fill_在卷积输出上应用的掩码。
dynchunktrain_config (DynChunkTrainConfig, 可选) – 如果指定，将使模块支持动态块卷积 (DCConv)，如 Dynamic Chunk Convolution for Unified Streaming and Non-Streaming Conformer ASR 所实现。这允许在保留比完全因果卷积更好的准确性的同时屏蔽未来帧，但会带来较小的速度损失。这应该仅用于训练（或者，如果您知道自己在做什么，也可以在推理时用于屏蔽评估），因为在推理时应使用前向流式函数。

Returns:

out – 输出张量。

Return type:

torch.Tensor

class speechbrain.lobes.models.transformer.Conformer.ConformerEncoderLayer(d_model, d_ffn, nhead, kernel_size=31, kdim=None, vdim=None, activation=<class 'speechbrain.nnet.activations.Swish'>, bias=True, dropout=0.0, causal=False, attention_type='RelPosMHAXL')[source]

基础：Module

这是Conformer编码器层的实现。

Parameters:

d_model (int) – 输入嵌入的预期大小。
d_ffn (int) – 自注意力前馈层的隐藏大小。
nhead (int) – 注意力头的数量。
kernel_size (int, optional) – 卷积模型的核大小。
kdim (int, optional) – 键的维度。
vdim (int, optional) – 值的维度。
activation (torch.nn.Module) – 在每个Conformer层中使用的激活函数。
bias (bool, 可选) – 是否卷积模块。
dropout (int, optional) – 编码器的丢弃率。
causal (bool, optional) – 卷积是否应该是因果的。
attention_type (str, 可选) – 注意力层的类型，例如 regularMHA 表示常规的多头注意力机制。

Example

>>> import torch
>>> x = torch.rand((8, 60, 512))
>>> pos_embs = torch.rand((1, 2*60-1, 512))
>>> net = ConformerEncoderLayer(d_ffn=512, nhead=8, d_model=512, kernel_size=3)
>>> output = net(x, pos_embs=pos_embs)
>>> output[0].shape
torch.Size([8, 60, 512])

forward(x, src_mask: Tensor | None = None, src_key_padding_mask: Tensor | None = None, pos_embs: Tensor = None, dynchunktrain_config: DynChunkTrainConfig | None = None)[source]

Parameters:

src (torch.Tensor) – 输入到编码器层的序列。
src_mask (torch.Tensor, optional) – 源序列的掩码。
src_key_padding_mask (torch.Tensor, optional) – 每个批次的源键的掩码。
pos_embs (torch.Tensor, torch.nn.Module, optional) – 包含输入序列位置嵌入的模块或张量
dynchunktrain_config (可选[DynChunkTrainConfig]) – 用于流式处理的动态块训练配置对象，这里特别涉及将动态块卷积应用于卷积模块。

forward_streaming(x, context: ConformerEncoderLayerStreamingContext, pos_embs: Tensor = None)[source]

Conformer层流式前向传播（通常用于DynamicChunkTraining训练的模型），在推理时使用。依赖于由make_streaming_context初始化的可变上下文对象，该对象应在多个块之间使用。由ConformerEncoder.forward_streaming调用。

Parameters:

x (torch.Tensor) – 该层的输入张量。只要保持上下文一致，就支持批处理。
上下文 (ConformerEncoderStreamingContext) – 可变的流式上下文；在多次调用中应传递相同的对象。
pos_embs (torch.Tensor, optional) – 位置嵌入，如果使用的话。

Returns:

x (torch.Tensor) – 输出张量。
self_attn (list) – 自注意力值列表。

make_streaming_context(mha_left_context_size: int)[source]

为此编码层创建一个空白的流式上下文。

Parameters:: mha_left_context_size (int) – 在流式处理时，应保存并用作当前块的左上下文多少左帧
Return type:: ConformerEncoderLayerStreamingContext

class speechbrain.lobes.models.transformer.Conformer.ConformerEncoder(num_layers, d_model, d_ffn, nhead, kernel_size=31, kdim=None, vdim=None, activation=<class 'speechbrain.nnet.activations.Swish'>, bias=True, dropout=0.0, causal=False, attention_type='RelPosMHAXL', output_hidden_states=False, layerdrop_prob=0.0)[source]

基础：Module

该类实现了Conformer编码器。

Parameters:

num_layers (int) – 层数。
d_model (int) – 嵌入维度大小。
d_ffn (int) – 自注意力前馈层的隐藏大小。
nhead (int) – 注意力头的数量。
kernel_size (int, optional) – 卷积模型的核大小。
kdim (int, optional) – 键的维度。
vdim (int, optional) – 值的维度。
activation (torch.nn.Module) – 在每个Confomer层中使用的激活函数。
bias (bool, 可选) – 是否卷积模块。
dropout (int, optional) – 编码器的丢弃率。
causal (bool, optional) – 卷积是否应该是因果的。
attention_type (str, 可选) – 注意力层的类型，例如 regulaMHA 表示常规的多头注意力机制。
output_hidden_states (bool, 可选) – 模型是否应该输出隐藏状态作为张量列表。
layerdrop_prob (float) – 丢弃整个层的概率。

Example

>>> import torch
>>> x = torch.rand((8, 60, 512))
>>> pos_emb = torch.rand((1, 2*60-1, 512))
>>> net = ConformerEncoder(1, 512, 512, 8)
>>> output, _ = net(x, pos_embs=pos_emb)
>>> output.shape
torch.Size([8, 60, 512])

>>> import torch
>>> from speechbrain.lobes.models.transformer.Conformer import ConformerEncoder
>>> x = torch.rand((8, 60, 512)); pos_emb = torch.rand((1, 2*60-1, 512));
>>> net = ConformerEncoder(4, 512, 512, 8, output_hidden_states=True)
>>> output, _, hs = net(x, pos_embs=pos_emb)
>>> hs[0].shape
torch.Size([8, 60, 512])

forward(src, src_mask: Tensor | None = None, src_key_padding_mask: Tensor | None = None, pos_embs: Tensor | None = None, dynchunktrain_config: DynChunkTrainConfig | None = None)[source]

Parameters:

src (torch.Tensor) – 输入到编码器层的序列。
src_mask (torch.Tensor, optional) – 源序列的掩码。
src_key_padding_mask (torch.Tensor, optional) – 每个批次的源键的掩码。
pos_embs (torch.Tensor, torch.nn.Module,) – 包含输入序列位置嵌入的模块或张量如果提供了自定义的pos_embs，它需要具有形状 (1, 2*S-1, E) 其中S是序列长度，E是嵌入维度。
dynchunktrain_config (可选[DynChunkTrainConfig]) – 用于流式处理的动态块训练配置对象，这里特别涉及将动态块卷积应用于卷积模块。

Returns:

output (torch.Tensor) – Conformer的输出。
attention_lst (list) – 注意力值。
hidden_state_lst (list, optional) – 编码器隐藏层的输出。仅在output_hidden_states设置为true时有效。

forward_streaming(src: Tensor, context: ConformerEncoderStreamingContext, pos_embs: Tensor | None = None)[source]

Conformer流式前向传播（通常用于DynamicChunkTraining训练的模型），在推理时使用。依赖于由make_streaming_context初始化的可变上下文对象，该对象应在多个块之间使用。

Parameters:

src (torch.Tensor) – 输入张量。只要保持上下文一致，就支持批处理。
上下文 (ConformerEncoderStreamingContext) – 可变的流式上下文；在多次调用中应传递相同的对象。
pos_embs (torch.Tensor, optional) – 位置嵌入，如果使用的话。

Returns:

output (torch.Tensor) – 流式conformer的输出。
attention_lst (list) – 注意力值。

make_streaming_context(dynchunktrain_config: DynChunkTrainConfig)[source]

为编码器创建一个空白的流式上下文。

Parameters:: dynchunktrain_config (可选[DynChunkTrainConfig]) – 用于流式处理的动态块训练配置对象
Return type:: ConformerEncoderStreamingContext

class speechbrain.lobes.models.transformer.Conformer.ConformerDecoderLayer(d_model, d_ffn, nhead, kernel_size, kdim=None, vdim=None, activation=<class 'speechbrain.nnet.activations.Swish'>, bias=True, dropout=0.0, causal=True, attention_type='RelPosMHAXL')[source]

基础：Module

这是Conformer编码器层的实现。

Parameters:

d_model (int) – 输入嵌入的预期大小。
d_ffn (int) – 自注意力前馈层的隐藏大小。
nhead (int) – 注意力头的数量。
kernel_size (int, optional) – 卷积模型的核大小。
kdim (int, optional) – 键的维度。
vdim (int, optional) – 值的维度。
activation (torch.nn.Module, optional) – 每个Conformer层中使用的激活函数。
bias (bool, 可选) – 是否卷积模块。
dropout (int, optional) – 编码器的丢弃率。
causal (bool, optional) – 卷积是否应该是因果的。
attention_type (str, 可选) – 注意力层的类型，例如 regularMHA 表示常规的多头注意力机制。

Example

>>> import torch
>>> x = torch.rand((8, 60, 512))
>>> pos_embs = torch.rand((1, 2*60-1, 512))
>>> net = ConformerEncoderLayer(d_ffn=512, nhead=8, d_model=512, kernel_size=3)
>>> output = net(x, pos_embs=pos_embs)
>>> output[0].shape
torch.Size([8, 60, 512])

forward(tgt, memory, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None, pos_embs_tgt=None, pos_embs_src=None)[source]

Parameters:

tgt (torch.Tensor) – 解码器层的输入序列。
memory (torch.Tensor) – 编码器最后一层的序列。
tgt_mask (torch.Tensor, optional, optional) – tgt序列的掩码。
memory_mask (torch.Tensor, optional) – 内存序列的掩码。
tgt_key_padding_mask (torch.Tensor, optional) – 每个批次中目标键的掩码。
memory_key_padding_mask (torch.Tensor, optional) – 每个批次的内存键的掩码。
pos_embs_tgt (torch.Tensor, torch.nn.Module, optional) – 包含每个注意力层的目标序列位置嵌入的模块或张量。
pos_embs_src (torch.Tensor, torch.nn.Module, optional) – 包含每个注意力层的源序列位置嵌入的模块或张量。

Returns:

x (torch.Tensor) – 输出张量
self_attn (torch.Tensor)
self_attn (torch.Tensor) – 自注意力张量

class speechbrain.lobes.models.transformer.Conformer.ConformerDecoder(num_layers, nhead, d_ffn, d_model, kdim=None, vdim=None, dropout=0.0, activation=<class 'speechbrain.nnet.activations.Swish'>, kernel_size=3, bias=True, causal=True, attention_type='RelPosMHAXL')[source]

基础：Module

该类实现了Transformer解码器。

Parameters:

num_layers (int) – 层数。
nhead (int) – 注意力头的数量。
d_ffn (int) – 自注意力前馈层的隐藏大小。
d_model (int) – 嵌入维度大小。
kdim (int, 可选) – 键的维度。
vdim (int, optional) – 值的维度。
dropout (float, optional) – 丢弃率。
activation (torch.nn.Module, optional) – 在非瓶颈卷积层后使用的激活函数。
kernel_size (int, 可选) – 卷积层的核大小。
bias (bool, 可选) – 是否卷积模块。
causal (bool, optional) – 卷积是否应该是因果的。
attention_type (str, 可选) – 注意力层的类型，例如 regularMHA 表示常规的多头注意力机制。

Example

>>> src = torch.rand((8, 60, 512))
>>> tgt = torch.rand((8, 60, 512))
>>> net = ConformerDecoder(1, 8, 1024, 512, attention_type="regularMHA")
>>> output, _, _ = net(tgt, src)
>>> output.shape
torch.Size([8, 60, 512])

forward(tgt, memory, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None, pos_embs_tgt=None, pos_embs_src=None)[source]

Parameters:

tgt (torch.Tensor) – 解码器层的输入序列。
memory (torch.Tensor) – 编码器最后一层的序列。
tgt_mask (torch.Tensor, optional, optional) – tgt序列的掩码。
memory_mask (torch.Tensor, optional) – 内存序列的掩码。
tgt_key_padding_mask (torch.Tensor, optional) – 每个批次中目标键的掩码。
memory_key_padding_mask (torch.Tensor, optional) – 每个批次的内存键的掩码。
pos_embs_tgt (torch.Tensor, torch.nn.Module, optional) – 包含每个注意力层的目标序列位置嵌入的模块或张量。
pos_embs_src (torch.Tensor, torch.nn.Module, optional) – 包含每个注意力层的源序列位置嵌入的模块或张量。

Returns:

output (torch.Tensor) – Conformer解码器输出。
self_attns (list) – 自注意力机制的位置。
multihead_attns (list) – 多头注意力机制的位置。