speechbrain.lobes.models.Tacotron2 模块

用于Tacotron2端到端神经文本到语音（TTS）模型的神经网络模块

作者 * Georges Abous-Rjeili 2021 * Artem Ploujnikov 2021

摘要

类：

`Attention`	Tacotron注意力层。
`ConvNorm`	一个使用Xavier初始化的1D卷积层
`Decoder`	Tacotron解码器
`Encoder`	Tacotron2编码器模块，由一系列1维卷积组（默认3个）和一个双向LSTM组成
`LinearNorm`	一个带有Xavier初始化的线性层
`LocationLayer`	一个基于位置的注意力层，由一个Xavier初始化的卷积层和一个密集层组成
`Loss`	Tacotron损失函数的实现
`LossStats`	`TacotronLoss` 的别名
`Postnet`	Tacotron postnet 由多个一维卷积层组成，这些卷积层使用 Xavier 初始化和 tanh 激活函数，并带有批量归一化。
`Prenet`	Tacotron 预网络模块，由指定数量的归一化（Xavier 初始化）线性层组成
`Tacotron2`	基于NVIDIA实现的Tactron2文本到语音模型。
`TextMelCollate`	根据每步的帧数对模型输入和目标进行零填充

函数：

`dynamic_range_compression`	音频信号的动态范围压缩
`infer`	预训练合成器的推理钩子
`mel_spectogram`	计算原始音频信号的梅尔频谱图

参考

class speechbrain.lobes.models.Tacotron2.LinearNorm(in_dim, out_dim, bias=True, w_init_gain='linear')[source]

基础：Module

一个使用Xavier初始化的线性层

Parameters:

in_dim (int) – 输入维度
out_dim (int) – 输出维度
bias (bool) – 是否使用偏置
w_init_gain (linear) – 权重初始化增益类型（参见 torch.nn.init.calculate_gain）

Example

>>> import torch
>>> from speechbrain.lobes.models.Tacotron2 import LinearNorm
>>> layer = LinearNorm(in_dim=5, out_dim=3)
>>> x = torch.randn(3, 5)
>>> y = layer(x)
>>> y.shape
torch.Size([3, 3])

forward(x)[source]

计算前向传播

Parameters:: x (torch.Tensor) – 一个 (batch, features) 输入张量
Returns:: output – 线性层的输出
Return type:: torch.Tensor

class speechbrain.lobes.models.Tacotron2.ConvNorm(in_channels, out_channels, kernel_size=1, stride=1, padding=None, dilation=1, bias=True, w_init_gain='linear')[source]

基础：Module

一个使用Xavier初始化的1D卷积层

Parameters:

in_channels (int) – 输入通道的数量
out_channels (int) – 输出通道的数量
kernel_size (int) – 核大小
stride (int) – 卷积步长
padding (int) – 要包含的填充量。如果未提供，将计算为 dilation * (kernel_size - 1) / 2
dilation (int) – 卷积的膨胀
bias (bool) – 是否使用偏置
w_init_gain (linear) – 权重初始化增益类型（参见 torch.nn.init.calculate_gain）

Example

>>> import torch
>>> from speechbrain.lobes.models.Tacotron2 import ConvNorm
>>> layer = ConvNorm(in_channels=10, out_channels=5, kernel_size=3)
>>> x = torch.randn(3, 10, 5)
>>> y = layer(x)
>>> y.shape
torch.Size([3, 5, 5])

forward(signal)[source]

计算前向传播

Parameters:: signal (torch.Tensor) – 卷积层的输入
Returns:: 输出 – 输出
Return type:: torch.Tensor

class speechbrain.lobes.models.Tacotron2.LocationLayer(attention_n_filters=32, attention_kernel_size=31, attention_dim=128)[source]

基础：Module

一个基于位置的注意力层，由一个Xavier初始化的卷积层和一个密集层组成

Parameters:

attention_n_filters (int) – 注意力机制中使用的过滤器数量
attention_kernel_size (int) – 注意力层的核大小
attention_dim (int) – 线性注意力层的维度

Example

>>> import torch
>>> from speechbrain.lobes.models.Tacotron2 import LocationLayer
>>> layer = LocationLayer()
>>> attention_weights_cat = torch.randn(3, 2, 64)
>>> processed_attention = layer(attention_weights_cat)
>>> processed_attention.shape
torch.Size([3, 64, 128])

forward(attention_weights_cat)[source]

执行注意力层的前向传递

Parameters:: attention_weights_cat (torch.Tensor) – 连接注意力权重
Returns:: processed_attention – 注意力层的输出
Return type:: torch.Tensor

class speechbrain.lobes.models.Tacotron2.Attention(attention_rnn_dim=1024, embedding_dim=512, attention_dim=128, attention_location_n_filters=32, attention_location_kernel_size=31)[source]

基础：Module

Tacotron注意力层。使用了基于位置的注意力机制。

Parameters:

attention_rnn_dim (int) – 应用注意力层的RNN的维度
embedding_dim (int) – 嵌入维度
attention_dim (int) – 记忆单元的维度
attention_location_n_filters (int) – 位置过滤器的数量
attention_location_kernel_size (int) – 位置层的核大小

Example

>>> import torch
>>> from speechbrain.lobes.models.Tacotron2 import (
... Attention)
>>> from speechbrain.lobes.models.transformer.Transformer import (
... get_mask_from_lengths)
>>> layer = Attention()
>>> attention_hidden_state = torch.randn(2, 1024)
>>> memory = torch.randn(2, 173, 512)
>>> processed_memory = torch.randn(2, 173, 128)
>>> attention_weights_cat = torch.randn(2, 2, 173)
>>> memory_lengths = torch.tensor([173, 91])
>>> mask = get_mask_from_lengths(memory_lengths)
>>> attention_context, attention_weights = layer(
...    attention_hidden_state,
...    memory,
...    processed_memory,
...    attention_weights_cat,
...    mask
... )
>>> attention_context.shape, attention_weights.shape
(torch.Size([2, 512]), torch.Size([2, 173]))

get_alignment_energies(query, processed_memory, attention_weights_cat)[source]

计算对齐能量

Parameters:

query (torch.Tensor) – 解码器输出 (batch, n_mel_channels * n_frames_per_step)
processed_memory (torch.Tensor) – 处理后的编码器输出 (B, T_in, attention_dim)
attention_weights_cat (torch.Tensor) – 累积和之前的注意力权重 (B, 2, max_time)

Returns:

alignment – (batch, max_time)

Return type:

torch.Tensor

forward(attention_hidden_state, memory, processed_memory, attention_weights_cat, mask)[source]

计算前向传播

Parameters:

attention_hidden_state (torch.Tensor) – 注意力RNN的最后输出
memory (torch.Tensor) – 编码器输出
processed_memory (torch.Tensor) – 处理后的编码器输出
attention_weights_cat (torch.Tensor) – 之前和累积的注意力权重
mask (torch.Tensor) – 用于填充数据的二进制掩码

Returns:

result – 一个 (attention_context, attention_weights) 元组

Return type:

tuple

class speechbrain.lobes.models.Tacotron2.Prenet(in_dim=80, sizes=[256, 256], dropout=0.5)[source]

基础：Module

Tacotron预网络模块由指定数量的归一化（Xavier初始化）线性层组成

Parameters:

in_dim (int) – 输入维度
sizes (int) – 隐藏层/输出的维度
dropout (float) – 丢弃概率

Example

>>> import torch
>>> from speechbrain.lobes.models.Tacotron2 import Prenet
>>> layer = Prenet()
>>> x = torch.randn(862, 2, 80)
>>> output = layer(x)
>>> output.shape
torch.Size([862, 2, 256])

forward(x)[source]

计算prenet的前向传播

Parameters:: x (torch.Tensor) – 预网络的输入
Returns:: 输出 – 输出
Return type:: torch.Tensor

class speechbrain.lobes.models.Tacotron2.Postnet(n_mel_channels=80, postnet_embedding_dim=512, postnet_kernel_size=5, postnet_n_convolutions=5)[source]

基础：Module

Tacotron的后处理网络由多个一维卷积层组成，这些卷积层使用Xavier初始化和tanh激活函数，并带有批量归一化。根据配置，后处理网络可以细化MEL频谱图或将其上采样为线性频谱图。

Parameters:

n_mel_channels (int) – MEL频谱图的通道数
postnet_embedding_dim (int) – postnet嵌入维度
postnet_kernel_size (int) – 解码器中卷积的核大小
postnet_n_convolutions (int) – postnet中的卷积数量

Example

>>> import torch
>>> from speechbrain.lobes.models.Tacotron2 import Postnet
>>> layer = Postnet()
>>> x = torch.randn(2, 80, 861)
>>> output = layer(x)
>>> output.shape
torch.Size([2, 80, 861])

forward(x)[source]

计算后置网络的前向传播

Parameters:: x (torch.Tensor) – 后处理网络的输入（通常是一个MEL频谱图）
Returns:: output – postnet 输出（根据模型配置的不同，可能是精炼的 MEL 频谱图或线性频谱图）
Return type:: torch.Tensor

class speechbrain.lobes.models.Tacotron2.Encoder(encoder_n_convolutions=3, encoder_embedding_dim=512, encoder_kernel_size=5)[source]

基础：Module

Tacotron2编码器模块，由一系列一维卷积组（默认3个）和一个双向LSTM组成

Parameters:

encoder_n_convolutions (int) – 编码器卷积的数量
encoder_embedding_dim (int) – 编码器嵌入的维度
encoder_kernel_size (int) – 编码器内一维卷积层的核大小

Example

>>> import torch
>>> from speechbrain.lobes.models.Tacotron2 import Encoder
>>> layer = Encoder()
>>> x = torch.randn(2, 512, 128)
>>> input_lengths = torch.tensor([128, 83])
>>> outputs = layer(x, input_lengths)
>>> outputs.shape
torch.Size([2, 128, 512])

forward(x, input_lengths)[source]

计算编码器的前向传递

Parameters:

x (torch.Tensor) – 一批输入（序列嵌入）
input_lengths (torch.Tensor) – 输入长度的张量

Returns:

outputs – 编码器输出

Return type:

torch.Tensor

infer(x, input_lengths)[source]

在推理上下文中执行前向步骤

Parameters:

x (torch.Tensor) – 一批输入（序列嵌入）
input_lengths (torch.Tensor) – 输入长度的张量

Returns:

outputs – 编码器输出

Return type:

torch.Tensor

class speechbrain.lobes.models.Tacotron2.Decoder(n_mel_channels=80, n_frames_per_step=1, encoder_embedding_dim=512, attention_dim=128, attention_location_n_filters=32, attention_location_kernel_size=31, attention_rnn_dim=1024, decoder_rnn_dim=1024, prenet_dim=256, max_decoder_steps=1000, gate_threshold=0.5, p_attention_dropout=0.1, p_decoder_dropout=0.1, early_stopping=True)[source]

基础：Module

Tacotron解码器

Parameters:

n_mel_channels (int) – MEL频谱图中的通道数
n_frames_per_step (int) – 解码器每个时间步的频谱图中的帧数
encoder_embedding_dim (int) – 编码器嵌入的维度
attention_dim (int) – 注意力向量的大小
attention_location_n_filters (int) – 基于位置的注意力机制中的过滤器数量
attention_location_kernel_size (int) – 基于位置的注意力的内核大小
attention_rnn_dim (int) – 注意力层的RNN维度
decoder_rnn_dim (int) – 编码器RNN的维度
prenet_dim (int) – prenet的维度（内部和输出层）
max_decoder_steps (int) – 模型预期的最长话语的最大解码步骤数
gate_threshold (float) – 解码器输出将与之比较的固定阈值
p_attention_dropout (float) – 注意力层的dropout概率
p_decoder_dropout (float) – 解码器层的dropout概率
early_stopping (bool) – 是否提前停止训练。

Example

>>> import torch
>>> from speechbrain.lobes.models.Tacotron2 import Decoder
>>> layer = Decoder()
>>> memory = torch.randn(2, 173, 512)
>>> decoder_inputs = torch.randn(2, 80, 173)
>>> memory_lengths = torch.tensor([173, 91])
>>> mel_outputs, gate_outputs, alignments = layer(
...     memory, decoder_inputs, memory_lengths)
>>> mel_outputs.shape, gate_outputs.shape, alignments.shape
(torch.Size([2, 80, 173]), torch.Size([2, 173]), torch.Size([2, 173, 173]))

get_go_frame(memory)[source]

获取所有零帧以用作第一个解码器输入

Parameters:: memory (torch.Tensor) – 解码器输出
Returns:: decoder_input – 全零帧
Return type:: torch.Tensor

initialize_decoder_states(memory)[source]

初始化注意力RNN状态、解码器RNN状态、注意力权重、注意力累积权重、注意力上下文，存储记忆并存储处理后的记忆

Parameters:

memory (torch.Tensor) – 编码器输出

Returns:

attention_hidden (torch.Tensor)
attention_cell (torch.Tensor)
decoder_hidden (torch.Tensor)
decoder_cell (torch.Tensor)
attention_weights (torch.Tensor)
attention_weights_cum (torch.Tensor)
attention_context (torch.Tensor)
processed_memory (torch.Tensor)

parse_decoder_inputs(decoder_inputs)[source]

准备解码器输入，即梅尔输出

Parameters:: decoder_inputs (torch.Tensor) – 用于教师强制训练的输入，即梅尔频谱图
Returns:: decoder_inputs – 处理后的解码器输入
Return type:: torch.Tensor

parse_decoder_outputs(mel_outputs, gate_outputs, alignments)[source]

准备解码器输出以供输出

Parameters:

mel_outputs (torch.Tensor) – MEL尺度频谱图输出
gate_outputs (torch.Tensor) – 门输出能量
alignments (torch.Tensor) – 对齐张量

Returns:

mel_outputs (torch.Tensor) – MEL尺度频谱图输出
gate_outputs (torch.Tensor) – 门输出能量
alignments (torch.Tensor) – 对齐张量

decode(decoder_input, attention_hidden, attention_cell, decoder_hidden, decoder_cell, attention_weights, attention_weights_cum, attention_context, memory, processed_memory, mask)[source]

使用存储状态、注意力和内存的解码器步骤 :param decoder_input: 先前的mel输出 :type decoder_input: torch.Tensor :param attention_hidden: 注意力模块的隐藏状态 :type attention_hidden: torch.Tensor :param attention_cell: 注意力单元状态 :type attention_cell: torch.Tensor :param decoder_hidden: 解码器隐藏状态 :type decoder_hidden: torch.Tensor :param decoder_cell: 解码器单元状态 :type decoder_cell: torch.Tensor :param attention_weights: 注意力权重 :type attention_weights: torch.Tensor :param attention_weights_cum: 累积注意力权重 :type attention_weights_cum: torch.Tensor :param attention_context: 注意力上下文张量 :type attention_context: torch.Tensor :param memory: 内存张量 :type memory: torch.Tensor :param processed_memory: 处理后的内存张量 :type processed_memory: torch.Tensor :param mask: :type mask: torch.Tensor

Returns:

mel_output (torch.Tensor) – MEL尺度的输出
gate_output (torch.Tensor) – 门输出能量
attention_weights (torch.Tensor) – 注意力权重

forward(memory, decoder_inputs, memory_lengths)[source]

解码器前向传播用于训练

Parameters:

memory (torch.Tensor) – 编码器输出
decoder_inputs (torch.Tensor) – 用于教师强制的解码器输入。即梅尔频谱图
memory_lengths (torch.Tensor) – 用于注意力掩码的编码器输出长度。

Returns:

mel_outputs (torch.Tensor) – 来自解码器的mel输出
gate_outputs (torch.Tensor) – 来自解码器的gate输出
alignments (torch.Tensor) – 来自解码器的注意力权重序列

infer(memory, memory_lengths)[source]

解码器推理

Parameters:

memory (torch.Tensor) – 编码器输出
memory_lengths (torch.Tensor) – 输入对应的相对长度。

Returns:

mel_outputs (torch.Tensor) – 来自解码器的mel输出
gate_outputs (torch.Tensor) – 来自解码器的gate输出
alignments (torch.Tensor) – 来自解码器的注意力权重序列
mel_lengths (torch.Tensor) – MEL频谱图的长度

class speechbrain.lobes.models.Tacotron2.Tacotron2(mask_padding=True, n_mel_channels=80, n_symbols=148, symbols_embedding_dim=512, encoder_kernel_size=5, encoder_n_convolutions=3, encoder_embedding_dim=512, attention_rnn_dim=1024, attention_dim=128, attention_location_n_filters=32, attention_location_kernel_size=31, n_frames_per_step=1, decoder_rnn_dim=1024, prenet_dim=256, max_decoder_steps=1000, gate_threshold=0.5, p_attention_dropout=0.1, p_decoder_dropout=0.1, postnet_embedding_dim=512, postnet_kernel_size=5, postnet_n_convolutions=5, decoder_no_early_stopping=False)[source]

基础：Module

基于NVIDIA实现的Tactron2文本到语音模型。

该类是模型的主要入口点，负责实例化所有子模块，这些子模块又管理各个神经网络层

简化结构：输入->词嵌入->编码器->注意力机制->解码器（+预网络）->后网络->输出

prenet（输入是解码器前一个时间步）输出是输入到解码器与注意力输出连接

Parameters:

mask_padding (bool) – 是否对tacotron的输出进行掩码填充
n_mel_channels (int) – 用于构建频谱图的梅尔通道数量
n_symbols (int=128) – 在textToSequence中定义的接受的字符符号数量
symbols_embedding_dim (int) – 输入到nn.Embedding的符号的嵌入维度数量
encoder_kernel_size (int) – 处理嵌入的内核大小
encoder_n_convolutions (int) – 编码器中的卷积层数
encoder_embedding_dim (int) – 编码器中的核数量，这也是编码器中双向LSTM的维度
attention_rnn_dim (int) – 输入维度
attention_dim (int) – 注意力机制中隐藏表示的数量
attention_location_n_filters (int) – 注意力机制中一维卷积滤波器的数量
attention_location_kernel_size (int) – 一维卷积滤波器的长度
n_frames_per_step (int=1) – 目前解码器仅支持每步生成1个mel帧。
decoder_rnn_dim (int) – 2个单向堆叠LSTM单元的数量
prenet_dim (int) – 线性预网络层的维度
max_decoder_steps (int) – 解码器在停止前生成的最大步数/帧数
gate_threshold (int) – 任何输出概率高于此值的截止水平被认为是完整的，并停止生成，因此我们有可变长度的输出
p_attention_dropout (float) – 注意力丢弃概率
p_decoder_dropout (float) – 解码器丢弃概率
postnet_embedding_dim (int) – postnet 的过滤器数量
postnet_kernel_size (int) – posnet内核的1d大小
postnet_n_convolutions (int) – postnet中的卷积层数
decoder_no_early_stopping (bool) – 决定解码器是否提前停止与 gate_threshold 一起使用。其逻辑反值被传递给解码器

Example

>>> import torch
>>> _ = torch.manual_seed(213312)
>>> from speechbrain.lobes.models.Tacotron2 import Tacotron2
>>> model = Tacotron2(
...    mask_padding=True,
...    n_mel_channels=80,
...    n_symbols=148,
...    symbols_embedding_dim=512,
...    encoder_kernel_size=5,
...    encoder_n_convolutions=3,
...    encoder_embedding_dim=512,
...    attention_rnn_dim=1024,
...    attention_dim=128,
...    attention_location_n_filters=32,
...    attention_location_kernel_size=31,
...    n_frames_per_step=1,
...    decoder_rnn_dim=1024,
...    prenet_dim=256,
...    max_decoder_steps=32,
...    gate_threshold=0.5,
...    p_attention_dropout=0.1,
...    p_decoder_dropout=0.1,
...    postnet_embedding_dim=512,
...    postnet_kernel_size=5,
...    postnet_n_convolutions=5,
...    decoder_no_early_stopping=False
... )
>>> _ = model.eval()
>>> inputs = torch.tensor([
...     [13, 12, 31, 14, 19],
...     [31, 16, 30, 31, 0],
... ])
>>> input_lengths = torch.tensor([5, 4])
>>> outputs, output_lengths, alignments = model.infer(inputs, input_lengths)
>>> outputs.shape, output_lengths.shape, alignments.shape
(torch.Size([2, 80, 1]), torch.Size([2]), torch.Size([2, 1, 5]))

parse_output(outputs, output_lengths, alignments_dim=None)[source]

屏蔽输出的填充部分

Parameters:

输出 (列表) – 一个张量列表 - 原始输出
output_lengths (torch.Tensor) – 一个表示所有输出长度的张量
alignments_dim (int) – 沿最后一个轴的对齐的期望维度可选但用于数据并行训练时需要

Returns:

mel_outputs (torch.Tensor)
mel_outputs_postnet (torch.Tensor)
gate_outputs (torch.Tensor)
alignments (torch.Tensor) – 原始输出 - 应用了掩码

forward(inputs, alignments_dim=None)[source]

解码器前向传播用于训练

Parameters:

inputs (tuple) – 批处理对象
alignments_dim (int) – 沿最后一个轴的对齐的期望维度可选但用于数据并行训练时需要

Returns:

mel_outputs (torch.Tensor) – 来自解码器的mel输出
mel_outputs_postnet (torch.Tensor) – 来自postnet的mel输出
gate_outputs (torch.Tensor) – 来自解码器的门输出
alignments (torch.Tensor) – 来自解码器的注意力权重序列
output_lengths (torch.Tensor) – 无填充的输出长度

infer(inputs, input_lengths)[source]

生成输出

Parameters:

inputs (torch.tensor) – 转换后的文本或音素
input_lengths (torch.tensor) – 输入参数的长度

Returns:

mel_outputs_postnet (torch.Tensor) – tacotron 2 的最终 mel 输出
mel_lengths (torch.Tensor) – mels 的长度
alignments (torch.Tensor) – 注意力权重的序列

speechbrain.lobes.models.Tacotron2.infer(model, text_sequences, input_lengths)[source]

预训练合成器的推理钩子

Parameters:

model (Tacotron2) – Tacotron模型
text_sequences (torch.Tensor) – 编码的文本序列
input_lengths (torch.Tensor) – 输入长度

Returns:

结果 – (mel_outputs_postnet, mel_lengths, alignments) - 确切的模型输出

Return type:

tuple

speechbrain.lobes.models.Tacotron2.LossStats: TacotronLoss 的别名

class speechbrain.lobes.models.Tacotron2.Loss(guided_attention_sigma=None, gate_loss_weight=1.0, guided_attention_weight=1.0, guided_attention_scheduler=None, guided_attention_hard_stop=None)[source]

基础：Module

Tacotron损失实现

损失由频谱图上的MSE损失、BCE门损失和引导注意力损失（如果启用）组成，该引导注意力损失试图使注意力矩阵对角线化

模块的输出是一个LossStats元组，其中包括总损失

Parameters:

guided_attention_sigma (float) – 引导注意力sigma因子，控制掩码的“宽度”
gate_loss_weight (float) – 仇恨损失将被乘以的常数
guided_attention_weight (float) – 引导注意力的权重
guided_attention_scheduler (callable) – 用于引导注意力损失的调度器类
guided_attention_hard_stop (int) – 引导注意力将在多少个周期后完全关闭

Example

>>> import torch
>>> _ = torch.manual_seed(42)
>>> from speechbrain.lobes.models.Tacotron2 import Loss
>>> loss = Loss(guided_attention_sigma=0.2)
>>> mel_target = torch.randn(2, 80, 861)
>>> gate_target = torch.randn(1722, 1)
>>> mel_out = torch.randn(2, 80, 861)
>>> mel_out_postnet = torch.randn(2, 80, 861)
>>> gate_out = torch.randn(2, 861)
>>> alignments = torch.randn(2, 861, 173)
>>> targets = mel_target, gate_target
>>> model_outputs = mel_out, mel_out_postnet, gate_out, alignments
>>> input_lengths = torch.tensor([173,  91])
>>> target_lengths = torch.tensor([861, 438])
>>> loss(model_outputs, targets, input_lengths, target_lengths, 1)
TacotronLoss(loss=tensor(4.8566), mel_loss=tensor(4.0097), gate_loss=tensor(0.8460), attn_loss=tensor(0.0010), attn_weight=tensor(1.))

forward(model_output, targets, input_lengths, target_lengths, epoch)[source]

计算损失

Parameters:

model_output (tuple) – 模型前向传播的输出： (mel_outputs, mel_outputs_postnet, gate_outputs, alignments)
targets (tuple) – 目标
input_lengths (torch.Tensor) – 一个 (batch, length) 张量的输入长度
target_lengths (torch.Tensor) – 一个 (batch, length) 的目标（频谱图）长度张量
epoch (int) – 当前的epoch编号（用于指导注意力损失的调度）通常使用StepScheduler

Returns:

result – 总损失 - 以及个别损失（mel 和 gate）

Return type:

损失统计

get_attention_loss(alignments, input_lengths, target_lengths, epoch)[source]

计算注意力损失

Parameters:

alignments (torch.Tensor) – 模型中的对齐矩阵
input_lengths (torch.Tensor) – 一个 (batch, length) 的输入长度张量
target_lengths (torch.Tensor) – 一个 (batch, length) 的目标（频谱图）长度张量
epoch (int) – 当前的epoch编号（用于指导注意力损失的调度）通常使用StepScheduler

Returns:

attn_loss – 注意力损失值

Return type:

torch.Tensor

class speechbrain.lobes.models.Tacotron2.TextMelCollate(n_frames_per_step=1)[source]

基础类：object

根据每步的帧数对模型输入和目标进行零填充

Parameters:: n_frames_per_step (int) – 每个步骤的输出帧数

__call__(batch)[source]

从归一化文本和梅尔频谱图中整理训练批次

Parameters:

batch (list) – [text_normalized, mel_normalized]

Returns:

text_padded (torch.Tensor)
input_lengths (torch.Tensor)
mel_padded (torch.Tensor)
gate_padded (torch.Tensor)
output_lengths (torch.Tensor)
len_x (torch.Tensor)
labels (torch.Tensor)
wavs (torch.Tensor)

speechbrain.lobes.models.Tacotron2.dynamic_range_compression(x, C=1, clip_val=1e-05)[source]: 音频信号的动态范围压缩

speechbrain.lobes.models.Tacotron2.mel_spectogram(sample_rate, hop_length, win_length, n_fft, n_mels, f_min, f_max, power, normalized, norm, mel_scale, compression, audio)[source]

计算原始音频信号的梅尔频谱图

Parameters:

sample_rate (int) – 音频信号的采样率。
hop_length (int) – STFT窗口之间的跳跃长度。
win_length (int) – 窗口大小。
n_fft (int) – FFT的大小。
n_mels (int) – 梅尔滤波器组的数量。
f_min (float) – 最小频率。
f_max (float) – 最大频率。
power (float) – 用于幅度谱图的指数。
normalized (bool) – 是否在stft之后通过幅度进行归一化。
norm (str 或 None) – 如果为“slaney”，则将三角梅尔权重除以梅尔频带的宽度
mel_scale (str) – 使用的比例：“htk” 或 “slaney”。
compression (bool) – 是否进行动态范围压缩
audio (torch.Tensor) – 输入音频信号

Returns:

mel – 计算得到的梅尔频谱特征。

Return type:

torch.Tensor