speechbrain.lobes.models.HifiGAN 模块

HiFi-GAN的神经网络模块：用于高效和高保真语音合成的生成对抗网络

Authors

贾罗德·杜雷 2021
王英志 2022

摘要

类：

`DiscriminatorLoss`	创建判别器损失的摘要
`DiscriminatorP`	HiFiGAN 周期性判别器从输入波形中每隔 P 个值取一个，并应用一系列卷积。注意：如果周期为 2，波形 = [1, 2, 3, 4, 5, 6 ...] --> [1, 3, 5 ... ] --> 卷积 -> 分数, 特征。
`DiscriminatorS`	HiFiGAN 尺度判别器。
`GeneratorLoss`	创建生成器损失的摘要并为不同的损失应用权重
`HifiganDiscriminator`	HiFiGAN 判别器，封装了 MPD 和 MSD。
`HifiganGenerator`	具有多感受野融合（MRF）的HiFiGAN生成器
`L1SpecLoss`	在HiFiGAN论文中描述的频谱图上的L1损失 https://arxiv.org/pdf/2010.05646.pdf 注意：与L2损失相比，L1损失有助于学习细节
`MSEDLoss`	均方判别器损失判别器被训练为将真实样本分类为1，将从生成器合成的样本分类为0。
`MSEGLoss`	平均平方生成器损失生成器通过更新样本质量以使其分类值几乎等于1来欺骗判别器。
`MelganFeatureLoss`	计算特征匹配损失，这是一个通过学习得到的相似性度量，通过判别器在真实样本和生成样本之间的特征差异来衡量（Larsen等，2016，Kumar等，2019）。
`MultiPeriodDiscriminator`	HiFiGAN 多周期判别器 (MPD) 包装器，用于 `PeriodDiscriminator` 以在不同的周期中应用它。
`MultiScaleDiscriminator`	HiFiGAN 多尺度判别器。
`MultiScaleSTFTLoss`	多尺度STFT损失。
`ResBlock1`	残差块类型1，每个卷积块中有3个卷积层。
`ResBlock2`	残差块类型2，每个卷积块中有2个卷积层。
`STFTLoss`	STFT 损失。
`UnitHifiganGenerator`	UnitHiFiGAN 生成器将离散的语音标记作为输入。
`VariancePredictor`	受FastSpeech2启发的方差预测器

函数：

`dynamic_range_compression`	音频信号的动态范围压缩
`mel_spectogram`	计算原始音频信号的梅尔频谱图
`process_duration`	处理给定的一批代码以提取连续的独特元素及其相关特征。
`stft`	计算输入的重叠短窗口的傅里叶变换

参考

speechbrain.lobes.models.HifiGAN.dynamic_range_compression(x, C=1, clip_val=1e-05)[source]: 音频信号的动态范围压缩

speechbrain.lobes.models.HifiGAN.mel_spectogram(sample_rate, hop_length, win_length, n_fft, n_mels, f_min, f_max, power, normalized, norm, mel_scale, compression, audio)[source]

计算原始音频信号的梅尔频谱图

Parameters:

sample_rate (int) – 音频信号的采样率。
hop_length (int) – STFT窗口之间的跳跃长度。
win_length (int) – 窗口大小。
n_fft (int) – FFT的大小。
n_mels (int) – 梅尔滤波器组的数量。
f_min (float) – 最小频率。
f_max (float) – 最大频率。
power (float) – 用于幅度谱图的指数。
normalized (bool) – 是否在stft之后通过幅度进行归一化。
norm (str 或 None) – 如果为“slaney”，则将三角梅尔权重除以梅尔频带的宽度
mel_scale (str) – 使用的比例：“htk” 或 “slaney”。
compression (bool) – 是否进行动态范围压缩
audio (torch.tensor) – 输入音频信号

Return type:

梅尔频谱图

speechbrain.lobes.models.HifiGAN.process_duration(code, code_feat)[source]

处理给定的一批代码以提取连续的唯一元素及其相关特征。

Parameters:

code (torch.Tensor (batch, time)) – 代码索引的张量。
code_feat (torch.Tensor (batch, time, channel)) – 代码特征的张量。

Returns:

uniq_code_feat_filtered (torch.Tensor (batch, time)) – 连续唯一代码的特征。
mask (torch.Tensor (batch, time)) – 唯一代码的填充掩码。
uniq_code_count (torch.Tensor (n)) – 唯一代码的计数。

Example

>>> code = torch.IntTensor([[40, 18, 18, 10]])
>>> code_feat = torch.rand([1, 4, 128])
>>> out_tensor, mask, uniq_code = process_duration(code, code_feat)
>>> out_tensor.shape
torch.Size([1, 1, 128])
>>> mask.shape
torch.Size([1, 1])
>>> uniq_code.shape
torch.Size([1])

class speechbrain.lobes.models.HifiGAN.ResBlock1(channels, kernel_size=3, dilation=(1, 3, 5))[source]

基础：Module

残差块类型1，每个卷积块中有3个卷积层。

Parameters:

channels (int) – 卷积层的隐藏通道数。
kernel_size (int) – 每层卷积滤波器的大小。
dilation (list) – 块中每个卷积层的膨胀值列表。

forward(x)[source]

返回ResBlock1的输出

Parameters:: x (torch.Tensor (batch, channel, time)) – 输入张量。
Return type:: ResBlock 输出

remove_weight_norm()[source]: 此函数在推理过程中移除权重归一化。

class speechbrain.lobes.models.HifiGAN.ResBlock2(channels, kernel_size=3, dilation=(1, 3))[source]

基础：Module

残差块类型2，每个卷积块中有2个卷积层。

Parameters:

channels (int) – 卷积层的隐藏通道数。
kernel_size (int) – 每层卷积滤波器的大小。
dilation (list) – 块中每个卷积层的膨胀值列表。

forward(x)[source]

返回 ResBlock1 的输出

Parameters:: x (torch.Tensor (batch, channel, time)) – 输入张量。
Return type:: ResBlock 输出

remove_weight_norm()[source]: 此函数在推理期间移除权重归一化。

class speechbrain.lobes.models.HifiGAN.HifiganGenerator(in_channels, out_channels, resblock_type, resblock_dilation_sizes, resblock_kernel_sizes, upsample_kernel_sizes, upsample_initial_channel, upsample_factors, inference_padding=5, cond_channels=0, conv_post_bias=True)[source]

基础：Module

具有多感受野融合（MRF）的HiFiGAN生成器

Parameters:

in_channels (int) – 输入张量的通道数。
out_channels (int) – 输出张量的通道数。
resblock_type (str) – ResBlock的类型。'1' 或 '2'。
resblock_dilation_sizes (List[List[int]]) – ResBlock 每一层的扩张值列表。
resblock_kernel_sizes (List[int]) – 每个ResBlock的核大小列表。
upsample_kernel_sizes (List[int]) – 每个转置卷积的核大小列表。
upsample_initial_channel (int) – 第一个上采样层的通道数。对于每个连续的上采样层，这个数值会除以2。
upsample_factors (List[int]) – 每个上采样层的上采样因子（步幅）。
inference_padding (int) – 在推理时应用于输入的常量填充。默认为5。
cond_channels (int) – 如果提供，会在前向传播的开头添加一个卷积层。
conv_post_bias (bool) – 是否在最终的卷积中添加一个偏置项。

Example

>>> inp_tensor = torch.rand([4, 80, 33])
>>> hifigan_generator= HifiganGenerator(
...    in_channels = 80,
...    out_channels = 1,
...    resblock_type = "1",
...    resblock_dilation_sizes = [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
...    resblock_kernel_sizes = [3, 7, 11],
...    upsample_kernel_sizes = [16, 16, 4, 4],
...    upsample_initial_channel = 512,
...    upsample_factors = [8, 8, 2, 2],
... )
>>> out_tensor = hifigan_generator(inp_tensor)
>>> out_tensor.shape
torch.Size([4, 1, 8448])

forward(x, g=None)[source]

Parameters:

x (torch.Tensor (batch, channel, time)) – 特征输入张量。
g (torch.Tensor (batch, 1, time)) – 全局条件输入张量。

Return type:

生成器输出

remove_weight_norm()[source]: 此函数在推理期间移除权重归一化。

inference(c, padding=True)[source]

推理函数执行填充并运行前向方法。

Parameters:

c (torch.Tensor (batch, channel, time)) – 特征输入张量。
padding (bool) – 是否在前向传播之前对张量进行填充。

Return type:

生成器输出

class speechbrain.lobes.models.HifiGAN.VariancePredictor(encoder_embed_dim, var_pred_hidden_dim, var_pred_kernel_size, var_pred_dropout)[source]

基础：Module

受FastSpeech2启发的方差预测器

Parameters:

encoder_embed_dim (int) – 输入张量的通道数。
var_pred_hidden_dim (int) – 卷积层隐藏通道的大小。
var_pred_kernel_size (int) – 每层中卷积滤波器的大小。
var_pred_dropout (float) – 每层的dropout概率。

Example

>>> inp_tensor = torch.rand([4, 80, 128])
>>> duration_predictor = VariancePredictor(
...    encoder_embed_dim = 128,
...    var_pred_hidden_dim = 128,
...    var_pred_kernel_size = 3,
...    var_pred_dropout = 0.5,
... )
>>> out_tensor = duration_predictor (inp_tensor)
>>> out_tensor.shape
torch.Size([4, 80])

forward(x)[source]

Parameters:: x (torch.Tensor (batch, channel, time)) – 特征输入张量。
Return type:: 方差预测器输出

class speechbrain.lobes.models.HifiGAN.UnitHifiganGenerator(in_channels, out_channels, resblock_type, resblock_dilation_sizes, resblock_kernel_sizes, upsample_kernel_sizes, upsample_initial_channel, upsample_factors, inference_padding=5, cond_channels=0, conv_post_bias=True, vocab_size=100, embedding_dim=128, attn_dim=128, duration_predictor=False, var_pred_hidden_dim=128, var_pred_kernel_size=3, var_pred_dropout=0.5, multi_speaker=False, normalize_speaker_embeddings=False, skip_token_embedding=False, pooling_type='attention')[source]

基础类: HifiganGenerator

UnitHiFiGAN生成器以离散语音标记作为输入。该生成器经过调整以支持比特率可扩展性训练。更多详情，请参阅：https://arxiv.org/abs/2406.10735。

Parameters:

in_channels (int) – 输入张量的通道数。
out_channels (int) – 输出张量的通道数。
resblock_type (str) – ResBlock的类型。'1' 或 '2'。
resblock_dilation_sizes (List[List[int]]) – ResBlock 每一层的扩张值列表。
resblock_kernel_sizes (List[int]) – 每个ResBlock的核大小列表。
upsample_kernel_sizes (List[int]) – 每个转置卷积的核大小列表。
upsample_initial_channel (int) – 第一个上采样层的通道数。对于每个连续的上采样层，这个数值会除以2。
upsample_factors (List[int]) – 每个上采样层的上采样因子（步幅）。
inference_padding (int) – 在推理时应用于输入的常量填充。默认为5。
cond_channels (int) – 是否在前面添加一个卷积层
conv_post_bias (bool) – 是否在最后的卷积层添加偏置
vocab_size (int) – 嵌入字典的大小。
embedding_dim (int) – 每个嵌入向量的大小。
attn_dim (int) – 注意力维度的大小。
duration_predictor (bool) – 启用持续时间预测模块。
var_pred_hidden_dim (int) – 持续时间预测器的卷积层隐藏通道的大小。
var_pred_kernel_size (int) – 持续时间预测器中每一层的卷积滤波器的大小。
var_pred_dropout (float) – 持续时间预测器中每一层的dropout概率。
multi_speaker (bool) – 启用多说话者训练。
normalize_speaker_embeddings (bool) – 启用说话者嵌入的归一化。
skip_token_embedding (bool) – 在连续输入的情况下是否跳过嵌入层。
pooling_type (str, 可选) – 使用的池化类型。必须是 [“attention”, “sum”, “none”] 之一。默认为可扩展声码器的“attention”。

Example

>>> inp_tensor = torch.randint(0, 100, (4, 10, 1))
>>> unit_hifigan_generator= UnitHifiganGenerator(
...    in_channels = 128,
...    out_channels = 1,
...    resblock_type = "1",
...    resblock_dilation_sizes = [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
...    resblock_kernel_sizes = [3, 7, 11],
...    upsample_kernel_sizes = [11, 8, 8, 4, 4],
...    upsample_initial_channel = 512,
...    upsample_factors = [5, 4, 4, 2, 2],
...    vocab_size = 100,
...    embedding_dim = 128,
...    duration_predictor = True,
...    var_pred_hidden_dim = 128,
...    var_pred_kernel_size = 3,
...    var_pred_dropout = 0.5,
... )
>>> out_tensor, _ = unit_hifigan_generator(inp_tensor)
>>> out_tensor.shape
torch.Size([4, 1, 3200])

forward(x, g=None, spk=None)[source]

Parameters:

x (torch.Tensor (batch, time, channel)) – 特征输入张量。
g (torch.Tensor (batch, 1, time)) – 全局条件输入张量。
spk (torch.Tensor) – 说话者嵌入

Return type:

生成器输出

inference(x, spk=None)[source]

推理函数执行持续时间预测并运行前向方法。

Parameters:

x (torch.Tensor (batch, time, channel)) – 特征输入张量。
spk (torch.Tensor) – 说话者嵌入

Return type:

生成器输出

class speechbrain.lobes.models.HifiGAN.DiscriminatorP(period, kernel_size=5, stride=3)[source]

基础：Module

HiFiGAN 周期性判别器从输入波形中每隔P个值取一个，并应用一系列卷积。注意：

如果周期为2 波形 = [1, 2, 3, 4, 5, 6 …] –> [1, 3, 5 … ] –> 卷积 –> 分数, 特征

Parameters:

period (int) – 每隔 period 取一个新值
kernel_size (int) – 用于卷积堆栈的一维核大小
stride (int) – 卷积堆栈的步幅

forward(x)[source]

Parameters:: x (torch.Tensor (batch, 1, time)) – 输入波形。
Return type:: 分数和特征

class speechbrain.lobes.models.HifiGAN.MultiPeriodDiscriminator[source]

基础：Module

HiFiGAN 多周期判别器 (MPD) 用于在不同周期中应用PeriodDiscriminator的封装器。建议周期为质数，以减少每个判别器之间的重叠。

forward(x)[source]

返回多周期判别器分数和特征

Parameters:: x (torch.Tensor (batch, 1, time)) – 输入波形。
Return type:: 分数和特征

class speechbrain.lobes.models.HifiGAN.DiscriminatorS(use_spectral_norm=False)[source]

基础：Module

HiFiGAN 尺度判别器。它与 MelganDiscriminator 类似，但在论文中解释了其特定的架构。这里没有使用 SpeechBrain CNN 包装器，因为 spectral_norm 不常使用

Parameters:: use_spectral_norm (bool) – 如果 True 则切换到谱范数而不是权重范数。

forward(x)[source]

Parameters:: x (torch.Tensor (batch, 1, time)) – 输入波形。
Return type:: 分数和特征

class speechbrain.lobes.models.HifiGAN.MultiScaleDiscriminator[source]

基础：Module

HiFiGAN 多尺度判别器。类似于 MultiScaleMelganDiscriminator，但根据论文特别为 HiFiGAN 定制。

forward(x)[source]

Parameters:: x (torch.Tensor (batch, 1, time)) – 输入波形。
Return type:: 分数和特征

class speechbrain.lobes.models.HifiGAN.HifiganDiscriminator[source]

基础：Module

HiFiGAN 判别器封装了 MPD 和 MSD。

Example

>>> inp_tensor = torch.rand([4, 1, 8192])
>>> hifigan_discriminator= HifiganDiscriminator()
>>> scores, feats = hifigan_discriminator(inp_tensor)
>>> len(scores)
8
>>> len(feats)
8

forward(x)[source]

返回每个判别器每层的特征列表。

Parameters:: x (torch.Tensor) – 输入波形。
Return type:: 每个判别器层的特征

speechbrain.lobes.models.HifiGAN.stft(x, n_fft, hop_length, win_length, window_fn='hann_window')[source]: 计算输入信号的短重叠窗口的傅里叶变换

class speechbrain.lobes.models.HifiGAN.STFTLoss(n_fft, hop_length, win_length)[source]

基础：Module

STFT损失。输入的生成波形和真实波形被转换为频谱图，并与L1和频谱收敛损失进行比较。它来自ParallelWaveGAN论文https://arxiv.org/pdf/1910.11480.pdf

Parameters:

n_fft (int) – 傅里叶变换的大小。
hop_length (int) – 相邻滑动窗口帧之间的距离。
win_length (int) – 窗口帧和STFT滤波器的大小。

forward(y_hat, y)[source]

返回幅度损失和频谱收敛损失

Parameters:

y_hat (torch.tensor) – 生成的波形张量
y (torch.tensor) – 真实波形张量

Return type:

幅度损失和频谱收敛损失

class speechbrain.lobes.models.HifiGAN.MultiScaleSTFTLoss(n_ffts=(1024, 2048, 512), hop_lengths=(120, 240, 50), win_lengths=(600, 1200, 240))[source]

基础：Module

多尺度短时傅里叶变换损失。输入生成和真实波形被转换为频谱图，并与L1和频谱收敛损失进行比较。它来自ParallelWaveGAN论文 https://arxiv.org/pdf/1910.11480.pdf

forward(y_hat, y)[source]

返回多尺度幅度损失和频谱收敛损失

Parameters:

y_hat (torch.tensor) – 生成的波形张量
y (torch.tensor) – 真实波形张量

Return type:

幅度损失和频谱收敛损失

class speechbrain.lobes.models.HifiGAN.L1SpecLoss(sample_rate=22050, hop_length=256, win_length=24, n_mel_channels=80, n_fft=1024, n_stft=513, mel_fmin=0.0, mel_fmax=8000.0, mel_normalized=False, power=1.0, norm='slaney', mel_scale='slaney', dynamic_range_compression=True)[source]

基础：Module

L1损失在频谱图上的应用，如HiFiGAN论文所述 https://arxiv.org/pdf/2010.05646.pdf 注意：与L2损失相比，L1损失有助于学习细节

Parameters:

sample_rate (int) – 音频信号的采样率。
hop_length (int) – STFT窗口之间的跳跃长度。
win_length (int) – 窗口大小。
n_mel_channels (int) – 梅尔滤波器组的数量。
n_fft (int) – FFT的大小。
n_stft (int) – STFT的大小。
mel_fmin (float) – 最小频率。
mel_fmax (float) – 最大频率。
mel_normalized (bool) – 是否在短时傅里叶变换后通过幅度进行归一化。
power (float) – 用于幅度谱图的指数。
norm (str 或 None) – 如果为“slaney”，则将三角梅尔权重除以梅尔频带的宽度
mel_scale (str) – 使用的比例：“htk” 或 “slaney”。
dynamic_range_compression (bool) – 是否进行动态范围压缩

forward(y_hat, y)[source]

返回频谱图上的L1损失

Parameters:

y_hat (torch.tensor) – 生成的波形张量
y (torch.tensor) – 真实波形张量

Return type:

L1 损失

class speechbrain.lobes.models.HifiGAN.MSEGLoss(*args, **kwargs)[source]

基础：Module

均方生成器损失生成器通过更新样本质量来欺骗判别器，使其分类值几乎等于1。

forward(score_fake)[source]

返回生成器 GAN 损失

Parameters:: score_fake (list) – 生成波形的判别器分数 D(G(s))
Return type:: 生成器损失

class speechbrain.lobes.models.HifiGAN.MelganFeatureLoss[source]

基础：Module

计算特征匹配损失，这是一个通过学习得到的相似性度量，通过判别器在真实样本和生成样本之间的特征差异来衡量（Larsen等，2016；Kumar等，2019）。

forward(fake_feats, real_feats)[source]

返回特征匹配损失

Parameters:

fake_feats (list) – 生成波形的判别器特征
real_feats (list) – 真实波形的判别器特征

Return type:

特征匹配损失

class speechbrain.lobes.models.HifiGAN.MSEDLoss[source]

基础：Module

均方判别器损失判别器被训练为将真实样本分类为1，并将从生成器合成的样本分类为0。

forward(score_fake, score_real)[source]

返回判别器 GAN 损失

Parameters:

score_fake (list) – 生成波形的判别器分数
score_real (list) – 真实波形的判别器分数

Return type:

判别器损失

class speechbrain.lobes.models.HifiGAN.GeneratorLoss(stft_loss=None, stft_loss_weight=0, mseg_loss=None, mseg_loss_weight=0, feat_match_loss=None, feat_match_loss_weight=0, l1_spec_loss=None, l1_spec_loss_weight=0, mseg_dur_loss=None, mseg_dur_loss_weight=0)[source]

基础：Module

创建生成器损失的摘要并为不同的损失应用权重

Parameters:

stft_loss (object) – stft损失的对象
stft_loss_weight (float) – STFT损失的权重
mseg_loss (object) – mseg损失的对象
mseg_loss_weight (float) – mseg损失的权重
feat_match_loss (object) – 特征匹配损失的对象
feat_match_loss_weight (float) – 特征匹配损失的权重
l1_spec_loss (object) – L1频谱图损失的对象
l1_spec_loss_weight (float) – L1频谱图损失的权重
mseg_dur_loss (object) – mseg持续时间损失的对象
mseg_dur_loss_weight (float) – mseg持续时间损失的权重

forward(stage, y_hat=None, y=None, scores_fake=None, feats_fake=None, feats_real=None, log_dur_pred=None, log_dur=None)[source]

返回生成器损失的字典并应用权重

Parameters:

stage (speechbrain.Stage) – 训练、验证或测试
y_hat (torch.tensor) – 生成的波形张量
y (torch.tensor) – 真实波形张量
scores_fake (list) – 生成波形的判别器分数
feats_fake (list) – 生成波形的判别器特征
feats_real (list) – 真实波形的判别器特征
log_dur_pred (torch.Tensor) – 用于持续时间损失的预测持续时间
log_dur (torch.Tensor) – 用于持续时间损失的实际持续时间

Return type:

生成器损失字典

class speechbrain.lobes.models.HifiGAN.DiscriminatorLoss(msed_loss=None)[source]

基础：Module

创建判别器损失的摘要

Parameters:: msed_loss (object) – MSE判别器损失的对象

forward(scores_fake, scores_real)[source]

返回一个判别器损失的字典

Parameters:

scores_fake (list) – 生成波形的判别器分数
scores_real (list) – 真实波形的判别器分数

Return type:

判别器损失字典