speechbrain.lobes.models.DiffWave 模块

DIFFWAVE的神经网络模块：一种用于音频合成的多功能扩散模型

Authors

王英志 2022

摘要

类：

`DiffWave`	具有扩张残差块的DiffWave模型
`DiffWaveDiffusion`	一个增强的扩散实现，具有DiffWave特定的推理
`DiffusionEmbedding`	将扩散步骤嵌入到DiffWave的输入向量中
`ResidualBlock`	带有扩张卷积的残差块
`SpectrogramUpsampler`	用于频谱图的上采样器，仅使用转置卷积。这里只进行上采样，特定于层的卷积可以在残差块中找到，以将梅尔频带映射到2×残差通道

函数：

diffwave_mel_spectogram

计算原始音频信号的梅尔频谱图，并对其进行预处理以用于diffwave训练

参考

speechbrain.lobes.models.DiffWave.diffwave_mel_spectogram(sample_rate, hop_length, win_length, n_fft, n_mels, f_min, f_max, power, normalized, norm, mel_scale, audio)[source]

计算原始音频信号的梅尔频谱图并对其进行预处理以用于diffwave训练

Parameters:

sample_rate (int) – 音频信号的采样率。
hop_length (int) – STFT窗口之间的跳跃长度。
win_length (int) – 窗口大小。
n_fft (int) – FFT的大小。
n_mels (int) – 梅尔滤波器组的数量。
f_min (float) – 最小频率。
f_max (float) – 最大频率。
power (float) – 用于幅度谱图的指数。
normalized (bool) – 是否在stft之后通过幅度进行归一化。
norm (str 或 None) – 如果为“slaney”，则将三角梅尔权重除以梅尔频带的宽度
mel_scale (str) – 使用的比例：“htk” 或 “slaney”。
audio (torch.tensor) – 输入音频信号

Returns:

mel

Return type:

torch.Tensor

class speechbrain.lobes.models.DiffWave.DiffusionEmbedding(max_steps)[source]

基础：Module

将扩散步骤嵌入到DiffWave的输入向量中

Parameters:: max_steps (int) – 总扩散步数

Example

>>> from speechbrain.lobes.models.DiffWave import DiffusionEmbedding
>>> diffusion_embedding = DiffusionEmbedding(max_steps=50)
>>> time_step = torch.randint(50, (1,))
>>> step_embedding = diffusion_embedding(time_step)
>>> step_embedding.shape
torch.Size([1, 512])

forward(diffusion_step)[source]

扩散步骤嵌入的前向函数

Parameters:: diffusion_step (torch.Tensor) – 执行扩散的哪一步
Returns:: 扩散步骤嵌入
Return type:: 张量 [bs, 512]

class speechbrain.lobes.models.DiffWave.SpectrogramUpsampler[source]

基础：Module

使用转置卷积进行频谱图上采样这里只进行上采样，特定于层的卷积可以在残差块中找到，用于将梅尔频带映射到2×残差通道

Example

>>> from speechbrain.lobes.models.DiffWave import SpectrogramUpsampler
>>> spec_upsampler = SpectrogramUpsampler()
>>> mel_input = torch.rand(3, 80, 100)
>>> upsampled_mel = spec_upsampler(mel_input)
>>> upsampled_mel.shape
torch.Size([3, 80, 25600])

forward(x)[source]

将频谱图上采样256次以匹配音频的长度提取梅尔频谱图时，跳跃长度应为256

Parameters:: x (torch.Tensor) – 输入的梅尔频谱图 [bs, 80, mel_len]
Return type:: 上采样频谱图 [bs, 80, mel_len*256]

class speechbrain.lobes.models.DiffWave.ResidualBlock(n_mels, residual_channels, dilation, uncond=False)[source]

基础：Module

带有扩张卷积的残差块

Parameters:

n_mels (int) – 用于条件语音编码任务的conv1x1的输入mel通道
residual_channels (int) – 音频卷积的通道数
dilation (int) – 音频卷积的扩张周期
uncond (bool) – 条件/无条件生成

Example

>>> from speechbrain.lobes.models.DiffWave import ResidualBlock
>>> res_block = ResidualBlock(n_mels=80, residual_channels=64, dilation=3)
>>> noisy_audio = torch.randn(1, 1, 22050)
>>> timestep_embedding = torch.rand(1, 512)
>>> upsampled_mel = torch.rand(1, 80, 22050)
>>> output = res_block(noisy_audio, timestep_embedding, upsampled_mel)
>>> output[0].shape
torch.Size([1, 64, 22050])

forward(x, diffusion_step, conditioner=None)[source]

残差块的前向函数

Parameters:

x (torch.Tensor) – 输入样本 [bs, 1, time]
diffusion_step (torch.Tensor) – 执行扩散步骤的嵌入
conditioner (torch.Tensor) – 用于条件生成的condition

Returns:

残差输出 [bs, residual_channels, time]
残差分支的跳过 [bs, residual_channels, time]

class speechbrain.lobes.models.DiffWave.DiffWave(input_channels, residual_layers, residual_channels, dilation_cycle_length, total_steps, unconditional=False)[source]

基础：Module

带有扩张残差块的DiffWave模型

Parameters:

input_channels (int) – 用于条件语音编码任务的conv1x1的输入mel通道
residual_layers (int) – 残差块的数量
residual_channels (int) – 音频卷积的通道数
dilation_cycle_length (int) – 音频卷积的扩张周期
total_steps (int) – 扩散的总步数
无条件 (bool) – 条件/无条件生成

Example

>>> from speechbrain.lobes.models.DiffWave import DiffWave
>>> diffwave = DiffWave(
...     input_channels=80,
...     residual_layers=30,
...     residual_channels=64,
...     dilation_cycle_length=10,
...     total_steps=50,
... )
>>> noisy_audio = torch.randn(1, 1, 25600)
>>> timestep = torch.randint(50, (1,))
>>> input_mel = torch.rand(1, 80, 100)
>>> predicted_noise = diffwave(noisy_audio, timestep, input_mel)
>>> predicted_noise.shape
torch.Size([1, 1, 25600])

forward(audio, diffusion_step, spectrogram=None, length=None)[source]

DiffWave 前向函数

Parameters:

audio (torch.Tensor) – 输入的高斯样本 [bs, 1, time]
diffusion_step (torch.Tensor) – 要执行的扩散时间步 [bs, 1]
spectrogram (torch.Tensor) – 频谱图数据 [bs, 80, mel_len]
length (torch.Tensor) – 样本长度 - 未使用 - 仅提供兼容性

Return type:

预测噪声 [bs, 1, time]

diffusion_forward(x, timesteps, cond_emb=None, length=None, out_mask_value=None, latent_mask_value=None)[source]: 适用于通过扩散包装的前向函数。对于此模型，out_mask_value/latent_mask_value 未被使用并被丢弃。详情请参见 forward()。

class speechbrain.lobes.models.DiffWave.DiffWaveDiffusion(model, timesteps=None, noise=None, beta_start=None, beta_end=None, sample_min=None, sample_max=None, show_progress=False)[source]

基础类: DenoisingDiffusion

一个增强的扩散实现，具有DiffWave特定的推理

Parameters:

model (nn.Module) – 基础模型
timesteps (int) – 总的时间步数
噪声 (str|nn.Module) – 使用的噪声类型 “gaussian” 将产生标准高斯噪声
beta_start (float) – 过程开始时“beta”参数的值 (参见DiffWave论文)
beta_end (float) – 过程结束时“beta”参数的值
sample_min (float)
sample_max (float) – 用于裁剪输出。
show_progress (bool) – 是否在推理过程中显示进度

Example

>>> from speechbrain.lobes.models.DiffWave import DiffWave
>>> diffwave = DiffWave(
...     input_channels=80,
...     residual_layers=30,
...     residual_channels=64,
...     dilation_cycle_length=10,
...     total_steps=50,
... )
>>> from speechbrain.lobes.models.DiffWave import DiffWaveDiffusion
>>> from speechbrain.nnet.diffusion import GaussianNoise
>>> diffusion = DiffWaveDiffusion(
...     model=diffwave,
...     beta_start=0.0001,
...     beta_end=0.05,
...     timesteps=50,
...     noise=GaussianNoise,
... )
>>> input_mel = torch.rand(1, 80, 100)
>>> output = diffusion.inference(
...     unconditional=False,
...     scale=256,
...     condition=input_mel,
...     fast_sampling=True,
...     fast_sampling_noise_schedule=[0.0001, 0.001, 0.01, 0.05, 0.2, 0.5],
... )
>>> output.shape
torch.Size([1, 25600])

inference(unconditional, scale, condition=None, fast_sampling=False, fast_sampling_noise_schedule=None, device=None)[source]

处理diffwave的推理一个推理函数用于所有局部/全局条件生成和无条件生成任务

Parameters:

无条件 (bool) – 如果为True，则进行无条件生成，否则进行条件生成
scale (int) – 用于获取最终输出波长的比例对于条件生成，输出波长为 scale * condition.shape[-1] 例如，如果条件是频谱图 (bs, n_mel, time)，scale 应为跳跃长度对于无条件生成，scale 应为所需的音频长度
condition (torch.Tensor) – 用于声码器或其他条件生成的输入频谱图，对于无条件生成应为 None
fast_sampling (bool) – 是否进行快速采样
fast_sampling_noise_schedule (list) – 用于快速采样的噪声调度
device (str|torch.device) – 推理设备

Returns:

predicted_sample – 预测的音频 (bs, 1, t)

Return type:

torch.Tensor