speechbrain.lobes.models.DiffWave 模块
DIFFWAVE的神经网络模块: 一种用于音频合成的多功能扩散模型
更多详情:https://arxiv.org/pdf/2009.09761.pdf
- Authors
王英志 2022
摘要
类:
具有扩张残差块的DiffWave模型 |
|
一个增强的扩散实现,具有DiffWave特定的推理 |
|
将扩散步骤嵌入到DiffWave的输入向量中 |
|
带有扩张卷积的残差块 |
|
用于频谱图的上采样器,仅使用转置卷积。这里只进行上采样,特定于层的卷积可以在残差块中找到,以将梅尔频带映射到2×残差通道 |
函数:
计算原始音频信号的梅尔频谱图,并对其进行预处理以用于diffwave训练 |
参考
- speechbrain.lobes.models.DiffWave.diffwave_mel_spectogram(sample_rate, hop_length, win_length, n_fft, n_mels, f_min, f_max, power, normalized, norm, mel_scale, audio)[source]
计算原始音频信号的梅尔频谱图 并对其进行预处理以用于diffwave训练
- Parameters:
sample_rate (int) – 音频信号的采样率。
hop_length (int) – STFT窗口之间的跳跃长度。
win_length (int) – 窗口大小。
n_fft (int) – FFT的大小。
n_mels (int) – 梅尔滤波器组的数量。
f_min (float) – 最小频率。
f_max (float) – 最大频率。
power (float) – 用于幅度谱图的指数。
normalized (bool) – 是否在stft之后通过幅度进行归一化。
norm (str 或 None) – 如果为“slaney”,则将三角梅尔权重除以梅尔频带的宽度
mel_scale (str) – 使用的比例:“htk” 或 “slaney”。
audio (torch.tensor) – 输入音频信号
- Returns:
mel
- Return type:
torch.Tensor
- class speechbrain.lobes.models.DiffWave.DiffusionEmbedding(max_steps)[source]
基础:
Module将扩散步骤嵌入到DiffWave的输入向量中
- Parameters:
max_steps (int) – 总扩散步数
Example
>>> from speechbrain.lobes.models.DiffWave import DiffusionEmbedding >>> diffusion_embedding = DiffusionEmbedding(max_steps=50) >>> time_step = torch.randint(50, (1,)) >>> step_embedding = diffusion_embedding(time_step) >>> step_embedding.shape torch.Size([1, 512])
- class speechbrain.lobes.models.DiffWave.SpectrogramUpsampler[source]
基础:
Module使用转置卷积进行频谱图上采样 这里只进行上采样,特定于层的卷积可以在残差块中找到,用于将梅尔频带映射到2×残差通道
Example
>>> from speechbrain.lobes.models.DiffWave import SpectrogramUpsampler >>> spec_upsampler = SpectrogramUpsampler() >>> mel_input = torch.rand(3, 80, 100) >>> upsampled_mel = spec_upsampler(mel_input) >>> upsampled_mel.shape torch.Size([3, 80, 25600])
- class speechbrain.lobes.models.DiffWave.ResidualBlock(n_mels, residual_channels, dilation, uncond=False)[source]
基础:
Module带有扩张卷积的残差块
- Parameters:
Example
>>> from speechbrain.lobes.models.DiffWave import ResidualBlock >>> res_block = ResidualBlock(n_mels=80, residual_channels=64, dilation=3) >>> noisy_audio = torch.randn(1, 1, 22050) >>> timestep_embedding = torch.rand(1, 512) >>> upsampled_mel = torch.rand(1, 80, 22050) >>> output = res_block(noisy_audio, timestep_embedding, upsampled_mel) >>> output[0].shape torch.Size([1, 64, 22050])
- class speechbrain.lobes.models.DiffWave.DiffWave(input_channels, residual_layers, residual_channels, dilation_cycle_length, total_steps, unconditional=False)[source]
基础:
Module带有扩张残差块的DiffWave模型
- Parameters:
Example
>>> from speechbrain.lobes.models.DiffWave import DiffWave >>> diffwave = DiffWave( ... input_channels=80, ... residual_layers=30, ... residual_channels=64, ... dilation_cycle_length=10, ... total_steps=50, ... ) >>> noisy_audio = torch.randn(1, 1, 25600) >>> timestep = torch.randint(50, (1,)) >>> input_mel = torch.rand(1, 80, 100) >>> predicted_noise = diffwave(noisy_audio, timestep, input_mel) >>> predicted_noise.shape torch.Size([1, 1, 25600])
- forward(audio, diffusion_step, spectrogram=None, length=None)[source]
DiffWave 前向函数
- Parameters:
audio (torch.Tensor) – 输入的高斯样本 [bs, 1, time]
diffusion_step (torch.Tensor) – 要执行的扩散时间步 [bs, 1]
spectrogram (torch.Tensor) – 频谱图数据 [bs, 80, mel_len]
length (torch.Tensor) – 样本长度 - 未使用 - 仅提供兼容性
- Return type:
预测噪声 [bs, 1, time]
- class speechbrain.lobes.models.DiffWave.DiffWaveDiffusion(model, timesteps=None, noise=None, beta_start=None, beta_end=None, sample_min=None, sample_max=None, show_progress=False)[source]
基础类:
DenoisingDiffusion一个增强的扩散实现,具有DiffWave特定的推理
- Parameters:
Example
>>> from speechbrain.lobes.models.DiffWave import DiffWave >>> diffwave = DiffWave( ... input_channels=80, ... residual_layers=30, ... residual_channels=64, ... dilation_cycle_length=10, ... total_steps=50, ... ) >>> from speechbrain.lobes.models.DiffWave import DiffWaveDiffusion >>> from speechbrain.nnet.diffusion import GaussianNoise >>> diffusion = DiffWaveDiffusion( ... model=diffwave, ... beta_start=0.0001, ... beta_end=0.05, ... timesteps=50, ... noise=GaussianNoise, ... ) >>> input_mel = torch.rand(1, 80, 100) >>> output = diffusion.inference( ... unconditional=False, ... scale=256, ... condition=input_mel, ... fast_sampling=True, ... fast_sampling_noise_schedule=[0.0001, 0.001, 0.01, 0.05, 0.2, 0.5], ... ) >>> output.shape torch.Size([1, 25600])
- inference(unconditional, scale, condition=None, fast_sampling=False, fast_sampling_noise_schedule=None, device=None)[source]
处理diffwave的推理 一个推理函数用于所有局部/全局条件生成和无条件生成任务
- Parameters:
无条件 (bool) – 如果为True,则进行无条件生成,否则进行条件生成
scale (int) – 用于获取最终输出波长的比例 对于条件生成,输出波长为 scale * condition.shape[-1] 例如,如果条件是频谱图 (bs, n_mel, time),scale 应为跳跃长度 对于无条件生成,scale 应为所需的音频长度
condition (torch.Tensor) – 用于声码器或其他条件生成的输入频谱图,对于无条件生成应为 None
fast_sampling (bool) – 是否进行快速采样
fast_sampling_noise_schedule (list) – 用于快速采样的噪声调度
device (str|torch.device) – 推理设备
- Returns:
predicted_sample – 预测的音频 (bs, 1, t)
- Return type:
torch.Tensor