speechbrain.augment.time_domain 模块

时域序列数据增强类

该模块包含专为在时域中增强序列数据而设计的类。它在训练期间特别有用，可以增强神经模型的鲁棒性。可用的数据失真方法包括添加噪声、应用混响、调整播放速度等。所有类都实现为torch.nn.Module，支持端到端的可微分性和梯度反向传播。

作者: - 彼得·普兰廷加 (2020) - 米尔科·拉瓦内利 (2023)

摘要

类：

`AddNoise`	此类将噪声信号与输入信号进行加性结合。
`AddReverb`	此类将音频信号与脉冲响应进行卷积。
`ChannelDrop`	此函数在多通道输入波形中随机丢弃通道。
`ChannelSwap`	此函数随机交换N个通道。
`CutCat`	此函数组合了批次中包含的时间序列的等长时间段。
`DoClip`	此函数通过钳制输入张量来模拟音频削波。
`DropBitResolution`	此类将float32张量转换为较低分辨率的张量（例如，int16，int8，float16），然后将其转换回float32。
`DropChunk`	该类用于丢弃输入信号的部分内容。
`DropFreq`	该类从信号中随机丢弃一个频率。
`FastDropChunk`	该类用于丢弃输入信号的部分内容。
`RandAmp`	此函数将信号乘以随机振幅。
`Resample`	此类使用基于 sinc 插值的 `torchaudio resampler` 对音频进行重采样。
`SignFlip`	翻转信号的符号。
`SpeedPerturb`	稍微加快或减慢音频信号。

函数：

pink_noise_like

创建一系列粉红噪声（也称为1/f噪声）。

参考

class speechbrain.augment.time_domain.AddNoise(csv_file=None, csv_keys=None, sorting='random', num_workers=0, snr_low=0, snr_high=0, pad_noise=False, start_index=None, normalize=False, noise_funct=<built-in method randn_like of type object>, replacements={}, noise_sample_rate=16000, clean_sample_rate=16000)[source]

基础：Module

此类将噪声信号与输入信号进行叠加组合。

Parameters:

csv_file (str) – 包含噪声音频文件位置的CSV文件的名称。如果未提供，将使用白噪声。
csv_keys (list, None, optional) – 默认值: None。应为噪声数据指定一个数据条目。如果为None，则csv文件应仅包含一个数据条目。
排序 (str) – 迭代csv文件的顺序，从以下选项中选择：随机、原始、升序和降序。
num_workers (int) – DataLoader中的工作线程数（参见PyTorch DataLoader文档）。
snr_low (int) – 混合比的下限，单位为分贝。
snr_high (int) – 混合比率的高端，单位为分贝。
pad_noise (bool) – 如果为True，复制比其对应的干净信号短的噪声信号，以覆盖整个干净信号。否则，不对噪声进行填充。
start_index (int) – 从噪声波形中开始的索引。默认情况下，选择在 [0, len(noise) - len(waveforms)] 范围内的随机索引。
normalize (bool) – 如果为True，超出[-1,1]范围的噪声信号将被归一化到[-1,1]。
noise_funct (funct object) – 用于绘制噪声样本的函数。如果未提供包含噪声序列的csv文件，则启用此函数。默认情况下，使用torch.randn_like（以采样白噪声）。通常，它必须是一个函数，该函数接收原始波形作为输入，并返回要添加的相应噪声的张量（例如，参见pink_noise_like）。
replacements (dict) – 一组在csv文件中执行的字符串替换。每当在文本中找到键时，它将被替换为相应的值。
noise_sample_rate (int) – 噪声音频信号的采样率，以便在必要时可以将噪声重新采样到干净的采样率。
clean_sample_rate (int) – 干净音频信号的采样率，以便在必要时可以将噪声重新采样到干净采样率。

Example

>>> import pytest
>>> from speechbrain.dataio.dataio import read_audio
>>> signal = read_audio('tests/samples/single-mic/example1.wav')
>>> clean = signal.unsqueeze(0) # [batch, time, channels]
>>> noisifier = AddNoise('tests/samples/annotation/noise.csv',
...                     replacements={'noise_folder': 'tests/samples/noise'})
>>> noisy = noisifier(clean, torch.ones(1))

forward(waveforms, lengths)[source]

Parameters:

波形 (torch.Tensor) – 形状应为 [batch, time] 或 [batch, time, channels]。
lengths (torch.Tensor) – 形状应为单一维度，[batch]。

Return type:

形状为 [batch, time] 或 [batch, time, channels] 的张量。

class speechbrain.augment.time_domain.AddReverb(csv_file, sorting='random', num_workers=0, rir_scale_factor=1.0, replacements={}, reverb_sample_rate=16000, clean_sample_rate=16000)[source]

基础：Module

该类将音频信号与脉冲响应进行卷积。

Parameters:

csv_file (str) – 包含脉冲响应文件位置的CSV文件的名称。
排序 (str) – 迭代csv文件的顺序，从以下选项中选择：随机、原始、升序和降序。
num_workers (int) – DataLoader中的工作线程数（参见PyTorch DataLoader文档）。
rir_scale_factor (float) – 它压缩或扩展给定的脉冲响应。如果 0 < scale_factor < 1，脉冲响应被压缩（较少的混响），而如果 scale_factor > 1，它被扩展（更多的混响）。
replacements (dict) – 一组在csv文件中执行的字符串替换。每当在文本中找到键时，它将被替换为相应的值。
reverb_sample_rate (int) – 混响信号的采样率（rirs），以便在必要时可以将它们重新采样为干净的采样率。
clean_sample_rate (int) – 干净信号的采样率，以便在卷积之前可以将污染信号重新采样到干净采样率。

Example

>>> import pytest
>>> from speechbrain.dataio.dataio import read_audio
>>> signal = read_audio('tests/samples/single-mic/example1.wav')
>>> clean = signal.unsqueeze(0) # [batch, time, channels]
>>> reverb = AddReverb('tests/samples/annotation/RIRs.csv',
...                     replacements={'rir_folder': 'tests/samples/RIRs'})
>>> reverbed = reverb(clean)

forward(waveforms)[source]

Parameters:: 波形 (torch.Tensor) – 形状应为 [batch, time] 或 [batch, time, channels]。
Return type:: 形状为 [batch, time] 或 [batch, time, channels] 的张量。

class speechbrain.augment.time_domain.SpeedPerturb(orig_freq, speeds=[90, 100, 110], device='cpu')[source]

基础：Module

稍微加快或减慢音频信号。

以与原始速率相似的速率对音频信号进行重采样，以实现稍微慢一些或稍微快一些的信号。该技术在论文“语音识别的音频增强”中有所概述。

Parameters:

orig_freq (int) – 原始信号的频率。
speeds (list) – 信号应更改为的速度，作为原始信号的百分比（即 speeds 除以 100 得到比率）。
device (str) – 用于重采样的设备。

Example

>>> from speechbrain.dataio.dataio import read_audio
>>> signal = read_audio('tests/samples/single-mic/example1.wav')
>>> perturbator = SpeedPerturb(orig_freq=16000, speeds=[90])
>>> clean = signal.unsqueeze(0)
>>> perturbed = perturbator(clean)
>>> clean.shape
torch.Size([1, 52173])
>>> perturbed.shape
torch.Size([1, 46956])

forward(waveform)[source]

Parameters:: 波形 (torch.Tensor) – 形状应为 [batch, time] 或 [batch, time, channels]。
Return type:: 形状为 [batch, time] 或 [batch, time, channels] 的 torch.Tensor。

class speechbrain.augment.time_domain.Resample(orig_freq=16000, new_freq=16000, *args, **kwargs)[source]

基础：Module

此类使用基于sinc插值的 torchaudio resampler 对音频进行重采样。

Parameters:

orig_freq (int) – 输入信号的采样频率。
new_freq (int) – 执行此操作后的新采样频率。
*args – 传递给 torchaudio.transforms.Resample 构造函数的额外参数
**kwargs – 额外的关键字参数，传递给 torchaudio.transforms.Resample 构造函数

Example

>>> from speechbrain.dataio.dataio import read_audio
>>> signal = read_audio('tests/samples/single-mic/example1.wav')
>>> signal = signal.unsqueeze(0) # [batch, time, channels]
>>> resampler = Resample(orig_freq=16000, new_freq=8000)
>>> resampled = resampler(signal)
>>> signal.shape
torch.Size([1, 52173])
>>> resampled.shape
torch.Size([1, 26087])

forward(waveforms)[source]

Parameters:: 波形 (torch.Tensor) – 形状应为 [batch, time] 或 [batch, time, channels]。
Return type:: 形状为 [batch, time] 或 [batch, time, channels] 的张量。

class speechbrain.augment.time_domain.DropFreq(drop_freq_low=1e-14, drop_freq_high=1, drop_freq_count_low=1, drop_freq_count_high=3, drop_freq_width=0.05, epsilon=1e-12)[source]

基础：Module

该类从信号中随机丢弃一个频率。

本课程的目的是教会模型学习依赖信号的所有部分，而不仅仅是几个频带。

Parameters:

drop_freq_low (float) – 可以丢弃的频率的低端，作为采样率 / 2 的一部分。
drop_freq_high (float) – 可以丢弃的频率的高端，作为采样率 / 2 的一部分。
drop_freq_count_low (int) – 可能被丢弃的频率数量的下限。
drop_freq_count_high (int) – 可能被丢弃的频率数量的上限。
drop_freq_width (float) – 要丢弃的频率带的宽度，作为采样率 / 2 的一部分。
epsilon (float) – 一个小的正值，用于防止诸如过滤0 Hz、除以零或其他数值不稳定性等问题。该值设置了滤波器中使用的归一化频率的绝对最小值。默认值为1e-12。

Example

>>> from speechbrain.dataio.dataio import read_audio
>>> dropper = DropFreq()
>>> signal = read_audio('tests/samples/single-mic/example1.wav')
>>> dropped_signal = dropper(signal.unsqueeze(0))

forward(waveforms)[source]

Parameters:: 波形 (torch.Tensor) – 形状应为 [batch, time] 或 [batch, time, channels]。
Return type:: 形状为 [batch, time] 或 [batch, time, channels] 的张量。

class speechbrain.augment.time_domain.DropChunk(drop_length_low=100, drop_length_high=1000, drop_count_low=1, drop_count_high=3, drop_start=0, drop_end=None, noise_factor=0.0)[source]

基础：Module

该类会丢弃输入信号的部分内容。

使用DropChunk作为增强策略有助于模型学会依赖信号的所有部分，因为它不能期望某个特定部分总是存在。

Parameters:

drop_length_low (int) – 设置信号为零的长度范围的低端，单位为样本。
drop_length_high (int) – 设置信号为零的长度的高端，以样本为单位。
drop_count_low (int) – 信号可以降至零的次数下限。
drop_count_high (int) – 信号可以降至零的次数的高端。
drop_start (int) – 允许删除的第一个索引。
drop_end (int) – 允许删除的最后一个索引。
noise_factor (float) – 用于缩放插入的白噪声的因子，相对于话语的平均振幅。1保持平均振幅不变，而0插入全0。

Example

>>> from speechbrain.dataio.dataio import read_audio
>>> dropper = DropChunk(drop_start=100, drop_end=200, noise_factor=0.)
>>> signal = read_audio('tests/samples/single-mic/example1.wav')
>>> signal = signal.unsqueeze(0) # [batch, time, channels]
>>> length = torch.ones(1)
>>> dropped_signal = dropper(signal, length)
>>> float(dropped_signal[:, 150])
0.0

forward(waveforms, lengths)[source]

Parameters:

波形 (torch.Tensor) – 形状应为 [batch, time] 或 [batch, time, channels]。
lengths (torch.Tensor) – 形状应为单一维度，[batch]。

Returns:

[batch, time, channels]

Return type:

形状为 [batch, time] 的张量或

class speechbrain.augment.time_domain.FastDropChunk(drop_length_low=100, drop_length_high=1000, drop_count_low=1, drop_count_high=10, drop_start=0, drop_end=None, n_masks=1000)[source]

基础：Module

该类会丢弃输入信号的部分内容。与DropChunk的不同之处在于，在这种情况下，我们在第一次调用forward函数时预先计算丢弃掩码。对于所有其他调用，我们只需打乱并应用它们。这使得代码更快，更适合大批量数据增强。

它只能用于固定长度的序列。

Parameters:

drop_length_low (int) – 设置信号为零的长度范围的低端，单位为样本。
drop_length_high (int) – 设置信号为零的长度的高端，以样本为单位。
drop_count_low (int) – 信号可以降至零的次数下限。
drop_count_high (int) – 信号可以降至零的次数的高端。
drop_start (int) – 允许删除的第一个索引。
drop_end (int) – 允许删除的最后一个索引。
n_masks (int) – 预计算掩码的数量。

Example

>>> from speechbrain.dataio.dataio import read_audio
>>> dropper = FastDropChunk(drop_start=100, drop_end=200)
>>> signal = torch.rand(10, 250, 22)
>>> dropped_signal = dropper(signal)

initialize_masks(waveforms)[source]

waveformstorch.Tensor
形状应为 [batch, time] 或 [batch, time, channels]。

`.

dropped_maskstorch.Tensor: 大小为 [n_masks, time] 的张量，包含被丢弃的块。被丢弃的区域被赋值为 0。

forward(waveforms)[source]

Parameters:: 波形 (torch.Tensor) – 形状应为 [batch, time] 或 [batch, time, channels]。
Return type:: 形状为 [batch, time] 或 [batch, time, channels] 的张量

class speechbrain.augment.time_domain.DoClip(clip_low=0.5, clip_high=0.5)[source]

基础：Module

此函数通过钳制输入张量来模拟音频削波。首先，它将波形从-1归一化到-1。然后，应用削波。最后，恢复原始振幅。

Parameters:

clip_low (float) – 信号裁剪的振幅下限。
clip_high (float) – 信号裁剪的振幅上限。

Example

>>> from speechbrain.dataio.dataio import read_audio
>>> clipper = DoClip(clip_low=0.01, clip_high=0.01)
>>> signal = read_audio('tests/samples/single-mic/example1.wav')
>>> clipped_signal = clipper(signal.unsqueeze(0))

forward(waveforms)[source]

Parameters:: 波形 (torch.Tensor) – 形状应为 [batch, time] 或 [batch, time, channels]。
Return type:: 形状为 [batch, time] 或 [batch, time, channels] 的张量

class speechbrain.augment.time_domain.RandAmp(amp_low=0.5, amp_high=1.5)[source]

基础：Module

此函数将信号乘以随机振幅。首先，信号被归一化，使其振幅在-1和1之间。然后将其与一个随机数相乘。

Parameters:

amp_low (float) – 最小振幅乘法因子。
amp_high (float) – 最大振幅乘法因子。

Example

>>> from speechbrain.dataio.dataio import read_audio
>>> rand_amp = RandAmp(amp_low=0.25, amp_high=1.75)
>>> signal = read_audio('tests/samples/single-mic/example1.wav')
>>> output_signal = rand_amp(signal.unsqueeze(0))

forward(waveforms)[source]

Parameters:: 波形 (torch.Tensor) – 形状应为 [batch, time] 或 [batch, time, channels]。
Return type:: 形状为 [batch, time] 或 [batch, time, channels] 的张量

class speechbrain.augment.time_domain.ChannelDrop(drop_rate=0.1)[source]

基础：Module

此函数在多通道输入波形中随机丢弃通道。

Parameters:: drop_rate (float) – 通道丢弃因子

Example

>>> signal = torch.rand(4, 256, 8)
>>> ch_drop = ChannelDrop(drop_rate=0.5)
>>> output_signal = ch_drop(signal)

forward(waveforms)[source]

Parameters:: 波形 (torch.Tensor) – 形状应为 [batch, time] 或 [batch, time, channels]。
Return type:: 形状为 [batch, time] 或 [batch, time, channels] 的张量

class speechbrain.augment.time_domain.ChannelSwap(min_swap=0, max_swap=0)[source]

基础：Module

此函数随机交换N个通道。

Parameters:

min_swap (int) – 要交换的最小通道数。
max_swap (int) – 要交换的最大通道数。

Example

>>> signal = torch.rand(4, 256, 8)
>>> ch_swap = ChannelSwap()
>>> output_signal = ch_swap(signal)

forward(waveforms)[source]

Parameters:: 波形 (torch.Tensor) – 形状应为 [batch, time] 或 [batch, time, channels]。
Return type:: 形状为 [batch, time] 或 [batch, time, channels] 的张量

class speechbrain.augment.time_domain.CutCat(min_num_segments=2, max_num_segments=10)[source]

基础：Module

此函数结合了批次中包含的时间序列的等长时间段。针对EEG信号在https://doi.org/10.1016/j.neunet.2021.05.032中提出。

Parameters:

min_num_segments (int) – 要组合的段数。
max_num_segments (int) – 要组合的最大段数。默认值为10。

Example

>>> signal = torch.ones((4, 256, 22)) * torch.arange(4).reshape((4, 1, 1,))
>>> cutcat =  CutCat()
>>> output_signal = cutcat(signal)

forward(waveforms)[source]

Parameters:: 波形 (torch.Tensor) – 形状应为 [batch, time] 或 [batch, time, channels]。
Return type:: 形状为 [batch, time] 或 [batch, time, channels] 的张量

speechbrain.augment.time_domain.pink_noise_like(waveforms, alpha_low=1.0, alpha_high=1.0, sample_rate=50)[source]

创建一系列粉红噪声（也称为1/f噪声）。粉红噪声是通过将白噪声序列的频谱乘以一个因子（1/f^alpha）获得的。 alpha因子控制频域中的衰减因子（alpha=0添加白噪声，alpha>>0添加低频噪声）。它在alpha_low和alpha_high之间随机采样。使用负alpha时，此函数生成蓝噪声。

Parameters:

波形 (torch.Tensor) – 原始波形。它仅用于推断形状。
alpha_low (float) – alpha光谱平滑因子的最小值。
alpha_high (float) – alpha光谱平滑因子的最大值。
sample_rate (float) – 原始信号的采样率。

Returns:

pink_noise – 输入张量形状的粉红噪声。

Return type:

torch.Tensor

Example

>>> waveforms = torch.randn(4,257,10)
>>> noise = pink_noise_like(waveforms)
>>> noise.shape
torch.Size([4, 257, 10])

class speechbrain.augment.time_domain.DropBitResolution(target_dtype='random')[source]

基础：Module

该类将float32张量转换为较低分辨率的张量（例如，int16、int8、float16），然后将其转换回float32。此过程会丢失信息，可用于数据增强。

参数:

target_dtype: str
其中之一为“int16”、“int8”、“float16”。如果为“random”，则位分辨率将从上述选项中随机选择。

Example:

>>> dropper = DropBitResolution()
>>> signal = torch.rand(4, 16000)
>>> signal_dropped = dropper(signal)

forward(float32_tensor)[source]

参数:

float32_tensor: torch.Tensor
具有形状 [batch, time] 或 [batch, time, channels] 的 Float32 张量。

返回:

torch.Tensor
形状为 [batch, time] 或 [batch, time, channels] 的张量 (Float32)

class speechbrain.augment.time_domain.SignFlip(flip_prob=0.5)[source]

基础：Module

翻转信号的符号。

该模块以给定概率对张量中的所有值进行取反。如果符号未被翻转，则返回原始信号不变。该技术在论文中概述： “CADDA：用于脑电信号的类自动可微分数据增强” https://arxiv.org/pdf/2106.13695

Parameters:: flip_prob (float) – 翻转信号符号的概率。默认值为0.5。

Example

>>> import torch
>>> x = torch.tensor([1,2,3,4,5])
>>> flip = SignFlip(flip_prob=1) # 100% chance to flip sign
>>> flip(x)
tensor([-1, -2, -3, -4, -5])

forward(waveform)[source]

Parameters:: 波形 (torch.Tensor) – 输入张量表示波形，形状不重要。
Returns:: 输出张量与输入形状相同，其中张量中所有值的符号以概率flip_prob被翻转。
Return type:: torch.Tensor