要在GitHub上执行或查看/下载此笔记本
从零开始的语音增强
所以你想用语音进行回归任务吗?不用再找了,你来对地方了。本教程将带你通过一个基本的语音增强模板,使用SpeechBrain展示制作新配方所需的所有组件。
在深入代码之前,让我们先简单介绍一下语音增强的问题。语音增强的目标是从输入录音中去除噪音:
这个问题非常困难,因为可能有各种各样的干扰会破坏语音信号。
解决问题有不同的方法。如今,最流行的技术之一是基于掩码的语音增强:
在掩蔽方法中,我们不是直接估计增强信号,而是估计一个软掩码。然后,我们通过将噪声信号与软掩码相乘来估计增强信号。
根据输入/输出的类型,我们可以有:
波形掩码(如上图所示)
光谱掩蔽(如下图所示)
在频谱掩蔽中,系统将噪声频谱图映射为干净的频谱图。这种映射通常被认为比波形到波形的映射更容易。然而,在时域中检索信号需要添加相位信息。常见的解决方案(合理但不理想)是使用噪声信号的相位。波形掩蔽方法不受此限制,并且在社区中逐渐受到欢迎。
值得一提的是,SpeechBrain 目前支持更先进的语音增强解决方案,例如 MetricGAN+(在对抗训练框架中学习 PESQ 指标)和 MimicLoss(利用来自语音识别器的信息实现更好的增强)。
在本教程中,我们将指导您创建一个基于频谱掩蔽的简单语音增强系统。
特别是,我们将参考这里报告的示例:
https://github.com/speechbrain/speechbrain/blob/develop/templates/enhancement/
README 提供了一个很好的介绍,因此在这里重现:
==========================
此文件夹提供了一个完整且文档齐全的示例,用于从头开始训练语音增强模型,基于几小时的数据。我们使用的数据来自Mini Librispeech + OpenRIR。
这里有四个文件:
train.py: 主代码文件,概述了整个训练过程。train.yaml: 超参数文件,设置所有执行参数。custom_model.py: 一个包含PyTorch模块定义的文件。mini_librispeech_prepare.py: 如果需要,下载并准备数据清单。
要训练一个增强模型,只需在命令行中执行以下操作:
python train.py train.yaml --data_folder /path/to/save/mini_librispeech
这将自动下载并准备mini librispeech的数据清单,然后使用噪声、混响和嘈杂声动态生成噪声样本,训练模型。
=========================
首先,让我们确保我们可以在不进行任何修改的情况下运行模板。
%%capture
# Installing SpeechBrain via pip
BRANCH = 'develop'
!python -m pip install git+https://github.com/speechbrain/speechbrain.git@$BRANCH
# Clone SpeechBrain repository
!git clone https://github.com/speechbrain/speechbrain/
import speechbrain as sb
%cd speechbrain/templates/enhancement
!python train.py train.yaml --device='cpu' --debug
Train.py中的配方概述
让我们从食谱的最高层次开始,逐步深入。为此,我们应该查看食谱的底部,其中if __name__ == "__main__":块定义了食谱结构。基本过程是:
加载超参数和命令行覆盖。
准备数据清单和加载对象。
实例化
SEBrain子类为se_brain。调用
se_brain.fit()进行训练。调用
se_brain.evaluate()来检查最终性能。
就是这样!在我们实际运行这段代码之前,让我们手动定义SEBrain作为Brain类的子类。如果你想了解更多关于Brain类如何工作的深入教程,请查看Brain教程。
为了简单起见,我们只定义子类并覆盖第一个方法,然后逐个添加其他覆盖。第一个方法是compute_forward方法,它简单地定义了模型如何使用数据进行预测。返回值应包括模型所做的任何预测。对于这种情况,该方法计算相关特征,计算预测的掩码,然后应用掩码并重新计算时域信号。
class SEBrain(sb.Brain):
"""Class that manages the training loop. See speechbrain.core.Brain."""
def compute_forward(self, batch, stage):
"""Apply masking to convert from noisy waveforms to enhanced signals.
Arguments
---------
batch : PaddedBatch
This batch object contains all the relevant tensors for computation.
stage : sb.Stage
One of sb.Stage.TRAIN, sb.Stage.VALID, or sb.Stage.TEST.
Returns
-------
predictions : dict
A dictionary with keys {"spec", "wav"} with predicted features.
"""
# We first move the batch to the appropriate device, and
# compute the features necessary for masking.
batch = batch.to(self.device)
self.clean_wavs, self.lens = batch.clean_sig
noisy_wavs, self.lens = self.hparams.wav_augment(
self.clean_wavs, self.lens
)
noisy_feats = self.compute_feats(noisy_wavs)
# Masking is done here with the "signal approximation (SA)" algorithm.
# The masked input is compared directly with clean speech targets.
mask = self.modules.model(noisy_feats)
predict_spec = torch.mul(mask, noisy_feats)
# Also return predicted wav, for evaluation. Note that this could
# also be used for a time-domain loss term.
predict_wav = self.hparams.resynth(
torch.expm1(predict_spec), noisy_wavs
)
# Return a dictionary so we don't have to remember the order
return {"spec": predict_spec, "wav": predict_wav}
如果你在这里想知道self.modules和self.hparams对象是什么,你问对了问题。这些对象在SEBrain类实例化时构建,并直接来自初始化器的dict参数:modules和hparams。字典的键提供了你用来引用对象的名称,例如,为modules传递{"model": model}将允许你通过self.modules.model访问模型。
在Brain子类中需要定义的另一个方法是compute_objectives函数。我们子类化SEBrain本身只是为了提供一种方便的方式来拆分类定义,不要在生产代码中使用这种技术!
class SEBrain(SEBrain):
def compute_objectives(self, predictions, batch, stage):
"""Computes the loss given the predicted and targeted outputs.
Arguments
---------
predictions : dict
The output dict from `compute_forward`.
batch : PaddedBatch
This batch object contains all the relevant tensors for computation.
stage : sb.Stage
One of sb.Stage.TRAIN, sb.Stage.VALID, or sb.Stage.TEST.
Returns
-------
loss : torch.Tensor
A one-element tensor used for backpropagating the gradient.
"""
# Prepare clean targets for comparison
clean_spec = self.compute_feats(self.clean_wavs)
# Directly compare the masked spectrograms with the clean targets
loss = sb.nnet.losses.mse_loss(
predictions["spec"], clean_spec, self.lens
)
# Append this batch of losses to the loss metric for easy
self.loss_metric.append(
batch.id,
predictions["spec"],
clean_spec,
self.lens,
reduction="batch",
)
# Some evaluations are slower, and we only want to perform them
# on the validation set.
if stage != sb.Stage.TRAIN:
# Evaluate speech intelligibility as an additional metric
self.stoi_metric.append(
batch.id,
predictions["wav"],
self.clean_wavs,
self.lens,
reduction="batch",
)
return loss
这两种方法都使用了一个不是重写的第三种方法,称为compute_feats,我们在这里快速定义它:
class SEBrain(SEBrain):
def compute_feats(self, wavs):
"""Returns corresponding log-spectral features of the input waveforms.
Arguments
---------
wavs : torch.Tensor
The batch of waveforms to convert to log-spectral features.
"""
# Log-spectral features
feats = self.hparams.compute_STFT(wavs)
feats = sb.processing.features.spectral_magnitude(feats, power=0.5)
# Log1p reduces the emphasis on small differences
feats = torch.log1p(feats)
return feats
只定义了两个更多的方法,用于跟踪统计信息和保存检查点。这些是on_stage_start和on_stage_end方法,它们分别在迭代每个数据集之前和之后由fit()调用。在每个阶段之前,我们设置了指标跟踪器:
class SEBrain(SEBrain):
def on_stage_start(self, stage, epoch=None):
"""Gets called at the beginning of each epoch.
Arguments
---------
stage : sb.Stage
One of sb.Stage.TRAIN, sb.Stage.VALID, or sb.Stage.TEST.
epoch : int
The currently-starting epoch. This is passed
`None` during the test stage.
"""
# Set up statistics trackers for this stage
self.loss_metric = sb.utils.metric_stats.MetricStats(
metric=sb.nnet.losses.mse_loss
)
# Set up evaluation-only statistics trackers
if stage != sb.Stage.TRAIN:
self.stoi_metric = sb.utils.metric_stats.MetricStats(
metric=sb.nnet.loss.stoi_loss.stoi_loss
)
在验证阶段之后,我们使用跟踪器来汇总统计数据,并保存一个检查点。
class SEBrain(SEBrain):
def on_stage_end(self, stage, stage_loss, epoch=None):
"""Gets called at the end of an epoch.
Arguments
---------
stage : sb.Stage
One of sb.Stage.TRAIN, sb.Stage.VALID, sb.Stage.TEST
stage_loss : float
The average loss for all of the data processed in this stage.
epoch : int
The currently-starting epoch. This is passed
`None` during the test stage.
"""
# Store the train loss until the validation stage.
if stage == sb.Stage.TRAIN:
self.train_loss = stage_loss
# Summarize the statistics from the stage for record-keeping.
else:
stats = {
"loss": stage_loss,
"stoi": -self.stoi_metric.summarize("average"),
}
# At the end of validation, we can write stats and checkpoints
if stage == sb.Stage.VALID:
# The train_logger writes a summary to stdout and to the logfile.
self.hparams.train_logger.log_stats(
{"Epoch": epoch},
train_stats={"loss": self.train_loss},
valid_stats=stats,
)
# Save the current checkpoint and delete previous checkpoints,
# unless they have the current best STOI score.
self.checkpointer.save_and_keep_only(meta=stats, max_keys=["stoi"])
# We also write statistics about test data to stdout and to the logfile.
if stage == sb.Stage.TEST:
self.hparams.train_logger.log_stats(
{"Epoch loaded": self.hparams.epoch_counter.current},
test_stats=stats,
)
好的,这就是定义SEBrain类所需的一切!在我们实际运行这个之前,剩下的唯一事情就是数据加载函数。我们将使用DynamicItemDatasets,你可以在数据加载教程中了解更多信息。我们只需要定义加载音频数据的函数,就可以用它来创建我们所有的数据集!
def dataio_prep(hparams):
"""This function prepares the datasets to be used in the brain class.
It also defines the data processing pipeline through user-defined functions.
We expect `prepare_mini_librispeech` to have been called before this,
so that the `train.json` and `valid.json` manifest files are available.
Arguments
---------
hparams : dict
This dictionary is loaded from the `train.yaml` file, and it includes
all the hyperparameters needed for dataset construction and loading.
Returns
-------
datasets : dict
Contains two keys, "train" and "valid" that correspond
to the appropriate DynamicItemDataset object.
"""
# Define audio pipeline. Adds noise, reverb, and babble on-the-fly.
# Of course for a real enhancement dataset, you'd want a fixed valid set.
@sb.utils.data_pipeline.takes("wav")
@sb.utils.data_pipeline.provides("clean_sig")
def audio_pipeline(wav):
"""Load the signal, and pass it and its length to the corruption class.
This is done on the CPU in the `collate_fn`."""
clean_sig = sb.dataio.dataio.read_audio(wav)
return clean_sig
# Define datasets sorted by ascending lengths for efficiency
datasets = {}
data_info = {
"train": hparams["train_annotation"],
"valid": hparams["valid_annotation"],
"test": hparams["test_annotation"],
}
hparams["dataloader_options"]["shuffle"] = False
for dataset in data_info:
datasets[dataset] = sb.dataio.dataset.DynamicItemDataset.from_json(
json_path=data_info[dataset],
replacements={"data_root": hparams["data_folder"]},
dynamic_items=[audio_pipeline],
output_keys=["id", "clean_sig"],
).filtered_sorted(sort_key="length")
return datasets
现在我们已经定义了train.py中除了__main__块之外的所有代码,我们可以开始运行我们的配方了!这段代码经过略微编辑,以简化那些不一定适用于在Colab中运行代码的部分。第一步是加载超参数。这会自动创建许多所需的对象。你可以在我们的HyperPyYAML教程中找到更多关于HyperPyYAML如何工作的信息。此外,我们将创建用于存储实验数据、检查点和统计信息的文件夹。
from hyperpyyaml import load_hyperpyyaml
with open("train.yaml") as fin:
hparams = load_hyperpyyaml(fin)
sb.create_experiment_directory(hparams["output_folder"])
就这样,我们可以轻松访问我们的pytorch模型以及许多其他超参数。你可以随意探索hparams对象,但这里有一些例子:
# Already-applied random seed
hparams["seed"]
# STFT function
hparams["compute_STFT"]
# Masking model
hparams["model"]
准备数据清单并使用我们之前定义的函数创建数据集对象:
from mini_librispeech_prepare import prepare_mini_librispeech
prepare_mini_librispeech(
data_folder=hparams["data_folder"],
save_json_train=hparams["train_annotation"],
save_json_valid=hparams["valid_annotation"],
save_json_test=hparams["test_annotation"],
)
datasets = dataio_prep(hparams)
我们可以通过查看前几项来检查数据是否正确加载:
import torch
datasets["train"][0]
datasets["valid"][0]
实例化 SEBrain 对象以准备训练:
se_brain = SEBrain(
modules=hparams["modules"],
opt_class=hparams["opt_class"],
hparams=hparams,
checkpointer=hparams["checkpointer"],
)
然后调用 fit() 进行训练!fit() 方法会迭代训练循环,调用必要的方法来更新模型的参数。由于所有状态变化的对象都由 Checkpointer 管理,训练可以在任何点停止,并在下次调用时恢复。
se_brain.fit(
epoch_counter=se_brain.hparams.epoch_counter,
train_set=datasets["train"],
valid_set=datasets["valid"],
train_loader_kwargs=hparams["dataloader_options"],
valid_loader_kwargs=hparams["dataloader_options"],
)
一旦训练完成,我们可以加载在验证数据上表现最好的检查点(以STOI衡量)来进行评估。
se_brain.evaluate(
test_set=datasets["test"],
max_key="stoi",
test_loader_kwargs=hparams["dataloader_options"],
)
引用SpeechBrain
如果您在研究中或业务中使用SpeechBrain,请使用以下BibTeX条目引用它:
@misc{speechbrainV1,
title={Open-Source Conversational AI with {SpeechBrain} 1.0},
author={Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Gaelle Laperriere and Mickael Rouvier and Renato De Mori and Yannick Esteve},
year={2024},
eprint={2407.00463},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2407.00463},
}
@misc{speechbrain,
title={{SpeechBrain}: A General-Purpose Speech Toolkit},
author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
year={2021},
eprint={2106.04624},
archivePrefix={arXiv},
primaryClass={eess.AS},
note={arXiv:2106.04624}
}