要在GitHub上执行或查看/下载此笔记本

我可以用SpeechBrain做什么？

SpeechBrain 已经可以做很多很酷的事情。你可以使用 SpeechBrain 来解决以下类型的问题：

语音分类 (多对一，例如说话者识别)
语音回归 (语音到语音的映射，例如，语音增强)
序列到序列（语音到语音映射，例如，语音识别）

更准确地说，SpeechBrain 支持许多对话式 AI 任务（查看我们的 README）。还可以查看所有不同的教程。

对于所有这些任务，我们提供了允许用户从头开始训练模型的教程。我们提供了预训练模型和实验日志。

使用SpeechBrain从头开始训练模型的常用方法如下：

cd recipe/dataset_name/task_name
python train.py train.yaml --data_folder=/path/to/the/dataset

请参考上述教程以获取有关训练的更多信息。

在这个简短的教程中，我们仅展示如何使用HuggingFace上提供的一些预训练模型。首先，让我们安装SpeechBrain：

%%capture
# Installing SpeechBrain via pip
BRANCH = 'develop'
!python -m pip install git+https://github.com/speechbrain/speechbrain.git@$BRANCH

%%capture
%cd /content
!wget -O example_mandarin.wav "https://www.dropbox.com/scl/fi/7jn7jg9ea2u6d9d70657z/example_mandarin.wav?rlkey=eh220qallihxp9yppm2kx7a2i&dl=1"
!wget -O example_rw.mp3 "https://www.dropbox.com/scl/fi/iplkymn8c8mbc6oclxem3/example_rw.mp3?rlkey=yhmqfsn8q43pmvd1uvjo3yl0s&dl=1"
!wget -O example_whamr.wav "https://www.dropbox.com/scl/fi/gxbtbf3c3hxr0y9dbf0nw/example_whamr.wav?rlkey=1wt5d49kjl36h0zypwrmsy8nz&dl=1"
!wget -O example-fr.wav "https://www.dropbox.com/scl/fi/vjn98vu8e3i2mvsw17msh/example-fr.wav?rlkey=vabmu4fgqp60oken8aosg75i0&dl=1"
!wget -O example-it.wav "https://www.dropbox.com/scl/fi/o3t7j53s7czaob8yq73rz/example-it.wav?rlkey=x9u6bkbcp6lh3602fb9uai5h3&dl=1"
!wget -O example.wav "https://www.dropbox.com/scl/fi/uws97livpeta7rowb7q7g/example.wav?rlkey=swppq2so15jibmpmihenrktbt&dl=1"
!wget -O example1.wav "https://www.dropbox.com/scl/fi/mu1tdejny4cbgxczwm944/example1.wav?rlkey=8pi7hjz15syvav80u1xzfbfhn&dl=1"
!wget -O example2.flac "https://www.dropbox.com/scl/fi/k9ouk6ec1q1fkevamodrn/example2.flac?rlkey=vtbyc6bzp9hknzvn9rb63z3yf&dl=1"
!wget -O test_mixture.wav "https://www.dropbox.com/scl/fi/4327g66ajs8aq3dck0fzn/test_mixture.wav?rlkey=bjdcw3msxw3armpelxuayug5i&dl=1"

安装完成后，您应该能够使用python导入speechbrain项目：

import speechbrain as sb
from speechbrain.dataio.dataio import read_audio
from IPython.display import Audio

不同语言的语音识别

中文

from speechbrain.inference.ASR import EncoderDecoderASR

asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-crdnn-rnnlm-librispeech", savedir="pretrained_models/asr-crdnn-rnnlm-librispeech")
asr_model.transcribe_file('/content/example.wav')

signal = read_audio("/content/example.wav").squeeze()
Audio(signal, rate=16000)

法语

from speechbrain.inference.ASR import EncoderDecoderASR

asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-crdnn-commonvoice-fr", savedir="pretrained_models/asr-crdnn-commonvoice-fr")
asr_model.transcribe_file("/content/example-fr.wav")

signal = read_audio("/content/example-fr.wav").squeeze()
Audio(signal, rate=44100)

意大利语

from speechbrain.inference.ASR import EncoderDecoderASR

asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-crdnn-commonvoice-it", savedir="pretrained_models/asr-crdnn-commonvoice-it")
asr_model.transcribe_file("/content/example-it.wav")

signal = read_audio("/content/example-it.wav").squeeze()
Audio(signal, rate=16000)

普通话

from speechbrain.inference.interfaces import foreign_class

asr_model = foreign_class(source="speechbrain/asr-wav2vec2-ctc-aishell",  pymodule_file="custom_interface.py", classname="CustomEncoderDecoderASR")
asr_model.transcribe_file("/content/example_mandarin.wav")

/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(

/usr/local/lib/python3.10/dist-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Some weights of Wav2Vec2Model were not initialized from the model checkpoint at TencentGameMate/chinese-wav2vec2-large and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

['她', '应', '该', '也', '是', '喜', '欢']

signal = read_audio("/content/example_mandarin.wav").squeeze()
Audio(signal, rate=16000)

基尼亚卢旺达语

from speechbrain.inference.ASR import EncoderDecoderASR

asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-wav2vec2-commonvoice-rw", savedir="pretrained_models/asr-wav2vec2-commonvoice-rw")
asr_model.transcribe_file("/content/example_rw.mp3")

signal = read_audio("/content/example_rw.mp3").squeeze()
Audio(signal, rate=44100)

语音分离

我们在这里展示了一个包含2个说话者的混合音频，但我们也有一个最先进的系统，可以分离包含3个说话者的混合音频。我们还有处理噪声和混响的模型。参见你的HuggingFace仓库

from speechbrain.inference.separation import SepformerSeparation as separator

model = separator.from_hparams(source="speechbrain/sepformer-wsj02mix", savedir='pretrained_models/sepformer-wsj02mix')
est_sources = model.separate_file(path='/content/test_mixture.wav')

signal = read_audio("/content/test_mixture.wav").squeeze()
Audio(signal, rate=8000)

Audio(est_sources[:, :, 0].detach().cpu().squeeze(), rate=8000)

Audio(est_sources[:, :, 1].detach().cpu().squeeze(), rate=8000)

语音增强

语音增强的目标是去除影响录音的噪音。 Speechbrain 提供了几种语音增强系统。在下面，你可以找到一个由 SepFormer（经过训练以执行增强的版本）处理的示例：

from speechbrain.inference.separation import SepformerSeparation as separator
import torchaudio

model = separator.from_hparams(source="speechbrain/sepformer-whamr-enhancement", savedir='pretrained_models/sepformer-whamr-enhancement4')
enhanced_speech = model.separate_file(path='/content/example_whamr.wav')

signal = read_audio("/content/example_whamr.wav").squeeze()
Audio(signal, rate=8000)

Audio(enhanced_speech[:, :].detach().cpu().squeeze(), rate=8000)

说话人验证

这里的任务是确定两个句子是否属于同一个说话者。

from speechbrain.inference.speaker import SpeakerRecognition
verification = SpeakerRecognition.from_hparams(source="speechbrain/spkrec-ecapa-voxceleb", savedir="pretrained_models/spkrec-ecapa-voxceleb")
score, prediction = verification.verify_files("/content/example1.wav", "/content/example2.flac")

print(prediction, score)

signal = read_audio("/content/example1.wav").squeeze()
Audio(signal, rate=16000)

signal = read_audio("/content/example2.flac").squeeze()
Audio(signal, rate=16000)

语音合成（文本转语音）

语音合成的目标是从输入文本中创建语音信号。以下是一个使用流行的Tacotron2模型与HiFiGAN作为声码器的示例：

import torchaudio
from speechbrain.inference.TTS import Tacotron2
from speechbrain.inference.vocoders import HIFIGAN

# Initialize TTS (tacotron2) and Vocoder (HiFIGAN)
tacotron2 = Tacotron2.from_hparams(source="speechbrain/tts-tacotron2-ljspeech", savedir="tmpdir_tts")
hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir="tmpdir_vocoder")

# Running the TTS
mel_output, mel_length, alignment = tacotron2.encode_text("This is an open-source toolkit for the development of speech technologies.")

# Running Vocoder (spectrogram-to-waveform)
waveforms = hifi_gan.decode_batch(mel_output)

Audio(waveforms.detach().cpu().squeeze(), rate=22050)

引用SpeechBrain

如果您在研究中或业务中使用SpeechBrain，请使用以下BibTeX条目引用它：

@misc{speechbrainV1,
  title={Open-Source Conversational AI with {SpeechBrain} 1.0},
  author={Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Gaelle Laperriere and Mickael Rouvier and Renato De Mori and Yannick Esteve},
  year={2024},
  eprint={2407.00463},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2407.00463},
}
@misc{speechbrain,
  title={{SpeechBrain}: A General-Purpose Speech Toolkit},
  author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
  year={2021},
  eprint={2106.04624},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  note={arXiv:2106.04624}
}