speechbrain.inference.TTS 模块

指定文本到语音(TTS)模块的推理接口。

Authors:
  • 阿库·罗赫 2021

  • 彼得·普兰廷加 2021

  • 洛伦·卢戈斯奇 2020

  • Mirco Ravanelli 2020

  • Titouan Parcollet 2021

  • 阿卜杜勒·赫巴 2021

  • 安德烈亚斯·诺茨 2022, 2023

  • Pooneh Mousavi 2023

  • Sylvain de Langen 2023

  • 阿德尔·穆门 2023

  • 普拉迪亚·坎达尔卡 2023

摘要

类:

FastSpeech2

一个即用型的Fastspeech2封装器(文本 -> 梅尔频谱)。

FastSpeech2InternalAlignment

一个即用型的Fastspeech2封装,带有内部对齐功能(文本 -> 梅尔频谱)。

MSTacotron2

一个即用型的Zero-Shot多说话者Tacotron2封装器。

Tacotron2

Tacotron2 的即用型封装器(文本 -> 梅尔频谱)。

参考

class speechbrain.inference.TTS.Tacotron2(*args, **kwargs)[source]

基础类:Pretrained

Tacotron2(文本 -> 梅尔频谱)的即用型封装器。

Parameters:
  • *args (元组)

  • **kwargs (dict) – 参数被转发到 Pretrained 父类。

Example

>>> tmpdir_tts = getfixture('tmpdir') / "tts"
>>> tacotron2 = Tacotron2.from_hparams(source="speechbrain/tts-tacotron2-ljspeech", savedir=tmpdir_tts)
>>> mel_output, mel_length, alignment = tacotron2.encode_text("Mary had a little lamb")
>>> items = [
...   "A quick brown fox jumped over the lazy dog",
...   "How much wood would a woodchuck chuck?",
...   "Never odd or even"
... ]
>>> mel_outputs, mel_lengths, alignments = tacotron2.encode_batch(items)
>>> # One can combine the TTS model with a vocoder (that generates the final waveform)
>>> # Initialize the Vocoder (HiFIGAN)
>>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder"
>>> from speechbrain.inference.vocoders import HIFIGAN
>>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir=tmpdir_vocoder)
>>> # Running the TTS
>>> mel_output, mel_length, alignment = tacotron2.encode_text("Mary had a little lamb")
>>> # Running Vocoder (spectrogram-to-waveform)
>>> waveforms = hifi_gan.decode_batch(mel_output)
HPARAMS_NEEDED = ['model', 'text_to_sequence']
text_to_seq(txt)[source]

使用自定义的文本到序列函数将原始文本编码为张量

encode_batch(texts)[source]

计算一系列文本的梅尔频谱图

文本必须按其长度递减顺序排序

Parameters:

文本 (列表[str]) – 要编码为频谱图的文本

Return type:

输出频谱图、输出长度和对齐的张量

encode_text(text)[source]

对单个文本字符串进行推理

forward(texts)[source]

对输入文本进行编码。

class speechbrain.inference.TTS.MSTacotron2(*args, **kwargs)[source]

基础类:Pretrained

一个即用型的Zero-Shot Multi-Speaker Tacotron2封装器。 用于语音克隆:(text, reference_audio) -> (mel_spec)。 用于生成随机说话者语音:(text) -> (mel_spec)。

Parameters:
  • *args (元组)

  • **kwargs (dict) – 参数被转发到 Pretrained 父类。

Example

>>> tmpdir_tts = getfixture('tmpdir') / "tts"
>>> mstacotron2 = MSTacotron2.from_hparams(source="speechbrain/tts-mstacotron2-libritts", savedir=tmpdir_tts) 
>>> # Sample rate of the reference audio must be greater or equal to the sample rate of the speaker embedding model
>>> reference_audio_path = "tests/samples/single-mic/example1.wav"
>>> input_text = "Mary had a little lamb."
>>> mel_output, mel_length, alignment = mstacotron2.clone_voice(input_text, reference_audio_path) 
>>> # One can combine the TTS model with a vocoder (that generates the final waveform)
>>> # Initialize the Vocoder (HiFIGAN)
>>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder"
>>> from speechbrain.inference.vocoders import HIFIGAN
>>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-libritts-22050Hz", savedir=tmpdir_vocoder) 
>>> # Running the TTS
>>> mel_output, mel_length, alignment = mstacotron2.clone_voice(input_text, reference_audio_path) 
>>> # Running Vocoder (spectrogram-to-waveform)
>>> waveforms = hifi_gan.decode_batch(mel_output) 
>>> # For generating a random speaker voice, use the following
>>> mel_output, mel_length, alignment = mstacotron2.generate_random_voice(input_text) 
HPARAMS_NEEDED = ['model']
clone_voice(texts, audio_path)[source]

使用输入文本和参考音频生成梅尔频谱图

Parameters:
  • 文本 (strlist) – 输入文本

  • audio_path (str) – 参考音频

Return type:

输出频谱图、输出长度和对齐的张量

generate_random_voice(texts)[source]

使用输入文本和随机说话者声音生成梅尔频谱图

Parameters:

文本 (strlist) – 输入文本

Return type:

输出频谱图、输出长度和对齐的张量

class speechbrain.inference.TTS.FastSpeech2(*args, **kwargs)[source]

基础类:Pretrained

一个即用型的Fastspeech2封装器(文本 -> 梅尔频谱)。

Parameters:
  • *args (元组)

  • **kwargs (dict) – 参数被转发到 Pretrained 父类。

Example

>>> tmpdir_tts = getfixture('tmpdir') / "tts"
>>> fastspeech2 = FastSpeech2.from_hparams(source="speechbrain/tts-fastspeech2-ljspeech", savedir=tmpdir_tts) 
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(["Mary had a little lamb."]) 
>>> items = [
...   "A quick brown fox jumped over the lazy dog",
...   "How much wood would a woodchuck chuck?",
...   "Never odd or even"
... ]
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(items) 
>>>
>>> # One can combine the TTS model with a vocoder (that generates the final waveform)
>>> # Initialize the Vocoder (HiFIGAN)
>>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder"
>>> from speechbrain.inference.vocoders import HIFIGAN
>>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir=tmpdir_vocoder) 
>>> # Running the TTS
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(["Mary had a little lamb."]) 
>>> # Running Vocoder (spectrogram-to-waveform)
>>> waveforms = hifi_gan.decode_batch(mel_outputs) 
HPARAMS_NEEDED = ['spn_predictor', 'model', 'input_encoder']
encode_text(texts, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

计算一系列文本的梅尔频谱图

Parameters:
  • 文本 (列表[字符串]) – 要转换为频谱图的文本

  • pace (float) – 语音合成的速度

  • pitch_rate (float) – 音素音高的缩放因子

  • energy_rate (float) – 音素能量的缩放因子

Return type:

输出频谱图、输出长度和对齐的张量

encode_phoneme(phonemes, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

计算一系列音素序列的梅尔频谱图

Parameters:
  • 音素 (列表[列表[str]]) – 需要转换为频谱图的音素

  • pace (float) – 语音合成的速度

  • pitch_rate (float) – 音素音高的缩放因子

  • energy_rate (float) – 音素能量的缩放因子

Return type:

输出频谱图、输出长度和对齐的张量

encode_batch(tokens_padded, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

对音素序列张量的批量推断

Parameters:
  • tokens_padded (torch.Tensor) – 要转换为频谱图的编码音素序列

  • pace (float) – 语音合成的速度

  • pitch_rate (float) – 音素音高的缩放因子

  • energy_rate (float) – 音素能量的缩放因子

Returns:

  • post_mel_outputs (torch.Tensor)

  • durations (torch.Tensor)

  • pitch (torch.Tensor)

  • energy (torch.Tensor)

forward(text, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

对音素序列张量的批量推断

Parameters:
  • 文本 (str) – 要转换为频谱图的文本

  • pace (float) – 语音合成的速度

  • pitch_rate (float) – 音素音高的缩放因子

  • energy_rate (float) – 音素能量的缩放因子

Return type:

编码文本

class speechbrain.inference.TTS.FastSpeech2InternalAlignment(*args, **kwargs)[source]

基础类:Pretrained

一个即用型的Fastspeech2封装器,带有内部对齐功能(文本 -> 梅尔频谱)。

Parameters:
  • *args (元组)

  • **kwargs (dict) – 参数被转发到 Pretrained 父类。

Example

>>> tmpdir_tts = getfixture('tmpdir') / "tts"
>>> fastspeech2 = FastSpeech2InternalAlignment.from_hparams(source="speechbrain/tts-fastspeech2-internal-alignment-ljspeech", savedir=tmpdir_tts) 
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(["Mary had a little lamb."]) 
>>> items = [
...   "A quick brown fox jumped over the lazy dog",
...   "How much wood would a woodchuck chuck?",
...   "Never odd or even"
... ]
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(items) 
>>> # One can combine the TTS model with a vocoder (that generates the final waveform)
>>> # Initialize the Vocoder (HiFIGAN)
>>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder"
>>> from speechbrain.inference.vocoders import HIFIGAN
>>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir=tmpdir_vocoder) 
>>> # Running the TTS
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(["Mary had a little lamb."]) 
>>> # Running Vocoder (spectrogram-to-waveform)
>>> waveforms = hifi_gan.decode_batch(mel_outputs) 
HPARAMS_NEEDED = ['model', 'input_encoder']
encode_text(texts, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

计算一系列文本的梅尔频谱图

Parameters:
  • 文本 (列表[字符串]) – 要转换为频谱图的文本

  • pace (float) – 语音合成的速度

  • pitch_rate (float) – 音素音高的缩放因子

  • energy_rate (float) – 音素能量的缩放因子

Return type:

输出频谱图、输出长度和对齐的张量

encode_phoneme(phonemes, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

计算一系列音素序列的梅尔频谱图

Parameters:
  • 音素 (列表[列表[str]]) – 需要转换为频谱图的音素

  • pace (float) – 语音合成的速度

  • pitch_rate (float) – 音素音高的缩放因子

  • energy_rate (float) – 音素能量的缩放因子

Return type:

输出频谱图、输出长度和对齐的张量

encode_batch(tokens_padded, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

对音素序列张量的批量推断

Parameters:
  • tokens_padded (torch.Tensor) – 要转换为频谱图的编码音素序列

  • pace (float) – 语音合成的速度

  • pitch_rate (float) – 音素音高的缩放因子

  • energy_rate (float) – 音素能量的缩放因子

Returns:

  • post_mel_outputs (torch.Tensor)

  • durations (torch.Tensor)

  • pitch (torch.Tensor)

  • energy (torch.Tensor)

forward(text, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

对音素序列张量的批量推断

Parameters:
  • 文本 (str) – 要转换为频谱图的文本

  • pace (float) – 语音合成的速度

  • pitch_rate (float) – 音素音高的缩放因子

  • energy_rate (float) – 音素能量的缩放因子

Return type:

编码文本