speechbrain.inference.TTS 模块

指定文本到语音（TTS）模块的推理接口。

Authors:

阿库·罗赫 2021
彼得·普兰廷加 2021
洛伦·卢戈斯奇 2020
Mirco Ravanelli 2020
Titouan Parcollet 2021
阿卜杜勒·赫巴 2021
安德烈亚斯·诺茨 2022, 2023
Pooneh Mousavi 2023
Sylvain de Langen 2023
阿德尔·穆门 2023
普拉迪亚·坎达尔卡 2023

摘要

类：

`FastSpeech2`	一个即用型的Fastspeech2封装器（文本 -> 梅尔频谱）。
`FastSpeech2InternalAlignment`	一个即用型的Fastspeech2封装，带有内部对齐功能（文本 -> 梅尔频谱）。
`MSTacotron2`	一个即用型的Zero-Shot多说话者Tacotron2封装器。
`Tacotron2`	Tacotron2 的即用型封装器（文本 -> 梅尔频谱）。

参考

class speechbrain.inference.TTS.Tacotron2(*args, **kwargs)[source]

基础类：Pretrained

Tacotron2（文本 -> 梅尔频谱）的即用型封装器。

Parameters:

*args (元组)
**kwargs (dict) – 参数被转发到 Pretrained 父类。

Example

>>> tmpdir_tts = getfixture('tmpdir') / "tts"
>>> tacotron2 = Tacotron2.from_hparams(source="speechbrain/tts-tacotron2-ljspeech", savedir=tmpdir_tts)
>>> mel_output, mel_length, alignment = tacotron2.encode_text("Mary had a little lamb")
>>> items = [
...   "A quick brown fox jumped over the lazy dog",
...   "How much wood would a woodchuck chuck?",
...   "Never odd or even"
... ]
>>> mel_outputs, mel_lengths, alignments = tacotron2.encode_batch(items)

>>> # One can combine the TTS model with a vocoder (that generates the final waveform)
>>> # Initialize the Vocoder (HiFIGAN)
>>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder"
>>> from speechbrain.inference.vocoders import HIFIGAN
>>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir=tmpdir_vocoder)
>>> # Running the TTS
>>> mel_output, mel_length, alignment = tacotron2.encode_text("Mary had a little lamb")
>>> # Running Vocoder (spectrogram-to-waveform)
>>> waveforms = hifi_gan.decode_batch(mel_output)

HPARAMS_NEEDED = ['model', 'text_to_sequence']

text_to_seq(txt)[source]: 使用自定义的文本到序列函数将原始文本编码为张量

encode_batch(texts)[source]

计算一系列文本的梅尔频谱图

文本必须按其长度递减顺序排序

Parameters:: 文本 (列表[str]) – 要编码为频谱图的文本
Return type:: 输出频谱图、输出长度和对齐的张量

encode_text(text)[source]: 对单个文本字符串进行推理

forward(texts)[source]: 对输入文本进行编码。

class speechbrain.inference.TTS.MSTacotron2(*args, **kwargs)[source]

基础类：Pretrained

一个即用型的Zero-Shot Multi-Speaker Tacotron2封装器。用于语音克隆：(text, reference_audio) -> (mel_spec)。用于生成随机说话者语音：(text) -> (mel_spec)。

Parameters:

*args (元组)
**kwargs (dict) – 参数被转发到 Pretrained 父类。

Example

>>> tmpdir_tts = getfixture('tmpdir') / "tts"
>>> mstacotron2 = MSTacotron2.from_hparams(source="speechbrain/tts-mstacotron2-libritts", savedir=tmpdir_tts) 
>>> # Sample rate of the reference audio must be greater or equal to the sample rate of the speaker embedding model
>>> reference_audio_path = "tests/samples/single-mic/example1.wav"
>>> input_text = "Mary had a little lamb."
>>> mel_output, mel_length, alignment = mstacotron2.clone_voice(input_text, reference_audio_path) 
>>> # One can combine the TTS model with a vocoder (that generates the final waveform)
>>> # Initialize the Vocoder (HiFIGAN)
>>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder"
>>> from speechbrain.inference.vocoders import HIFIGAN
>>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-libritts-22050Hz", savedir=tmpdir_vocoder) 
>>> # Running the TTS
>>> mel_output, mel_length, alignment = mstacotron2.clone_voice(input_text, reference_audio_path) 
>>> # Running Vocoder (spectrogram-to-waveform)
>>> waveforms = hifi_gan.decode_batch(mel_output) 
>>> # For generating a random speaker voice, use the following
>>> mel_output, mel_length, alignment = mstacotron2.generate_random_voice(input_text) 

HPARAMS_NEEDED = ['model']

clone_voice(texts, audio_path)[source]

使用输入文本和参考音频生成梅尔频谱图

Parameters:

文本 (str 或 list) – 输入文本
audio_path (str) – 参考音频

Return type:

输出频谱图、输出长度和对齐的张量

generate_random_voice(texts)[source]

使用输入文本和随机说话者声音生成梅尔频谱图

Parameters:: 文本 (str 或 list) – 输入文本
Return type:: 输出频谱图、输出长度和对齐的张量

class speechbrain.inference.TTS.FastSpeech2(*args, **kwargs)[source]

基础类：Pretrained

一个即用型的Fastspeech2封装器（文本 -> 梅尔频谱）。

Parameters:

*args (元组)
**kwargs (dict) – 参数被转发到 Pretrained 父类。

Example

>>> tmpdir_tts = getfixture('tmpdir') / "tts"
>>> fastspeech2 = FastSpeech2.from_hparams(source="speechbrain/tts-fastspeech2-ljspeech", savedir=tmpdir_tts) 
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(["Mary had a little lamb."]) 
>>> items = [
...   "A quick brown fox jumped over the lazy dog",
...   "How much wood would a woodchuck chuck?",
...   "Never odd or even"
... ]
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(items) 
>>>
>>> # One can combine the TTS model with a vocoder (that generates the final waveform)
>>> # Initialize the Vocoder (HiFIGAN)
>>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder"
>>> from speechbrain.inference.vocoders import HIFIGAN
>>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir=tmpdir_vocoder) 
>>> # Running the TTS
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(["Mary had a little lamb."]) 
>>> # Running Vocoder (spectrogram-to-waveform)
>>> waveforms = hifi_gan.decode_batch(mel_outputs) 

HPARAMS_NEEDED = ['spn_predictor', 'model', 'input_encoder']

encode_text(texts, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

计算一系列文本的梅尔频谱图

Parameters:

文本 (列表[字符串]) – 要转换为频谱图的文本
pace (float) – 语音合成的速度
pitch_rate (float) – 音素音高的缩放因子
energy_rate (float) – 音素能量的缩放因子

Return type:

输出频谱图、输出长度和对齐的张量

encode_phoneme(phonemes, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

计算一系列音素序列的梅尔频谱图

Parameters:

音素 (列表[列表[str]]) – 需要转换为频谱图的音素
pace (float) – 语音合成的速度
pitch_rate (float) – 音素音高的缩放因子
energy_rate (float) – 音素能量的缩放因子

Return type:

输出频谱图、输出长度和对齐的张量

encode_batch(tokens_padded, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

对音素序列张量的批量推断

Parameters:

tokens_padded (torch.Tensor) – 要转换为频谱图的编码音素序列
pace (float) – 语音合成的速度
pitch_rate (float) – 音素音高的缩放因子
energy_rate (float) – 音素能量的缩放因子

Returns:

post_mel_outputs (torch.Tensor)
durations (torch.Tensor)
pitch (torch.Tensor)
energy (torch.Tensor)

forward(text, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

对音素序列张量的批量推断

Parameters:

文本 (str) – 要转换为频谱图的文本
pace (float) – 语音合成的速度
pitch_rate (float) – 音素音高的缩放因子
energy_rate (float) – 音素能量的缩放因子

Return type:

编码文本

class speechbrain.inference.TTS.FastSpeech2InternalAlignment(*args, **kwargs)[source]

基础类：Pretrained

一个即用型的Fastspeech2封装器，带有内部对齐功能（文本 -> 梅尔频谱）。

Parameters:

*args (元组)
**kwargs (dict) – 参数被转发到 Pretrained 父类。

Example

>>> tmpdir_tts = getfixture('tmpdir') / "tts"
>>> fastspeech2 = FastSpeech2InternalAlignment.from_hparams(source="speechbrain/tts-fastspeech2-internal-alignment-ljspeech", savedir=tmpdir_tts) 
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(["Mary had a little lamb."]) 
>>> items = [
...   "A quick brown fox jumped over the lazy dog",
...   "How much wood would a woodchuck chuck?",
...   "Never odd or even"
... ]
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(items) 
>>> # One can combine the TTS model with a vocoder (that generates the final waveform)
>>> # Initialize the Vocoder (HiFIGAN)
>>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder"
>>> from speechbrain.inference.vocoders import HIFIGAN
>>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir=tmpdir_vocoder) 
>>> # Running the TTS
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(["Mary had a little lamb."]) 
>>> # Running Vocoder (spectrogram-to-waveform)
>>> waveforms = hifi_gan.decode_batch(mel_outputs) 

HPARAMS_NEEDED = ['model', 'input_encoder']

encode_text(texts, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

计算一系列文本的梅尔频谱图

Parameters:

文本 (列表[字符串]) – 要转换为频谱图的文本
pace (float) – 语音合成的速度
pitch_rate (float) – 音素音高的缩放因子
energy_rate (float) – 音素能量的缩放因子

Return type:

输出频谱图、输出长度和对齐的张量

encode_phoneme(phonemes, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

计算一系列音素序列的梅尔频谱图

Parameters:

音素 (列表[列表[str]]) – 需要转换为频谱图的音素
pace (float) – 语音合成的速度
pitch_rate (float) – 音素音高的缩放因子
energy_rate (float) – 音素能量的缩放因子

Return type:

输出频谱图、输出长度和对齐的张量

encode_batch(tokens_padded, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

对音素序列张量的批量推断

Parameters:

tokens_padded (torch.Tensor) – 要转换为频谱图的编码音素序列
pace (float) – 语音合成的速度
pitch_rate (float) – 音素音高的缩放因子
energy_rate (float) – 音素能量的缩放因子

Returns:

post_mel_outputs (torch.Tensor)
durations (torch.Tensor)
pitch (torch.Tensor)
energy (torch.Tensor)

forward(text, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

对音素序列张量的批量推断

Parameters:

文本 (str) – 要转换为频谱图的文本
pace (float) – 语音合成的速度
pitch_rate (float) – 音素音高的缩放因子
energy_rate (float) – 音素能量的缩放因子

Return type:

编码文本