speechbrain.lobes.models.huggingface_transformers.mimi 模块

这个模块支持集成huggingface预训练的Mimi。

Mimi 编解码器是由 Kyutai 开发的最先进的音频神经编解码器。它将语义和声学信息结合到以 12Hz 运行且比特率为 1.1kbps 的音频令牌中。

请注意，您需要安装 transformers>=4.45.1 才能使用此模块。

仓库: https://huggingface.co/kyutai/mimi 论文: https://kyutai.org/Moshi.pdf

Authors

Pooneh Mousavi 2024

摘要

类：

Mimi

该模块支持集成HuggingFace预训练的Mimi模型。

参考

class speechbrain.lobes.models.huggingface_transformers.mimi.Mimi(source, save_path=None, sample_rate=24000, freeze=True, num_codebooks=8)[source]

基础类: HFTransformersInterface

这个模块使得HuggingFace预训练的Mimi模型能够集成。 Mimi编解码器是由Kyutai开发的一种先进的音频神经编解码器。它将语义和声学信息结合到以12Hz运行和1.1kbps比特率的音频令牌中。

Source paper:: https://kyutai.org/Moshi.pdf
Transformers>=4.45.1 from HuggingFace needs to be installed:: https://huggingface.co/transformers/installation.html
The code is adapted from the official HF Kyutai repository:: https://huggingface.co/kyutai/mimi

Parameters:

source (str) – 一个 HuggingFace 仓库标识符或路径
save_path (str) – 预训练模型将被保存的位置
sample_rate (int (默认值: 24000)) – 音频采样率
freeze (bool) – 模型是否会被冻结（例如，如果用作训练另一个模型的一部分，则不可训练）
num_codebooks (int (默认值: 8)) – 码本数量。可以是 [2,3,4,5,6,7,8]

Example

>>> model_hub = "kyutai/mimi"
>>> save_path = "savedir"
>>> model = Mimi(model_hub, save_path)
>>> audio = torch.randn(4, 48000)
>>> length = torch.tensor([1.0, .5, .75, 1.0])
>>> tokens, emb = model.encode(audio, length)
>>> tokens.shape
torch.Size([4, 8, 25])
>>> emb.shape
torch.Size([4, 8, 25, 256])
>>> rec = model.decode(tokens, length)
>>> rec.shape
torch.Size([4, 1, 48000])

forward(inputs, length)[source]

将输入音频编码为令牌和嵌入，并从令牌解码音频

Parameters:

inputs (torch.Tensor) – 一个 (Batch x Samples) 或 (Batch x Channel x Samples) 的音频张量
length (torch.Tensor) – 一个相对长度的张量

Returns:

tokens (torch.Tensor) – 一个 (Batch x Tokens x Heads) 的音频令牌张量
emb (torch.Tensor) – 来自模型量化器的原始向量嵌入
audio (torch.Tensor) – 重构的音频

encode(inputs, length)[source]

将输入音频编码为标记和嵌入

Parameters:

inputs (torch.Tensor) – 一个 (Batch x Samples) 或 (Batch x Channel x Samples) 的音频张量
length (torch.Tensor) – 一个相对长度的张量

Returns:

tokens (torch.Tensor) – 一个 (Batch x num_codebooks x Length) 的音频令牌张量
emb (torch.Tensor) – 来自模型量化器的原始向量嵌入

decode(tokens, length=None)[source]

从标记解码音频

Parameters:

tokens (torch.Tensor) – 一个 (Batch x num_codebooks x Length) 的音频令牌张量
length (torch.Tensor) – 一个1维张量，表示相对长度

Returns:

audio – 重建的音频

Return type:

torch.Tensor