speechbrain.alignment.aligner 模块

对齐代码

Authors

埃琳娜·拉斯托古耶娃 2020
洛伦·卢戈斯奇 2020

摘要

类：

HMMAligner

该类在前向方法中计算维特比对齐。

函数：

`batch_log_matvecmul`	对于批次中的每个“矩阵”和“向量”对，在日志域中进行矩阵-向量乘法，即使用logsumexp代替加法，使用加法代替乘法。
`batch_log_maxvecmul`	类似于batch_log_matvecmul，但取最大值而不是logsumexp。
`map_inds_to_intersect`	将包含来自不同音素集的音素索引的2个列表转换为单个音素，以便比较结果列表的索引相等性将产生正确的准确性。

参考

class speechbrain.alignment.aligner.HMMAligner(states_per_phoneme=1, output_folder='', neg_inf=-100000.0, batch_reduction='none', input_len_norm=False, target_len_norm=False, lexicon_path=None)[source]

基础：Module

该类在前向方法中计算维特比对齐。

它还记录对齐并创建它们的批次以用于维特比训练。

Parameters:

states_per_phoneme (int) – 每个音素使用的隐藏状态数。
output_folder (str) – 这是对齐结果保存到磁盘时将存储的文件夹。尚未实现。
neg_inf (float) – 用于表示负无限对数概率的浮点数。使用 -float("Inf") 往往会导致数值不稳定。当使用 genbmm 库时（目前未使用），比 -1e5 更小的数字有时也会导致错误。（默认值：-1e5）
batch_reduction (string) – 可以是“none”、“sum”或“mean”之一。在前向方法中计算的损失上应用何种批次级别的归约。
input_len_norm (bool) – 是否在前向方法中通过输入的长度来归一化损失。
target_len_norm (bool) – 是否在正向方法中通过目标的长度来归一化损失。
lexicon_path (string) – 词典的位置。

Example

>>> log_posteriors = torch.tensor([[[ -1., -10., -10.],
...                                 [-10.,  -1., -10.],
...                                 [-10., -10.,  -1.]],
...
...                                [[ -1., -10., -10.],
...                                 [-10.,  -1., -10.],
...                                 [-10., -10., -10.]]])
>>> lens = torch.tensor([1., 0.66])
>>> phns = torch.tensor([[0, 1, 2],
...                      [0, 1, 0]])
>>> phn_lens = torch.tensor([1., 0.66])
>>> aligner = HMMAligner()
>>> forward_scores = aligner(
...        log_posteriors, lens, phns, phn_lens, 'forward'
... )
>>> forward_scores.shape
torch.Size([2])
>>> viterbi_scores, alignments = aligner(
...        log_posteriors, lens, phns, phn_lens, 'viterbi'
... )
>>> alignments
[[0, 1, 2], [0, 1]]
>>> viterbi_scores.shape
torch.Size([2])

use_lexicon(words, interword_sils=True, sample_pron=False)[source]

使用词典进行处理，返回可能的音素序列、转移/pi概率以及可能的最终状态。基于逐句处理。批次中的每个语句通过辅助方法_use_lexicon进行处理。

Parameters:

单词 (列表) – 转录文本中的单词列表
interword_sils (bool) – 如果为True，将在每个单词之间插入可选的静音。如果为False，可选的静音将仅放置在每个话语的开始和结束处。
sample_pron (bool) – 如果为True，它将采样一个可能的音素序列。如果为False，它将返回所有可能的音素序列的统计信息。

Returns:

poss_phns (torch.Tensor (batch, phoneme in possible phn sequence)) – 被认为是每个话语中的音素。
poss_phn_lens (torch.Tensor (batch)) – 批次中每个可能音素序列的相对长度。
trans_prob (torch.Tensor (batch, from, to)) – 包含转移（对数）概率的张量。
pi_prob (torch.Tensor (batch, state)) – 包含初始（对数）概率的张量。
final_state (list of lists of ints) – 每个话语的可能最终状态的列表。

Example

>>> aligner = HMMAligner()
>>> aligner.lexicon = {
...                     "a": {0: "a"},
...                     "b": {0: "b", 1: "c"}
...                   }
>>> words = [["a", "b"]]
>>> aligner.lex_lab2ind = {
...                   "sil": 0,
...                   "a":  1,
...                   "b":  2,
...                   "c":  3,
...                 }
>>> poss_phns, poss_phn_lens, trans_prob, pi_prob, final_states = aligner.use_lexicon(
...     words,
...     interword_sils = True
... )
>>> poss_phns
tensor([[0, 1, 0, 2, 3, 0]])
>>> poss_phn_lens
tensor([1.])
>>> trans_prob
tensor([[[-6.9315e-01, -6.9315e-01, -1.0000e+05, -1.0000e+05, -1.0000e+05,
          -1.0000e+05],
         [-1.0000e+05, -1.3863e+00, -1.3863e+00, -1.3863e+00, -1.3863e+00,
          -1.0000e+05],
         [-1.0000e+05, -1.0000e+05, -1.0986e+00, -1.0986e+00, -1.0986e+00,
          -1.0000e+05],
         [-1.0000e+05, -1.0000e+05, -1.0000e+05, -6.9315e-01, -1.0000e+05,
          -6.9315e-01],
         [-1.0000e+05, -1.0000e+05, -1.0000e+05, -1.0000e+05, -6.9315e-01,
          -6.9315e-01],
         [-1.0000e+05, -1.0000e+05, -1.0000e+05, -1.0000e+05, -1.0000e+05,
           0.0000e+00]]])
>>> pi_prob
tensor([[-6.9315e-01, -6.9315e-01, -1.0000e+05, -1.0000e+05, -1.0000e+05,
         -1.0000e+05]])
>>> final_states
[[3, 4, 5]]
>>> # With no optional silences between words
>>> poss_phns_, _, trans_prob_, pi_prob_, final_states_ = aligner.use_lexicon(
...     words,
...     interword_sils = False
... )
>>> poss_phns_
tensor([[0, 1, 2, 3, 0]])
>>> trans_prob_
tensor([[[-6.9315e-01, -6.9315e-01, -1.0000e+05, -1.0000e+05, -1.0000e+05],
         [-1.0000e+05, -1.0986e+00, -1.0986e+00, -1.0986e+00, -1.0000e+05],
         [-1.0000e+05, -1.0000e+05, -6.9315e-01, -1.0000e+05, -6.9315e-01],
         [-1.0000e+05, -1.0000e+05, -1.0000e+05, -6.9315e-01, -6.9315e-01],
         [-1.0000e+05, -1.0000e+05, -1.0000e+05, -1.0000e+05,  0.0000e+00]]])
>>> pi_prob_
tensor([[-6.9315e-01, -6.9315e-01, -1.0000e+05, -1.0000e+05, -1.0000e+05]])
>>> final_states_
[[2, 3, 4]]
>>> # With sampling of a single possible pronunciation
>>> import random
>>> random.seed(0)
>>> poss_phns_, _, trans_prob_, pi_prob_, final_states_ = aligner.use_lexicon(
...     words,
...     sample_pron = True
... )
>>> poss_phns_
tensor([[0, 1, 0, 2, 0]])
>>> trans_prob_
tensor([[[-6.9315e-01, -6.9315e-01, -1.0000e+05, -1.0000e+05, -1.0000e+05],
         [-1.0000e+05, -1.0986e+00, -1.0986e+00, -1.0986e+00, -1.0000e+05],
         [-1.0000e+05, -1.0000e+05, -6.9315e-01, -6.9315e-01, -1.0000e+05],
         [-1.0000e+05, -1.0000e+05, -1.0000e+05, -6.9315e-01, -6.9315e-01],
         [-1.0000e+05, -1.0000e+05, -1.0000e+05, -1.0000e+05,  0.0000e+00]]])

forward(emission_pred, lens, phns, phn_lens, dp_algorithm, prob_matrices=None)[source]

准备相关的（对数）概率张量并进行动态规划：可以是前向算法或维特比算法。根据对象初始化时指定的方式进行归约。

Parameters:

emission_pred (torch.Tensor (batch, time, phoneme in vocabulary)) – 来自我们声学模型的后验概率。
lens (torch.Tensor (batch)) – 每个语音文件的相对持续时间。
phns (torch.Tensor (batch, phoneme in phn sequence)) – 已知/认为在每个话语中的音素
phn_lens (torch.Tensor (batch)) – 批次中每个音素序列的相对长度。
dp_algorithm (string) – 可以是“forward”或“viterbi”。
prob_matrices (dict) – (可选) 必须包含键 ‘trans_prob’, ‘pi_prob’ 和 ‘final_states’。用于覆盖默认的前向和维特比操作，这些操作强制遍历 phns 序列中的所有状态。

Returns:

如果 dp_algorithm == “forward”。

forward_scores : torch.Tensor (batch, 或标量)

批次中每个话语的（对数）似然，如果指定了则应用归约。（或）
如果 dp_algorithm == “viterbi”。

viterbi_scores : torch.Tensor (batch, 或标量)

每个话语的Viterbi路径的（对数）似然，如果指定了则应用归约。

alignments : 列表的列表，包含整数

批次中文件的Viterbi对齐。

Return type:

张量

expand_phns_by_states_per_phoneme(phns, phn_lens)[source]

将phn序列中的每个音素扩展为HMM中定义的每个音素的隐藏状态数。

Parameters:

phns (torch.Tensor (batch, phoneme in phn sequence)) – 已知/认为在每个话语中的音素。
phn_lens (torch.Tensor (batch)) – 批次中每个音素序列的相对长度。

Returns:

expanded_phns

Return type:

torch.Tensor (批次, 扩展音素序列中的音素)

Example

>>> phns = torch.tensor([[0., 3., 5., 0.],
...                      [0., 2., 0., 0.]])
>>> phn_lens = torch.tensor([1., 0.75])
>>> aligner = HMMAligner(states_per_phoneme = 3)
>>> expanded_phns = aligner.expand_phns_by_states_per_phoneme(
...         phns, phn_lens
... )
>>> expanded_phns
tensor([[ 0.,  1.,  2.,  9., 10., 11., 15., 16., 17.,  0.,  1.,  2.],
        [ 0.,  1.,  2.,  6.,  7.,  8.,  0.,  1.,  2.,  0.,  0.,  0.]])

store_alignments(ids, alignments)[source]

在self.align_dict中记录Viterbi对齐。

Parameters:

ids (list of str) – 批次中文件的ID。
alignments (list of lists of int) – 批次中文件的Viterbi对齐。没有填充。

Example

>>> aligner = HMMAligner()
>>> ids = ['id1', 'id2']
>>> alignments = [[0, 2, 4], [1, 2, 3, 4]]
>>> aligner.store_alignments(ids, alignments)
>>> aligner.align_dict.keys()
dict_keys(['id1', 'id2'])
>>> aligner.align_dict['id1']
tensor([0, 2, 4], dtype=torch.int16)

get_prev_alignments(ids, emission_pred, lens, phns, phn_lens)[source]

如果之前记录的Viterbi对齐可用，则获取它们。如果不可用，则获取平坦起始对齐。目前假设如果批次中的第一个话语没有可用的Viterbi对齐，则其余话语也不会有可用的Viterbi对齐。

Parameters:

ids (list of str) – 批次中文件的ID。
emission_pred (torch.Tensor (batch, time, phoneme in vocabulary)) – 来自我们声学模型的后验概率。用于推断批次中最长话语的持续时间。
lens (torch.Tensor (batch)) – 每个语音文件的相对持续时间。
phns (torch.Tensor (batch, phoneme in phn sequence)) – 已知或认为存在于每个话语中的音素。
phn_lens (torch.Tensor (batch)) – 批次中每个音素序列的相对长度。

Returns:

零填充对齐。

Return type:

torch.Tensor (批次, 时间)

Example

>>> ids = ['id1', 'id2']
>>> emission_pred = torch.tensor([[[ -1., -10., -10.],
...                                [-10.,  -1., -10.],
...                                [-10., -10.,  -1.]],
...
...                               [[ -1., -10., -10.],
...                                [-10.,  -1., -10.],
...                                [-10., -10., -10.]]])
>>> lens = torch.tensor([1., 0.66])
>>> phns = torch.tensor([[0, 1, 2],
...                      [0, 1, 0]])
>>> phn_lens = torch.tensor([1., 0.66])
>>> aligner = HMMAligner()
>>> alignment_batch = aligner.get_prev_alignments(
...        ids, emission_pred, lens, phns, phn_lens
... )
>>> alignment_batch
tensor([[0, 1, 2],
        [0, 1, 0]])

calc_accuracy(alignments, ends, phns, ind2labs=None)[source]

计算预测对齐和真实对齐之间的平均准确度。真实对齐是从真实音素及其在音频样本中的结束位置得出的。

Parameters:

alignments (list of lists of ints/floats) – 批次中每个话语的预测对齐。
ends (list of lists of ints) – 根据转录，样本索引的列表的列表，其中每个真实音素结束。注意：当前实现假设‘ends’标记下一个音素开始的索引。
phns (list of lists of ints/floats) – 批次中未填充的真实音素列表的列表。
ind2labs (tuple) – (可选) 包含第一个和第二个音素序列的原始索引到标签的字典。

Returns:

mean_acc – 上采样预测对齐与真实对齐匹配的平均百分比。

Return type:

float

Example

>>> aligner = HMMAligner()
>>> alignments = [[0., 0., 0., 1.]]
>>> phns = [[0., 1.]]
>>> ends = [[2, 4]]
>>> mean_acc = aligner.calc_accuracy(alignments, ends, phns)
>>> mean_acc.item()
75.0

collapse_alignments(alignments)[source]

将对齐转换为每个音素样式1个状态。

Parameters:: alignments (list of ints) – 单个话语的预测对齐。
Returns:: sequence – 预测的对齐结果转换为每个音素一个状态的样式。
Return type:: list 的整数

Example

>>> aligner = HMMAligner(states_per_phoneme = 3)
>>> alignments = [0, 1, 2, 3, 4, 5, 3, 4, 5, 0, 1, 2]
>>> sequence = aligner.collapse_alignments(alignments)
>>> sequence
[0, 1, 1, 0]

speechbrain.alignment.aligner.map_inds_to_intersect(lists1, lists2, ind2labs)[source]

将包含来自不同音素集的音素索引的两个列表转换为单个音素，以便比较结果列表的索引相等性将产生正确的准确性。

Parameters:

lists1 (list of lists of ints) – 包含第一个音素序列的索引。
lists2 (list of lists of ints) – 包含第二个音素序列的索引。
ind2labs (tuple (dict, dict)) – 包含第一个和第二个音素序列的原始索引到标签的字典。

Returns:

lists1_new (list of lists of ints) – 包含映射到新音素集的第一个音素序列的索引。
lists2_new (list of lists of ints) – 包含映射到新音素集的第二个音素序列的索引。

Example

>>> lists1 = [[0, 1]]
>>> lists2 = [[0, 1]]
>>> ind2lab1 = {
...        0: "a",
...        1: "b",
...        }
>>> ind2lab2 = {
...        0: "a",
...        1: "c",
...        }
>>> ind2labs = (ind2lab1, ind2lab2)
>>> out1, out2 = map_inds_to_intersect(lists1, lists2, ind2labs)
>>> out1
[[0, 1]]
>>> out2
[[0, 2]]

speechbrain.alignment.aligner.batch_log_matvecmul(A, b)[source]

对于批次中的每个‘matrix’和‘vector’对，在log域中进行矩阵-向量乘法，即使用logsumexp代替加法，使用加法代替乘法。

Parameters:

A (torch.Tensor (batch, dim1, dim2)) – 张量
b (torch.Tensor (batch, dim1)) – 张量。

Returns:

x

Return type:

torch.Tensor (批次, 维度1)

Example

>>> A = torch.tensor([[[   0., 0.],
...                    [ -1e5, 0.]]])
>>> b = torch.tensor([[0., 0.,]])
>>> x = batch_log_matvecmul(A, b)
>>> x
tensor([[0.6931, 0.0000]])
>>>
>>> # non-log domain equivalent without batching functionality
>>> A_ = torch.tensor([[1., 1.],
...                    [0., 1.]])
>>> b_ = torch.tensor([1., 1.,])
>>> x_ = torch.matmul(A_, b_)
>>> x_
tensor([2., 1.])

speechbrain.alignment.aligner.batch_log_maxvecmul(A, b)[source]

类似于batch_log_matvecmul，但取最大值而不是logsumexp。返回最大值和最大值的位置。

Parameters:

A (torch.Tensor (batch, dim1, dim2)) – 张量。
b (torch.Tensor (batch, dim1)) – 张量

Returns:

x (torch.Tensor (batch, dim1)) – 张量。
argmax (torch.Tensor (batch, dim1)) – 张量。

Example

>>> A = torch.tensor([[[   0., -1.],
...                    [ -1e5,  0.]]])
>>> b = torch.tensor([[0., 0.,]])
>>> x, argmax = batch_log_maxvecmul(A, b)
>>> x
tensor([[0., 0.]])
>>> argmax
tensor([[0, 1]])