speechbrain.dataio.encoder 模块

将分类数据编码为整数

Authors

萨穆埃莱·科内尔 2020
阿库·柔赫 2020

摘要

类：

`CTCTextEncoder`	TextEncoder的子类，还提供了处理CTC空白标记的方法。
`CategoricalEncoder`	对离散集的标签进行编码。
`TextEncoder`	CategoricalEncoder 的子类，提供了用于编码文本和处理特殊标记的特定方法，用于训练序列到序列模型。

函数：

load_text_encoder_tokens

从预训练模型中加载编码器标记。

参考

class speechbrain.dataio.encoder.CategoricalEncoder(starting_index=0, **special_labels)[source]

基础类：object

对离散集合的标签进行编码。

用于编码，例如，在说话人识别中的说话人身份。给定一组可哈希的对象（例如字符串），它将每个唯一的项目编码为一个整数值：[“spk0”, “spk1”] –> [0, 1] 内部通过两个字典处理每个标签与其索引之间的对应关系：lab2ind 和 ind2lab。

标签整数编码可以通过在注释中指定所需的条目（例如，spkid）并调用update_from_didataset方法，从SpeechBrain DynamicItemDataset自动生成：

>>> from speechbrain.dataio.encoder import CategoricalEncoder
>>> from speechbrain.dataio.dataset import DynamicItemDataset
>>> dataset = {"ex_{}".format(x) : {"spkid" : "spk{}".format(x)} for x in range(20)}
>>> dataset = DynamicItemDataset(dataset)
>>> encoder = CategoricalEncoder()
>>> encoder.update_from_didataset(dataset, "spkid")
>>> assert len(encoder) == len(dataset) # different speaker for each utterance

然而也可以从可迭代对象更新：

>>> from speechbrain.dataio.encoder import CategoricalEncoder
>>> from speechbrain.dataio.dataset import DynamicItemDataset
>>> dataset = ["spk{}".format(x) for x in range(20)]
>>> encoder = CategoricalEncoder()
>>> encoder.update_from_iterable(dataset)
>>> assert len(encoder) == len(dataset)

注意

在这两种方法中，可以指定可迭代对象或数据集中的单个元素是否应被视为序列（默认为False）。如果它是一个序列，序列中的每个元素都将被编码。

>>> from speechbrain.dataio.encoder import CategoricalEncoder
>>> from speechbrain.dataio.dataset import DynamicItemDataset
>>> dataset = [[x+1, x+2] for x in range(20)]
>>> encoder = CategoricalEncoder()
>>> encoder.ignore_len()
>>> encoder.update_from_iterable(dataset, sequence_input=True)
>>> assert len(encoder) == 21 # there are only 21 unique elements 1-21

这个类提供了4种不同的方法来显式地在内部字典中添加标签：add_label、ensure_label、insert_label、enforce_label。如果标签已经存在于内部字典中，add_label和insert_label会引发错误。insert_label和enforce_label还允许指定所需标签编码的整数值。

编码可以使用4种不同的方法进行： encode_label, encode_sequence, encode_label_torch 和 encode_sequence_torch。 encode_label 操作单个标签并简单地返回相应的整数编码：

>>> from speechbrain.dataio.encoder import CategoricalEncoder
>>> from speechbrain.dataio.dataset import DynamicItemDataset
>>> dataset = ["spk{}".format(x) for x in range(20)]
>>> encoder.update_from_iterable(dataset)
>>>
22
>>>
encode_sequence on sequences of labels:
>>> encoder.encode_sequence(["spk1", "spk19"])
[22, 40]
>>>
encode_label_torch and encode_sequence_torch return torch tensors
>>> encoder.encode_sequence_torch(["spk1", "spk19"])
tensor([22, 40])
>>>
Decoding can be performed using decode_torch and decode_ndim methods.
>>> encoded = encoder.encode_sequence_torch(["spk1", "spk19"])
>>> encoder.decode_torch(encoded)
['spk1', 'spk19']
>>>
decode_ndim is used for multidimensional list or pytorch tensors
>>> encoded = encoded.unsqueeze(0).repeat(3, 1)
>>> encoder.decode_torch(encoded)
[['spk1', 'spk19'], ['spk1', 'spk19'], ['spk1', 'spk19']]
>>>

在某些应用中，可能会出现在测试期间遇到训练期间未遇到的标签的情况。为了处理这种词汇外问题，可以使用add_unk。每个词汇外标签都会被映射到这个特殊的标签及其对应的整数编码。

>>> import torch
>>> try:
...     encoder.encode_label("spk42")
... except KeyError:
...        print("spk42 is not in the encoder this raises an error!")
spk42 is not in the encoder this raises an error!
>>> encoder.add_unk()
41
>>> encoder.encode_label("spk42")
41
>>>
returns the <unk> encoding

该类还提供了使用以下方法保存和加载标签和标记之间的内部映射：save 和 load 方法以及 load_or_create。

VALUE_SEPARATOR = ' => '

EXTRAS_SEPARATOR = '================\n'

handle_special_labels(special_labels)[source]: 处理特殊标签，例如unk_label。

classmethod from_saved(path)[source]: 直接重新创建先前保存的编码器

update_from_iterable(iterable, sequence_input=False)[source]

从迭代器更新

Parameters:

iterable (iterable) – 要操作的输入序列。
sequence_input (bool) – 是否可迭代对象直接产生标签序列或单个标签。（默认 False）

update_from_didataset(didataset, output_key, sequence_input=False)[source]

从DynamicItemDataset更新。

Parameters:

didataset (DynamicItemDataset) – 要操作的数据集。
output_key (str) – 数据集中要编码的键（在数据或动态项中）。
sequence_input (bool) – 使用指定键生成的数据是否由标签序列或直接由单个标签组成。

limited_labelset_from_iterable(iterable, sequence_input=False, n_most_common=None, min_count=1)[source]

根据标签计数从可迭代对象生成标签映射

用于限制标签集的大小。

Parameters:

iterable (iterable) – 要操作的输入序列。
sequence_input (bool) – 是否可迭代对象直接生成标签序列或单个标签。默认为False。
n_most_common (int, None) – 最多取这么多标签作为标签集，保留最常见的标签。如果为None（默认情况下），则取所有标签。
min_count (int) – 如果标签出现的次数少于这个值，则不使用它们。

Returns:

不同标签的计数（未过滤）。

Return type:

collections.Counter

load_or_create(path, from_iterables=[], from_didatasets=[], sequence_input=False, output_key=None, special_labels={})[source]

用于有条件创建编码器的便捷语法

这种模式在许多实验中会重复出现，因此我们决定在这里添加一个方便的快捷方式。当前版本支持多GPU（DDP）安全。

add_label(label)[source]

将新标签添加到编码器中的下一个空闲位置。

Parameters:: 标签 (可哈希的) – 最常见的标签是字符串，但任何可以作为字典键的内容都支持。请注意，默认的保存/加载仅支持Python字面量。
Returns:: 用于编码此标签的索引。
Return type:: int

ensure_label(label)[source]

如果标签尚未存在，则添加一个标签。

Parameters:: 标签 (可哈希的) – 最常见的标签是字符串，但任何可以作为字典键的内容都支持。请注意，默认的保存/加载仅支持Python字面量。
Returns:: 用于编码此标签的索引。
Return type:: int

insert_label(label, index)[source]

添加一个新标签，强制其索引为特定值。

如果标签已经具有指定的索引，它将被移动到映射的末尾。

Parameters:

标签 (可哈希的) – 大多数情况下标签是字符串，但任何可以作为字典键的内容都支持。请注意，默认的保存/加载仅支持Python字面量。
索引 (int) – 要使用的特定索引。

enforce_label(label, index)[source]

确保标签存在并编码为特定索引。

如果标签存在但编码到其他索引，则将其移动到给定的索引。

如果给定索引处已经存在另一个标签，则该标签将被移动到下一个空闲位置。

add_unk(unk_label='<unk>')[source]

为未知标记（词汇表外）添加标签。

当要求编码未知标签时，它们可以映射到此。

Parameters:: unk_label (hashable, optional) – 最常见的标签是字符串，但任何可以作为字典键的内容都支持。请注意，默认的保存/加载仅支持Python字面量。默认值：。这也可以是None！
Returns:: 用于编码此内容的索引。
Return type:: int

is_continuous()[source]

检查索引集是否有间隙

例如：如果起始索引 = 1 连续：[1,2,3,4] 连续：[0,1,2] 非连续：[2,3,4] 非连续：[1,2,4]

Returns:: 如果连续则为真。
Return type:: bool

encode_label(label, allow_unk=True)[source]

将标签编码为整数

Parameters:

标签 (可哈希) – 要编码的标签，必须存在于映射中。
allow_unk (bool) – 如果给定，该标签不在标签集中并且已经使用 add_unk() 添加了 unk_label，允许编码到 unk_label 的索引。

Returns:

对应的编码整数值。

Return type:

int

encode_label_torch(label, allow_unk=True)[source]

将标签编码为 torch.LongTensor。

Parameters:

标签 (可哈希的) – 要编码的标签，必须存在于映射中。
allow_unk (bool) – 如果给定，该标签不在标签集中并且已经使用 add_unk() 添加了 unk_label，允许编码到 unk_label 的索引。

Returns:

对应的编码整数值。张量形状 [1]。

Return type:

torch.LongTensor

encode_sequence(sequence, allow_unk=True)[source]

将一系列标签编码为列表

Parameters:

sequence (iterable) – 要编码的标签，必须存在于映射中。
allow_unk (bool) – 如果给定，该标签不在标签集中并且已经使用 add_unk() 添加了 unk_label，允许编码到 unk_label 的索引。

Returns:

对应的整数标签。

Return type:

list

encode_sequence_torch(sequence, allow_unk=True)[source]

将标签序列编码为 torch.LongTensor

Parameters:

sequence (iterable) – 要编码的标签，必须存在于映射中。
allow_unk (bool) – 如果给定，该标签不在标签集中并且已经使用 add_unk() 添加了 unk_label，允许编码到 unk_label 的索引。

Returns:

对应的整数标签。张量形状 [len(sequence)]。

Return type:

torch.LongTensor

decode_torch(x)[source]

将任意嵌套的 torch.Tensor 解码为标签列表。

单独提供是因为Torch提供了更清晰的内省功能，因此不需要try-except。

Parameters:: x (torch.Tensor) – 某种整数数据类型（Long, int）的Torch张量，可以是任意形状，用于解码。
Returns:: 原始标签列表
Return type:: list

decode_ndim(x)[source]

将任意嵌套的可迭代对象解码为标签列表。

这基本上适用于任何Python可迭代对象（包括torch），也适用于单个元素。

Parameters:: x (任意) – Python 列表或其他可迭代对象或 torch.Tensor 或单个整数元素
Returns:: 原始标签的ndim列表，如果输入是单个元素，输出也将是单个元素。
Return type:: list, 任何

save(path)[source]

保存分类编码以供以后使用和恢复

保存使用Python字面量格式，支持诸如元组标签等功能，但被认为是安全加载的（与例如pickle不同）。

Parameters:: path (str, Path) – 保存的位置。将会覆盖。

load(path)[source]

从给定路径加载。

CategoricalEncoder 使用 Python 字面量格式，支持诸如元组标签等功能，但被认为是安全的加载方式（与 pickle 不同）。

Parameters:: path (str, Path) – 从哪里加载。

load_if_possible(path, end_of_epoch=False)[source]

如果可能则加载，返回一个布尔值指示是否已加载。

Parameters:

path (str, Path) – 从哪里加载。
end_of_epoch (bool) – 检查点是否为epoch结束。

Returns:

如果加载成功。

Return type:

bool

Example

>>> encoding_file = getfixture('tmpdir') / "encoding.txt"
>>> encoder = CategoricalEncoder()
>>> # The idea is in an experiment script to have something like this:
>>> if not encoder.load_if_possible(encoding_file):
...     encoder.update_from_iterable("abcd")
...     encoder.save(encoding_file)
>>> # So the first time you run the experiment, the encoding is created.
>>> # However, later, the encoding exists:
>>> encoder = CategoricalEncoder()
>>> encoder.expect_len(4)
>>> if not encoder.load_if_possible(encoding_file):
...     assert False  # We won't get here!
>>> encoder.decode_ndim(range(4))
['a', 'b', 'c', 'd']

expect_len(expected_len)[source]

指定预期的类别数量。如果在编码/解码过程中观察到的类别数量与此不匹配，将引发错误。

这在检测编码器动态构建时可能出现的错误场景中非常有用，特别是当下游代码期望特定的类别数量时（否则可能会无声地崩溃）。

这可以随时调用，类别计数检查只会在实际编码/解码任务期间执行。

Parameters:: expected_len (int) – 预期的最终类别数量，即 len(encoder)。

Example

>>> encoder = CategoricalEncoder()
>>> encoder.update_from_iterable("abcd")
>>> encoder.expect_len(3)
>>> encoder.encode_label("a")
Traceback (most recent call last):
  ...
RuntimeError: .expect_len(3) was called, but 4 categories found
>>> encoder.expect_len(4)
>>> encoder.encode_label("a")
0

ignore_len()[source]

指定在编码/解码时应忽略类别计数。

有效抑制“从未调用过 .expect_len”的警告。当类别计数已知时，建议使用 expect_len()。

class speechbrain.dataio.encoder.TextEncoder(starting_index=0, **special_labels)[source]

基础：CategoricalEncoder

CategoricalEncoder 子类，提供了特定的方法来编码文本并处理特殊标记，用于序列到序列模型的训练。具体来说，除了 CategoricalEncoder 中已经存在的用于处理词汇表外标记的特殊标记外，这里还定义了处理序列开始和序列结束标记的特殊方法。

注意：这里的update_from_iterable和update_from_didataset默认sequence_input=True，因为假设此编码器用于字符串的可迭代对象：例如。

>>> from speechbrain.dataio.encoder import TextEncoder
>>> dataset = [["encode", "this", "textencoder"], ["foo", "bar"]]
>>> encoder = TextEncoder()
>>> encoder.update_from_iterable(dataset)
>>> encoder.expect_len(5)
>>> encoder.encode_label("this")
1
>>> encoder.add_unk()
5
>>> encoder.expect_len(6)
>>> encoder.encode_sequence(["this", "out-of-vocab"])
[1, 5]
>>>

可以使用两种方法将和添加到内部字典中： insert_bos_eos, add_bos_eos。

>>> encoder.add_bos_eos()
>>> encoder.expect_len(8)
>>> encoder.lab2ind[encoder.eos_label]
7
>>>
add_bos_eos adds the special tokens at the end of the dict indexes
>>> encoder = TextEncoder()
>>> encoder.update_from_iterable(dataset)
>>> encoder.insert_bos_eos(bos_index=0, eos_index=1)
>>> encoder.expect_len(7)
>>> encoder.lab2ind[encoder.eos_label]
1
>>>
insert_bos_eos allows to specify whose index will correspond to each of them.
Note that you can also specify the same integer encoding for both.

有四种方法可以用来在序列前添加和在序列后添加。 prepend_bos_label 和 append_eos_label 分别将和字符串标记添加到输入序列中。

>>> words = ["foo", "bar"]
>>> encoder.prepend_bos_label(words)
['<bos>', 'foo', 'bar']
>>> encoder.append_eos_label(words)
['foo', 'bar', '<eos>']

prepend_bos_index 和 append_eos_index 分别将和索引添加到输入的编码序列中。

>>> words = ["foo", "bar"]
>>> encoded = encoder.encode_sequence(words)
>>> encoder.prepend_bos_index(encoded)
[0, 3, 4]
>>> encoder.append_eos_index(encoded)
[3, 4, 1]

handle_special_labels(special_labels)[source]: 处理特殊标签，如bos和eos。

update_from_iterable(iterable, sequence_input=True)[source]: 将sequence_input的默认值更改为True。

update_from_didataset(didataset, output_key, sequence_input=True)[source]: 将sequence_input的默认值更改为True。

limited_labelset_from_iterable(iterable, sequence_input=True, n_most_common=None, min_count=1)[source]: 将sequence_input的默认值更改为True。

add_bos_eos(bos_label='<bos>', eos_label='<eos>')[source]

在标签集中添加句子边界标记。

如果句首和句尾标记相同，将只使用一个句子边界标签。

此方法将添加到索引的末尾，而不是像insert_bos_eos那样在开头添加。

Parameters:

bos_label (hashable) – 句子开始标签，任何标签。
eos_label (hashable) – 句子结束标签，可以是任何标签。如果设置为与bos_label相同的标签，则只使用一个句子边界标签。

insert_bos_eos(bos_label='<bos>', eos_label='<eos>', bos_index=0, eos_index=None)[source]

在标签集中插入句子边界标记。

如果句首和句尾标记相同，将只使用一个句子边界标签。

Parameters:

bos_label (hashable) – 句子开始标签，任何标签
eos_label (hashable) – 句子结束标签，可以是任何标签。如果设置为与bos_label相同的标签，则只使用一个句子边界标签。
bos_index (int) – 插入bos_label的位置。eos_index = bos_index + 1
eos_index (可选, int) – 插入eos_label的位置。默认值：eos_index = bos_index + 1

get_bos_index()[source]: 返回空白编码的索引

get_eos_index()[source]: 返回空白编码的索引

prepend_bos_label(x)[source]: 返回一个带有 BOS 前缀的 x 的列表版本

prepend_bos_index(x)[source]: 返回一个带有BOS索引前缀的x的列表版本。如果输入是张量，则返回张量。

append_eos_label(x)[source]: 返回x的列表版本，并附加EOS。

append_eos_index(x)[source]: 返回x的列表版本，并附加EOS索引。如果输入是张量，则返回张量。

class speechbrain.dataio.encoder.CTCTextEncoder(starting_index=0, **special_labels)[source]

基础类：TextEncoder

TextEncoder的子类，还提供了处理CTC空白标记的方法。

add_blank 和 insert_blank 可用于向编码器状态添加特殊标记。

>>> from speechbrain.dataio.encoder import CTCTextEncoder
>>> chars = ["a", "b", "c", "d"]
>>> encoder = CTCTextEncoder()
>>> encoder.update_from_iterable(chars)
>>> encoder.add_blank()
>>> encoder.expect_len(5)
>>> encoder.encode_sequence(chars)
[0, 1, 2, 3]
>>> encoder.get_blank_index()
4
>>> encoder.decode_ndim([0, 1, 2, 3, 4])
['a', 'b', 'c', 'd', '<blank>']

collapse_labels 和 collapse_indices_ndim 可用于应用 CTC 折叠规则： >>> encoder.collapse_labels([“a”, “a”, “b”, “c”, “d”]) [‘a’, ‘b’, ‘c’, ‘d’] >>> encoder.collapse_indices_ndim([4, 4, 0, 1, 2, 3, 4, 4]) # 4 是 [0, 1, 2, 3]

handle_special_labels(special_labels)[source]: 处理特殊标签，例如空白。

add_blank(blank_label='<blank>')[source]: 向标签集添加空白符号。

insert_blank(blank_label='<blank>', index=0)[source]: 在给定的标签集处插入空白符号。

get_blank_index()[source]: 返回空白编码的索引。

collapse_labels(x, merge_repeats=True)[source]

在一个标签序列上应用CTC折叠规则。

Parameters:

x (可迭代对象) – 要操作的标签序列。
merge_repeats (bool) – 是否在移除空白之前合并重复的标签。在基本的CTC标签拓扑中，重复的标签会被合并。然而，在RNN-T中，它们不会被合并。

Returns:

应用折叠规则的标签列表。

Return type:

list

collapse_indices_ndim(x, merge_repeats=True)[source]

在任意标签序列上应用CTC折叠规则。

Parameters:

x (可迭代对象) – 要操作的标签序列。
merge_repeats (bool) – 是否在移除空白之前合并重复的标签。在基本的CTC标签拓扑中，重复的标签会被合并。然而，在RNN-T中，它们不会被合并。

Returns:

应用了折叠规则的标签列表。

Return type:

list

speechbrain.dataio.encoder.load_text_encoder_tokens(model_path)[source]

从预训练模型加载编码器令牌。

当您与预训练的HF模型一起使用时，此方法非常有用。它将加载yaml中的标记，然后您将能够直接在YAML文件中实例化任何CTCBaseSearcher。

Parameters:: model_path (str, Path) – 预训练模型的路径。
Returns:: 令牌列表。
Return type:: list