speechbrain.lm.ngram 模块

N-gram 语言模型查询接口

Authors

阿库·柔赫 2020

摘要

类：

BackoffNgramLM

用于回退N-gram语言模型的查询接口

函数：

`ngram_evaluation_details`	评估数据中每个句子的N-gram语言模型
`ngram_perplexity`	从一系列单独的句子评估中计算困惑度。

参考

class speechbrain.lm.ngram.BackoffNgramLM(ngrams, backoffs)[source]

基础类：object

用于退避N-gram语言模型的查询接口

ngrams格式最好通过一个示例查询来解释：P( world | ~~, hello )，即三元模型，给定“ ~~hello”时“world”的概率是： ngrams[2][("", "hello")]["world"]~~~~

在顶层，ngrams 是一个包含不同历史长度的字典，每个顺序都是一个字典，其中上下文（元组）作为键，（对数）分布（字典）作为值。

回退格式稍微简单一些。在顶层，回退是不同上下文顺序的列表，每个顺序是从回退上下文到回退（对数）权重的映射（字典）。

Parameters:

ngrams (dict) – N-gram 对数概率。这是一个三重嵌套的字典。第一层由 N-gram 阶数（整数）索引。第二层由上下文（token 元组）索引。第三层由 token 索引，并映射到对数概率。示例： log(P(fox|a quick red)) = -5.3 通过以下方式访问： ngrams[4][('a', 'quick', 'red')]['fox']
backoffs (dict) – 回退日志权重。这是一个双重嵌套的字典。第一层按N-gram顺序（整数）索引。第二层按回退历史（标记元组）索引，即概率分布所依赖的上下文。这映射到日志权重。示例：如果未列出log(P(fox|a quick red))，我们会找到 log(backoff(a quick red)) = -23.4，可以通过以下方式访问： backoffs[3][('a', 'quick', 'red')] 此字典需要至少包含到N-1阶的条目（即使它们为空）。它也可能包含N阶的条目，尽管这些条目永远不会被访问。

Example

>>> import math
>>> ngrams = {1: {tuple(): {'a': -0.6931, 'b': -0.6931}},
...           2: {('a',): {'a': -0.6931, 'b': -0.6931},
...               ('b',): {'a': -0.6931}}}
>>> backoffs = {1: {('b',): 0.}}
>>> lm = BackoffNgramLM(ngrams, backoffs)
>>> round(math.exp(lm.logprob('a', ('b',))), 1)
0.5
>>> round(math.exp(lm.logprob('b', ('b',))), 1)
0.5

logprob(token, context=())[source]: 计算退避对数权重并应用它们。

speechbrain.lm.ngram.ngram_evaluation_details(data, LM)[source]

评估数据中每个句子的N-gram语言模型

调用ngram_perplexity并使用此函数的输出来计算困惑度。

Parameters:

data (iterator) – 一个句子的迭代器，其中每个句子应该是一个迭代器，如 speechbrain.lm.counting.ngrams_for_evaluation 返回的那样
LM (BackoffNgramLM) – 要评估的语言模型

Returns:

列出`collections.Counter`s，这些计数器具有“num_tokens”和“neglogprob”键，分别给出每个句子的标记数量和logprob（与数据顺序相同）。

Return type:

list

注意

collections.Counter 不能添加负数。因此，使用负对数概率（始终 >=0）非常重要。

Example

>>> class MockLM:
...     def __init__(self):
...         self.top_order = 3
...     def logprob(self, token, context):
...         return -1.0
>>> LM = MockLM()
>>> data = [[("S", ("<s>",)),
...          ("p", ("<s>", "S")),
...          ("e", ("S", "p")),
...          ("e", ("p", "e")),
...          ("c", ("e", "e")),
...          ("h", ("e", "c")),
...          ("</s>", ("c", "h"))],
...         [("B", ("<s>",)),
...          ("r", ("<s>", "B")),
...          ("a", ("B", "r")),
...          ("i", ("r", "a")),
...          ("n", ("a", "i")),
...          ("</s>", ("i", "n"))]]
>>> sum(ngram_evaluation_details(data, LM), collections.Counter())
Counter({'num_tokens': 13, 'neglogprob': 13.0})

speechbrain.lm.ngram.ngram_perplexity(eval_details, logbase=10.0)[source]

从一系列单独的句子评估中计算困惑度。

Parameters:

eval_details (list) – 单个句子评估的列表。由 ngram_evaluation_details返回
logbase (float) – 使用的对数基数。

Returns:

计算得到的困惑度。

Return type:

float

Example

>>> eval_details = [
...     collections.Counter(neglogprob=5, num_tokens=5),
...     collections.Counter(neglogprob=15, num_tokens=15)]
>>> ngram_perplexity(eval_details)
10.0