speechbrain.decoders.language_model 模块

kenlm n-gram 的语言模型封装器。

此文件基于PyCTCDecode中的kenLM包装器的实现（参见：https://github.com/kensho-technologies/pyctcdecode），并用于CTC解码器。

参见：speechbrain.decoders.ctc.py

Authors

阿德尔·穆门 2023

摘要

类：

`KenlmState`	kenlm状态的包装器。
`LanguageModel`	语言模型容器类，用于整合功能。

函数：

load_unigram_set_from_arpa

从arpa文件中读取unigrams。

参考

speechbrain.decoders.language_model.load_unigram_set_from_arpa(arpa_path: str) → Set[str][source]

从arpa文件中读取unigrams。

取自：https://github.com/kensho-technologies/pyctcdecode

Parameters:: arpa_path (str) – arpa文件的路径。
Returns:: unigrams – 一元语法集合。
Return type:: set

class speechbrain.decoders.language_model.KenlmState(state: State)[source]

基础类：object

kenlm 状态的包装器。

这是kenlm状态对象的包装器。它用于确保状态不会在语言模型类之外被修改。

取自：https://github.com/kensho-technologies/pyctcdecode

Parameters:: state (kenlm.State) – Kenlm 状态对象。

property state: State: 获取原始状态对象。

class speechbrain.decoders.language_model.LanguageModel(kenlm_model: Model, unigrams: Collection[str] | None = None, alpha: float = 0.5, beta: float = 1.5, unk_score_offset: float = -10.0, score_boundary: bool = True)[source]

基础类：object

语言模型容器类，用于整合功能。

该类是围绕kenlm语言模型的封装器。它提供了对标记进行评分和获取初始状态的功能。

取自：https://github.com/kensho-technologies/pyctcdecode

Parameters:

kenlm_model (kenlm.Model) – Kenlm 模型。
unigrams (list) – 已知单词unigrams的列表。
alpha (float) – 浅层融合时语言模型的权重。
beta (float) – 在评分过程中用于长度分数调整的权重。
unk_score_offset (float) – 未知标记的日志分数偏移量。
score_boundary (bool) – 是否在评分时让kenlm尊重边界。

property order: int: 获取n-gram语言模型的顺序。

get_start_state() → KenlmState[source]: 获取初始 lm 状态。

score_partial_token(partial_token: str) → float[source]: 获取部分令牌分数。

score(prev_state, word: str, is_last_word: bool = False) → Tuple[float, KenlmState][source]: 根据起始状态评分单词。