词形还原器

classv3

String name:lemmatizerTrainable:

用于词形还原的流水线组件

该组件用于根据词性标签或查找表规则为词元分配基本形式。不同的Language子类可以通过特定语言的工厂实现自己的词形还原组件。默认使用的数据由spacy-lookups-data扩展包提供。

关于可训练的词汇还原器，请参阅EditTreeLemmatizer。

Assigned Attributes

由规则生成或预测的词元将被保存到Token.lemma中。

位置	值
`Token.lemma`	The lemma (hash). int
`Token.lemma_`	The lemma. str

配置与实现

默认配置由管道组件工厂定义，描述了组件应如何配置。您可以通过nlp.add_pipe上的config参数或在训练用的config.cfg中覆盖其设置。要查看基于查找和规则的词形还原器使用的查找数据格式示例，请参阅spacy-lookups-data。

设置	描述
`mode`	The lemmatizer mode, e.g. `"lookup"` or `"rule"`. Defaults to `lookup` if no language-specific lemmatizer is available (see the following table). str
`overwrite`	Whether to overwrite existing lemmas. Defaults to `False`. bool
`model`	Not yet implemented: the model to use. Model
仅关键字
`scorer`	The scoring method. Defaults to `Scorer.score_token_attr` for the attribute `"lemma"`. Optional[Callable]

许多语言如果存在更好的词形还原器，会指定默认的词形还原模式而非lookup。词形还原模式rule和pos_lookup需要来自前一个流水线组件的token.pos（参见预训练流水线设计细节中的示例配置），或依赖第三方库（pymorphy3）。

语言	默认模式
`bn`	`rule`
`ca`	`pos_lookup`
`el`	`rule`
`en`	`rule`
`es`	`rule`
`fa`	`rule`
`fr`	`rule`
`it`	`pos_lookup`
`mk`	`rule`
`nb`	`rule`
`nl`	`rule`
`pl`	`pos_lookup`
`ru`	`pymorphy3`
`sv`	`rule`
`uk`	`pymorphy3`

explosion/spaCy/master/spacy/pipeline/lemmatizer.py

Lemmatizer.init 方法

创建一个新的管道实例。在您的应用程序中，通常会使用快捷方式，通过其字符串名称并使用nlp.add_pipe来实例化该组件。

名称	描述
`vocab`	The shared vocabulary. Vocab
`model`	Not yet implemented: The model to use. Model
`name`	String name of the component instance. Used to add entries to the `losses` during training. str
仅关键字
mode	The lemmatizer mode, e.g. `"lookup"` or `"rule"`. Defaults to `"lookup"`. str
overwrite	是否覆盖现有词元。bool

Lemmatizer.call 方法

将管道应用于单个文档。文档会被原地修改并返回。这通常在调用nlp对象处理文本时自动完成，所有管道组件会按顺序应用于Doc对象。

名称	描述
`doc`	The document to process. Doc
返回值	处理后的文档。Doc

Lemmatizer.pipe 方法

将管道应用于文档流。这通常在调用nlp对象处理文本时自动完成，所有流水线组件会按顺序应用于Doc。

名称	描述
`stream`	A stream of documents. Iterable[Doc]
仅关键字
`batch_size`	The number of documents to buffer. Defaults to `128`. int
YIELDS	按顺序处理后的文档。Doc

Lemmatizer.initialize 方法

初始化词形还原器并加载所有数据资源。该方法通常由Language.initialize调用，允许您通过配置中的[initialize.components]块来自定义接收的参数。加载过程仅在初始化期间进行，通常在训练之前。运行时所有数据都从磁盘加载。

名称	描述
`get_examples`	Function that returns gold-standard annotations in the form of `Example` objects. Defaults to `None`. Optional[Callable[[], Iterable[Example]]]
仅关键字
`nlp`	The current `nlp` object. Defaults to `None`. Optional[Language]
`lookups`	The lookups object containing the tables such as `"lemma_rules"`, `"lemma_index"`, `"lemma_exc"` and `"lemma_lookup"`. If `None`, default tables are loaded from `spacy-lookups-data`. Defaults to `None`. Optional[Lookups]

Lemmatizer.lookup_lemmatize 方法

使用基于查找的方法对标记进行词形还原。如果未找到词元，则返回原始字符串。

名称	描述
`token`	The token to lemmatize. Token
返回值	包含一个或多个词元的列表。 List[str]

Lemmatizer.rule_lemmatize 方法

使用基于规则的方法对标记进行词形还原。通常依赖于词性标注。

名称	描述
`token`	The token to lemmatize. Token
返回值	包含一个或多个词元(lemma)的列表。 List[str]

Lemmatizer.is_base_form 方法

检查我们是否在处理一个无屈折变化的范式，这样我们可以完全避免词形还原。

名称	描述
`token`	The token to analyze. Token
返回值	判断该token的属性(如词性标注、形态特征)是否描述了一个基本形式。bool

Lemmatizer.get_lookups_config 类方法

返回给定模式下的查找表配置设置，用于Lemmatizer.load_lookups。

名称	描述
`mode`	The lemmatizer mode. str
返回值	必需的表名和可选的表名。元组[列表[字符串], 列表[字符串]]

Lemmatizer.to_disk 方法

将管道序列化到磁盘。

名称	描述
`path`	A path to a directory, which will be created if it doesn’t exist. Paths may be either strings or `Path`-like objects. Union[str,Path]
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]

Lemmatizer.from_disk 方法

从磁盘加载管道。就地修改对象并返回它。

名称	描述
`path`	A path to a directory. Paths may be either strings or `Path`-like objects. Union[str,Path]
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]
RETURNS	The modified `Lemmatizer` object. Lemmatizer

Lemmatizer.to_bytes 方法

将管道序列化为字节串。

名称	描述
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]
RETURNS	The serialized form of the `Lemmatizer` object. bytes

Lemmatizer.from_bytes 方法

从字节串加载管道。原地修改对象并返回它。

名称	描述
`bytes_data`	The data to load from. bytes
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]
RETURNS	The `Lemmatizer` object. Lemmatizer

属性

名称	描述
`vocab`	The shared `Vocab`. Vocab
`lookups`	The lookups object. Lookups
`mode`	The lemmatizer mode. str

序列化字段

在序列化过程中，spaCy会导出多个用于恢复对象不同方面的数据字段。如果需要，您可以通过exclude参数传入字符串名称来将它们排除在序列化之外。

名称	描述
`vocab`	The shared `Vocab`.
`lookups`	The lookups. You usually don’t want to exclude this.

建议编辑