Pipeline Functions · spaCy API Documentation

merge_noun_chunks 函数

将名词块合并为单个词元。也可以通过字符串名称"merge_noun_chunks"调用。

名称	描述
`doc`	The `Doc` object to process, e.g. the `Doc` in the pipeline. Doc
RETURNS	The modified `Doc` with merged noun chunks. Doc

merge_entities 函数

将命名实体合并为单个词元。也可以通过字符串名称"merge_entities"调用该功能。

名称	描述
`doc`	The `Doc` object to process, e.g. the `Doc` in the pipeline. Doc
RETURNS	The modified `Doc` with merged entities. Doc

merge_subtokens 函数

将子词合并为单个词。也可以通过字符串名称"merge_subtokens"调用。从v2.1版本开始，解析器能够预测后续应合并为单个词的"子词"。这对于中文、日语或韩语等语言特别重要，因为这些语言的"词"并非定义为由空格分隔的字符序列。该组件底层使用Matcher来查找带有依赖标签"subtok"的词序列，然后将它们合并为单个词。

名称	描述
`doc`	The `Doc` object to process, e.g. the `Doc` in the pipeline. Doc
`label`	The subtoken dependency label. Defaults to `"subtok"`. str
RETURNS	The modified `Doc` with merged subtokens. Doc

token_splitter 函数v3.0

将超过最小长度的标记分割为更短的标记。旨在用于transformer管道，其中过长的spaCy标记会导致输入文本超出transformer模型的最大长度限制。

设置	描述
`min_length`	The minimum length for a token to be split. Defaults to `25`. int
`split_length`	The length of the split tokens. Defaults to `5`. int
RETURNS	The modified `Doc` with the split tokens. Doc

doc_cleaner 函数v3.2.1

清理Doc属性。适用于在包含tok2vec或transformer管道组件的流水线末端使用，这些组件会存储张量和其他可能占用大量内存的值，而这些值在整条流水线运行完成后通常不再需要。

设置	描述
`attrs`	A dict of the `Doc` attributes and the values to set them to. Defaults to `{"tensor": None, "_.trf_data": None}` to clean up after `tok2vec` and `transformer` components. dict
`silent`	If `False`, show warnings if attributes aren’t found or can’t be set. Defaults to `True`. bool
RETURNS	The modified `Doc` with the modified attributes. Doc

span_cleaner 函数实验性

根据键前缀从doc.spans中移除SpanGroup。当CoreferenceResolver与SpanResolver配对使用时，此方法用于清理后续数据。

设置	描述
`prefix`	A prefix to check `SpanGroup` keys for. Any matching groups will be removed. Defaults to `"coref_head_clusters"`. str
RETURNS	The modified `Doc` with any matching spans removed. Doc