容器

语言

class
一个文本处理流程

通常你会在每个进程中加载一次nlp实例,并在应用程序中传递该实例。当你调用spacy.load时会创建Language类,它包含共享词汇表和语言数据、可选的二进制权重(例如由训练好的管道提供),以及处理管道(包含按顺序在文档上调用的标注器或解析器等组件)。你也可以添加自己的处理管道组件,这些组件接收Doc对象,修改后返回。

Language.__init__ 方法

初始化一个Language对象。请注意meta仅用于Language.meta中的元信息,而不会用于配置nlp对象或覆盖配置。如需从配置文件初始化,请改用Language.from_config

名称描述
vocabA Vocab object. If True, a vocab is created using the default language data settings. Vocab
仅关键字
max_lengthMaximum number of characters allowed in a single text. Defaults to 10 ** 6. int
metaMeta data overrides. Dict[str, Any]
create_tokenizerOptional function that receives the nlp object and returns a tokenizer. Callable[[Language], Callable[[str],Doc]]
batch_sizeDefault batch size for pipe and evaluate. Defaults to 1000. int

Language.from_config 类方法v3.0

从已加载的配置中创建一个Language对象。将设置分词器和语言数据,根据配置中的管道添加组件,并根据配置中指定的定义添加管道组件。如果未提供配置,则使用给定语言的默认配置。这也是spaCy根据其config.cfg在底层加载模型的方式。

名称描述
configThe loaded config. Union[Dict[str, Any],Config]
仅关键字
vocabA Vocab object. If True, a vocab is created using the default language data settings. Vocab
disableName(s) of pipeline component(s) to disable. Disabled pipes will be loaded but they won’t be run unless you explicitly enable them by calling nlp.enable_pipe. Is merged with the config entry nlp.disabled. Union[str, Iterable[str]]
enable v3.4Name(s) of pipeline component(s) to enable. All other pipes will be disabled, but can be enabled again using nlp.enable_pipe. Union[str, Iterable[str]]
excludeName(s) of pipeline component(s) to exclude. Excluded components won’t be loaded. Union[str, Iterable[str]]
metaMeta data overrides. Dict[str, Any]
auto_fillWhether to automatically fill in missing values in the config, based on defaults and function argument annotations. Defaults to True. bool
validateWhether to validate the component config and arguments against the types expected by the factory. Defaults to True. bool

Language.component 类方法v3.0

在给定名称下注册自定义管道组件。这允许通过名称使用 Language.add_pipe初始化组件,并在 配置文件中引用它。此类方法和装饰器适用于 简单的无状态函数,这些函数接收一个Doc并返回它。对于 更复杂的有状态组件,这些组件允许设置并需要访问共享的 nlp对象,请使用Language.factory 装饰器。有关更多详细信息和示例,请参阅 使用文档

名称描述
nameThe name of the component factory. str
仅关键字
assignsDoc or Token attributes assigned by this component, e.g. ["token.ent_id"]. Used for pipe analysis. Iterable[str]
requiresDoc or Token attributes required by this component, e.g. ["token.ent_id"]. Used for pipe analysis. Iterable[str]
retokenizesWhether the component changes tokenization. Used for pipe analysis. bool
funcOptional function if not used as a decorator. Optional[Callable[[Doc],Doc]]

Language.factory 类方法

在指定名称下注册一个自定义管道组件工厂。这允许通过名称使用 Language.add_pipe初始化组件,并在 配置文件中引用它。注册的工厂函数需要 至少接收两个命名参数,spaCy会自动填充:nlp 表示当前的nlp对象,name表示组件实例名称。这 有助于区分同一组件的多个实例,并允许 可训练组件使用组件实例名称添加自定义损失。default_config 定义了其余工厂参数的默认值。它会被合并到 nlp.config中。更多详情和 示例,请参阅 使用文档

名称描述
nameThe name of the component factory. str
仅关键字
default_configThe default config, describing the default values of the factory arguments. Dict[str, Any]
assignsDoc or Token attributes assigned by this component, e.g. ["token.ent_id"]. Used for pipe analysis. Iterable[str]
requiresDoc or Token attributes required by this component, e.g. ["token.ent_id"]. Used for pipe analysis. Iterable[str]
retokenizesWhether the component changes tokenization. Used for pipe analysis. bool
default_score_weightsThe scores to report during training, and their default weight towards the final score used to select the best model. Weights should sum to 1.0 per component and will be combined and normalized for the whole pipeline. If a weight is set to None, the score will not be logged or weighted. Dict[str, Optional[float]]
funcOptional function if not used as a decorator. Optional[Callable[[], Callable[[Doc],Doc]]]

Language.__call__ 方法

将处理流程应用于某些文本。文本可以跨多个句子,并且可以包含任意空白字符。会保留与原始字符串的对齐关系。

除了文本,也可以将Doc作为输入传递,这种情况下会跳过分词步骤,但会继续执行流程的其余部分。

名称描述
textThe text to be processed, or a Doc. Union[str,Doc]
仅关键字
disableNames of pipeline components to disable. List[str]
component_cfgOptional dictionary of keyword arguments for components, keyed by component names. Defaults to None. Optional[Dict[str, Dict[str, Any]]]

Language.pipe 方法

以流式处理文本,并按顺序生成Doc对象。这种方式通常比逐个处理文本更高效。

可以传递Doc对象作为输入而非文本。在这种情况下会跳过分词步骤,但会继续执行管道的其余部分。

名称描述
textsA sequence of strings (or Doc objects). Iterable[Union[str,Doc]]
仅关键字
as_tuplesIf set to True, inputs should be a sequence of (text, context) tuples. Output will then be a sequence of (doc, context) tuples. Defaults to False. bool
batch_sizeThe number of texts to buffer. Optional[int]
disableNames of pipeline components to disable. List[str]
component_cfgOptional dictionary of keyword arguments for components, keyed by component names. Defaults to None. Optional[Dict[str, Dict[str, Any]]]
n_processNumber of processors to use. Defaults to 1. int

Language.set_error_handler 方法v3.0

定义一个回调函数,当处理一个或多个文档时抛出错误时将被调用。具体来说,该函数会在所有定义了set_error_handler的流水线组件上调用此方法。错误处理器将接收原始组件名称、组件本身、正在处理的文档列表以及原始错误作为参数。

名称描述
error_handlerA function that performs custom error handling. Callable[[str, Callable[[Doc],Doc], List[Doc], Exception]

Language.initialize 方法v3.0

初始化训练流程并返回一个Optimizer。内部实现中,它会使用[initialize]配置块中定义的设置来建立词汇表、加载向量和tok2vec权重,并将可选参数传递给由流程组件或分词器实现的initialize方法。该方法通常在运行spacy train时自动调用。详情请参阅关于配置生命周期初始化的使用指南。

get_examples 应该是一个返回可迭代Example对象的函数。数据示例可以是完整的训练数据或代表性样本。它们用于初始化可训练管道组件的模型,并传递给每个组件的initialize方法(如果存在)。初始化包括验证网络、推断缺失形状以及基于数据设置标签方案。

如果在调用nlp.initialize时未提供get_examples函数,管道组件将使用通用数据进行初始化。在这种情况下,每个组件的输出维度必须已在config中定义,或通过为每个可能的输出标签调用pipe.add_label(例如用于标注器或文本分类器)来定义。

名称描述
get_examplesOptional function that returns gold-standard annotations in the form of Example objects. Optional[Callable[[], Iterable[Example]]]
仅关键字
sgdAn optimizer. Will be created via create_optimizer if not set. Optional[Optimizer]

Language.resume_training 方法实验性v3.0

继续训练一个已训练好的流程。创建并返回一个优化器,并为任何具有rehearse方法的流程组件初始化"复习"机制。复习机制用于防止模型"遗忘"其初始化的"知识"。要进行复习,收集你希望模型保持性能的文本样本,并使用一批Example对象调用nlp.rehearse

名称描述
仅关键字
sgdAn optimizer. Will be created via create_optimizer if not set. Optional[Optimizer]

Language.update 方法

更新流水线中的模型。

名称描述
examplesA batch of Example objects to learn from. Iterable[Example]
仅关键字
dropThe dropout rate. float
sgdAn optimizer. Will be created via create_optimizer if not set. Optional[Optimizer]
lossesDictionary to update with the loss, keyed by pipeline component. Optional[Dict[str, float]]
component_cfgOptional dictionary of keyword arguments for components, keyed by component names. Defaults to None. Optional[Dict[str, Dict[str, Any]]]

Language.rehearse 方法实验性v3.0

对一批数据执行"预演"更新。预演更新旨在教导当前模型做出与初始模型相似的预测,以尝试解决"灾难性遗忘"问题。此功能为实验性功能。

名称描述
examplesA batch of Example objects to learn from. Iterable[Example]
仅关键字
dropThe dropout rate. float
sgdAn optimizer. Will be created via create_optimizer if not set. Optional[Optimizer]
lossesDictionary to update with the loss, keyed by pipeline component. Optional[Dict[str, float]]

Language.evaluate 方法

评估流水线的各个组件。

名称描述
examplesA batch of Example objects to learn from. Iterable[Example]
仅关键字
batch_sizeThe batch size to use. Optional[int]
scorerOptional Scorer to use. If not passed in, a new one will be created. Optional[Scorer]
component_cfgOptional dictionary of keyword arguments for components, keyed by component names. Defaults to None. Optional[Dict[str, Dict[str, Any]]]
scorer_cfgOptional dictionary of keyword arguments for the Scorer. Defaults to None. Optional[Dict[str, Any]]
per_component v3.6Whether to return the scores keyed by component name. Defaults to False. bool

Language.use_params 上下文管理器方法

用参数字典中提供的权重替换管道中模型的权重。可用作上下文管理器,在这种情况下,模型会在代码块执行后恢复其原始权重。

名称描述
paramsA dictionary of parameters keyed by model ID. dict

Language.add_pipe 方法

向处理管道添加一个组件。需要一个名称映射到通过@Language.component@Language.factory注册的组件工厂。组件应该是可调用对象,接收Doc对象,修改后返回它。只能设置beforeafterfirstlast中的一个。默认行为是last=True

名称描述
factory_nameName of the registered component factory. str
nameOptional unique name of pipeline component instance. If not set, the factory name is used. An error is raised if the name already exists in the pipeline. Optional[str]
仅关键字
beforeComponent name or index to insert component directly before. Optional[Union[str, int]]
afterComponent name or index to insert component directly after. Optional[Union[str, int]]
firstInsert component first / not first in the pipeline. Optional[bool]
lastInsert component last / not last in the pipeline. Optional[bool]
config v3.0Optional config parameters to use for this component. Will be merged with the default_config specified by the component factory. Dict[str, Any]
source v3.0Optional source pipeline to copy component from. If a source is provided, the factory_name is interpreted as the name of the component in the source pipeline. Make sure that the vocab, vectors and settings of the source pipeline match the target pipeline. Optional[Language]
validate v3.0Whether to validate the component config and arguments against the types expected by the factory. Defaults to True. bool

Language.create_pipe 方法

从工厂创建一个流水线组件。

名称描述
factory_nameName of the registered component factory. str
nameOptional unique name of pipeline component instance. If not set, the factory name is used. An error is raised if the name already exists in the pipeline. Optional[str]
仅关键字
config v3.0Optional config parameters to use for this component. Will be merged with the default_config specified by the component factory. Dict[str, Any]
validate v3.0Whether to validate the component config and arguments against the types expected by the factory. Defaults to True. bool

Language.has_factory classmethodv3.0

检查工厂名称是否在Language类或其子类上注册。 将检查子类上注册的 语言特定工厂 以及在Language基类上注册的通用工厂(所有子类均可使用)。

名称描述
nameName of the pipeline factory to check. str

Language.has_pipe 方法

检查组件是否存在于流水线中。等同于 name in nlp.pipe_names

名称描述
nameName of the pipeline component to check. str

Language.get_pipe 方法

获取给定组件名称的流水线组件。

名称描述
nameName of the pipeline component to get. str

Language.replace_pipe 方法

替换流水线中的一个组件并返回新组件。

名称描述
nameName of the component to replace. str
componentThe factory name of the component to insert. str
仅关键字
config v3.0Optional config parameters to use for the new component. Will be merged with the default_config specified by the component factory. Optional[Dict[str, Any]]
validate v3.0Whether to validate the component config and arguments against the types expected by the factory. Defaults to True. bool

Language.rename_pipe 方法

重命名管道中的一个组件。这对于为预定义和预加载的组件创建自定义名称非常有用。要更改添加到管道中的组件的默认名称,您还可以在add_pipe上使用name参数。

名称描述
old_nameName of the component to rename. str
new_nameNew name of the component. str

Language.remove_pipe 方法

从管道中移除一个组件。返回被移除的组件名称和组件函数。

名称描述
nameName of the component to remove. str

Language.disable_pipe 方法v3.0

临时禁用某个管道组件,使其不作为管道的一部分运行。被禁用的组件会列在 nlp.disabled中,并包含在 nlp.components里,但不会出现在 nlp.pipeline中,因此当你用nlp对象处理Doc时这些组件不会运行。如果该组件已被禁用,此方法将不执行任何操作。

名称描述
nameName of the component to disable. str

Language.enable_pipe 方法v3.0

启用之前被禁用的组件(例如通过Language.disable_pipes),使其作为nlp.pipeline管道的一部分运行。如果该组件已启用,则此方法不执行任何操作。

名称描述
nameName of the component to enable. str

Language.select_pipes 上下文管理器方法v3.0

禁用一个或多个流水线组件。如果用作上下文管理器,该流水线将在代码块结束时恢复到初始状态。否则会返回一个DisabledPipes对象,该对象具有.restore()方法可用于撤销更改。您可以指定disable(作为列表或字符串)或enable。在后一种情况下,所有不在enable列表中的组件都将被禁用。在底层,此方法会调用disable_pipeenable_pipe

名称描述
仅关键字
disableName(s) of pipeline component(s) to disable. Optional[Union[str, Iterable[str]]]
enableName(s) of pipeline component(s) that will not be disabled. Optional[Union[str, Iterable[str]]]

Language.get_factory_meta 类方法v3.0

获取给定管道组件名称的工厂元信息。需要组件工厂的名称。工厂元信息是FactoryMeta数据类的一个实例,包含由@Language.component@Language.factory装饰器提供的关于组件及其默认值的信息。

名称描述
nameThe factory name. str

Language.get_pipe_meta 方法v3.0

获取给定管道组件名称的工厂元信息。需要传入管道中组件实例的名称。工厂元信息是FactoryMeta数据类的一个实例,包含有关组件及其默认值的信息,这些信息由@Language.component@Language.factory装饰器提供。

名称描述
nameThe pipeline component name. str

Language.analyze_pipes 方法v3.0

分析当前流水线组件并展示它们分配和需要的属性以及设置的分数摘要。数据基于@Language.component@Language.factory装饰器中提供的信息。如果需求未满足(例如某个组件指定了前置组件未设置的必需属性),则会显示警告。

结构化

美观

名称描述
仅关键字
keysThe values to display in the table. Corresponds to attributes of the FactoryMeta. Defaults to ["assigns", "requires", "scores", "retokenizes"]. List[str]
prettyPretty-print the results as a table. Defaults to False. bool

Language.replace_listeners 方法v3.0

查找给定管道组件模型中的监听层 (连接到共享的token-to-vector嵌入组件)并将其替换为独立的token-to-vector层副本。监听层允许其他组件连接到共享的token-to-vector嵌入组件,如Tok2VecTransformer。当使用来自现有管道的组件训练新管道时,替换监听层会很有用:如果多个组件(例如标注器、解析器、NER)监听同一个token-to-vector组件,但其中某些组件被冻结且不更新,那么随着token-to-vector组件根据新数据更新,这些冻结组件的性能可能会显著下降。为防止这种情况,可以将监听层替换为由组件拥有的独立token-to-vector层,这样当组件不更新时该层也不会改变。

该方法通常不会直接调用,仅在加载包含sourced components的配置时在后台执行,这些组件定义了replace_listeners

名称描述
tok2vec_nameName of the token-to-vector component, typically "tok2vec" or "transformer".str
pipe_nameName of pipeline component to replace listeners for. str
listenersThe paths to the listeners, relative to the component config, e.g. ["model.tok2vec"]. Typically, implementations will only connect to one tok2vec component, model.tok2vec, but in theory, custom models can use multiple listeners. The value here can either be an empty list to not replace any listeners, or a complete list of the paths to all listener layers used by the model that should be replaced.Iterable[str]

Language.memory_zone 上下文管理器v3.8

开始一个代码块,在该块内分配的所有资源将在块结束时被释放。如果在内存区域块内创建了资源,在块外访问它是无效的。这种无效访问的行为是未定义的。内存区域不应嵌套。内存区域对于需要在定义的内存预算下处理大量文本的服务非常有用。

名称描述
memOptional cymem.Pool object to own allocations (created if not provided). This argument is not required for ordinary usage. Defaults to None. Optional[cymem.Pool]

Language.meta 属性

Language类的元数据,包括名称、版本、数据来源、许可证、作者信息等。如果加载了训练好的管道,这里会包含管道的元数据。Language.meta也是当你将nlp对象保存到磁盘时序列化为meta.json的内容。更多详情请参阅元数据格式

名称描述

Language.config 属性v3.0

导出当前nlp对象的可训练配置文件config.cfg。包含当前流水线、用于创建当前活跃流水线组件的所有配置,以及可与spacy train配合使用的默认训练配置。Language.config返回一个Thinc Config对象,这是内置dict的子类。它支持额外方法to_disk(将配置序列化到文件)和to_str(将配置输出为字符串)。

名称描述

Language.to_disk 方法

将当前状态保存到目录。在底层,该方法委托给各个流水线组件的to_disk方法(如果可用)。这意味着如果加载了训练好的流水线,所有组件及其权重将被保存到磁盘。

名称描述
pathA path to a directory, which will be created if it doesn’t exist. Paths may be either strings or Path-like objects. Union[str,Path]
仅关键字
excludeNames of pipeline components or serialization fields to exclude. Iterable[str]

Language.from_disk 方法

从目录加载状态,包括所有与Language对象一起保存的数据。就地修改对象并返回它。

名称描述
pathA path to a directory. Paths may be either strings or Path-like objects. Union[str,Path]
仅关键字
excludeNames of pipeline components or serialization fields to exclude. Iterable[str]

Language.to_bytes 方法

将当前状态序列化为二进制字符串。

名称描述
仅关键字
excludeNames of pipeline components or serialization fields to exclude. iterable

Language.from_bytes 方法

从二进制字符串加载状态。请注意,此方法通常通过子类如EnglishGerman使用,以使特定语言功能(如词法属性获取器)可用于加载的对象。

请注意,如果您想序列化并重新加载整个管道,仅使用此方法是不够的,您还需要处理配置。详情请参阅“序列化管道”

名称描述
bytes_dataThe data to load from. bytes
仅关键字
excludeNames of pipeline components or serialization fields to exclude. Iterable[str]

属性

名称描述
vocabA container for the lexical types. Vocab
tokenizerThe tokenizer. Tokenizer
make_docCallable that takes a string and returns a Doc. Callable[[str],Doc]
pipelineList of (name, component) tuples describing the current processing pipeline, in order. List[Tuple[str, Callable[[Doc],Doc]]]
pipe_namesList of pipeline component names, in order. List[str]
pipe_labelsList of labels set by the pipeline components, if available, keyed by component name. Dict[str, List[str]]
pipe_factoriesDictionary of pipeline component names, mapped to their factory names. Dict[str, str]
factoriesAll available factory functions, keyed by name. Dict[str, Callable[[], Callable[[Doc],Doc]]]
factory_names v3.0List of all available factory names. List[str]
components v3.0List of all available (name, component) tuples, including components that are currently disabled. List[Tuple[str, Callable[[Doc],Doc]]]
component_names v3.0List of all available component names, including components that are currently disabled. List[str]
disabled v3.0Names of components that are currently disabled and don’t run as part of the pipeline. List[str]
pathPath to the pipeline data directory, if a pipeline is loaded from a path or package. Otherwise None. Optional[Path]

类属性

名称描述
DefaultsSettings, data and factory methods for creating the nlp object and processing pipeline. Defaults
langIETF language tag, such as ‘en’ for English. str
default_configBase config to use for Language.config. Defaults to default_config.cfg. Config

Defaults

以下属性可以在Language.Defaults类上设置,用于自定义默认语言数据:

名称描述
stop_wordsList of stop words, used for Token.is_stop.
Example: stop_words.py Set[str]
tokenizer_exceptionsTokenizer exception rules, string mapped to list of token attributes.
Example: de/tokenizer_exceptions.py Dict[str, List[dict]]
prefixes, suffixes, infixesPrefix, suffix and infix rules for the default tokenizer.
Example: puncutation.py Optional[Sequence[Union[str,Pattern]]]
token_matchOptional regex for matching strings that should never be split, overriding the infix rules.
Example: fr/tokenizer_exceptions.py Optional[Callable]
url_matchRegular expression for matching URLs. Prefixes and suffixes are removed before applying the match.
Example: tokenizer_exceptions.py Optional[Callable]
lex_attr_gettersCustom functions for setting lexical attributes on tokens, e.g. like_num.
Example: lex_attrs.py Dict[int, Callable[[str], Any]]
syntax_iteratorsFunctions that compute views of a Doc object based on its syntax. At the moment, only used for noun chunks.
Example: syntax_iterators.py. Dict[str, Callable[[Union[Doc,Span]], Iterator[Span]]]
writing_systemInformation about the language’s writing system, available via Vocab.writing_system. Defaults to: {"direction": "ltr", "has_case": True, "has_letters": True}..
Example: zh/__init__.py Dict[str, Any]
configDefault config added to nlp.config. This can include references to custom tokenizers or lemmatizers.
Example: zh/__init__.py Config

序列化字段

在序列化过程中,spaCy会导出多个用于恢复对象不同方面的数据字段。如果需要,您可以通过exclude参数传入字符串名称来将它们排除在序列化之外。

名称描述
vocabThe shared Vocab.
tokenizerTokenization rules and exceptions.
metaThe meta data, available as Language.meta.
String names of pipeline components, e.g. "ner".

FactoryMeta 数据类v3.0

FactoryMeta包含有关组件及其默认值的信息,这些信息由@Language.component@Language.factory装饰器提供。每当定义组件时就会创建它,并存储在每个组件实例和工厂实例的Language类上。

名称描述
factoryThe name of the registered component factory. str
default_configThe default config, describing the default values of the factory arguments. Dict[str, Any]
assignsDoc or Token attributes assigned by this component, e.g. ["token.ent_id"]. Used for pipe analysis. Iterable[str]
requiresDoc or Token attributes required by this component, e.g. ["token.ent_id"]. Used for pipe analysis. Iterable[str]
retokenizesWhether the component changes tokenization. Used for pipe analysis. bool
default_score_weightsThe scores to report during training, and their default weight towards the final score used to select the best model. Weights should sum to 1.0 per component and will be combined and normalized for the whole pipeline. If a weight is set to None, the score will not be logged or weighted. Dict[str, Optional[float]]
scoresAll scores set by the components if it’s trainable, e.g. ["ents_f", "ents_r", "ents_p"]. Based on the default_score_weights and used for pipe analysis. Iterable[str]