流水线

DependencyParser

class
String name:parserTrainable:
用于句法依存解析的流水线组件

一个基于转移的依存解析器组件。该依存解析器联合学习句子分割和带标签的依存解析,并可选择学习合并被分词器过度分割的标记。解析器使用了Honnibal和Johnson (2014)描述的非单调弧渴望转移系统的变体,并增加了执行句子分割的"break"转移动作。Nivre (2005)伪投影依存转换被用于使解析器能够预测非投影解析。

解析器使用模仿学习目标进行训练。它遵循当前权重预测的动作,并在每个状态下确定哪些动作与从当前状态可达的最优解析兼容。权重更新时,分配给最优动作集的分数会增加,而分配给其他动作的分数会减少。请注意,对于给定状态可能存在多个最优动作。

Assigned Attributes

依存关系预测被分配到Token.depToken.head字段。 除了依存关系本身,解析器还会确定句子边界, 这些边界保存在Token.is_sent_start中,并可通过Doc.sents访问。

位置
Token.depThe type of dependency relation (hash). int
Token.dep_The type of dependency relation. str
Token.headThe syntactic parent, or “governor”, of this token. Token
Token.is_sent_startA boolean value indicating whether the token starts a sentence. After the parser runs this will be True or False for all tokens. bool
Doc.sentsAn iterator over sentences in the Doc, determined by Token.is_sent_start values. Iterator[Span]

配置与实现

默认配置由管道组件工厂定义,描述了组件应如何配置。您可以通过nlp.add_pipe中的config参数或在训练用的config.cfg中覆盖其设置。有关架构及其参数和超参数的详细信息,请参阅模型架构文档。

设置描述
movesA list of transition names. Inferred from the data if not provided. Defaults to None. Optional[TransitionSystem]
update_with_oracle_cut_sizeDuring training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won’t need to change it. Defaults to 100. int
learn_tokensWhether to learn to merge subtokens that are split relative to the gold standard. Experimental. Defaults to False. bool
min_action_freqThe minimum frequency of labelled actions to retain. Rarer labelled actions have their label backed-off to “dep”. While this primarily affects the label accuracy, it can also affect the attachment structure, as the labels are used to represent the pseudo-projectivity transformation. Defaults to 30. int
modelThe Model powering the pipeline component. Defaults to TransitionBasedParser. Model[List[Doc], List[Floats2d]]
explosion/spaCy/master/spacy/pipeline/dep_parser.pyx

DependencyParser.__init__ 方法

创建一个新的管道实例。在您的应用程序中,通常会使用快捷方式,通过其字符串名称并使用nlp.add_pipe来实例化该组件。

名称描述
vocabThe shared vocabulary. Vocab
modelThe Model powering the pipeline component. Model[List[Doc], List[Floats2d]]
nameString name of the component instance. Used to add entries to the losses during training. str
movesA list of transition names. Inferred from the data if not provided. Optional[TransitionSystem]
仅关键字
update_with_oracle_cut_sizeDuring training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won’t need to change it. Defaults to 100. int
learn_tokensWhether to learn to merge subtokens that are split relative to the gold standard. Experimental. Defaults to False. bool
min_action_freqThe minimum frequency of labelled actions to retain. Rarer labelled actions have their label backed-off to “dep”. While this primarily affects the label accuracy, it can also affect the attachment structure, as the labels are used to represent the pseudo-projectivity transformation. int
scorerThe scoring method. Defaults to Scorer.score_deps for the attribute "dep" ignoring the labels p and punct and Scorer.score_spans for the attribute "sents". Optional[Callable]

DependencyParser.__call__ 方法

将管道应用于单个文档。文档会被原地修改并返回。 这通常在调用nlp对象处理文本时自动完成, 所有管道组件会按顺序应用于Doc对象。 __call__pipe方法都会委托给 predictset_annotations方法。

名称描述
docThe document to process. Doc

DependencyParser.pipe 方法

将管道应用于文档流。这通常在调用nlp对象处理文本时自动完成,所有管道组件会按顺序应用于Doc对象。无论是__call__还是pipe方法,最终都会委托给predictset_annotations方法执行。

名称描述
docsA stream of documents. Iterable[Doc]
仅关键字
batch_sizeThe number of documents to buffer. Defaults to 128. int

DependencyParser.initialize 方法v3.0

初始化组件以进行训练。get_examples应为一个返回可迭代Example对象的函数。至少需要提供一个示例。这些数据示例用于初始化组件模型,可以是完整的训练数据或代表性样本。初始化过程包括验证网络、推断缺失形状以及根据数据设置标签方案。该方法通常由Language.initialize调用,并允许您通过配置中的[initialize.components]块来自定义接收的参数。

名称描述
get_examplesFunction that returns gold-standard annotations in the form of Example objects. Must contain at least one Example. Callable[[], Iterable[Example]]
仅关键字
nlpThe current nlp object. Defaults to None. Optional[Language]
labelsThe label information to add to the component, as provided by the label_data property after initialization. To generate a reusable JSON file from your data, you should run the init labels command. If no labels are provided, the get_examples callback is used to extract the labels from the data, which may be a lot slower. Optional[Dict[str, Dict[str, int]]]

DependencyParser.predict 方法

将组件的模型应用于一批Doc对象,而不修改它们。

名称描述
docsThe documents to predict. Iterable[Doc]

DependencyParser.set_annotations 方法

使用预先计算的分数修改一批Doc对象。

名称描述
docsThe documents to modify. Iterable[Doc]
scoresThe scores to set, produced by DependencyParser.predict. Returns an internal helper class for the parse state. List[StateClass]

DependencyParser.update 方法

从一批Example对象中学习,更新管道的模型。委托给predictget_loss

名称描述
examplesA batch of Example objects to learn from. Iterable[Example]
仅关键字
dropThe dropout rate. float
sgdAn optimizer. Will be created via create_optimizer if not set. Optional[Optimizer]
lossesOptional record of the loss during training. Updated using the component name as the key. Optional[Dict[str, float]]

DependencyParser.get_loss 方法

计算这批文档及其预测分数的损失和损失梯度。

名称描述
examplesThe batch of examples. Iterable[Example]
scoresScores representing the model’s predictions. StateClass

DependencyParser.create_optimizer 方法

为流水线组件创建一个Optimizer

名称描述

DependencyParser.use_params 方法上下文管理器

修改管道的模型,以使用给定的参数值。在上下文结束时,原始参数将被恢复。

名称描述
paramsThe parameter values to use in the model. dict

DependencyParser.add_label 方法

向管道添加一个新标签。请注意,如果您向initialize方法提供了代表性数据样本,则无需调用此方法。在这种情况下,样本中找到的所有标签将自动添加到模型中,并且输出维度将自动推断

名称描述
labelThe label to add. str

DependencyParser.set_output 方法

通过调用模型的resize_output属性来更改组件模型的输出维度。这是一个接收原始模型和新输出维度nO的函数,会就地修改模型。在调整已训练模型的尺寸时,应注意避免"灾难性遗忘"问题。

名称描述
nOThe new output dimension. int

DependencyParser.to_disk 方法

将管道序列化到磁盘。

名称描述
pathA path to a directory, which will be created if it doesn’t exist. Paths may be either strings or Path-like objects. Union[str,Path]
仅关键字
excludeString names of serialization fields to exclude. Iterable[str]

DependencyParser.from_disk 方法

从磁盘加载管道。就地修改对象并返回它。

名称描述
pathA path to a directory. Paths may be either strings or Path-like objects. Union[str,Path]
仅关键字
excludeString names of serialization fields to exclude. Iterable[str]

DependencyParser.to_bytes 方法

将管道序列化为字节串。

名称描述
仅关键字
excludeString names of serialization fields to exclude. Iterable[str]

DependencyParser.from_bytes 方法

从字节串加载管道。原地修改对象并返回它。

名称描述
bytes_dataThe data to load from. bytes
仅关键字
excludeString names of serialization fields to exclude. Iterable[str]

DependencyParser.labels 属性

当前添加到组件中的标签。

名称描述

DependencyParser.label_data 属性v3.0

当前添加到组件的标签及其内部元信息。 这是由init labels生成的数据,并被 DependencyParser.initialize用来 使用预定义的标签集初始化模型。

名称描述

序列化字段

在序列化过程中,spaCy会导出多个用于恢复对象不同方面的数据字段。如果需要,您可以通过exclude参数传入字符串名称来将它们排除在序列化之外。

名称描述
vocabThe shared Vocab.
cfgThe config file. You usually don’t want to exclude this.
modelThe binary model data. You usually don’t want to exclude this.