实体识别器

class

String name:nerTrainable:

命名实体识别的流水线组件

一个基于转换的命名实体识别组件。该实体识别器能够识别不重叠的标记标签跨度。所使用的基于转换的算法编码了某些对"传统"命名实体识别任务有效的假设，但可能并不适合所有的跨度识别问题。具体来说，损失函数优化的是整体实体准确性，因此如果标注者在边界标记上的一致性较低，该组件在您的问题上可能表现不佳。基于转换的算法还假设关于实体的最关键信息会靠近它们的起始标记。如果您的实体较长且其特征由中间位置的标记决定，那么该组件可能不适合您的任务。

Assigned Attributes

预测结果将以元组形式保存到Doc.ents中。每个标签也会反映到对应的底层词元(token)，分别保存在Token.ent_type和Token.ent_iob字段中。请注意，根据定义每个词元只能拥有一个标签。

当设置Doc.ents来创建训练数据时，所有跨度必须有效且不重叠，否则会抛出错误。

位置	值
`Doc.ents`	The annotated spans. Tuple[Span]
`Token.ent_iob`	An enum encoding of the IOB part of the named entity tag. int
`Token.ent_iob_`	The IOB part of the named entity tag. str
`Token.ent_type`	The label part of the named entity tag (hash). int
`Token.ent_type_`	The label part of the named entity tag. str

配置与实现

默认配置由管道组件工厂定义，描述了组件应如何配置。您可以通过nlp.add_pipe中的config参数或在训练用的config.cfg中覆盖其设置。有关架构及其参数和超参数的详细信息，请参阅模型架构文档。

设置	描述
`moves`	A list of transition names. Inferred from the data if not provided. Defaults to `None`. Optional[TransitionSystem]
`update_with_oracle_cut_size`	During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won’t need to change it. Defaults to `100`. int
`model`	The `Model` powering the pipeline component. Defaults to TransitionBasedParser. Model[List[Doc], List[Floats2d]]
`incorrect_spans_key`	This key refers to a `SpanGroup` in `doc.spans` that specifies incorrect spans. The NER will learn not to predict (exactly) those spans. Defaults to `None`. Optional[str]
`scorer`	The scoring method. Defaults to `spacy.scorer.get_ner_prf`. Optional[Callable]

explosion/spaCy/master/spacy/pipeline/ner.pyx

EntityRecognizer.init 方法

创建一个新的管道实例。在您的应用程序中，通常会使用快捷方式，通过其字符串名称并使用nlp.add_pipe来实例化该组件。

名称	描述
`vocab`	The shared vocabulary. Vocab
`model`	The `Model` powering the pipeline component. Model[List[Doc], List[Floats2d]]
`name`	String name of the component instance. Used to add entries to the `losses` during training. str
`moves`	A list of transition names. Inferred from the data if set to `None`, which is the default. Optional[TransitionSystem]
仅关键字
`update_with_oracle_cut_size`	During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won’t need to change it. Defaults to `100`. int
`incorrect_spans_key`	Identifies spans that are known to be incorrect entity annotations. The incorrect entity annotations can be stored in the span group in `Doc.spans`, under this key. Defaults to `None`. Optional[str]

EntityRecognizer.call 方法

将管道应用于单个文档。文档会被原地修改并返回。这通常在调用nlp对象处理文本时自动完成，所有管道组件会按顺序应用于Doc对象。 __call__和 pipe方法都会委托给 predict和 set_annotations方法。

名称	描述
`doc`	The document to process. Doc
返回值	处理后的文档。Doc

EntityRecognizer.pipe 方法

将管道应用于文档流。这通常在调用nlp对象处理文本时自动完成，所有管道组件会按顺序应用于Doc对象。无论是__call__还是pipe方法，最终都会委托给predict和set_annotations方法执行。

名称	描述
`docs`	A stream of documents. Iterable[Doc]
仅关键字
`batch_size`	The number of documents to buffer. Defaults to `128`. int
YIELDS	按顺序处理后的文档。Doc

EntityRecognizer.initialize 方法v3.0

初始化组件以进行训练。get_examples应为一个返回可迭代Example对象的函数。至少需要提供一个示例。这些数据示例用于初始化组件模型，可以是完整的训练数据或代表性样本。初始化过程包括验证网络、推断缺失形状以及根据数据设置标签方案。该方法通常由Language.initialize调用，并允许您通过配置中的[initialize.components]块来自定义接收的参数。

名称	描述
`get_examples`	Function that returns gold-standard annotations in the form of `Example` objects. Must contain at least one `Example`. Callable[[], Iterable[Example]]
仅关键字
`nlp`	The current `nlp` object. Defaults to `None`. Optional[Language]
`labels`	The label information to add to the component, as provided by the `label_data` property after initialization. To generate a reusable JSON file from your data, you should run the `init labels` command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. Optional[Dict[str, Dict[str, int]]]

EntityRecognizer.predict 方法

将组件的模型应用于一批Doc对象，而不修改它们。

名称	描述
`docs`	The documents to predict. Iterable[Doc]
返回值	用于解析状态的辅助类（内部使用）。StateClass

EntityRecognizer.set_annotations 方法

使用预先计算的分数修改一批Doc对象。

名称	描述
`docs`	The documents to modify. Iterable[Doc]
`scores`	The scores to set, produced by `EntityRecognizer.predict`. Returns an internal helper class for the parse state. List[StateClass]

EntityRecognizer.update 方法

从一批Example对象中学习，更新管道的模型。委托给predict和get_loss。

名称	描述
`examples`	A batch of `Example` objects to learn from. Iterable[Example]
仅关键字
`drop`	The dropout rate. float
`sgd`	An optimizer. Will be created via `create_optimizer` if not set. Optional[Optimizer]
`losses`	Optional record of the loss during training. Updated using the component name as the key. Optional[Dict[str, float]]
RETURNS	The updated `losses` dictionary. Dict[str, float]

EntityRecognizer.get_loss 方法

计算这批文档及其预测分数的损失和损失梯度。

名称	描述
`examples`	The batch of examples. Iterable[Example]
`scores`	Scores representing the model’s predictions. StateClass
RETURNS	The loss and the gradient, i.e. `(loss, gradient)`. Tuple[float, float]

EntityRecognizer.create_optimizer 方法

为管道组件创建一个优化器。

名称	描述
返回值	优化器。Optimizer

EntityRecognizer.use_params 方法上下文管理器

修改管道的模型，以使用给定的参数值。在上下文结束时，原始参数将被恢复。

名称	描述
`params`	The parameter values to use in the model. dict

EntityRecognizer.add_label 方法

向管道添加一个新标签。请注意，如果您向initialize方法提供了代表性数据样本，则无需调用此方法。在这种情况下，样本中找到的所有标签将自动添加到模型中，并且输出维度将自动推断。

名称	描述
`label`	The label to add. str
RETURNS	`0` if the label is already present, otherwise `1`. int

EntityRecognizer.set_output 方法

通过调用模型的resize_output属性来更改组件模型的输出维度。这是一个接收原始模型和新输出维度nO的函数，会就地修改模型。在调整已训练模型的尺寸时，应注意避免"灾难性遗忘"问题。

名称	描述
`nO`	The new output dimension. int

EntityRecognizer.to_disk 方法

将管道序列化到磁盘。

名称	描述
`path`	A path to a directory, which will be created if it doesn’t exist. Paths may be either strings or `Path`-like objects. Union[str,Path]
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]

EntityRecognizer.from_disk 方法

从磁盘加载管道。原地修改对象并返回它。

名称	描述
`path`	A path to a directory. Paths may be either strings or `Path`-like objects. Union[str,Path]
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]
RETURNS	The modified `EntityRecognizer` object. EntityRecognizer

EntityRecognizer.to_bytes 方法

将管道序列化为字节串。

名称	描述
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]
RETURNS	The serialized form of the `EntityRecognizer` object. bytes

EntityRecognizer.from_bytes 方法

从字节串加载管道。原地修改对象并返回它。

名称	描述
`bytes_data`	The data to load from. bytes
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]
RETURNS	The `EntityRecognizer` object. EntityRecognizer

EntityRecognizer.labels 属性

当前添加到组件中的标签。

名称	描述
返回值	添加到组件的标签。Tuple[str, …]

EntityRecognizer.label_data 属性v3.0

当前添加到组件的标签及其内部元信息。这是由init labels生成的数据，并被 EntityRecognizer.initialize用来使用预定义的标签集初始化模型。

名称	描述
返回值	添加到组件的标签数据。Dict[str, Dict[str, Dict[str, int]]]

序列化字段

在序列化过程中，spaCy会导出多个用于恢复对象不同方面的数据字段。如果需要，您可以通过exclude参数传入字符串名称来将它们排除在序列化之外。

名称	描述
`vocab`	The shared `Vocab`.
`cfg`	The config file. You usually don’t want to exclude this.
`model`	The binary model data. You usually don’t want to exclude this.

建议编辑