实体链接器

class

String name:entity_linkerTrainable:

用于命名实体链接和消歧的流水线组件

EntityLinker组件将文本提及（标记为命名实体）消歧到唯一标识符，将命名实体锚定到"现实世界"中。它需要一个KnowledgeBase知识库，以及一个函数来根据特定文本提及从该知识库生成合理候选，还需要一个机器学习模型来根据提及的局部上下文选择正确的候选。EntityLinker默认使用InMemoryLookupKB实现。

Assigned Attributes

预测结果将以知识库ID的形式分配给Token.ent_kb_id_。

位置	值
`Token.ent_kb_id`	Knowledge base ID (hash). int
`Token.ent_kb_id_`	Knowledge base ID. str

配置与实现

默认配置由管道组件工厂定义，描述了组件应如何配置。您可以通过nlp.add_pipe中的config参数或在训练用的config.cfg中覆盖其设置。有关架构及其参数和超参数的详细信息，请参阅模型架构文档。

设置	描述
`labels_discard`	NER labels that will automatically get a “NIL” prediction. Defaults to `[]`. Iterable[str]
`n_sents`	The number of neighbouring sentences to take into account. Defaults to 0. int
`incl_prior`	Whether or not to include prior probabilities from the KB in the model. Defaults to `True`. bool
`incl_context`	Whether or not to include the local context in the model. Defaults to `True`. bool
`model`	The `Model` powering the pipeline component. Defaults to EntityLinker. Model
`entity_vector_length`	Size of encoding vectors in the KB. Defaults to `64`. int
`use_gold_ents`	Whether to copy entities from the gold docs or not. Defaults to `True`. If `False`, entities must be set in the training data or by an annotating component in the pipeline. int
`get_candidates`	Function that generates plausible candidates for a given `Span` object. Defaults to CandidateGenerator, a function looking up exact, case-dependent aliases in the KB. Callable[[KnowledgeBase,Span], Iterable[Candidate]]
`get_candidates_batch` v3.5	Function that generates plausible candidates for a given batch of `Span` objects. Defaults to CandidateBatchGenerator, a function looking up exact, case-dependent aliases in the KB. Callable[[KnowledgeBase, Iterable[Span]], Iterable[Iterable[Candidate]]]
`generate_empty_kb` v3.5.1	Function that generates an empty `KnowledgeBase` object. Defaults to `spacy.EmptyKB.v2`, which generates an empty `InMemoryLookupKB`. Callable[[Vocab, int],KnowledgeBase]
`overwrite` v3.2	Whether existing annotation is overwritten. Defaults to `True`. bool
`scorer` v3.2	The scoring method. Defaults to `Scorer.score_links`. Optional[Callable]
`threshold` v3.4	Confidence threshold for entity predictions. The default of `None` implies that all predictions are accepted, otherwise those with a score beneath the threshold are discarded. If there are no predictions with scores above the threshold, the linked entity is `NIL`. Optional[float]

explosion/spaCy/master/spacy/pipeline/entity_linker.py

EntityLinker.init 方法

创建一个新的管道实例。在您的应用程序中，通常会使用快捷方式，通过其字符串名称并使用nlp.add_pipe来实例化该组件。

在构建实体链接器组件时，会使用提供的entity_vector_length构建一个空的知识库。如果你想使用自定义知识库，应该调用set_kb或在initialize调用中提供kb_loader。

名称	描述
`vocab`	The shared vocabulary. Vocab
`model`	The `Model` powering the pipeline component. Model
`name`	String name of the component instance. Used to add entries to the `losses` during training. str
仅关键字
`entity_vector_length`	Size of encoding vectors in the KB. int
`get_candidates`	Function that generates plausible candidates for a given `Span` object. Callable[[KnowledgeBase,Span], Iterable[Candidate]]
`labels_discard`	NER labels that will automatically get a `"NIL"` prediction. Iterable[str]
`n_sents`	The number of neighbouring sentences to take into account. int
`incl_prior`	Whether or not to include prior probabilities from the KB in the model. bool
`incl_context`	Whether or not to include the local context in the model. bool
`overwrite` v3.2	Whether existing annotation is overwritten. Defaults to `True`. bool
`scorer` v3.2	The scoring method. Defaults to `Scorer.score_links`. Optional[Callable]
`threshold` v3.4	Confidence threshold for entity predictions. The default of `None` implies that all predictions are accepted, otherwise those with a score beneath the threshold are discarded. If there are no predictions with scores above the threshold, the linked entity is `NIL`. Optional[float]

EntityLinker.call 方法

将管道应用于单个文档。文档会被原地修改并返回。这通常在调用nlp对象处理文本时自动执行，所有管道组件会按顺序应用于Doc对象。 __call__和pipe 都会委托给predict和 set_annotations方法。

名称	描述
`doc`	The document to process. Doc
返回值	处理后的文档。Doc

EntityLinker.pipe 方法

将管道应用于文档流。这通常在调用nlp对象处理文本时自动完成，所有管道组件会按顺序应用于Doc对象。无论是__call__还是pipe方法，最终都会委托给predict和set_annotations方法执行。

名称	描述
`stream`	A stream of documents. Iterable[Doc]
仅关键字
`batch_size`	The number of documents to buffer. Defaults to `128`. int
YIELDS	按顺序处理后的文档。Doc

EntityLinker.set_kb 方法v3.0

kb_loader应该是一个接收Vocab实例并创建KnowledgeBase的函数，确保知识库的字符串与当前词汇表保持同步。

名称	描述
`kb_loader`	Function that creates a `KnowledgeBase` from a `Vocab` instance. Callable[[Vocab],KnowledgeBase]

EntityLinker.initialize 方法v3.0

初始化组件以进行训练。get_examples应是一个返回可迭代Example对象的函数。至少需要提供一个示例。这些数据示例用于初始化组件模型，可以是完整的训练数据或代表性样本。初始化过程包括验证网络、推断缺失形状以及根据数据设置标签方案。该方法通常由Language.initialize调用。

可选地，可以指定一个kb_loader参数来更改内部知识库。该参数应为一个接收Vocab实例并创建KnowledgeBase的函数，确保知识库的字符串与当前词汇表保持同步。

名称	描述
`get_examples`	Function that returns gold-standard annotations in the form of `Example` objects. Must contain at least one `Example`. Callable[[], Iterable[Example]]
仅关键字
`nlp`	The current `nlp` object. Defaults to `None`. Optional[Language]
`kb_loader`	Function that creates a `KnowledgeBase` from a `Vocab` instance. Callable[[Vocab],KnowledgeBase]

EntityLinker.predict 方法

将组件的模型应用于一批Doc对象，而不修改它们。返回每个文档中每个实体的知识库ID，如果没有预测则包括NIL。

名称	描述
`docs`	The documents to predict. Iterable[Doc]
RETURNS	The predicted KB identifiers for the entities in the `docs`. List[str]

EntityLinker.set_annotations 方法

使用预先计算的实体ID列表来批量修改文档中的命名实体。

名称	描述
`docs`	The documents to modify. Iterable[Doc]
`kb_ids`	The knowledge base identifiers for the entities in the docs, predicted by `EntityLinker.predict`. List[str]

EntityLinker.update 方法

从一批Example对象中学习，同时更新管道的实体链接模型和上下文编码器。委托给predict方法。

名称	描述
`examples`	A batch of `Example` objects to learn from. Iterable[Example]
仅关键字
`drop`	The dropout rate. float
`sgd`	An optimizer. Will be created via `create_optimizer` if not set. Optional[Optimizer]
`losses`	Optional record of the loss during training. Updated using the component name as the key. Optional[Dict[str, float]]
RETURNS	The updated `losses` dictionary. Dict[str, float]

EntityLinker.create_optimizer 方法

为管道组件创建一个优化器。

名称	描述
返回值	优化器。Optimizer

EntityLinker.use_params 方法上下文管理器

修改管道的模型，以使用给定的参数值。在上下文结束时，原始参数将被恢复。

名称	描述
`params`	The parameter values to use in the model. dict

EntityLinker.to_disk 方法

将管道序列化到磁盘。

名称	描述
`path`	A path to a directory, which will be created if it doesn’t exist. Paths may be either strings or `Path`-like objects. Union[str,Path]
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]

EntityLinker.from_disk 方法

从磁盘加载管道。就地修改对象并返回它。

名称	描述
`path`	A path to a directory. Paths may be either strings or `Path`-like objects. Union[str,Path]
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]
RETURNS	The modified `EntityLinker` object. EntityLinker

EntityLinker.to_bytes 方法

将管道序列化为字节串，包括KnowledgeBase。

名称	描述
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]
RETURNS	The serialized form of the `EntityLinker` object. bytes

EntityLinker.from_bytes 方法

从字节串加载管道。原地修改对象并返回它。

名称	描述
`bytes_data`	The data to load from. bytes
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]
RETURNS	The `EntityLinker` object. EntityLinker

序列化字段

在序列化过程中，spaCy会导出多个用于恢复对象不同方面的数据字段。如果需要，您可以通过exclude参数传入字符串名称来将它们排除在序列化之外。

名称	描述
`vocab`	The shared `Vocab`.
`cfg`	The config file. You usually don’t want to exclude this.
`model`	The binary model data. You usually don’t want to exclude this.
`kb`	The knowledge base. You usually don’t want to exclude this.

建议编辑