实体链接器
EntityLinker组件将文本提及(标记为命名实体)消歧到唯一标识符,将命名实体锚定到"现实世界"中。它需要一个KnowledgeBase知识库,以及一个函数来根据特定文本提及从该知识库生成合理候选,还需要一个机器学习模型来根据提及的局部上下文选择正确的候选。EntityLinker默认使用InMemoryLookupKB实现。
Assigned Attributes
预测结果将以知识库ID的形式分配给Token.ent_kb_id_。
| 位置 | 值 |
|---|---|
Token.ent_kb_id | Knowledge base ID (hash). int |
Token.ent_kb_id_ | Knowledge base ID. str |
配置与实现
默认配置由管道组件工厂定义,描述了组件应如何配置。您可以通过nlp.add_pipe中的config参数或在训练用的config.cfg中覆盖其设置。有关架构及其参数和超参数的详细信息,请参阅模型架构文档。
| 设置 | 描述 |
|---|---|
labels_discard | NER labels that will automatically get a “NIL” prediction. Defaults to []. Iterable[str] |
n_sents | The number of neighbouring sentences to take into account. Defaults to 0. int |
incl_prior | Whether or not to include prior probabilities from the KB in the model. Defaults to True. bool |
incl_context | Whether or not to include the local context in the model. Defaults to True. bool |
model | The Model powering the pipeline component. Defaults to EntityLinker. Model |
entity_vector_length | Size of encoding vectors in the KB. Defaults to 64. int |
use_gold_ents | Whether to copy entities from the gold docs or not. Defaults to True. If False, entities must be set in the training data or by an annotating component in the pipeline. int |
get_candidates | Function that generates plausible candidates for a given Span object. Defaults to CandidateGenerator, a function looking up exact, case-dependent aliases in the KB. Callable[[KnowledgeBase,Span], Iterable[Candidate]] |
get_candidates_batch v3.5 | Function that generates plausible candidates for a given batch of Span objects. Defaults to CandidateBatchGenerator, a function looking up exact, case-dependent aliases in the KB. Callable[[KnowledgeBase, Iterable[Span]], Iterable[Iterable[Candidate]]] |
generate_empty_kb v3.5.1 | Function that generates an empty KnowledgeBase object. Defaults to spacy.EmptyKB.v2, which generates an empty InMemoryLookupKB. Callable[[Vocab, int],KnowledgeBase] |
overwrite v3.2 | Whether existing annotation is overwritten. Defaults to True. bool |
scorer v3.2 | The scoring method. Defaults to Scorer.score_links. Optional[Callable] |
threshold v3.4 | Confidence threshold for entity predictions. The default of None implies that all predictions are accepted, otherwise those with a score beneath the threshold are discarded. If there are no predictions with scores above the threshold, the linked entity is NIL. Optional[float] |
explosion/spaCy/master/spacy/pipeline/entity_linker.py
EntityLinker.__init__ 方法
创建一个新的管道实例。在您的应用程序中,通常会使用快捷方式,通过其字符串名称并使用nlp.add_pipe来实例化该组件。
在构建实体链接器组件时,会使用提供的entity_vector_length构建一个空的知识库。如果你想使用自定义知识库,应该调用set_kb或在initialize调用中提供kb_loader。
| 名称 | 描述 |
|---|---|
vocab | The shared vocabulary. Vocab |
model | The Model powering the pipeline component. Model |
name | String name of the component instance. Used to add entries to the losses during training. str |
| 仅关键字 | |
entity_vector_length | Size of encoding vectors in the KB. int |
get_candidates | Function that generates plausible candidates for a given Span object. Callable[[KnowledgeBase,Span], Iterable[Candidate]] |
labels_discard | NER labels that will automatically get a "NIL" prediction. Iterable[str] |
n_sents | The number of neighbouring sentences to take into account. int |
incl_prior | Whether or not to include prior probabilities from the KB in the model. bool |
incl_context | Whether or not to include the local context in the model. bool |
overwrite v3.2 | Whether existing annotation is overwritten. Defaults to True. bool |
scorer v3.2 | The scoring method. Defaults to Scorer.score_links. Optional[Callable] |
threshold v3.4 | Confidence threshold for entity predictions. The default of None implies that all predictions are accepted, otherwise those with a score beneath the threshold are discarded. If there are no predictions with scores above the threshold, the linked entity is NIL. Optional[float] |
EntityLinker.__call__ 方法
将管道应用于单个文档。文档会被原地修改并返回。
这通常在调用nlp对象处理文本时自动执行,
所有管道组件会按顺序应用于Doc对象。
__call__和pipe
都会委托给predict和
set_annotations方法。
| 名称 | 描述 |
|---|---|
doc | The document to process. Doc |
| 返回值 | 处理后的文档。Doc |
EntityLinker.pipe 方法
将管道应用于文档流。这通常在调用nlp对象处理文本时自动完成,所有管道组件会按顺序应用于Doc对象。无论是__call__还是pipe方法,最终都会委托给predict和set_annotations方法执行。
| 名称 | 描述 |
|---|---|
stream | A stream of documents. Iterable[Doc] |
| 仅关键字 | |
batch_size | The number of documents to buffer. Defaults to 128. int |
| YIELDS | 按顺序处理后的文档。Doc |
EntityLinker.set_kb 方法v3.0
kb_loader应该是一个接收Vocab实例并创建KnowledgeBase的函数,确保知识库的字符串与当前词汇表保持同步。
| 名称 | 描述 |
|---|---|
kb_loader | Function that creates a KnowledgeBase from a Vocab instance. Callable[[Vocab],KnowledgeBase] |
EntityLinker.initialize 方法v3.0
初始化组件以进行训练。get_examples应是一个返回可迭代Example对象的函数。至少需要提供一个示例。这些数据示例用于初始化组件模型,可以是完整的训练数据或代表性样本。初始化过程包括验证网络、推断缺失形状以及根据数据设置标签方案。该方法通常由Language.initialize调用。
可选地,可以指定一个kb_loader参数来更改内部知识库。该参数应为一个接收Vocab实例并创建KnowledgeBase的函数,确保知识库的字符串与当前词汇表保持同步。
| 名称 | 描述 |
|---|---|
get_examples | Function that returns gold-standard annotations in the form of Example objects. Must contain at least one Example. Callable[[], Iterable[Example]] |
| 仅关键字 | |
nlp | The current nlp object. Defaults to None. Optional[Language] |
kb_loader | Function that creates a KnowledgeBase from a Vocab instance. Callable[[Vocab],KnowledgeBase] |
EntityLinker.predict 方法
将组件的模型应用于一批Doc对象,而不修改它们。返回每个文档中每个实体的知识库ID,如果没有预测则包括NIL。
| 名称 | 描述 |
|---|---|
docs | The documents to predict. Iterable[Doc] |
| RETURNS | The predicted KB identifiers for the entities in the docs. List[str] |
EntityLinker.set_annotations 方法
使用预先计算的实体ID列表来批量修改文档中的命名实体。
| 名称 | 描述 |
|---|---|
docs | The documents to modify. Iterable[Doc] |
kb_ids | The knowledge base identifiers for the entities in the docs, predicted by EntityLinker.predict. List[str] |
EntityLinker.update 方法
从一批Example对象中学习,同时更新管道的实体链接模型和上下文编码器。委托给predict方法。
| 名称 | 描述 |
|---|---|
examples | A batch of Example objects to learn from. Iterable[Example] |
| 仅关键字 | |
drop | The dropout rate. float |
sgd | An optimizer. Will be created via create_optimizer if not set. Optional[Optimizer] |
losses | Optional record of the loss during training. Updated using the component name as the key. Optional[Dict[str, float]] |
| RETURNS | The updated losses dictionary. Dict[str, float] |
EntityLinker.create_optimizer 方法
为管道组件创建一个优化器。
| 名称 | 描述 |
|---|---|
| 返回值 | 优化器。Optimizer |
EntityLinker.use_params 方法上下文管理器
修改管道的模型,以使用给定的参数值。在上下文结束时,原始参数将被恢复。
| 名称 | 描述 |
|---|---|
params | The parameter values to use in the model. dict |
EntityLinker.to_disk 方法
将管道序列化到磁盘。
| 名称 | 描述 |
|---|---|
path | A path to a directory, which will be created if it doesn’t exist. Paths may be either strings or Path-like objects. Union[str,Path] |
| 仅关键字 | |
exclude | String names of serialization fields to exclude. Iterable[str] |
EntityLinker.from_disk 方法
从磁盘加载管道。就地修改对象并返回它。
| 名称 | 描述 |
|---|---|
path | A path to a directory. Paths may be either strings or Path-like objects. Union[str,Path] |
| 仅关键字 | |
exclude | String names of serialization fields to exclude. Iterable[str] |
| RETURNS | The modified EntityLinker object. EntityLinker |
EntityLinker.to_bytes 方法
将管道序列化为字节串,包括KnowledgeBase。
| 名称 | 描述 |
|---|---|
| 仅关键字 | |
exclude | String names of serialization fields to exclude. Iterable[str] |
| RETURNS | The serialized form of the EntityLinker object. bytes |
EntityLinker.from_bytes 方法
从字节串加载管道。原地修改对象并返回它。
| 名称 | 描述 |
|---|---|
bytes_data | The data to load from. bytes |
| 仅关键字 | |
exclude | String names of serialization fields to exclude. Iterable[str] |
| RETURNS | The EntityLinker object. EntityLinker |
序列化字段
在序列化过程中,spaCy会导出多个用于恢复对象不同方面的数据字段。如果需要,您可以通过exclude参数传入字符串名称来将它们排除在序列化之外。
| 名称 | 描述 |
|---|---|
vocab | The shared Vocab. |
cfg | The config file. You usually don’t want to exclude this. |
model | The binary model data. You usually don’t want to exclude this. |
kb | The knowledge base. You usually don’t want to exclude this. |