实体规则器

class

String name:entity_rulerTrainable:

基于规则的命名实体识别管道组件

实体规则器允许您使用基于标记的规则或精确短语匹配向Doc.ents添加跨度。它可以与统计型EntityRecognizer结合使用以提高准确性，也可以单独用于实现纯基于规则的实体识别系统。有关使用示例，请参阅基于规则的实体识别文档。

Assigned Attributes

该组件分配预测的方式与EntityRecognizer基本相同。

预测结果可以通过Doc.ents以元组形式访问。每个标签也会反映在底层每个词符中，保存在Token.ent_type和Token.ent_iob字段里。请注意根据定义，每个词符只能有一个标签。

当设置Doc.ents来创建训练数据时，所有跨度必须有效且不重叠，否则会抛出错误。

位置	值
`Doc.ents`	The annotated spans. Tuple[Span]
`Token.ent_iob`	An enum encoding of the IOB part of the named entity tag. int
`Token.ent_iob_`	The IOB part of the named entity tag. str
`Token.ent_type`	The label part of the named entity tag (hash). int
`Token.ent_type_`	The label part of the named entity tag. str

配置与实现

默认配置由管道组件工厂定义，描述了组件应如何配置。您可以通过nlp.add_pipe上的config参数或在您的config.cfg训练配置中覆盖其设置。

设置	描述
`phrase_matcher_attr`	Optional attribute name match on for the internal `PhraseMatcher`, e.g. `LOWER` to match on the lowercase token text. Defaults to `None`. Optional[Union[int, str]]
`matcher_fuzzy_compare` v3.5	The fuzzy comparison method, passed on to the internal `Matcher`. Defaults to `spacy.matcher.levenshtein.levenshtein_compare`. Callable
`validate`	Whether patterns should be validated (passed to the `Matcher` and `PhraseMatcher`). Defaults to `False`. bool
`overwrite_ents`	If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. Defaults to `False`. bool
`ent_id_sep`	Separator used internally for entity IDs. Defaults to `"\|\|"`. str
`scorer`	The scoring method. Defaults to `spacy.scorer.get_ner_prf`. Optional[Callable]

explosion/spaCy/master/spacy/pipeline/entityruler.py

EntityRuler.init 方法

初始化实体规则器。如果在此处提供了模式，它们需要是一个包含"label"和"pattern"键的字典列表。模式可以是词符模式（列表）或短语模式（字符串）。例如：{"label": "ORG", "pattern": "Apple"}。

名称	描述
`nlp`	The shared nlp object to pass the vocab to the matchers and process phrase patterns. Language
`name` v3.0	Instance name of the current pipeline component. Typically passed in automatically from the factory when the component is added. Used to disable the current entity ruler while creating phrase patterns with the nlp object. str
仅关键字
`phrase_matcher_attr`	Optional attribute name match on for the internal `PhraseMatcher`, e.g. `LOWER` to match on the lowercase token text. Defaults to `None`. Optional[Union[int, str]]
`matcher_fuzzy_compare` v3.5	The fuzzy comparison method, passed on to the internal `Matcher`. Defaults to `spacy.matcher.levenshtein.levenshtein_compare`. Callable
`validate`	Whether patterns should be validated, passed to Matcher and PhraseMatcher as `validate`. Defaults to `False`. bool
`overwrite_ents`	If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. Defaults to `False`. bool
`ent_id_sep`	Separator used internally for entity IDs. Defaults to `"\|\|"`. str
`patterns`	Optional patterns to load in on initialization. Optional[List[Dict[str, Union[str, List[dict]]]]]
`scorer`	The scoring method. Defaults to `spacy.scorer.get_ner_prf`. Optional[Callable]

EntityRuler.initialize 方法v3.0

初始化组件时使用数据，并在训练前加载来自模式文件的规则。该方法通常由Language.initialize调用，允许您通过配置中的[initialize.components]块来自定义接收的参数。

名称	描述
`get_examples`	Function that returns gold-standard annotations in the form of `Example` objects. Not used by the `EntityRuler`. Callable[[], Iterable[Example]]
仅关键字
`nlp`	The current `nlp` object. Defaults to `None`. Optional[Language]
`patterns`	The list of patterns. Defaults to `None`. Optional[Sequence[Dict[str, Union[str, List[Dict[str, Any]]]]]]

EntityRuler.len 方法

添加到实体规则器的所有模式的数量。

名称	描述
返回值	模式的数量。int

EntityRuler.contains 方法

标签是否存在于模式中。

名称	描述
`label`	The label to check. str
返回值	实体规则器是否包含该标签。bool

EntityRuler.call 方法

在Doc中查找匹配项并将其添加到doc.ents。通常，在通过nlp.add_pipe将该组件添加到流程后会自动执行此操作。如果实体规则器初始化时设置了overwrite_ents=True，现有实体若与匹配项重叠将被替换。当文档中存在重叠匹配时，实体规则器会优先选择较长的模式而非较短的模式，若长度相同则选择文档中先出现的匹配项。

名称	描述
`doc`	The `Doc` object to process, e.g. the `Doc` in the pipeline. Doc
RETURNS	The modified `Doc` with added entities, if available. Doc

EntityRuler.add_patterns 方法

向实体规则器添加模式。模式可以是词符模式（字典列表）或短语模式（字符串）。更多详情请参阅关于基于规则的匹配的使用指南。

名称	描述
`patterns`	The patterns to add. List[Dict[str, Union[str, List[dict]]]]

EntityRuler.remove 方法v3.2.1

根据ID从实体规则器中移除一个模式。如果ID不存在，将引发ValueError错误。

名称	描述
`id`	The ID of the pattern rule. str

EntityRuler.to_disk 方法

将实体规则模式保存到目录中。模式将以换行符分隔的JSON格式(JSONL)保存。如果提供后缀为.jsonl的文件，则仅将模式保存为JSONL。如果提供的是目录名称，则会导出包含组件配置的patterns.jsonl文件和cfg文件。

名称	描述
`path`	A path to a JSONL file or directory, which will be created if it doesn’t exist. Paths may be either strings or `Path`-like objects. Union[str,Path]

EntityRuler.from_disk 方法

从路径加载实体规则器。期望文件包含每行一个条目的换行分隔JSON（JSONL），或者包含patterns.jsonl文件和组件配置的cfg文件的目录。

名称	描述
`path`	A path to a JSONL file or directory. Paths may be either strings or `Path`-like objects. Union[str,Path]
RETURNS	The modified `EntityRuler` object. EntityRuler

EntityRuler.to_bytes 方法

将实体规则器模式序列化为字节串。

名称	描述
返回值	序列化后的模式。bytes

EntityRuler.from_bytes 方法

从字节串加载管道。原地修改对象并返回它。

名称	描述
`bytes_data`	The bytestring to load. bytes
RETURNS	The modified `EntityRuler` object. EntityRuler

EntityRuler.labels 属性

匹配模式中存在的所有标签。

名称	描述
返回值	字符串标签。元组[字符串, …]

EntityRuler.ent_ids 属性

匹配模式中id属性存在的所有实体ID。

名称	描述
返回值	字符串ID。元组[字符串, …]

EntityRuler.patterns 属性

获取所有添加到实体规则器的模式。

名称	描述
返回值	原始模式，每个模式对应一个字典。列表[字典[字符串, 联合类型[字符串, 字典]]]

属性

名称	描述
`matcher`	The underlying matcher used to process token patterns. Matcher
`phrase_matcher`	The underlying phrase matcher used to process phrase patterns. PhraseMatcher
`token_patterns`	The token patterns present in the entity ruler, keyed by label. Dict[str, List[Dict[str, Union[str, List[dict]]]]
`phrase_patterns`	The phrase patterns present in the entity ruler, keyed by label. Dict[str, List[Doc]]

建议编辑

流水线