SpanRuler
span ruler 允许您使用基于标记的规则或精确短语匹配,将跨度添加到 Doc.spans 和/或
Doc.ents。有关使用示例,请参阅关于
基于规则的跨度匹配的文档。
Assigned Attributes
匹配结果将被保存到Doc.spans[spans_key]作为SpanGroup和/或保存到Doc.ents,其中标注信息存储在Token.ent_type和Token.ent_iob字段中。
| 位置 | 值 |
|---|---|
Doc.spans[spans_key] | The annotated spans. SpanGroup |
Doc.ents | The annotated spans. Tuple[Span] |
Token.ent_iob | An enum encoding of the IOB part of the named entity tag. int |
Token.ent_iob_ | The IOB part of the named entity tag. str |
Token.ent_type | The label part of the named entity tag (hash). int |
Token.ent_type_ | The label part of the named entity tag. str |
配置与实现
默认配置由管道组件工厂定义,描述了组件应如何配置。您可以通过nlp.add_pipe上的config参数或在您的config.cfg中覆盖其设置。
| 设置 | 描述 |
|---|---|
spans_key | The spans key to save the spans under. If None, no spans are saved. Defaults to "ruler". Optional[str] |
spans_filter | The optional method to filter spans before they are assigned to doc.spans. Defaults to None. Optional[Callable[[Iterable[Span], Iterable[Span]], List[Span]]] |
annotate_ents | Whether to save spans to doc.ents. Defaults to False. bool |
ents_filter | The method to filter spans before they are assigned to doc.ents. Defaults to util.filter_chain_spans. Callable[[Iterable[Span], Iterable[Span]], List[Span]] |
phrase_matcher_attr | Token attribute to match on, passed to the internal PhraseMatcher as attr. Defaults to None. Optional[Union[int, str]] |
matcher_fuzzy_compare v3.5 | The fuzzy comparison method, passed on to the internal Matcher. Defaults to spacy.matcher.levenshtein.levenshtein_compare. Callable |
validate | Whether patterns should be validated, passed to Matcher and PhraseMatcher as validate. Defaults to False. bool |
overwrite | Whether to remove any existing spans under Doc.spans[spans key] if spans_key is set, or to remove any ents under Doc.ents if annotate_ents is set. Defaults to True. bool |
scorer | The scoring method. Defaults to Scorer.score_spans for Doc.spans[spans_key] with overlapping spans allowed. Optional[Callable] |
explosion/spaCy/master/spacy/pipeline/span_ruler.py
SpanRuler.__init__ 方法
初始化span ruler。如果在此处提供了模式,它们需要是一个包含"label"和"pattern"键的字典列表。模式可以是词符模式(列表)或短语模式(字符串)。例如:
{"label": "ORG", "pattern": "Apple"}。
| 名称 | 描述 |
|---|---|
nlp | The shared nlp object to pass the vocab to the matchers and process phrase patterns. Language |
name | Instance name of the current pipeline component. Typically passed in automatically from the factory when the component is added. Used to disable the current span ruler while creating phrase patterns with the nlp object. str |
| 仅关键字 | |
spans_key | The spans key to save the spans under. If None, no spans are saved. Defaults to "ruler". Optional[str] |
spans_filter | The optional method to filter spans before they are assigned to doc.spans. Defaults to None. Optional[Callable[[Iterable[Span], Iterable[Span]], List[Span]]] |
annotate_ents | Whether to save spans to doc.ents. Defaults to False. bool |
ents_filter | The method to filter spans before they are assigned to doc.ents. Defaults to util.filter_chain_spans. Callable[[Iterable[Span], Iterable[Span]], List[Span]] |
phrase_matcher_attr | Token attribute to match on, passed to the internal PhraseMatcher as attr. Defaults to None. Optional[Union[int, str]] |
matcher_fuzzy_compare v3.5 | The fuzzy comparison method, passed on to the internal Matcher. Defaults to spacy.matcher.levenshtein.levenshtein_compare. Callable |
validate | Whether patterns should be validated, passed to Matcher and PhraseMatcher as validate. Defaults to False. bool |
overwrite | Whether to remove any existing spans under Doc.spans[spans key] if spans_key is set, or to remove any ents under Doc.ents if annotate_ents is set. Defaults to True. bool |
scorer | The scoring method. Defaults to Scorer.score_spans for Doc.spans[spans_key] with overlapping spans allowed. Optional[Callable] |
SpanRuler.initialize 方法
初始化组件时使用数据,并在训练前用于从模式文件加载规则。该方法通常由Language.initialize调用,并允许您通过配置中的[initialize.components]块自定义接收的参数。初始化时会移除所有现有模式。
| 名称 | 描述 |
|---|---|
get_examples | Function that returns gold-standard annotations in the form of Example objects. Not used by the SpanRuler. Callable[[], Iterable[Example]] |
| 仅关键字 | |
nlp | The current nlp object. Defaults to None. Optional[Language] |
patterns | The list of patterns. Defaults to None. Optional[Sequence[Dict[str, Union[str, List[Dict[str, Any]]]]]] |
SpanRuler.__len__ 方法
添加到跨度标尺的所有模式的数量。
| 名称 | 描述 |
|---|---|
| 返回值 | 模式的数量。int |
SpanRuler.__contains__ 方法
标签是否存在于模式中。
| 名称 | 描述 |
|---|---|
label | The label to check. str |
| 返回值 | 判断span ruler是否包含该标签。bool |
SpanRuler.__call__ 方法
在Doc中查找匹配项并将其添加到doc.spans[span_key]和/或
doc.ents中。通常,在将该组件通过nlp.add_pipe添加到流程后,
这一过程会自动发生。如果span ruler初始化时设置了overwrite=True参数,
现有的spans和entities将被移除。
| 名称 | 描述 |
|---|---|
doc | The Doc object to process, e.g. the Doc in the pipeline. Doc |
| RETURNS | The modified Doc with added spans/entities. Doc |
SpanRuler.add_patterns 方法
向span ruler添加模式。模式可以是词符模式(字典列表)或短语模式(字符串)。更多详情请参阅关于基于规则的匹配的使用指南。
| 名称 | 描述 |
|---|---|
patterns | The patterns to add. List[Dict[str, Union[str, List[dict]]]] |
SpanRuler.remove 方法
根据标签从span ruler中移除模式。如果标签不存在于任何模式中,将抛出ValueError错误。
| 名称 | 描述 |
|---|---|
label | The label of the pattern rule. str |
SpanRuler.remove_by_id 方法
根据ID从span ruler中移除模式。如果ID在任何模式中不存在,则会引发ValueError错误。
| 名称 | 描述 |
|---|---|
pattern_id | The ID of the pattern rule. str |
SpanRuler.clear 方法
移除span ruler中的所有模式。
SpanRuler.to_disk 方法
将span ruler模式保存到目录中。这些模式将以换行符分隔的JSON格式(JSONL)保存。
| 名称 | 描述 |
|---|---|
path | A path to a directory, which will be created if it doesn’t exist. Paths may be either strings or Path-like objects. Union[str,Path] |
SpanRuler.from_disk 方法
从路径加载span ruler。
| 名称 | 描述 |
|---|---|
path | A path to a directory. Paths may be either strings or Path-like objects. Union[str,Path] |
| RETURNS | The modified SpanRuler object. SpanRuler |
SpanRuler.to_bytes 方法
将span ruler序列化为字节串。
| 名称 | 描述 |
|---|---|
| 返回值 | 序列化后的模式。bytes |
SpanRuler.from_bytes 方法
从字节串加载管道。原地修改对象并返回它。
| 名称 | 描述 |
|---|---|
bytes_data | The bytestring to load. bytes |
| RETURNS | The modified SpanRuler object. SpanRuler |
SpanRuler.labels 属性
匹配模式中存在的所有标签。
| 名称 | 描述 |
|---|---|
| 返回值 | 字符串标签。元组[字符串, …] |
SpanRuler.ids 属性
匹配模式中id属性存在的所有ID。
| 名称 | 描述 |
|---|---|
| 返回值 | 字符串ID。元组[字符串, …] |
SpanRuler.patterns 属性
所有添加到跨度规则器的模式。
| 名称 | 描述 |
|---|---|
| 返回值 | 原始模式,每个模式对应一个字典。列表[字典[字符串, 联合类型[字符串, 字典]]] |
属性
| 名称 | 描述 |
|---|---|
key | The spans key that spans are saved under. Optional[str] |
matcher | The underlying matcher used to process token patterns. Matcher |
phrase_matcher | The underlying phrase matcher used to process phrase patterns. PhraseMatcher |