匹配器

匹配器

class
基于模式规则匹配一系列标记

Matcher 允许您通过描述词符属性的规则来查找单词和短语。规则可以引用词符标注(如文本或词性标签),以及像 Token.is_punct 这样的词汇属性。将匹配器应用于 Doc 可以让您在上下文中访问匹配到的词符。有关结合规则和统计模型的深入示例和工作流程,请参阅基于规则匹配的 使用指南

模式格式

添加到Matcher的模式由字典列表组成。每个字典描述一个词符及其属性。可用的词符模式键对应多个Token属性。基于规则匹配支持的属性包括:

属性描述
ORTHThe exact verbatim text of a token. str
TEXTThe exact verbatim text of a token. str
NORMThe normalized form of the token text. str
LOWERThe lowercase form of the token text. str
LENGTHThe length of the token text. int
IS_ALPHA, IS_ASCII, IS_DIGITToken text consists of alphabetic characters, ASCII characters, digits. bool
IS_LOWER, IS_UPPER, IS_TITLEToken text is in lowercase, uppercase, titlecase. bool
IS_PUNCT, IS_SPACE, IS_STOPToken is punctuation, whitespace, stop word. bool
IS_SENT_STARTToken is start of sentence. bool
LIKE_NUM, LIKE_URL, LIKE_EMAILToken text resembles a number, URL, email. bool
SPACYToken has a trailing space. bool
POS, TAG, MORPH, DEP, LEMMA, SHAPEThe token’s simple and extended part-of-speech tag, morphological analysis, dependency label, lemma, shape. str
ENT_TYPEThe token’s entity label. str
ENT_IOBThe IOB part of the token’s entity tag. str
ENT_IDThe token’s entity ID (ent_id). str
ENT_KB_IDThe token’s entity knowledge base ID (ent_kb_id). str
_Properties in custom extension attributes. Dict[str, Any]
OPOperator or quantifier to determine how often to match a token pattern. str

运算符和量词定义了匹配频率,即一个标记模式应该被匹配的次数:

OP描述
!Negate the pattern, by requiring it to match exactly 0 times.
?Make the pattern optional, by allowing it to match 0 or 1 times.
+Require the pattern to match 1 or more times.
*Allow the pattern to match 0 or more times.
{n}Require the pattern to match exactly n times.
{n,m}Require the pattern to match at least n but not more than m times.
{n,}Require the pattern to match at least n times.
{,m}Require the pattern to match at most m times.

词符模式还可以映射到属性字典而非单个值,用于表示预期值是列表中的成员还是与其他值的比较关系。

属性描述
REGEXAttribute value matches the regular expression at any position in the string. Any
FUZZYAttribute value matches if the fuzzy_compare method matches for (value, pattern, -1). The default method allows a Levenshtein edit distance of at least 2 and up to 30% of the pattern string length. Any
FUZZY1, FUZZY2, … FUZZY9Attribute value matches if the fuzzy_compare method matches for (value, pattern, N). The default method allows a Levenshtein edit distance of at most N (1-9). Any
INAttribute value is member of a list. Any
NOT_INAttribute value is not member of a list. Any
IS_SUBSETAttribute value (for MORPH or custom list attributes) is a subset of a list. Any
IS_SUPERSETAttribute value (for MORPH or custom list attributes) is a superset of a list. Any
INTERSECTSAttribute value (for MORPH or custom list attribute) has a non-empty intersection with a list. Any
==, >=, <=, >, <Attribute value is equal, greater or equal, smaller or equal, greater or smaller. Union[int, float]

从spaCy v3.5版本开始,REGEXFUZZY可以与INNOT_IN组合使用。

Matcher.__init__ 方法

创建基于规则的Matcher。如果设置了validate=True,所有添加到匹配器的模式都将根据JSON模式进行验证,如果发现问题则会抛出MatchPatternError。这些问题可能包括类型错误(例如预期是整数却提供了字符串)或意外的属性名称。

名称描述
vocabThe vocabulary object, which must be shared with the documents the matcher will operate on. Vocab
validateValidate all patterns added to this matcher. bool
fuzzy_compareThe comparison method used for the FUZZY operators. Callable[[str, str, int], bool]

Matcher.__call__ 方法

DocSpan上查找所有与提供模式匹配的标记序列。

请注意,如果单个标签关联了多个模式,返回的匹配结果无法区分具体是由哪个模式触发的匹配。

名称描述
doclikeThe Doc or Span to match over. Union[Doc,Span]
仅关键字
as_spans v3.0Instead of tuples, return a list of Span objects of the matches, with the match_id assigned as the span label. Defaults to False. bool
allow_missing v3.0Whether to skip checks for missing annotation for attributes included in patterns. Defaults to False. bool
with_alignments v3.0.6Return match alignment information as part of the match tuple as List[int] with the same length as the matched span. Each entry denotes the corresponding index of the token in the pattern. If as_spans is set to True, this setting is ignored. Defaults to False. bool

Matcher.__len__ 方法

获取添加到匹配器中的规则数量。请注意,这里仅返回规则的数量(与ID数量相同),而不是单个模式的数量。

名称描述

Matcher.__contains__ 方法

检查匹配器是否包含针对某个匹配ID的规则。

名称描述
keyThe match ID. str

Matcher.add 方法

向匹配器添加规则,包含一个ID键、一个或多个模式,以及一个可选的用于处理匹配项的回调函数。该回调函数将接收参数matcherdocimatches。如果给定ID的模式已存在,则模式将被扩展。已有的on_match回调将被覆盖。

名称描述
match_idAn ID for the thing you’re matching. str
patternsMatch pattern. A pattern consists of a list of dicts, where each dict describes a token. List[List[Dict[str, Any]]]
仅关键字
on_matchCallback function to act on matches. Takes the arguments matcher, doc, i and matches. Optional[Callable[[Matcher,Doc, int, List[tuple], Any]]
greedy v3.0Optional filter for greedy matches. Can either be "FIRST" or "LONGEST". Optional[str]

Matcher.remove 方法

从匹配器中移除一条规则。如果匹配ID不存在,将抛出KeyError错误。

名称描述
keyThe ID of the match rule. str

Matcher.get 方法

检索存储于某个键的模式。返回规则作为一个(on_match, patterns)元组,其中包含回调函数和可用模式。

名称描述
keyThe ID of the match rule. str