匹配器
Matcher 允许您通过描述词符属性的规则来查找单词和短语。规则可以引用词符标注(如文本或词性标签),以及像 Token.is_punct 这样的词汇属性。将匹配器应用于 Doc 可以让您在上下文中访问匹配到的词符。有关结合规则和统计模型的深入示例和工作流程,请参阅基于规则匹配的 使用指南。
模式格式
添加到Matcher的模式由字典列表组成。每个字典描述一个词符及其属性。可用的词符模式键对应多个Token属性。基于规则匹配支持的属性包括:
| 属性 | 描述 |
|---|---|
ORTH | The exact verbatim text of a token. str |
TEXT | The exact verbatim text of a token. str |
NORM | The normalized form of the token text. str |
LOWER | The lowercase form of the token text. str |
LENGTH | The length of the token text. int |
IS_ALPHA, IS_ASCII, IS_DIGIT | Token text consists of alphabetic characters, ASCII characters, digits. bool |
IS_LOWER, IS_UPPER, IS_TITLE | Token text is in lowercase, uppercase, titlecase. bool |
IS_PUNCT, IS_SPACE, IS_STOP | Token is punctuation, whitespace, stop word. bool |
IS_SENT_START | Token is start of sentence. bool |
LIKE_NUM, LIKE_URL, LIKE_EMAIL | Token text resembles a number, URL, email. bool |
SPACY | Token has a trailing space. bool |
POS, TAG, MORPH, DEP, LEMMA, SHAPE | The token’s simple and extended part-of-speech tag, morphological analysis, dependency label, lemma, shape. str |
ENT_TYPE | The token’s entity label. str |
ENT_IOB | The IOB part of the token’s entity tag. str |
ENT_ID | The token’s entity ID (ent_id). str |
ENT_KB_ID | The token’s entity knowledge base ID (ent_kb_id). str |
_ | Properties in custom extension attributes. Dict[str, Any] |
OP | Operator or quantifier to determine how often to match a token pattern. str |
运算符和量词定义了匹配频率,即一个标记模式应该被匹配的次数:
| OP | 描述 |
|---|---|
! | Negate the pattern, by requiring it to match exactly 0 times. |
? | Make the pattern optional, by allowing it to match 0 or 1 times. |
+ | Require the pattern to match 1 or more times. |
* | Allow the pattern to match 0 or more times. |
{n} | Require the pattern to match exactly n times. |
{n,m} | Require the pattern to match at least n but not more than m times. |
{n,} | Require the pattern to match at least n times. |
{,m} | Require the pattern to match at most m times. |
词符模式还可以映射到属性字典而非单个值,用于表示预期值是列表中的成员还是与其他值的比较关系。
| 属性 | 描述 |
|---|---|
REGEX | Attribute value matches the regular expression at any position in the string. Any |
FUZZY | Attribute value matches if the fuzzy_compare method matches for (value, pattern, -1). The default method allows a Levenshtein edit distance of at least 2 and up to 30% of the pattern string length. Any |
FUZZY1, FUZZY2, … FUZZY9 | Attribute value matches if the fuzzy_compare method matches for (value, pattern, N). The default method allows a Levenshtein edit distance of at most N (1-9). Any |
IN | Attribute value is member of a list. Any |
NOT_IN | Attribute value is not member of a list. Any |
IS_SUBSET | Attribute value (for MORPH or custom list attributes) is a subset of a list. Any |
IS_SUPERSET | Attribute value (for MORPH or custom list attributes) is a superset of a list. Any |
INTERSECTS | Attribute value (for MORPH or custom list attribute) has a non-empty intersection with a list. Any |
==, >=, <=, >, < | Attribute value is equal, greater or equal, smaller or equal, greater or smaller. Union[int, float] |
从spaCy v3.5版本开始,REGEX和FUZZY可以与IN和NOT_IN组合使用。
Matcher.__init__ 方法
创建基于规则的Matcher。如果设置了validate=True,所有添加到匹配器的模式都将根据JSON模式进行验证,如果发现问题则会抛出MatchPatternError。这些问题可能包括类型错误(例如预期是整数却提供了字符串)或意外的属性名称。
| 名称 | 描述 |
|---|---|
vocab | The vocabulary object, which must be shared with the documents the matcher will operate on. Vocab |
validate | Validate all patterns added to this matcher. bool |
fuzzy_compare | The comparison method used for the FUZZY operators. Callable[[str, str, int], bool] |
Matcher.__call__ 方法
在Doc或Span上查找所有与提供模式匹配的标记序列。
请注意,如果单个标签关联了多个模式,返回的匹配结果无法区分具体是由哪个模式触发的匹配。
| 名称 | 描述 |
|---|---|
doclike | The Doc or Span to match over. Union[Doc,Span] |
| 仅关键字 | |
as_spans v3.0 | Instead of tuples, return a list of Span objects of the matches, with the match_id assigned as the span label. Defaults to False. bool |
allow_missing v3.0 | Whether to skip checks for missing annotation for attributes included in patterns. Defaults to False. bool |
with_alignments v3.0.6 | Return match alignment information as part of the match tuple as List[int] with the same length as the matched span. Each entry denotes the corresponding index of the token in the pattern. If as_spans is set to True, this setting is ignored. Defaults to False. bool |
| RETURNS | A list of (match_id, start, end) tuples, describing the matches. A match tuple describes a span doc[start:end]. The match_id is the ID of the added match pattern. If as_spans is set to True, a list of Span objects is returned instead. Union[List[Tuple[int, int, int]], List[Span]] |
Matcher.__len__ 方法
获取添加到匹配器中的规则数量。请注意,这里仅返回规则的数量(与ID数量相同),而不是单个模式的数量。
| 名称 | 描述 |
|---|---|
| 返回值 | 规则的数量。int |
Matcher.__contains__ 方法
检查匹配器是否包含针对某个匹配ID的规则。
| 名称 | 描述 |
|---|---|
key | The match ID. str |
| 返回值 | 匹配器是否包含该匹配ID的规则。bool |
Matcher.add 方法
向匹配器添加规则,包含一个ID键、一个或多个模式,以及一个可选的用于处理匹配项的回调函数。该回调函数将接收参数matcher、doc、i和matches。如果给定ID的模式已存在,则模式将被扩展。已有的on_match回调将被覆盖。
| 名称 | 描述 |
|---|---|
match_id | An ID for the thing you’re matching. str |
patterns | Match pattern. A pattern consists of a list of dicts, where each dict describes a token. List[List[Dict[str, Any]]] |
| 仅关键字 | |
on_match | Callback function to act on matches. Takes the arguments matcher, doc, i and matches. Optional[Callable[[Matcher,Doc, int, List[tuple], Any]] |
greedy v3.0 | Optional filter for greedy matches. Can either be "FIRST" or "LONGEST". Optional[str] |
Matcher.remove 方法
从匹配器中移除一条规则。如果匹配ID不存在,将抛出KeyError错误。
| 名称 | 描述 |
|---|---|
key | The ID of the match rule. str |
Matcher.get 方法
检索存储于某个键的模式。返回规则作为一个(on_match, patterns)元组,其中包含回调函数和可用模式。
| 名称 | 描述 |
|---|---|
key | The ID of the match rule. str |
| RETURNS | The rule, as an (on_match, patterns) tuple. Tuple[Optional[Callable], List[List[dict]]] |