匹配器

class

基于模式规则匹配一系列标记

Matcher 允许您通过描述词符属性的规则来查找单词和短语。规则可以引用词符标注（如文本或词性标签），以及像 Token.is_punct 这样的词汇属性。将匹配器应用于 Doc 可以让您在上下文中访问匹配到的词符。有关结合规则和统计模型的深入示例和工作流程，请参阅基于规则匹配的使用指南。

模式格式

添加到Matcher的模式由字典列表组成。每个字典描述一个词符及其属性。可用的词符模式键对应多个Token属性。基于规则匹配支持的属性包括：

属性	描述
`ORTH`	The exact verbatim text of a token. str
`TEXT`	The exact verbatim text of a token. str
`NORM`	The normalized form of the token text. str
`LOWER`	The lowercase form of the token text. str
`LENGTH`	The length of the token text. int
`IS_ALPHA`, `IS_ASCII`, `IS_DIGIT`	Token text consists of alphabetic characters, ASCII characters, digits. bool
`IS_LOWER`, `IS_UPPER`, `IS_TITLE`	Token text is in lowercase, uppercase, titlecase. bool
`IS_PUNCT`, `IS_SPACE`, `IS_STOP`	Token is punctuation, whitespace, stop word. bool
`IS_SENT_START`	Token is start of sentence. bool
`LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL`	Token text resembles a number, URL, email. bool
`SPACY`	Token has a trailing space. bool
`POS`, `TAG`, `MORPH`, `DEP`, `LEMMA`, `SHAPE`	The token’s simple and extended part-of-speech tag, morphological analysis, dependency label, lemma, shape. str
`ENT_TYPE`	The token’s entity label. str
`ENT_IOB`	The IOB part of the token’s entity tag. str
`ENT_ID`	The token’s entity ID (`ent_id`). str
`ENT_KB_ID`	The token’s entity knowledge base ID (`ent_kb_id`). str
`_`	Properties in custom extension attributes. Dict[str, Any]
`OP`	Operator or quantifier to determine how often to match a token pattern. str

运算符和量词定义了匹配频率，即一个标记模式应该被匹配的次数：

OP	描述
`!`	Negate the pattern, by requiring it to match exactly 0 times.
`?`	Make the pattern optional, by allowing it to match 0 or 1 times.
`+`	Require the pattern to match 1 or more times.
`*`	Allow the pattern to match 0 or more times.
`{n}`	Require the pattern to match exactly n times.
`{n,m}`	Require the pattern to match at least n but not more than m times.
`{n,}`	Require the pattern to match at least n times.
`{,m}`	Require the pattern to match at most m times.

词符模式还可以映射到属性字典而非单个值，用于表示预期值是列表中的成员还是与其他值的比较关系。

属性	描述
`REGEX`	Attribute value matches the regular expression at any position in the string. Any
`FUZZY`	Attribute value matches if the `fuzzy_compare` method matches for `(value, pattern, -1)`. The default method allows a Levenshtein edit distance of at least 2 and up to 30% of the pattern string length. Any
`FUZZY1`, `FUZZY2`, … `FUZZY9`	Attribute value matches if the `fuzzy_compare` method matches for `(value, pattern, N)`. The default method allows a Levenshtein edit distance of at most N (1-9). Any
`IN`	Attribute value is member of a list. Any
`NOT_IN`	Attribute value is not member of a list. Any
`IS_SUBSET`	Attribute value (for `MORPH` or custom list attributes) is a subset of a list. Any
`IS_SUPERSET`	Attribute value (for `MORPH` or custom list attributes) is a superset of a list. Any
`INTERSECTS`	Attribute value (for `MORPH` or custom list attribute) has a non-empty intersection with a list. Any
`==`, `>=`, `<=`, `>`, `<`	Attribute value is equal, greater or equal, smaller or equal, greater or smaller. Union[int, float]

从spaCy v3.5版本开始，REGEX和FUZZY可以与IN和NOT_IN组合使用。

Matcher.init 方法

创建基于规则的Matcher。如果设置了validate=True，所有添加到匹配器的模式都将根据JSON模式进行验证，如果发现问题则会抛出MatchPatternError。这些问题可能包括类型错误（例如预期是整数却提供了字符串）或意外的属性名称。

名称	描述
`vocab`	The vocabulary object, which must be shared with the documents the matcher will operate on. Vocab
`validate`	Validate all patterns added to this matcher. bool
`fuzzy_compare`	The comparison method used for the `FUZZY` operators. Callable[[str, str, int], bool]

Matcher.call 方法

在Doc或Span上查找所有与提供模式匹配的标记序列。

请注意，如果单个标签关联了多个模式，返回的匹配结果无法区分具体是由哪个模式触发的匹配。

名称	描述
`doclike`	The `Doc` or `Span` to match over. Union[Doc,Span]
仅关键字
`as_spans` v3.0	Instead of tuples, return a list of `Span` objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. bool
`allow_missing` v3.0	Whether to skip checks for missing annotation for attributes included in patterns. Defaults to `False`. bool
`with_alignments` v3.0.6	Return match alignment information as part of the match tuple as `List[int]` with the same length as the matched span. Each entry denotes the corresponding index of the token in the pattern. If `as_spans` is set to `True`, this setting is ignored. Defaults to `False`. bool
RETURNS	A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. Union[List[Tuple[int, int, int]], List[Span]]

Matcher.len 方法

获取添加到匹配器中的规则数量。请注意，这里仅返回规则的数量（与ID数量相同），而不是单个模式的数量。

名称	描述
返回值	规则的数量。int

Matcher.contains 方法

检查匹配器是否包含针对某个匹配ID的规则。

名称	描述
`key`	The match ID. str
返回值	匹配器是否包含该匹配ID的规则。bool

Matcher.add 方法

向匹配器添加规则，包含一个ID键、一个或多个模式，以及一个可选的用于处理匹配项的回调函数。该回调函数将接收参数matcher、doc、i和matches。如果给定ID的模式已存在，则模式将被扩展。已有的on_match回调将被覆盖。

v3.0版本变更

从spaCy v3.0开始，Matcher.add的第二个参数改为接受一个模式列表（而不是可变数量的参数）。on_match回调函数变为可选的关键字参数。

名称	描述
`match_id`	An ID for the thing you’re matching. str
`patterns`	Match pattern. A pattern consists of a list of dicts, where each dict describes a token. List[List[Dict[str, Any]]]
仅关键字
`on_match`	Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. Optional[Callable[[Matcher,Doc, int, List[tuple], Any]]
`greedy` v3.0	Optional filter for greedy matches. Can either be `"FIRST"` or `"LONGEST"`. Optional[str]

Matcher.remove 方法

从匹配器中移除一条规则。如果匹配ID不存在，将抛出KeyError错误。

名称	描述
`key`	The ID of the match rule. str

Matcher.get 方法

检索存储于某个键的模式。返回规则作为一个(on_match, patterns)元组，其中包含回调函数和可用模式。

名称	描述
`key`	The ID of the match rule. str
RETURNS	The rule, as an `(on_match, patterns)` tuple. Tuple[Optional[Callable], List[List[dict]]]

建议编辑

匹配器

模式格式

Matcher.__init__ 方法

Matcher.__call__ 方法

Matcher.__len__ 方法