分词器

class

将文本分割为单词、标点符号等。

对文本进行分词，并使用发现的边界创建Doc对象。如需更深入理解，请参阅关于 spaCy分词器工作原理的文档。分词器通常在初始化 Language子类时自动创建，它会从语言子类提供的 Language.Defaults中读取标点符号和特殊大小写规则等设置。

Tokenizer.init 方法

创建一个Tokenizer来根据Unicode文本生成Doc对象。如需了解如何使用不同分词规则构建自定义分词器的示例，请参阅使用文档。

名称	描述
`vocab`	A storage container for lexical types. Vocab
`rules`	Exceptions and special-cases for the tokenizer. Optional[Dict[str, List[Dict[int, str]]]]
`prefix_search`	A function matching the signature of `re.compile(string).search` to match prefixes. Optional[Callable[[str], Optional[Match]]]
`suffix_search`	A function matching the signature of `re.compile(string).search` to match suffixes. Optional[Callable[[str], Optional[Match]]]
`infix_finditer`	A function matching the signature of `re.compile(string).finditer` to find infixes. Optional[Callable[[str], Iterator[Match]]]
`token_match`	A function matching the signature of `re.compile(string).match` to find token matches. Optional[Callable[[str], Optional[Match]]]
`url_match`	A function matching the signature of `re.compile(string).match` to find token matches after considering prefixes and suffixes. Optional[Callable[[str], Optional[Match]]]
`faster_heuristics` v3.3.0	Whether to restrict the final `Matcher`-based pass for rules to those containing affixes or space. Defaults to `True`. bool

Tokenizer.call 方法

对字符串进行分词。

名称	描述
`string`	The string to tokenize. str
返回值	一个用于语言标注的容器。Doc

Tokenizer.pipe 方法

对文本流进行分词处理。

名称	描述
`texts`	A sequence of unicode texts. Iterable[str]
`batch_size`	The number of texts to accumulate in an internal buffer. Defaults to `1000`. int
YIELDS	The tokenized `Doc` objects, in order. Doc

Tokenizer.find_infix 方法

查找字符串的内部分割点。

名称	描述
`string`	The string to split. str
RETURNS	A list of `re.MatchObject` objects that have `.start()` and `.end()` methods, denoting the placement of internal segment separators, e.g. hyphens. List[Match]

Tokenizer.find_prefix 方法

找出应从字符串中分割的前缀长度，如果没有匹配的前缀规则则返回None。

名称	描述
`string`	The string to segment. str
RETURNS	The length of the prefix if present, otherwise `None`. Optional[int]

Tokenizer.find_suffix 方法

找出应从字符串中分割出的后缀长度，如果没有匹配的后缀规则则返回None。

名称	描述
`string`	The string to segment. str
RETURNS	The length of the suffix if present, otherwise `None`. Optional[int]

Tokenizer.add_special_case 方法

添加一个特殊的分词规则。该机制也用于向语言数据中添加自定义分词例外。更多详情和示例请参阅关于语言数据和分词器特殊案例的使用指南。

名称	描述
`string`	The string to specially tokenize. str
`token_attrs`	A sequence of dicts, where each dict describes a token and its attributes. The `ORTH` fields of the attributes must exactly match the string when they are concatenated. Iterable[Dict[int, str]]

Tokenizer.explain 方法

使用一个缓慢的调试分词器对字符串进行分词，该分词器会提供每个分词匹配的分词规则或模式信息。产生的分词结果与Tokenizer.__call__相同，但空格分词除外。

名称	描述
`string`	The string to tokenize with the debugging tokenizer. str
RETURNS	A list of `(pattern_string, token_string)` tuples. List[Tuple[str, str]]

Tokenizer.to_disk 方法

将分词器序列化到磁盘。

名称	描述
`path`	A path to a directory, which will be created if it doesn’t exist. Paths may be either strings or `Path`-like objects. Union[str,Path]
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]

Tokenizer.from_disk 方法

从磁盘加载分词器。就地修改对象并返回它。

名称	描述
`path`	A path to a directory. Paths may be either strings or `Path`-like objects. Union[str,Path]
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]
RETURNS	The modified `Tokenizer` object. Tokenizer

Tokenizer.to_bytes 方法

将分词器序列化为字节串。

名称	描述
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]
RETURNS	The serialized form of the `Tokenizer` object. bytes

Tokenizer.from_bytes 方法

从字节字符串加载分词器。原地修改对象并返回它。

名称	描述
`bytes_data`	The data to load from. bytes
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]
RETURNS	The `Tokenizer` object. Tokenizer

属性

名称	描述
`vocab`	The vocab object of the parent `Doc`. Vocab
`prefix_search`	A function to find segment boundaries from the start of a string. Returns the length of the segment, or `None`. Optional[Callable[[str], Optional[Match]]]
`suffix_search`	A function to find segment boundaries from the end of a string. Returns the length of the segment, or `None`. Optional[Callable[[str], Optional[Match]]]
`infix_finditer`	A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) sequence of `re.MatchObject` objects. Optional[Callable[[str], Iterator[Match]]]
`token_match`	A function matching the signature of `re.compile(string).match` to find token matches. Returns an `re.MatchObject` or `None`. Optional[Callable[[str], Optional[Match]]]
`rules`	A dictionary of tokenizer exceptions and special cases. Optional[Dict[str, List[Dict[int, str]]]]

序列化字段

在序列化过程中，spaCy会导出多个用于恢复对象不同方面的数据字段。如果需要，您可以通过exclude参数传入字符串名称来将它们排除在序列化之外。

名称	描述
`vocab`	The shared `Vocab`.
`prefix_search`	The prefix rules.
`suffix_search`	The suffix rules.
`infix_finditer`	The infix rules.
`token_match`	The token match expression.
`exceptions`	The tokenizer exception rules.

建议编辑

流水线

Tokenizer.__init__ 方法

Tokenizer.__call__ 方法