词元

class

一个单独的标记——即一个单词、标点符号、空格等。

Token.init 方法

构建一个Token对象。

名称	描述
`vocab`	A storage container for lexical types. Vocab
`doc`	The parent document. Doc
`offset`	The index of the token within the document. int

Token.len 方法

标记中Unicode字符的数量，即token.text。

名称	描述
返回值	该标记中的Unicode字符数量。int

Token.set_extension 类方法

在Token上定义一个自定义属性，该属性可通过Token._访问。详情请参阅关于自定义属性的文档。

名称	描述
`name`	Name of the attribute to set by the extension. For example, `"my_attr"` will be available as `token._.my_attr`. str
`default`	Optional default value of the attribute if no getter or method is defined. Optional[Any]
`method`	Set a custom method on the object, for example `token._.compare(other_token)`. Optional[Callable[[Token, …], Any]]
`getter`	Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute. Optional[Callable[[Token], Any]]
`setter`	Setter function that takes the `Token` and a value, and modifies the object. Is called when the user writes to the `Token._` attribute. Optional[Callable[[Token, Any], None]]
`force`	Force overwriting existing attribute. bool

Token.get_extension 类方法

通过名称查找先前注册的扩展。如果扩展已注册，则返回一个4元组(default, method, getter, setter)。否则抛出KeyError。

名称	描述
`name`	Name of the extension. str
RETURNS	A `(default, method, getter, setter)` tuple of the extension. Tuple[Optional[Any], Optional[Callable], Optional[Callable], Optional[Callable]]

Token.has_extension 类方法

检查扩展是否已在Token类上注册。

名称	描述
`name`	Name of the extension to check. str
返回值	扩展是否已注册。bool

Token.remove_extension 类方法

移除之前注册的扩展。

名称	描述
`name`	Name of the extension. str
RETURNS	A `(default, method, getter, setter)` tuple of the removed extension. Tuple[Optional[Any], Optional[Callable], Optional[Callable], Optional[Callable]]

Token.check_flag 方法

检查布尔标志的值。

名称	描述
`flag_id`	The attribute ID of the flag to check. int
返回值	标志位是否被设置。bool

Token.similarity 方法需要模型

计算语义相似度估计值。默认使用向量余弦相似度。

名称	描述
other	The object to compare with. By default, accepts `Doc`, `Span`, `Token` and `Lexeme` objects. Union[Doc,Span,Token,Lexeme]
返回值	一个标量相似度分数。数值越高表示越相似。float

Token.nbor 方法

获取相邻的标记。

名称	描述
`i`	The relative position of the token to get. Defaults to `1`. int
RETURNS	The token at position `self.doc[self.i+i]`. Token

Token.set_morph 方法

从UD FEATS字符串、UD FEATS字符串的哈希值、特征字典或MorphAnalysis设置形态分析。值None可用于将形态重置为未设置状态。

名称	描述
features	要设置的形态特征。Union[int, dict, str,MorphAnalysis, None]

Token.has_morph 方法

检查该标记是否包含已标注的形态信息。当形态标注未设置/缺失时返回False。

名称	描述
返回值	是否设置了形态标注。bool

Token.is_ancestor 方法需要模型

检查该词符是否是依赖树中另一个词符的父节点、祖父节点等。

名称	描述
descendant	另一个词元。Token
返回值	判断该token是否为后代的祖先节点。bool

Token.ancestors 属性需要模型

标记的语法祖先序列（父节点、祖父节点等）。

名称	描述
YIELDS	A sequence of ancestor tokens such that `ancestor.is_ancestor(self)`. Token

Token.conjuncts 属性需要模型

一个协调标记的元组，不包括标记本身。

名称	描述
返回值	协调后的词元。元组[Token, …]

Token.children 属性需要模型

该标记的直接语法子节点序列。

名称	描述
YIELDS	A child token such that `child.head == self`. Token

Token.lefts 属性需要模型

该词在句法依存分析中的左侧直接子节点。

名称	描述
YIELDS	返回该词符的左子节点。Token

Token.rights 属性需要模型

该词在句法依存分析中的右侧直接子节点。

名称	描述
YIELDS	该词元的右子节点。Token

Token.n_lefts 属性需要模型

该词在句法依存分析中左侧直接子节点的数量。

名称	描述
返回值	左子标记的数量。int

Token.n_rights 属性需要模型

该词在句法依存分析中向右的直接子节点数量。

名称	描述
返回值	右子代标记的数量。int

Token.subtree 属性需要模型

一个包含该词元及其所有句法子代的序列。

名称	描述
YIELDS	A descendant token such that `self.is_ancestor(token)` or `token == self`. Token

Token.has_vector 属性需要模型

一个布尔值，表示该词符是否关联有词向量。

名称	描述
RETURNS	该词元是否附加了向量数据。bool

Token.vector 属性需要模型

一个实值意义表示。

名称	描述
返回值	表示该标记向量的1维数组。numpy.ndarray[ndim=1, dtype=float32]

Token.vector_norm 属性需要模型

该词元向量表示的L2范数。

名称	描述
返回值	向量表示的L2范数。float

属性

名称	描述
`doc`	The parent document. Doc
`lex` v3.0	The underlying lexeme. Lexeme
`sent`	The sentence span that this token is a part of. Span
`text`	Verbatim text content. str
`text_with_ws`	Text content, with trailing space character if present. str
`whitespace_`	Trailing space character if present. str
`orth`	ID of the verbatim text content. int
`orth_`	Verbatim text content (identical to `Token.text`). Exists mostly for consistency with the other attributes. str
`vocab`	The vocab object of the parent `Doc`. vocab
`tensor`	The token’s slice of the parent `Doc`’s tensor. numpy.ndarray
`head`	The syntactic parent, or “governor”, of this token. Token
`left_edge`	The leftmost token of this token’s syntactic descendants. Token
`right_edge`	The rightmost token of this token’s syntactic descendants. Token
`i`	The index of the token within the parent document. int
`ent_type`	Named entity type. int
`ent_type_`	Named entity type. str
`ent_iob`	IOB code of named entity tag. `3` means the token begins an entity, `2` means it is outside an entity, `1` means it is inside an entity, and `0` means no entity tag is set. int
`ent_iob_`	IOB code of named entity tag. “B” means the token begins an entity, “I” means it is inside an entity, “O” means it is outside an entity, and "" means no entity tag is set. str
`ent_kb_id`	Knowledge base ID that refers to the named entity this token is a part of, if any. int
`ent_kb_id_`	Knowledge base ID that refers to the named entity this token is a part of, if any. str
`ent_id`	ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. int
`ent_id_`	ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. str
`lemma`	Base form of the token, with no inflectional suffixes. int
`lemma_`	Base form of the token, with no inflectional suffixes. str
`norm`	The token’s norm, i.e. a normalized form of the token text. Can be set in the language’s tokenizer exceptions. int
`norm_`	The token’s norm, i.e. a normalized form of the token text. Can be set in the language’s tokenizer exceptions. str
`lower`	Lowercase form of the token. int
`lower_`	Lowercase form of the token text. Equivalent to `Token.text.lower()`. str
`shape`	Transform of the token’s string to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. int
`shape_`	Transform of the token’s string to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. str
`prefix`	Hash value of a length-N substring from the start of the token. Defaults to `N=1`. int
`prefix_`	A length-N substring from the start of the token. Defaults to `N=1`. str
`suffix`	Hash value of a length-N substring from the end of the token. Defaults to `N=3`. int
`suffix_`	Length-N substring from the end of the token. Defaults to `N=3`. str
`is_alpha`	Does the token consist of alphabetic characters? Equivalent to `token.text.isalpha()`. bool
`is_ascii`	Does the token consist of ASCII characters? Equivalent to `all(ord(c) < 128 for c in token.text)`. bool
`is_digit`	Does the token consist of digits? Equivalent to `token.text.isdigit()`. bool
`is_lower`	Is the token in lowercase? Equivalent to `token.text.islower()`. bool
`is_upper`	Is the token in uppercase? Equivalent to `token.text.isupper()`. bool
`is_title`	Is the token in titlecase? Equivalent to `token.text.istitle()`. bool
`is_punct`	Is the token punctuation? bool
`is_left_punct`	Is the token a left punctuation mark, e.g. `"("` ? bool
`is_right_punct`	Is the token a right punctuation mark, e.g. `")"` ? bool
`is_sent_start`	Does the token start a sentence? bool or `None` if unknown. Defaults to `True` for the first token in the `Doc`.
`is_sent_end`	Does the token end a sentence? bool or `None` if unknown.
`is_space`	Does the token consist of whitespace characters? Equivalent to `token.text.isspace()`. bool
`is_bracket`	Is the token a bracket? bool
`is_quote`	Is the token a quotation mark? bool
`is_currency`	Is the token a currency symbol? bool
`like_url`	Does the token resemble a URL? bool
`like_num`	Does the token represent a number? e.g. “10.9”, “10”, “ten”, etc. bool
`like_email`	Does the token resemble an email address? bool
`is_oov`	Is the token out-of-vocabulary (i.e. does it not have a word vector)? bool
`is_stop`	Is the token part of a “stop list”? bool
`pos`	Coarse-grained part-of-speech from the Universal POS tag set. int
`pos_`	Coarse-grained part-of-speech from the Universal POS tag set. str
`tag`	Fine-grained part-of-speech. int
`tag_`	Fine-grained part-of-speech. str
`morph` v3.0	Morphological analysis. MorphAnalysis
`dep`	Syntactic dependency relation. int
`dep_`	Syntactic dependency relation. str
`lang`	Language of the parent document’s vocabulary. int
`lang_`	Language of the parent document’s vocabulary. str
`prob`	Smoothed log probability estimate of token’s word type (context-independent entry in the vocabulary). float
`idx`	The character offset of the token within the parent document. int
`sentiment`	A scalar value indicating the positivity or negativity of the token. float
`lex_id`	Sequential ID of the token’s lexical type, used to index into tables, e.g. for word vectors. int
`rank`	Sequential ID of the token’s lexical type, used to index into tables, e.g. for word vectors. int
`cluster`	Brown cluster ID. int
`_`	User space for adding custom attribute extensions. Underscore

建议编辑

容器

Token.__init__ 方法

Token.__len__ 方法