容器

词元

class
一个单独的标记——即一个单词、标点符号、空格等。

Token.__init__ 方法

构建一个Token对象。

名称描述
vocabA storage container for lexical types. Vocab
docThe parent document. Doc
offsetThe index of the token within the document. int

Token.__len__ 方法

标记中Unicode字符的数量,即token.text

名称描述

Token.set_extension 类方法

Token上定义一个自定义属性,该属性可通过Token._访问。 详情请参阅关于 自定义属性的文档。

名称描述
nameName of the attribute to set by the extension. For example, "my_attr" will be available as token._.my_attr. str
defaultOptional default value of the attribute if no getter or method is defined. Optional[Any]
methodSet a custom method on the object, for example token._.compare(other_token). Optional[Callable[[Token,], Any]]
getterGetter function that takes the object and returns an attribute value. Is called when the user accesses the ._ attribute. Optional[Callable[[Token], Any]]
setterSetter function that takes the Token and a value, and modifies the object. Is called when the user writes to the Token._ attribute. Optional[Callable[[Token, Any], None]]
forceForce overwriting existing attribute. bool

Token.get_extension 类方法

通过名称查找先前注册的扩展。如果扩展已注册,则返回一个4元组(default, method, getter, setter)。否则抛出KeyError

名称描述
nameName of the extension. str

Token.has_extension 类方法

检查扩展是否已在Token类上注册。

名称描述
nameName of the extension to check. str

Token.remove_extension 类方法

移除之前注册的扩展。

名称描述
nameName of the extension. str

Token.check_flag 方法

检查布尔标志的值。

名称描述
flag_idThe attribute ID of the flag to check. int

Token.similarity 方法需要模型

计算语义相似度估计值。默认使用向量余弦相似度。

名称描述
otherThe object to compare with. By default, accepts Doc, Span, Token and Lexeme objects. Union[Doc,Span,Token,Lexeme]

Token.nbor 方法

获取相邻的标记。

名称描述
iThe relative position of the token to get. Defaults to 1. int

Token.set_morph 方法

从UD FEATS字符串、UD FEATS字符串的哈希值、特征字典或MorphAnalysis设置形态分析。值None可用于将形态重置为未设置状态。

名称描述
features要设置的形态特征。Union[int, dict, str,MorphAnalysis, None]

Token.has_morph 方法

检查该标记是否包含已标注的形态信息。当形态标注未设置/缺失时返回False

名称描述

Token.is_ancestor 方法需要模型

检查该词符是否是依赖树中另一个词符的父节点、祖父节点等。

名称描述
descendant另一个词元。Token

Token.ancestors 属性需要模型

标记的语法祖先序列(父节点、祖父节点等)。

名称描述

Token.conjuncts 属性需要模型

一个协调标记的元组,不包括标记本身。

名称描述

Token.children 属性需要模型

该标记的直接语法子节点序列。

名称描述

Token.lefts 属性需要模型

该词在句法依存分析中的左侧直接子节点。

名称描述

Token.rights 属性需要模型

该词在句法依存分析中的右侧直接子节点。

名称描述

Token.n_lefts 属性需要模型

该词在句法依存分析中左侧直接子节点的数量。

名称描述

Token.n_rights 属性需要模型

该词在句法依存分析中向右的直接子节点数量。

名称描述

Token.subtree 属性需要模型

一个包含该词元及其所有句法子代的序列。

名称描述

Token.has_vector 属性需要模型

一个布尔值,表示该词符是否关联有词向量。

名称描述

Token.vector 属性需要模型

一个实值意义表示。

名称描述

Token.vector_norm 属性需要模型

该词元向量表示的L2范数。

名称描述

属性

名称描述
docThe parent document. Doc
lex v3.0The underlying lexeme. Lexeme
sentThe sentence span that this token is a part of. Span
textVerbatim text content. str
text_with_wsText content, with trailing space character if present. str
whitespace_Trailing space character if present. str
orthID of the verbatim text content. int
orth_Verbatim text content (identical to Token.text). Exists mostly for consistency with the other attributes. str
vocabThe vocab object of the parent Doc. vocab
tensorThe token’s slice of the parent Doc’s tensor. numpy.ndarray
headThe syntactic parent, or “governor”, of this token. Token
left_edgeThe leftmost token of this token’s syntactic descendants. Token
right_edgeThe rightmost token of this token’s syntactic descendants. Token
iThe index of the token within the parent document. int
ent_typeNamed entity type. int
ent_type_Named entity type. str
ent_iobIOB code of named entity tag. 3 means the token begins an entity, 2 means it is outside an entity, 1 means it is inside an entity, and 0 means no entity tag is set. int
ent_iob_IOB code of named entity tag. “B” means the token begins an entity, “I” means it is inside an entity, “O” means it is outside an entity, and "" means no entity tag is set. str
ent_kb_idKnowledge base ID that refers to the named entity this token is a part of, if any. int
ent_kb_id_Knowledge base ID that refers to the named entity this token is a part of, if any. str
ent_idID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. int
ent_id_ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. str
lemmaBase form of the token, with no inflectional suffixes. int
lemma_Base form of the token, with no inflectional suffixes. str
normThe token’s norm, i.e. a normalized form of the token text. Can be set in the language’s tokenizer exceptions. int
norm_The token’s norm, i.e. a normalized form of the token text. Can be set in the language’s tokenizer exceptions. str
lowerLowercase form of the token. int
lower_Lowercase form of the token text. Equivalent to Token.text.lower(). str
shapeTransform of the token’s string to show orthographic features. Alphabetic characters are replaced by x or X, and numeric characters are replaced by d, and sequences of the same character are truncated after length 4. For example,"Xxxx"or"dd". int
shape_Transform of the token’s string to show orthographic features. Alphabetic characters are replaced by x or X, and numeric characters are replaced by d, and sequences of the same character are truncated after length 4. For example,"Xxxx"or"dd". str
prefixHash value of a length-N substring from the start of the token. Defaults to N=1. int
prefix_A length-N substring from the start of the token. Defaults to N=1. str
suffixHash value of a length-N substring from the end of the token. Defaults to N=3. int
suffix_Length-N substring from the end of the token. Defaults to N=3. str
is_alphaDoes the token consist of alphabetic characters? Equivalent to token.text.isalpha(). bool
is_asciiDoes the token consist of ASCII characters? Equivalent to all(ord(c) < 128 for c in token.text). bool
is_digitDoes the token consist of digits? Equivalent to token.text.isdigit(). bool
is_lowerIs the token in lowercase? Equivalent to token.text.islower(). bool
is_upperIs the token in uppercase? Equivalent to token.text.isupper(). bool
is_titleIs the token in titlecase? Equivalent to token.text.istitle(). bool
is_punctIs the token punctuation? bool
is_left_punctIs the token a left punctuation mark, e.g. "(" ? bool
is_right_punctIs the token a right punctuation mark, e.g. ")" ? bool
is_sent_startDoes the token start a sentence? bool or None if unknown. Defaults to True for the first token in the Doc.
is_sent_endDoes the token end a sentence? bool or None if unknown.
is_spaceDoes the token consist of whitespace characters? Equivalent to token.text.isspace(). bool
is_bracketIs the token a bracket? bool
is_quoteIs the token a quotation mark? bool
is_currencyIs the token a currency symbol? bool
like_urlDoes the token resemble a URL? bool
like_numDoes the token represent a number? e.g. “10.9”, “10”, “ten”, etc. bool
like_emailDoes the token resemble an email address? bool
is_oovIs the token out-of-vocabulary (i.e. does it not have a word vector)? bool
is_stopIs the token part of a “stop list”? bool
posCoarse-grained part-of-speech from the Universal POS tag set. int
pos_Coarse-grained part-of-speech from the Universal POS tag set. str
tagFine-grained part-of-speech. int
tag_Fine-grained part-of-speech. str
morph v3.0Morphological analysis. MorphAnalysis
depSyntactic dependency relation. int
dep_Syntactic dependency relation. str
langLanguage of the parent document’s vocabulary. int
lang_Language of the parent document’s vocabulary. str
probSmoothed log probability estimate of token’s word type (context-independent entry in the vocabulary). float
idxThe character offset of the token within the parent document. int
sentimentA scalar value indicating the positivity or negativity of the token. float
lex_idSequential ID of the token’s lexical type, used to index into tables, e.g. for word vectors. int
rankSequential ID of the token’s lexical type, used to index into tables, e.g. for word vectors. int
clusterBrown cluster ID. int
_User space for adding custom attribute extensions. Underscore