属性
Token 属性在许多地方使用内部ID指定,包括:
Matcher模式,Doc.to_array和Doc.from_arrayDoc.has_annotationMultiHashEmbedTok2Vec架构attrs
所有方法都会自动在ID的字符串版本("DEP")和内部整数符号(DEP)之间进行转换。内部ID可以从spacy.attrs导入,或从StringStore中获取。字符串属性名到内部属性ID的映射存储在spacy.attrs.IDS中。
对应的Token对象属性可以通过相同名称的小写形式访问,例如token.orth或token.length。
对于表示字符串值的属性,内部整型ID通过Token.attr访问,例如token.dep,而字符串值可以通过添加下划线_来获取,如token.dep_。
| 属性 | 描述 |
|---|---|
DEP | The token’s dependency label. str |
ENT_ID | The token’s entity ID (ent_id). str |
ENT_IOB | The IOB part of the token’s entity tag. Uses custom integer values rather than the string store: unset is 0, I is 1, O is 2, and B is 3. str |
ENT_KB_ID | The token’s entity knowledge base ID. str |
ENT_TYPE | The token’s entity label. str |
IS_ALPHA | Token text consists of alphabetic characters. bool |
IS_ASCII | Token text consists of ASCII characters. bool |
IS_DIGIT | Token text consists of digits. bool |
IS_LOWER | Token text is in lowercase. bool |
IS_PUNCT | Token is punctuation. bool |
IS_SPACE | Token is whitespace. bool |
IS_STOP | Token is a stop word. bool |
IS_TITLE | Token text is in titlecase. bool |
IS_UPPER | Token text is in uppercase. bool |
LEMMA | The token’s lemma. str |
LENGTH | The length of the token text. int |
LIKE_EMAIL | Token text resembles an email address. bool |
LIKE_NUM | Token text resembles a number. bool |
LIKE_URL | Token text resembles a URL. bool |
LOWER | The lowercase form of the token text. str |
MORPH | The token’s morphological analysis. MorphAnalysis |
NORM | The normalized form of the token text. str |
ORTH | The exact verbatim text of a token. str |
POS | The token’s universal part of speech (UPOS). str |
SENT_START | Token is start of sentence. bool |
SHAPE | The token’s shape. str |
SPACY | Token has a trailing space. bool |
TAG | The token’s fine-grained part of speech. str |