词汇表

class

一个用于存储词汇及跨语言共享数据的存储类

Vocab对象提供了一个查找表，允许你访问Lexeme对象以及StringStore。它还拥有在Doc对象之间共享的底层C数据。

Vocab.init 方法

创建词汇表。

名称	描述
`lex_attr_getters`	A dictionary mapping attribute IDs to functions to compute them. Defaults to `None`. Optional[Dict[str, Callable[[str], Any]]]
`strings`	A `StringStore` that maps strings to hash values, and vice versa, or a list of strings. Union[List[str],StringStore]
`lookups`	A `Lookups` that stores the `lexeme_norm` and other large lookup tables. Defaults to `None`. Optional[Lookups]
`oov_prob`	The default OOV probability. Defaults to `-20.0`. float
`vectors_name`	A name to identify the vectors table. str
`writing_system`	A dictionary describing the language’s writing system. Typically provided by `Language.Defaults`. Dict[str, Any]
`get_noun_chunks`	A function that yields base noun phrases used for `Doc.noun_chunks`. Optional[Callable[[Union[Doc,Span], Iterator[Tuple[int, int, int]]]]]

Vocab.len 方法

获取词汇表中当前词素的数量。

名称	描述
返回值	词汇表中词素的数量。int

Vocab.getitem 方法

根据给定的整数ID或字符串检索词位。如果提供的是之前未见过的字符串，则会创建并存储一个新的词位。

名称	描述
`id_or_string`	The hash value of a word, or its string. Union[int, str]
返回值	由给定ID指示的词素。Lexeme

Vocab.iter 方法

遍历词汇表中的词素。

名称	描述
YIELDS	词汇表中的一个条目。Lexeme

Vocab.contains 方法

检查字符串在词汇表中是否有条目。要获取给定字符串的ID，您需要在vocab.strings中查找它。

名称	描述
`string`	The ID string. str
返回值	判断字符串是否在词汇表中存在条目。bool

Vocab.add_flag 方法

为词汇表中的单词设置一个新的布尔标志。flag_getter函数将被调用以处理当前词汇表中的单词，然后应用于新出现的单词。之后，您可以通过token.check_flag(flag_id) 访问每个标记的标志值。

名称	描述
`flag_getter`	A function that takes the lexeme text and returns the boolean flag value. Callable[[str], bool]
`flag_id`	An integer between `1` and `63` (inclusive), specifying the bit at which the flag will be stored. If `-1`, the lowest available bit will be chosen. int
返回值	用于检查标志值的整型ID。int

Vocab.reset_vectors 方法

删除当前的向量表。由于所有向量的宽度必须相同，您需要调用此操作来更改向量的大小。只能指定width或shape关键字参数中的一个。

名称	描述
仅关键字
`width`	The new width. int
`shape`	The new shape. int

将当前向量表缩减至nr_row个唯一条目。被丢弃向量对应的单词将被重新映射到剩余向量中最接近的一个。例如，假设原始表中包含以下单词的向量：['sat', 'cat', 'feline', 'reclined']。如果将向量表修剪为两行，我们将丢弃"feline"和"reclined"的向量。这些单词随后会被重新映射到最接近的剩余向量——因此"feline"将与"cat"共享相同向量，而"reclined"将与"sat"共享相同向量。相似性通过余弦相似度判断。原始向量可能很大，因此采用小批量计算余弦值以减少内存使用。

名称	描述
`nr_row`	The number of rows to keep in the vector table. int
`batch_size`	Batch of vectors for calculating the similarities. Larger batch sizes might be faster, while temporarily requiring more memory. int
RETURNS	A dictionary keyed by removed words mapped to `(string, score)` tuples, where `string` is the entry the removed word was mapped to, and `score` the similarity score between the two words. Dict[str, Tuple[str, float]]

Vocab.deduplicate_vectors 方法v3.3

从当前向量表中移除所有重复行，同时保留向量中所有单词的映射关系。

Vocab.get_vector 方法

获取词汇表中某个单词的向量。可以通过字符串或哈希值查找单词。如果当前向量不包含该单词的条目，则返回一个与当前向量维度相同(Vocab.vectors_length)的0向量。

名称	描述
`orth`	The hash value of a word, or its unicode string. Union[int, str]
RETURNS	A word vector. Size and shape are determined by the `Vocab.vectors` instance. numpy.ndarray[ndim=1, dtype=float32]

Vocab.set_vector 方法

为词汇表中的单词设置向量。可以通过字符串或哈希值引用单词。

名称	描述
`orth`	The hash value of a word, or its unicode string. Union[int, str]
`vector`	The vector to set. numpy.ndarray[ndim=1, dtype=float32]

Vocab.has_vector 方法

检查一个单词是否有向量。如果没有加载向量则返回False。可以通过字符串或哈希值来查找单词。

名称	描述
`orth`	The hash value of a word, or its unicode string. Union[int, str]
返回值	该单词是否有向量。bool

Vocab.to_disk 方法

将当前状态保存到目录中。

名称	描述
`path`	A path to a directory, which will be created if it doesn’t exist. Paths may be either strings or `Path`-like objects. Union[str,Path]
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]

Vocab.from_disk 方法

从目录加载状态。就地修改对象并返回它。

名称	描述
`path`	A path to a directory. Paths may be either strings or `Path`-like objects. Union[str,Path]
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]
RETURNS	The modified `Vocab` object. Vocab

Vocab.to_bytes 方法

将当前状态序列化为二进制字符串。

名称	描述
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]
RETURNS	The serialized form of the `Vocab` object. Vocab

Vocab.from_bytes 方法

从二进制字符串加载状态。

名称	描述
`bytes_data`	The data to load from. bytes
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]
RETURNS	The `Vocab` object. Vocab

属性

名称	描述
`strings`	A table managing the string-to-int mapping. StringStore
`vectors`	A table associating word IDs to word vectors. Vectors
`vectors_length`	Number of dimensions for each word vector. int
`lookups`	The available lookup tables in this vocab. Lookups
`writing_system`	A dict with information about the language’s writing system. Dict[str, Any]
`get_noun_chunks` v3.0	A function that yields base noun phrases used for `Doc.noun_chunks`. Optional[Callable[[Union[Doc,Span], Iterator[Tuple[int, int, int]]]]]

序列化字段

在序列化过程中，spaCy会导出多个用于恢复对象不同方面的数据字段。如果需要，您可以通过exclude参数传入字符串名称来将它们排除在序列化之外。

名称	描述
`strings`	The strings in the `StringStore`.
`vectors`	The word vectors, if available.
`lookups`	The lookup tables, if available.

建议编辑

其他

词汇表

Vocab.init 方法

Vocab.len 方法

Vocab.getitem 方法

Vocab.iter 方法

Vocab.contains 方法

Vocab.add_flag 方法

Vocab.reset_vectors 方法

Vocab.prune_vectors 方法

Vocab.deduplicate_vectors 方法v3.3

Vocab.get_vector 方法

Vocab.set_vector 方法

Vocab.has_vector 方法

Vocab.to_disk 方法

Vocab.from_disk 方法

Vocab.to_bytes 方法

Vocab.from_bytes 方法

属性

序列化字段

其他

Vocab.__init__ 方法

Vocab.__len__ 方法

Vocab.__getitem__ 方法

Vocab.__iter__ 方法

Vocab.__contains__ 方法

Vocab.add_flag 方法

Vocab.reset_vectors 方法

Vocab.prune_vectors 方法

Vocab.deduplicate_vectors 方法v3.3

Vocab.get_vector 方法

Vocab.set_vector 方法

Vocab.has_vector 方法

Vocab.to_disk 方法

Vocab.from_disk 方法

Vocab.to_bytes 方法

Vocab.from_bytes 方法

属性

序列化字段

Vocab.init 方法

Vocab.len 方法

Vocab.getitem 方法

Vocab.iter 方法

Vocab.contains 方法