向量

class

存储、保存和加载词向量

向量数据保存在Vectors.data属性中，该属性应该是numpy.ndarray（用于CPU向量）或cupy.ndarray（用于GPU向量）的实例。

自spaCy v3.2版本起，Vectors支持两种类型的向量表：

default: 标准向量表（与spaCy v3.1及更早版本相同），其中每个键都映射到向量表中的一行。多个键可以映射到同一个向量，且不需要分配表中的所有行——因此vectors.n_keys可能大于或小于vectors.shape[0]。
floret: 仅支持使用floret训练的向量，这是fastText的扩展版本，通过将fastText的子词n-gram与Bloom嵌入相结合来生成紧凑的向量表。这些紧凑表类似于许多spaCy组件中已使用的HashEmbed嵌入。每个单词表示为一个或多个行的总和，具体由与字符n-gram和哈希表相关的设置决定。

Vectors.init 方法

创建一个新的向量存储。在默认模式下，您可以在初始化时直接设置向量值和键，或者提供一个shape关键字参数来创建一个空表，稍后可以向其中添加向量。在floret模式下，完整的向量数据和设置必须在初始化时提供，之后无法修改。

名称	描述
仅关键字
`strings`	The string store. A new string store is created if one is not provided. Defaults to `None`. Optional[StringStore]
`shape`	Size of the table as `(n_entries, n_columns)`, the number of entries and number of columns. Not required if you’re initializing the object with `data` and `keys`. Tuple[int, int]
`data`	The vector data. numpy.ndarray[ndim=2, dtype=float32]
`keys`	A sequence of keys aligned with the data. Iterable[Union[str, int]]
`name`	A name to identify the vectors table. str
`mode` v3.2	Vectors mode: `"default"` or `"floret"` (default: `"default"`). str
`minn` v3.2	The floret char ngram minn (default: `0`). int
`maxn` v3.2	The floret char ngram maxn (default: `0`). int
`hash_count` v3.2	The floret hash count. Supported values: 1—4 (default: `1`). int
`hash_seed` v3.2	The floret hash seed (default: `0`). int
`bow` v3.2	The floret BOW string (default: `"<"`). str
`eow` v3.2	The floret EOW string (default: `">"`). str
`attr` v3.6	The token attribute for the vector keys (default: `"ORTH"`). Union[int, str]

Vectors.getitem 方法

通过键获取向量。如果表中未找到该键，则会引发KeyError错误。

名称	描述
`key`	The key to get the vector for. Union[int, str]
返回值	该键的向量。numpy.ndarray[ndim=1, dtype=float32]

Vectors.setitem 方法

为给定键设置向量。不支持floret模式。

名称	描述
`key`	The key to set the vector for. int
`vector`	The vector to set. numpy.ndarray[ndim=1, dtype=float32]

Vectors.iter 方法

遍历表中的键。在floret模式下，不会使用键表。

名称	描述
YIELDS	表格中的一个键。int

Vectors.len 方法

返回表中的向量数量。

名称	描述
返回值	表中向量的数量。int

Vectors.contains 方法

检查某个键是否已映射到表中的向量条目。在floret模式下，所有键都会返回True。

名称	描述
`key`	The key to check. int
返回值	判断该键是否有向量条目。bool

Vectors.add 方法

向表中添加一个键，可选择同时设置向量值。通过设置row可以将键映射到现有向量，也可以添加新向量。不支持floret模式。

名称	描述
`key`	The key to add. Union[str, int]
仅关键字
`vector`	An optional vector to add for the key. numpy.ndarray[ndim=1, dtype=float32]
`row`	An optional row number of a vector to map the key to. int
返回值	向量被添加到的行号。int

Vectors.resize 方法

调整底层向量数组的大小。如果inplace=True，内存将被重新分配。这可能导致其他数据引用失效，因此请仅在确定需要时使用inplace=True。如果向量数量减少，映射到已删除行的键将被移除。这些被移除的项会以(key, row)元组列表形式返回。不支持floret模式。

名称	描述
`shape`	A `(rows, dims)` tuple describing the number of rows and dimensions. Tuple[int, int]
`inplace`	Reallocate the memory. bool
RETURNS	The removed items as a list of `(key, row)` tuples. List[Tuple[int, int]]

Vectors.keys 方法

表中键的序列。在floret模式下，不使用键表。

名称	描述
返回值	键值。可迭代对象[整数]

Vectors.values 方法

遍历已分配给至少一个键的向量。请注意，某些向量可能未被分配，因此返回的向量数量可能少于向量表的长度。在floret模式下，不使用键表。

名称	描述
YIELDS	表格中的一个向量。numpy.ndarray[ndim=1, dtype=float32]

Vectors.items 方法

按顺序遍历(key, vector)键值对。在floret模式下，键表为空。

名称	描述
YIELDS	`(key, vector)` pairs, in order. Tuple[int,numpy.ndarray[ndim=1, dtype=float32]]

Vectors.find 方法

按行查找一个或多个键，反之亦然。不支持floret模式。

名称	描述
仅关键字
`key`	Find the row that the given key points to. Returns int, `-1` if missing. Union[str, int]
`keys`	Find rows that the keys point to. Returns `numpy.ndarray`. Iterable[Union[str, int]]
`row`	Find the first key that points to the row. Returns integer. int
`rows`	Find the keys that point to the rows. Returns `numpy.ndarray`. Iterable[int]
返回值	请求的键、键组、行或多行数据。Union[int,numpy.ndarray[ndim=1, dtype=float32]]

Vectors.shape 属性

获取向量表中行数和维度数的(rows, dims)元组。

名称	描述
RETURNS	A `(rows, dims)` pair. Tuple[int, int]

Vectors.size 属性

向量大小，即 rows * dims。

名称	描述
返回值	向量大小。 int

Vectors.is_full 属性

向量表是否已满且没有空槽位可用于新键。如果表已满，可以使用Vectors.resize调整其大小。在floret模式下，表始终是满的且无法调整大小。

名称	描述
返回值	向量表是否已满。bool

Vectors.n_keys 属性

获取表中的键数量。请注意，这是所有键的数量，而不仅仅是唯一向量。如果多个键映射到相同的向量，它们将被单独计数。在floret模式下，不使用键表。

名称	描述
RETURNS	The number of all keys in the table. Returns `-1` for floret vectors. int

Vectors.most_similar 方法

对于每个给定的向量，通过余弦相似度找出与之最相似的n个条目。查询基于向量进行，结果以(keys, best_rows, scores)元组形式返回。如果queries量较大，计算会分块进行以避免消耗过多内存。您可以通过设置batch_size来控制计算过程中的空间/性能平衡。该功能不支持floret模式。

名称	描述
`queries`	An array with one or more vectors. numpy.ndarray
仅关键字
`batch_size`	The batch size to use. Default to `1024`. int
`n`	The number of entries to return for each query. Defaults to `1`. int
`sort`	Whether to sort the entries returned by score. Defaults to `True`. bool
RETURNS	The most similar entries as a `(keys, best_rows, scores)` tuple. Tuple[numpy.ndarray,numpy.ndarray,numpy.ndarray]

Vectors.get_batch 方法v3.2

高效批量获取所提供键的向量。

名称	描述
`keys`	The keys. Iterable[Union[int, str]]

Vectors.to_ops 方法

更改嵌入矩阵以使用不同的Thinc操作。

名称	描述
`ops`	The Thinc ops to switch the embedding matrix to. Ops

Vectors.to_disk 方法

将当前状态保存到目录中。

名称	描述
`path`	A path to a directory, which will be created if it doesn’t exist. Paths may be either strings or `Path`-like objects. Union[str,Path]

Vectors.from_disk 方法

从目录加载状态。就地修改对象并返回它。

名称	描述
`path`	A path to a directory. Paths may be either strings or `Path`-like objects. Union[str,Path]
RETURNS	The modified `Vectors` object. Vectors

Vectors.to_bytes 方法

将当前状态序列化为二进制字符串。

名称	描述
RETURNS	The serialized form of the `Vectors` object. bytes

Vectors.from_bytes 方法

从二进制字符串加载状态。

名称	描述
`data`	The data to load from. bytes
RETURNS	The `Vectors` object. Vectors

属性

名称	描述
`data`	Stored vectors data. `numpy` is used for CPU vectors, `cupy` for GPU vectors. Union[numpy.ndarray[ndim=1, dtype=float32], cupy.ndarray[ndim=1, dtype=float32]]
`key2row`	Dictionary mapping word hashes to rows in the `Vectors.data` table. Dict[int, int]
`keys`	Array keeping the keys in order, such that `keys[vectors.key2row[key]] == key`. Union[numpy.ndarray[ndim=1, dtype=float32], cupy.ndarray[ndim=1, dtype=float32]]
`attr` v3.6	The token attribute for the vector keys. int

建议编辑

其他

Vectors.__init__ 方法

Vectors.__getitem__ 方法

Vectors.__setitem__ 方法

Vectors.__iter__ 方法

Vectors.__len__ 方法

Vectors.__contains__ 方法