cuVS 基准数据集#

一个数据集通常包含4个二进制文件，分别是数据库向量、查询向量、真实邻居及其对应的距离。例如，Glove-100数据集包含文件base.fbin（数据库向量）、query.fbin（查询向量）、groundtruth.neighbors.ibin（真实邻居）和groundtruth.distances.fbin（真实距离）。前两个文件用于索引构建和搜索，而后两个文件与特定距离相关联，用于评估。

The file suffixes fbin, f16bin, ibin, u8bin, and i8bin denote that the data type of vectors stored in the file are float32, float16`(a.k.a `half), int, uint8, and int8, respectively. These binary files are little-endian and the format is: the first 8 bytes are num_vectors (uint32_t) and num_dimensions (uint32_t), and the following num_vectors * num_dimensions * sizeof(type) bytes are vectors stored in row-major order.

某些实现可以接受float16数据库和查询向量作为输入，并且会有更好的性能。使用python/cuvs_bench/cuvs_bench/get_dataset/fbin_to_f16bin.py将数据集从float32转换为float16类型。

常用的数据集可以从两个网站下载： #. 百万级数据集可以在数据集部分的ann-benchmarks找到。

然而，这些数据集是HDF5格式的。使用python/cuvs_bench/cuvs_bench/get_dataset/hdf5_to_fbin.py来转换格式。该脚本的用法是：
$ python/cuvs_bench/cuvs_bench/get_dataset/hdf5_to_fbin.py
usage: hdf5_to_fbin.py [-n] <input>.hdf5
   -n: normalize base/query set
 outputs: <input>.base.fbin
          <input>.query.fbin
          <input>.groundtruth.neighbors.ibin
          <input>.groundtruth.distances.fbin
因此，对于一个输入的hdf5文件，将会生成四个输出二进制文件。有关预处理GloVe数据集的示例，请参见上一节。

大多数由ann-benchmarks提供的数据集使用Angular或Euclidean距离。Angular表示余弦距离。然而，通过预先归一化向量，计算余弦距离可以简化为计算内积。在实践中，我们总是可以进行归一化以降低计算成本，因此最好测量内积的性能而不是余弦距离。hdf5_to_fbin.py的-n选项可用于归一化数据集。

可以在big-ann-benchmarks找到十亿级别的数据集。真实值文件包含邻居和距离，因此需要拆分。为此提供了一个脚本：
以Deep-1B数据集为例：
mkdir -p data/deep-1B && cd data/deep-1B # 手动下载"Yandex DEEP"的"Ground Truth"文件 # 假设文件名为deep_new_groundtruth.public.10K.bin python -m cuvs_bench.split_groundtruth deep_new_groundtruth.public.10K.bin groundtruth # 应该生成两个文件 'groundtruth.neighbors.ibin' 和 'groundtruth.distances.fbin'
除了整个十亿级别数据集的真实值文件外，该网站还提供了基础集的前10M或100M向量的真实值文件。这意味着我们可以将这些十亿级别的数据集用作百万级别的数据集。为了方便这一点，可以使用数据集的可选参数subset_size。有关进一步解释，请参见下一步。

生成真实数据#

如果你有一个数据集，但没有相应的真实值文件，那么你可以使用generate_groundtruth工具生成真实值。示例用法：

# With existing query file
python -m cuvs_bench.generate_groundtruth --dataset /dataset/base.fbin --output=groundtruth_dir --queries=/dataset/query.public.10K.fbin

# With randomly generated queries
python -m cuvs_bench.generate_groundtruth --dataset /dataset/base.fbin --output=groundtruth_dir --queries=random --n_queries=10000

# Using only a subset of the dataset. Define queries by randomly
# selecting vectors from the (subset of the) dataset.
python -m cuvs_bench.generate_groundtruth --dataset /dataset/base.fbin --nrows=2000000 --output=groundtruth_dir --queries=random-choice --n_queries=10000