快速入门

import shutil

import lance
import numpy as np
import pandas as pd
import pyarrow as pa

创建数据集

通过pyarrow可以非常轻松地创建lance数据集

创建一个数据框

df = pd.DataFrame({"a": [5]})
df
a
0 5

写入到lance

shutil.rmtree("/tmp/test.lance", ignore_errors=True)

dataset = lance.write_dataset(df, "/tmp/test.lance")
dataset.to_table().to_pandas()
a
0 5

从Parquet格式转换

shutil.rmtree("/tmp/test.parquet", ignore_errors=True)
shutil.rmtree("/tmp/test.lance", ignore_errors=True)

tbl = pa.Table.from_pandas(df)
pa.dataset.write_dataset(tbl, "/tmp/test.parquet", format='parquet')

parquet = pa.dataset.dataset("/tmp/test.parquet")
parquet.to_table().to_pandas()
a
0 5

一行代码写入lance

dataset = lance.write_dataset(parquet, "/tmp/test.lance")
# make sure it's the same
dataset.to_table().to_pandas()
a
0 5

版本控制

我们可以追加行

df = pd.DataFrame({"a": [10]})
tbl = pa.Table.from_pandas(df)
dataset = lance.write_dataset(tbl, "/tmp/test.lance", mode="append")

dataset.to_table().to_pandas()
a
0 5
1 10

我们可以覆盖数据并创建新版本

df = pd.DataFrame({"a": [50, 100]})
tbl = pa.Table.from_pandas(df)
dataset = lance.write_dataset(tbl, "/tmp/test.lance", mode="overwrite")
dataset.to_table().to_pandas()
a
0 50
1 100

旧版本仍然存在

dataset.versions()
[{'version': 1,
  'timestamp': datetime.datetime(2024, 8, 15, 21, 22, 31, 453453),
  'metadata': {}},
 {'version': 2,
  'timestamp': datetime.datetime(2024, 8, 15, 21, 22, 35, 475152),
  'metadata': {}},
 {'version': 3,
  'timestamp': datetime.datetime(2024, 8, 15, 21, 22, 45, 32922),
  'metadata': {}}]
lance.dataset('/tmp/test.lance', version=1).to_table().to_pandas()
a
0 5
lance.dataset('/tmp/test.lance', version=2).to_table().to_pandas()
a
0 5
1 10

我们可以创建标签

dataset.tags.create("stable", 2)
dataset.tags.create("nightly", 3)
dataset.tags.list()
{'nightly': {'version': 3, 'manifest_size': 628},
 'stable': {'version': 2, 'manifest_size': 684}}

可以查看

lance.dataset('/tmp/test.lance', version="stable").to_table().to_pandas()
a
0 5
1 10

向量

数据准备

本教程我们将使用Sift 1M数据集:

  • 从以下网址下载 ANN_SIFT1Mhttp://corpus-texmex.irisa.fr/

  • 直接链接应为 ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz

  • 下载并解压该压缩包

!rm -rf sift* vec_data.lance
!wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
!tar -xzf sift.tar.gz
--2023-02-13 16:54:50--  ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
           => ‘sift.tar.gz’
Resolving ftp.irisa.fr (ftp.irisa.fr)... 131.254.254.45
Connecting to ftp.irisa.fr (ftp.irisa.fr)|131.254.254.45|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /local/texmex/corpus ... done.
==> SIZE sift.tar.gz ... 168280445
==> PASV ... done.    ==> RETR sift.tar.gz ... done.
Length: 168280445 (160M) (unauthoritative)

sift.tar.gz         100%[===================>] 160.48M  6.85MB/s    in 36s

2023-02-13 16:55:29 (4.43 MB/s) - ‘sift.tar.gz’ saved [168280445]

将其转换为Lance

from lance.vector import vec_to_table
import struct

uri = "vec_data.lance"

with open("sift/sift_base.fvecs", mode="rb") as fobj:
    buf = fobj.read()
    data = np.array(struct.unpack("<128000000f", buf[4 : 4 + 4 * 1000000 * 128])).reshape((1000000, 128))
    dd = dict(zip(range(1000000), data))

table = vec_to_table(dd)
lance.write_dataset(table, uri, max_rows_per_group=8192, max_rows_per_file=1024*1024)
<lance.dataset.LanceDataset at 0x13859fe20>
uri = "vec_data.lance"
sift1m = lance.dataset(uri)

KNN (无索引)

采样100个向量作为查询向量

import duckdb
# if this segfaults make sure duckdb v0.7+ is installed
samples = duckdb.query("SELECT vector FROM sift1m USING SAMPLE 100").to_df().vector
samples
0     [29.0, 10.0, 1.0, 50.0, 7.0, 89.0, 95.0, 51.0,...
1     [7.0, 5.0, 39.0, 49.0, 17.0, 12.0, 83.0, 117.0...
2     [0.0, 0.0, 0.0, 10.0, 12.0, 31.0, 6.0, 0.0, 0....
3     [0.0, 2.0, 9.0, 1.793662034335766e-43, 30.0, 1...
4     [54.0, 112.0, 16.0, 0.0, 0.0, 7.0, 112.0, 44.0...
                            ...
95    [1.793662034335766e-43, 33.0, 47.0, 28.0, 0.0,...
96    [1.0, 4.0, 2.0, 32.0, 3.0, 7.0, 119.0, 116.0, ...
97    [17.0, 46.0, 12.0, 0.0, 0.0, 3.0, 23.0, 58.0, ...
98    [0.0, 11.0, 30.0, 14.0, 34.0, 7.0, 0.0, 0.0, 1...
99    [20.0, 8.0, 121.0, 98.0, 37.0, 77.0, 9.0, 18.0...
Name: vector, Length: 100, dtype: object

调用最近邻搜索(此处不使用近似最近邻索引)

import time

start = time.time()
tbl = sift1m.to_table(columns=["id"], nearest={"column": "vector", "q": samples[0], "k": 10})
end = time.time()

print(f"Time(sec): {end-start}")
print(tbl.to_pandas())
Time(sec): 0.10735273361206055
       id                                             vector    score
0  144678  [29.0, 10.0, 1.0, 50.0, 7.0, 89.0, 95.0, 51.0,...      0.0
1  575538  [2.0, 0.0, 1.0, 42.0, 3.0, 38.0, 152.0, 27.0, ...  76908.0
2  241428  [11.0, 0.0, 2.0, 118.0, 11.0, 108.0, 116.0, 21...  92877.0
3  220788  [0.0, 0.0, 0.0, 95.0, 0.0, 8.0, 133.0, 67.0, 1...  93305.0
4  833796  [1.0, 1.0, 0.0, 23.0, 11.0, 26.0, 140.0, 115.0...  95721.0
5  919065  [1.0, 1.0, 1.0, 42.0, 96.0, 42.0, 126.0, 83.0,...  96632.0
6  741948  [36.0, 9.0, 15.0, 108.0, 17.0, 23.0, 25.0, 55....  96927.0
7  225303  [0.0, 0.0, 3.0, 41.0, 0.0, 2.0, 36.0, 84.0, 68...  97055.0
8  787098  [4.0, 5.0, 7.0, 29.0, 7.0, 1.0, 9.0, 91.0, 33....  97950.0
9  113073  [0.0, 0.0, 0.0, 64.0, 65.0, 30.0, 12.0, 33.0, ...  99572.0

如果没有索引,这将需要扫描整个数据集来计算距离。

对于实时服务,使用ANN索引我们可以做得更好

构建索引

现在让我们构建一个索引。Lance目前支持IVF_PQ、IVF_HNSW_PQ和IVF_HNSW_SQ索引类型

注意 如果您不想等待索引构建,可以从 此处 下载预构建索引版本,并跳过下一个单元格

%%time

sift1m.create_index(
    "vector",
    index_type="IVF_PQ", # IVF_PQ, IVF_HNSW_PQ and IVF_HNSW_SQ are supported
    num_partitions=256,  # IVF
    num_sub_vectors=16,  # PQ
)
Building vector index: IVF256,PQ16
CPU times: user 2min 23s, sys: 2.77 s, total: 2min 26s
Wall time: 22.7 s
Sample 65536 out of 1000000 to train kmeans of 128 dim, 256 clusters
Sample 65536 out of 1000000 to train kmeans of 8 dim, 256 clusters
Sample 65536 out of 1000000 to train kmeans of 8 dim, 256 clusters
Sample 65536 out of 1000000 to train kmeans of 8 dim, 256 clusters
Sample 65536 out of 1000000 to train kmeans of 8 dim, 256 clusters
Sample 65536 out of 1000000 to train kmeans of 8 dim, 256 clusters
Sample 65536 out of 1000000 to train kmeans of 8 dim, 256 clusters
Sample 65536 out of 1000000 to train kmeans of 8 dim, 256 clusters
Sample 65536 out of 1000000 to train kmeans of 8 dim, 256 clusters
Sample 65536 out of 1000000 to train kmeans of 8 dim, 256 clusters
Sample 65536 out of 1000000 to train kmeans of 8 dim, 256 clusters
Sample 65536 out of 1000000 to train kmeans of 8 dim, 256 clusters
Sample 65536 out of 1000000 to train kmeans of 8 dim, 256 clusters
Sample 65536 out of 1000000 to train kmeans of 8 dim, 256 clusters
Sample 65536 out of 1000000 to train kmeans of 8 dim, 256 clusters
Sample 65536 out of 1000000 to train kmeans of 8 dim, 256 clusters
Sample 65536 out of 1000000 to train kmeans of 8 dim, 256 clusters

注意 如果您要在自己的数据上尝试此操作,请确保您的向量 (维度 / 子向量数量) % 8 == 0,否则由于SIMD未对齐, 索引创建时间将比预期长得多

使用ANN索引再次尝试最近邻搜索

让我们再次寻找最近的邻居

sift1m = lance.dataset(uri)
import time

tot = 0
for q in samples:
    start = time.time()
    tbl = sift1m.to_table(nearest={"column": "vector", "q": q, "k": 10})
    end = time.time()
    tot += (end - start)

print(f"Avg(sec): {tot / len(samples)}")
print(tbl.to_pandas())
Avg(sec): 0.0009334301948547364
       id                                             vector         score
0  378825  [20.0, 8.0, 121.0, 98.0, 37.0, 77.0, 9.0, 18.0...  16560.197266
1  143787  [11.0, 24.0, 122.0, 122.0, 53.0, 4.0, 0.0, 3.0...  61714.941406
2  356895  [0.0, 14.0, 67.0, 122.0, 83.0, 23.0, 1.0, 0.0,...  64147.218750
3  535431  [9.0, 22.0, 118.0, 118.0, 4.0, 5.0, 4.0, 4.0, ...  69092.593750
4  308778  [1.0, 7.0, 48.0, 123.0, 73.0, 36.0, 8.0, 4.0, ...  69131.812500
5  222477  [14.0, 73.0, 39.0, 4.0, 16.0, 94.0, 19.0, 8.0,...  69244.195312
6  672558  [2.0, 1.0, 0.0, 11.0, 36.0, 23.0, 7.0, 10.0, 0...  70264.828125
7  365538  [54.0, 43.0, 97.0, 59.0, 34.0, 17.0, 10.0, 15....  70273.710938
8  659787  [10.0, 9.0, 23.0, 121.0, 38.0, 26.0, 38.0, 9.0...  70374.703125
9  603930  [32.0, 32.0, 122.0, 122.0, 70.0, 4.0, 15.0, 12...  70583.375000

注意关于性能,实际数值会因存储设备而异。 这些数据是在M2 Macbook Air的本地磁盘上运行的。如果您 直接查询S3、HDD或网络驱动器,性能会 较慢。

延迟与召回率可通过以下参数调节:- nprobes:搜索的IVF分区数量 - refine_factor:决定在重新排序时检索的向量数量

%%time

sift1m.to_table(
    nearest={
        "column": "vector",
        "q": samples[0],
        "k": 10,
        "nprobes": 10,
        "refine_factor": 5,
    }
).to_pandas()
CPU times: user 2.53 ms, sys: 3.31 ms, total: 5.84 ms
Wall time: 4.18 ms
id 向量 分数
0 144678 [29.0, 10.0, 1.0, 50.0, 7.0, 89.0, 95.0, 51.0,... 0.0
1 575538 [2.0, 0.0, 1.0, 42.0, 3.0, 38.0, 152.0, 27.0, ... 76908.0
2 241428 [11.0, 0.0, 2.0, 118.0, 11.0, 108.0, 116.0, 21... 92877.0
3 220788 [0.0, 0.0, 0.0, 95.0, 0.0, 8.0, 133.0, 67.0, 1... 93305.0
4 833796 [1.0, 1.0, 0.0, 23.0, 11.0, 26.0, 140.0, 115.0... 95721.0
5 919065 [1.0, 1.0, 1.0, 42.0, 96.0, 42.0, 126.0, 83.0,... 96632.0
6 741948 [36.0, 9.0, 15.0, 108.0, 17.0, 23.0, 25.0, 55.... 96927.0
7 225303 [0.0, 0.0, 3.0, 41.0, 0.0, 2.0, 36.0, 84.0, 68... 97055.0
8 787098 [4.0, 5.0, 7.0, 29.0, 7.0, 1.0, 9.0, 91.0, 33.... 97950.0
9 113073 [0.0, 0.0, 0.0, 64.0, 65.0, 30.0, 12.0, 33.0, ... 99572.0

q => 样本向量

k => 返回多少个邻近点

nprobes => 需要探测的分区数量(在粗量化器中)

refine_factor => 控制"重新排序"。如果k=10且refine_factor=5,则通过ANN检索50个最近邻,然后使用实际距离重新排序,最后返回前10个结果。这可以在不过多牺牲性能的情况下提高召回率。

注意 上述延迟时间包含文件I/O操作,因为lancedb目前不会将任何数据保留在内存中。除了索引构建速度外,创建一个纯内存版本的数据集将对性能产生最大影响。

可以同时检索特征和向量

通常我们还有其他特征或元数据列需要一起存储和获取。如果单独管理数据和索引,就需要进行大量繁琐的管道操作来整合内容。而使用Lance只需一次调用即可完成

tbl = sift1m.to_table()
tbl = tbl.append_column("item_id", pa.array(range(len(tbl))))
tbl = tbl.append_column("revenue", pa.array((np.random.randn(len(tbl))+5)*1000))
tbl.to_pandas()
id 向量 商品ID 收入
0 0 [0.0, 16.0, 35.0, 5.0, 32.0, 31.0, 14.0, 10.0,... 0 5950.436925
1 1 [1.8e-43, 14.0, 35.0, 19.0, 20.0, 3.0, 1.0, 13... 1 4680.298627
2 2 [33.0, 1.8e-43, 0.0, 1.0, 5.0, 3.0, 44.0, 40.0... 2 5342.593212
3 3 [23.0, 10.0, 1.8e-43, 12.0, 47.0, 14.0, 25.0, ... 3 5080.994002
4 4 [27.0, 29.0, 21.0, 1.8e-43, 1.0, 1.0, 0.0, 0.0... 4 4977.299308
... ... ... ... ...
999995 999995 [8.0, 9.0, 5.0, 0.0, 10.0, 39.0, 72.0, 68.0, 3... 999995 4928.768010
999996 999996 [3.0, 28.0, 55.0, 29.0, 35.0, 12.0, 1.0, 2.0, ... 999996 5056.264199
999997 999997 [0.0, 13.0, 41.0, 72.0, 40.0, 9.0, 0.0, 0.0, 0... 999997 5930.547635
999998 999998 [41.0, 121.0, 4.0, 0.0, 0.0, 0.0, 0.0, 0.0, 24... 999998 5985.139759
999999 999999 [2.0, 4.0, 8.0, 8.0, 26.0, 72.0, 63.0, 0.0, 0.... 999999 5008.962686

1000000 行 × 4 列

sift1m = lance.write_dataset(tbl, uri, mode="overwrite")
sift1m.to_table(columns=["revenue"], nearest={"column": "vector", "q": samples[0], "k": 10}).to_pandas()
收入 向量 分数
0 2994.968781 [29.0, 10.0, 1.0, 50.0, 7.0, 89.0, 95.0, 51.0,... 0.0
1 4231.026305 [2.0, 0.0, 1.0, 42.0, 3.0, 38.0, 152.0, 27.0, ... 76908.0
2 3340.900287 [11.0, 0.0, 2.0, 118.0, 11.0, 108.0, 116.0, 21... 92877.0
3 4339.588996 [0.0, 0.0, 0.0, 95.0, 0.0, 8.0, 133.0, 67.0, 1... 93305.0
4 5141.730799 [1.0, 1.0, 0.0, 23.0, 11.0, 26.0, 140.0, 115.0... 95721.0
5 4518.194820 [1.0, 1.0, 1.0, 42.0, 96.0, 42.0, 126.0, 83.0,... 96632.0
6 3383.586889 [36.0, 9.0, 15.0, 108.0, 17.0, 23.0, 25.0, 55.... 96927.0
7 5496.905675 [0.0, 0.0, 3.0, 41.0, 0.0, 2.0, 36.0, 84.0, 68... 97055.0
8 5298.669719 [4.0, 5.0, 7.0, 29.0, 7.0, 1.0, 9.0, 91.0, 33.... 97950.0
9 6742.810395 [0.0, 0.0, 0.0, 64.0, 65.0, 30.0, 12.0, 33.0, ... 99572.0