2023年2月16日

使用Qdrant作为OpenAI嵌入向量的向量数据库

本笔记本将逐步指导您如何使用Qdrant作为OpenAI嵌入的向量数据库。Qdrant是一个用Rust编写的高性能向量搜索数据库，提供RESTful和gRPC API来管理您的嵌入向量。官方Python客户端qdrant-client可简化与应用程序的集成。

本笔记本展示了一个端到端流程，包括：

使用由OpenAI API生成的预计算嵌入。
将嵌入存储在Qdrant的本地实例中。
使用OpenAI API将原始文本查询转换为嵌入向量。
使用Qdrant在创建的集合中执行最近邻搜索。

什么是Qdrant

Qdrant 是一个开源向量数据库，它允许存储神经嵌入以及元数据（也称为payload）。Payload不仅可用于保存特定点的附加属性，还可用于过滤。Qdrant 提供了一种独特的过滤机制，该机制内置于向量搜索阶段，使其非常高效。

部署选项

Qdrant 可以通过多种方式启动，具体取决于应用程序的预期负载情况：

本地或内部部署，使用Docker容器
在Kubernetes集群上，使用Helm chart
使用 Qdrant Cloud

集成

Qdrant 提供 RESTful 和 gRPC 两种 API，无论您使用哪种编程语言都能轻松集成。不过，针对最流行的语言我们提供了一些官方客户端，如果您使用 Python，那么 Python Qdrant 客户端库可能是最佳选择。

! docker compose up -d

[1A[1B[0G[?25l[+] Running 1/0
 [32m✔[0m Container qdrant-qdrant-1  [32mRunning[0m                                      [34m0.0s [0m
[?25h

我们可以通过运行一个简单的curl命令来验证服务器是否成功启动：

! curl http://localhost:6333

{"title":"qdrant - vector search engine","version":"1.3.0"}

安装需求

本笔记本显然需要安装openai和qdrant-client包，但我们还会用到其他一些额外的库。以下命令可以一次性安装所有依赖：

! pip install openai qdrant-client pandas wget

准备您的OpenAI API密钥

OpenAI API密钥用于文档和查询的向量化。

如果您没有OpenAI API密钥，可以从https://beta.openai.com/account/api-keys获取。

获取密钥后，请通过运行以下命令将其添加到环境变量中，命名为OPENAI_API_KEY：

! export OPENAI_API_KEY="your API key"

# Test that your OpenAI API key is correctly set as an environment variable
# Note. if you run this notebook locally, you will need to reload your terminal and the notebook for the env variables to be live.
import os

# Note. alternatively you can set a temporary env variable like this:
# os.environ["OPENAI_API_KEY"] = "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

if os.getenv("OPENAI_API_KEY") is not None:
    print("OPENAI_API_KEY is ready")
else:
    print("OPENAI_API_KEY environment variable not found")

OPENAI_API_KEY is ready

连接Qdrant

使用官方Python库连接到正在运行的Qdrant服务器实例非常简单：

import qdrant_client

client = qdrant_client.QdrantClient(
    host="localhost",
    prefer_grpc=True,
)

我们可以通过运行任何可用的方法来测试连接:

client.get_collections()

CollectionsResponse(collections=[])

加载数据

在本节中，我们将加载本会话之前准备好的数据，这样您就不必使用自己的额度重新计算维基百科文章的嵌入向量了。

import wget

embeddings_url = "https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip"

# The file is ~700 MB so this will take some time
wget.download(embeddings_url)

100% [......................................................................] 698933052 / 698933052

'vector_database_wikipedia_articles_embedded (9).zip'

下载的文件需要解压：

import zipfile

with zipfile.ZipFile("vector_database_wikipedia_articles_embedded.zip","r") as zip_ref:
    zip_ref.extractall("../data")

最后我们可以从提供的CSV文件中加载它：

import pandas as pd

from ast import literal_eval

article_df = pd.read_csv('../data/vector_database_wikipedia_articles_embedded.csv')
# Read vectors from strings back into a list
article_df["title_vector"] = article_df.title_vector.apply(literal_eval)
article_df["content_vector"] = article_df.content_vector.apply(literal_eval)
article_df.head()

	id	url	标题	文本内容	标题向量	内容向量	向量id
0	1	https://simple.wikipedia.org/wiki/April	四月	四月是公历年中第四个月份...	[0.001009464613161981, -0.020700545981526375, ...	[-0.011253940872848034, -0.013491976074874401,...	0
1	2	https://simple.wikipedia.org/wiki/August	August	八月(Aug.)是一年中的第八个月份...	[0.0009286514250561595, 0.000820168002974242, ...	[0.0003609954728744924, 0.007262262050062418, ...	1
2	6	https://simple.wikipedia.org/wiki/Art	艺术	艺术是一种表达想象力的创造性活动...	[0.003393713850528002, 0.0061537534929811954, ...	[-0.004959689453244209, 0.015772193670272827, ...	2
3	8	https://simple.wikipedia.org/wiki/A	A	A或a是英语字母表中的第一个字母...	[0.0153952119871974, -0.013759135268628597, 0....	[0.024894846603274345, -0.022186409682035446, ...	3
4	9	https://simple.wikipedia.org/wiki/Air	Air	空气指的是地球的大气层。空气是一种...	[0.02224554680287838, -0.02044147066771984, -...	[0.021524671465158463, 0.018522677943110466, -...	4

索引数据

Qdrant将数据存储在集合(collections)中，其中每个对象至少由一个向量描述，并可能包含称为有效载荷(payload)的附加元数据。我们的集合将命名为Articles，每个对象将由标题(title)和内容(content)两个向量共同描述。Qdrant不需要您预先设置任何类型的模式，因此只需简单设置即可自由地向集合中添加数据点。

我们将从创建一个集合开始，然后使用预先计算好的嵌入向量填充它。

from qdrant_client.http import models as rest

vector_size = len(article_df["content_vector"][0])

client.create_collection(
    collection_name="Articles",
    vectors_config={
        "title": rest.VectorParams(
            distance=rest.Distance.COSINE,
            size=vector_size,
        ),
        "content": rest.VectorParams(
            distance=rest.Distance.COSINE,
            size=vector_size,
        ),
    }
)

True

client.upsert(
    collection_name="Articles",
    points=[
        rest.PointStruct(
            id=k,
            vector={
                "title": v["title_vector"],
                "content": v["content_vector"],
            },
            payload=v.to_dict(),
        )
        for k, v in article_df.iterrows()
    ],
)

UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

# Check the collection size to make sure all the points have been stored
client.count(collection_name="Articles")

CountResult(count=25000)

搜索数据

将数据存入Qdrant后，我们将开始查询集合中最接近的向量。我们可以提供额外的参数vector_name来切换基于标题或内容的搜索。由于预计算的嵌入是使用text-embedding-ada-002 OpenAI模型创建的，因此在搜索时也必须使用该模型。

from openai import OpenAI

openai_client = OpenAI()

def query_qdrant(query, collection_name, vector_name="title", top_k=20):
    # Creates embedding vector from user query
    embedded_query = openai_client.embeddings.create(
        input=query,
        model="text-embedding-ada-002",
    ).data[0].embedding

    query_results = client.search(
        collection_name=collection_name,
        query_vector=(
            vector_name, embedded_query
        ),
        limit=top_k,
    )

    return query_results

query_results = query_qdrant("modern art in Europe", "Articles")
for i, article in enumerate(query_results):
    print(f"{i + 1}. {article.payload['title']} (Score: {round(article.score, 3)})")

1. Museum of Modern Art (Score: 0.875)
2. Western Europe (Score: 0.867)
3. Renaissance art (Score: 0.864)
4. Pop art (Score: 0.86)
5. Northern Europe (Score: 0.855)
6. Hellenistic art (Score: 0.853)
7. Modernist literature (Score: 0.847)
8. Art film (Score: 0.843)
9. Central Europe (Score: 0.843)
10. European (Score: 0.841)
11. Art (Score: 0.841)
12. Byzantine art (Score: 0.841)
13. Postmodernism (Score: 0.84)
14. Eastern Europe (Score: 0.839)
15. Cubism (Score: 0.839)
16. Europe (Score: 0.839)
17. Impressionism (Score: 0.838)
18. Bauhaus (Score: 0.838)
19. Surrealism (Score: 0.837)
20. Expressionism (Score: 0.837)

# This time we'll query using content vector
query_results = query_qdrant("Famous battles in Scottish history", "Articles", "content")
for i, article in enumerate(query_results):
    print(f"{i + 1}. {article.payload['title']} (Score: {round(article.score, 3)})")

1. Battle of Bannockburn (Score: 0.869)
2. Wars of Scottish Independence (Score: 0.861)
3. 1651 (Score: 0.852)
4. First War of Scottish Independence (Score: 0.85)
5. Robert I of Scotland (Score: 0.846)
6. 841 (Score: 0.844)
7. 1716 (Score: 0.844)
8. 1314 (Score: 0.837)
9. 1263 (Score: 0.836)
10. William Wallace (Score: 0.835)
11. Stirling (Score: 0.831)
12. 1306 (Score: 0.831)
13. 1746 (Score: 0.83)
14. 1040s (Score: 0.828)
15. 1106 (Score: 0.827)
16. 1304 (Score: 0.826)
17. David II of Scotland (Score: 0.825)
18. Braveheart (Score: 0.824)
19. 1124 (Score: 0.824)
20. Second War of Scottish Independence (Score: 0.823)