2023年6月28日

使用Qdrant进行嵌入向量搜索

,

本笔记本将引导您完成一个简单流程:下载一些数据,进行嵌入处理,然后使用选定的向量数据库进行索引和搜索。对于希望在安全环境中存储和搜索我们的嵌入数据以支持生产用例(如聊天机器人、主题建模等)的客户来说,这是一个常见需求。

什么是向量数据库

向量数据库是一种专为存储、管理和搜索嵌入向量而设计的数据库。近年来,由于AI在解决涉及自然语言、图像识别和其他非结构化数据用例方面的效果日益显著,使用嵌入技术将非结构化数据(文本、音频、视频等)编码为向量供机器学习模型使用的做法呈现爆发式增长。向量数据库已成为企业实现和扩展这些用例的有效解决方案。

为什么使用向量数据库

向量数据库使企业能够利用本代码库中分享的众多嵌入应用场景(例如问答系统、聊天机器人和推荐服务),并在安全、可扩展的环境中实现这些应用。许多客户在小规模场景下已通过嵌入技术解决问题,但性能和安全性阻碍了其投入生产环境——我们认为向量数据库是解决这一问题的关键组件。本指南将带您了解文本数据嵌入的基础知识,将其存储于向量数据库,并用于语义搜索。

演示流程

演示流程如下:

  • 设置: 导入包并设置任何需要的变量
  • 加载数据: 加载数据集并使用OpenAI嵌入进行编码
  • Qdrant
    • 设置: 这里我们将为Qdrant配置Python客户端。更多详情请查看这里
    • 索引数据: 我们将创建一个包含标题内容向量的集合
    • 搜索数据: 我们将运行几次搜索来确认其正常工作

完成本笔记本的学习后,您应该对如何设置和使用向量数据库有了基本理解,可以继续探索利用我们嵌入功能的更复杂用例。

设置

导入所需的库并设置我们想要使用的嵌入模型。

# We'll need to install Qdrant client
!pip install qdrant-client
import openai
import pandas as pd
from ast import literal_eval
import qdrant_client # Qdrant's client library for Python

# This can be changed to the embedding model of your choice. Make sure its the same model that is used for generating embeddings
EMBEDDING_MODEL = "text-embedding-ada-002"

# Ignore unclosed SSL socket warnings - optional in case you get these errors
import warnings

warnings.filterwarnings(action="ignore", message="unclosed", category=ResourceWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning) 

加载数据

在本节中,我们将加载在此会话之前准备好的嵌入数据。

import wget

embeddings_url = "https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip"

# The file is ~700 MB so this will take some time
wget.download(embeddings_url)
100% [......................................................................] 698933052 / 698933052
'vector_database_wikipedia_articles_embedded (10).zip'
import zipfile
with zipfile.ZipFile("vector_database_wikipedia_articles_embedded.zip","r") as zip_ref:
    zip_ref.extractall("../data")
article_df = pd.read_csv('../data/vector_database_wikipedia_articles_embedded.csv')
article_df.head()
id url 标题 文本内容 标题向量 内容向量 向量id
0 1 https://simple.wikipedia.org/wiki/April 四月 四月是公历年中第四个月份... [0.001009464613161981, -0.020700545981526375, ... [-0.011253940872848034, -0.013491976074874401,... 0
1 2 https://simple.wikipedia.org/wiki/August August 八月(Aug.)是一年中的第八个月份... [0.0009286514250561595, 0.000820168002974242, ... [0.0003609954728744924, 0.007262262050062418, ... 1
2 6 https://simple.wikipedia.org/wiki/Art 艺术 艺术是一种表达想象力的创造性活动... [0.003393713850528002, 0.0061537534929811954, ... [-0.004959689453244209, 0.015772193670272827, ... 2
3 8 https://simple.wikipedia.org/wiki/A A A或a是英语字母表中的第一个字母... [0.0153952119871974, -0.013759135268628597, 0.... [0.024894846603274345, -0.022186409682035446, ... 3
4 9 https://simple.wikipedia.org/wiki/Air Air 空气指的是地球的大气层。空气是一种... [0.02224554680287838, -0.02044147066771984, -... [0.021524671465158463, 0.018522677943110466, -... 4
# Read vectors from strings back into a list
article_df['title_vector'] = article_df.title_vector.apply(literal_eval)
article_df['content_vector'] = article_df.content_vector.apply(literal_eval)

# Set vector_id to be a string
article_df['vector_id'] = article_df['vector_id'].apply(str)
article_df.info(show_counts=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              25000 non-null  int64 
 1   url             25000 non-null  object
 2   title           25000 non-null  object
 3   text            25000 non-null  object
 4   title_vector    25000 non-null  object
 5   content_vector  25000 non-null  object
 6   vector_id       25000 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.3+ MB

Qdrant

Qdrant 是一个用Rust编写的高性能向量搜索数据库。它提供本地部署和云端版本,但出于本示例的目的,我们将使用本地部署模式。

设置所有内容需要:

  • 启动本地Qdrant实例
  • 配置集合并将数据存储在其中
  • 尝试一些查询

设置

对于本地部署,我们将根据Qdrant文档使用Docker:https://qdrant.tech/documentation/quick_start/。Qdrant只需要一个容器,但本代码库中的./qdrant/docker-compose.yaml提供了docker-compose.yaml文件的示例。

您可以通过导航到此目录并运行docker-compose up -d 来在本地启动Qdrant实例

您可能需要将Docker的内存限制增加到8GB或更高。否则Qdrant可能会执行失败并显示类似7 Killed的错误信息。

! docker compose up -d
[?25l[+] Running 1/0
 ✔ Container qdrant-qdrant-1  Running                                      0.0s 
[?25h
qdrant = qdrant_client.QdrantClient(host="localhost", port=6333)
qdrant.get_collections()
CollectionsResponse(collections=[CollectionDescription(name='Articles')])

索引数据

Qdrant将数据存储在集合(collections)中,其中每个对象至少由一个向量描述,并可能包含称为有效载荷(payload)的附加元数据。我们的集合将命名为Articles,每个对象将由标题(title)内容(content)两个向量共同描述。

我们将使用官方的qdrant-client包,它已经内置了所有实用方法。

from qdrant_client.http import models as rest
# Get the vector size from the first row to set up the collection
vector_size = len(article_df['content_vector'][0])

# Set up the collection with the vector configuration. You need to declare the vector size and distance metric for the collection. Distance metric enables vector database to index and search vectors efficiently.
qdrant.recreate_collection(
    collection_name='Articles',
    vectors_config={
        'title': rest.VectorParams(
            distance=rest.Distance.COSINE,
            size=vector_size,
        ),
        'content': rest.VectorParams(
            distance=rest.Distance.COSINE,
            size=vector_size,
        ),
    }
)
True
vector_size = len(article_df['content_vector'][0])

qdrant.recreate_collection(
    collection_name='Articles',
    vectors_config={
        'title': rest.VectorParams(
            distance=rest.Distance.COSINE,
            size=vector_size,
        ),
        'content': rest.VectorParams(
            distance=rest.Distance.COSINE,
            size=vector_size,
        ),
    }
)
True

除了在vector下定义的向量配置外,我们还可以定义payload配置。Payload是一个可选字段,允许您存储与向量一起的额外元数据。在我们的案例中,我们将存储文章的idtitleurl。当我们从payload返回搜索结果中最近文章的标题时,我们还可以向用户提供文章的URL(这是元数据的一部分)。

from qdrant_client.models import PointStruct # Import the PointStruct to store the vector and payload
from tqdm import tqdm # Library to show the progress bar 

# Populate collection with vectors using tqdm to show progress
for k, v in tqdm(article_df.iterrows(), desc="Upserting articles", total=len(article_df)):
    try:
        qdrant.upsert(
            collection_name='Articles',
            points=[
                PointStruct(
                    id=k,
                    vector={'title': v['title_vector'], 
                            'content': v['content_vector']},
                    payload={
                        'id': v['id'],
                        'title': v['title'],
                        'url': v['url']
                    }
                )
            ]
        )
    except Exception as e:
        print(f"Failed to upsert row {k}: {v}")
        print(f"Exception: {e}")
Upserting articles: 100%|█████████████████████████████████████████████████████████████████████████████████████| 25000/25000 [02:52<00:00, 144.82it/s]
# Check the collection size to make sure all the points have been stored
qdrant.count(collection_name='Articles')
CountResult(count=25000)

搜索数据

将数据存入Qdrant后,我们将开始查询集合中最接近的向量。可以通过额外参数vector_name来切换基于标题或内容的搜索。请确保使用text-embedding-ada-002模型,因为文件中的原始嵌入就是使用该模型创建的。

def query_qdrant(query, collection_name, vector_name='title', top_k=20):

    # Creates embedding vector from user query
    embedded_query = openai.embeddings.create(
        input=query,
        model=EMBEDDING_MODEL,
    ).data[0].embedding # We take the first embedding from the list
    
    query_results = qdrant.search(
        collection_name=collection_name,
        query_vector=(
            vector_name, embedded_query
        ),
        limit=top_k, 
        query_filter=None
    )
    
    return query_results
query_results = query_qdrant('modern art in Europe', 'Articles', 'title')
for i, article in enumerate(query_results):
    print(f'{i + 1}. {article.payload["title"]}, URL: {article.payload["url"]} (Score: {round(article.score, 3)})')
1. Museum of Modern Art, URL: https://simple.wikipedia.org/wiki/Museum%20of%20Modern%20Art (Score: 0.875)
2. Western Europe, URL: https://simple.wikipedia.org/wiki/Western%20Europe (Score: 0.867)
3. Renaissance art, URL: https://simple.wikipedia.org/wiki/Renaissance%20art (Score: 0.864)
4. Pop art, URL: https://simple.wikipedia.org/wiki/Pop%20art (Score: 0.86)
5. Northern Europe, URL: https://simple.wikipedia.org/wiki/Northern%20Europe (Score: 0.855)
6. Hellenistic art, URL: https://simple.wikipedia.org/wiki/Hellenistic%20art (Score: 0.853)
7. Modernist literature, URL: https://simple.wikipedia.org/wiki/Modernist%20literature (Score: 0.847)
8. Art film, URL: https://simple.wikipedia.org/wiki/Art%20film (Score: 0.843)
9. Central Europe, URL: https://simple.wikipedia.org/wiki/Central%20Europe (Score: 0.843)
10. European, URL: https://simple.wikipedia.org/wiki/European (Score: 0.841)
11. Art, URL: https://simple.wikipedia.org/wiki/Art (Score: 0.841)
12. Byzantine art, URL: https://simple.wikipedia.org/wiki/Byzantine%20art (Score: 0.841)
13. Postmodernism, URL: https://simple.wikipedia.org/wiki/Postmodernism (Score: 0.84)
14. Eastern Europe, URL: https://simple.wikipedia.org/wiki/Eastern%20Europe (Score: 0.839)
15. Cubism, URL: https://simple.wikipedia.org/wiki/Cubism (Score: 0.839)
16. Europe, URL: https://simple.wikipedia.org/wiki/Europe (Score: 0.839)
17. Impressionism, URL: https://simple.wikipedia.org/wiki/Impressionism (Score: 0.838)
18. Bauhaus, URL: https://simple.wikipedia.org/wiki/Bauhaus (Score: 0.838)
19. Surrealism, URL: https://simple.wikipedia.org/wiki/Surrealism (Score: 0.837)
20. Expressionism, URL: https://simple.wikipedia.org/wiki/Expressionism (Score: 0.837)
# This time we'll query using content vector
query_results = query_qdrant('Famous battles in Scottish history', 'Articles', 'content')
for i, article in enumerate(query_results):
    print(f'{i + 1}. {article.payload["title"]}, URL: {article.payload["url"]} (Score: {round(article.score, 3)})')
1. Battle of Bannockburn, URL: https://simple.wikipedia.org/wiki/Battle%20of%20Bannockburn (Score: 0.869)
2. Wars of Scottish Independence, URL: https://simple.wikipedia.org/wiki/Wars%20of%20Scottish%20Independence (Score: 0.861)
3. 1651, URL: https://simple.wikipedia.org/wiki/1651 (Score: 0.852)
4. First War of Scottish Independence, URL: https://simple.wikipedia.org/wiki/First%20War%20of%20Scottish%20Independence (Score: 0.85)
5. Robert I of Scotland, URL: https://simple.wikipedia.org/wiki/Robert%20I%20of%20Scotland (Score: 0.846)
6. 841, URL: https://simple.wikipedia.org/wiki/841 (Score: 0.844)
7. 1716, URL: https://simple.wikipedia.org/wiki/1716 (Score: 0.844)
8. 1314, URL: https://simple.wikipedia.org/wiki/1314 (Score: 0.837)
9. 1263, URL: https://simple.wikipedia.org/wiki/1263 (Score: 0.836)
10. William Wallace, URL: https://simple.wikipedia.org/wiki/William%20Wallace (Score: 0.835)
11. Stirling, URL: https://simple.wikipedia.org/wiki/Stirling (Score: 0.831)
12. 1306, URL: https://simple.wikipedia.org/wiki/1306 (Score: 0.831)
13. 1746, URL: https://simple.wikipedia.org/wiki/1746 (Score: 0.83)
14. 1040s, URL: https://simple.wikipedia.org/wiki/1040s (Score: 0.828)
15. 1106, URL: https://simple.wikipedia.org/wiki/1106 (Score: 0.827)
16. 1304, URL: https://simple.wikipedia.org/wiki/1304 (Score: 0.826)
17. David II of Scotland, URL: https://simple.wikipedia.org/wiki/David%20II%20of%20Scotland (Score: 0.825)
18. Braveheart, URL: https://simple.wikipedia.org/wiki/Braveheart (Score: 0.824)
19. 1124, URL: https://simple.wikipedia.org/wiki/1124 (Score: 0.824)
20. July 27, URL: https://simple.wikipedia.org/wiki/July%2027 (Score: 0.823)