使用Tair作为OpenAI嵌入的向量数据库

Sep 11, 2023
Open in Github

本笔记本将逐步指导您如何使用Tair作为OpenAI嵌入的向量数据库。

本笔记本展示了一个端到端的过程:

  1. 使用由OpenAI API创建的预计算嵌入。
  2. 将嵌入存储在Tair的云实例中。
  3. 使用OpenAI API将原始文本查询转换为嵌入。
  4. 使用Tair在创建的集合中执行最近邻搜索。

什么是Tair

Tair 是由阿里云开发的云原生内存数据库服务。Tair 兼容开源 Redis,并提供多种数据模型和企业级功能,以支持您的实时在线场景。Tair 还引入了基于新型非易失性内存(NVM)存储介质的持久内存优化实例。这些实例可以降低成本30%,确保数据持久性,并提供几乎与内存数据库相同的性能。Tair 已广泛应用于政务、金融、制造、医疗和泛互联网等领域,以满足其高速查询和计算需求。

Tairvector 是一种内部数据结构,提供高性能的实时向量存储和检索。TairVector 提供了两种索引算法:分层可导航小世界(HNSW)和平面搜索(Flat Search)。此外,TairVector 支持多种距离函数,如欧几里得距离、内积和杰卡德距离。与传统向量检索服务相比,TairVector 具有以下优势:

  • 将所有数据存储在内存中,并支持实时索引更新,以减少读写操作的延迟。
  • 使用内存中的优化数据结构,以更好地利用存储容量。
  • 作为一个开箱即用的数据结构,在简单高效的架构中运行,无需复杂的模块或依赖。

部署选项

先决条件

为了本次练习的目的,我们需要准备一些事项:

  1. Tair云服务器实例。
  2. 'tair' 库用于与 tair 数据库进行交互。
  3. 一个 OpenAI API key

安装要求

这个笔记本显然需要openaitair包,但我们还将使用一些其他额外的库。以下命令将安装所有这些库:

! pip install openai redis tair pandas wget
Looking in indexes: http://sg.mirrors.cloud.aliyuncs.com/pypi/simple/
Requirement already satisfied: openai in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (0.28.0)
Requirement already satisfied: redis in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (5.0.0)
Requirement already satisfied: tair in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (1.3.6)
Requirement already satisfied: pandas in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (2.1.0)
Requirement already satisfied: wget in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (3.2)
Requirement already satisfied: requests>=2.20 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from openai) (2.31.0)
Requirement already satisfied: tqdm in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from openai) (4.66.1)
Requirement already satisfied: aiohttp in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from openai) (3.8.5)
Requirement already satisfied: async-timeout>=4.0.2 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from redis) (4.0.3)
Requirement already satisfied: numpy>=1.22.4 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from pandas) (1.25.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from pandas) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.1 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from pandas) (2023.3)
Requirement already satisfied: six>=1.5 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from requests>=2.20->openai) (3.2.0)
Requirement already satisfied: idna<4,>=2.5 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from requests>=2.20->openai) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from requests>=2.20->openai) (2.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from requests>=2.20->openai) (2023.7.22)
Requirement already satisfied: attrs>=17.3.0 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from aiohttp->openai) (22.1.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from aiohttp->openai) (6.0.4)
Requirement already satisfied: yarl<2.0,>=1.0 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from aiohttp->openai) (1.9.2)
Requirement already satisfied: frozenlist>=1.1.1 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from aiohttp->openai) (1.4.0)
Requirement already satisfied: aiosignal>=1.1.2 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from aiohttp->openai) (1.3.1)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

import getpass
import openai

openai.api_key = getpass.getpass("Input your OpenAI API key:")
Input your OpenAI API key:········

连接到Tair

首先将其添加到您的环境变量中。

使用官方的Python库连接到正在运行的Tair服务器实例非常简单。

# The format of url: redis://[[username]:[password]]@localhost:6379/0
TAIR_URL = getpass.getpass("Input your tair url:")
Input your tair url:········
from tair import Tair as TairClient

# connect to tair from url and create a client

url = TAIR_URL
client = TairClient.from_url(url)

我们可以通过ping来测试连接:

client.ping()
True
import wget

embeddings_url = "https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip"

# The file is ~700 MB so this will take some time
wget.download(embeddings_url)
100% [......................................................................] 698933052 / 698933052
'vector_database_wikipedia_articles_embedded (1).zip'

下载的文件需要被解压:

import zipfile
import os
import re
import tempfile

current_directory = os.getcwd()
zip_file_path = os.path.join(current_directory, "vector_database_wikipedia_articles_embedded.zip")
output_directory = os.path.join(current_directory, "../../data")

with zipfile.ZipFile(zip_file_path, "r") as zip_ref:
    zip_ref.extractall(output_directory)


# check the csv file exist
file_name = "vector_database_wikipedia_articles_embedded.csv"
data_directory = os.path.join(current_directory, "../../data")
file_path = os.path.join(data_directory, file_name)


if os.path.exists(file_path):
    print(f"The file {file_name} exists in the data directory.")
else:
    print(f"The file {file_name} does not exist in the data directory.")
The file vector_database_wikipedia_articles_embedded.csv exists in the data directory.

创建索引

Tair 将数据存储在索引中,每个对象由一个键描述。每个键包含一个向量和多个属性键。

我们将从创建两个索引开始,一个用于title_vector,另一个用于content_vector,然后我们将用预先计算的嵌入填充它。

# set index parameters
index = "openai_test"
embedding_dim = 1536
distance_type = "L2"
index_type = "HNSW"
data_type = "FLOAT32"

# Create two indexes, one for title_vector and one for content_vector, skip if already exists
index_names = [index + "_title_vector", index+"_content_vector"]
for index_name in index_names:
    index_connection = client.tvs_get_index(index_name)
    if index_connection is not None:
        print("Index already exists")
    else:
        client.tvs_create_index(name=index_name, dim=embedding_dim, distance_type=distance_type,
                                index_type=index_type, data_type=data_type)
Index already exists
Index already exists

加载数据

在本节中,我们将加载之前准备好的数据,这样您就不必使用自己的积分重新计算维基百科文章的嵌入。

import pandas as pd
from ast import literal_eval
# Path to your local CSV file
csv_file_path = '../../data/vector_database_wikipedia_articles_embedded.csv'
article_df = pd.read_csv(csv_file_path)

# Read vectors from strings back into a list
article_df['title_vector'] = article_df.title_vector.apply(literal_eval).values
article_df['content_vector'] = article_df.content_vector.apply(literal_eval).values

# add/update data to indexes
for i in range(len(article_df)):
    # add data to index with title_vector
    client.tvs_hset(index=index_names[0], key=article_df.id[i].item(), vector=article_df.title_vector[i], is_binary=False,
                    **{"url": article_df.url[i], "title": article_df.title[i], "text": article_df.text[i]})
    # add data to index with content_vector
    client.tvs_hset(index=index_names[1], key=article_df.id[i].item(), vector=article_df.content_vector[i], is_binary=False,
                    **{"url": article_df.url[i], "title": article_df.title[i], "text": article_df.text[i]})
# Check the data count to make sure all the points have been stored
for index_name in index_names:
    stats = client.tvs_get_index(index_name)
    count = int(stats["current_record_count"]) - int(stats["delete_record_count"])
    print(f"Count in {index_name}:{count}")
Count in openai_test_title_vector:25000
Count in openai_test_content_vector:25000

搜索数据

一旦数据放入Tair,我们将开始查询集合以寻找最接近的向量。我们可能会提供一个额外的参数vector_name,以从基于标题的搜索切换到基于内容的搜索。由于预计算的嵌入是使用text-embedding-3-small OpenAI模型创建的,我们在搜索时也必须使用它。

def query_tair(client, query, vector_name="title_vector", top_k=5):

    # Creates embedding vector from user query
    embedded_query = openai.Embedding.create(
        input= query,
        model="text-embedding-3-small",
    )["data"][0]['embedding']
    embedded_query = np.array(embedded_query)

    # search for the top k approximate nearest neighbors of vector in an index
    query_result = client.tvs_knnsearch(index=index+"_"+vector_name, k=top_k, vector=embedded_query)

    return query_result
import openai
import numpy as np

query_result = query_tair(client=client, query="modern art in Europe", vector_name="title_vector")
for i in range(len(query_result)):
    title = client.tvs_hmget(index+"_"+"content_vector", query_result[i][0].decode('utf-8'), "title")
    print(f"{i + 1}. {title[0].decode('utf-8')} (Distance: {round(query_result[i][1],3)})")
1. Museum of Modern Art (Distance: 0.125)
2. Western Europe (Distance: 0.133)
3. Renaissance art (Distance: 0.136)
4. Pop art (Distance: 0.14)
5. Northern Europe (Distance: 0.145)
# This time we'll query using content vector
query_result = query_tair(client=client, query="Famous battles in Scottish history", vector_name="content_vector")
for i in range(len(query_result)):
    title = client.tvs_hmget(index+"_"+"content_vector", query_result[i][0].decode('utf-8'), "title")
    print(f"{i + 1}. {title[0].decode('utf-8')} (Distance: {round(query_result[i][1],3)})")
1. Battle of Bannockburn (Distance: 0.131)
2. Wars of Scottish Independence (Distance: 0.139)
3. 1651 (Distance: 0.147)
4. First War of Scottish Independence (Distance: 0.15)
5. Robert I of Scotland (Distance: 0.154)