2023年2月13日

使用Redis作为OpenAI的向量数据库

本笔记本介绍如何将Redis作为向量数据库与OpenAI嵌入结合使用。Redis是一个可扩展的实时数据库,在使用RediSearch模块时可作为向量数据库。RediSearch模块允许您对Redis中的向量进行索引和搜索。本笔记本将展示如何使用RediSearch模块来索引和搜索通过OpenAI API创建并存储在Redis中的向量。

什么是Redis?

大多数来自网络服务背景的开发者可能都熟悉Redis。Redis本质上是一个开源键值存储系统,可用作缓存、消息代理和数据库。开发者选择Redis是因为它速度快、拥有庞大的客户端库生态系统,并且多年来一直被大型企业广泛部署。

除了Redis的传统用途外,Redis还提供了Redis Modules,这是一种通过新数据类型和命令来扩展Redis的方式。示例模块包括RedisJSONRedisTimeSeriesRedisBloomRediSearch

什么是RediSearch?

RediSearch是一个Redis模块,它为Redis提供查询、二级索引、全文搜索和向量搜索功能。要使用RediSearch,您需要先在Redis数据上声明索引。然后就可以使用RediSearch客户端查询这些数据。有关RediSearch功能集的更多信息,请参阅READMERediSearch文档

部署选项

有多种方式可以部署Redis。对于本地开发而言,最快的方法是使用Redis Stack docker容器,这也是我们将在此采用的方式。Redis Stack包含多个Redis模块,这些模块可以协同工作,构建一个快速的多模型数据存储和查询引擎。

对于生产环境用例,最简单的入门方式是使用Redis Cloud服务。Redis Cloud是一项全托管的Redis服务。您也可以通过Redis Enterprise在自有基础设施上部署Redis。Redis Enterprise是支持部署在Kubernetes、本地或云端的全托管Redis服务。

此外,每个主要云服务提供商(AWS MarketplaceGoogle MarketplaceAzure Marketplace)都在其应用市场中提供Redis Enterprise服务。

先决条件

在开始这个项目之前,我们需要进行以下设置:

  • 启动一个带有RediSearch(redis-stack)的Redis数据库
  • install libraries
  • 获取您的OpenAI API key

===========================================================

启动Redis

为了使这个示例简单易懂,我们将使用Redis Stack的docker容器,可以按如下方式启动

$ docker-compose up -d

这还包括用于管理Redis数据库的RedisInsight图形界面,启动docker容器后可通过http://localhost:8001访问该界面。

您已完成所有设置,可以开始使用了!接下来,我们将导入并创建客户端,用于与刚刚创建的Redis数据库进行通信。

安装要求

Redis-Py 是用于与 Redis 通信的 Python 客户端。我们将使用它与我们的 Redis-stack 数据库进行交互。

! pip install redis wget pandas openai
! export OPENAI_API_KEY="your API key"
# Test that your OpenAI API key is correctly set as an environment variable
# Note. if you run this notebook locally, you will need to reload your terminal and the notebook for the env variables to be live.
import os
import openai

# Note. alternatively you can set a temporary env variable like this:
# os.environ["OPENAI_API_KEY"] = 'sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'

if os.getenv("OPENAI_API_KEY") is not None:
    openai.api_key = os.getenv("OPENAI_API_KEY")
    print ("OPENAI_API_KEY is ready")
else:
    print ("OPENAI_API_KEY environment variable not found")
OPENAI_API_KEY is ready

加载数据

在本节中,我们将加载已转换为向量的嵌入数据。我们将使用这些数据在Redis中创建索引,然后搜索相似的向量。

import sys
import numpy as np
import pandas as pd
from typing import List

# use helper function in nbutils.py to download and read the data
# this should take from 5-10 min to run
if os.getcwd() not in sys.path:
    sys.path.append(os.getcwd())
import nbutils

nbutils.download_wikipedia_data()
data = nbutils.read_wikipedia_data()

data.head()
File Downloaded
id url 标题 文本内容 标题向量 内容向量 向量id
0 1 https://simple.wikipedia.org/wiki/April 四月 四月是公历年中第四个月份... [0.001009464613161981, -0.020700545981526375, ... [-0.011253940872848034, -0.013491976074874401,... 0
1 2 https://simple.wikipedia.org/wiki/August August 八月(Aug.)是一年中的第八个月份... [0.0009286514250561595, 0.000820168002974242, ... [0.0003609954728744924, 0.007262262050062418, ... 1
2 6 https://simple.wikipedia.org/wiki/Art 艺术 艺术是一种表达想象力的创造性活动... [0.003393713850528002, 0.0061537534929811954, ... [-0.004959689453244209, 0.015772193670272827, ... 2
3 8 https://simple.wikipedia.org/wiki/A A A或a是英语字母表中的第一个字母... [0.0153952119871974, -0.013759135268628597, 0.... [0.024894846603274345, -0.022186409682035446, ... 3
4 9 https://simple.wikipedia.org/wiki/Air Air 空气指的是地球的大气层。空气是一种... [0.02224554680287838, -0.02044147066771984, -... [0.021524671465158463, 0.018522677943110466, -... 4

连接到Redis

现在我们的Redis数据库已经运行起来了,可以使用Redis-py客户端进行连接。我们将使用Redis数据库的默认主机和端口,即localhost:6379

import redis
from redis.commands.search.indexDefinition import (
    IndexDefinition,
    IndexType
)
from redis.commands.search.query import Query
from redis.commands.search.field import (
    TextField,
    VectorField
)

REDIS_HOST =  "localhost"
REDIS_PORT = 6379
REDIS_PASSWORD = "" # default for passwordless Redis

# Connect to Redis
redis_client = redis.Redis(
    host=REDIS_HOST,
    port=REDIS_PORT,
    password=REDIS_PASSWORD
)
redis_client.ping()
True

在Redis中创建搜索索引

以下单元格将展示如何在Redis中指定和创建搜索索引。我们将:

  1. 设置一些常量来定义我们的索引,比如距离度量和索引名称
  2. 使用RediSearch字段定义索引模式
  3. 创建索引
# Constants
VECTOR_DIM = len(data['title_vector'][0]) # length of the vectors
VECTOR_NUMBER = len(data)                 # initial number of vectors
INDEX_NAME = "embeddings-index"           # name of the search index
PREFIX = "doc"                            # prefix for the document keys
DISTANCE_METRIC = "COSINE"                # distance metric for the vectors (ex. COSINE, IP, L2)
# Define RediSearch fields for each of the columns in the dataset
title = TextField(name="title")
url = TextField(name="url")
text = TextField(name="text")
title_embedding = VectorField("title_vector",
    "FLAT", {
        "TYPE": "FLOAT32",
        "DIM": VECTOR_DIM,
        "DISTANCE_METRIC": DISTANCE_METRIC,
        "INITIAL_CAP": VECTOR_NUMBER,
    }
)
text_embedding = VectorField("content_vector",
    "FLAT", {
        "TYPE": "FLOAT32",
        "DIM": VECTOR_DIM,
        "DISTANCE_METRIC": DISTANCE_METRIC,
        "INITIAL_CAP": VECTOR_NUMBER,
    }
)
fields = [title, url, text, title_embedding, text_embedding]
# Check if index exists
try:
    redis_client.ft(INDEX_NAME).info()
    print("Index already exists")
except:
    # Create RediSearch Index
    redis_client.ft(INDEX_NAME).create_index(
        fields = fields,
        definition = IndexDefinition(prefix=[PREFIX], index_type=IndexType.HASH)
)

将文档加载到索引中

现在我们有了一个搜索索引,可以将文档加载到其中。我们将使用与之前示例相同的文档。在Redis中,可以使用HASH或JSON(如果除了RediSearch还使用RedisJSON)数据类型来存储文档。在本示例中,我们将使用HASH数据类型。下面的单元格将展示如何将文档加载到索引中。

def index_documents(client: redis.Redis, prefix: str, documents: pd.DataFrame):
    records = documents.to_dict("records")
    for doc in records:
        key = f"{prefix}:{str(doc['id'])}"

        # create byte vectors for title and content
        title_embedding = np.array(doc["title_vector"], dtype=np.float32).tobytes()
        content_embedding = np.array(doc["content_vector"], dtype=np.float32).tobytes()

        # replace list of floats with byte vectors
        doc["title_vector"] = title_embedding
        doc["content_vector"] = content_embedding

        client.hset(key, mapping = doc)
index_documents(redis_client, PREFIX, data)
print(f"Loaded {redis_client.info()['db0']['keys']} documents in Redis search index with name: {INDEX_NAME}")
Loaded 25000 documents in Redis search index with name: embeddings-index

使用OpenAI查询嵌入实现简单向量搜索查询

现在我们已经有了一个搜索索引并将文档加载其中,可以运行搜索查询了。下面我们将提供一个函数来执行搜索查询并返回结果。通过这个函数,我们将运行几个查询来展示如何利用Redis作为向量数据库。

def search_redis(
    redis_client: redis.Redis,
    user_query: str,
    index_name: str = "embeddings-index",
    vector_field: str = "title_vector",
    return_fields: list = ["title", "url", "text", "vector_score"],
    hybrid_fields = "*",
    k: int = 20,
    print_results: bool = True,
) -> List[dict]:

    # Creates embedding vector from user query
    embedded_query = openai.Embedding.create(input=user_query,
                                            model="text-embedding-3-small",
                                            )["data"][0]['embedding']

    # Prepare the Query
    base_query = f'{hybrid_fields}=>[KNN {k} @{vector_field} $vector AS vector_score]'
    query = (
        Query(base_query)
         .return_fields(*return_fields)
         .sort_by("vector_score")
         .paging(0, k)
         .dialect(2)
    )
    params_dict = {"vector": np.array(embedded_query).astype(dtype=np.float32).tobytes()}

    # perform vector search
    results = redis_client.ft(index_name).search(query, params_dict)
    if print_results:
        for i, article in enumerate(results.docs):
            score = 1 - float(article.vector_score)
            print(f"{i}. {article.title} (Score: {round(score ,3) })")
    return results.docs
# For using OpenAI to generate query embedding
results = search_redis(redis_client, 'modern art in Europe', k=10)
0. Museum of Modern Art (Score: 0.875)
1. Western Europe (Score: 0.868)
2. Renaissance art (Score: 0.864)
3. Pop art (Score: 0.86)
4. Northern Europe (Score: 0.855)
5. Hellenistic art (Score: 0.853)
6. Modernist literature (Score: 0.847)
7. Art film (Score: 0.843)
8. Central Europe (Score: 0.843)
9. European (Score: 0.841)
results = search_redis(redis_client, 'Famous battles in Scottish history', vector_field='content_vector', k=10)
0. Battle of Bannockburn (Score: 0.869)
1. Wars of Scottish Independence (Score: 0.861)
2. 1651 (Score: 0.853)
3. First War of Scottish Independence (Score: 0.85)
4. Robert I of Scotland (Score: 0.846)
5. 841 (Score: 0.844)
6. 1716 (Score: 0.844)
7. 1314 (Score: 0.837)
8. 1263 (Score: 0.836)
9. William Wallace (Score: 0.835)

基于Redis的混合查询

前面的示例展示了如何使用RediSearch运行向量搜索查询。在本节中,我们将展示如何将向量搜索与其他RediSearch字段结合进行混合搜索。在下面的示例中,我们将把向量搜索与全文搜索结合起来。

def create_hybrid_field(field_name: str, value: str) -> str:
    return f'@{field_name}:"{value}"'

# search the content vector for articles about famous battles in Scottish history and only include results with Scottish in the title
results = search_redis(redis_client,
                       "Famous battles in Scottish history",
                       vector_field="title_vector",
                       k=5,
                       hybrid_fields=create_hybrid_field("title", "Scottish")
                       )
0. First War of Scottish Independence (Score: 0.892)
1. Wars of Scottish Independence (Score: 0.889)
2. Second War of Scottish Independence (Score: 0.879)
3. List of Scottish monarchs (Score: 0.873)
4. Scottish Borders (Score: 0.863)
# run a hybrid query for articles about Art in the title vector and only include results with the phrase "Leonardo da Vinci" in the text
results = search_redis(redis_client,
                       "Art",
                       vector_field="title_vector",
                       k=5,
                       hybrid_fields=create_hybrid_field("text", "Leonardo da Vinci")
                       )

# find specific mention of Leonardo da Vinci in the text that our full-text-search query returned
mention = [sentence for sentence in results[0].text.split("\n") if "Leonardo da Vinci" in sentence][0]
mention
0. Art (Score: 1.0)
1. Paint (Score: 0.896)
2. Renaissance art (Score: 0.88)
3. Painting (Score: 0.874)
4. Renaissance (Score: 0.846)
'In Europe, after the Middle Ages, there was a "Renaissance" which means "rebirth". People rediscovered science and artists were allowed to paint subjects other than religious subjects. People like Michelangelo and Leonardo da Vinci still painted religious pictures, but they also now could paint mythological pictures too. These artists also invented perspective where things in the distance look smaller in the picture. This was new because in the Middle Ages people would paint all the figures close up and just overlapping each other. These artists used nudity regularly in their art.'

HNSW索引

到目前为止,我们一直在使用FLAT或"暴力"索引来运行查询。Redis还支持HNSW索引,这是一种快速的近似索引。HNSW索引是基于图的索引,它使用分层可导航小世界图来存储向量。对于想要运行近似查询的大型数据集,HNSW索引是一个不错的选择。

HNSW在大多数情况下会比FLAT花费更长的构建时间和消耗更多内存,但在查询运行时速度更快,特别是对于大型数据集。

以下单元格将展示如何创建HNSW索引并使用之前相同的数据运行查询。

# re-define RediSearch vector fields to use HNSW index
title_embedding = VectorField("title_vector",
    "HNSW", {
        "TYPE": "FLOAT32",
        "DIM": VECTOR_DIM,
        "DISTANCE_METRIC": DISTANCE_METRIC,
        "INITIAL_CAP": VECTOR_NUMBER
    }
)
text_embedding = VectorField("content_vector",
    "HNSW", {
        "TYPE": "FLOAT32",
        "DIM": VECTOR_DIM,
        "DISTANCE_METRIC": DISTANCE_METRIC,
        "INITIAL_CAP": VECTOR_NUMBER
    }
)
fields = [title, url, text, title_embedding, text_embedding]
import time
# Check if index exists
HNSW_INDEX_NAME = INDEX_NAME+ "_HNSW"

try:
    redis_client.ft(HNSW_INDEX_NAME).info()
    print("Index already exists")
except:
    # Create RediSearch Index
    redis_client.ft(HNSW_INDEX_NAME).create_index(
        fields = fields,
        definition = IndexDefinition(prefix=[PREFIX], index_type=IndexType.HASH)
    )

# since RediSearch creates the index in the background for existing documents, we will wait until
# indexing is complete before running our queries. Although this is not necessary for the first query,
# some queries may take longer to run if the index is not fully built. In general, Redis will perform
# best when adding new documents to existing indices rather than new indices on existing documents.
while redis_client.ft(HNSW_INDEX_NAME).info()["indexing"] == "1":
    time.sleep(5)
results = search_redis(redis_client, 'modern art in Europe', index_name=HNSW_INDEX_NAME, k=10)
0. Western Europe (Score: 0.868)
1. Northern Europe (Score: 0.855)
2. Central Europe (Score: 0.843)
3. European (Score: 0.841)
4. Eastern Europe (Score: 0.839)
5. Europe (Score: 0.839)
6. Western European Union (Score: 0.837)
7. Southern Europe (Score: 0.831)
8. Western civilization (Score: 0.83)
9. Council of Europe (Score: 0.827)
# compare the results of the HNSW index to the FLAT index and time both queries
def time_queries(iterations: int = 10):
    print(" ----- Flat Index ----- ")
    t0 = time.time()
    for i in range(iterations):
        results_flat = search_redis(redis_client, 'modern art in Europe', k=10, print_results=False)
    t0 = (time.time() - t0) / iterations
    results_flat = search_redis(redis_client, 'modern art in Europe', k=10, print_results=True)
    print(f"Flat index query time: {round(t0, 3)} seconds\n")
    time.sleep(1)
    print(" ----- HNSW Index ------ ")
    t1 = time.time()
    for i in range(iterations):
        results_hnsw = search_redis(redis_client, 'modern art in Europe', index_name=HNSW_INDEX_NAME, k=10, print_results=False)
    t1 = (time.time() - t1) / iterations
    results_hnsw = search_redis(redis_client, 'modern art in Europe', index_name=HNSW_INDEX_NAME, k=10, print_results=True)
    print(f"HNSW index query time: {round(t1, 3)} seconds")
    print(" ------------------------ ")
time_queries()
 ----- Flat Index ----- 
0. Museum of Modern Art (Score: 0.875)
1. Western Europe (Score: 0.867)
2. Renaissance art (Score: 0.864)
3. Pop art (Score: 0.861)
4. Northern Europe (Score: 0.855)
5. Hellenistic art (Score: 0.853)
6. Modernist literature (Score: 0.847)
7. Art film (Score: 0.843)
8. Central Europe (Score: 0.843)
9. Art (Score: 0.842)
Flat index query time: 0.263 seconds

 ----- HNSW Index ------ 
0. Western Europe (Score: 0.867)
1. Northern Europe (Score: 0.855)
2. Central Europe (Score: 0.843)
3. European (Score: 0.841)
4. Eastern Europe (Score: 0.839)
5. Europe (Score: 0.839)
6. Western European Union (Score: 0.837)
7. Southern Europe (Score: 0.831)
8. Western civilization (Score: 0.83)
9. Council of Europe (Score: 0.827)
HNSW index query time: 0.129 seconds
 ------------------------