2023年6月28日

使用Pinecone进行嵌入搜索

本笔记本将引导您完成一个简单流程:下载一些数据,进行嵌入处理,然后使用选定的向量数据库进行索引和搜索。对于希望在安全环境中存储和搜索我们的嵌入数据以支持生产用例(如聊天机器人、主题建模等)的客户来说,这是一个常见需求。

什么是向量数据库

向量数据库是一种专为存储、管理和搜索嵌入向量而设计的数据库。近年来,由于AI在解决涉及自然语言、图像识别和其他非结构化数据用例方面的效果日益显著,使用嵌入技术将非结构化数据(文本、音频、视频等)编码为向量供机器学习模型使用的做法呈现爆发式增长。向量数据库已成为企业实现和扩展这些用例的有效解决方案。

为什么使用向量数据库

向量数据库使企业能够利用本代码库中分享的众多嵌入应用场景(例如问答系统、聊天机器人和推荐服务),并在安全、可扩展的环境中实现这些应用。许多客户在小规模场景下已通过嵌入技术解决问题,但性能和安全性阻碍了其投入生产环境——我们认为向量数据库是解决这一问题的关键组件。本指南将带您了解文本数据嵌入的基础知识,将其存储于向量数据库,并用于语义搜索。

演示流程

演示流程如下:

  • 设置: 导入包并设置任何需要的变量
  • 加载数据: 加载数据集并使用OpenAI嵌入进行编码
  • Pinecone
    • 设置: 这里我们将为Pinecone设置Python客户端。更多详情请点击此处
    • 索引数据: 我们将创建一个包含标题内容命名空间的索引
    • 搜索数据: 我们将通过搜索查询测试这两个命名空间,以确认其正常工作

完成本笔记本的学习后,您应该对如何设置和使用向量数据库有了基本理解,可以继续探索利用我们嵌入功能的更复杂用例。

设置

导入所需的库并设置我们想要使用的嵌入模型。

# We'll need to install the Pinecone client
!pip install pinecone-client

#Install wget to pull zip file
!pip install wget
Requirement already satisfied: pinecone-client in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (2.2.2)
Requirement already satisfied: requests>=2.19.0 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (2.31.0)
Requirement already satisfied: pyyaml>=5.4 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (6.0)
Requirement already satisfied: loguru>=0.5.0 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (0.7.0)
Requirement already satisfied: typing-extensions>=3.7.4 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (4.5.0)
Requirement already satisfied: dnspython>=2.0.0 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (2.3.0)
Requirement already satisfied: python-dateutil>=2.5.3 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (2.8.2)
Requirement already satisfied: urllib3>=1.21.1 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (1.26.16)
Requirement already satisfied: tqdm>=4.64.1 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (4.65.0)
Requirement already satisfied: numpy>=1.22.0 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (1.25.0)
Requirement already satisfied: six>=1.5 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from python-dateutil>=2.5.3->pinecone-client) (1.16.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from requests>=2.19.0->pinecone-client) (3.1.0)
Requirement already satisfied: idna<4,>=2.5 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from requests>=2.19.0->pinecone-client) (3.4)
Requirement already satisfied: certifi>=2017.4.17 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from requests>=2.19.0->pinecone-client) (2023.5.7)
Requirement already satisfied: wget in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (3.2)
import openai

from typing import List, Iterator
import pandas as pd
import numpy as np
import os
import wget
from ast import literal_eval

# Pinecone's client library for Python
import pinecone

# I've set this to our new embeddings model, this can be changed to the embedding model of your choice
EMBEDDING_MODEL = "text-embedding-3-small"

# Ignore unclosed SSL socket warnings - optional in case you get these errors
import warnings

warnings.filterwarnings(action="ignore", message="unclosed", category=ResourceWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning) 
/Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages/pinecone/index.py:4: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
  from tqdm.autonotebook import tqdm
embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'

# The file is ~700 MB so this will take some time
wget.download(embeddings_url)
import zipfile
with zipfile.ZipFile("vector_database_wikipedia_articles_embedded.zip","r") as zip_ref:
    zip_ref.extractall("../data")
article_df = pd.read_csv('../data/vector_database_wikipedia_articles_embedded.csv')
article_df.head()
id url 标题 文本内容 标题向量 内容向量 向量id
0 1 https://simple.wikipedia.org/wiki/April 四月 四月是公历年中第四个月份... [0.001009464613161981, -0.020700545981526375, ... [-0.011253940872848034, -0.013491976074874401,... 0
1 2 https://simple.wikipedia.org/wiki/August August 八月(Aug.)是一年中的第八个月份... [0.0009286514250561595, 0.000820168002974242, ... [0.0003609954728744924, 0.007262262050062418, ... 1
2 6 https://simple.wikipedia.org/wiki/Art 艺术 艺术是一种表达想象力的创造性活动... [0.003393713850528002, 0.0061537534929811954, ... [-0.004959689453244209, 0.015772193670272827, ... 2
3 8 https://simple.wikipedia.org/wiki/A A A或a是英语字母表中的第一个字母... [0.0153952119871974, -0.013759135268628597, 0.... [0.024894846603274345, -0.022186409682035446, ... 3
4 9 https://simple.wikipedia.org/wiki/Air Air 空气指的是地球的大气层。空气是一种... [0.02224554680287838, -0.02044147066771984, -... [0.021524671465158463, 0.018522677943110466, -... 4
# Read vectors from strings back into a list
article_df['title_vector'] = article_df.title_vector.apply(literal_eval)
article_df['content_vector'] = article_df.content_vector.apply(literal_eval)

# Set vector_id to be a string
article_df['vector_id'] = article_df['vector_id'].apply(str)
article_df.info(show_counts=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              25000 non-null  int64 
 1   url             25000 non-null  object
 2   title           25000 non-null  object
 3   text            25000 non-null  object
 4   title_vector    25000 non-null  object
 5   content_vector  25000 non-null  object
 6   vector_id       25000 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.3+ MB

Pinecone

接下来我们将探讨的选项是Pinecone,这是一个托管的向量数据库,提供云原生解决方案。

在执行此步骤之前,您需要前往Pinecone注册,然后将您的API密钥保存为名为PINECONE_API_KEY的环境变量。

对于本节内容,我们将:

  • 为文章标题和内容创建具有多个命名空间的索引
  • 将我们的数据存储在索引中,为文章的标题内容分别设置可搜索的"命名空间"
  • 执行一些相似性搜索查询以验证我们的设置是否正常工作
api_key = os.getenv("PINECONE_API_KEY")
pinecone.init(api_key=api_key)

创建索引

首先我们需要创建一个索引,命名为wikipedia-articles。有了索引后,我们可以创建多个命名空间,这使得单个索引可适用于多种搜索场景。更多详情请参阅Pinecone文档

如果想通过并行批量插入来提高索引写入速度,可以参考Pinecone文档中关于并行批量插入的详细指南。

# Models a simple batch generator that make chunks out of an input DataFrame
class BatchGenerator:
    
    
    def __init__(self, batch_size: int = 10) -> None:
        self.batch_size = batch_size
    
    # Makes chunks out of an input DataFrame
    def to_batches(self, df: pd.DataFrame) -> Iterator[pd.DataFrame]:
        splits = self.splits_num(df.shape[0])
        if splits <= 1:
            yield df
        else:
            for chunk in np.array_split(df, splits):
                yield chunk

    # Determines how many chunks DataFrame contains
    def splits_num(self, elements: int) -> int:
        return round(elements / self.batch_size)
    
    __call__ = to_batches

df_batcher = BatchGenerator(300)
# Pick a name for the new index
index_name = 'wikipedia-articles'

# Check whether the index with the same name already exists - if so, delete it
if index_name in pinecone.list_indexes():
    pinecone.delete_index(index_name)
    
# Creates new index
pinecone.create_index(name=index_name, dimension=len(article_df['content_vector'][0]))
index = pinecone.Index(index_name=index_name)

# Confirm our index was created
pinecone.list_indexes()
['podcasts', 'wikipedia-articles']
# Upsert content vectors in content namespace - this can take a few minutes
print("Uploading vectors to content namespace..")
for batch_df in df_batcher(article_df):
    index.upsert(vectors=zip(batch_df.vector_id, batch_df.content_vector), namespace='content')
Uploading vectors to content namespace..
# Upsert title vectors in title namespace - this can also take a few minutes
print("Uploading vectors to title namespace..")
for batch_df in df_batcher(article_df):
    index.upsert(vectors=zip(batch_df.vector_id, batch_df.title_vector), namespace='title')
Uploading vectors to title namespace..
# Check index size for each namespace to confirm all of our docs have loaded
index.describe_index_stats()
{'dimension': 1536,
 'index_fullness': 0.1,
 'namespaces': {'content': {'vector_count': 25000},
                'title': {'vector_count': 25000}},
 'total_vector_count': 50000}

搜索数据

现在我们将输入一些模拟搜索并检查是否能返回良好的结果

# First we'll create dictionaries mapping vector IDs to their outputs so we can retrieve the text for our search results
titles_mapped = dict(zip(article_df.vector_id,article_df.title))
content_mapped = dict(zip(article_df.vector_id,article_df.text))
def query_article(query, namespace, top_k=5):
    '''Queries an article using its title in the specified
     namespace and prints results.'''

    # Create vector embeddings based on the title column
    embedded_query = openai.Embedding.create(
                                            input=query,
                                            model=EMBEDDING_MODEL,
                                            )["data"][0]['embedding']

    # Query namespace passed as parameter using title vector
    query_result = index.query(embedded_query, 
                                      namespace=namespace, 
                                      top_k=top_k)

    # Print query results 
    print(f'\nMost similar results to {query} in "{namespace}" namespace:\n')
    if not query_result.matches:
        print('no query result')
    
    matches = query_result.matches
    ids = [res.id for res in matches]
    scores = [res.score for res in matches]
    df = pd.DataFrame({'id':ids, 
                       'score':scores,
                       'title': [titles_mapped[_id] for _id in ids],
                       'content': [content_mapped[_id] for _id in ids],
                       })
    
    counter = 0
    for k,v in df.iterrows():
        counter += 1
        print(f'{v.title} (score = {v.score})')
    
    print('\n')

    return df
query_output = query_article('modern art in Europe','title')
Most similar results to modern art in Europe in "title" namespace:

Museum of Modern Art (score = 0.875177085)
Western Europe (score = 0.867441177)
Renaissance art (score = 0.864156306)
Pop art (score = 0.860346854)
Northern Europe (score = 0.854658186)


content_query_output = query_article("Famous battles in Scottish history",'content')
Most similar results to Famous battles in Scottish history in "content" namespace:

Battle of Bannockburn (score = 0.869336188)
Wars of Scottish Independence (score = 0.861470938)
1651 (score = 0.852588475)
First War of Scottish Independence (score = 0.84962213)
Robert I of Scotland (score = 0.846214116)