本笔记本将指导您如何使用Neon Serverless Postgres作为OpenAI嵌入的向量数据库。它展示了如何:
- 使用由OpenAI API创建的嵌入向量。
- 将嵌入存储在Neon Serverless Postgres数据库中。
- 使用OpenAI API将原始文本查询转换为嵌入向量。
- 使用Neon和
pgvector扩展来执行向量相似性搜索。
本笔记本将指导您如何使用Neon Serverless Postgres作为OpenAI嵌入的向量数据库。它展示了如何:
pgvector扩展来执行向量相似性搜索。在开始之前,请确保您具备以下条件:
neondb数据库。具体操作指南请参阅注册和创建您的第一个项目。pgvector扩展。在Neon中运行CREATE EXTENSION vector;来安装该扩展。具体说明请参阅启用pgvector扩展。pip。本笔记本需要安装openai、psycopg2、pandas、wget和python-dotenv包。您可以使用pip进行安装:
! pip install openai psycopg2 pandas wget python-dotenv生成文档和查询的向量需要OpenAI API密钥。
如果您没有OpenAI API密钥,请从https://platform.openai.com/account/api-keys获取。
将OpenAI API密钥添加为操作系统环境变量,或在提示时提供会话密钥。如果定义环境变量,请将变量命名为OPENAI_API_KEY。
如需了解如何将OpenAI API密钥配置为环境变量,请参阅API密钥安全最佳实践。
# Test to ensure that your OpenAI API key is defined as an environment variable or provide it when prompted
# If you run this notebook locally, you may have to reload the terminal and the notebook to make the environment available
import os
from getpass import getpass
# Check if OPENAI_API_KEY is set as an environment variable
if os.getenv("OPENAI_API_KEY") is not None:
print("Your OPENAI_API_KEY is ready")
else:
# If not, prompt for it
api_key = getpass("Enter your OPENAI_API_KEY: ")
if api_key:
print("Your OPENAI_API_KEY is now available for this session")
# Optionally, you can set it as an environment variable for the current session
os.environ["OPENAI_API_KEY"] = api_key
else:
print("You did not enter your OPENAI_API_KEY")Your OPENAI_API_KEY is ready
请在下方提供您的Neon数据库连接字符串,或在.env文件中使用DATABASE_URL变量定义。有关获取Neon连接字符串的信息,请参阅从任何应用程序连接。
import os
import psycopg2
from dotenv import load_dotenv
# Load environment variables from .env file
load_dotenv()
# The connection string can be provided directly here.
# Replace the next line with Your Neon connection string.
connection_string = "postgres://<user>:<password>@<hostname>/<dbname>"
# If connection_string is not directly provided above,
# then check if DATABASE_URL is set in the environment or .env.
if not connection_string:
connection_string = os.environ.get("DATABASE_URL")
# If neither method provides a connection string, raise an error.
if not connection_string:
raise ValueError("Please provide a valid connection string either in the code or in the .env file as DATABASE_URL.")
# Connect using the connection string
connection = psycopg2.connect(connection_string)
# Create a new cursor object
cursor = connection.cursor()测试与数据库的连接:
# Execute this query to test the database connection
cursor.execute("SELECT 1;")
result = cursor.fetchone()
# Check the query result
if result == (1,):
print("Your database connection was successful!")
else:
print("Your connection failed.")Your database connection was successful!
本指南使用OpenAI Cookbook examples目录中预计算的维基百科文章嵌入,这样您就不必使用自己的OpenAI积分来计算嵌入。
导入预计算的嵌入压缩文件:
import wget
embeddings_url = "https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip"
# The file is ~700 MB. Importing it will take several minutes.
wget.download(embeddings_url)'vector_database_wikipedia_articles_embedded.zip'
解压下载的zip文件:
import zipfile
import os
import re
import tempfile
current_directory = os.getcwd()
zip_file_path = os.path.join(current_directory, "vector_database_wikipedia_articles_embedded.zip")
output_directory = os.path.join(current_directory, "../../data")
with zipfile.ZipFile(zip_file_path, "r") as zip_ref:
zip_ref.extractall(output_directory)
# Check to see if the csv file was extracted
file_name = "vector_database_wikipedia_articles_embedded.csv"
data_directory = os.path.join(current_directory, "../../data")
file_path = os.path.join(data_directory, file_name)
if os.path.exists(file_path):
print(f"The csv file {file_name} exists in the data directory.")
else:
print(f"The csv file {file_name} does not exist in the data directory.")The file vector_database_wikipedia_articles_embedded.csv exists in the data directory.
您在数据库中创建的向量表名为articles。每个对象都包含title和content向量。
在title和content向量列上都定义了索引。
create_table_sql = '''
CREATE TABLE IF NOT EXISTS public.articles (
id INTEGER NOT NULL,
url TEXT,
title TEXT,
content TEXT,
title_vector vector(1536),
content_vector vector(1536),
vector_id INTEGER
);
ALTER TABLE public.articles ADD PRIMARY KEY (id);
'''
# SQL statement for creating indexes
create_indexes_sql = '''
CREATE INDEX ON public.articles USING ivfflat (content_vector) WITH (lists = 1000);
CREATE INDEX ON public.articles USING ivfflat (title_vector) WITH (lists = 1000);
'''
# Execute the SQL statements
cursor.execute(create_table_sql)
cursor.execute(create_indexes_sql)
# Commit the changes
connection.commit()将预计算的向量数据从.csv文件加载到您的articles表中。共有25000条记录,预计该操作需要几分钟时间。
import io
# Path to your local CSV file
csv_file_path = '../../data/vector_database_wikipedia_articles_embedded.csv'
# Define a generator function to process the csv file
def process_file(file_path):
with open(file_path, 'r', encoding='utf-8') as file:
for line in file:
yield line
# Create a StringIO object to store the modified lines
modified_lines = io.StringIO(''.join(list(process_file(csv_file_path))))
# Create the COPY command for copy_expert
copy_command = '''
COPY public.articles (id, url, title, content, title_vector, content_vector, vector_id)
FROM STDIN WITH (FORMAT CSV, HEADER true, DELIMITER ',');
'''
# Execute the COPY command using copy_expert
cursor.copy_expert(copy_command, modified_lines)
# Commit the changes
connection.commit()检查记录数量以确保数据已加载。应该有25000条记录。
# Check the size of the data
count_sql = """select count(*) from public.articles;"""
cursor.execute(count_sql)
result = cursor.fetchone()
print(f"Count:{result[0]}")Count:25000
数据存储到您的Neon数据库后,您可以查询数据以查找最近邻。
首先定义query_neon函数,该函数在执行向量相似性搜索时运行。该函数会根据用户查询创建嵌入向量,准备SQL查询语句,并使用该嵌入向量执行SQL查询。您预先加载到数据库中的嵌入向量是使用text-embedding-3-small OpenAI模型生成的,因此在进行相似性搜索时必须使用相同的模型来创建嵌入向量。
提供了一个vector_name参数,允许您基于"标题"或"内容"进行搜索。
def query_neon(query, collection_name, vector_name="title_vector", top_k=20):
# Create an embedding vector from the user query
embedded_query = openai.Embedding.create(
input=query,
model="text-embedding-3-small",
)["data"][0]["embedding"]
# Convert the embedded_query to PostgreSQL compatible format
embedded_query_pg = "[" + ",".join(map(str, embedded_query)) + "]"
# Create the SQL query
query_sql = f"""
SELECT id, url, title, l2_distance({vector_name},'{embedded_query_pg}'::VECTOR(1536)) AS similarity
FROM {collection_name}
ORDER BY {vector_name} <-> '{embedded_query_pg}'::VECTOR(1536)
LIMIT {top_k};
"""
# Execute the query
cursor.execute(query_sql)
results = cursor.fetchall()
return results基于title_vector嵌入向量运行相似性搜索:
# Query based on `title_vector` embeddings
import openai
query_results = query_neon("Greek mythology", "Articles")
for i, result in enumerate(query_results):
print(f"{i + 1}. {result[2]} (Score: {round(1 - result[3], 3)})")1. Greek mythology (Score: 0.998) 2. Roman mythology (Score: 0.7) 3. Greek underworld (Score: 0.637) 4. Mythology (Score: 0.635) 5. Classical mythology (Score: 0.629) 6. Japanese mythology (Score: 0.615) 7. Norse mythology (Score: 0.569) 8. Greek language (Score: 0.566) 9. Zeus (Score: 0.534) 10. List of mythologies (Score: 0.531) 11. Jupiter (mythology) (Score: 0.53) 12. Greek (Score: 0.53) 13. Gaia (mythology) (Score: 0.526) 14. Titan (mythology) (Score: 0.522) 15. Mercury (mythology) (Score: 0.521) 16. Ancient Greece (Score: 0.52) 17. Greek alphabet (Score: 0.52) 18. Venus (mythology) (Score: 0.515) 19. Pluto (mythology) (Score: 0.515) 20. Athena (Score: 0.514)
基于content_vector嵌入向量运行相似性搜索:
# Query based on `content_vector` embeddings
query_results = query_neon("Famous battles in Greek history", "Articles", "content_vector")
for i, result in enumerate(query_results):
print(f"{i + 1}. {result[2]} (Score: {round(1 - result[3], 3)})")1. 222 BC (Score: 0.489) 2. Trojan War (Score: 0.458) 3. Peloponnesian War (Score: 0.456) 4. History of the Peloponnesian War (Score: 0.449) 5. 430 BC (Score: 0.441) 6. 168 BC (Score: 0.436) 7. Ancient Greece (Score: 0.429) 8. Classical Athens (Score: 0.428) 9. 499 BC (Score: 0.427) 10. Leonidas I (Score: 0.426) 11. Battle (Score: 0.421) 12. Greek War of Independence (Score: 0.421) 13. Menelaus (Score: 0.419) 14. Thebes, Greece (Score: 0.417) 15. Patroclus (Score: 0.417) 16. 427 BC (Score: 0.416) 17. 429 BC (Score: 0.413) 18. August 2 (Score: 0.412) 19. Ionia (Score: 0.411) 20. 323 (Score: 0.409)