Postgres 向量存储
在本笔记本中,我们将展示如何在LlamaIndex中使用Postgresql和pgvector执行向量搜索
如果您在 Colab 上打开这个笔记本,您可能需要安装 LlamaIndex 🦙。
%pip install llama-index-vector-stores-postgres!pip install llama-index运行以下单元格将在 Colab 中安装带有 PGVector 的 Postgres。
!sudo apt update!echo | sudo apt install -y postgresql-common!echo | sudo /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh!echo | sudo apt install postgresql-15-pgvector!sudo service postgresql start!sudo -u postgres psql -c "ALTER USER postgres PASSWORD 'password';"!sudo -u postgres psql -c "CREATE DATABASE vector_db;"# import logging# import sys
# Uncomment to see debug logs# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
from llama_index.core import SimpleDirectoryReader, StorageContextfrom llama_index.core import VectorStoreIndexfrom llama_index.vector_stores.postgres import PGVectorStoreimport textwrap设置OpenAI
Section titled “Setup OpenAI”第一步是配置 OpenAI 密钥。它将用于为加载到索引中的文档创建嵌入向量
import os
os.environ["OPENAI_API_KEY"] = "sk-..."下载数据
!mkdir -p 'data/paul_graham/'!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'Load the documents stored in the data/paul_graham/ using the SimpleDirectoryReader
documents = SimpleDirectoryReader("./data/paul_graham").load_data()print("Document ID:", documents[0].doc_id)Document ID: 56e70c8c-0fb7-4250-99be-b953d0185a01使用本地主机上运行的现有 postgres 数据库,创建我们将要使用的数据库。
import psycopg2
connection_string = "postgresql://postgres:password@localhost:5432"db_name = "vector_db"conn = psycopg2.connect(connection_string)conn.autocommit = True
with conn.cursor() as c: c.execute(f"DROP DATABASE IF EXISTS {db_name}") c.execute(f"CREATE DATABASE {db_name}")这里我们使用之前加载的文档创建一个由Postgres支持的索引。PGVectorStore接受几个参数。以下示例使用vector_cosine_ops方法构建了一个具有HNSW索引的PGVectorStore,其中m = 16,ef_construction = 64,ef_search = 40。
from sqlalchemy import make_url
url = make_url(connection_string)vector_store = PGVectorStore.from_params( database=db_name, host=url.host, password=url.password, port=url.port, user=url.username, table_name="paul_graham_essay", embed_dim=1536, # openai embedding dimension hnsw_kwargs={ "hnsw_m": 16, "hnsw_ef_construction": 64, "hnsw_ef_search": 40, "hnsw_dist_method": "vector_cosine_ops", },)
storage_context = StorageContext.from_defaults(vector_store=vector_store)index = VectorStoreIndex.from_documents( documents, storage_context=storage_context, show_progress=True)query_engine = index.as_query_engine()Parsing nodes: 0%| | 0/1 [00:00<?, ?it/s]
Generating embeddings: 0%| | 0/22 [00:00<?, ?it/s]
2025-09-11 16:47:21,725 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"我们现在可以使用我们的索引来提问了。
response = query_engine.query("What did the author do?")2025-09-11 16:47:30,412 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"2025-09-11 16:47:31,665 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"print(textwrap.fill(str(response), 100))The author worked on writing essays, programming, building microcomputers, predicting rocketheights, developing a word processor, and giving talks on starting a startup.response = query_engine.query("What happened in the mid 1980s?")2025-09-11 16:47:37,531 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"2025-09-11 16:47:38,352 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"print(textwrap.fill(str(response), 100))AI was in the air in the mid 1980s, and two things that influenced the desire to work on it were anovel by Heinlein called The Moon is a Harsh Mistress, which featured an intelligent computer calledMike, and a PBS documentary that showed Terry Winograd using SHRDLU.vector_store = PGVectorStore.from_params( database="vector_db", host="localhost", password="password", port=5432, user="postgres", table_name="paul_graham_essay", embed_dim=1536, # openai embedding dimension hnsw_kwargs={ "hnsw_m": 16, "hnsw_ef_construction": 64, "hnsw_ef_search": 40, "hnsw_dist_method": "vector_cosine_ops", },)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)query_engine = index.as_query_engine()response = query_engine.query("What did the author do?")print(textwrap.fill(str(response), 100))The author worked on writing essays, programming, creating microcomputers, developing software,giving talks, and starting a startup.要启用混合搜索,您需要:
- pass in
hybrid_search=Truewhen constructing thePGVectorStore(and optionally configuretext_search_configwith the desired language) - pass in
vector_store_query_mode="hybrid"when constructing the query engine (this config is passed to the retriever under the hood). You can also optionally set thesparse_top_kto configure how many results we should obtain from sparse text search (default is using the same value assimilarity_top_k).
from sqlalchemy import make_url
url = make_url(connection_string)hybrid_vector_store = PGVectorStore.from_params( database=db_name, host=url.host, password=url.password, port=url.port, user=url.username, table_name="paul_graham_essay_hybrid_search", embed_dim=1536, # openai embedding dimension hybrid_search=True, text_search_config="english", hnsw_kwargs={ "hnsw_m": 16, "hnsw_ef_construction": 64, "hnsw_ef_search": 40, "hnsw_dist_method": "vector_cosine_ops", },)
storage_context = StorageContext.from_defaults( vector_store=hybrid_vector_store)hybrid_index = VectorStoreIndex.from_documents( documents, storage_context=storage_context)hybrid_query_engine = hybrid_index.as_query_engine( vector_store_query_mode="hybrid", sparse_top_k=2)hybrid_response = hybrid_query_engine.query( "Who does Paul Graham think of with the word schtick")print(hybrid_response)Roy Lichtenstein使用QueryFusionRetriever改进混合搜索
Section titled “Improving hybrid search with QueryFusionRetriever”由于文本搜索和向量搜索的分数计算方式不同,仅通过文本搜索找到的节点得分会低得多。
您通常可以通过使用QueryFusionRetriever来提升混合搜索性能,它能更好地利用互信息对节点进行排序。
from llama_index.core.response_synthesizers import CompactAndRefinefrom llama_index.core.retrievers import QueryFusionRetrieverfrom llama_index.core.query_engine import RetrieverQueryEngine
vector_retriever = hybrid_index.as_retriever( vector_store_query_mode="default", similarity_top_k=5,)text_retriever = hybrid_index.as_retriever( vector_store_query_mode="sparse", similarity_top_k=5, # interchangeable with sparse_top_k in this context)retriever = QueryFusionRetriever( [vector_retriever, text_retriever], similarity_top_k=5, num_queries=1, # set this to 1 to disable query generation mode="relative_score", use_async=False,)
response_synthesizer = CompactAndRefine()query_engine = RetrieverQueryEngine( retriever=retriever, response_synthesizer=response_synthesizer,)response = query_engine.query( "Who does Paul Graham think of with the word schtick, and why?")print(response)Paul Graham thinks of Roy Lichtenstein when using the word "schtick" because Lichtenstein's distinctive signature style in his paintings immediately identifies his work as his own.PGVectorStore 支持在节点中存储元数据,并在检索步骤中基于该元数据进行过滤。
!mkdir -p 'data/git_commits/'!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/csv/commit_history.csv' -O 'data/git_commits/commit_history.csv'import csv
with open("data/git_commits/commit_history.csv", "r") as f: commits = list(csv.DictReader(f))
print(commits[0])print(len(commits)){'commit': '44e41c12ab25e36c202f58e068ced262eadc8d16', 'author': 'Lakshmi Narayanan Sreethar<lakshmi@timescale.com>', 'date': 'Tue Sep 5 21:03:21 2023 +0530', 'change summary': 'Fix segfault in set_integer_now_func', 'change details': 'When an invalid function oid is passed to set_integer_now_func, it finds out that the function oid is invalid but before throwing the error, it calls ReleaseSysCache on an invalid tuple causing a segfault. Fixed that by removing the invalid call to ReleaseSysCache. Fixes #6037 '}4167添加带有自定义元数据的节点
Section titled “Add nodes with custom metadata”# Create TextNode for each of the first 100 commitsfrom llama_index.core.schema import TextNodefrom datetime import datetimeimport re
nodes = []dates = set()authors = set()for commit in commits[:100]: author_email = commit["author"].split("<")[1][:-1] commit_date = datetime.strptime( commit["date"], "%a %b %d %H:%M:%S %Y %z" ).strftime("%Y-%m-%d") commit_text = commit["change summary"] if commit["change details"]: commit_text += "\n\n" + commit["change details"] fixes = re.findall(r"#(\d+)", commit_text, re.IGNORECASE) nodes.append( TextNode( text=commit_text, metadata={ "commit_date": commit_date, "author": author_email, "fixes": fixes, }, ) ) dates.add(commit_date) authors.add(author_email)
print(nodes[0])print(min(dates), "to", max(dates))print(authors)Node ID: 9c2c2f17-d763-4ce8-bb02-83cb176008e4Text: Fix segfault in set_integer_now_func When an invalid functionoid is passed to set_integer_now_func, it finds out that the functionoid is invalid but before throwing the error, it calls ReleaseSysCacheon an invalid tuple causing a segfault. Fixed that by removing theinvalid call to ReleaseSysCache. Fixes #60372023-03-22 to 2023-09-05{'konstantina@timescale.com', 'nikhil@timescale.com', 'satish.8483@gmail.com', 'mats@timescale.com', 'fabriziomello@gmail.com', 'erik@timescale.com', 'sven@timescale.com', 'lakshmi@timescale.com', 'dmitry@timescale.com', 'engel@sero-systems.de', 'rafia.sabih@gmail.com', '36882414+akuzm@users.noreply.github.com', 'jguthrie@timescale.com', 'jan@timescale.com', 'me@noctarius.com'}vector_store = PGVectorStore.from_params( database=db_name, host=url.host, password=url.password, port=url.port, user=url.username, table_name="metadata_filter_demo3", embed_dim=1536, # openai embedding dimension hnsw_kwargs={ "hnsw_m": 16, "hnsw_ef_construction": 64, "hnsw_ef_search": 40, "hnsw_dist_method": "vector_cosine_ops", },)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)index.insert_nodes(nodes)2025-09-11 16:48:11,383 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"print(index.as_query_engine().query("How did Lakshmi fix the segfault?"))2025-09-11 16:48:15,149 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"2025-09-11 16:48:15,687 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
Lakshmi fixed the segfault by removing the invalid call to ReleaseSysCache that was causing the issue.现在我们可以在检索节点时按提交作者或日期进行筛选。
from llama_index.core.vector_stores.types import ( MetadataFilter, MetadataFilters,)
filters = MetadataFilters( filters=[ MetadataFilter(key="author", value="mats@timescale.com"), MetadataFilter(key="author", value="sven@timescale.com"), ], condition="or",)
retriever = index.as_retriever( similarity_top_k=10, filters=filters,)
retrieved_nodes = retriever.retrieve("What is this software project about?")
for node in retrieved_nodes: print(node.node.metadata)2025-09-11 16:48:31,673 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
{'commit_date': '2023-08-07', 'author': 'mats@timescale.com', 'fixes': []}{'commit_date': '2023-08-27', 'author': 'sven@timescale.com', 'fixes': []}{'commit_date': '2023-07-13', 'author': 'mats@timescale.com', 'fixes': []}{'commit_date': '2023-08-07', 'author': 'sven@timescale.com', 'fixes': []}{'commit_date': '2023-08-30', 'author': 'sven@timescale.com', 'fixes': []}{'commit_date': '2023-08-15', 'author': 'sven@timescale.com', 'fixes': []}{'commit_date': '2023-08-23', 'author': 'sven@timescale.com', 'fixes': []}{'commit_date': '2023-08-10', 'author': 'mats@timescale.com', 'fixes': []}{'commit_date': '2023-07-25', 'author': 'mats@timescale.com', 'fixes': ['5892']}{'commit_date': '2023-08-21', 'author': 'sven@timescale.com', 'fixes': []}filters = MetadataFilters( filters=[ MetadataFilter(key="commit_date", value="2023-08-15", operator=">="), MetadataFilter(key="commit_date", value="2023-08-25", operator="<="), ], condition="and",)
retriever = index.as_retriever( similarity_top_k=10, filters=filters,)
retrieved_nodes = retriever.retrieve("What is this software project about?")
for node in retrieved_nodes: print(node.node.metadata)2025-09-11 16:48:40,347 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
{'commit_date': '2023-08-23', 'author': 'erik@timescale.com', 'fixes': []}{'commit_date': '2023-08-17', 'author': 'konstantina@timescale.com', 'fixes': []}{'commit_date': '2023-08-15', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}{'commit_date': '2023-08-15', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}{'commit_date': '2023-08-24', 'author': 'lakshmi@timescale.com', 'fixes': []}{'commit_date': '2023-08-15', 'author': 'sven@timescale.com', 'fixes': []}{'commit_date': '2023-08-23', 'author': 'sven@timescale.com', 'fixes': []}{'commit_date': '2023-08-21', 'author': 'sven@timescale.com', 'fixes': []}{'commit_date': '2023-08-20', 'author': 'sven@timescale.com', 'fixes': []}{'commit_date': '2023-08-21', 'author': 'sven@timescale.com', 'fixes': []}在上述示例中,我们使用 AND 或 OR 组合了多个过滤器。我们还可以组合多组过滤器。
例如在SQL中:
WHERE (commit_date >= '2023-08-01' AND commit_date <= '2023-08-15') AND (author = 'mats@timescale.com' OR author = 'sven@timescale.com')filters = MetadataFilters( filters=[ MetadataFilters( filters=[ MetadataFilter( key="commit_date", value="2023-08-01", operator=">=" ), MetadataFilter( key="commit_date", value="2023-08-15", operator="<=" ), ], condition="and", ), MetadataFilters( filters=[ MetadataFilter(key="author", value="mats@timescale.com"), MetadataFilter(key="author", value="sven@timescale.com"), ], condition="or", ), ], condition="and",)
retriever = index.as_retriever( similarity_top_k=10, filters=filters,)
retrieved_nodes = retriever.retrieve("What is this software project about?")
for node in retrieved_nodes: print(node.node.metadata)2025-09-11 16:48:45,021 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
{'commit_date': '2023-08-07', 'author': 'mats@timescale.com', 'fixes': []}{'commit_date': '2023-08-07', 'author': 'sven@timescale.com', 'fixes': []}{'commit_date': '2023-08-15', 'author': 'sven@timescale.com', 'fixes': []}{'commit_date': '2023-08-10', 'author': 'mats@timescale.com', 'fixes': []}上述内容可以通过使用 IN 运算符进行简化。PGVectorStore 支持 in、nin 和 contains 来比较元素与列表。
filters = MetadataFilters( filters=[ MetadataFilter(key="commit_date", value="2023-08-01", operator=">="), MetadataFilter(key="commit_date", value="2023-08-15", operator="<="), MetadataFilter( key="author", value=["mats@timescale.com", "sven@timescale.com"], operator="in", ), ], condition="and",)
retriever = index.as_retriever( similarity_top_k=10, filters=filters,)
retrieved_nodes = retriever.retrieve("What is this software project about?")
for node in retrieved_nodes: print(node.node.metadata)2025-09-11 16:48:49,129 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
{'commit_date': '2023-08-07', 'author': 'mats@timescale.com', 'fixes': []}{'commit_date': '2023-08-07', 'author': 'sven@timescale.com', 'fixes': []}{'commit_date': '2023-08-15', 'author': 'sven@timescale.com', 'fixes': []}{'commit_date': '2023-08-10', 'author': 'mats@timescale.com', 'fixes': []}# Same thing, with NOT INfilters = MetadataFilters( filters=[ MetadataFilter(key="commit_date", value="2023-08-01", operator=">="), MetadataFilter(key="commit_date", value="2023-08-15", operator="<="), MetadataFilter( key="author", value=["mats@timescale.com", "sven@timescale.com"], operator="nin", ), ], condition="and",)
retriever = index.as_retriever( similarity_top_k=10, filters=filters,)
retrieved_nodes = retriever.retrieve("What is this software project about?")
for node in retrieved_nodes: print(node.node.metadata)2025-09-11 16:48:51,587 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
{'commit_date': '2023-08-09', 'author': 'me@noctarius.com', 'fixes': ['5805']}{'commit_date': '2023-08-15', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}{'commit_date': '2023-08-15', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}{'commit_date': '2023-08-11', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}{'commit_date': '2023-08-09', 'author': 'konstantina@timescale.com', 'fixes': ['5923', '5680', '5774', '5786', '5906', '5912']}{'commit_date': '2023-08-03', 'author': 'dmitry@timescale.com', 'fixes': []}{'commit_date': '2023-08-03', 'author': 'dmitry@timescale.com', 'fixes': ['5908']}{'commit_date': '2023-08-01', 'author': 'nikhil@timescale.com', 'fixes': []}{'commit_date': '2023-08-10', 'author': 'konstantina@timescale.com', 'fixes': []}{'commit_date': '2023-08-10', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}# CONTAINSfilters = MetadataFilters( filters=[ MetadataFilter(key="fixes", value="5680", operator="contains"), ])
retriever = index.as_retriever( similarity_top_k=10, filters=filters,)
retrieved_nodes = retriever.retrieve("How did these commits fix the issue?")for node in retrieved_nodes: print(node.node.metadata)2025-09-11 16:48:56,822 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
{'commit_date': '2023-08-09', 'author': 'konstantina@timescale.com', 'fixes': ['5923', '5680', '5774', '5786', '5906', '5912']}可以构建更复杂的查询,例如连接其他表。这通过使用您的函数设置 customize_query_fn 参数来实现。首先,让我们创建一个用户表并填充数据。
from sqlalchemy import ( Table, MetaData, Column, String, Integer, create_engine, insert,)
engine = create_engine(url=connection_string + "/" + db_name)
metadata = MetaData()
user_table = Table( "user", metadata, Column("id", Integer, primary_key=True, autoincrement=True), Column("name", String, nullable=False), Column("email", String, nullable=False),)
user_table.drop(engine, checkfirst=True)user_table.create(engine)
with engine.begin() as conn: stmt = insert(user_table) conn.execute( stmt, [{"name": "Konstantina", "email": "konstantina@timescale.com"}] )然后,我们可以创建一个查询定制函数,并使用 customize_query_fn 实例化 PGVectorStore。
from typing import Anyfrom sqlalchemy import Select
def customize_query(query: Select, table_class: Any, **kwargs: Any) -> Select: # Join the user table on the email addresses and add the name column to the select statement return query.add_columns(user_table.c.name).join( user_table, user_table.c.email == table_class.metadata_["author"].astext, )
vector_store = PGVectorStore.from_params( database=db_name, host=url.host, password=url.password, port=url.port, user=url.username, table_name="metadata_filter_demo3", embed_dim=1536, # openai embedding dimension hnsw_kwargs={ "hnsw_m": 16, "hnsw_ef_construction": 64, "hnsw_ef_search": 40, "hnsw_dist_method": "vector_cosine_ops", }, customize_query_fn=customize_query,)index = VectorStoreIndex.from_vector_store(vector_store=vector_store)然后我们可以查询向量存储,并在节点元数据中名为custom_fields的字典中检索添加到select语句的任何额外字段。
filters = MetadataFilters( filters=[ MetadataFilter(key="fixes", value="5680", operator="contains"), ])
retriever = index.as_retriever( similarity_top_k=10, filters=filters,)
retrieved_nodes = retriever.retrieve("How did these commits fix the issue?")for node in retrieved_nodes: print(node.node.metadata)2025-09-11 17:06:43,812 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
{'commit_date': '2023-08-09', 'author': 'konstantina@timescale.com', 'fixes': ['5923', '5680', '5774', '5786', '5906', '5912'], 'custom_fields': {'name': 'Konstantina'}}PgVector 查询选项
Section titled “PgVector Query Options”IVFFlat 探针
Section titled “IVFFlat Probes”指定 IVFFlat 探针数量(默认为 1)
从索引中检索时,您可以指定适当数量的IVFFlat探针(数值越高召回率越好,数值越低速度越快)
retriever = index.as_retriever( vector_store_query_mode="hybrid", similarity_top_k=5, vector_store_kwargs={"ivfflat_probes": 10},)HNSW EF 搜索
Section titled “HNSW EF Search”指定搜索的动态候选列表大小(默认为40)
retriever = index.as_retriever( vector_store_query_mode="hybrid", similarity_top_k=5, vector_store_kwargs={"hnsw_ef_search": 300},)