Postgres 向量存储

在本笔记本中，我们将展示如何在LlamaIndex中使用Postgresql和pgvector执行向量搜索

如果您在 Colab 上打开这个笔记本，您可能需要安装 LlamaIndex 🦙。

%pip install llama-index-vector-stores-postgres

!pip install llama-index

运行以下单元格将在 Colab 中安装带有 PGVector 的 Postgres。

!sudo apt update
!echo | sudo apt install -y postgresql-common
!echo | sudo /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh
!echo | sudo apt install postgresql-15-pgvector
!sudo service postgresql start
!sudo -u postgres psql -c "ALTER USER postgres PASSWORD 'password';"
!sudo -u postgres psql -c "CREATE DATABASE vector_db;"

# import logging
# import sys

# Uncomment to see debug logs
# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

from llama_index.core import SimpleDirectoryReader, StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.postgres import PGVectorStore
import textwrap

设置OpenAI

第一步是配置 OpenAI 密钥。它将用于为加载到索引中的文档创建嵌入向量

import os

os.environ["OPENAI_API_KEY"] = "sk-..."

下载数据

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

加载文档

Load the documents stored in the data/paul_graham/ using the SimpleDirectoryReader

documents = SimpleDirectoryReader("./data/paul_graham").load_data()
print("Document ID:", documents[0].doc_id)

Document ID: 56e70c8c-0fb7-4250-99be-b953d0185a01

创建数据库

使用本地主机上运行的现有 postgres 数据库，创建我们将要使用的数据库。

import psycopg2

connection_string = "postgresql://postgres:password@localhost:5432"
db_name = "vector_db"
conn = psycopg2.connect(connection_string)
conn.autocommit = True

with conn.cursor() as c:
    c.execute(f"DROP DATABASE IF EXISTS {db_name}")
    c.execute(f"CREATE DATABASE {db_name}")

创建索引

这里我们使用之前加载的文档创建一个由Postgres支持的索引。PGVectorStore接受几个参数。以下示例使用vector_cosine_ops方法构建了一个具有HNSW索引的PGVectorStore，其中m = 16，ef_construction = 64，ef_search = 40。

from sqlalchemy import make_url

url = make_url(connection_string)
vector_store = PGVectorStore.from_params(
    database=db_name,
    host=url.host,
    password=url.password,
    port=url.port,
    user=url.username,
    table_name="paul_graham_essay",
    embed_dim=1536,  # openai embedding dimension
    hnsw_kwargs={
        "hnsw_m": 16,
        "hnsw_ef_construction": 64,
        "hnsw_ef_search": 40,
        "hnsw_dist_method": "vector_cosine_ops",
    },
)

storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, show_progress=True
)
query_engine = index.as_query_engine()

Parsing nodes:   0%|          | 0/1 [00:00<?, ?it/s]



Generating embeddings:   0%|          | 0/22 [00:00<?, ?it/s]


2025-09-11 16:47:21,725 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"

查询索引

我们现在可以使用我们的索引来提问了。

response = query_engine.query("What did the author do?")

2025-09-11 16:47:30,412 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-09-11 16:47:31,665 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"

print(textwrap.fill(str(response), 100))

The author worked on writing essays, programming, building microcomputers, predicting rocket
heights, developing a word processor, and giving talks on starting a startup.

response = query_engine.query("What happened in the mid 1980s?")

2025-09-11 16:47:37,531 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-09-11 16:47:38,352 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"

print(textwrap.fill(str(response), 100))

AI was in the air in the mid 1980s, and two things that influenced the desire to work on it were a
novel by Heinlein called The Moon is a Harsh Mistress, which featured an intelligent computer called
Mike, and a PBS documentary that showed Terry Winograd using SHRDLU.

查询现有索引

vector_store = PGVectorStore.from_params(
    database="vector_db",
    host="localhost",
    password="password",
    port=5432,
    user="postgres",
    table_name="paul_graham_essay",
    embed_dim=1536,  # openai embedding dimension
    hnsw_kwargs={
        "hnsw_m": 16,
        "hnsw_ef_construction": 64,
        "hnsw_ef_search": 40,
        "hnsw_dist_method": "vector_cosine_ops",
    },
)

index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
query_engine = index.as_query_engine()

response = query_engine.query("What did the author do?")

print(textwrap.fill(str(response), 100))

The author worked on writing essays, programming, creating microcomputers, developing software,
giving talks, and starting a startup.

混合搜索

要启用混合搜索，您需要：

pass in hybrid_search=True when constructing the PGVectorStore (and optionally configure text_search_config with the desired language)
pass in vector_store_query_mode="hybrid" when constructing the query engine (this config is passed to the retriever under the hood). You can also optionally set the sparse_top_k to configure how many results we should obtain from sparse text search (default is using the same value as similarity_top_k).

from sqlalchemy import make_url

url = make_url(connection_string)
hybrid_vector_store = PGVectorStore.from_params(
    database=db_name,
    host=url.host,
    password=url.password,
    port=url.port,
    user=url.username,
    table_name="paul_graham_essay_hybrid_search",
    embed_dim=1536,  # openai embedding dimension
    hybrid_search=True,
    text_search_config="english",
    hnsw_kwargs={
        "hnsw_m": 16,
        "hnsw_ef_construction": 64,
        "hnsw_ef_search": 40,
        "hnsw_dist_method": "vector_cosine_ops",
    },
)

storage_context = StorageContext.from_defaults(
    vector_store=hybrid_vector_store
)
hybrid_index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

hybrid_query_engine = hybrid_index.as_query_engine(
    vector_store_query_mode="hybrid", sparse_top_k=2
)
hybrid_response = hybrid_query_engine.query(
    "Who does Paul Graham think of with the word schtick"
)

print(hybrid_response)

Roy Lichtenstein

使用QueryFusionRetriever改进混合搜索

由于文本搜索和向量搜索的分数计算方式不同，仅通过文本搜索找到的节点得分会低得多。

您通常可以通过使用QueryFusionRetriever来提升混合搜索性能，它能更好地利用互信息对节点进行排序。

from llama_index.core.response_synthesizers import CompactAndRefine
from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.core.query_engine import RetrieverQueryEngine

vector_retriever = hybrid_index.as_retriever(
    vector_store_query_mode="default",
    similarity_top_k=5,
)
text_retriever = hybrid_index.as_retriever(
    vector_store_query_mode="sparse",
    similarity_top_k=5,  # interchangeable with sparse_top_k in this context
)
retriever = QueryFusionRetriever(
    [vector_retriever, text_retriever],
    similarity_top_k=5,
    num_queries=1,  # set this to 1 to disable query generation
    mode="relative_score",
    use_async=False,
)

response_synthesizer = CompactAndRefine()
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

response = query_engine.query(
    "Who does Paul Graham think of with the word schtick, and why?"
)
print(response)

Paul Graham thinks of Roy Lichtenstein when using the word "schtick" because Lichtenstein's distinctive signature style in his paintings immediately identifies his work as his own.

元数据过滤器

PGVectorStore 支持在节点中存储元数据，并在检索步骤中基于该元数据进行过滤。

下载Git提交数据集

!mkdir -p 'data/git_commits/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/csv/commit_history.csv' -O 'data/git_commits/commit_history.csv'

import csv

with open("data/git_commits/commit_history.csv", "r") as f:
    commits = list(csv.DictReader(f))

print(commits[0])
print(len(commits))

{'commit': '44e41c12ab25e36c202f58e068ced262eadc8d16', 'author': 'Lakshmi Narayanan Sreethar<lakshmi@timescale.com>', 'date': 'Tue Sep 5 21:03:21 2023 +0530', 'change summary': 'Fix segfault in set_integer_now_func', 'change details': 'When an invalid function oid is passed to set_integer_now_func, it finds out that the function oid is invalid but before throwing the error, it calls ReleaseSysCache on an invalid tuple causing a segfault. Fixed that by removing the invalid call to ReleaseSysCache.  Fixes #6037 '}
4167

添加带有自定义元数据的节点

# Create TextNode for each of the first 100 commits
from llama_index.core.schema import TextNode
from datetime import datetime
import re

nodes = []
dates = set()
authors = set()
for commit in commits[:100]:
    author_email = commit["author"].split("<")[1][:-1]
    commit_date = datetime.strptime(
        commit["date"], "%a %b %d %H:%M:%S %Y %z"
    ).strftime("%Y-%m-%d")
    commit_text = commit["change summary"]
    if commit["change details"]:
        commit_text += "\n\n" + commit["change details"]
    fixes = re.findall(r"#(\d+)", commit_text, re.IGNORECASE)
    nodes.append(
        TextNode(
            text=commit_text,
            metadata={
                "commit_date": commit_date,
                "author": author_email,
                "fixes": fixes,
            },
        )
    )
    dates.add(commit_date)
    authors.add(author_email)

print(nodes[0])
print(min(dates), "to", max(dates))
print(authors)

Node ID: 9c2c2f17-d763-4ce8-bb02-83cb176008e4
Text: Fix segfault in set_integer_now_func  When an invalid function
oid is passed to set_integer_now_func, it finds out that the function
oid is invalid but before throwing the error, it calls ReleaseSysCache
on an invalid tuple causing a segfault. Fixed that by removing the
invalid call to ReleaseSysCache.  Fixes #6037
2023-03-22 to 2023-09-05
{'konstantina@timescale.com', 'nikhil@timescale.com', 'satish.8483@gmail.com', 'mats@timescale.com', 'fabriziomello@gmail.com', 'erik@timescale.com', 'sven@timescale.com', 'lakshmi@timescale.com', 'dmitry@timescale.com', 'engel@sero-systems.de', 'rafia.sabih@gmail.com', '36882414+akuzm@users.noreply.github.com', 'jguthrie@timescale.com', 'jan@timescale.com', 'me@noctarius.com'}

vector_store = PGVectorStore.from_params(
    database=db_name,
    host=url.host,
    password=url.password,
    port=url.port,
    user=url.username,
    table_name="metadata_filter_demo3",
    embed_dim=1536,  # openai embedding dimension
    hnsw_kwargs={
        "hnsw_m": 16,
        "hnsw_ef_construction": 64,
        "hnsw_ef_search": 40,
        "hnsw_dist_method": "vector_cosine_ops",
    },
)

index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
index.insert_nodes(nodes)

2025-09-11 16:48:11,383 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"

print(index.as_query_engine().query("How did Lakshmi fix the segfault?"))

2025-09-11 16:48:15,149 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-09-11 16:48:15,687 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Lakshmi fixed the segfault by removing the invalid call to ReleaseSysCache that was causing the issue.

应用元数据过滤器

现在我们可以在检索节点时按提交作者或日期进行筛选。

from llama_index.core.vector_stores.types import (
    MetadataFilter,
    MetadataFilters,
)

filters = MetadataFilters(
    filters=[
        MetadataFilter(key="author", value="mats@timescale.com"),
        MetadataFilter(key="author", value="sven@timescale.com"),
    ],
    condition="or",
)

retriever = index.as_retriever(
    similarity_top_k=10,
    filters=filters,
)

retrieved_nodes = retriever.retrieve("What is this software project about?")

for node in retrieved_nodes:
    print(node.node.metadata)

2025-09-11 16:48:31,673 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


{'commit_date': '2023-08-07', 'author': 'mats@timescale.com', 'fixes': []}
{'commit_date': '2023-08-27', 'author': 'sven@timescale.com', 'fixes': []}
{'commit_date': '2023-07-13', 'author': 'mats@timescale.com', 'fixes': []}
{'commit_date': '2023-08-07', 'author': 'sven@timescale.com', 'fixes': []}
{'commit_date': '2023-08-30', 'author': 'sven@timescale.com', 'fixes': []}
{'commit_date': '2023-08-15', 'author': 'sven@timescale.com', 'fixes': []}
{'commit_date': '2023-08-23', 'author': 'sven@timescale.com', 'fixes': []}
{'commit_date': '2023-08-10', 'author': 'mats@timescale.com', 'fixes': []}
{'commit_date': '2023-07-25', 'author': 'mats@timescale.com', 'fixes': ['5892']}
{'commit_date': '2023-08-21', 'author': 'sven@timescale.com', 'fixes': []}

filters = MetadataFilters(
    filters=[
        MetadataFilter(key="commit_date", value="2023-08-15", operator=">="),
        MetadataFilter(key="commit_date", value="2023-08-25", operator="<="),
    ],
    condition="and",
)

retriever = index.as_retriever(
    similarity_top_k=10,
    filters=filters,
)

retrieved_nodes = retriever.retrieve("What is this software project about?")

for node in retrieved_nodes:
    print(node.node.metadata)

2025-09-11 16:48:40,347 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


{'commit_date': '2023-08-23', 'author': 'erik@timescale.com', 'fixes': []}
{'commit_date': '2023-08-17', 'author': 'konstantina@timescale.com', 'fixes': []}
{'commit_date': '2023-08-15', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}
{'commit_date': '2023-08-15', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}
{'commit_date': '2023-08-24', 'author': 'lakshmi@timescale.com', 'fixes': []}
{'commit_date': '2023-08-15', 'author': 'sven@timescale.com', 'fixes': []}
{'commit_date': '2023-08-23', 'author': 'sven@timescale.com', 'fixes': []}
{'commit_date': '2023-08-21', 'author': 'sven@timescale.com', 'fixes': []}
{'commit_date': '2023-08-20', 'author': 'sven@timescale.com', 'fixes': []}
{'commit_date': '2023-08-21', 'author': 'sven@timescale.com', 'fixes': []}

应用嵌套过滤器

在上述示例中，我们使用 AND 或 OR 组合了多个过滤器。我们还可以组合多组过滤器。

例如在SQL中：

WHERE (commit_date >= '2023-08-01' AND commit_date <= '2023-08-15') AND (author = 'mats@timescale.com' OR author = 'sven@timescale.com')

filters = MetadataFilters(
    filters=[
        MetadataFilters(
            filters=[
                MetadataFilter(
                    key="commit_date", value="2023-08-01", operator=">="
                ),
                MetadataFilter(
                    key="commit_date", value="2023-08-15", operator="<="
                ),
            ],
            condition="and",
        ),
        MetadataFilters(
            filters=[
                MetadataFilter(key="author", value="mats@timescale.com"),
                MetadataFilter(key="author", value="sven@timescale.com"),
            ],
            condition="or",
        ),
    ],
    condition="and",
)

retriever = index.as_retriever(
    similarity_top_k=10,
    filters=filters,
)

retrieved_nodes = retriever.retrieve("What is this software project about?")

for node in retrieved_nodes:
    print(node.node.metadata)

2025-09-11 16:48:45,021 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


{'commit_date': '2023-08-07', 'author': 'mats@timescale.com', 'fixes': []}
{'commit_date': '2023-08-07', 'author': 'sven@timescale.com', 'fixes': []}
{'commit_date': '2023-08-15', 'author': 'sven@timescale.com', 'fixes': []}
{'commit_date': '2023-08-10', 'author': 'mats@timescale.com', 'fixes': []}

上述内容可以通过使用 IN 运算符进行简化。PGVectorStore 支持 in、nin 和 contains 来比较元素与列表。

filters = MetadataFilters(
    filters=[
        MetadataFilter(key="commit_date", value="2023-08-01", operator=">="),
        MetadataFilter(key="commit_date", value="2023-08-15", operator="<="),
        MetadataFilter(
            key="author",
            value=["mats@timescale.com", "sven@timescale.com"],
            operator="in",
        ),
    ],
    condition="and",
)

retriever = index.as_retriever(
    similarity_top_k=10,
    filters=filters,
)

retrieved_nodes = retriever.retrieve("What is this software project about?")

for node in retrieved_nodes:
    print(node.node.metadata)

2025-09-11 16:48:49,129 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


{'commit_date': '2023-08-07', 'author': 'mats@timescale.com', 'fixes': []}
{'commit_date': '2023-08-07', 'author': 'sven@timescale.com', 'fixes': []}
{'commit_date': '2023-08-15', 'author': 'sven@timescale.com', 'fixes': []}
{'commit_date': '2023-08-10', 'author': 'mats@timescale.com', 'fixes': []}

# Same thing, with NOT IN
filters = MetadataFilters(
    filters=[
        MetadataFilter(key="commit_date", value="2023-08-01", operator=">="),
        MetadataFilter(key="commit_date", value="2023-08-15", operator="<="),
        MetadataFilter(
            key="author",
            value=["mats@timescale.com", "sven@timescale.com"],
            operator="nin",
        ),
    ],
    condition="and",
)

retriever = index.as_retriever(
    similarity_top_k=10,
    filters=filters,
)

retrieved_nodes = retriever.retrieve("What is this software project about?")

for node in retrieved_nodes:
    print(node.node.metadata)

2025-09-11 16:48:51,587 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


{'commit_date': '2023-08-09', 'author': 'me@noctarius.com', 'fixes': ['5805']}
{'commit_date': '2023-08-15', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}
{'commit_date': '2023-08-15', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}
{'commit_date': '2023-08-11', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}
{'commit_date': '2023-08-09', 'author': 'konstantina@timescale.com', 'fixes': ['5923', '5680', '5774', '5786', '5906', '5912']}
{'commit_date': '2023-08-03', 'author': 'dmitry@timescale.com', 'fixes': []}
{'commit_date': '2023-08-03', 'author': 'dmitry@timescale.com', 'fixes': ['5908']}
{'commit_date': '2023-08-01', 'author': 'nikhil@timescale.com', 'fixes': []}
{'commit_date': '2023-08-10', 'author': 'konstantina@timescale.com', 'fixes': []}
{'commit_date': '2023-08-10', 'author': '36882414+akuzm@users.noreply.github.com', 'fixes': []}

# CONTAINS
filters = MetadataFilters(
    filters=[
        MetadataFilter(key="fixes", value="5680", operator="contains"),
    ]
)

retriever = index.as_retriever(
    similarity_top_k=10,
    filters=filters,
)

retrieved_nodes = retriever.retrieve("How did these commits fix the issue?")
for node in retrieved_nodes:
    print(node.node.metadata)

2025-09-11 16:48:56,822 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


{'commit_date': '2023-08-09', 'author': 'konstantina@timescale.com', 'fixes': ['5923', '5680', '5774', '5786', '5906', '5912']}

自定义查询

可以构建更复杂的查询，例如连接其他表。这通过使用您的函数设置 customize_query_fn 参数来实现。首先，让我们创建一个用户表并填充数据。

from sqlalchemy import (
    Table,
    MetaData,
    Column,
    String,
    Integer,
    create_engine,
    insert,
)

engine = create_engine(url=connection_string + "/" + db_name)

metadata = MetaData()

user_table = Table(
    "user",
    metadata,
    Column("id", Integer, primary_key=True, autoincrement=True),
    Column("name", String, nullable=False),
    Column("email", String, nullable=False),
)

user_table.drop(engine, checkfirst=True)
user_table.create(engine)

with engine.begin() as conn:
    stmt = insert(user_table)
    conn.execute(
        stmt, [{"name": "Konstantina", "email": "konstantina@timescale.com"}]
    )

然后，我们可以创建一个查询定制函数，并使用 customize_query_fn 实例化 PGVectorStore。

from typing import Any
from sqlalchemy import Select


def customize_query(query: Select, table_class: Any, **kwargs: Any) -> Select:
    # Join the user table on the email addresses and add the name column to the select statement
    return query.add_columns(user_table.c.name).join(
        user_table,
        user_table.c.email == table_class.metadata_["author"].astext,
    )


vector_store = PGVectorStore.from_params(
    database=db_name,
    host=url.host,
    password=url.password,
    port=url.port,
    user=url.username,
    table_name="metadata_filter_demo3",
    embed_dim=1536,  # openai embedding dimension
    hnsw_kwargs={
        "hnsw_m": 16,
        "hnsw_ef_construction": 64,
        "hnsw_ef_search": 40,
        "hnsw_dist_method": "vector_cosine_ops",
    },
    customize_query_fn=customize_query,
)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)

然后我们可以查询向量存储，并在节点元数据中名为custom_fields的字典中检索添加到select语句的任何额外字段。

filters = MetadataFilters(
    filters=[
        MetadataFilter(key="fixes", value="5680", operator="contains"),
    ]
)

retriever = index.as_retriever(
    similarity_top_k=10,
    filters=filters,
)

retrieved_nodes = retriever.retrieve("How did these commits fix the issue?")
for node in retrieved_nodes:
    print(node.node.metadata)

2025-09-11 17:06:43,812 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


{'commit_date': '2023-08-09', 'author': 'konstantina@timescale.com', 'fixes': ['5923', '5680', '5774', '5786', '5906', '5912'], 'custom_fields': {'name': 'Konstantina'}}

PgVector 查询选项

IVFFlat 探针

指定 IVFFlat 探针数量（默认为 1）

从索引中检索时，您可以指定适当数量的IVFFlat探针（数值越高召回率越好，数值越低速度越快）

retriever = index.as_retriever(
    vector_store_query_mode="hybrid",
    similarity_top_k=5,
    vector_store_kwargs={"ivfflat_probes": 10},
)

HNSW EF 搜索

指定搜索的动态候选列表大小（默认为40）

retriever = index.as_retriever(
    vector_store_query_mode="hybrid",
    similarity_top_k=5,
    vector_store_kwargs={"hnsw_ef_search": 300},
)