Google Cloud SQL for PostgreSQL - `PostgresVectorStore`

Cloud SQL 是一项全托管式关系型数据库服务，提供高性能、无缝集成和卓越的可扩展性。它支持 MySQL、PostgreSQL 和 SQL Server 数据库引擎。借助 Cloud SQL 的 LlamaIndex 集成，您可以扩展数据库应用程序以构建支持人工智能的体验。

This notebook goes over how to use Cloud SQL for PostgreSQL to store vector embeddings with the PostgresVectorStore class.

Learn more about the package on GitHub.

开始之前

要运行此笔记本，您需要执行以下操作：

🦙 库安装

Install the integration library, llama-index-cloud-sql-pg, and the library for the embedding service, llama-index-embeddings-vertex.

%pip install --upgrade --quiet llama-index-cloud-sql-pg llama-index-embeddings-vertex llama-index-llms-vertex llama-index

Colab only: Uncomment the following cell to restart the kernel or use the button to restart the kernel. For Vertex AI Workbench you can restart the terminal using the button on top.

# # Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

🔐 身份验证

以登录此笔记本的IAM用户身份验证到Google Cloud，以便访问您的Google Cloud项目。

如果您正在使用 Colab 运行此笔记本，请使用下面的单元格并继续。
If you are using Vertex AI Workbench, check out the setup instructions here.

from google.colab import auth

auth.authenticate_user()

☁ 设置您的谷歌云项目

设置您的 Google Cloud 项目，以便在此笔记本中利用 Google Cloud 资源。

如果您不知道您的项目ID，请尝试以下方法：

Run gcloud config list.
Run gcloud projects list.
See the support page: Locate the project ID.

# @markdown Please fill in the value below with your Google Cloud project ID and then run the cell.

PROJECT_ID = "my-project-id"  # @param {type:"string"}

# Set the project id
!gcloud config set project {PROJECT_ID}

基本用法

设置 Cloud SQL 数据库值

在Cloud SQL 实例页面中查找您的数据库值。

# @title Set Your Values Here { display-mode: "form" }
REGION = "us-central1"  # @param {type: "string"}
INSTANCE = "my-primary"  # @param {type: "string"}
DATABASE = "my-database"  # @param {type: "string"}
TABLE_NAME = "vector_store"  # @param {type: "string"}
USER = "postgres"  # @param {type: "string"}
PASSWORD = "my-password"  # @param {type: "string"}

PostgresEngine 连接池

将 Cloud SQL 设为向量存储的要求和参数之一是 PostgresEngine 对象。PostgresEngine 可配置与 Cloud SQL 数据库的连接池，使应用程序能够成功连接并遵循行业最佳实践。

要使用 PostgresEngine.from_instance() 创建 PostgresEngine，您只需提供以下4项内容：

project_idproject_id : 云 SQL 实例所在的 Google Cloud 项目的项目 ID。
regionregion : Cloud SQL 实例所在的区域。
instanceinstance : Cloud SQL 实例的名称。
databasedatabase : 要连接的 Cloud SQL 实例上的数据库名称。

默认情况下，IAM数据库认证将作为数据库认证方法使用。该库使用属于从环境中获取的应用默认凭据(ADC)的IAM主体。

有关IAM数据库认证的更多信息，请参阅：

可选地，也可以使用内置数据库认证，通过用户名和密码访问 Cloud SQL 数据库。只需向 PostgresEngine.from_instance() 提供可选的 user 和 password 参数：

useruser : 用于内置数据库认证和登录的数据库用户
passwordpassword : 用于内置数据库认证和登录的数据库密码。

Note: This tutorial demonstrates the async interface. All async methods have corresponding sync methods.

from llama_index_cloud_sql_pg import PostgresEngine

engine = await PostgresEngine.afrom_instance(
    project_id=PROJECT_ID,
    region=REGION,
    instance=INSTANCE,
    database=DATABASE,
    user=USER,
    password=PASSWORD,
)

初始化一个表格

The PostgresVectorStore class requires a database table. The PostgresEngine engine has a helper method init_vector_store_table() that can be used to create a table with the proper schema for you.

await engine.ainit_vector_store_table(
    table_name=TABLE_NAME,
    vector_size=768,  # Vector size for VertexAI model(textembedding-gecko@latest)
)

可选提示：💡

You can also specify a schema name by passing schema_name wherever you pass table_name.

SCHEMA_NAME = "my_schema"

await engine.ainit_vector_store_table(
    table_name=TABLE_NAME,
    schema_name=SCHEMA_NAME,
    vector_size=768,
)

创建一个嵌入类实例

You can use any Llama Index embeddings model. You may need to enable Vertex AI API to use VertexTextEmbeddings. We recommend setting the embedding model’s version for production, learn more about the Text embeddings models.

# enable Vertex AI API
!gcloud services enable aiplatform.googleapis.com

from llama_index.core import Settings
from llama_index.embeddings.vertex import VertexTextEmbedding
from llama_index.llms.vertex import Vertex
import google.auth

credentials, project_id = google.auth.default()
Settings.embed_model = VertexTextEmbedding(
    model_name="textembedding-gecko@003",
    project=PROJECT_ID,
    credentials=credentials,
)

Settings.llm = Vertex(model="gemini-1.5-flash-002", project=PROJECT_ID)

初始化一个默认的 PostgresVectorStore

from llama_index_cloud_sql_pg import PostgresVectorStore

vector_store = await PostgresVectorStore.create(
    engine=engine,
    table_name=TABLE_NAME,
    # schema_name=SCHEMA_NAME
)

下载数据

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

加载文档

from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./data/paul_graham").load_data()
print("Document ID:", documents[0].doc_id)

与 VectorStoreIndex 配合使用

Create an index from the vector store by using VectorStoreIndex.

使用文档初始化向量存储

The simplest way to use a Vector Store is to load a set of documents and build an index from them using from_documents.

from llama_index.core import StorageContext, VectorStoreIndex

storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, show_progress=True
)

查询索引

query_engine = index.as_query_engine()
response = query_engine.query("What did the author do?")
print(response)

创建自定义向量存储

向量存储可以利用关系数据来筛选相似性搜索。

创建一个带有自定义元数据列的新表。您也可以重用已包含文档ID、内容、嵌入和/或元数据的自定义列的现有表。

from llama_index_cloud_sql_pg import Column

# Set table name
TABLE_NAME = "vectorstore_custom"
# SCHEMA_NAME = "my_schema"

await engine.ainit_vector_store_table(
    table_name=TABLE_NAME,
    # schema_name=SCHEMA_NAME,
    vector_size=768,  # VertexAI model: textembedding-gecko@003
    metadata_columns=[Column("len", "INTEGER")],
)


# Initialize PostgresVectorStore
custom_store = await PostgresVectorStore.create(
    engine=engine,
    table_name=TABLE_NAME,
    # schema_name=SCHEMA_NAME,
    metadata_columns=["len"],
)

添加带元数据的文档

Document metadata can provide the LLM and retrieval process with more information. Learn more about different approaches for extracting and adding metadata.

from llama_index.core import Document

fruits = ["apple", "pear", "orange", "strawberry", "banana", "kiwi"]
documents = [
    Document(text=fruit, metadata={"len": len(fruit)}) for fruit in fruits
]

storage_context = StorageContext.from_defaults(vector_store=custom_store)
custom_doc_index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, show_progress=True
)

使用元数据过滤器搜索文档

You can apply pre-filtering to the search results by specifying a filters argument

from llama_index.core.vector_stores.types import (
    MetadataFilter,
    MetadataFilters,
    FilterOperator,
)

filters = MetadataFilters(
    filters=[
        MetadataFilter(key="len", operator=FilterOperator.GT, value="5"),
    ],
)

query_engine = custom_doc_index.as_query_engine(filters=filters)
res = query_engine.query("List some fruits")
print(str(res.source_nodes[0].text))

添加索引

Speed up vector search queries by applying a vector index. Learn more about vector indexes.

from llama_index_cloud_sql_pg.indexes import IVFFlatIndex

index = IVFFlatIndex()
await vector_store.aapply_vector_index(index)

重新索引

await vector_store.areindex()  # Re-index using default index name

移除索引

await vector_store.adrop_vector_index()  # Delete index using default name