跳转到内容

Google Cloud SQL for PostgreSQL - `PostgresVectorStore`

Cloud SQL 是一项全托管式关系型数据库服务,提供高性能、无缝集成和卓越的可扩展性。它支持 MySQL、PostgreSQL 和 SQL Server 数据库引擎。借助 Cloud SQL 的 LlamaIndex 集成,您可以扩展数据库应用程序以构建支持人工智能的体验。

This notebook goes over how to use Cloud SQL for PostgreSQL to store vector embeddings with the PostgresVectorStore class.

Learn more about the package on GitHub.

Open In Colab

要运行此笔记本,您需要执行以下操作:

Install the integration library, llama-index-cloud-sql-pg, and the library for the embedding service, llama-index-embeddings-vertex.

%pip install --upgrade --quiet llama-index-cloud-sql-pg llama-index-embeddings-vertex llama-index-llms-vertex llama-index

Colab only: Uncomment the following cell to restart the kernel or use the button to restart the kernel. For Vertex AI Workbench you can restart the terminal using the button on top.

# # Automatically restart kernel after installs so that your environment can access the new packages
# import IPython
# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

以登录此笔记本的IAM用户身份验证到Google Cloud,以便访问您的Google Cloud项目。

  • 如果您正在使用 Colab 运行此笔记本,请使用下面的单元格并继续。
  • If you are using Vertex AI Workbench, check out the setup instructions here.
from google.colab import auth
auth.authenticate_user()

设置您的 Google Cloud 项目,以便在此笔记本中利用 Google Cloud 资源。

如果您不知道您的项目ID,请尝试以下方法:

# @markdown Please fill in the value below with your Google Cloud project ID and then run the cell.
PROJECT_ID = "my-project-id" # @param {type:"string"}
# Set the project id
!gcloud config set project {PROJECT_ID}

Cloud SQL 实例页面中查找您的数据库值。

# @title Set Your Values Here { display-mode: "form" }
REGION = "us-central1" # @param {type: "string"}
INSTANCE = "my-primary" # @param {type: "string"}
DATABASE = "my-database" # @param {type: "string"}
TABLE_NAME = "vector_store" # @param {type: "string"}
USER = "postgres" # @param {type: "string"}
PASSWORD = "my-password" # @param {type: "string"}

将 Cloud SQL 设为向量存储的要求和参数之一是 PostgresEngine 对象。PostgresEngine 可配置与 Cloud SQL 数据库的连接池,使应用程序能够成功连接并遵循行业最佳实践。

要使用 PostgresEngine.from_instance() 创建 PostgresEngine,您只需提供以下4项内容:

  1. project_idproject_id : 云 SQL 实例所在的 Google Cloud 项目的项目 ID。
  2. regionregion : Cloud SQL 实例所在的区域。
  3. instanceinstance : Cloud SQL 实例的名称。
  4. databasedatabase : 要连接的 Cloud SQL 实例上的数据库名称。

默认情况下,IAM数据库认证将作为数据库认证方法使用。该库使用属于从环境中获取的应用默认凭据(ADC)的IAM主体。

有关IAM数据库认证的更多信息,请参阅:

可选地,也可以使用内置数据库认证,通过用户名和密码访问 Cloud SQL 数据库。只需向 PostgresEngine.from_instance() 提供可选的 userpassword 参数:

  • useruser : 用于内置数据库认证和登录的数据库用户
  • passwordpassword : 用于内置数据库认证和登录的数据库密码。

Note: This tutorial demonstrates the async interface. All async methods have corresponding sync methods.

from llama_index_cloud_sql_pg import PostgresEngine
engine = await PostgresEngine.afrom_instance(
project_id=PROJECT_ID,
region=REGION,
instance=INSTANCE,
database=DATABASE,
user=USER,
password=PASSWORD,
)

The PostgresVectorStore class requires a database table. The PostgresEngine engine has a helper method init_vector_store_table() that can be used to create a table with the proper schema for you.

await engine.ainit_vector_store_table(
table_name=TABLE_NAME,
vector_size=768, # Vector size for VertexAI model(textembedding-gecko@latest)
)

You can also specify a schema name by passing schema_name wherever you pass table_name.

SCHEMA_NAME = "my_schema"
await engine.ainit_vector_store_table(
table_name=TABLE_NAME,
schema_name=SCHEMA_NAME,
vector_size=768,
)

You can use any Llama Index embeddings model. You may need to enable Vertex AI API to use VertexTextEmbeddings. We recommend setting the embedding model’s version for production, learn more about the Text embeddings models.

# enable Vertex AI API
!gcloud services enable aiplatform.googleapis.com
from llama_index.core import Settings
from llama_index.embeddings.vertex import VertexTextEmbedding
from llama_index.llms.vertex import Vertex
import google.auth
credentials, project_id = google.auth.default()
Settings.embed_model = VertexTextEmbedding(
model_name="textembedding-gecko@003",
project=PROJECT_ID,
credentials=credentials,
)
Settings.llm = Vertex(model="gemini-1.5-flash-002", project=PROJECT_ID)

初始化一个默认的 PostgresVectorStore

Section titled “Initialize a default PostgresVectorStore”
from llama_index_cloud_sql_pg import PostgresVectorStore
vector_store = await PostgresVectorStore.create(
engine=engine,
table_name=TABLE_NAME,
# schema_name=SCHEMA_NAME
)
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader("./data/paul_graham").load_data()
print("Document ID:", documents[0].doc_id)

与 VectorStoreIndex 配合使用

Section titled “Use with VectorStoreIndex”

Create an index from the vector store by using VectorStoreIndex.

The simplest way to use a Vector Store is to load a set of documents and build an index from them using from_documents.

from llama_index.core import StorageContext, VectorStoreIndex
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context, show_progress=True
)
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do?")
print(response)

向量存储可以利用关系数据来筛选相似性搜索。

创建一个带有自定义元数据列的新表。 您也可以重用已包含文档ID、内容、嵌入和/或元数据的自定义列的现有表。

from llama_index_cloud_sql_pg import Column
# Set table name
TABLE_NAME = "vectorstore_custom"
# SCHEMA_NAME = "my_schema"
await engine.ainit_vector_store_table(
table_name=TABLE_NAME,
# schema_name=SCHEMA_NAME,
vector_size=768, # VertexAI model: textembedding-gecko@003
metadata_columns=[Column("len", "INTEGER")],
)
# Initialize PostgresVectorStore
custom_store = await PostgresVectorStore.create(
engine=engine,
table_name=TABLE_NAME,
# schema_name=SCHEMA_NAME,
metadata_columns=["len"],
)

Document metadata can provide the LLM and retrieval process with more information. Learn more about different approaches for extracting and adding metadata.

from llama_index.core import Document
fruits = ["apple", "pear", "orange", "strawberry", "banana", "kiwi"]
documents = [
Document(text=fruit, metadata={"len": len(fruit)}) for fruit in fruits
]
storage_context = StorageContext.from_defaults(vector_store=custom_store)
custom_doc_index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context, show_progress=True
)

You can apply pre-filtering to the search results by specifying a filters argument

from llama_index.core.vector_stores.types import (
MetadataFilter,
MetadataFilters,
FilterOperator,
)
filters = MetadataFilters(
filters=[
MetadataFilter(key="len", operator=FilterOperator.GT, value="5"),
],
)
query_engine = custom_doc_index.as_query_engine(filters=filters)
res = query_engine.query("List some fruits")
print(str(res.source_nodes[0].text))

Speed up vector search queries by applying a vector index. Learn more about vector indexes.

from llama_index_cloud_sql_pg.indexes import IVFFlatIndex
index = IVFFlatIndex()
await vector_store.aapply_vector_index(index)
await vector_store.areindex() # Re-index using default index name
await vector_store.adrop_vector_index() # Delete index using default name