agentchat.contrib.vectordb.pgvectordb

集合

class Collection()

PGVector的一个集合对象。

属性:

client - PGVector 客户端。
collection_name str - 集合的名称。默认为 "documents"。
embedding_function Callable - 用于生成向量表示的嵌入函数。默认值为 None。当为 None 时，将使用 SentenceTransformer("all-MiniLM-L6-v2").encode。可以从以下模型中选择： https://huggingface.co/models?library=sentence-transformers
metadata 可选的[字典] - 集合的元数据。
get_or_create 可选 - 指示是否获取或创建集合的标志。

init

def __init__(client=None,
             collection_name: str = "autogen-docs",
             embedding_function: Callable = None,
             metadata=None,
             get_or_create=None)

初始化集合对象。

参数:

client - PostgreSQL客户端。
collection_name - 集合的名称。默认为 "documents"。
embedding_function - 用于生成向量表示的嵌入函数。
metadata - 集合的元数据。
get_or_create - 标志指示是否获取或创建集合。

无

添加

def add(ids: List[ItemID],
        documents: List,
        embeddings: List = None,
        metadatas: List = None) -> None

将文档添加到集合中。

参数:

ids List[ItemID] - 文档ID的列表。
embeddings 列表 - 文档嵌入的列表。可选
metadatas 列表 - 文档元数据列表。可选
documents 列表 - 一个文档列表。

无

upsert

def upsert(ids: List[ItemID],
           documents: List,
           embeddings: List = None,
           metadatas: List = None) -> None

将文档upsert到集合中。

参数:

ids List[ItemID] - 文档ID的列表。
documents 列表 - 文档列表。
embeddings 列表 - 文档嵌入的列表。
metadatas 列表 - 文档元数据的列表。

无

count

def count() -> int

获取集合中的文档总数。

int - 文档的总数。

table_exists

def table_exists(table_name: str) -> bool

检查PostgreSQL数据库中是否存在某个表。

参数:

table_name str - 要检查的表的名称。

bool - 如果表存在则为True，否则为False。

获取

def get(ids: Optional[str] = None,
        include: Optional[str] = None,
        where: Optional[str] = None,
        limit: Optional[Union[int, str]] = None,
        offset: Optional[Union[int, str]] = None) -> List[Document]

从集合中检索文档。

参数:

ids Optional[List] - 文档ID的列表。
include 可选 - 要包含的字段。
where 可选 - 额外的过滤条件。
limit 可选 - 要检索的文档的最大数量。
offset 可选 - 用于分页的偏移量。

List - 检索到的文档。

更新

def update(ids: List, embeddings: List, metadatas: List,
           documents: List) -> None

更新集合中的文档。

参数:

ids 列表 - 文档ID的列表。
embeddings 列表 - 文档嵌入的列表。
metadatas 列表 - 文档元数据的列表。
documents 列表 - 文档列表。

无

欧几里得距离

@staticmethod
def euclidean_distance(arr1: List[float], arr2: List[float]) -> float

计算两个向量之间的欧几里得距离。

参数:

arr1 (List[float]): 第一个向量。
arr2 (List[float]): 第二个向量。

float: arr1和arr2之间的欧几里得距离。

余弦距离

@staticmethod
def cosine_distance(arr1: List[float], arr2: List[float]) -> float

计算两个向量之间的余弦距离。

参数:

arr1 (List[float]): 第一个向量。
arr2 (List[float]): 第二个向量。

float: arr1和arr2之间的余弦距离。

inner_product_distance

@staticmethod
def inner_product_distance(arr1: List[float], arr2: List[float]) -> float

计算两个向量之间的欧几里得距离。

参数:

arr1 (List[float]): 第一个向量。
arr2 (List[float]): 第二个向量。

float: arr1和arr2之间的欧几里得距离。

查询

def query(query_texts: List[str],
          collection_name: Optional[str] = None,
          n_results: Optional[int] = 10,
          distance_type: Optional[str] = "euclidean",
          distance_threshold: Optional[float] = -1,
          include_embedding: Optional[bool] = False) -> QueryResults

查询集合中的文档。

参数:

query_texts List[str] - 查询文本列表。
collection_name Optional[str] - 集合的名称。
n_results int - 返回的最大结果数。
distance_type Optional[str] - 距离搜索类型 - euclidean 或 cosine
distance_threshold Optional[float] - 用于限制搜索的距离阈值
include_embedding Optional[bool] - 在QueryResults中包含嵌入值

QueryResults - 查询结果。

convert_string_to_array

@staticmethod
def convert_string_to_array(array_string: str) -> List[float]

将字符串表示的数组转换为浮点数列表。

参数:

array_string (str): 数组的字符串表示。

list: 从输入字符串解析出的浮点数列表。如果输入不是字符串，则返回输入本身。

修改

def modify(metadata, collection_name: Optional[str] = None) -> None

修改集合的元数据。

参数:

collection_name - 集合的名称。
metadata - 新的元数据。

无

删除

def delete(ids: List[ItemID], collection_name: Optional[str] = None) -> None

从集合中删除文档。

参数:

ids List[ItemID] - 要删除的文档ID列表。
collection_name str - 要删除的集合名称。

无

delete_collection

def delete_collection(collection_name: Optional[str] = None) -> None

删除整个集合。

参数:

collection_name Optional[str] - 要删除的集合的名称。

无

创建集合

def create_collection(collection_name: Optional[str] = None,
                      dimension: Optional[Union[str, int]] = None) -> None

创建一个新集合。

参数:

collection_name Optional[str] - 新集合的名称。
dimension Optional[Union[str, int]] - 句子嵌入模型的维度大小

无

PGVectorDB

class PGVectorDB(VectorDB)

一个使用PGVector作为后端支持的向量数据库。

init

def __init__(*,
             conn: Optional[psycopg.Connection] = None,
             connection_string: Optional[str] = None,
             host: Optional[str] = None,
             port: Optional[Union[int, str]] = None,
             dbname: Optional[str] = None,
             username: Optional[str] = None,
             password: Optional[str] = None,
             connect_timeout: Optional[int] = 10,
             embedding_function: Callable = None,
             metadata: Optional[dict] = None) -> None

初始化向量数据库。

注意：必须指定 connection_string 或 host + port + dbname

参数:

conn - psycopg.Connection | 一个用于连接数据库的客户连接对象。连接对象可能包含附加的键/值对： https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING
connection_string - "postgresql://username:password@hostname:port/database" | PGVector 连接字符串。默认值为 None。
host - str | 要连接的主机。默认值为 None。
port - int | 要连接的端口。默认为 None。
dbname - str | 要连接的数据库名称。默认为 None。
username - str | 要使用的数据库用户名。默认值为None。
password - str | 要使用的数据库用户密码。默认值为None。
connect_timeout - int | 设置连接的超时时间。默认为10。
embedding_function - Callable | 用于生成向量表示的嵌入函数。默认为 None。当为 None 时，将使用 SentenceTransformer("all-MiniLM-L6-v2").encode。可以从以下链接中选择模型： https://huggingface.co/models?library=sentence-transformers
metadata - dict | 向量数据库的元数据。默认值为 None。如果为 None，将使用此
setting - {"hnsw:space": "ip", "hnsw:construction_ef": 30, "hnsw:M": 16}. 在表上使用 hnsw (embedding vector_l2_ops) WITH (m = hnsw:M) ef_construction = "hnsw:construction_ef" 创建索引。更多信息：https://github.com/pgvector/pgvector?tab=readme-ov-file#hnsw

无

建立连接

def establish_connection(
        conn: Optional[psycopg.Connection] = None,
        connection_string: Optional[str] = None,
        host: Optional[str] = None,
        port: Optional[Union[int, str]] = None,
        dbname: Optional[str] = None,
        username: Optional[str] = None,
        password: Optional[str] = None,
        connect_timeout: Optional[int] = 10) -> psycopg.Connection

使用psycopg建立到PostgreSQL数据库的连接。

参数:

conn - 一个已存在的 psycopg 连接对象。如果提供了，将使用该连接。
connection_string - 包含连接信息的字符串。如果提供，将使用此字符串建立新连接。
host - PostgreSQL服务器的主机名。如果未提供connection_string，则使用此参数。
port - 连接到服务器主机的端口号。如果未提供connection_string则使用。
dbname - 数据库名称。如果未提供connection_string，则使用此参数。
username - 连接时使用的用户名。如果不提供connection_string则使用此参数。
password - 用户的密码。如果未提供 connection_string 时使用。
connect_timeout - 连接的最大等待时间，单位为秒。默认值为10秒。

一个表示已建立连接的psycopg.Connection对象。

引发:

如果没有提供凭证，则会出现PermissionError

psycopg.Error - 如果尝试连接数据库时发生错误。

创建集合

def create_collection(collection_name: str,
                      overwrite: bool = False,
                      get_or_create: bool = True) -> Collection

在向量数据库中创建一个集合。情况1. 如果集合不存在，则创建该集合。情况2. 如果集合存在且overwrite为True，则会覆盖该集合。情况3. 如果集合存在且overwrite为False，若get_or_create为True，则会获取该集合，否则会引发ValueError。

参数:

collection_name - str | 集合的名称。
overwrite - bool | 如果集合存在，是否覆盖它。默认值为 False。
get_or_create - bool | 是否在集合存在时获取它。默认值为 True。

集合 | 集合对象。

get_collection

def get_collection(collection_name: str = None) -> Collection

从向量数据库中获取集合。

参数:

collection_name - str | 集合的名称。默认为 None。如果为 None，则返回当前活动的集合。

集合 | 集合对象。

delete_collection

def delete_collection(collection_name: str) -> None

从向量数据库中删除集合。

参数:

collection_name - str | 集合的名称。

无

insert_docs

def insert_docs(docs: List[Document],
                collection_name: str = None,
                upsert: bool = False) -> None

将文档插入到向量数据库的集合中。

参数:

docs - List[Document] | 一个文档列表。每个文档都是一个TypedDict Document。
collection_name - str | 集合的名称。默认值为 None。
upsert - bool | 如果文档存在，是否更新。默认为False。
kwargs - Dict | 额外的关键字参数。

无

更新文档

def update_docs(docs: List[Document], collection_name: str = None) -> None

更新向量数据库集合中的文档。

参数:

docs - List[Document] | 文档列表。
collection_name - str | 集合的名称。默认值为 None。

无

删除文档

def delete_docs(ids: List[ItemID], collection_name: str = None) -> None

从向量数据库的集合中删除文档。

参数:

ids - List[ItemID] | 文档ID列表。每个ID都是类型化的ItemID。
collection_name - str | 集合的名称。默认值为 None。
kwargs - Dict | 额外的关键字参数。

无

retrieve_docs

def retrieve_docs(queries: List[str],
                  collection_name: str = None,
                  n_results: int = 10,
                  distance_threshold: float = -1) -> QueryResults

根据查询从向量数据库的集合中检索文档。

参数:

queries - List[str] | 查询列表。每个查询是一个字符串。
collection_name - str | 集合的名称。默认值为 None。
n_results - int | 返回的相关文档数量。默认值为10。
distance_threshold - float | 距离分数的阈值，只有小于该值的距离才会返回。如果小于0则不使用此过滤，默认值为-1。
kwargs - Dict | 额外的关键字参数。

QueryResults | 查询结果。每个查询结果是一个元组列表的列表，包含文档及其距离。

get_docs_by_ids

def get_docs_by_ids(ids: List[ItemID] = None,
                    collection_name: str = None,
                    include=None,
                    **kwargs) -> List[Document]

根据ID从向量数据库的集合中检索文档。

参数:

ids - List[ItemID] | 文档ID列表。如果为None，将返回所有文档。默认为None。
collection_name - str | 集合的名称。默认值为 None。
include - List[str] | 要包含的字段。默认值为 None。如果为 None，则包含 ["metadatas", "documents"]，ids 将始终包含。
kwargs - dict | 额外的关键字参数。

List[Document] | 结果。

集合​

__init__​

添加​

upsert​

count​

table_exists​

获取​

更新​

欧几里得距离​

余弦距离​

inner_product_distance​

查询​

convert_string_to_array​

修改​

删除​

delete_collection​

创建集合​

PGVectorDB​

__init__​

建立连接​

创建集合​

get_collection​

delete_collection​

insert_docs​

更新文档​

删除文档​

retrieve_docs​

get_docs_by_ids​

集合

init

添加

upsert

count

table_exists

获取

更新

欧几里得距离

余弦距离

inner_product_distance

查询

convert_string_to_array

修改

删除

delete_collection

创建集合

PGVectorDB

init

建立连接

创建集合

get_collection

delete_collection

insert_docs

更新文档

删除文档

retrieve_docs

get_docs_by_ids