agentchat.contrib.vectordb.mongodb

with_id_rename

def with_id_rename(docs: Iterable) -> List[Dict[str, Any]]

实用工具将集合中的 _id 字段更改为文档的 id。

MongoDBAtlasVectorDB

class MongoDBAtlasVectorDB(VectorDB)

一个用于 MongoDB 的 Collection 对象。

init

def __init__(connection_string: str = "",
             database_name: str = "vector_db",
             embedding_function: Callable = SentenceTransformer(
                 "all-MiniLM-L6-v2").encode,
             collection_name: str = None,
             index_name: str = "vector_index",
             overwrite: bool = False,
             wait_until_index_ready: float = None,
             wait_until_document_ready: float = None)

初始化向量数据库。

参数:

connection_string - str | 用于连接的MongoDB连接字符串。默认为''。
database_name - str | 数据库名称。默认是'vector_db'。
embedding_function - Callable | 用于生成向量表示的嵌入函数。
collection_name - str | 为该向量数据库创建的集合名称默认为 None
index_name - str | 向量数据库的索引名称，默认为 'vector_index'
overwrite - bool = False
wait_until_index_ready - float | None | 阻塞调用，直到数据库索引准备就绪。None，默认值，表示不等待。
wait_until_document_ready - float | None | 阻塞调用，直到数据库索引准备好。默认值为 None，表示不等待。

list_collections

def list_collections()

列出向量数据库中的集合。

List[str] | 集合的列表。

创建集合

def create_collection(collection_name: str,
                      overwrite: bool = False,
                      get_or_create: bool = True) -> Collection

在向量数据库中创建一个集合，并在集合中创建向量搜索索引。

参数:

collection_name - str | 集合的名称。
overwrite - bool | 如果集合存在，是否覆盖它。默认值为 False。
get_or_create - bool | 获取或创建集合。默认为True

create_index_if_not_exists

def create_index_if_not_exists(index_name: str = "vector_index",
                               collection: Collection = None) -> None

在MongoDB的指定集合上创建向量搜索索引。

参数:

MONGODB_INDEX str，可选项 - 要创建的向量搜索索引的名称。默认为“vector_search_index”。
collection Collection, 可选 - 要在其上创建索引的MongoDB集合。默认为None。

get_collection

def get_collection(collection_name: str = None) -> Collection

从向量数据库中获取集合。

参数:

collection_name - str | 集合的名称。默认为 None。如果为 None，则返回当前活动的集合。

集合 | 集合对象。

delete_collection

def delete_collection(collection_name: str) -> None

从向量数据库中删除集合。

参数:

collection_name - str | 集合的名称。

create_vector_search_index

def create_vector_search_index(
    collection: Collection,
    index_name: Union[str, None] = "vector_index",
    similarity: Literal["euclidean", "cosine",
                        "dotProduct"] = "cosine") -> None

在集合中创建一个向量搜索索引。

参数:

collection - Atlas 数据库中已有的集合。
index_name - 矢量搜索索引名称。
similarity - 用于测量向量相似度的算法。
kwargs - 额外的关键字参数。

无

insert_docs

def insert_docs(docs: List[Document],
                collection_name: str = None,
                upsert: bool = False,
                batch_size=DEFAULT_INSERT_BATCH_SIZE,
                **kwargs) -> None

将文档和向量嵌入插入向量数据库的集合中。

对于大量文档，插入操作是以批处理方式进行的。

建议文档不要包含ID字段，因为该方法会为它们生成哈希ID。

参数:

docs - List[Document] | 一个文档列表。每个文档是一个TypedDict Document，可能包含一个ID。没有ID的文档将会生成一个ID。
collection_name - str | 集合的名称。默认值为 None。
upsert - bool | 如果文档存在，是否更新。默认为False。
batch_size - 每批插入的文档数量
kwargs - 额外的关键字参数。使用 hash_length 来设置生成的ID的哈希长度，使用 overwrite_ids 来用哈希值覆盖现有的ID。

更新文档

def update_docs(docs: List[Document],
                collection_name: str = None,
                **kwargs: Any) -> None

更新集合中的文档，包括它们的嵌入。

可选地将upsert作为关键字参数。

使用 deepcopy 来避免更改文档。

参数:

docs - List[Document] | 一个文档列表，包含ID，以确保更新正确的文档。
collection_name - str | 集合的名称。默认值为 None。
kwargs - Any | 使用`upsert=True`来插入集合中不存在的文档。

删除文档

def delete_docs(ids: List[ItemID], collection_name: str = None, **kwargs)

从向量数据库的集合中删除文档。

参数:

ids - List[ItemID] | 文档ID列表。每个ID都是类型化的ItemID。
collection_name - str | 集合的名称。默认值为 None。

get_docs_by_ids

def get_docs_by_ids(ids: List[ItemID] = None,
                    collection_name: str = None,
                    include: List[str] = None,
                    **kwargs) -> List[Document]

根据ID从向量数据库的集合中检索文档。

参数:

ids - List[ItemID] | 文档ID列表。如果为None，将返回所有文档。默认为None。
collection_name - str | 集合的名称。默认值为 None。
include - List[str] | 要包含的字段。如果为None，将包含["metadata", "content"]，ids将始终包含。基本上，使用include来选择是否包含嵌入和元数据
kwargs - dict | 额外的关键字参数。

List[Document] | 结果。

retrieve_docs

def retrieve_docs(queries: List[str],
                  collection_name: str = None,
                  n_results: int = 10,
                  distance_threshold: float = -1,
                  **kwargs) -> QueryResults

根据查询从向量数据库的集合中检索文档。

参数:

queries - List[str] | 查询列表。每个查询是一个字符串。
collection_name - str | 集合的名称。默认值为 None。
n_results - int | 返回的相关文档数量。默认值为10。
distance_threshold - float | 距离分数的阈值，只有小于该值的距离才会返回。如果小于0则不使用此过滤，默认值为-1。
kwargs - Dict | 额外的关键字参数。以下是一些重要的参数：
oversampling_factor - int | 这是在HNSW算法中，此参数乘以n_results等于'ef'的值。它决定在搜索阶段考虑的最近邻候选数量。较高的值会提高准确性，但会降低速度。默认值为10

QueryResults | 对于每个查询字符串，列出最近的文档及其分数。

with_id_rename​

MongoDBAtlasVectorDB​

__init__​

list_collections​

创建集合​

create_index_if_not_exists​

get_collection​

delete_collection​

create_vector_search_index​

insert_docs​

更新文档​

删除文档​

get_docs_by_ids​

retrieve_docs​