检索器#

如需更详细的使用信息,请参阅我们的操作指南:RAG Cookbook

1. 概念#

检索器(Retrievers)模块本质上是一个搜索引擎。它的设计目的是通过搜索大量文本来帮助您找到特定的信息片段。想象一下,您拥有一个庞大的书籍库,想要查找某些主题或关键词被提及的位置,这个模块就像一位图书管理员,帮助您完成这项任务。

2. 类型#

2.1. 向量检索器#

向量检索器(Vector Retriever)通常指信息检索和机器学习中使用的一种方法或系统,它利用数据的向量表示。这种方法基于将文本、图像或其他形式的信息数据转换为高维空间中的数值向量。

以下是其工作原理的简要概述:

  • 嵌入模型:首先,它使用一个嵌入模型。该模型将文本转换为数学形式(向量)。

  • 存储信息:该模块接收大型文档,将其分解为较小的片段,然后使用嵌入模型将这些片段转换为向量。这些向量存储在向量存储中。

  • 检索信息:当我们提出问题或进行查询时,嵌入模型会将我们的问题转换为向量,然后在这个向量存储中搜索最接近的匹配向量。最接近的匹配很可能是我们正在寻找的最相关信息。

2.2. 关键词检索器#

关键词检索器(Keyword Retriever)旨在通过关键词匹配来检索相关信息。与处理数据向量表示的向量检索器(Vector Retriever)不同,关键词检索器通过识别文档和查询中的关键词或特定术语来寻找匹配项。

以下是其工作原理的简要概述:

  • 文档预处理:在使用检索器之前,会对文档进行预处理,以对其中的关键词进行分词和索引。分词是将文本拆分为单个单词或短语的过程,便于识别关键词。

  • 查询解析:当我们输入一个问题或查询时,检索器会解析该查询以提取相关关键词。这涉及将查询分解为其组成术语。

  • 关键词匹配:一旦从查询中识别出关键词,检索器就会在预处理过的文档中搜索这些关键词的出现情况。它会检查文档中是否存在与关键词完全匹配的内容。

  • 排序与检索:在找到包含查询关键词的文档后,检索器会根据多种因素对这些文档进行排序,例如关键词匹配频率、文档相关性或其他评分方法。然后排名最高的文档会被检索为最相关的结果。

3. 开始使用#

3.1. 使用Vector Retriever#

初始化VectorRetrieve: 要开始使用,我们需要用可选的嵌入模型和存储来初始化VectorRetriever。如果我们不提供嵌入模型,它将使用默认的OpenAIEmbedding。具体操作如下:

from camel.embeddings import OpenAIEmbedding
from camel.retrievers import VectorRetriever

# Initialize the VectorRetriever with an embedding model
# Create or initialize a vector storage (e.g., QdrantStorage)
from camel.storages.vectordb_storages import QdrantStorage

vector_storage = QdrantStorage(
    vector_dim=OpenAIEmbedding().get_output_dim(),
    collection_name="my first collection",
    path="storage_customized_run",
)

vr = VectorRetriever(embedding_model=OpenAIEmbedding(), storage=vector_storage)

嵌入并存储数据: 在我们能够检索信息之前,需要先准备数据并将其存储在向量存储中。process方法为我们处理这一过程。它会处理来自文件或URL的内容,将其分割成块,并将它们的嵌入存储在指定的向量存储中。

# Provide the path to our content input (can be a file or URL)
content_input_path = "https://www.camel-ai.org/"

# Embed and store chunks of data in the vector storage
vr.process(content=content_input_path)

执行查询: 现在我们的数据已存储,可以执行查询来根据搜索字符串检索信息。query方法会执行查询并将检索到的结果编译成字符串。

# Specify our query string
query = "What is CAMEL"

# Execute the query and retrieve results
results = vr.query(query=query, similarity_threshold=0)
print(results)
>>>  [{'similarity score': '0.812884257383057', 'content path': 'https://www.camel-ai.org/', 'metadata': {'filetype': 'text/html', 'languages': ['eng'], 'page_number': 1, 'url': 'https://www.camel-ai.org/', 'link_urls': ['/home', '/home', '/research/agent-trust', '/agent', '/data_explorer', '/chat', 'https://www.google.com/url?q=https%3A%2F%2Fcamel-ai.github.io%2Fcamel&sa=D&sntz=1&usg=AOvVaw1ifGIva9n-a-0KpTrIG8Cv', 'https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Fcamel-ai%2Fcamel&sa=D&sntz=1&usg=AOvVaw03Z2OD0-plx_zugZZgBb8w', '/team', '/sponsors', '/home', '/home', '/research/agent-trust', '/agent', '/data_explorer', '/chat', 'https://www.google.com/url?q=https%3A%2F%2Fcamel-ai.github.io%2Fcamel&sa=D&sntz=1&usg=AOvVaw1ifGIva9n-a-0KpTrIG8Cv', 'https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Fcamel-ai%2Fcamel&sa=D&sntz=1&usg=AOvVaw03Z2OD0-plx_zugZZgBb8w', '/team', '/sponsors', '/home', '/research/agent-trust', '/agent', '/data_explorer', '/chat', 'https://www.google.com/url?q=https%3A%2F%2Fcamel-ai.github.io%2Fcamel&sa=D&sntz=1&usg=AOvVaw1ifGIva9n-a-0KpTrIG8Cv', 'https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Fcamel-ai%2Fcamel&sa=D&sntz=1&usg=AOvVaw03Z2OD0-plx_zugZZgBb8w', '/team', '/sponsors', 'https://github.com/camel-ai/camel'], 'link_texts': [None, 'Home', 'AgentTrust', 'Agent App', 'Data Explorer App', 'ChatBot', 'Docs', 'Github Repo', 'Team', 'Sponsors', None, 'Home', 'AgentTrust', 'Agent App', 'Data Explorer App', 'ChatBot', 'Docs', 'Github Repo', 'Team', 'Sponsors', 'Home', 'AgentTrust', 'Agent App', 'Data Explorer App', 'ChatBot', 'Docs', 'Github Repo', 'Team', 'Sponsors', None], 'emphasized_text_contents': ['Skip to main content', 'Skip to navigation', 'CAMEL-AI', 'CAMEL-AI', 'CAMEL:\xa0 Communicative Agents for "Mind" Exploration of Large Language Model Society', 'https://github.com/camel-ai/camel'], 'emphasized_text_tags': ['span', 'span', 'span', 'span', 'span', 'span']}, 'text': 'Search this site\n\nSkip to main content\n\nSkip to navigation\n\nCAMEL-AI\n\nHome\n\nResearchAgentTrust\n\nAgent App\n\nData Explorer App\n\nChatBot\n\nDocs\n\nGithub Repo\n\nTeam\n\nSponsors\n\nCAMEL-AI\n\nHome\n\nResearchAgentTrust\n\nAgent App\n\nData Explorer App\n\nChatBot\n\nDocs\n\nGithub Repo\n\nTeam\n\nSponsors\n\nMoreHomeResearchAgentTrustAgent AppData Explorer AppChatBotDocsGithub RepoTeamSponsors\n\nCAMEL:\xa0 Communicative Agents for "Mind" Exploration of Large Language Model Society\n\nhttps://github.com/camel-ai/camel'}]

3.2. 使用Auto Retriever#

为了进一步简化检索过程,我们可以使用AutoRetriever方法。该方法同时处理数据嵌入存储和查询执行,在处理多个内容输入路径时特别有用。

from camel.retrievers import AutoRetriever
from camel.types import StorageType

# Set the vector storage local path and the storage type
ar = AutoRetriever(vector_storage_local_path="camel/retrievers",storage_type=StorageType.QDRANT)

# Run the auto vector retriever
retrieved_info = ar.run_vector_retriever(
    contents=[
        "https://www.camel-ai.org/",  # Example remote url
    ],
    query="What is CAMEL-AI",
    return_detailed_info=True, # Set this as true is we want to get detailed info including metadata
)

print(retrieved_info)
>>>  Original Query:
>>>  {What is CAMEL-AI}
>>>  Retrieved Context:
>>>  {'similarity score': '0.8380731206379989', 'content path': 'https://www.camel-ai.org/', 'metadata': {'emphasized_text_contents': ['Mission', 'CAMEL-AI.org', 'is an open-source community dedicated to the study of autonomous and communicative agents. We believe that studying these agents on a large scale offers valuable insights into their behaviors, capabilities, and potential risks. To facilitate research in this field, we provide, implement, and support various types of agents, tasks, prompts, models, datasets, and simulated environments.', 'Join us via', 'Slack', 'Discord', 'or'], 'emphasized_text_tags': ['span', 'span', 'span', 'span', 'span', 'span', 'span'], 'filetype': 'text/html', 'languages': ['eng'], 'link_texts': [None, None, None], 'link_urls': ['#h.3f4tphhd9pn8', 'https://join.slack.com/t/camel-ai/shared_invite/zt-1vy8u9lbo-ZQmhIAyWSEfSwLCl2r2eKA', 'https://discord.gg/CNcNpquyDc'], 'page_number': 1, 'url': 'https://www.camel-ai.org/'}, 'text': 'Mission\n\nCAMEL-AI.org is an open-source community dedicated to the study of autonomous and communicative agents. We believe that studying these agents on a large scale offers valuable insights into their behaviors, capabilities, and potential risks. To facilitate research in this field, we provide, implement, and support various types of agents, tasks, prompts, models, datasets, and simulated environments.\n\nJoin us via\n\nSlack\n\nDiscord\n\nor'}

就这样!我们已经成功设置并使用Retriever模块,根据我们数据集合中的查询来检索信息。

请根据您的具体使用场景和数据源自由定制代码!