私有RAG信息提取引擎
| 时间: 90 分钟 | 级别: 高级 |
|---|
处理私人文件是许多行业中的常见任务。各种企业拥有大量以巨大文件形式存储的非结构化数据,这些数据必须被处理和分析。行业报告、财务分析、法律文件以及许多其他文件以PDF、Word和其他格式存储。建立在RAG管道之上的对话式聊天机器人是在此类文件中寻找相关答案的可行解决方案之一。然而,如果我们想从这些文件中提取结构化信息,并将其传递给下游系统,我们需要使用不同的方法。
信息提取是将非结构化数据转换为机器易于处理的格式的过程。在本教程中,我们将向您展示如何使用DSPy在一组文档上执行此过程。假设我们无法将数据发送到外部服务,我们将使用Ollama在我们的场所运行我们自己的LLM模型,并使用Vultr作为云提供商。Qdrant在此设置中充当知识库,为给定查询提供相关文档片段,也将在Vultr的混合云模式下托管。最后一个缺失的部分,DSPy应用程序也将在同一环境中运行。如果您在受监管的行业工作,或者只是需要保持数据私密,本教程适合您。

在Vultr上部署Qdrant混合云
本教程中我们将使用的所有服务都将在Vultr Kubernetes 引擎上运行。这为我们提供了在扩展和管理资源方面很大的灵活性。Vultr管理控制平面和工作节点,并提供与其他托管服务(如负载均衡器、块存储和DNS)的集成。
- To start using managed Kubernetes on Vultr, follow the 平台特定文档.
- Once your Kubernetes clusters are up, 您可以开始部署 Qdrant 混合云.
安装必要的包
我们需要几个Python包来运行我们的应用程序。它们可能与dspy-ai包和qdrant额外包一起安装:
pip install dspy-ai[qdrant]
Qdrant 混合云
我们的文档包含了如何在Vultr上以混合云模式设置Qdrant的全面指南。请仔细遵循以启动并运行您的Qdrant实例。完成后,我们需要将Qdrant的URL和API密钥存储在环境变量中。您可以通过运行以下命令来完成此操作:
export QDRANT_URL="https://qdrant.example.com"
export QDRANT_API_KEY="your-api-key"
import os
os.environ["QDRANT_URL"] = "https://qdrant.example.com"
os.environ["QDRANT_API_KEY"] = "your-api-key"
DSPy 是我们将要使用的框架。它已经与 Qdrant 集成,但假设你使用 快速嵌入 来创建嵌入。DSPy 不提供索引数据的方法,而是将这项任务留给用户。我们将自己创建一个集合,并用我们的文档块的嵌入填充它。
数据索引
FastEmbed 使用 BAAI/bge-small-en 作为默认的嵌入模型。我们也将使用它。如果我们调用现有 QdrantClient 实例上的 .add 方法,我们的集合将自动创建。在本教程中,我们不会过多关注文档解析,因为有很多工具可以帮助完成这项工作。unstructured 库是您可以在基础设施上启动的选项之一。在我们的简化示例中,我们将使用字符串列表作为我们的文档。这些是虚构的技术活动的描述。每个描述都应包含活动的名称、地点以及开始和结束日期。
documents = [
"Taking place in San Francisco, USA, from the 10th to the 12th of June, 2024, the Global Developers Conference is the annual gathering spot for developers worldwide, offering insights into software engineering, web development, and mobile applications.",
"The AI Innovations Summit, scheduled for 15-17 September 2024 in London, UK, aims at professionals and researchers advancing artificial intelligence and machine learning.",
"Berlin, Germany will host the CyberSecurity World Conference between November 5th and 7th, 2024, serving as a key forum for cybersecurity professionals to exchange strategies and research on threat detection and mitigation.",
"Data Science Connect in New York City, USA, occurring from August 22nd to 24th, 2024, connects data scientists, analysts, and engineers to discuss data science's innovative methodologies, tools, and applications.",
"Set for July 14-16, 2024, in Tokyo, Japan, the Frontend Developers Fest invites developers to delve into the future of UI/UX design, web performance, and modern JavaScript frameworks.",
"The Blockchain Expo Global, happening May 20-22, 2024, in Dubai, UAE, focuses on blockchain technology's applications, opportunities, and challenges for entrepreneurs, developers, and investors.",
"Singapore's Cloud Computing Summit, scheduled for October 3-5, 2024, is where IT professionals and cloud experts will convene to discuss strategies, architectures, and cloud solutions.",
"The IoT World Forum, taking place in Barcelona, Spain from December 1st to 3rd, 2024, is the premier conference for those focused on the Internet of Things, from smart cities to IoT security.",
"Los Angeles, USA, will become the hub for game developers, designers, and enthusiasts at the Game Developers Arcade, running from April 18th to 20th, 2024, to showcase new games and discuss development tools.",
"The TechWomen Summit in Sydney, Australia, from March 8-10, 2024, aims to empower women in tech with workshops, keynotes, and networking opportunities.",
"Seoul, South Korea's Mobile Tech Conference, happening from September 29th to October 1st, 2024, will explore the future of mobile technology, including 5G networks and app development trends.",
"The Open Source Summit, to be held in Helsinki, Finland from August 11th to 13th, 2024, celebrates open source technologies and communities, offering insights into the latest software and collaboration techniques.",
"Vancouver, Canada will play host to the VR/AR Innovation Conference from June 20th to 22nd, 2024, focusing on the latest in virtual and augmented reality technologies.",
"Scheduled for May 5-7, 2024, in London, UK, the Fintech Leaders Forum brings together experts to discuss the future of finance, including innovations in blockchain, digital currencies, and payment technologies.",
"The Digital Marketing Summit, set for April 25-27, 2024, in New York City, USA, is designed for marketing professionals and strategists to discuss digital marketing and social media trends.",
"EcoTech Symposium in Paris, France, unfolds over 2024-10-09 to 2024-10-11, spotlighting sustainable technologies and green innovations for environmental scientists, tech entrepreneurs, and policy makers.",
"Set in Tokyo, Japan, from 16th to 18th May '24, the Robotic Innovations Conference showcases automation, robotics, and AI-driven solutions, appealing to enthusiasts and engineers.",
"The Software Architecture World Forum in Dublin, Ireland, occurring 22-24 Sept 2024, gathers software architects and IT managers to discuss modern architecture patterns.",
"Quantum Computing Summit, convening in Silicon Valley, USA from 2024/11/12 to 2024/11/14, is a rendezvous for exploring quantum computing advancements with physicists and technologists.",
"From March 3 to 5, 2024, the Global EdTech Conference in London, UK, discusses the intersection of education and technology, featuring e-learning and digital classrooms.",
"Bangalore, India's NextGen DevOps Days, from 28 to 30 August 2024, is a hotspot for IT professionals keen on the latest DevOps tools and innovations.",
"The UX/UI Design Conference, slated for April 21-23, 2024, in New York City, USA, invites discussions on the latest in user experience and interface design among designers and developers.",
"Big Data Analytics Summit, taking place 2024 July 10-12 in Amsterdam, Netherlands, brings together data professionals to delve into big data analysis and insights.",
"Toronto, Canada, will see the HealthTech Innovation Forum from June 8 to 10, '24, focusing on technology's impact on healthcare with professionals and innovators.",
"Blockchain for Business Summit, happening in Singapore from 2024-05-02 to 2024-05-04, focuses on blockchain's business applications, from finance to supply chain.",
"Las Vegas, USA hosts the Global Gaming Expo from October 18th to 20th, 2024, a premiere event for game developers, publishers, and enthusiasts.",
"The Renewable Energy Tech Conference in Copenhagen, Denmark, from 2024/09/05 to 2024/09/07, discusses renewable energy innovations and policies.",
"Set for 2024 Apr 9-11 in Boston, USA, the Artificial Intelligence in Healthcare Summit gathers healthcare professionals to discuss AI's healthcare applications.",
"Nordic Software Engineers Conference, happening in Stockholm, Sweden from June 15 to 17, 2024, focuses on software development in the Nordic region.",
"The International Space Exploration Symposium, scheduled in Houston, USA from 2024-08-05 to 2024-08-07, invites discussions on space exploration technologies and missions."
]
我们将能够提出一般性问题,例如,关于我们感兴趣的主题或特定地点发生的事件,但预计结果将以结构化格式返回。

如果我们已经定义了文档,Qdrant 中的索引只需一次调用:
client.add(
collection_name="document-parts",
documents=documents,
metadata=[{"document": document} for document in documents],
)
我们的集合已经准备好进行查询。我们现在可以进入下一步,即设置Ollama模型。
Vultr上的Ollama
Ollama 是一个很棒的工具,可以在您自己的基础设施上运行 LLM 模型。它设计得轻量且易于使用,并且提供了官方的 Docker 镜像。我们可以使用它在我们的 Vultr Kubernetes 集群上运行 Ollama。对于 LLM,我们可能有一些特殊要求,比如 GPU,而 Vultr 提供了Vultr Kubernetes 引擎用于云 GPU,因此模型可以在专用机器上运行。请参考官方文档以在您的环境中启动并运行 Ollama。完成后,我们需要将 Ollama URL 存储在环境变量中:
export OLLAMA_URL="https://ollama.example.com"
os.environ["OLLAMA_URL"] = "https://ollama.example.com"
我们稍后在应用程序中配置Ollama模型时会参考这个URL。
设置大型语言模型
我们将使用Ollama中提供的一个轻量级LLM,即gemma:2b模型。它由Google DeepMind团队开发,拥有30亿个参数。Ollama版本使用了4位量化。在运行Ollama的机器上安装该模型非常简单,只需运行以下命令:
ollama run gemma:2b
Ollama 模型也与 DSPy 集成,因此我们可以直接在应用程序中使用它们。
实现信息提取管道
DSPy与其他LLM框架有些不同。它旨在优化管道中LMs的提示和权重。它有点像LMs的编译器:你用高级语言编写管道,DSPy为你生成提示和权重。这意味着你可以构建复杂的系统,而不必担心如何提示你的LMs的细节,因为DSPy会为你处理。它在某种程度上类似于PyTorch,但适用于LLMs。
首先,我们将定义我们将要使用的语言模型:
import dspy
gemma_model = dspy.OllamaLocal(
model="gemma:2b",
base_url=os.environ.get("OLLAMA_URL"),
max_tokens=500,
)
同样地,我们必须定义与我们的Qdrant混合云集群的连接:
from dspy.retrieve.qdrant_rm import QdrantRM
from qdrant_client import QdrantClient, models
client = QdrantClient(
os.environ.get("QDRANT_URL"),
api_key=os.environ.get("QDRANT_API_KEY"),
)
qdrant_retriever = QdrantRM(
qdrant_collection_name="document-parts",
qdrant_client=client,
)
最后,两个组件都需要通过简单调用其中一个函数在DSPy中进行配置:
dspy.configure(lm=gemma_model, rm=qdrant_retriever)
应用逻辑
有一个签名的概念,它定义了管道的输入和输出格式。我们将为事件定义一个简单的签名:
class Event(dspy.Signature):
description = dspy.InputField(
desc="Textual description of the event, including name, location and dates"
)
event_name = dspy.OutputField(desc="Name of the event")
location = dspy.OutputField(desc="Location of the event")
start_date = dspy.OutputField(desc="Start date of the event, YYYY-MM-DD")
end_date = dspy.OutputField(desc="End date of the event, YYYY-MM-DD")
它旨在从事件的文本描述中提取结构化信息。现在,我们可以构建我们的模块,它将与Qdrant和Ollama模型一起使用。我们称之为EventExtractor:
class EventExtractor(dspy.Module):
def __init__(self):
super().__init__()
# Retrieve module to get relevant documents
self.retriever = dspy.Retrieve(k=3)
# Predict module for the created signature
self.predict = dspy.Predict(Event)
def forward(self, query: str):
# Retrieve the most relevant documents
results = self.retriever.forward(query)
# Try to extract events from the retrieved documents
events = []
for document in results.passages:
event = self.predict(description=document)
events.append(event)
return events
逻辑很简单:我们从Qdrant中检索最相关的文档,然后尝试使用Event签名从中提取结构化信息。我们可以简单地调用它并查看结果:
extractor = EventExtractor()
extractor.forward("Blockchain events close to Europe")
输出:
[
Prediction(
event_name='Event Name: Blockchain Expo Global',
location='Dubai, UAE',
start_date='2024-05-20',
end_date='2024-05-22'
),
Prediction(
event_name='Event Name: Blockchain for Business Summit',
location='Singapore',
start_date='2024-05-02',
end_date='2024-05-04'
),
Prediction(
event_name='Event Name: Open Source Summit',
location='Helsinki, Finland',
start_date='2024-08-11',
end_date='2024-08-13'
)
]
任务成功解决,即使没有任何优化。然而,每个事件都有“事件名称:”前缀,我们可能希望将其移除。DSPy允许优化模块,因此我们可以改进结果。优化可以通过不同的方式进行,这在DSPy文档中有详细介绍。
在本教程中,我们不会详细介绍优化过程。然而,我们鼓励您进行实验,因为它可能会显著提高您的管道性能。
创建的模块可以轻松地存储在特定路径上,并在以后加载:
extractor.save("event_extractor")
要加载,只需创建模块的实例并调用load方法:
second_extractor = EventExtractor()
second_extractor.load("event_extractor")
这在优化模块时特别有用,因为优化后的版本可能会被存储并在以后加载,而无需每次运行应用程序时重新进行优化过程。
部署提取管道
Vultr 在部署应用程序方面为我们提供了很大的灵活性。理想情况下,我们会使用之前设置的 Kubernetes 集群来运行它。部署就像运行任何其他 Python 应用程序一样简单。这次我们不需要 GPU,因为 Ollama 已经在另一台机器上运行,而 DSPy 只是与它进行交互。
总结
在本教程中,我们向您展示了如何使用DSPy、Ollama和Qdrant设置一个用于信息提取的私有环境。所有组件都可以安全地托管在Vultr云上,让您完全控制您的数据。
