在这篇文章中,我们将探讨如何使用Google的Gemini模型与Instructor来分析Gemini 1.5 Pro Paper并提取结构化摘要。
以编程方式处理PDF一直很痛苦。典型的方法都有显著的缺点:
- PDF 解析库 需要复杂的规则并且容易出错
- OCR 解决方案 处理速度慢且容易出错
- 专业的 PDF API 价格昂贵且需要额外的集成
- LLM 解决方案 通常需要复杂的文档切块和嵌入管道
如果我们能把一个PDF交给LLM并得到结构化数据会怎样?借助Gemini的多模态能力和Instructor的结构化输出处理,我们完全可以做到这一点。
首先,安装所需的包:
pip install "instructor[google-generativeai]"
然后,这是你需要的所有代码:
import instructor
import google.generativeai as genai
from google.ai.generativelanguage_v1beta.types.file import File
from pydantic import BaseModel
import time
# Initialize the client
client = instructor.from_gemini(
client=genai.GenerativeModel(
model_name="models/gemini-1.5-flash-latest",
)
)
# Define your output structure
class Summary(BaseModel):
summary: str
# Upload the PDF
file = genai.upload_file("path/to/your.pdf")
# Wait for file to finish processing
while file.state != File.State.ACTIVE:
time.sleep(1)
file = genai.get_file(file.name)
print(f"File is still uploading, state: {file.state}")
print(f"File is now active, state: {file.state}")
print(file)
resp = client.chat.completions.create(
messages=[
{"role": "user", "content": ["Summarize the following file", file]},
],
response_model=Summary,
)
print(resp.summary)
Expand to see Raw Results
summary="Gemini 1.5 Pro is a highly compute-efficient multimodal mixture-of-experts model capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. It achieves near-perfect recall on long-context retrieval tasks across modalities, improves the state-of-the-art in long-document QA, long-video QA and long-context ASR, and matches or surpasses Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Gemini 1.5 Pro is built to handle extremely long contexts; it has the ability to recall and reason over fine-grained information from up to at least 10M tokens. This scale is unprecedented among contemporary large language models (LLMs), and enables the processing of long-form mixed-modality inputs including entire collections of documents, multiple hours of video, and almost five days long of audio. Gemini 1.5 Pro surpasses Gemini 1.0 Pro and performs at a similar level to 1.0 Ultra on a wide array of benchmarks while requiring significantly less compute to train. It can recall information amidst distractor context, and it can learn to translate a new language from a single set of linguistic documentation. With only instructional materials (a 500-page reference grammar, a dictionary, and ≈ 400 extra parallel sentences) all provided in context, Gemini 1.5 Pro is capable of learning to translate from English to Kalamang, a Papuan language with fewer than 200 speakers, and therefore almost no online presence."
Gemini 和 Instructor 的结合相比传统的 PDF 处理方法提供了几个关键优势:
简单集成 - 与需要复杂文档处理流程、分块策略和嵌入数据库的传统方法不同,您只需几行代码即可直接处理PDF。这大大减少了开发时间和维护开销。
结构化输出 - Instructor的Pydantic集成确保您获得所需的数据结构。模型的输出会自动验证和类型化,使得构建可靠的应用程序更加容易。如果提取失败,Instructor会自动为您处理重试,并支持使用tenacity的自定义重试逻辑。
多模态支持 - Gemini的多模态能力意味着这种方法适用于各种文件类型。您可以在同一个API请求中处理图像、视频和音频文件。查看我们的多模态处理指南,了解我们如何从旅行视频中提取结构化数据。
处理PDF文件不必复杂。
通过将Gemini的多模态能力与Instructor的结构化输出处理相结合,我们可以将复杂的文档处理转化为简单的Python代码。
不再需要纠结于解析规则、管理嵌入或构建复杂的管道——只需定义您的数据模型,让LLM来完成繁重的工作。
如果你喜欢这个,今天就试试instructor,看看结构化输出如何让与LLMs的工作变得更轻松。今天就开始使用Instructor吧!