2025年3月11日

在Responses API中使用文件搜索对PDF进行RAG

尽管RAG可能令人望而生畏,但在PDF文件中搜索不应过于复杂。目前最常用的方案之一是:解析PDF文件、定义分块策略、将这些文本块上传至存储服务提供商、对这些文本块运行嵌入操作,并将这些嵌入存储在向量数据库中。而这仅仅是准备工作——在我们的大语言模型工作流程中检索内容还需要多个步骤。

这时文件搜索功能——一个可在Responses API中使用的托管工具——就派上用场了。它允许您搜索知识库并根据检索到的内容生成答案。在本操作指南中,我们将这些PDF上传至OpenAI的向量数据库,利用文件搜索功能从该向量库获取额外上下文信息,以回答我们在第一步生成的问题。接着,我们将基于从OpenAI博客(openai.com/news)提取的PDF文件初步创建一组问题。

文件搜索功能之前仅在Assistants API中提供。现在它已整合到全新的Responses API中,这是一个支持有状态或无状态运行的API,并新增了元数据过滤等特性

设置

!pip install PyPDF2 pandas tqdm openai -q
from openai import OpenAI
from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm
import concurrent
import PyPDF2
import os
import pandas as pd
import base64

client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
dir_pdfs = 'openai_blog_pdfs' # have those PDFs stored locally here
pdf_files = [os.path.join(dir_pdfs, f) for f in os.listdir(dir_pdfs)]

使用我们的PDF创建向量存储

我们将在OpenAI API上创建一个向量存储库,并将PDF文件上传至该向量存储库。OpenAI会读取这些PDF文件,将其内容分割成多个文本片段,对这些片段进行嵌入处理,然后将嵌入向量和文本存储在向量存储库中。这将使我们能够通过查询向量存储库来返回与查询相关的相关内容。

def upload_single_pdf(file_path: str, vector_store_id: str):
    file_name = os.path.basename(file_path)
    try:
        file_response = client.files.create(file=open(file_path, 'rb'), purpose="assistants")
        attach_response = client.vector_stores.files.create(
            vector_store_id=vector_store_id,
            file_id=file_response.id
        )
        return {"file": file_name, "status": "success"}
    except Exception as e:
        print(f"Error with {file_name}: {str(e)}")
        return {"file": file_name, "status": "failed", "error": str(e)}

def upload_pdf_files_to_vector_store(vector_store_id: str):
    pdf_files = [os.path.join(dir_pdfs, f) for f in os.listdir(dir_pdfs)]
    stats = {"total_files": len(pdf_files), "successful_uploads": 0, "failed_uploads": 0, "errors": []}
    
    print(f"{len(pdf_files)} PDF files to process. Uploading in parallel...")

    with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
        futures = {executor.submit(upload_single_pdf, file_path, vector_store_id): file_path for file_path in pdf_files}
        for future in tqdm(concurrent.futures.as_completed(futures), total=len(pdf_files)):
            result = future.result()
            if result["status"] == "success":
                stats["successful_uploads"] += 1
            else:
                stats["failed_uploads"] += 1
                stats["errors"].append(result)

    return stats

def create_vector_store(store_name: str) -> dict:
    try:
        vector_store = client.vector_stores.create(name=store_name)
        details = {
            "id": vector_store.id,
            "name": vector_store.name,
            "created_at": vector_store.created_at,
            "file_count": vector_store.file_counts.completed
        }
        print("Vector store created:", details)
        return details
    except Exception as e:
        print(f"Error creating vector store: {e}")
        return {}
store_name = "openai_blog_store"
vector_store_details = create_vector_store(store_name)
upload_pdf_files_to_vector_store(vector_store_details["id"])
Vector store created: {'id': 'vs_67d06b9b9a9c8191bafd456cf2364ce3', 'name': 'openai_blog_store', 'created_at': 1741712283, 'file_count': 0}
21 PDF files to process. Uploading in parallel...
100%|███████████████████████████████| 21/21 [00:09<00:00,  2.32it/s]
{'total_files': 21,
 'successful_uploads': 21,
 'failed_uploads': 0,
 'errors': []}

现在我们的向量存储已准备就绪,可以直接查询向量存储并检索特定查询的相关内容。通过使用新的向量搜索API,我们能够从知识库中找到相关条目,而无需将其集成到LLM查询中。

query = "What's Deep Research?"
search_results = client.vector_stores.search(
    vector_store_id=vector_store_details['id'],
    query=query
)
for result in search_results.data:
    print(str(len(result.content[0].text)) + ' of character of content from ' + result.filename + ' with a relevant score of ' + str(result.score))
3502 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.9813588865322393
3493 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.9522476825143714
3634 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.9397930296526796
2774 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.9101975747303771
3474 of character of content from Deep research System Card _ OpenAI.pdf with a relevant score of 0.9036647613464299
3123 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.887120981288272
3343 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.8448454849432881
3262 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.791345286655509
3271 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.7485530025091963
2721 of character of content from Introducing deep research _ OpenAI.pdf with a relevant score of 0.734033360849088

我们可以看到,搜索查询返回了不同大小(以及底层不同文本)的结果。它们都有不同的相关性分数,这些分数是由我们使用混合搜索的排名器计算得出的。

在单个API调用中整合搜索结果与LLM

不过,与其先查询向量存储再将数据传入Responses或Chat Completion API调用,在LLM查询中使用这些搜索结果更便捷的方式是直接利用file_search工具作为OpenAI Responses API的一部分。

query = "What's Deep Research?"
response = client.responses.create(
    input= query,
    model="gpt-4o-mini",
    tools=[{
        "type": "file_search",
        "vector_store_ids": [vector_store_details['id']],
    }]
)

# Extract annotations from the response
annotations = response.output[1].content[0].annotations
    
# Get top-k retrieved filenames
retrieved_files = set([result.filename for result in annotations])

print(f'Files used: {retrieved_files}')
print('Response:')
print(response.output[1].content[0].text) # 0 being the filesearch call
Files used: {'Introducing deep research _ OpenAI.pdf'}
Response:
Deep Research is a new capability introduced by OpenAI that allows users to conduct complex, multi-step research tasks on the internet efficiently. Key features include:

1. **Autonomous Research**: Deep Research acts as an independent agent that synthesizes vast amounts of information across the web, enabling users to receive comprehensive reports similar to those produced by a research analyst.

2. **Multi-Step Reasoning**: It performs deep analysis by finding, interpreting, and synthesizing data from various sources, including text, images, and PDFs.

3. **Application Areas**: Especially useful for professionals in fields such as finance, science, policy, and engineering, as well as for consumers seeking detailed information for purchases.

4. **Efficiency**: The output is fully documented with citations, making it easy to verify information, and it significantly speeds up research processes that would otherwise take hours for a human to complete.

5. **Limitations**: While Deep Research enhances research capabilities, it is still subject to limitations, such as potential inaccuracies in information retrieval and challenges in distinguishing authoritative data from unreliable sources.

Overall, Deep Research marks a significant advancement toward automated general intelligence (AGI) by improving access to thorough and precise research outputs.

我们可以看到gpt-4o-mini能够回答需要关于OpenAI深度研究最新专业知识的问题。它使用了文件Introducing deep research _ OpenAI.pdf中最相关的文本片段。如果我们想更深入地分析检索到的文本片段,还可以通过添加include=["output[*].file_search_call.search_results"]到查询中,来分析搜索引擎返回的不同文本。

评估性能

对于这些信息检索系统来说,关键是要衡量为这些答案检索到的文件的相关性和质量。本指南的后续步骤将包括生成评估数据集,并计算该生成数据集的不同指标。这是一种不完美的方法,我们始终建议为您自己的用例准备经过人工验证的评估数据集,但这将向您展示评估这些内容的方法论。这种方法之所以不完美,是因为生成的一些问题可能比较通用(例如:本文档中主要利益相关者说了什么),我们的检索测试将很难判断该问题是针对哪个文档生成的。

生成问题

我们将创建一些函数,用于读取本地存储的PDF文件并生成只能由该文档回答的问题。这样就能创建出后续可用的评估数据集。

def extract_text_from_pdf(pdf_path):
    text = ""
    try:
        with open(pdf_path, "rb") as f:
            reader = PyPDF2.PdfReader(f)
            for page in reader.pages:
                page_text = page.extract_text()
                if page_text:
                    text += page_text
    except Exception as e:
        print(f"Error reading {pdf_path}: {e}")
    return text

def generate_questions(pdf_path):
    text = extract_text_from_pdf(pdf_path)

    prompt = (
        "Can you generate a question that can only be answered from this document?:\n"
        f"{text}\n\n"
    )

    response = client.responses.create(
        input=prompt,
        model="gpt-4o",
    )

    question = response.output[0].content[0].text

    return question

如果我们对第一个PDF文件运行generate_question函数,就能看到它生成的问题类型。

generate_questions(pdf_files[0])
'What new capabilities will ChatGPT have as a result of the partnership between OpenAI and Schibsted Media Group?'

我们现在可以为本地存储的所有PDF文件生成所有问题。

# Generate questions for each PDF and store in a dictionary
questions_dict = {}
for pdf_path in pdf_files:
    questions = generate_questions(pdf_path)
    questions_dict[os.path.basename(pdf_path)] = questions
questions_dict
{'OpenAI partners with Schibsted Media Group _ OpenAI.pdf': 'What is the purpose of the partnership between Schibsted Media Group and OpenAI announced on February 10, 2025?',
 'OpenAI and the CSU system bring AI to 500,000 students & faculty _ OpenAI.pdf': 'What significant milestone did the California State University system achieve by partnering with OpenAI, making it the first of its kind in the United States?',
 '1,000 Scientist AI Jam Session _ OpenAI.pdf': 'What was the specific AI model used during the "1,000 Scientist AI Jam Session" event across the nine national labs?',
 'Announcing The Stargate Project _ OpenAI.pdf': 'What are the initial equity funders and lead partners in The Stargate Project announced by OpenAI, and who holds the financial and operational responsibilities?',
 'Introducing Operator _ OpenAI.pdf': 'What is the name of the new model that powers the Operator agent introduced by OpenAI?',
 'Introducing NextGenAI _ OpenAI.pdf': 'What major initiative did OpenAI launch on March 4, 2025, and which research institution from Europe is involved as a founding partner?',
 'Introducing the Intelligence Age _ OpenAI.pdf': "What is the name of the video generation tool used by OpenAI's creative team to help produce their Super Bowl ad?",
 'Operator System Card _ OpenAI.pdf': 'What is the preparedness score for the "Cybersecurity" category according to the Operator System Card?',
 'Strengthening America’s AI leadership with the U.S. National Laboratories _ OpenAI.pdf': "What is the purpose of OpenAI's agreement with the U.S. National Laboratories as described in the document?",
 'OpenAI GPT-4.5 System Card _ OpenAI.pdf': 'What is the Preparedness Framework rating for "Cybersecurity" for GPT-4.5 according to the system card?',
 'Partnering with Axios expands OpenAI’s work with the news industry _ OpenAI.pdf': "What is the goal of OpenAI's new content partnership with Axios as announced in the document?",
 'OpenAI and Guardian Media Group launch content partnership _ OpenAI.pdf': 'What is the main purpose of the partnership between OpenAI and Guardian Media Group announced on February 14, 2025?',
 'Introducing GPT-4.5 _ OpenAI.pdf': 'What is the release date of the GPT-4.5 research preview?',
 'Introducing data residency in Europe _ OpenAI.pdf': 'What are the benefits of data residency in Europe for new ChatGPT Enterprise and Edu customers according to the document?',
 'The power of personalized AI _ OpenAI.pdf': 'What is the purpose of the "Model Spec" document published by OpenAI for ChatGPT?',
 'Disrupting malicious uses of AI _ OpenAI.pdf': "What is OpenAI's mission as stated in the document?",
 'Sharing the latest Model Spec _ OpenAI.pdf': 'What is the release date of the latest Model Spec mentioned in the document?',
 'Deep research System Card _ OpenAI.pdf': "What specific publication date is mentioned in the Deep Research System Card for when the report on deep research's preparedness was released?",
 'Bertelsmann powers creativity and productivity with OpenAI _ OpenAI.pdf': 'What specific AI-powered solutions is Bertelsmann planning to implement for its divisions RTL Deutschland and Penguin Random House according to the document?',
 'OpenAI’s Economic Blueprint _ OpenAI.pdf': 'What date and location is scheduled for the kickoff event of OpenAI\'s "Innovating for America" initiative as mentioned in the Economic Blueprint document?',
 'Introducing deep research _ OpenAI.pdf': 'What specific model powers the "deep research" capability in ChatGPT that is discussed in this document, and what are its main features designed for?'}

我们现在有一个filename:question字典,可以循环遍历并向gpt-4o(-mini)提问而无需提供文档,gpt-4o应该能够在向量存储中找到相关文档。

我们将把字典转换为数据框,并使用gpt-4o-mini进行处理。我们会留意预期的文件

rows = []
for filename, query in questions_dict.items():
    rows.append({"query": query, "_id": filename.replace(".pdf", "")})

# Metrics evaluation parameters
k = 5
total_queries = len(rows)
correct_retrievals_at_k = 0
reciprocal_ranks = []
average_precisions = []

def process_query(row):
    query = row['query']
    expected_filename = row['_id'] + '.pdf'
    # Call file_search via Responses API
    response = client.responses.create(
        input=query,
        model="gpt-4o-mini",
        tools=[{
            "type": "file_search",
            "vector_store_ids": [vector_store_details['id']],
            "max_num_results": k,
        }],
        tool_choice="required" # it will force the file_search, while not necessary, it's better to enforce it as this is what we're testing
    )
    # Extract annotations from the response
    annotations = None
    if hasattr(response.output[1], 'content') and response.output[1].content:
        annotations = response.output[1].content[0].annotations
    elif hasattr(response.output[1], 'annotations'):
        annotations = response.output[1].annotations

    if annotations is None:
        print(f"No annotations for query: {query}")
        return False, 0, 0

    # Get top-k retrieved filenames
    retrieved_files = [result.filename for result in annotations[:k]]
    if expected_filename in retrieved_files:
        rank = retrieved_files.index(expected_filename) + 1
        rr = 1 / rank
        correct = True
    else:
        rr = 0
        correct = False

    # Calculate Average Precision
    precisions = []
    num_relevant = 0
    for i, fname in enumerate(retrieved_files):
        if fname == expected_filename:
            num_relevant += 1
            precisions.append(num_relevant / (i + 1))
    avg_precision = sum(precisions) / len(precisions) if precisions else 0
    
    if expected_filename not in retrieved_files:
        print("Expected file NOT found in the retrieved files!")
        
    if retrieved_files and retrieved_files[0] != expected_filename:
        print(f"Query: {query}")
        print(f"Expected file: {expected_filename}")
        print(f"First retrieved file: {retrieved_files[0]}")
        print(f"Retrieved files: {retrieved_files}")
        print("-" * 50)
    
    
    return correct, rr, avg_precision
process_query(rows[0])
(True, 1.0, 1.0)

在这个示例中,召回率和精确度均为1,我们的文件排名第一,因此在此示例中MRR和MAP值均为1。

我们现在可以对我们的问题集执行此处理。

with ThreadPoolExecutor() as executor:
    results = list(tqdm(executor.map(process_query, rows), total=total_queries))

correct_retrievals_at_k = 0
reciprocal_ranks = []
average_precisions = []

for correct, rr, avg_precision in results:
    if correct:
        correct_retrievals_at_k += 1
    reciprocal_ranks.append(rr)
    average_precisions.append(avg_precision)

recall_at_k = correct_retrievals_at_k / total_queries
precision_at_k = recall_at_k  # In this context, same as recall
mrr = sum(reciprocal_ranks) / total_queries
map_score = sum(average_precisions) / total_queries
 62%|███████████████████▏           | 13/21 [00:07<00:03,  2.57it/s]
Expected file NOT found in the retrieved files!
Query: What is OpenAI's mission as stated in the document?
Expected file: Disrupting malicious uses of AI _ OpenAI.pdf
First retrieved file: Introducing the Intelligence Age _ OpenAI.pdf
Retrieved files: ['Introducing the Intelligence Age _ OpenAI.pdf']
--------------------------------------------------
 71%|██████████████████████▏        | 15/21 [00:14<00:06,  1.04s/it]
Expected file NOT found in the retrieved files!
Query: What is the purpose of the "Model Spec" document published by OpenAI for ChatGPT?
Expected file: The power of personalized AI _ OpenAI.pdf
First retrieved file: Sharing the latest Model Spec _ OpenAI.pdf
Retrieved files: ['Sharing the latest Model Spec _ OpenAI.pdf', 'Sharing the latest Model Spec _ OpenAI.pdf', 'Sharing the latest Model Spec _ OpenAI.pdf', 'Sharing the latest Model Spec _ OpenAI.pdf', 'Sharing the latest Model Spec _ OpenAI.pdf']
--------------------------------------------------
100%|███████████████████████████████| 21/21 [00:15<00:00,  1.38it/s]

上面记录的输出结果会显示,当我们的评估数据集预期某个文件应排名第一时,它并未排在首位,或者根本未被找到。从我们并不完美的评估数据集中可以看出,有些问题比较通用,预期会匹配另一份文档,而我们的检索系统并未针对该问题专门检索到这份文档。

# Print the metrics with k
print(f"Metrics at k={k}:")
print(f"Recall@{k}: {recall_at_k:.4f}")
print(f"Precision@{k}: {precision_at_k:.4f}")
print(f"Mean Reciprocal Rank (MRR): {mrr:.4f}")
print(f"Mean Average Precision (MAP): {map_score:.4f}")
Metrics at k=5:
Recall@5: 0.9048
Precision@5: 0.9048
Mean Reciprocal Rank (MRR): 0.9048
Mean Average Precision (MAP): 0.8954

通过这本手册,我们能够了解如何:

  • 使用PDF上下文填充(利用4o的视觉模态)和传统PDF阅读器生成评估数据集
  • 创建一个向量存储并用PDF填充它
  • 获取LLM对查询的答案,利用开箱即用的RAG系统,通过OpenAI响应API中的file_search工具调用实现
  • 了解文本片段如何被检索、排序并作为Response API的一部分使用
  • 在先前生成的评估数据集上测量准确率、精确率、召回率、MRR和MAP

通过将文件搜索与响应功能结合使用,您可以简化RAG架构,并通过全新的响应API在单次调用中实现这一功能。文件存储、嵌入和检索全部集成在一个工具中!