Optimizing Retrieval-Augmented Generation using GPT-4o Vision Modality

在实现检索增强生成（RAG）时，处理富含图像、图表和表格的文档会带来独特挑战。传统RAG模型擅长处理文本数据，但当视觉元素在信息传递中起关键作用时往往表现不佳。本指南通过利用视觉模态来提取和解读视觉内容，弥合这一差距，确保生成的响应尽可能信息丰富且准确。

我们的方法包括将文档解析为图像，并利用元数据标记来识别包含图片、图表和表格的页面。当语义搜索检索到此类页面时，我们会将页面图像传递给视觉模型，而不仅仅依赖文本。这种方法增强了模型理解和回答与视觉数据相关的用户查询的能力。

在本指南中，我们将探索并演示以下关键概念：

2. 解析PDF并提取视觉信息:

探索将PDF页面转换为图像的技术。
利用GPT-4o视觉模态从包含图像、图表或表格的页面中提取文本信息。

3. 生成嵌入向量:

利用嵌入模型创建文本数据的向量表示。
标记含有视觉内容的页面，以便我们在向量存储上设置元数据标志，并检索图像以通过视觉模态传递给GPT-4o。

5. 执行语义搜索查找相关页面:

在页面文本上实现语义搜索，以找到与用户查询最匹配的页面。
将匹配的页面文本提供给GPT-4o作为上下文，以回答用户的查询。

6. 处理包含视觉内容的页面（可选步骤）：

了解如何利用GPT-4o视觉模态传递图像，在附加上下文的情况下进行问答。
了解这一过程如何提高涉及视觉数据的响应准确性。

通过学习本指南，您将全面掌握如何实现能够处理和解析包含复杂视觉元素的文档的RAG系统。这些知识将使您能够构建提供更丰富、更准确信息的AI解决方案，从而提升用户满意度和参与度。

我们将使用世界银行报告 - A Better Bank for a Better World: Annual Report 2024 来阐释这些概念，因为该文档包含了图像、表格和图表数据的混合内容。

请注意，使用视觉模态会消耗大量资源，导致延迟增加和成本上升。建议仅在纯文本提取方法在评估基准上表现不佳的情况下使用视觉模态。在此背景下，让我们深入探讨。

步骤1：使用Pinecone设置向量存储

在本节中，我们将使用Pinecone设置一个向量存储库，以高效存储和管理我们的嵌入向量。Pinecone是一个专为处理高维向量数据优化的向量数据库，这对于语义搜索和相似性匹配等任务至关重要。

前提条件

注册Pinecone并按照Pinecone数据库快速入门中的说明获取API密钥
使用pip install "pinecone[grpc]"安装Pinecone SDK。gRPC（gRPC远程过程调用）是一个高性能、开源的通用RPC框架，它使用HTTP/2进行传输，协议缓冲区（protobuf）作为接口定义语言，并支持分布式系统中的客户端-服务器通信。该框架旨在使服务间通信更高效，适合微服务架构。

安全存储API密钥

出于安全考虑，请将API密钥存储在项目目录的.env文件中，格式如下：
PINECONE_API_KEY=your-api-key-here.
安装 pip install python-dotenv 以从 .env 文件中读取 API 密钥。

创建Pinecone索引
我们将使用create_index函数在Pinecone上初始化我们的嵌入数据库。有两个关键参数需要考虑：

维度：必须与所选模型生成的嵌入维度相匹配。例如，OpenAI的text-embedding-ada-002模型生成的嵌入维度为1536，而text-embedding-3-large生成的嵌入维度为3072。我们将使用text-embedding-3-large模型，因此将维度设置为3072。
指标：距离指标决定了向量之间相似度的计算方式。Pinecone支持多种指标，包括余弦、点积和欧几里得。在本教程中，我们将使用余弦相似度指标。您可以在Pinecone距离指标文档中了解更多关于距离指标的信息。

import os
import time
# Import the Pinecone library
from pinecone.grpc import PineconeGRPC as Pinecone
from pinecone import ServerlessSpec

from dotenv import load_dotenv

load_dotenv()

api_key = os.getenv("PINECONE_API_KEY")

# Initialize a Pinecone client with your API key
pc = Pinecone(api_key)

# Create a serverless index
index_name = "my-test-index"

if not pc.has_index(index_name):
    pc.create_index(
        name=index_name,
        dimension=3072,
        metric="cosine",
        spec=ServerlessSpec(
            cloud='aws',
            region='us-east-1'
        )
    )

# Wait for the index to be ready
while not pc.describe_index(index_name).status['ready']:
    time.sleep(1)

在Pinecone上导航至索引列表，您应该能在索引列表中看到my-test-index。

第二步：解析PDF并提取视觉信息：

在本节中，我们将解析世界银行报告的PDF文档——A Better Bank for a Better World: Annual Report 2024，并提取文本和视觉信息，例如描述图像、图表和表格。该过程包含三个主要步骤：

将PDF解析为单独页面： 我们将PDF拆分为单独的页面以便于处理。
将PDF页面转换为图像：这使得GPT-4o视觉能力能够以图像形式分析页面。
处理图像和表格： 向GPT-4o提供指令以提取文本，同时描述文档中的图像、图形或表格。

前提条件

在继续之前，请确保已安装以下软件包。同时确保您的OpenAI API密钥已设置为环境变量。您可能还需要安装Poppler以实现PDF渲染功能。

pip install PyPDF2 pdf2image pytesseract pandas tqdm

步骤分解：

1. 下载并分块PDF文件：

chunk_document 函数从提供的URL下载PDF文件，并使用PyPDF2将其分割为单独的页面。
每个页面都作为单独的PDF字节流存储在列表中。

2. 将PDF页面转换为图像：

convert_page_to_image 函数接收单页PDF的字节数据，并使用pdf2image将其转换为图像。
图片被保存在本地的'images'目录中以便进一步处理。

3. 使用GPT-4o视觉模态提取文本：

extract_text_from_image 函数利用 GPT-4o 的视觉能力从页面图像中提取文本。
该方法甚至可以从扫描文档中提取文本信息。
请注意，此模式资源密集，因此具有更高的延迟和相关成本。

4. 处理整个文档：

process_document 函数负责协调每个页面的处理流程。
它使用进度条(tqdm)来显示处理状态。
从每个页面提取的信息被收集到一个列表中，然后转换为Pandas DataFrame。

import base64
import requests
import os
import pandas as pd
from PyPDF2 import PdfReader, PdfWriter
from pdf2image import convert_from_bytes
from io import BytesIO
from openai import OpenAI
from tqdm import tqdm

# Link to the document we will use as the example 
document_to_parse = "https://documents1.worldbank.org/curated/en/099101824180532047/pdf/BOSIB13bdde89d07f1b3711dd8e86adb477.pdf"

# OpenAI client 
oai_client = OpenAI()


# Chunk the PDF document into single page chunks 
def chunk_document(document_url):
    # Download the PDF document
    response = requests.get(document_url)
    pdf_data = response.content

    # Read the PDF data using PyPDF2
    pdf_reader = PdfReader(BytesIO(pdf_data))
    page_chunks = []

    for page_number, page in enumerate(pdf_reader.pages, start=1):
        pdf_writer = PdfWriter()
        pdf_writer.add_page(page)
        pdf_bytes_io = BytesIO()
        pdf_writer.write(pdf_bytes_io)
        pdf_bytes_io.seek(0)
        pdf_bytes = pdf_bytes_io.read()
        page_chunk = {
            'pageNumber': page_number,
            'pdfBytes': pdf_bytes
        }
        page_chunks.append(page_chunk)

    return page_chunks


# Function to encode the image
def encode_image(local_image_path):
    with open(local_image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')


# Function to convert page to image     
def convert_page_to_image(pdf_bytes, page_number):
    # Convert the PDF page to an image
    images = convert_from_bytes(pdf_bytes)
    image = images[0]  # There should be only one page

    # Define the directory to save images (relative to your script)
    images_dir = 'images'  # Use relative path here

    # Ensure the directory exists
    os.makedirs(images_dir, exist_ok=True)

    # Save the image to the images directory
    image_file_name = f"page_{page_number}.png"
    image_file_path = os.path.join(images_dir, image_file_name)
    image.save(image_file_path, 'PNG')

    # Return the relative image path
    return image_file_path


# Pass the image to the LLM for interpretation  
def get_vision_response(prompt, image_path):
    # Getting the base64 string
    base64_image = encode_image(image_path)

    response = oai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{base64_image}"
                        },
                    },
                ],
            }
        ],
    )
    return response


# Process document function that brings it all together 
def process_document(document_url):
    try:
        # Update document status to 'Processing'
        print("Document processing started")

        # Get per-page chunks
        page_chunks = chunk_document(document_url)
        total_pages = len(page_chunks)

        # Prepare a list to collect page data
        page_data_list = []

        # Add progress bar here
        for page_chunk in tqdm(page_chunks, total=total_pages, desc='Processing Pages'):
            page_number = page_chunk['pageNumber']
            pdf_bytes = page_chunk['pdfBytes']

            # Convert page to image
            image_path = convert_page_to_image(pdf_bytes, page_number)

            # Prepare question for vision API
            system_prompt = (
                "The user will provide you an image of a document file. Perform the following actions: "
                "1. Transcribe the text on the page. **TRANSCRIPTION OF THE TEXT:**"
                "2. If there is a chart, describe the image and include the text **DESCRIPTION OF THE IMAGE OR CHART**"
                "3. If there is a table, transcribe the table and include the text **TRANSCRIPTION OF THE TABLE**"
            )

            # Get vision API response
            vision_response = get_vision_response(system_prompt, image_path)

            # Extract text from vision response
            text = vision_response.choices[0].message.content

            # Collect page data
            page_data = {
                'PageNumber': page_number,
                'ImagePath': image_path,
                'PageText': text
            }
            page_data_list.append(page_data)

        # Create DataFrame from page data
        pdf_df = pd.DataFrame(page_data_list)
        print("Document processing completed.")
        print("DataFrame created with page data.")

        # Return the DataFrame
        return pdf_df

    except Exception as err:
        print(f"Error processing document: {err}")
        # Update document status to 'Error'


df = process_document(document_to_parse)

Document processing started

Processing Pages: 100%|██████████| 49/49 [18:54<00:00, 23.14s/it]

Document processing completed.
DataFrame created with page data.

让我们检查DataFrame以确保页面已正确处理。为简洁起见，我们将仅获取并显示前五行数据。此外，您应该能在'images'目录中看到生成的页面图像。

from IPython.display import display, HTML

# Convert the DataFrame to an HTML table and display top 5 rows 
display(HTML(df.head().to_html()))

	页码	图片路径	页面文本
0	1	images/page_1.png	文字转录内容：\n\n公开披露授权 \n公开披露授权 \n为了更美好的世界打造更优质的银行 \n2024年度报告 \n世界银行集团 \n国际复兴开发银行 · 国际开发协会 \n\n图片或图表描述：\n\n图片呈现了一个夜间场景，临时搭建的庇护所从内部透出光亮。庇护所似乎由布料制成，表面带有图案。通过开口可以看到内部有人，庇护所外的地面上可见鞋子等物品。场景暗示了星空下的社区或家庭环境。叠加在图片上的圆形图形元素可能暗喻相互关联性或全球影响力。
1	2	images/page_2.png	文本转录内容：\n\n目录\n\n行长致辞 6 执行董事致辞 8 成为更好的银行 10 2024财年财务概要 12 各地区业绩 14 各主题业绩 44 我们的工作方式 68 关键表格\n国际复兴开发银行关键财务指标，2020-24财年 84 国际开发协会关键财务指标，2020-24财年 88 本年度报告涵盖2023年7月1日至2024年6月30日期间，由国际复兴开发银行(IBRD)和国际开发协会(IDA)的执行董事共同编制——这两家机构统称为世界银行——并根据各自章程完成。世界银行集团行长兼执行董事会主席阿贾伊·班加已将此报告连同随附的行政预算和经审计的财务报表一并提交给理事会。世界银行集团其他机构的年度报告——国际金融公司(IFC)、多边投资担保机构(MIGA)和国际投资争端解决中心(ICSID)——将另行发布。各机构年报的关键亮点可在《世界银行集团年度报告摘要》中查阅。本报告中，世界银行及简称"银行"仅指国际复兴开发银行和国际开发协会；世界银行集团及简称"银行集团"指全部五家机构。除非另有说明，本报告所有金额均以当前美元计算。在涉及区域细分时，多区域项目资金尽可能按受援国在表格和文本中列示。对于行业和主题细分，资金按业务列示。财年承诺和支付数据与2024财年国际复兴开发银行和国际开发协会财务报表及管理层讨论与分析文件中报告的审计数据一致。由于四舍五入，表格中的数字相加可能与总数不符，图表中的百分比相加可能不等于100%。图片或图表描述\n\n图片展示了一只手持着一束水稻植株的特写，金黄色的稻穗清晰可见。背景虚化，显示出更多稻田景象。
2	3	images/page_3.png	文字转录内容：\n\n关于我们\n\n世界银行集团是全球发展中国家最大的资金和知识来源之一。我们的五个机构共同致力于减少贫困、促进共同繁荣和推动可持续发展。\n\n我们的愿景\n我们的愿景是创建一个没有贫困、宜居的星球。\n\n我们的使命\n我们的使命是在宜居的星球上消除极端贫困并促进共同繁荣。这一目标正受到多重交织危机的威胁。时间至关重要。我们正在建设一个更好的银行，以推动具有影响力的发展，这些发展具有以下特点：\n• 包容性：涵盖所有人，包括妇女和青年；\n• 抗冲击能力：抵御气候和生物多样性危机、流行病和脆弱性；\n• 可持续性：通过增长和就业创造、人类发展、财政和债务管理、粮食安全以及获得清洁空气、水和可负担能源来实现。\n\n为实现这一目标，我们将作为一个统一的世界银行集团与所有客户合作，并与其他多边机构、私营部门和民间社会建立紧密伙伴关系。\n\n我们的核心价值观\n我们的工作以核心价值观为指导：影响力、诚信、尊重、团队合作和创新。这些价值观指导着我们在全球各地开展的所有工作。
3	4	images/page_4.png	文字转录内容：\n\n推动行动，衡量成果\n\n世界银行集团为全球带来具有影响力、有意义的发展成果。在2024财年上半年，我们：\n\n- 帮助1.56亿人获得食物\n- 改善2.8亿学生的教育条件\n- 为2.87亿贫困人口提供有效的社会保护支持†\n- 为5900万人提供清洁饮水、卫生设施和/或卫生服务\n- 帮助7700万人获得可持续交通\n- 提供17吉瓦的可再生能源装机容量\n- 承诺到2025年将年度融资的45%用于气候行动，减缓和适应各占一半\n\n新版记分卡在印刷时仍在开发中，因此本报告仅包含截至2023年12月31日的成果数据。\n在2024年国际货币基金组织-世界银行集团年会时，完整的2024财年记分卡数据将发布于：https://scorecard.worldbankgroup.org\n\n†仅含国际复兴开发银行(IBRD)和国际开发协会(IDA)指标。\n\n2024财年，世界银行集团宣布开发新版记分卡，将通过22项指标（此前为150项）追踪成果，以简洁清晰地展示世行集团在各项使命中的进展——从改善医疗可及性到建立可持续粮食系统，再到促进私人投资。\n\n这是首次通过同一套指标追踪世行集团所有融资机构的工作。新版记分卡将追踪世行集团"在宜居星球上消除贫困"的总体愿景。\n\n2024世界银行年度报告\n\n图片或图表描述：\n\n图片展示了一系列圆形照片与文字亮点相连，描绘世界银行集团的成就。照片包含与粮食、教育、社会保护、水资源、交通、可再生能源及环境倡议相关的人物和基础设施。每张照片都配有文字说明具体的成就或承诺。
4	5	images/page_5.png	TRANSCRIPTION OF THE TEXT:\n\nMESSAGE FROM THE PRESIDENT\n\nDELIVERING ON OUR COMMITMENTS REQUIRES US TO DEVELOP NEW AND BETTER WAYS OF WORKING. IN FISCAL 2024, WE DID JUST THAT.\n\nAJAY BANGA\n\nIn fiscal 2024, the World Bank Group adopted a bold new vision of a world free of poverty on a livable planet. To achieve this, the Bank Group is enacting reforms to become a better partner to governments, the private sector, and, ultimately, the people we serve. Rarely in our 80-year history has our work been more urgent: We face declining progress in our fight against poverty, an existential climate crisis, mounting public debt, food insecurity, an unequal pandemic recovery, and the effects of geopolitical conflict.\n\nResponding to these intertwined challenges requires a faster, simpler, and more efficient World Bank Group. We are refocusing to confront these challenges not just through funding, but with knowledge. Our Knowledge Compact for Action, published in fiscal 2024, details how we will empower all Bank Group clients, public and private, by making our wealth of development knowledge more accessible. And we have reorganized the World Bank’s global practices into five Vice Presidency units—People, Prosperity, Planet, Infrastructure, and Digital—for more flexible and faster engagements with clients. Each of these units reached important milestones in fiscal 2024.\n\nWe are supporting countries in delivering quality, affordable health services to 1.5 billion people by 2030 so our children and grandchildren will lead healthier, better lives. This is part of our larger global effort to address a basic standard of care through every stage of a person’s life—infancy, childhood, adolescence, and adulthood. To help people withstand food-affected shocks and crises, we are strengthening social protection services to support half a billion people by the end of 2030—aiming for half of these beneficiaries to be women.\n\nWe are helping developing countries create jobs and employment, the surest enablers of prosperity. In the next 10 years, 1.2 billion young people across the Global South will become working-age adults. Yet, in the same period and the same countries, only 424 million jobs are expected to be created. The cost of hundreds of millions of young people with no hope for a decent job or future is unimaginable, and we are working urgently to create opportunity for all.\n\nIn response to climate change—arguably the greatest challenge of our generation—we’re channeling 45 percent of annual financing to climate action by 2025, deployed equally between mitigation and adaptation. Among other efforts, we intend to launch at least 15 country-led methane-reduction programs by fiscal 2026, and our Forest Carbon Partnership Facility has helped strengthen high-integrity carbon markets.\n\nAccess to electricity is a fundamental human right and foundational to any successful development effort. It will accelerate the digital development of developing countries, strengthen public infrastructure, and prepare people for the jobs of tomorrow. But half the population of Africa—600 million people—lacks access to electricity. In response, we have committed to provide electricity to 300 million people in Sub-Saharan Africa by 2030 in partnership with the African Development Bank.\n\nRecognizing that digitalization is the transformational opportunity of our time, we are collaborating with governments in more than 100 developing countries to enable digital economies. Our digital lending portfolio totaled $6.5 billion in commitments as of June 2024, and our new Digital Vice Presidency unit will guide our efforts to establish the foundations of a digital economy. Key measures include building and enhancing digital and data infrastructure, ensuring cybersecurity and data privacy for institutions, businesses, and citizens, and advancing digital government services.\n\nDelivering on our commitments requires us to develop new and better ways of working. In fiscal 2024, we did just that. We are squeezing our balance sheet and finding new opportunities to take more risk and boost our lending. Our new crisis preparedness and response tools, Global Challenge Programs, and Livable Planet Fund demonstrate how we are modernizing our approach to better thrive and meet outcomes. Our new Scorecard radically changes how we track results.\n\nBut we cannot deliver alone; we depend on our own. We need partners from both the public and private sectors to join our efforts. That’s why we are working closely with other multilateral development banks to improve the lives of people in developing countries in tangible, measurable ways. Our deepening relationship with the private sector is evidenced by our Private Sector Investment Lab, which is working to address the barriers preventing private sector investment in emerging markets. The Lab’s core group of 15 Chief Executive Officers and Chairs meets regularly, and already has informed our work—most notably with the development of the World Bank Group Guarantee Platform.\n\nThe impact and innovations we delivered this year will allow us to move forward with a raised ambition and a greater sense of urgency to improve people’s lives. I would like to recognize the remarkable efforts of our staff and Executive Directors, as well as the unwavering support of our clients and partners. Together, we head into fiscal 2025 with a great sense of optimism—and determination to create a better Bank for a better world.\n\nAJAY BANGA \nPresident of the World Bank Group \nand Chairman of the Board of Executive Directors\n\nDESCRIPTION OF THE IMAGE OR CHART:\n\nThe image shows a group of people engaged in agriculture. One person is holding a tomato, and others are observing. It reflects collaboration or assistance in agricultural practices, possibly in a developing country.

让我们来看一个示例页面，比如第21页，其中包含嵌入的图形和文本。我们可以观察到视觉模态有效地提取并描述了视觉信息。例如，该页面上的饼图被准确描述为：

"图6：中东和北非地区国际复兴开发银行和国际开发协会2024财年按部门划分的贷款 - 占46亿美元总额的比例" 是一个圆形图表，类似于饼图，展示了资金在不同部门之间的百分比分布。这些部门包括：

# Filter and print rows where pageNumber is 21
filtered_rows = df[df['PageNumber'] == 21]
for text in filtered_rows.PageText:
    print(text)

**TRANSCRIPTION OF THE TEXT:**

We also committed $35 million in grants to support emergency relief in Gaza. Working with the World Food Programme, the World Health Organization, and the UN Children’s Fund, the grants supported the delivery of emergency food, water, and medical supplies. In the West Bank, we approved a $200 million program for the continuation of education for children, $22 million to support municipal services, and $45 million to strengthen healthcare and hospital services.

**Enabling green and resilient growth**
To help policymakers in the region advance their climate change and development goals, we published Country Climate and Development Reports for the West Bank and Gaza, Lebanon, and Tunisia. In Libya, the catastrophic flooding in September 2023 devastated eastern localities, particularly the city of Derna. The World Bank, together with the UN and the European Union, produced a Rapid Damage and Needs Assessment to inform recovery and reconstruction efforts.

We signed a new Memorandum of Understanding (MoU) with the Islamic Development Bank to promote further collaboration between our institutions. The MoU focuses on joint knowledge and operational engagements around the energy, food, and water nexus, climate impact, empowering women and youth to engage with the private sector, and advancing the digital transition and regional integration. The MoU aims to achieve a co-financing value of $6 billion through 2026, 45 percent of which has already been met.

**Expanding economic opportunities for women**
The World Bank has drawn on a variety of instruments to support Jordan’s commitment to increase female labor force participation, including through the recently approved Country Partnership Framework. Through operations, technical assistance (such as Mashreq Gender Facility; Women Entrepreneurs Finance Initiative; and the Women, Business and the Law report), and policy dialogue, we have contributed to legal reforms in Jordan that removed job restrictions on women, prohibited gender-based discrimination in the workplace, and criminalized sexual harassment in the workplace. In fiscal 2024, we approved the first women-focused Bank project in the region: the Enhancing Women’s Economic Opportunities Program for Results aims to improve workplace conditions, increase financial inclusion and entrepreneurship, make public transport safer, and increase access to affordable, quality childcare services.

**Analyzing critical infrastructure needs**
We published an Interim Damage Assessment for Gaza in partnership with the UN and with financial support from the EU. This found that a preliminary estimate of the cost of damages to critical infrastructure from the conflict in Gaza between October 2023 and the end of January 2024 was around $18.5 billion—equivalent to 97 percent of the 2022 GDP of the West Bank and Gaza combined. When the situation allows, a full-fledged Rapid Damage and Needs Assessment will be conducted.

**COUNTRY IMPACT**

Egypt: The Bank-supported Takaful and Karama social protection program has reached 4.7 million vulnerable households, benefitting approximately 20 million individuals, 75 percent of them women.

Lebanon: A roads project has rehabilitated over 500 km of roads in 25 districts across the country and generated 1.3 million labor days for Lebanese workers and Syrian refugees.

Morocco: Our programs have benefited more than 400,000 people directly and more than 33 million people indirectly, through more than 230 disaster risk reduction projects.

**DESCRIPTION OF THE IMAGE OR CHART:**

The image is a pie chart titled "FIGURE 6: MIDDLE EAST AND NORTH AFRICA IBRD AND IDA LENDING BY SECTOR - FISCAL 2024 SHARE OF TOTAL OF $4.6 BILLION." The chart breaks down the sectors as follows:
- Public Administration: 24%
- Social Protection: 13%
- Health: 13%
- Education: 17%
- Agriculture, Fishing, and Forestry: 8%
- Water, Sanitation, and Waste Management: 8%
- Transportation: 5%
- Energy and Extractives: 3%
- Financial Sector: 1%
- Industry, Trade, and Services: 2%
- Information and Communications Technologies: 6%

**TRANSCRIPTION OF THE TABLE:**

TABLE 13: MIDDLE EAST AND NORTH AFRICA REGIONAL SNAPSHOT

| INDICATOR                                                | 2000   | 2012     | CURRENT DATA* |
|----------------------------------------------------------|--------|----------|---------------|
| Total population (millions)                              | 283.9  | 356.2    | 430.9         |
| Population growth (annual %)                             | 2.0    | 1.8      | 1.5           |
| GNI per capita (Atlas method, current US$)               | 1,595.5| 4,600.4  | 3,968.1       |
| GDP per capita growth (annual %)                         | 4.0    | 1.7      | 1.2           |
| Population living below $2.15 a day (millions)           | 9.7    | 8.2      | 19.1          |
| Life expectancy at birth, females (years)                | 70.8   | 73.9     | 74.8          |
| Life expectancy at birth, males (years)                  | 66.5   | 69.6     | 69.9          |
| Carbon dioxide emissions (megatons)                      | 813.2  | 1,297.7  | 1,370.9       |
| Extreme poverty (% of population below $2.15 a day, 2017 PPP)| 3.4 | 2.3      | 4.7           |
| Debt service as a proportion of exports of goods, services, and primary income | 15.1   | 5.2      | 12.4   |
| Ratio of female to male labor force participation rate (%) (modeled ILO estimate) | 24.5   | 26.2     | 23.2   |
| Vulnerable employment, total (% of total employment) (modeled ILO estimate) | 35.4   | 31.7     | 31.4   |
| Under-5 mortality rate per 1,000 live births             | 46.7   | 29.0     | 20.9          |
| Primary completion rate (% of relevant age group)        | 81.4   | 88.9     | 86.7          |
| Individuals using the Internet (% of population)         | 0.9    | 26.0     | 73.4          |
| Access to electricity (% of population)                  | 91.4   | 94.7     | 96.9          |
| Renewable energy consumption (% of total final energy consumption) | 3.0 | 3.6      | 2.9    |
| People using at least basic drinking water services (% of population) | 86.5   | 90.6     | 93.7   |
| People using at least basic sanitation services (% of population) | 79.4   | 86.2     | 90.4   |

*Note: ILO = International Labour Organization. PPP = purchasing power parity. a. The most current data available between 2018 and 2023; visit [https://data.worldbank.org](https://data.worldbank.org) for data updates.

For more information, visit [www.worldbank.org/mena](http://www.worldbank.org/mena).

步骤3：生成嵌入向量：

在本节中，我们重点将文档每页提取的文本内容转换为向量嵌入。这些嵌入能捕捉文本的语义信息，从而实现高效的相似性搜索和各种自然语言处理（NLP）任务。我们还会识别包含视觉元素（如图片、图表或表格）的页面，并对其进行特殊处理标记。

步骤分解：

1. 添加视觉内容标志

为了处理包含视觉信息的页面，在步骤2中我们使用视觉模态从图表、表格和图像中提取内容。通过在提示中包含特定指令，我们确保模型在描述视觉内容时添加诸如DESCRIPTION OF THE IMAGE OR CHART或TRANSCRIPTION OF THE TABLE等标记。在此步骤中，如果检测到此类标记，我们将Visual_Input_Processed标志设置为'Y'；否则保持为'N'。

虽然视觉模态能有效捕捉大部分视觉信息，但在复杂视觉内容（如工程图纸）中，某些细节可能在转换过程中丢失。在步骤6中，我们将使用此标志来确定何时将页面图像传递给GPT-4 Vision作为额外上下文。这是一个可选增强功能，可显著提升RAG解决方案的效能。

2. 使用OpenAI的嵌入模型生成嵌入

我们使用OpenAI的嵌入模型text-embedding-3-large，来生成表示每个页面语义内容的高维嵌入。

注意：确保您使用的嵌入模型维度与Pinecone向量存储的配置一致至关重要。在本例中，我们将Pinecone数据库设置为3072维，以匹配text-embedding-3-large的默认维度。

# Add a column to flag pages with visual content
df['Visual_Input_Processed'] = df['PageText'].apply(
    lambda x: 'Y' if 'DESCRIPTION OF THE IMAGE OR CHART' in x or 'TRANSCRIPTION OF THE TABLE' in x else 'N'
)


# Function to get embeddings
def get_embedding(text_input):
    response = oai_client.embeddings.create(
        input=text_input,
        model="text-embedding-3-large"
    )
    return response.data[0].embedding


# Generate embeddings with a progress bar
embeddings = []
for text in tqdm(df['PageText'], desc='Generating Embeddings'):
    embedding = get_embedding(text)
    embeddings.append(embedding)

# Add the embeddings to the DataFrame
df['Embeddings'] = embeddings

Generating Embeddings: 100%|██████████| 49/49 [00:18<00:00,  2.61it/s]

我们可以验证我们的逻辑是否正确标记了需要视觉输入的页面。例如，我们之前检查过的第21页，其Visual_Input_Needed标志被设置为"Y"。

# Display the flag for page 21 
filtered_rows = df[df['PageNumber'] == 21]
print(filtered_rows.Visual_Input_Processed)

20    Y
Name: Visual_Input_Processed, dtype: object

步骤4：将嵌入上传至Pinecone：

在本节中，我们将把为文档每页生成的嵌入向量上传至Pinecone。除了嵌入向量外，我们还会添加描述每页的相关元数据标签，例如页码、文本内容、图片路径以及页面是否包含图形。

步骤分解：

1. 创建元数据字段：
元数据增强了我们执行更精细搜索的能力，可以找到与向量关联的文本或图像，并支持在向量数据库中进行过滤。

pageId: 将document_id和pageNumber组合起来，为每个页面创建唯一标识符。我们将使用它作为嵌入向量的唯一标识符。
pageNumber: 文档中的页码数字。
text: 从页面提取的文本内容。
ImagePath: 与页面关联的图片文件路径。
GraphicIncluded: 一个布尔值或标志，表示页面是否包含可能需要视觉处理的图形元素。

2. 上传嵌入向量：
我们将使用Pinecone API中的upsert_vector函数来"upsert"这些值 -

唯一标识符
嵌入
如上定义的元数据

注意："Upsert"是"update"和"insert"两个词的组合。在数据库操作中，upsert是一种原子操作，如果记录存在则更新现有记录，如果不存在则插入新记录。当您希望确保数据库拥有最新数据而无需单独执行插入或更新检查时，这特别有用。

# reload the index from Pinecone 
index = pc.Index(index_name)

# Create a document ID prefix 
document_id = 'WB_Report'


# Define the async function correctly
def upsert_vector(identifier, embedding, metadata):
    try:
        index.upsert([
            {
                'id': identifier,
                'values': embedding,
                'metadata': metadata
            }
        ])
    except Exception as e:
        print(f"Error upserting vector with ID {identifier}: {e}")
        raise


for idx, row in tqdm(df.iterrows(), total=df.shape[0], desc='Uploading to Pinecone'):
    pageNumber = row['PageNumber']

    # Create meta-data tags to be added to Pinecone 
    metadata = {
        'pageId': f"{document_id}-{pageNumber}",
        'pageNumber': pageNumber,
        'text': row['PageText'],
        'ImagePath': row['ImagePath'],
        'GraphicIncluded': row['Visual_Input_Processed']
    }

    upsert_vector(metadata['pageId'], row['Embeddings'], metadata)

Uploading to Pinecone: 100%|██████████| 49/49 [00:08<00:00,  5.93it/s]

在Pinecone上导航至索引列表，您应该能够查看已上传至数据库的向量及其元数据。

步骤5：执行语义搜索以查找相关页面：

在本节中，我们实现了一个语义搜索功能，用于在文档中查找与用户问题最相关的页面。该方法利用存储在Pinecone向量数据库中的嵌入向量，根据页面内容与用户查询的语义相似度来检索相关页面。通过这种方式，我们可以高效地搜索文本内容，并将其作为上下文提供给GPT-4o来回答用户的问题。

步骤分解：

1. 为用户问题生成嵌入向量

我们使用OpenAI的嵌入模型来生成用户问题的高维向量表示。
该向量捕捉了问题的语义含义，使我们能够对存储的嵌入执行高效的相似性搜索。
嵌入对于确保搜索查询与文档内容在语义上对齐至关重要，即使词语不完全匹配。

2. 查询Pinecone索引获取相关页面

使用生成的嵌入向量，我们查询Pinecone索引以找到最相关的页面。
Pinecone通过使用cosine相似度比较问题的嵌入向量与向量数据库中存储的嵌入向量来执行相似性搜索。如果您还记得，我们在第1步创建Pinecone数据库时将此设置为metric参数。
我们指定要检索的顶部匹配数量，通常基于覆盖范围与相关性之间的平衡。例如，检索前3-5页通常足以提供全面答案，而不会因过多上下文使模型不堪重负。

3. 编译匹配页面的元数据以提供上下文

一旦识别出相关嵌入向量，我们会收集其关联的元数据，包括提取的文本和页码。
这些元数据对于构建提供给GPT-4o的上下文至关重要。
我们还将编译的信息格式化为JSON，以便LLM更容易解析。

4. 使用GPT-4o模型生成答案

最后，我们将编译后的上下文传递给GPT-4o。
模型利用上下文生成信息丰富、连贯且与上下文相关的答案来回答用户的问题。
检索到的上下文帮助LLM更准确地回答问题，因为它可以访问文档中的相关信息。

import json


# Function to get response to a user's question 
def get_response_to_question(user_question, pc_index):
    # Get embedding of the question to find the relevant page with the information 
    question_embedding = get_embedding(user_question)

    # get response vector embeddings 
    response = pc_index.query(
        vector=question_embedding,
        top_k=2,
        include_values=True,
        include_metadata=True
    )

    # Collect the metadata from the matches
    context_metadata = [match['metadata'] for match in response['matches']]

    # Convert the list of metadata dictionaries to prompt a JSON string
    context_json = json.dumps(context_metadata, indent=3)

    prompt = f"""You are a helpful assistant. Use the following context and images to answer the question. In the answer, include the reference to the document, and page number you found the information on between <source></source> tags. If you don't find the information, you can say "I couldn't find the information"

    question: {user_question}
    
    <SOURCES>
    {context_json}
    </SOURCES>
    """

    # Call completions end point with the prompt 
    completion = oai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": prompt}
        ]
    )

    return completion.choices[0].message.content

现在，让我们提出一个需要从图表中获取信息的问题。在这种情况下，相关细节可以在饼图中找到。

question = "What percentage was allocated to social protections in Western and Central Africa?"
answer = get_response_to_question(question, index)

print(answer)

Social protection was allocated 8% of the total lending in Western and Central Africa in fiscal 2024. <source>WB_Report-13, page 13</source>

让我们通过提出一个需要解读表格信息的问题来增加挑战性。在第二步中，我们使用GPT-4o视觉模态提取了这些信息。

question = "What was the increase in access to electricity between 2000 and 2012 in Western and Central Africa?"
answer = get_response_to_question(question, index)

print(answer)

The increase in access to electricity between 2000 and 2012 in Western and Central Africa was from 34.1% to 44.1%, which is an increase of 10 percentage points. 

<source>WB_Report-13, page 13</source>

这种方法效果不错。但在某些情况下，信息可能嵌入在图像或图形中，当转换为文本时会丢失保真度，例如复杂的工程图纸。

通过使用GPT-4o的视觉模态，我们可以直接将页面图像作为上下文传递给模型。在下一节中，我们将探讨如何利用图像输入来提高模型响应的准确性。

步骤6：处理包含视觉内容的页面（可选步骤）：

When metadata indicates the presence of an image, graphic or a table, we can pass the image as the context to GPT-4o instead of the extracted text. This approach can be useful in cases where text description of the visual information is not sufficient to convey the context. It can be the case for complex graphics such as engineering drawings or complex diagrams.

步骤分解：

此步骤与步骤5的区别在于，我们添加了额外的逻辑来识别何时为嵌入设置了Visual_Input_Processed标志。在这种情况下，我们不再传递文本作为上下文，而是使用GPT-4o视觉模态传递页面图像作为上下文。

注意：这种方法确实会增加延迟和成本，因为处理图像输入需要更多资源且费用更高。因此，仅当无法通过上文步骤5中概述的纯文本模式实现预期结果时，才应使用该方法。

import base64
import json


def get_response_to_question_with_images(user_question, pc_index):
    # Get embedding of the question to find the relevant page with the information 
    question_embedding = get_embedding(user_question)

    # Get response vector embeddings 
    response = pc_index.query(
        vector=question_embedding,
        top_k=3,
        include_values=True,
        include_metadata=True
    )

    # Collect the metadata from the matches
    context_metadata = [match['metadata'] for match in response['matches']]

    # Build the message content
    message_content = []

    # Add the initial prompt
    initial_prompt = f"""You are a helpful assistant. Use the text and images provided by the user to answer the question. You must include the reference to the page number or title of the section you the answer where you found the information. If you don't find the information, you can say "I couldn't find the information"

    question: {user_question}
    """
    
    message_content.append({"role": "system", "content": initial_prompt})
    
    context_messages = []

    # Process each metadata item to include text or images based on 'Visual_Input_Processed'
    for metadata in context_metadata:
        visual_flag = metadata.get('GraphicIncluded')
        page_number = metadata.get('pageNumber')
        page_text = metadata.get('text')
        message =""

        if visual_flag =='Y':
            # Include the image
            print(f"Adding page number {page_number} as an image to context")
            image_path = metadata.get('ImagePath', None)
            try:
                base64_image = encode_image(image_path)
                image_type = 'jpeg'
                # Prepare the messages for the API call
                context_messages.append({
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/{image_type};base64,{base64_image}"
                    },
                })
            except Exception as e:
                print(f"Error encoding image at {image_path}: {e}")
        else:
            # Include the text
            print(f"Adding page number {page_number} as text to context")
            context_messages.append({
                    "type": "text",
                    "text": f"Page {page_number} - {page_text}",
                })
        
                # Prepare the messages for the API call
        messages =  {
                "role": "user",
                "content": context_messages
        }
    
    message_content.append(messages)

    completion = oai_client.chat.completions.create(
    model="gpt-4o",
    messages=message_content
    )

    return completion.choices[0].message.content

让我们来检查在步骤5中针对纯文本语义搜索提出的相同问题。我们注意到，GPT-4o模型能够识别出包含相关信息以回答问题的图表。

question = "What percentage was allocated to social protections in Western and Central Africa?"
answer = get_response_to_question_with_images(question, index)

print(answer)

Adding page number 13.0 as an image to context
Adding page number 12.0 as an image to context
Adding page number 11.0 as an image to context
The percentage allocated to social protection in Western and Central Africa is 8% (Figure 2: Western and Central Africa; IBRD and IDA Lending by Sector).

现在让我们提出一个可能无法仅通过文本模式回答的问题，例如在文档中查找相关图像并描述该图像。

question = "Can you find the image associated with digital improvements and describe what you see in the images?"
answer = get_response_to_question_with_images(question, index)

print(answer)

Adding page number 32.0 as an image to context
Adding page number 10.0 as an image to context
Adding page number 4.0 as an image to context
### Image Descriptions

1. **Page 60-61 (Digital Section)**:
   - **Left Side**: A person is sitting and working on a laptop, holding a smartphone. The setting seems informal, possibly in a small office or a cafe.
   - **Text**: Discussion on scaling digital development, thought leadership, partnerships, and establishment of a Digital Vice Presidency unit for digital transformation efforts.

2. **Page 16-17 (Eastern and Southern Africa Section)**:
   - **Right Side**: A group of people standing on a paved street, some using mobile phones. It seems to be a casual, evening setting.
   - **Text**: Information about improving access to electricity in Rwanda and efforts for education and other services in Eastern and Southern Africa.

3. **Page 4-5 (Driving Action, Measuring Results)**:
   - **Images**: Various circular images and icons accompany text highlights such as feeding people, providing schooling, access to clean water, transport, and energy.
   - **Text**: Summary of key development results achieved by the World Bank Group in fiscal 2024.

These images illustrate the initiatives and impacts of the World Bank's projects and activities in various sectors.

结论

在本指南中，我们开启了一段旅程，旨在增强面向富含图像、图表和表格文档的检索增强生成(RAG)系统。传统RAG模型虽然擅长处理文本数据，却常常忽略了通过视觉元素传递的丰富信息。通过整合视觉模型并利用元数据标记，我们弥合了这一鸿沟，使AI能够有效解读和利用视觉内容。

我们首先使用Pinecone建立了一个向量存储，为高效存储和检索向量嵌入奠定了基础。通过解析PDF文件并利用GPT-4o视觉模态提取视觉信息，我们能够将文档页面转换为相关文本。通过生成嵌入向量并对包含视觉内容的页面进行标记，我们在向量存储中创建了一个强大的元数据过滤系统。

将这些嵌入向量上传到Pinecone实现了与我们RAG处理工作流的无缝集成。通过语义搜索，我们检索出与用户查询相匹配的相关页面，确保同时考虑文本和视觉信息。通过将包含视觉内容的页面传递给视觉模型进行处理，显著提升了响应的准确性和深度，特别是对于依赖图像或表格的查询。

以世界银行的《更优银行，更美世界：2024年度报告》作为我们的指导范例，我们展示了这些技术如何协同处理并解读复杂文档。该方法不仅丰富了向用户提供的信息，还通过交付更全面准确的响应，显著提升了用户满意度和参与度。

通过遵循本指南中概述的概念，您现在能够构建能够处理和解析具有复杂视觉元素的文档的RAG系统。这一进步为视觉数据起关键作用的各个领域的AI应用开辟了新的可能性。

2024年11月12日

利用GPT-4o视觉模态优化检索增强生成

1. 使用Pinecone设置向量存储:

2. 解析PDF并提取视觉信息:

3. 生成嵌入向量:

4. 将嵌入上传至Pinecone:

5. 执行语义搜索查找相关页面:

6. 处理包含视觉内容的页面（可选步骤）：

步骤1：使用Pinecone设置向量存储

第二步：解析PDF并提取视觉信息：

步骤3：生成嵌入向量：

步骤4：将嵌入上传至Pinecone：

步骤5：执行语义搜索以查找相关页面：

步骤6：处理包含视觉内容的页面（可选步骤）：

结论