2023年6月22日

使用LlamaIndex进行金融文档分析

在本示例笔记本中，我们展示了如何通过LlamaIndex框架仅用几行代码对10-K文件进行财务分析。

Notebook Outline

简介
设置
数据加载与索引
简单问答
高级问答 - 比较与对比

简介

LLamaIndex

LlamaIndex 是一个面向LLM应用的数据框架。只需几行代码即可快速上手，在几分钟内构建检索增强生成(RAG)系统。对于高级用户，LlamaIndex提供了一套丰富的工具包，用于数据摄取和索引构建，包含检索和重新排序模块，以及可组合组件用于构建自定义查询引擎。

查看完整文档获取更多详情。

基于10-K文件的财务分析

金融分析师工作的一个关键部分是从冗长的财务文件中提取信息并综合见解。一个很好的例子是10-K表格——这是美国证券交易委员会(SEC)要求的年度报告，全面总结了公司的财务表现。这些文档通常长达数百页，包含特定领域的术语，使得外行人难以快速消化。

我们展示LlamaIndex如何支持金融分析师快速提取信息并综合洞察跨多个文档，只需极少编码。

设置

首先，我们需要安装llama-index库

!pip install llama-index pypdf

现在，我们导入本教程中使用的所有模块

from langchain import OpenAI

from llama_index import SimpleDirectoryReader, ServiceContext, VectorStoreIndex
from llama_index import set_global_service_context
from llama_index.response.pprint_utils import pprint_response
from llama_index.tools import QueryEngineTool, ToolMetadata
from llama_index.query_engine import SubQuestionQueryEngine

在开始之前，我们可以配置为RAG系统提供支持的LLM供应商和模型。
这里我们选择OpenAI的gpt-3.5-turbo-instruct。

llm = OpenAI(temperature=0, model_name="gpt-3.5-turbo-instruct", max_tokens=-1)

我们构建一个ServiceContext并将其设为全局默认值，因此所有后续依赖LLM调用的操作都将使用我们在此配置的模型。

service_context = ServiceContext.from_defaults(llm=llm)
set_global_service_context(service_context=service_context)

数据加载与索引

现在，我们加载并解析两份PDF文件（一份是2021年Uber 10-K年报，另一份是2021年Lyft 10-K年报）。
在底层实现中，这些PDF文件会被转换为纯文本的Document对象，并按页面进行分隔。

注意：此操作可能需要一些时间才能完成，因为每个文档都超过100页。

lyft_docs = SimpleDirectoryReader(input_files=["../data/10k/lyft_2021.pdf"]).load_data()
uber_docs = SimpleDirectoryReader(input_files=["../data/10k/uber_2021.pdf"]).load_data()

print(f'Loaded lyft 10-K with {len(lyft_docs)} pages')
print(f'Loaded Uber 10-K with {len(uber_docs)} pages')

Loaded lyft 10-K with 238 pages
Loaded Uber 10-K with 307 pages

现在，我们可以在已加载的文档上构建一个(内存中的)VectorStoreIndex。

注意：此操作可能需要一些时间才能完成，因为它会调用OpenAI API来计算文档片段的向量嵌入。

lyft_index = VectorStoreIndex.from_documents(lyft_docs)
uber_index = VectorStoreIndex.from_documents(uber_docs)

简单问答

现在我们已准备好对索引运行一些查询！
为此，我们首先配置一个QueryEngine，它只是捕获一组关于我们如何查询底层索引的配置。

对于VectorStoreIndex，最常见的调整配置是similarity_top_k，它控制检索多少个文档块（我们称之为Node对象）作为回答问题的上下文。

lyft_engine = lyft_index.as_query_engine(similarity_top_k=3)

uber_engine = uber_index.as_query_engine(similarity_top_k=3)

让我们看看一些查询的实际应用！

response = await lyft_engine.aquery('What is the revenue of Lyft in 2021? Answer in millions with page reference')

print(response)

$3,208.3 million (page 63)

response = await uber_engine.aquery('What is the revenue of Uber in 2021? Answer in millions, with page reference')

print(response)

$17,455 (page 53)

高级问答 - 对比分析

对于更复杂的财务分析，通常需要参考多份文档。

举个例子，我们来看看如何对Lyft和Uber的财务数据进行对比查询。
为此，我们构建了一个SubQuestionQueryEngine，它能够将复杂的对比查询分解为更简单的子问题，然后在由单独索引支持的子查询引擎上执行这些子问题。

query_engine_tools = [
    QueryEngineTool(
        query_engine=lyft_engine, 
        metadata=ToolMetadata(name='lyft_10k', description='Provides information about Lyft financials for year 2021')
    ),
    QueryEngineTool(
        query_engine=uber_engine, 
        metadata=ToolMetadata(name='uber_10k', description='Provides information about Uber financials for year 2021')
    ),
]

s_engine = SubQuestionQueryEngine.from_defaults(query_engine_tools=query_engine_tools)

让我们看看这些查询的实际效果！

response = await s_engine.aquery('Compare and contrast the customer segments and geographies that grew the fastest')

Generated 4 sub questions.
[36;1m[1;3m[uber_10k] Q: What customer segments grew the fastest for Uber
[0m[36;1m[1;3m[uber_10k] A: in 2021?

The customer segments that grew the fastest for Uber in 2021 were its Mobility Drivers, Couriers, Riders, and Eaters. These segments experienced growth due to the continued stay-at-home order demand related to COVID-19, as well as Uber's introduction of its Uber One, Uber Pass, Eats Pass, and Rides Pass membership programs. Additionally, Uber's marketplace-centric advertising helped to connect merchants and brands with its platform network, further driving growth.
[0m[33;1m[1;3m[uber_10k] Q: What geographies grew the fastest for Uber
[0m[33;1m[1;3m[uber_10k] A: 
Based on the context information, it appears that Uber experienced the most growth in large metropolitan areas, such as Chicago, Miami, New York City, Sao Paulo, and London. Additionally, Uber experienced growth in suburban and rural areas, as well as in countries such as Argentina, Germany, Italy, Japan, South Korea, and Spain.
[0m[38;5;200m[1;3m[lyft_10k] Q: What customer segments grew the fastest for Lyft
[0m[38;5;200m[1;3m[lyft_10k] A: 
The customer segments that grew the fastest for Lyft were ridesharing, light vehicles, and public transit. Ridesharing grew as Lyft was able to predict demand and proactively incentivize drivers to be available for rides in the right place at the right time. Light vehicles grew as users were looking for options that were more active, usually lower-priced, and often more efficient for short trips during heavy traffic. Public transit grew as Lyft integrated third-party public transit data into the Lyft App to offer users a robust view of transportation options around them.
[0m[32;1m[1;3m[lyft_10k] Q: What geographies grew the fastest for Lyft
[0m[32;1m[1;3m[lyft_10k] A: 
It is not possible to answer this question with the given context information.
[0m

print(response)

The customer segments that grew the fastest for Uber in 2021 were its Mobility Drivers, Couriers, Riders, and Eaters. These segments experienced growth due to the continued stay-at-home order demand related to COVID-19, as well as Uber's introduction of its Uber One, Uber Pass, Eats Pass, and Rides Pass membership programs. Additionally, Uber's marketplace-centric advertising helped to connect merchants and brands with its platform network, further driving growth. Uber experienced the most growth in large metropolitan areas, such as Chicago, Miami, New York City, Sao Paulo, and London. Additionally, Uber experienced growth in suburban and rural areas, as well as in countries such as Argentina, Germany, Italy, Japan, South Korea, and Spain.

The customer segments that grew the fastest for Lyft were ridesharing, light vehicles, and public transit. Ridesharing grew as Lyft was able to predict demand and proactively incentivize drivers to be available for rides in the right place at the right time. Light vehicles grew as users were looking for options that were more active, usually lower-priced, and often more efficient for short trips during heavy traffic. Public transit grew as Lyft integrated third-party public transit data into the Lyft App to offer users a robust view of transportation options around them. It is not possible to answer the question of which geographies grew the fastest for Lyft with the given context information.

In summary, Uber and Lyft both experienced growth in customer segments related to mobility, couriers, riders, and eaters. Uber experienced the most growth in large metropolitan areas, as well as in suburban and rural areas, and in countries such as Argentina, Germany, Italy, Japan, South Korea, and Spain. Lyft experienced the most growth in ridesharing, light vehicles, and public transit. It is not possible to answer the question of which geographies grew the fastest for Lyft with the given context information.

response = await s_engine.aquery('Compare revenue growth of Uber and Lyft from 2020 to 2021')

Generated 2 sub questions.
[36;1m[1;3m[uber_10k] Q: What is the revenue growth of Uber from 2020 to 2021
[0m[36;1m[1;3m[uber_10k] A: 
The revenue growth of Uber from 2020 to 2021 was 57%, or 54% on a constant currency basis.
[0m[33;1m[1;3m[lyft_10k] Q: What is the revenue growth of Lyft from 2020 to 2021
[0m[33;1m[1;3m[lyft_10k] A: 
The revenue growth of Lyft from 2020 to 2021 is 36%, increasing from $2,364,681 thousand to $3,208,323 thousand.
[0m

print(response)

The revenue growth of Uber from 2020 to 2021 was 57%, or 54% on a constant currency basis, while the revenue growth of Lyft from 2020 to 2021 was 36%. This means that Uber had a higher revenue growth than Lyft from 2020 to 2021.