支持Pydantic输出的查询引擎¶
每个查询引擎都支持使用RetrieverQueryEngine
中的以下response_mode
来实现集成结构化响应:
refine
compact
tree_summarize
accumulate
(测试版,需要额外解析才能转换为对象)compact_accumulate
(测试版, 需要额外解析才能转换为对象)
在本笔记本中,我们将通过一个小示例演示其使用方法。
在底层实现中,每个LLM响应都将是一个pydantic对象。如果该响应需要被优化或总结,它会被转换为JSON字符串以供下一次响应使用。最终响应会以pydantic对象的形式返回。
注意: 从技术上讲,这可以与任何LLM配合使用,但对非OpenAI模型的支持仍在开发中,目前被视为测试版。
设置¶
如果你在Colab上打开这个Notebook,你可能需要安装LlamaIndex 🦙。
In [ ]:
Copied!
%pip install llama-index-llms-anthropic
%pip install llama-index-llms-openai
%pip install llama-index-llms-anthropic
%pip install llama-index-llms-openai
In [ ]:
Copied!
!pip install llama-index
!pip install llama-index
In [ ]:
Copied!
import os
import openai
os.environ["OPENAI_API_KEY"] = "sk-..."
openai.api_key = os.environ["OPENAI_API_KEY"]
import os
import openai
os.environ["OPENAI_API_KEY"] = "sk-..."
openai.api_key = os.environ["OPENAI_API_KEY"]
下载数据
In [ ]:
Copied!
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
In [ ]:
Copied!
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader("./data/paul_graham").load_data()
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader("./data/paul_graham").load_data()
创建我们的Pydantic输出对象¶
In [ ]:
Copied!
from typing import List
from pydantic import BaseModel
class Biography(BaseModel):
"""Data model for a biography."""
name: str
best_known_for: List[str]
extra_info: str
from typing import List
from pydantic import BaseModel
class Biography(BaseModel):
"""人物传记的数据模型"""
name: str
best_known_for: List[str]
extra_info: str
创建索引 + 查询引擎 (OpenAI)¶
使用OpenAI时,将利用函数调用API来获取可靠的结构化输出。
In [ ]:
Copied!
from llama_index.core import VectorStoreIndex
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
index = VectorStoreIndex.from_documents(
documents,
)
从llama_index.core导入VectorStoreIndex
从llama_index.llms.openai导入OpenAI
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
index = VectorStoreIndex.from_documents(
documents,
)
In [ ]:
Copied!
query_engine = index.as_query_engine(
output_cls=Biography, response_mode="compact", llm=llm
)
query_engine = index.as_query_engine(
output_cls=Biography, response_mode="compact", llm=llm
)
In [ ]:
Copied!
response = query_engine.query("Who is Paul Graham?")
response = query_engine.query("保罗·格雷厄姆是谁?")
In [ ]:
Copied!
print(response.name)
print(response.best_known_for)
print(response.extra_info)
print(response.name)
print(response.best_known_for)
print(response.extra_info)
Paul Graham ['working on Bel', 'co-founding Viaweb', 'creating the programming language Arc'] Paul Graham is a computer scientist, entrepreneur, and writer. He is best known for his work on Bel, a programming language, and for co-founding Viaweb, an early web application company that was later acquired by Yahoo. Graham also created the programming language Arc. He has written numerous essays on topics such as startups, programming, and life.
In [ ]:
Copied!
# get the full pydanitc object
print(type(response.response))
# 获取完整的pydantic对象
print(type(response.response))
<class '__main__.Biography'>
创建索引 + 查询引擎(非OpenAI版本,测试阶段)¶
当使用不支持函数调用的LLM时,我们依赖LLM自行编写JSON,并将其解析为正确的pydantic对象。
In [ ]:
Copied!
import os
os.environ["ANTHROPIC_API_KEY"] = "sk-..."
import os
os.environ["ANTHROPIC_API_KEY"] = "sk-..."
In [ ]:
Copied!
from llama_index.core import VectorStoreIndex
from llama_index.llms.anthropic import Anthropic
llm = Anthropic(model="claude-instant-1.2", temperature=0.1)
index = VectorStoreIndex.from_documents(
documents,
)
从llama_index.core导入VectorStoreIndex
从llama_index.llms.anthropic导入Anthropic
llm = Anthropic(model="claude-instant-1.2", temperature=0.1)
index = VectorStoreIndex.from_documents(
documents,
)
In [ ]:
Copied!
query_engine = index.as_query_engine(
output_cls=Biography, response_mode="tree_summarize", llm=llm
)
query_engine = index.as_query_engine(
output_cls=Biography, response_mode="tree_summarize", llm=llm
)
In [ ]:
Copied!
response = query_engine.query("Who is Paul Graham?")
response = query_engine.query("保罗·格雷厄姆是谁?")
In [ ]:
Copied!
print(response.name)
print(response.best_known_for)
print(response.extra_info)
print(response.name)
print(response.best_known_for)
print(response.extra_info)
Paul Graham ['Co-founder of Y Combinator', 'Essayist and programmer'] He is known for creating Viaweb, one of the first web application builders, and for founding Y Combinator, one of the world's top startup accelerators. Graham has also written extensively about technology, investing, and philosophy.
In [ ]:
Copied!
# get the full pydanitc object
print(type(response.response))
# 获取完整的pydantic对象
print(type(response.response))
<class '__main__.Biography'>
累积示例 (Beta版)¶
使用pydantic对象进行累加需要一些额外的解析。这仍然是一个测试版功能,但仍然可以实现pydantic对象的累加。
In [ ]:
Copied!
from typing import List
from pydantic import BaseModel
class Company(BaseModel):
"""Data model for a companies mentioned."""
company_name: str
context_info: str
from typing import List
from pydantic import BaseModel
class Company(BaseModel):
"""被提及公司的数据模型"""
company_name: str
context_info: str
In [ ]:
Copied!
from llama_index.core import VectorStoreIndex,
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
index = VectorStoreIndex.from_documents(
documents,
)
from llama_index.core import VectorStoreIndex,
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
index = VectorStoreIndex.from_documents(
documents,
)
In [ ]:
Copied!
query_engine = index.as_query_engine(
output_cls=Company, response_mode="accumulate", llm=llm
)
query_engine = index.as_query_engine(
output_cls=Company, response_mode="accumulate", llm=llm
)
In [ ]:
Copied!
response = query_engine.query("What companies are mentioned in the text?")
response = query_engine.query("文中提到了哪些公司?")
在累积过程中,响应会通过默认分隔符分隔,并在前面添加前缀。
In [ ]:
Copied!
companies = []
# split by the default separator
for response_str in str(response).split("\n---------------------\n"):
# remove the prefix -- every response starts like `Response 1: {...}`
# so, we find the first bracket and remove everything before it
response_str = response_str[response_str.find("{") :]
companies.append(Company.parse_raw(response_str))
companies = []
# 按默认分隔符分割
for response_str in str(response).split("\n---------------------\n"):
# 移除前缀 -- 每个响应都以`Response 1: {...}`开头
# 因此我们找到第一个大括号并移除之前的所有内容
response_str = response_str[response_str.find("{") :]
companies.append(Company.parse_raw(response_str))
In [ ]:
Copied!
print(companies)
print(companies)
[Company(company_name='Yahoo', context_info='Yahoo bought us'), Company(company_name='Yahoo', context_info="I'd been meaning to since Yahoo bought us")]