结构化数据提取简介
大型语言模型在数据理解方面表现出色,这使其最重要的应用场景之一得以实现:将常规人类语言(我们称之为非结构化数据)转换为计算机程序可使用的特定、规范、预期格式。我们将此过程的输出称为结构化数据。由于在转换过程中通常会忽略大量冗余数据,我们称之为提取。
LlamaIndex中结构化数据提取的核心方式是Pydantic类:您可以在Pydantic中定义数据结构,LlamaIndex与Pydantic协作将LLM的输出强制转换为该结构。
什么是Pydantic?
Section titled “What is Pydantic?”Pydantic 是一个广泛使用的数据验证和转换库。它高度依赖 Python 类型声明。该项目文档中有详细指南介绍 Pydantic,但我们将在此介绍最基础的内容。
要创建一个Pydantic类,请继承自Pydantic的BaseModel类:
from pydantic import BaseModel
class User(BaseModel): id: int name: str = "Jane Doe"在这个示例中,您创建了一个包含两个字段的 User 类:id 和 name。您将 id 定义为整数类型,而将 name 定义为默认值为 Jane Doe 的字符串。
您可以通过嵌套这些模型来创建更复杂的结构:
from typing import List, Optionalfrom pydantic import BaseModel
class Foo(BaseModel): count: int size: Optional[float] = None
class Bar(BaseModel): apple: str = "x" banana: str = "y"
class Spam(BaseModel): foo: Foo bars: List[Bar]现在 Spam 拥有一个 foo 和一个 bars。Foo 包含一个 count 和一个可选的 size,而 bars 是一个对象列表,其中每个对象都具有 apple 和 banana 属性。
将Pydantic对象转换为JSON模式
Section titled “Converting Pydantic objects to JSON schemas”Pydantic 支持将 Pydantic 类转换为符合流行标准的 JSON 序列化模式对象。例如上方的 User 类会序列化为:
{ "properties": { "id": { "title": "Id", "type": "integer" }, "name": { "default": "Jane Doe", "title": "Name", "type": "string" } }, "required": ["id"], "title": "User", "type": "object"}此属性至关重要:这些JSON格式的模式通常传递给LLM,而LLM则将其用作返回数据的指导规范。
如前所述,大型语言模型正在使用来自Pydantic的JSON模式作为返回数据的指令。为了辅助它们并提高返回数据的准确性,包含对对象和字段及其用途的自然语言描述会很有帮助。Pydantic通过文档字符串和字段支持此功能。
在接下来的所有示例中,我们将使用以下Pydantic类示例:
from datetime import datetime
class LineItem(BaseModel): """A line item in an invoice."""
item_name: str = Field(description="The name of this item") price: float = Field(description="The price of this item")
class Invoice(BaseModel): """A representation of information from an invoice."""
invoice_id: str = Field( description="A unique identifier for this invoice, often a number" ) date: datetime = Field(description="The date this invoice was created") line_items: list[LineItem] = Field( description="A list of all the items in this invoice" )这会扩展成一个更复杂的 JSON 模式:
{ "$defs": { "LineItem": { "description": "A line item in an invoice.", "properties": { "item_name": { "description": "The name of this item", "title": "Item Name", "type": "string" }, "price": { "description": "The price of this item", "title": "Price", "type": "number" } }, "required": ["item_name", "price"], "title": "LineItem", "type": "object" } }, "description": "A representation of information from an invoice.", "properties": { "invoice_id": { "description": "A unique identifier for this invoice, often a number", "title": "Invoice Id", "type": "string" }, "date": { "description": "The date this invoice was created", "format": "date-time", "title": "Date", "type": "string" }, "line_items": { "description": "A list of all the items in this invoice", "items": { "$ref": "#/$defs/LineItem" }, "title": "Line Items", "type": "array" } }, "required": ["invoice_id", "date", "line_items"], "title": "Invoice", "type": "object"}既然您已对Pydantic及其生成的模式有了基本了解,现在可以继续在LlamaIndex中使用Pydantic类进行结构化数据提取,从结构化LLMs开始。