跳转到内容

结构化数据提取简介

大型语言模型在数据理解方面表现出色,这使其最重要的应用场景之一得以实现:将常规人类语言(我们称之为非结构化数据)转换为计算机程序可使用的特定、规范、预期格式。我们将此过程的输出称为结构化数据。由于在转换过程中通常会忽略大量冗余数据,我们称之为提取

LlamaIndex中结构化数据提取的核心方式是Pydantic类:您可以在Pydantic中定义数据结构,LlamaIndex与Pydantic协作将LLM的输出强制转换为该结构。

Pydantic 是一个广泛使用的数据验证和转换库。它高度依赖 Python 类型声明。该项目文档中有详细指南介绍 Pydantic,但我们将在此介绍最基础的内容。

要创建一个Pydantic类,请继承自Pydantic的BaseModel类:

from pydantic import BaseModel
class User(BaseModel):
id: int
name: str = "Jane Doe"

在这个示例中,您创建了一个包含两个字段的 User 类:idname。您将 id 定义为整数类型,而将 name 定义为默认值为 Jane Doe 的字符串。

您可以通过嵌套这些模型来创建更复杂的结构:

from typing import List, Optional
from pydantic import BaseModel
class Foo(BaseModel):
count: int
size: Optional[float] = None
class Bar(BaseModel):
apple: str = "x"
banana: str = "y"
class Spam(BaseModel):
foo: Foo
bars: List[Bar]

现在 Spam 拥有一个 foo 和一个 barsFoo 包含一个 count 和一个可选的 size,而 bars 是一个对象列表,其中每个对象都具有 applebanana 属性。

Pydantic 支持将 Pydantic 类转换为符合流行标准的 JSON 序列化模式对象。例如上方的 User 类会序列化为:

{
"properties": {
"id": {
"title": "Id",
"type": "integer"
},
"name": {
"default": "Jane Doe",
"title": "Name",
"type": "string"
}
},
"required": ["id"],
"title": "User",
"type": "object"
}

此属性至关重要:这些JSON格式的模式通常传递给LLM,而LLM则将其用作返回数据的指导规范。

如前所述,大型语言模型正在使用来自Pydantic的JSON模式作为返回数据的指令。为了辅助它们并提高返回数据的准确性,包含对对象和字段及其用途的自然语言描述会很有帮助。Pydantic通过文档字符串字段支持此功能。

在接下来的所有示例中,我们将使用以下Pydantic类示例:

from datetime import datetime
class LineItem(BaseModel):
"""A line item in an invoice."""
item_name: str = Field(description="The name of this item")
price: float = Field(description="The price of this item")
class Invoice(BaseModel):
"""A representation of information from an invoice."""
invoice_id: str = Field(
description="A unique identifier for this invoice, often a number"
)
date: datetime = Field(description="The date this invoice was created")
line_items: list[LineItem] = Field(
description="A list of all the items in this invoice"
)

这会扩展成一个更复杂的 JSON 模式:

{
"$defs": {
"LineItem": {
"description": "A line item in an invoice.",
"properties": {
"item_name": {
"description": "The name of this item",
"title": "Item Name",
"type": "string"
},
"price": {
"description": "The price of this item",
"title": "Price",
"type": "number"
}
},
"required": ["item_name", "price"],
"title": "LineItem",
"type": "object"
}
},
"description": "A representation of information from an invoice.",
"properties": {
"invoice_id": {
"description": "A unique identifier for this invoice, often a number",
"title": "Invoice Id",
"type": "string"
},
"date": {
"description": "The date this invoice was created",
"format": "date-time",
"title": "Date",
"type": "string"
},
"line_items": {
"description": "A list of all the items in this invoice",
"items": {
"$ref": "#/$defs/LineItem"
},
"title": "Line Items",
"type": "array"
}
},
"required": ["invoice_id", "date", "line_items"],
"title": "Invoice",
"type": "object"
}

既然您已对Pydantic及其生成的模式有了基本了解,现在可以继续在LlamaIndex中使用Pydantic类进行结构化数据提取,从结构化LLMs开始。