结构化数据提取简介

大型语言模型在数据理解方面表现出色，这使其最重要的应用场景之一得以实现：将常规人类语言（我们称之为非结构化数据）转换为计算机程序可使用的特定、规范、预期格式。我们将此过程的输出称为结构化数据。由于在转换过程中通常会忽略大量冗余数据，我们称之为提取。

LlamaIndex中结构化数据提取的核心方式是Pydantic类：您可以在Pydantic中定义数据结构，LlamaIndex与Pydantic协作将LLM的输出强制转换为该结构。

什么是Pydantic？

Pydantic 是一个广泛使用的数据验证和转换库。它高度依赖 Python 类型声明。该项目文档中有详细指南介绍 Pydantic，但我们将在此介绍最基础的内容。

要创建一个Pydantic类，请继承自Pydantic的BaseModel类：

from pydantic import BaseModel


class User(BaseModel):
    id: int
    name: str = "Jane Doe"

在这个示例中，您创建了一个包含两个字段的 User 类：id 和 name。您将 id 定义为整数类型，而将 name 定义为默认值为 Jane Doe 的字符串。

您可以通过嵌套这些模型来创建更复杂的结构：

from typing import List, Optional
from pydantic import BaseModel


class Foo(BaseModel):
    count: int
    size: Optional[float] = None


class Bar(BaseModel):
    apple: str = "x"
    banana: str = "y"


class Spam(BaseModel):
    foo: Foo
    bars: List[Bar]

现在 Spam 拥有一个 foo 和一个 bars。Foo 包含一个 count 和一个可选的 size，而 bars 是一个对象列表，其中每个对象都具有 apple 和 banana 属性。

将Pydantic对象转换为JSON模式

Pydantic 支持将 Pydantic 类转换为符合流行标准的 JSON 序列化模式对象。例如上方的 User 类会序列化为：

{
  "properties": {
    "id": {
      "title": "Id",
      "type": "integer"
    },
    "name": {
      "default": "Jane Doe",
      "title": "Name",
      "type": "string"
    }
  },
  "required": ["id"],
  "title": "User",
  "type": "object"
}

此属性至关重要：这些JSON格式的模式通常传递给LLM，而LLM则将其用作返回数据的指导规范。

使用注解

如前所述，大型语言模型正在使用来自Pydantic的JSON模式作为返回数据的指令。为了辅助它们并提高返回数据的准确性，包含对对象和字段及其用途的自然语言描述会很有帮助。Pydantic通过文档字符串和字段支持此功能。

在接下来的所有示例中，我们将使用以下Pydantic类示例：

from datetime import datetime


class LineItem(BaseModel):
    """A line item in an invoice."""

    item_name: str = Field(description="The name of this item")
    price: float = Field(description="The price of this item")


class Invoice(BaseModel):
    """A representation of information from an invoice."""

    invoice_id: str = Field(
        description="A unique identifier for this invoice, often a number"
    )
    date: datetime = Field(description="The date this invoice was created")
    line_items: list[LineItem] = Field(
        description="A list of all the items in this invoice"
    )

这会扩展成一个更复杂的 JSON 模式：

{
  "$defs": {
    "LineItem": {
      "description": "A line item in an invoice.",
      "properties": {
        "item_name": {
          "description": "The name of this item",
          "title": "Item Name",
          "type": "string"
        },
        "price": {
          "description": "The price of this item",
          "title": "Price",
          "type": "number"
        }
      },
      "required": ["item_name", "price"],
      "title": "LineItem",
      "type": "object"
    }
  },
  "description": "A representation of information from an invoice.",
  "properties": {
    "invoice_id": {
      "description": "A unique identifier for this invoice, often a number",
      "title": "Invoice Id",
      "type": "string"
    },
    "date": {
      "description": "The date this invoice was created",
      "format": "date-time",
      "title": "Date",
      "type": "string"
    },
    "line_items": {
      "description": "A list of all the items in this invoice",
      "items": {
        "$ref": "#/$defs/LineItem"
      },
      "title": "Line Items",
      "type": "array"
    }
  },
  "required": ["invoice_id", "date", "line_items"],
  "title": "Invoice",
  "type": "object"
}

既然您已对Pydantic及其生成的模式有了基本了解，现在可以继续在LlamaIndex中使用Pydantic类进行结构化数据提取，从结构化LLMs开始。