分类 Python SDK

本指南展示如何使用 Python SDK 对文档进行分类。您将：

创建分类规则
上传文件
提交一个分类任务
读取预测结果（类型、置信度、推理过程）

该SDK可在llama云服务中使用。

设置

首先，获取API密钥：获取API密钥

将其放入 .env 文件中：

LLAMA_CLOUD_API_KEY=llx-xxxxxx

安装依赖项：

pip install llama-cloud-services python-dotenv

或使用 uv：

uv add llama-cloud-services python-dotenv

快速入门

下面的代码片段使用了来自 llama-cloud-services 的便捷包装器 LlamaClassify，该包装器会上传文件、创建分类任务、轮询完成状态并返回结果。

import os
from dotenv import load_dotenv
from llama_cloud.client import AsyncLlamaCloud
from llama_cloud.types import ClassifierRule, ClassifyParsingConfiguration, ParserLanguages
from llama_cloud_services.beta.classifier.client import LlamaClassify  # helper wrapper

load_dotenv()

client = AsyncLlamaCloud(token=os.environ["LLAMA_CLOUD_API_KEY"])
project_id = "your-project-id"
classifier = LlamaClassify(client, project_id=project_id)

rules = [
    ClassifierRule(
        type="invoice",
        description="Documents that contain an invoice number, invoice date, bill-to section, and line items with totals."
    ),
    ClassifierRule(
        type="receipt",
        description="Short purchase receipts, typically from POS systems, with merchant, items and total, often a single page."
    ),
]

parsing = ClassifyParsingConfiguration(
    lang=ParserLanguages.EN,
    max_pages=5,            # optional, parse at most 5 pages
    # target_pages=[1]        # optional, parse only specific pages (1-indexed), can't be used with max_pages
)

# for async usage, use `await classifier.aclassify(...)`
results = classifier.classify(
    rules=rules,
    files=[
        "/path/to/doc1.pdf",
        "/path/to/doc2.pdf",
    ],
    parsing_configuration=parsing,
)

for item in results.items:
    # in cases of partial success, some of the items may not have a result
    if item.result is None:
        print(f"Classification job {item.classify_job_id} error-ed on file {item.file_id}")
        continue
    print(item.file_id, item.result.type, item.result.confidence)
    print(item.result.reasoning)

备注：

ClassifierRule 需要一个 type 和一个模型可以遵循的描述性 description。
ClassifyParsingConfiguration 是可选的；设置 lang、max_pages 或 target_pages 来控制解析。
在部分失败的情况下，某些项目可能没有结果（即 results.items[*].result 可能为 None）。

编写优质规则的技巧

明确指出区分该类型的内容特征。
包含文档通常包含的关键字段（例如，发票编号、总金额）。
在需要时添加多条规则以覆盖不同的模式。
从简单开始，在小数据集上测试，然后逐步优化。