OpenAI JSON模式与函数调用在数据提取中的对比

OpenAI 刚刚发布了 JSON 模式：这项新配置强制要求大语言模型仅生成可解析为有效 JSON 的字符串（但不保证符合任何模式验证）。

在此之前，从文本中提取结构化数据的最佳方式是通过函数调用。

在本笔记本中，我们探讨了最新的JSON模式与函数调用功能在结构化输出与提取之间的权衡。

更新: OpenAI已澄清JSON模式在函数调用中始终启用，对于常规消息则是可选的（https://community.openai.com/t/json-mode-vs-function-calling/476994/4）

生成合成数据

我们将从为数据提取任务生成一些合成数据开始。让我们向我们的LLM请求一个假设的销售对话记录。

%pip install llama-index-llms-openai
%pip install llama-index-program-openai

from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-3.5-turbo-1106")
response = llm.complete(
    "Generate a sales call transcript, use real names, talk about a product, discuss some action items"
)

transcript = response.text
print(transcript)

[Phone rings]

John: Hello, this is John.

Sarah: Hi John, this is Sarah from XYZ Company. I'm calling to discuss our new product, the XYZ Widget, and see if it might be a good fit for your business.

John: Hi Sarah, thanks for reaching out. I'm definitely interested in learning more about the XYZ Widget. Can you give me a quick overview of what it does?

Sarah: Of course! The XYZ Widget is a cutting-edge tool that helps businesses streamline their workflow and improve productivity. It's designed to automate repetitive tasks and provide real-time data analytics to help you make informed decisions.

John: That sounds really interesting. I can see how that could benefit our team. Do you have any case studies or success stories from other companies who have used the XYZ Widget?

Sarah: Absolutely, we have several case studies that I can share with you. I'll send those over along with some additional information about the product. I'd also love to schedule a demo for you and your team to see the XYZ Widget in action.

John: That would be great. I'll make sure to review the case studies and then we can set up a time for the demo. In the meantime, are there any specific action items or next steps we should take?

Sarah: Yes, I'll send over the information and then follow up with you to schedule the demo. In the meantime, feel free to reach out if you have any questions or need further information.

John: Sounds good, I appreciate your help Sarah. I'm looking forward to learning more about the XYZ Widget and seeing how it can benefit our business.

Sarah: Thank you, John. I'll be in touch soon. Have a great day!

John: You too, bye.

设置我们期望的模式

让我们将期望的输出“结构”指定为一个Pydantic模型。

from pydantic import BaseModel, Field
from typing import List


class CallSummary(BaseModel):
    """Data model for a call summary."""

    summary: str = Field(
        description="High-level summary of the call transcript. Should not exceed 3 sentences."
    )
    products: List[str] = Field(
        description="List of products discussed in the call"
    )
    rep_name: str = Field(description="Name of the sales rep")
    prospect_name: str = Field(description="Name of the prospect")
    action_items: List[str] = Field(description="List of action items")

使用函数调用的数据提取

我们可以使用LlamaIndex中的OpenAIPydanticProgram模块让事情变得非常简单，只需定义一个提示模板，并传入我们已定义的LLM和pydantic模型。

from llama_index.program.openai import OpenAIPydanticProgram
from llama_index.core import ChatPromptTemplate
from llama_index.core.llms import ChatMessage

prompt = ChatPromptTemplate(
    message_templates=[
        ChatMessage(
            role="system",
            content=(
                "You are an expert assitant for summarizing and extracting insights from sales call transcripts."
            ),
        ),
        ChatMessage(
            role="user",
            content=(
                "Here is the transcript: \n"
                "------\n"
                "{transcript}\n"
                "------"
            ),
        ),
    ]
)
program = OpenAIPydanticProgram.from_defaults(
    output_cls=CallSummary,
    llm=llm,
    prompt=prompt,
    verbose=True,
)

output = program(transcript=transcript)

Function call: CallSummary with args: {"summary":"Sarah from XYZ Company called to discuss the new product, the XYZ Widget, which John expressed interest in. Sarah offered to share case studies and schedule a demo. They agreed to review the case studies and set up a time for the demo. The next steps include Sarah sending over information and following up to schedule the demo.","products":["XYZ Widget"],"rep_name":"Sarah","prospect_name":"John","action_items":["Review case studies","Schedule demo"]}

我们现在获得了期望的结构化数据，以Pydantic模型的形式呈现。快速检查显示结果符合我们的预期。

output.dict()

{'summary': 'Sarah from XYZ Company called to discuss the new product, the XYZ Widget, which John expressed interest in. Sarah offered to share case studies and schedule a demo. They agreed to review the case studies and set up a time for the demo. The next steps include Sarah sending over information and following up to schedule the demo.',
 'products': ['XYZ Widget'],
 'rep_name': 'Sarah',
 'prospect_name': 'John',
 'action_items': ['Review case studies', 'Schedule demo']}

使用JSON模式进行数据提取

让我们尝试使用 JSON 模式实现相同效果，而非函数调用

prompt = ChatPromptTemplate(
    message_templates=[
        ChatMessage(
            role="system",
            content=(
                "You are an expert assitant for summarizing and extracting insights from sales call transcripts.\n"
                "Generate a valid JSON following the given schema below:\n"
                "{json_schema}"
            ),
        ),
        ChatMessage(
            role="user",
            content=(
                "Here is the transcript: \n"
                "------\n"
                "{transcript}\n"
                "------"
            ),
        ),
    ]
)

messages = prompt.format_messages(
    json_schema=CallSummary.schema_json(), transcript=transcript
)

output = llm.chat(
    messages, response_format={"type": "json_object"}
).message.content

我们得到了一个有效的JSON，但它只是在重复我们指定的模式，并没有真正执行提取操作。

print(output)

{
  "title": "CallSummary",
  "description": "Data model for a call summary.",
  "type": "object",
  "properties": {
    "summary": {
      "title": "Summary",
      "description": "High-level summary of the call transcript. Should not exceed 3 sentences.",
      "type": "string"
    },
    "products": {
      "title": "Products",
      "description": "List of products discussed in the call",
      "type": "array",
      "items": {
        "type": "string"
      }
    },
    "rep_name": {
      "title": "Rep Name",
      "description": "Name of the sales rep",
      "type": "string"
    },
    "prospect_name": {
      "title": "Prospect Name",
      "description": "Name of the prospect",
      "type": "string"
    },
    "action_items": {
      "title": "Action Items",
      "description": "List of action items",
      "type": "array",
      "items": {
        "type": "string"
      }
    }
  },
  "required": ["summary", "products", "rep_name", "prospect_name", "action_items"]
}

让我们再试一次，只展示我们想要的JSON格式，而不是指定模式

import json

prompt = ChatPromptTemplate(
    message_templates=[
        ChatMessage(
            role="system",
            content=(
                "You are an expert assitant for summarizing and extracting insights from sales call transcripts.\n"
                "Generate a valid JSON in the following format:\n"
                "{json_example}"
            ),
        ),
        ChatMessage(
            role="user",
            content=(
                "Here is the transcript: \n"
                "------\n"
                "{transcript}\n"
                "------"
            ),
        ),
    ]
)

dict_example = {
    "summary": "High-level summary of the call transcript. Should not exceed 3 sentences.",
    "products": ["product 1", "product 2"],
    "rep_name": "Name of the sales rep",
    "prospect_name": "Name of the prospect",
    "action_items": ["action item 1", "action item 2"],
}

json_example = json.dumps(dict_example)

messages = prompt.format_messages(
    json_example=json_example, transcript=transcript
)

output = llm.chat(
    messages, response_format={"type": "json_object"}
).message.content

现在我们能够按预期获取提取的结构化数据。

print(output)

{
  "summary": "Sarah from XYZ Company called John to discuss the new product, the XYZ Widget, which is designed to streamline workflow and improve productivity. They discussed case studies and scheduling a demo for John and his team. The next steps include Sarah sending over information and following up to schedule the demo.",
  "products": ["XYZ Widget"],
  "rep_name": "Sarah",
  "prospect_name": "John",
  "action_items": ["Review case studies", "Schedule demo"]
}

快速要点

对于结构化数据提取，函数调用仍然更易于使用（特别是如果您已经将模式指定为例如 pydantic 模型）
虽然JSON模式强制规定了输出的格式，但它无法帮助验证是否符合指定模式。直接传入模式可能无法生成预期的JSON，并且可能需要额外仔细的格式化和提示。