2025年4月8日

Evals API 使用案例 - 批量模型和提示实验

评估是任务导向型且迭代进行的,这是检查您的LLM集成效果并进行改进的最佳方式。

在以下评估中,我们将重点关注测试多种模型和提示变体的任务。

我们的使用场景是:

  1. 我希望从我的推送通知摘要器中获得最佳性能

Evals 结构

评估包含两个部分:"评估"和"运行"。"评估"保存了测试标准的配置以及"运行"数据的结构。一个评估has_many多个运行,这些运行会根据您的测试标准进行评估。

import pydantic
import openai
from openai.types.chat import ChatCompletion
import os

os.environ["OPENAI_API_KEY"] = os.environ.get("OPENAI_API_KEY", "your-api-key")

应用场景

我们正在测试以下集成功能:一个推送通知摘要器,它能接收多条推送通知并将其合并为一条消息。

class PushNotifications(pydantic.BaseModel):
    notifications: str

print(PushNotifications.model_json_schema())
DEVELOPER_PROMPT = """
You are a helpful assistant that summarizes push notifications.
You are given a list of push notifications and you need to collapse them into a single one.
Output only the final summary, nothing else.
"""

def summarize_push_notification(push_notifications: str) -> ChatCompletion:
    result = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "developer", "content": DEVELOPER_PROMPT},
            {"role": "user", "content": push_notifications},
        ],
    )
    return result

example_push_notifications_list = PushNotifications(notifications="""
- Alert: Unauthorized login attempt detected.
- New comment on your blog post: "Great insights!"
- Tonight's dinner recipe: Pasta Primavera.
""")
result = summarize_push_notification(example_push_notifications_list.notifications)
print(result.choices[0].message.content)

设置您的评估

一个评估(Eval)包含在多个运行(Runs)之间共享的配置,它由两个组成部分组成:

  1. Data source configuration data_source_config - the schema (columns) that your future Runs conform to.
    • data_source_config 使用 JSON Schema 来定义 Eval 中可用的变量。
  2. 测试标准 testing_criteria - 如何判断您的集成是否适用于数据源的每个

针对这个用例,我们想测试推送通知摘要的生成效果是否理想,因此我们将围绕这一目标设置评估方案。

# We want our input data to be available in our variables, so we set the item_schema to
# PushNotifications.model_json_schema()
data_source_config = {
    "type": "custom",
    "item_schema": PushNotifications.model_json_schema(),
    # We're going to be uploading completions from the API, so we tell the Eval to expect this
    "include_sample_schema": True,
}

该data_source_config定义了在整个评估过程中可用的变量。

该项目模式:

{
  "properties": {
    "notifications": {
      "title": "Notifications",
      "type": "string"
    }
  },
  "required": ["notifications"],
  "title": "PushNotifications",
  "type": "object"
}

这意味着我们将在评估中使用变量 {{item.notifications}}

"include_sample_schema": True 意味着我们将在评估中使用变量 {{sample.output_text}}

接下来,我们将使用这些变量来设置测试标准。

GRADER_DEVELOPER_PROMPT = """
Categorize the following push notification summary into the following categories:
1. concise-and-snappy
2. drops-important-information
3. verbose
4. unclear
5. obscures-meaning
6. other 

You'll be given the original list of push notifications and the summary like this:

<push_notifications>
...notificationlist...
</push_notifications>
<summary>
...summary...
</summary>

You should only pick one of the categories above, pick the one which most closely matches and why.
"""
GRADER_TEMPLATE_PROMPT = """
<push_notifications>{{item.notifications}}</push_notifications>
<summary>{{sample.output_text}}</summary>
"""
push_notification_grader = {
    "name": "Push Notification Summary Grader",
    "type": "label_model",
    "model": "o3-mini",
    "input": [
        {
            "role": "developer",
            "content": GRADER_DEVELOPER_PROMPT,
        },
        {
            "role": "user",
            "content": GRADER_TEMPLATE_PROMPT,
        },
    ],
    "passing_labels": ["concise-and-snappy"],
    "labels": [
        "concise-and-snappy",
        "drops-important-information",
        "verbose",
        "unclear",
        "obscures-meaning",
        "other",
    ],
}

push_notification_grader 是一个模型评分器(llm-as-a-judge),它会检查输入 {{item.notifications}} 和生成的摘要 {{sample.output_text}},并将其标记为"correct"(正确)或"incorrect"(不正确)。 然后我们通过"passing_labels"来定义什么样的回答才算通过。

注意:底层实现使用了结构化输出,以确保标签始终有效。

现在我们将创建我们的评估,并开始向其中添加数据!

eval_create_result = openai.evals.create(
    name="Push Notification Bulk Experimentation Eval",
    metadata={
        "description": "This eval tests many prompts and models to find the best performing combination.",
    },
    data_source_config=data_source_config,
    testing_criteria=[push_notification_grader],
)
eval_id = eval_create_result.id

创建运行

现在我们已设置好包含测试标准的评估环境,可以开始添加大量运行记录了! 我们将从一些推送通知数据开始。

push_notification_data = [
        """
- New message from Sarah: "Can you call me later?"
- Your package has been delivered!
- Flash sale: 20% off electronics for the next 2 hours!
""",
        """
- Weather alert: Thunderstorm expected in your area.
- Reminder: Doctor's appointment at 3 PM.
- John liked your photo on Instagram.
""",
        """
- Breaking News: Local elections results are in.
- Your daily workout summary is ready.
- Check out your weekly screen time report.
""",
        """
- Your ride is arriving in 2 minutes.
- Grocery order has been shipped.
- Don't miss the season finale of your favorite show tonight!
""",
        """
- Event reminder: Concert starts at 7 PM.
- Your favorite team just scored!
- Flashback: Memories from 3 years ago.
""",
        """
- Low battery alert: Charge your device.
- Your friend Mike is nearby.
- New episode of "The Tech Hour" podcast is live!
""",
        """
- System update available.
- Monthly billing statement is ready.
- Your next meeting starts in 15 minutes.
""",
        """
- Alert: Unauthorized login attempt detected.
- New comment on your blog post: "Great insights!"
- Tonight's dinner recipe: Pasta Primavera.
""",
        """
- Special offer: Free coffee with any breakfast order.
- Your flight has been delayed by 30 minutes.
- New movie release: "Adventures Beyond" now streaming.
""",
        """
- Traffic alert: Accident reported on Main Street.
- Package out for delivery: Expected by 5 PM.
- New friend suggestion: Connect with Emma.
"""]

现在我们将设置一系列提示词进行测试。

我们想测试一个基础提示词,包含几个变体:

  1. 在一种变体中,我们仅使用基础提示
  2. 在下一个示例中,我们将包含一些我们希望摘要呈现样式的正面案例
  3. 在最后一个示例中,我们将同时包含正面和负面示例。

我们还将包含一个可供使用的模型列表。

PROMPT_PREFIX = """
You are a helpful assistant that takes in an array of push notifications and returns a collapsed summary of them.
The push notification will be provided as follows:
<push_notifications>
...notificationlist...
</push_notifications>

You should return just the summary and nothing else.
"""

PROMPT_VARIATION_BASIC = f"""
{PROMPT_PREFIX}

You should return a summary that is concise and snappy.
"""

PROMPT_VARIATION_WITH_EXAMPLES = f"""
{PROMPT_VARIATION_BASIC}

Here is an example of a good summary:
<push_notifications>
- Traffic alert: Accident reported on Main Street.- Package out for delivery: Expected by 5 PM.- New friend suggestion: Connect with Emma.
</push_notifications>
<summary>
Traffic alert, package expected by 5pm, suggestion for new friend (Emily).
</summary>
"""

PROMPT_VARIATION_WITH_NEGATIVE_EXAMPLES = f"""
{PROMPT_VARIATION_WITH_EXAMPLES}

Here is an example of a bad summary:
<push_notifications>
- Traffic alert: Accident reported on Main Street.- Package out for delivery: Expected by 5 PM.- New friend suggestion: Connect with Emma.
</push_notifications>
<summary>
Traffic alert reported on main street. You have a package that will arrive by 5pm, Emily is a new friend suggested for you.
</summary>
"""

prompts = [
    ("basic", PROMPT_VARIATION_BASIC),
    ("with_examples", PROMPT_VARIATION_WITH_EXAMPLES),
    ("with_negative_examples", PROMPT_VARIATION_WITH_NEGATIVE_EXAMPLES),
]

models = ["gpt-4o", "gpt-4o-mini", "o3-mini"]

现在我们可以直接遍历所有提示词和所有模型,一次性测试多种配置!

我们将使用带有模板变量的'completion'运行数据源来推送通知列表。

OpenAI 将为您处理完成调用的操作并填充 "sample.output_text"

for prompt_name, prompt in prompts:
    for model in models:
        run_data_source = {
            "type": "completions",
            "input_messages": {
                "type": "template",
                "template": [
                    {
                        "role": "developer",
                        "content": prompt,
                    },
                    {
                        "role": "user",
                        "content": "<push_notifications>{{item.notifications}}</push_notifications>",
                    },
                ],
            },
            "model": model,
            "source": {
                "type": "file_content",
                "content": [
                    {
                        "item": PushNotifications(notifications=notification).model_dump()
                    }
                    for notification in push_notification_data
                ],
            },
        }

        run_create_result = openai.evals.runs.create(
            eval_id=eval_id,
            name=f"bulk_{prompt_name}_{model}",
            data_source=run_data_source,
        )
        print(f"Report URL {model}, {prompt_name}:", run_create_result.report_url)