Evals 结构

评估包含两个部分："评估"和"运行"。"评估"保存了测试标准的配置以及"运行"数据的结构。一个评估可以包含多个运行，这些运行会根据您的测试标准进行评估。

应用场景

我们正在测试以下集成功能：推送通知摘要，该功能接收多条推送通知并将其合并为一条，这是一次聊天补全调用。

DEVELOPER_PROMPT = """ You are a helpful assistant that summarizes push notifications. You are given a list of push notifications and you need to collapse them into a single one. Output only the final summary, nothing else. """ def summarize_push_notification(push_notifications: str) -> ChatCompletion: result = openai.chat.completions.create( model="gpt-4o-mini", messages=[ {"role": "developer", "content": DEVELOPER_PROMPT}, {"role": "user", "content": push_notifications}, ], ) return result example_push_notifications_list = PushNotifications(notifications=""" - Alert: Unauthorized login attempt detected. - New comment on your blog post: "Great insights!" - Tonight's dinner recipe: Pasta Primavera. """) result = summarize_push_notification(example_push_notifications_list.notifications) print(result.choices[0].message.content)

设置您的评估

一个评估(Eval)包含在多个运行(Runs)之间共享的配置，它由两个组成部分组成：

Data source configuration data_source_config - the schema (columns) that your future Runs conform to.
- data_source_config 使用 JSON Schema 来定义 Eval 中可用的变量。
测试标准 testing_criteria - 如何判断您的集成是否适用于数据源的每个行。

针对这个用例，我们想测试推送通知摘要的生成效果是否理想，因此我们将围绕这一目标设置评估方案。

# We want our input data to be available in our variables, so we set the item_schema to # PushNotifications.model_json_schema() data_source_config = { "type": "custom", "item_schema": PushNotifications.model_json_schema(), # We're going to be uploading completions from the API, so we tell the Eval to expect this "include_sample_schema": True, }

该data_source_config定义了在整个评估过程中可用的变量。

该项目模式：

{
  "properties": {
    "notifications": {
      "title": "Notifications",
      "type": "string"
    }
  },
  "required": ["notifications"],
  "title": "PushNotifications",
  "type": "object"
}

这意味着我们将在评估中使用变量 {{item.notifications}}。

"include_sample_schema": True 意味着我们将在评估中使用变量 {{sample.output_text}}。

接下来，我们将使用这些变量来设置测试标准。

GRADER_DEVELOPER_PROMPT = """ Label the following push notification summary as either correct or incorrect. The push notification and the summary will be provided below. A good push notificiation summary is concise and snappy. If it is good, then label it as correct, if not, then incorrect. """ GRADER_TEMPLATE_PROMPT = """ Push notifications: {{item.notifications}} Summary: {{sample.output_text}} """ push_notification_grader = { "name": "Push Notification Summary Grader", "type": "label_model", "model": "o3-mini", "input": [ { "role": "developer", "content": GRADER_DEVELOPER_PROMPT, }, { "role": "user", "content": GRADER_TEMPLATE_PROMPT, }, ], "passing_labels": ["correct"], "labels": ["correct", "incorrect"], }

push_notification_grader 是一个模型评分器（llm-as-a-judge），它会检查输入 {{item.notifications}} 和生成的摘要 {{sample.output_text}}，并将其标记为"correct"（正确）或"incorrect"（不正确）。然后我们通过"passing_labels"来指示什么构成一个合格的答案。

注意：底层实现使用了结构化输出，以确保标签始终有效。

现在我们将创建我们的eval!，并开始向其中添加数据

eval_create_result = openai.evals.create( name="Push Notification Summary Workflow", metadata={ "description": "This eval checks if the push notification summary is correct.", }, data_source_config=data_source_config, testing_criteria=[push_notification_grader], ) eval_id = eval_create_result.id

创建运行

现在我们已经设置了包含test_criteria的评估环境，可以开始添加大量运行记录了！我们将从一些推送通知数据开始。

push_notification_data = [ """ - New message from Sarah: "Can you call me later?" - Your package has been delivered! - Flash sale: 20% off electronics for the next 2 hours! """, """ - Weather alert: Thunderstorm expected in your area. - Reminder: Doctor's appointment at 3 PM. - John liked your photo on Instagram. """, """ - Breaking News: Local elections results are in. - Your daily workout summary is ready. - Check out your weekly screen time report. """, """ - Your ride is arriving in 2 minutes. - Grocery order has been shipped. - Don't miss the season finale of your favorite show tonight! """, """ - Event reminder: Concert starts at 7 PM. - Your favorite team just scored! - Flashback: Memories from 3 years ago. """, """ - Low battery alert: Charge your device. - Your friend Mike is nearby. - New episode of "The Tech Hour" podcast is live! """, """ - System update available. - Monthly billing statement is ready. - Your next meeting starts in 15 minutes. """, """ - Alert: Unauthorized login attempt detected. - New comment on your blog post: "Great insights!" - Tonight's dinner recipe: Pasta Primavera. """, """ - Special offer: Free coffee with any breakfast order. - Your flight has been delayed by 30 minutes. - New movie release: "Adventures Beyond" now streaming. """, """ - Traffic alert: Accident reported on Main Street. - Package out for delivery: Expected by 5 PM. - New friend suggestion: Connect with Emma. """]

我们的第一次运行将使用上述完成函数中的默认评分器summarize_push_notification。我们将遍历数据集，进行完成调用，然后将它们作为运行提交以进行评分。

run_data = [] for push_notifications in push_notification_data: result = summarize_push_notification(push_notifications) run_data.append({ "item": PushNotifications(notifications=push_notifications).model_dump(), "sample": result.model_dump() }) eval_run_result = openai.evals.runs.create( eval_id=eval_id, name="baseline-run", data_source={ "type": "jsonl", "source": { "type": "file_content", "content": run_data, } }, ) print(eval_run_result) # Check out the results in the UI print(eval_run_result.report_url)

现在让我们模拟一个回归测试，这是我们的原始提示，让我们模拟开发者破坏这个提示。

DEVELOPER_PROMPT = """
You are a helpful assistant that summarizes push notifications.
You are given a list of push notifications and you need to collapse them into a single one.
Output only the final summary, nothing else.
"""

DEVELOPER_PROMPT = """ You are a helpful assistant that summarizes push notifications. You are given a list of push notifications and you need to collapse them into a single one. You should make the summary longer than it needs to be and include more information than is necessary. """ def summarize_push_notification_bad(push_notifications: str) -> ChatCompletion: result = openai.chat.completions.create( model="gpt-4o-mini", messages=[ {"role": "developer", "content": DEVELOPER_PROMPT}, {"role": "user", "content": push_notifications}, ], ) return result

run_data = [] for push_notifications in push_notification_data: result = summarize_push_notification_bad(push_notifications) run_data.append({ "item": PushNotifications(notifications=push_notifications).model_dump(), "sample": result.model_dump() }) eval_run_result = openai.evals.runs.create( eval_id=eval_id, name="regression-run", data_source={ "type": "jsonl", "source": { "type": "file_content", "content": run_data, } }, ) print(eval_run_result.report_url)

如果您查看该报告，会发现其得分远低于基准运行。

恭喜，您刚刚阻止了一个漏洞发布给用户

快速说明： Evals目前尚未原生支持responses API，但您可以通过以下代码将其转换为completions格式。

def summarize_push_notification_responses(push_notifications: str): result = openai.responses.create( model="gpt-4o", input=[ {"role": "developer", "content": DEVELOPER_PROMPT}, {"role": "user", "content": push_notifications}, ], ) return result def transform_response_to_completion(response): completion = { "model": response.model, "choices": [{ "index": 0, "message": { "role": "assistant", "content": response.output_text }, "finish_reason": "stop", }] } return completion run_data = [] for push_notifications in push_notification_data: response = summarize_push_notification_responses(push_notifications) completion = transform_response_to_completion(response) run_data.append({ "item": PushNotifications(notifications=push_notifications).model_dump(), "sample": completion }) report_response = openai.evals.runs.create( eval_id=eval_id, name="responses-run", data_source={ "type": "jsonl", "source": { "type": "file_content", "content": run_data, } }, ) print(report_response.report_url)

2025年4月8日

Evals API 使用案例 - 检测提示词退化

Evals 结构

应用场景

设置您的评估

创建运行

恭喜，您刚刚阻止了一个漏洞发布给用户