Evals 结构

评估包含两个部分："评估"和"运行"。"评估"保存了测试标准的配置以及"运行"数据的结构。一个评估has_many多个运行，这些运行会根据您的测试标准进行评估。

应用场景

我们正在测试以下集成功能：一个推送通知摘要器，它能接收多条推送通知并将其合并为一条消息。

DEVELOPER_PROMPT = """ You are a helpful assistant that summarizes push notifications. You are given a list of push notifications and you need to collapse them into a single one. Output only the final summary, nothing else. """ def summarize_push_notification(push_notifications: str) -> ChatCompletion: result = openai.chat.completions.create( model="gpt-4o-mini", messages=[ {"role": "developer", "content": DEVELOPER_PROMPT}, {"role": "user", "content": push_notifications}, ], ) return result example_push_notifications_list = PushNotifications(notifications=""" - Alert: Unauthorized login attempt detected. - New comment on your blog post: "Great insights!" - Tonight's dinner recipe: Pasta Primavera. """) result = summarize_push_notification(example_push_notifications_list.notifications) print(result.choices[0].message.content)

设置您的评估

一个评估(Eval)包含在多个运行(Runs)之间共享的配置，它由两个组成部分组成：

Data source configuration data_source_config - the schema (columns) that your future Runs conform to.
- data_source_config 使用 JSON Schema 来定义 Eval 中可用的变量。
测试标准 testing_criteria - 如何判断您的集成是否适用于数据源的每个行。

针对这个用例，我们想测试推送通知摘要的生成效果是否理想，因此我们将围绕这一目标设置评估方案。

# We want our input data to be available in our variables, so we set the item_schema to # PushNotifications.model_json_schema() data_source_config = { "type": "custom", "item_schema": PushNotifications.model_json_schema(), # We're going to be uploading completions from the API, so we tell the Eval to expect this "include_sample_schema": True, }

该data_source_config定义了在整个评估过程中可用的变量。

该项目模式：

{
  "properties": {
    "notifications": {
      "title": "Notifications",
      "type": "string"
    }
  },
  "required": ["notifications"],
  "title": "PushNotifications",
  "type": "object"
}

这意味着我们将在评估中使用变量 {{item.notifications}}。

"include_sample_schema": True 意味着我们将在评估中使用变量 {{sample.output_text}}。

接下来，我们将使用这些变量来设置测试标准。

GRADER_DEVELOPER_PROMPT = """ Categorize the following push notification summary into the following categories: 1. concise-and-snappy 2. drops-important-information 3. verbose 4. unclear 5. obscures-meaning 6. other You'll be given the original list of push notifications and the summary like this: <push_notifications> ...notificationlist... </push_notifications> <summary> ...summary... </summary> You should only pick one of the categories above, pick the one which most closely matches and why. """ GRADER_TEMPLATE_PROMPT = """ <push_notifications>{{item.notifications}}</push_notifications> <summary>{{sample.output_text}}</summary> """ push_notification_grader = { "name": "Push Notification Summary Grader", "type": "label_model", "model": "o3-mini", "input": [ { "role": "developer", "content": GRADER_DEVELOPER_PROMPT, }, { "role": "user", "content": GRADER_TEMPLATE_PROMPT, }, ], "passing_labels": ["concise-and-snappy"], "labels": [ "concise-and-snappy", "drops-important-information", "verbose", "unclear", "obscures-meaning", "other", ], }

push_notification_grader 是一个模型评分器（llm-as-a-judge），它会检查输入 {{item.notifications}} 和生成的摘要 {{sample.output_text}}，并将其标记为"correct"（正确）或"incorrect"（不正确）。然后我们通过"passing_labels"来定义什么样的回答才算通过。

注意：底层实现使用了结构化输出，以确保标签始终有效。

现在我们将创建我们的评估，并开始向其中添加数据！

eval_create_result = openai.evals.create( name="Push Notification Bulk Experimentation Eval", metadata={ "description": "This eval tests many prompts and models to find the best performing combination.", }, data_source_config=data_source_config, testing_criteria=[push_notification_grader], ) eval_id = eval_create_result.id

创建运行

现在我们已设置好包含测试标准的评估环境，可以开始添加大量运行记录了！我们将从一些推送通知数据开始。

push_notification_data = [ """ - New message from Sarah: "Can you call me later?" - Your package has been delivered! - Flash sale: 20% off electronics for the next 2 hours! """, """ - Weather alert: Thunderstorm expected in your area. - Reminder: Doctor's appointment at 3 PM. - John liked your photo on Instagram. """, """ - Breaking News: Local elections results are in. - Your daily workout summary is ready. - Check out your weekly screen time report. """, """ - Your ride is arriving in 2 minutes. - Grocery order has been shipped. - Don't miss the season finale of your favorite show tonight! """, """ - Event reminder: Concert starts at 7 PM. - Your favorite team just scored! - Flashback: Memories from 3 years ago. """, """ - Low battery alert: Charge your device. - Your friend Mike is nearby. - New episode of "The Tech Hour" podcast is live! """, """ - System update available. - Monthly billing statement is ready. - Your next meeting starts in 15 minutes. """, """ - Alert: Unauthorized login attempt detected. - New comment on your blog post: "Great insights!" - Tonight's dinner recipe: Pasta Primavera. """, """ - Special offer: Free coffee with any breakfast order. - Your flight has been delayed by 30 minutes. - New movie release: "Adventures Beyond" now streaming. """, """ - Traffic alert: Accident reported on Main Street. - Package out for delivery: Expected by 5 PM. - New friend suggestion: Connect with Emma. """]

现在我们将设置一系列提示词进行测试。

我们想测试一个基础提示词，包含几个变体：

在一种变体中，我们仅使用基础提示
在下一个示例中，我们将包含一些我们希望摘要呈现样式的正面案例
在最后一个示例中，我们将同时包含正面和负面示例。

我们还将包含一个可供使用的模型列表。

PROMPT_PREFIX = """ You are a helpful assistant that takes in an array of push notifications and returns a collapsed summary of them. The push notification will be provided as follows: <push_notifications> ...notificationlist... </push_notifications> You should return just the summary and nothing else. """ PROMPT_VARIATION_BASIC = f""" {PROMPT_PREFIX} You should return a summary that is concise and snappy. """ PROMPT_VARIATION_WITH_EXAMPLES = f""" {PROMPT_VARIATION_BASIC} Here is an example of a good summary: <push_notifications> - Traffic alert: Accident reported on Main Street.- Package out for delivery: Expected by 5 PM.- New friend suggestion: Connect with Emma. </push_notifications> <summary> Traffic alert, package expected by 5pm, suggestion for new friend (Emily). </summary> """ PROMPT_VARIATION_WITH_NEGATIVE_EXAMPLES = f""" {PROMPT_VARIATION_WITH_EXAMPLES} Here is an example of a bad summary: <push_notifications> - Traffic alert: Accident reported on Main Street.- Package out for delivery: Expected by 5 PM.- New friend suggestion: Connect with Emma. </push_notifications> <summary> Traffic alert reported on main street. You have a package that will arrive by 5pm, Emily is a new friend suggested for you. </summary> """ prompts = [ ("basic", PROMPT_VARIATION_BASIC), ("with_examples", PROMPT_VARIATION_WITH_EXAMPLES), ("with_negative_examples", PROMPT_VARIATION_WITH_NEGATIVE_EXAMPLES), ] models = ["gpt-4o", "gpt-4o-mini", "o3-mini"]

现在我们可以直接遍历所有提示词和所有模型，一次性测试多种配置！

我们将使用带有模板变量的'completion'运行数据源来推送通知列表。

OpenAI 将为您处理完成调用的操作并填充 "sample.output_text"

for prompt_name, prompt in prompts: for model in models: run_data_source = { "type": "completions", "input_messages": { "type": "template", "template": [ { "role": "developer", "content": prompt, }, { "role": "user", "content": "<push_notifications>{{item.notifications}}</push_notifications>", }, ], }, "model": model, "source": { "type": "file_content", "content": [ { "item": PushNotifications(notifications=notification).model_dump() } for notification in push_notification_data ], }, } run_create_result = openai.evals.runs.create( eval_id=eval_id, name=f"bulk_{prompt_name}_{model}", data_source=run_data_source, ) print(f"Report URL {model}, {prompt_name}:", run_create_result.report_url)

2025年4月8日

Evals API 使用案例 - 批量模型和提示实验

Evals 结构

应用场景

设置您的评估

创建运行

恭喜，您刚刚在数据集上测试了9种不同的提示和模型变体！