Evals 结构

评估包含两个部分："评估"和"运行"。"评估"保存了测试标准的配置以及"运行"数据的结构。一个评估可以包含多个运行，每个运行都会根据您的测试标准进行评估。

应用场景

我们正在测试以下集成功能：推送通知摘要，该功能接收多条推送通知并将其合并为一条，这是一次聊天补全调用。

生成我们的测试数据

我将使用两个不同版本的提示词生成模拟生产环境下的聊天补全请求，以测试各自的表现。第一个是"优质"提示词，第二个是"劣质"提示词。这些提示词将带有不同的元数据，我们稍后会用到。

push_notification_data = [ """ - New message from Sarah: "Can you call me later?" - Your package has been delivered! - Flash sale: 20% off electronics for the next 2 hours! """, """ - Weather alert: Thunderstorm expected in your area. - Reminder: Doctor's appointment at 3 PM. - John liked your photo on Instagram. """, """ - Breaking News: Local elections results are in. - Your daily workout summary is ready. - Check out your weekly screen time report. """, """ - Your ride is arriving in 2 minutes. - Grocery order has been shipped. - Don't miss the season finale of your favorite show tonight! """, """ - Event reminder: Concert starts at 7 PM. - Your favorite team just scored! - Flashback: Memories from 3 years ago. """, """ - Low battery alert: Charge your device. - Your friend Mike is nearby. - New episode of "The Tech Hour" podcast is live! """, """ - System update available. - Monthly billing statement is ready. - Your next meeting starts in 15 minutes. """, """ - Alert: Unauthorized login attempt detected. - New comment on your blog post: "Great insights!" - Tonight's dinner recipe: Pasta Primavera. """, """ - Special offer: Free coffee with any breakfast order. - Your flight has been delayed by 30 minutes. - New movie release: "Adventures Beyond" now streaming. """, """ - Traffic alert: Accident reported on Main Street. - Package out for delivery: Expected by 5 PM. - New friend suggestion: Connect with Emma. """]

PROMPTS = [ ( """ You are a helpful assistant that summarizes push notifications. You are given a list of push notifications and you need to collapse them into a single one. Output only the final summary, nothing else. """, "v1" ), ( """ You are a helpful assistant that summarizes push notifications. You are given a list of push notifications and you need to collapse them into a single one. The summary should be longer than it needs to be and include more information than is necessary. Output only the final summary, nothing else. """, "v2" ) ] tasks = [] for notifications in push_notification_data: for (prompt, version) in PROMPTS: tasks.append(client.chat.completions.create( model="gpt-4o-mini", messages=[ {"role": "developer", "content": prompt}, {"role": "user", "content": notifications}, ], store=True, metadata={"prompt_version": version, "usecase": "push_notifications_summarizer"}, )) await asyncio.gather(*tasks)

您可以在https://platform.openai.com/logs查看刚刚创建的补全结果。

确保聊天补全内容显示出来，因为它们是下一步操作所必需的。

设置您的评估

一个评估(Eval)包含在多个运行(Runs)之间共享的配置，它由两个组成部分组成：

Data source configuration data_source_config - the schema (columns) that your future Runs conform to.
- data_source_config 使用 JSON Schema 来定义 Eval 中可用的变量。
测试标准 testing_criteria - 如何判断您的集成是否适用于数据源的每个行。

针对这个用例，我们使用的是存储式补全功能，因此需要配置对应的数据源设置

重要提示 您可能会有许多不同的存储补全用例，元数据是跟踪这些用例的最佳方式，以便评估保持专注和任务导向。

# We want our input data to be available in our variables, so we set the item_schema to # PushNotifications.model_json_schema() data_source_config = { "type": "stored_completions", "metadata": { "usecase": "push_notifications_summarizer" } }

该data_source_config定义了在整个评估过程中可用的变量。

存储的补全配置提供了两个变量供您在评估过程中使用：

{{item.input}} - 发送给completions调用的消息
{{sample.output_text}} - 来自助手的文本响应

接下来，我们将使用这些变量来设置测试标准。

GRADER_DEVELOPER_PROMPT = """ Label the following push notification summary as either correct or incorrect. The push notification and the summary will be provided below. A good push notificiation summary is concise and snappy. If it is good, then label it as correct, if not, then incorrect. """ GRADER_TEMPLATE_PROMPT = """ Push notifications: {{item.input}} Summary: {{sample.output_text}} """ push_notification_grader = { "name": "Push Notification Summary Grader", "type": "label_model", "model": "o3-mini", "input": [ { "role": "developer", "content": GRADER_DEVELOPER_PROMPT, }, { "role": "user", "content": GRADER_TEMPLATE_PROMPT, }, ], "passing_labels": ["correct"], "labels": ["correct", "incorrect"], }

push_notification_grader 是一个模型评分器（llm-as-a-judge），它会检查输入 {{item.input}} 和生成的摘要 {{sample.output_text}} 并将其标记为"correct"（正确）或"incorrect"（不正确）。

注意：底层实现使用了结构化输出，以确保标签始终有效。

现在我们将创建我们的eval!，并开始向其中添加数据

eval_create_result = await client.evals.create( name="Push Notification Completion Monitoring", metadata={"description": "This eval monitors completions"}, data_source_config=data_source_config, testing_criteria=[push_notification_grader], ) eval_id = eval_create_result.id

创建运行

现在我们已设置好包含test_criteria的评估环境，可以开始添加运行测试了。我想比较两个提示版本之间的性能表现

为此，我们只需将数据源定义为"stored_completions"，并为每个提示版本设置元数据过滤器。

# Grade prompt_version=v1 eval_run_result = await client.evals.runs.create( eval_id=eval_id, name="v1-run", data_source={ "type": "completions", "source": { "type": "stored_completions", "metadata": { "prompt_version": "v1", } } } ) print(eval_run_result.report_url)

# Grade prompt_version=v2 eval_run_result_v2 = await client.evals.runs.create( eval_id=eval_id, name="v2-run", data_source={ "type": "completions", "source": { "type": "stored_completions", "metadata": { "prompt_version": "v2", } } } ) print(eval_run_result_v2.report_url)

为了彻底起见，让我们看看这个提示在4o上的表现如何，而不是4o-mini，并以两个版本的提示作为起点。

我们只需要引用输入消息({{item.input}})并将模型设置为4o。由于我们尚未存储任何4o的完成结果，此次评估运行将生成新的完成内容。

tasks = [] for prompt_version in ["v1", "v2"]: tasks.append(client.evals.runs.create( eval_id=eval_id, name=f"post-fix-new-model-run-{prompt_version}", data_source={ "type": "completions", "input_messages": { "type": "item_reference", "item_reference": "item.input", }, "model": "gpt-4o", "source": { "type": "stored_completions", "metadata": { "prompt_version": prompt_version, } } }, )) result = await asyncio.gather(*tasks) for run in result: print(run.report_url)

2025年4月8日

Evals API 使用案例 - 监控存储的补全结果

Evals 结构

应用场景

生成我们的测试数据

设置您的评估

创建运行

恭喜，你刚刚发现了一个bug，你可以回滚它，或者进行其他提示更改等等！