评估是任务导向型且迭代进行的,这是检查您的LLM集成效果并进行改进的最佳方式。
在以下评估中,我们将重点关注检测我的提示变更是否导致功能回退的任务。
我们的使用场景是:
- 我有一个llm集成,可以接收一系列推送通知并将它们总结为一条简明的陈述。
- 我想检测提示词变更是否会导致行为退化
Evals 结构
评估包含两个部分:"评估"和"运行"。"评估"保存了测试标准的配置以及"运行"数据的结构。一个评估可以包含多个运行,这些运行会根据您的测试标准进行评估。
评估是任务导向型且迭代进行的,这是检查您的LLM集成效果并进行改进的最佳方式。
在以下评估中,我们将重点关注检测我的提示变更是否导致功能回退的任务。
我们的使用场景是:
评估包含两个部分:"评估"和"运行"。"评估"保存了测试标准的配置以及"运行"数据的结构。一个评估可以包含多个运行,这些运行会根据您的测试标准进行评估。
import openai
from openai.types.chat import ChatCompletion
import pydantic
import os
os.environ["OPENAI_API_KEY"] = os.environ.get("OPENAI_API_KEY", "your-api-key")我们正在测试以下集成功能:推送通知摘要,该功能接收多条推送通知并将其合并为一条,这是一次聊天补全调用。
class PushNotifications(pydantic.BaseModel):
notifications: str
print(PushNotifications.model_json_schema())DEVELOPER_PROMPT = """
You are a helpful assistant that summarizes push notifications.
You are given a list of push notifications and you need to collapse them into a single one.
Output only the final summary, nothing else.
"""
def summarize_push_notification(push_notifications: str) -> ChatCompletion:
result = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "developer", "content": DEVELOPER_PROMPT},
{"role": "user", "content": push_notifications},
],
)
return result
example_push_notifications_list = PushNotifications(notifications="""
- Alert: Unauthorized login attempt detected.
- New comment on your blog post: "Great insights!"
- Tonight's dinner recipe: Pasta Primavera.
""")
result = summarize_push_notification(example_push_notifications_list.notifications)
print(result.choices[0].message.content)一个评估(Eval)包含在多个运行(Runs)之间共享的配置,它由两个组成部分组成:
data_source_config - the schema (columns) that your future Runs conform to.
data_source_config 使用 JSON Schema 来定义 Eval 中可用的变量。testing_criteria - 如何判断您的集成是否适用于数据源的每个行。针对这个用例,我们想测试推送通知摘要的生成效果是否理想,因此我们将围绕这一目标设置评估方案。
# We want our input data to be available in our variables, so we set the item_schema to
# PushNotifications.model_json_schema()
data_source_config = {
"type": "custom",
"item_schema": PushNotifications.model_json_schema(),
# We're going to be uploading completions from the API, so we tell the Eval to expect this
"include_sample_schema": True,
}该data_source_config定义了在整个评估过程中可用的变量。
该项目模式:
{
"properties": {
"notifications": {
"title": "Notifications",
"type": "string"
}
},
"required": ["notifications"],
"title": "PushNotifications",
"type": "object"
}这意味着我们将在评估中使用变量 {{item.notifications}}。
"include_sample_schema": True
意味着我们将在评估中使用变量 {{sample.output_text}}。
接下来,我们将使用这些变量来设置测试标准。
GRADER_DEVELOPER_PROMPT = """
Label the following push notification summary as either correct or incorrect.
The push notification and the summary will be provided below.
A good push notificiation summary is concise and snappy.
If it is good, then label it as correct, if not, then incorrect.
"""
GRADER_TEMPLATE_PROMPT = """
Push notifications: {{item.notifications}}
Summary: {{sample.output_text}}
"""
push_notification_grader = {
"name": "Push Notification Summary Grader",
"type": "label_model",
"model": "o3-mini",
"input": [
{
"role": "developer",
"content": GRADER_DEVELOPER_PROMPT,
},
{
"role": "user",
"content": GRADER_TEMPLATE_PROMPT,
},
],
"passing_labels": ["correct"],
"labels": ["correct", "incorrect"],
}push_notification_grader 是一个模型评分器(llm-as-a-judge),它会检查输入 {{item.notifications}} 和生成的摘要 {{sample.output_text}},并将其标记为"correct"(正确)或"incorrect"(不正确)。
然后我们通过"passing_labels"来指示什么构成一个合格的答案。
注意:底层实现使用了结构化输出,以确保标签始终有效。
现在我们将创建我们的eval!,并开始向其中添加数据
eval_create_result = openai.evals.create(
name="Push Notification Summary Workflow",
metadata={
"description": "This eval checks if the push notification summary is correct.",
},
data_source_config=data_source_config,
testing_criteria=[push_notification_grader],
)
eval_id = eval_create_result.id现在我们已经设置了包含test_criteria的评估环境,可以开始添加大量运行记录了! 我们将从一些推送通知数据开始。
push_notification_data = [
"""
- New message from Sarah: "Can you call me later?"
- Your package has been delivered!
- Flash sale: 20% off electronics for the next 2 hours!
""",
"""
- Weather alert: Thunderstorm expected in your area.
- Reminder: Doctor's appointment at 3 PM.
- John liked your photo on Instagram.
""",
"""
- Breaking News: Local elections results are in.
- Your daily workout summary is ready.
- Check out your weekly screen time report.
""",
"""
- Your ride is arriving in 2 minutes.
- Grocery order has been shipped.
- Don't miss the season finale of your favorite show tonight!
""",
"""
- Event reminder: Concert starts at 7 PM.
- Your favorite team just scored!
- Flashback: Memories from 3 years ago.
""",
"""
- Low battery alert: Charge your device.
- Your friend Mike is nearby.
- New episode of "The Tech Hour" podcast is live!
""",
"""
- System update available.
- Monthly billing statement is ready.
- Your next meeting starts in 15 minutes.
""",
"""
- Alert: Unauthorized login attempt detected.
- New comment on your blog post: "Great insights!"
- Tonight's dinner recipe: Pasta Primavera.
""",
"""
- Special offer: Free coffee with any breakfast order.
- Your flight has been delayed by 30 minutes.
- New movie release: "Adventures Beyond" now streaming.
""",
"""
- Traffic alert: Accident reported on Main Street.
- Package out for delivery: Expected by 5 PM.
- New friend suggestion: Connect with Emma.
"""]我们的第一次运行将使用上述完成函数中的默认评分器summarize_push_notification。我们将遍历数据集,进行完成调用,然后将它们作为运行提交以进行评分。
run_data = []
for push_notifications in push_notification_data:
result = summarize_push_notification(push_notifications)
run_data.append({
"item": PushNotifications(notifications=push_notifications).model_dump(),
"sample": result.model_dump()
})
eval_run_result = openai.evals.runs.create(
eval_id=eval_id,
name="baseline-run",
data_source={
"type": "jsonl",
"source": {
"type": "file_content",
"content": run_data,
}
},
)
print(eval_run_result)
# Check out the results in the UI
print(eval_run_result.report_url)现在让我们模拟一个回归测试,这是我们的原始提示,让我们模拟开发者破坏这个提示。
DEVELOPER_PROMPT = """
You are a helpful assistant that summarizes push notifications.
You are given a list of push notifications and you need to collapse them into a single one.
Output only the final summary, nothing else.
"""DEVELOPER_PROMPT = """
You are a helpful assistant that summarizes push notifications.
You are given a list of push notifications and you need to collapse them into a single one.
You should make the summary longer than it needs to be and include more information than is necessary.
"""
def summarize_push_notification_bad(push_notifications: str) -> ChatCompletion:
result = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "developer", "content": DEVELOPER_PROMPT},
{"role": "user", "content": push_notifications},
],
)
return resultrun_data = []
for push_notifications in push_notification_data:
result = summarize_push_notification_bad(push_notifications)
run_data.append({
"item": PushNotifications(notifications=push_notifications).model_dump(),
"sample": result.model_dump()
})
eval_run_result = openai.evals.runs.create(
eval_id=eval_id,
name="regression-run",
data_source={
"type": "jsonl",
"source": {
"type": "file_content",
"content": run_data,
}
},
)
print(eval_run_result.report_url)如果您查看该报告,会发现其得分远低于基准运行。
快速说明:
Evals目前尚未原生支持responses API,但您可以通过以下代码将其转换为completions格式。
def summarize_push_notification_responses(push_notifications: str):
result = openai.responses.create(
model="gpt-4o",
input=[
{"role": "developer", "content": DEVELOPER_PROMPT},
{"role": "user", "content": push_notifications},
],
)
return result
def transform_response_to_completion(response):
completion = {
"model": response.model,
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": response.output_text
},
"finish_reason": "stop",
}]
}
return completion
run_data = []
for push_notifications in push_notification_data:
response = summarize_push_notification_responses(push_notifications)
completion = transform_response_to_completion(response)
run_data.append({
"item": PushNotifications(notifications=push_notifications).model_dump(),
"sample": completion
})
report_response = openai.evals.runs.create(
eval_id=eval_id,
name="responses-run",
data_source={
"type": "jsonl",
"source": {
"type": "file_content",
"content": run_data,
}
},
)
print(report_response.report_url)