评估是任务导向型且迭代进行的,这是检查您的LLM集成效果并进行改进的最佳方式。
在以下评估中,我们将重点关注测试多种模型和提示变体的任务。
我们的使用场景是:
- 我希望从我的推送通知摘要器中获得最佳性能
Evals 结构
评估包含两个部分:"评估"和"运行"。"评估"保存了测试标准的配置以及"运行"数据的结构。一个评估has_many多个运行,这些运行会根据您的测试标准进行评估。
评估是任务导向型且迭代进行的,这是检查您的LLM集成效果并进行改进的最佳方式。
在以下评估中,我们将重点关注测试多种模型和提示变体的任务。
我们的使用场景是:
评估包含两个部分:"评估"和"运行"。"评估"保存了测试标准的配置以及"运行"数据的结构。一个评估has_many多个运行,这些运行会根据您的测试标准进行评估。
import pydantic
import openai
from openai.types.chat import ChatCompletion
import os
os.environ["OPENAI_API_KEY"] = os.environ.get("OPENAI_API_KEY", "your-api-key")我们正在测试以下集成功能:一个推送通知摘要器,它能接收多条推送通知并将其合并为一条消息。
class PushNotifications(pydantic.BaseModel):
notifications: str
print(PushNotifications.model_json_schema())DEVELOPER_PROMPT = """
You are a helpful assistant that summarizes push notifications.
You are given a list of push notifications and you need to collapse them into a single one.
Output only the final summary, nothing else.
"""
def summarize_push_notification(push_notifications: str) -> ChatCompletion:
result = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "developer", "content": DEVELOPER_PROMPT},
{"role": "user", "content": push_notifications},
],
)
return result
example_push_notifications_list = PushNotifications(notifications="""
- Alert: Unauthorized login attempt detected.
- New comment on your blog post: "Great insights!"
- Tonight's dinner recipe: Pasta Primavera.
""")
result = summarize_push_notification(example_push_notifications_list.notifications)
print(result.choices[0].message.content)一个评估(Eval)包含在多个运行(Runs)之间共享的配置,它由两个组成部分组成:
data_source_config - the schema (columns) that your future Runs conform to.
data_source_config 使用 JSON Schema 来定义 Eval 中可用的变量。testing_criteria - 如何判断您的集成是否适用于数据源的每个行。针对这个用例,我们想测试推送通知摘要的生成效果是否理想,因此我们将围绕这一目标设置评估方案。
# We want our input data to be available in our variables, so we set the item_schema to
# PushNotifications.model_json_schema()
data_source_config = {
"type": "custom",
"item_schema": PushNotifications.model_json_schema(),
# We're going to be uploading completions from the API, so we tell the Eval to expect this
"include_sample_schema": True,
}该data_source_config定义了在整个评估过程中可用的变量。
该项目模式:
{
"properties": {
"notifications": {
"title": "Notifications",
"type": "string"
}
},
"required": ["notifications"],
"title": "PushNotifications",
"type": "object"
}这意味着我们将在评估中使用变量 {{item.notifications}}。
"include_sample_schema": True
意味着我们将在评估中使用变量 {{sample.output_text}}。
接下来,我们将使用这些变量来设置测试标准。
GRADER_DEVELOPER_PROMPT = """
Categorize the following push notification summary into the following categories:
1. concise-and-snappy
2. drops-important-information
3. verbose
4. unclear
5. obscures-meaning
6. other
You'll be given the original list of push notifications and the summary like this:
<push_notifications>
...notificationlist...
</push_notifications>
<summary>
...summary...
</summary>
You should only pick one of the categories above, pick the one which most closely matches and why.
"""
GRADER_TEMPLATE_PROMPT = """
<push_notifications>{{item.notifications}}</push_notifications>
<summary>{{sample.output_text}}</summary>
"""
push_notification_grader = {
"name": "Push Notification Summary Grader",
"type": "label_model",
"model": "o3-mini",
"input": [
{
"role": "developer",
"content": GRADER_DEVELOPER_PROMPT,
},
{
"role": "user",
"content": GRADER_TEMPLATE_PROMPT,
},
],
"passing_labels": ["concise-and-snappy"],
"labels": [
"concise-and-snappy",
"drops-important-information",
"verbose",
"unclear",
"obscures-meaning",
"other",
],
}push_notification_grader 是一个模型评分器(llm-as-a-judge),它会检查输入 {{item.notifications}} 和生成的摘要 {{sample.output_text}},并将其标记为"correct"(正确)或"incorrect"(不正确)。
然后我们通过"passing_labels"来定义什么样的回答才算通过。
注意:底层实现使用了结构化输出,以确保标签始终有效。
现在我们将创建我们的评估,并开始向其中添加数据!
eval_create_result = openai.evals.create(
name="Push Notification Bulk Experimentation Eval",
metadata={
"description": "This eval tests many prompts and models to find the best performing combination.",
},
data_source_config=data_source_config,
testing_criteria=[push_notification_grader],
)
eval_id = eval_create_result.id现在我们已设置好包含测试标准的评估环境,可以开始添加大量运行记录了! 我们将从一些推送通知数据开始。
push_notification_data = [
"""
- New message from Sarah: "Can you call me later?"
- Your package has been delivered!
- Flash sale: 20% off electronics for the next 2 hours!
""",
"""
- Weather alert: Thunderstorm expected in your area.
- Reminder: Doctor's appointment at 3 PM.
- John liked your photo on Instagram.
""",
"""
- Breaking News: Local elections results are in.
- Your daily workout summary is ready.
- Check out your weekly screen time report.
""",
"""
- Your ride is arriving in 2 minutes.
- Grocery order has been shipped.
- Don't miss the season finale of your favorite show tonight!
""",
"""
- Event reminder: Concert starts at 7 PM.
- Your favorite team just scored!
- Flashback: Memories from 3 years ago.
""",
"""
- Low battery alert: Charge your device.
- Your friend Mike is nearby.
- New episode of "The Tech Hour" podcast is live!
""",
"""
- System update available.
- Monthly billing statement is ready.
- Your next meeting starts in 15 minutes.
""",
"""
- Alert: Unauthorized login attempt detected.
- New comment on your blog post: "Great insights!"
- Tonight's dinner recipe: Pasta Primavera.
""",
"""
- Special offer: Free coffee with any breakfast order.
- Your flight has been delayed by 30 minutes.
- New movie release: "Adventures Beyond" now streaming.
""",
"""
- Traffic alert: Accident reported on Main Street.
- Package out for delivery: Expected by 5 PM.
- New friend suggestion: Connect with Emma.
"""]现在我们将设置一系列提示词进行测试。
我们想测试一个基础提示词,包含几个变体:
我们还将包含一个可供使用的模型列表。
PROMPT_PREFIX = """
You are a helpful assistant that takes in an array of push notifications and returns a collapsed summary of them.
The push notification will be provided as follows:
<push_notifications>
...notificationlist...
</push_notifications>
You should return just the summary and nothing else.
"""
PROMPT_VARIATION_BASIC = f"""
{PROMPT_PREFIX}
You should return a summary that is concise and snappy.
"""
PROMPT_VARIATION_WITH_EXAMPLES = f"""
{PROMPT_VARIATION_BASIC}
Here is an example of a good summary:
<push_notifications>
- Traffic alert: Accident reported on Main Street.- Package out for delivery: Expected by 5 PM.- New friend suggestion: Connect with Emma.
</push_notifications>
<summary>
Traffic alert, package expected by 5pm, suggestion for new friend (Emily).
</summary>
"""
PROMPT_VARIATION_WITH_NEGATIVE_EXAMPLES = f"""
{PROMPT_VARIATION_WITH_EXAMPLES}
Here is an example of a bad summary:
<push_notifications>
- Traffic alert: Accident reported on Main Street.- Package out for delivery: Expected by 5 PM.- New friend suggestion: Connect with Emma.
</push_notifications>
<summary>
Traffic alert reported on main street. You have a package that will arrive by 5pm, Emily is a new friend suggested for you.
</summary>
"""
prompts = [
("basic", PROMPT_VARIATION_BASIC),
("with_examples", PROMPT_VARIATION_WITH_EXAMPLES),
("with_negative_examples", PROMPT_VARIATION_WITH_NEGATIVE_EXAMPLES),
]
models = ["gpt-4o", "gpt-4o-mini", "o3-mini"]现在我们可以直接遍历所有提示词和所有模型,一次性测试多种配置!
我们将使用带有模板变量的'completion'运行数据源来推送通知列表。
OpenAI 将为您处理完成调用的操作并填充 "sample.output_text"
for prompt_name, prompt in prompts:
for model in models:
run_data_source = {
"type": "completions",
"input_messages": {
"type": "template",
"template": [
{
"role": "developer",
"content": prompt,
},
{
"role": "user",
"content": "<push_notifications>{{item.notifications}}</push_notifications>",
},
],
},
"model": model,
"source": {
"type": "file_content",
"content": [
{
"item": PushNotifications(notifications=notification).model_dump()
}
for notification in push_notification_data
],
},
}
run_create_result = openai.evals.runs.create(
eval_id=eval_id,
name=f"bulk_{prompt_name}_{model}",
data_source=run_data_source,
)
print(f"Report URL {model}, {prompt_name}:", run_create_result.report_url)