在接下来的评估中,我们将通过对比新模型(gpt-4.1-mini)与旧模型(gpt-4o-mini)在一些存储响应上的表现来进行比较。这样做的好处是,大多数开发者无需花费时间搭建完整的评估体系——他们的所有数据都已经存储在日志页面中。
import openai
import os
client = openai.OpenAI()我们想看看gpt-4.1与gpt-4o在解释代码库方面的表现对比。由于只有在已有用户流量的情况下才能使用响应数据源,我们将使用4o生成一些示例流量,然后与gpt-4.1的表现进行比较。
我们将从OpenAI SDK获取一些示例代码文件,并请gpt-4o为我们解释它们。
openai_sdk_file_path = os.path.dirname(openai.__file__)
# Get some example code files from the OpenAI SDK
file_paths = [
os.path.join(openai_sdk_file_path, "resources", "evals", "evals.py"),
os.path.join(openai_sdk_file_path, "resources", "responses", "responses.py"),
os.path.join(openai_sdk_file_path, "resources", "images.py"),
os.path.join(openai_sdk_file_path, "resources", "embeddings.py"),
os.path.join(openai_sdk_file_path, "resources", "files.py"),
]
print(file_paths[0])
现在,让我们生成一些响应。
for file_path in file_paths:
response = client.responses.create(
input=[
{"role": "user",
"content": [
{
"type": "input_text",
"text": "What does this file do?"
},
{
"type": "input_text",
"text": open(file_path, "r").read(),
},
]},
],
model="gpt-4o-mini",
)
print(response.output_text)请注意,要使此功能正常工作,您必须在一个未禁用数据记录的组织中进行操作(例如通过zdr等)。如果不确定您的情况是否符合,请访问https://platform.openai.com/logs?api=responses查看是否能看见您刚生成的响应。
grader_system_prompt = """
You are **Code-Explanation Grader**, an expert software engineer and technical writer.
Your job is to score how well *Model A* explained the purpose and behaviour of a given source-code file.
### What you receive
1. **File contents** – the full text of the code file (or a representative excerpt).
2. **Candidate explanation** – the answer produced by Model A that tries to describe what the file does.
### What to produce
Return a single JSON object that can be parsed by `json.loads`, containing:
```json
{
"steps": [
{ "description": "...", "result": "float" },
{ "description": "...", "result": "float" },
{ "description": "...", "result": "float" }
],
"result": "float"
}
```
• Each object in `steps` documents your reasoning for one category listed under “Scoring dimensions”.
• Place your final 1 – 7 quality score (inclusive) in the top-level `result` key as a **string** (e.g. `"5.5"`).
### Scoring dimensions (evaluate in this order)
1. **Correctness & Accuracy ≈ 45 %**
• Does the explanation match the actual code behaviour, interfaces, edge cases, and side effects?
• Fact-check every technical claim; penalise hallucinations or missed key functionality.
2. **Completeness & Depth ≈ 25 %**
• Are all major components, classes, functions, data flows, and external dependencies covered?
• Depth should be appropriate to the file’s size/complexity; superficial glosses lose points.
3. **Clarity & Organization ≈ 20 %**
• Is the explanation well-structured, logically ordered, and easy for a competent developer to follow?
• Good use of headings, bullet lists, and concise language is rewarded.
4. **Insight & Usefulness ≈ 10 %**
• Does the answer add valuable context (e.g., typical use cases, performance notes, risks) beyond line-by-line paraphrase?
• Highlighting **why** design choices matter is a plus.
### Error taxonomy
• **Major error** – Any statement that materially misrepresents the file (e.g., wrong API purpose, inventing non-existent behaviour).
• **Minor error** – Small omission or wording that slightly reduces clarity but doesn’t mislead.
List all found errors in your `steps` reasoning.
### Numeric rubric
1 Catastrophically wrong; mostly hallucination or irrelevant.
2 Many major errors, few correct points.
3 Several major errors OR pervasive minor mistakes; unreliable.
4 Mostly correct but with at least one major gap or multiple minors; usable only with caution.
5 Solid, generally correct; minor issues possible but no major flaws.
6 Comprehensive, accurate, and clear; only very small nit-picks.
7 Exceptional: precise, thorough, insightful, and elegantly presented; hard to improve.
Use the full scale. Reserve 6.5 – 7 only when you are almost certain the explanation is outstanding.
Then set `"result": "4.0"` (example).
Be rigorous and unbiased.
"""
user_input_message = """**User input**
{{item.input}}
**Response to evaluate**
{{sample.output_text}}
"""logs_eval = client.evals.create(
name="Code QA Eval",
data_source_config={
"type": "logs",
},
testing_criteria=[
{
"type": "score_model",
"name": "General Evaluator",
"model": "o3",
"input": [{
"role": "system",
"content": grader_system_prompt,
}, {
"role": "user",
"content": user_input_message,
},
],
"range": [1, 7],
"pass_threshold": 5.5,
}
]
)首先,让我们启动一个运行来评估原始回答的质量。为此,我们只需设置要评估哪些回答的筛选条件
gpt_4o_mini_run = client.evals.runs.create(
name="gpt-4o-mini",
eval_id=logs_eval.id,
data_source={
"type": "responses",
"source": {"type": "responses", "limit": len(file_paths)}, # just grab the most recent responses
},
)现在,让我们看看4.1-mini的表现如何!
gpt_41_mini_run = client.evals.runs.create(
name="gpt-4.1-mini",
eval_id=logs_eval.id,
data_source={
"type": "responses",
"source": {"type": "responses", "limit": len(file_paths)},
"input_messages": {
"type": "item_reference",
"item_reference": "item.input",
},
"model": "gpt-4.1-mini",
}
)现在,让我们前往仪表板查看我们的表现!
gpt_4o_mini_run.report_url