2025年5月13日

Evals API 使用案例 - 响应评估

在接下来的评估中,我们将通过对比新模型(gpt-4.1-mini)与旧模型(gpt-4o-mini)在一些存储响应上的表现来进行比较。这样做的好处是,大多数开发者无需花费时间搭建完整的评估体系——他们的所有数据都已经存储在日志页面中。

import openai
import os


client = openai.OpenAI()

我们想看看gpt-4.1与gpt-4o在解释代码库方面的表现对比。由于只有在已有用户流量的情况下才能使用响应数据源,我们将使用4o生成一些示例流量,然后与gpt-4.1的表现进行比较。

我们将从OpenAI SDK获取一些示例代码文件,并请gpt-4o为我们解释它们。

openai_sdk_file_path = os.path.dirname(openai.__file__)

# Get some example code files from the OpenAI SDK 
file_paths   = [
    os.path.join(openai_sdk_file_path, "resources", "evals", "evals.py"),
    os.path.join(openai_sdk_file_path, "resources", "responses", "responses.py"),
    os.path.join(openai_sdk_file_path, "resources", "images.py"),
    os.path.join(openai_sdk_file_path, "resources", "embeddings.py"),
    os.path.join(openai_sdk_file_path, "resources", "files.py"),
]

print(file_paths[0])

现在,让我们生成一些响应。

for file_path in file_paths:
    response = client.responses.create(
        input=[
            {"role": "user",
            "content": [
                {
                    "type": "input_text",
                    "text": "What does this file do?"
                },
                {
                    "type": "input_text",
                    "text": open(file_path, "r").read(),
                },
            ]},
        ],
        model="gpt-4o-mini",
    )
    print(response.output_text)

请注意,要使此功能正常工作,您必须在一个未禁用数据记录的组织中进行操作(例如通过zdr等)。如果不确定您的情况是否符合,请访问https://platform.openai.com/logs?api=responses查看是否能看见您刚生成的响应。

grader_system_prompt = """
You are **Code-Explanation Grader**, an expert software engineer and technical writer.  
Your job is to score how well *Model A* explained the purpose and behaviour of a given source-code file.

### What you receive
1. **File contents** – the full text of the code file (or a representative excerpt).  
2. **Candidate explanation** – the answer produced by Model A that tries to describe what the file does.

### What to produce
Return a single JSON object that can be parsed by `json.loads`, containing:
```json
{
  "steps": [
    { "description": "...", "result": "float" },
    { "description": "...", "result": "float" },
    { "description": "...", "result": "float" }
  ],
  "result": "float"
}
```
• Each object in `steps` documents your reasoning for one category listed under “Scoring dimensions”.  
• Place your final 1 – 7 quality score (inclusive) in the top-level `result` key as a **string** (e.g. `"5.5"`).

### Scoring dimensions (evaluate in this order)

1. **Correctness & Accuracy ≈ 45 %**  
   • Does the explanation match the actual code behaviour, interfaces, edge cases, and side effects?  
   • Fact-check every technical claim; penalise hallucinations or missed key functionality.

2. **Completeness & Depth ≈ 25 %**  
   • Are all major components, classes, functions, data flows, and external dependencies covered?  
   • Depth should be appropriate to the file’s size/complexity; superficial glosses lose points.

3. **Clarity & Organization ≈ 20 %**  
   • Is the explanation well-structured, logically ordered, and easy for a competent developer to follow?  
   • Good use of headings, bullet lists, and concise language is rewarded.

4. **Insight & Usefulness ≈ 10 %**  
   • Does the answer add valuable context (e.g., typical use cases, performance notes, risks) beyond line-by-line paraphrase?  
   • Highlighting **why** design choices matter is a plus.

### Error taxonomy
• **Major error** – Any statement that materially misrepresents the file (e.g., wrong API purpose, inventing non-existent behaviour).  
• **Minor error** – Small omission or wording that slightly reduces clarity but doesn’t mislead.  
List all found errors in your `steps` reasoning.

### Numeric rubric
1  Catastrophically wrong; mostly hallucination or irrelevant.  
2  Many major errors, few correct points.  
3  Several major errors OR pervasive minor mistakes; unreliable.  
4  Mostly correct but with at least one major gap or multiple minors; usable only with caution.  
5  Solid, generally correct; minor issues possible but no major flaws.  
6  Comprehensive, accurate, and clear; only very small nit-picks.  
7  Exceptional: precise, thorough, insightful, and elegantly presented; hard to improve.

Use the full scale. Reserve 6.5 – 7 only when you are almost certain the explanation is outstanding.

Then set `"result": "4.0"` (example).

Be rigorous and unbiased.
"""
user_input_message = """**User input**

{{item.input}}

**Response to evaluate**

{{sample.output_text}}
"""
logs_eval = client.evals.create(
    name="Code QA Eval",
    data_source_config={
        "type": "logs",
    },
    testing_criteria=[
        {
			"type": "score_model",
            "name": "General Evaluator",
            "model": "o3",
            "input": [{
                "role": "system",
                "content": grader_system_prompt,
            }, {
                "role": "user",
                "content": user_input_message,
            },
            ],
            "range": [1, 7],
            "pass_threshold": 5.5,
        }
    ]
)

首先,让我们启动一个运行来评估原始回答的质量。为此,我们只需设置要评估哪些回答的筛选条件

gpt_4o_mini_run = client.evals.runs.create(
    name="gpt-4o-mini",
    eval_id=logs_eval.id,
    data_source={
        "type": "responses",
        "source": {"type": "responses", "limit": len(file_paths)}, # just grab the most recent responses
    },
)

现在,让我们看看4.1-mini的表现如何!

gpt_41_mini_run = client.evals.runs.create(
    name="gpt-4.1-mini",
    eval_id=logs_eval.id,
    data_source={
        "type": "responses",
        "source": {"type": "responses", "limit": len(file_paths)},
        "input_messages": {
            "type": "item_reference",
            "item_reference": "item.input",
        },
        "model": "gpt-4.1-mini",
    }
)

现在,让我们前往仪表板查看我们的表现!

gpt_4o_mini_run.report_url