Evals API Use-case - Responses Evaluation

在接下来的评估中，我们将通过对比新模型(gpt-4.1-mini)与旧模型(gpt-4o-mini)在一些存储响应上的表现来进行比较。这样做的好处是，大多数开发者无需花费时间搭建完整的评估体系——他们的所有数据都已经存储在日志页面中。

我们想看看gpt-4.1与gpt-4o在解释代码库方面的表现对比。由于只有在已有用户流量的情况下才能使用响应数据源，我们将使用4o生成一些示例流量，然后与gpt-4.1的表现进行比较。

我们将从OpenAI SDK获取一些示例代码文件，并请gpt-4o为我们解释它们。

openai_sdk_file_path = os.path.dirname(openai.__file__) # Get some example code files from the OpenAI SDK file_paths = [ os.path.join(openai_sdk_file_path, "resources", "evals", "evals.py"), os.path.join(openai_sdk_file_path, "resources", "responses", "responses.py"), os.path.join(openai_sdk_file_path, "resources", "images.py"), os.path.join(openai_sdk_file_path, "resources", "embeddings.py"), os.path.join(openai_sdk_file_path, "resources", "files.py"), ] print(file_paths[0])

现在，让我们生成一些响应。

for file_path in file_paths: response = client.responses.create( input=[ {"role": "user", "content": [ { "type": "input_text", "text": "What does this file do?" }, { "type": "input_text", "text": open(file_path, "r").read(), }, ]}, ], model="gpt-4o-mini", ) print(response.output_text)

请注意，要使此功能正常工作，您必须在一个未禁用数据记录的组织中进行操作（例如通过zdr等）。如果不确定您的情况是否符合，请访问https://platform.openai.com/logs?api=responses查看是否能看见您刚生成的响应。

grader_system_prompt = """ You are **Code-Explanation Grader**, an expert software engineer and technical writer. Your job is to score how well *Model A* explained the purpose and behaviour of a given source-code file. ### What you receive 1. **File contents** – the full text of the code file (or a representative excerpt). 2. **Candidate explanation** – the answer produced by Model A that tries to describe what the file does. ### What to produce Return a single JSON object that can be parsed by `json.loads`, containing: ```json { "steps": [ { "description": "...", "result": "float" }, { "description": "...", "result": "float" }, { "description": "...", "result": "float" } ], "result": "float" } ``` • Each object in `steps` documents your reasoning for one category listed under “Scoring dimensions”. • Place your final 1 – 7 quality score (inclusive) in the top-level `result` key as a **string** (e.g. `"5.5"`). ### Scoring dimensions (evaluate in this order) 1. **Correctness & Accuracy ≈ 45 %** • Does the explanation match the actual code behaviour, interfaces, edge cases, and side effects? • Fact-check every technical claim; penalise hallucinations or missed key functionality. 2. **Completeness & Depth ≈ 25 %** • Are all major components, classes, functions, data flows, and external dependencies covered? • Depth should be appropriate to the file’s size/complexity; superficial glosses lose points. 3. **Clarity & Organization ≈ 20 %** • Is the explanation well-structured, logically ordered, and easy for a competent developer to follow? • Good use of headings, bullet lists, and concise language is rewarded. 4. **Insight & Usefulness ≈ 10 %** • Does the answer add valuable context (e.g., typical use cases, performance notes, risks) beyond line-by-line paraphrase? • Highlighting **why** design choices matter is a plus. ### Error taxonomy • **Major error** – Any statement that materially misrepresents the file (e.g., wrong API purpose, inventing non-existent behaviour). • **Minor error** – Small omission or wording that slightly reduces clarity but doesn’t mislead. List all found errors in your `steps` reasoning. ### Numeric rubric 1 Catastrophically wrong; mostly hallucination or irrelevant. 2 Many major errors, few correct points. 3 Several major errors OR pervasive minor mistakes; unreliable. 4 Mostly correct but with at least one major gap or multiple minors; usable only with caution. 5 Solid, generally correct; minor issues possible but no major flaws. 6 Comprehensive, accurate, and clear; only very small nit-picks. 7 Exceptional: precise, thorough, insightful, and elegantly presented; hard to improve. Use the full scale. Reserve 6.5 – 7 only when you are almost certain the explanation is outstanding. Then set `"result": "4.0"` (example). Be rigorous and unbiased. """ user_input_message = """**User input** {{item.input}} **Response to evaluate** {{sample.output_text}} """

logs_eval = client.evals.create( name="Code QA Eval", data_source_config={ "type": "logs", }, testing_criteria=[ { "type": "score_model", "name": "General Evaluator", "model": "o3", "input": [{ "role": "system", "content": grader_system_prompt, }, { "role": "user", "content": user_input_message, }, ], "range": [1, 7], "pass_threshold": 5.5, } ] )

首先，让我们启动一个运行来评估原始回答的质量。为此，我们只需设置要评估哪些回答的筛选条件

gpt_4o_mini_run = client.evals.runs.create( name="gpt-4o-mini", eval_id=logs_eval.id, data_source={ "type": "responses", "source": {"type": "responses", "limit": len(file_paths)}, # just grab the most recent responses }, )

现在，让我们看看4.1-mini的表现如何！

gpt_41_mini_run = client.evals.runs.create( name="gpt-4.1-mini", eval_id=logs_eval.id, data_source={ "type": "responses", "source": {"type": "responses", "limit": len(file_paths)}, "input_messages": { "type": "item_reference", "item_reference": "item.input", }, "model": "gpt-4.1-mini", } )

现在，让我们前往仪表板查看我们的表现！

2025年5月13日

Evals API 使用案例 - 响应评估