注意：OpenAI现已推出托管评估产品并提供API接口！我们建议您直接使用该服务。详情请参阅Evals

OpenAI Evals 框架包含以下内容

一个用于评估LLM（大型语言模型）或基于LLM构建的系统的框架。
一个开源的挑战性评估注册表

本笔记本将涵盖：

评估简介与OpenAI Evals库
构建评估
运行评估

什么是评估/ `evals`？

评估是验证和测试您的LLM应用程序所产生输出的过程。拥有强大的评估（"evals"）意味着应用程序更加稳定可靠，能够适应代码和模型的变化。评估任务是用于衡量LLM或LLM系统输出质量的方法。给定一个输入提示后，系统会生成输出。我们通过一组理想答案来评估这个输出，从而判断LLM系统的质量。

评估的重要性

如果您正在使用像GPT-4这样的基础模型进行开发，创建高质量的评估是您能做的最具影响力的事情之一。开发AI解决方案涉及一个迭代的设计过程。如果没有评估，理解不同模型版本和提示如何影响您的用例可能会非常困难且耗时。

随着OpenAI的持续模型升级,评估功能让您能够以标准化的方式高效测试模型在您用例中的表现。开发一套针对您目标定制的评估方案,将帮助您快速有效地了解新模型在您用例中的可能表现。您还可以将评估纳入CI/CD流程,确保在部署前达到预期的准确度。

评估类型

评估/评分补全主要有两种方法：在代码中编写一些验证逻辑，或使用模型本身来检查答案。我们将通过一些示例来介绍每种方法。

编写答案检查逻辑

最简单和最常见的评估类型包含一个输入和一个理想的响应或答案。例如，我们可以设计一个评估样本，其中输入是"奥巴马第一次当选总统是哪一年？"，而理想答案是"2008"。我们将输入提供给模型并获取其补全结果。如果模型回答"2008"，则会被评为正确。我们可以编写字符串匹配来检查补全结果是否包含短语"2008"。如果包含，我们就认为它是正确的。

考虑另一个评估场景，输入要求生成有效的JSON：我们可以编写一些代码，尝试将完成内容解析为JSON，如果可解析则认为完成正确。

模型评分：一个两阶段过程，首先模型回答问题，然后我们让另一个模型检查回答是否正确。

考虑一个要求模型编写一个有趣笑话的输入。模型随后生成一个完成内容。接着我们创建一个新的模型输入，询问："以下笑话是否有趣？首先逐步推理，然后回答是或否"，并将之前的完成内容包含在内。如果新模型的完成内容以"是"结尾，我们最终认为原始完成内容是正确的。

模型评分在使用最新、最强大的模型（如GPT-4）时效果最佳，如果我们让它们在做出判断前具备推理能力。模型评分会有一定的错误率，因此在规模化运行评估前，通过人工评估验证性能非常重要。为了获得最佳结果，最好使用与生成内容不同的模型进行评分，例如用GPT-4来评估GPT-3.5的答案。

OpenAI 评估模板

在使用评估时，我们发现了几种能适配多种基准测试的"模板"。我们已在OpenAI Evals库中实现了这些模板，以简化新评估的开发工作。例如，我们定义了两类开箱即用的评估模板：

基础评估模板: 这些模板包含确定性函数，用于将输出与理想答案进行比较。当期望的模型响应变化很少时，例如回答选择题或具有直接答案的简单问题，我们发现以下模板很有用。
模型评分模板: 这些模板包含一些功能，其中大型语言模型会将输出与ideal_answers进行比较，并尝试判断准确性。在期望的模型响应可能存在显著变化的情况下，例如回答开放式问题时，我们发现使用模型自行评分是一种可行的自动化评估策略。

开始设置

首先，访问 github.com/openai/evals，使用 git clone git@github.com:openai/evals.git 克隆仓库，并按照安装说明进行操作。

要在本笔记本后续运行评估，您需要设置并指定您的OpenAI API密钥。获取API密钥后，请使用OPENAI_API_KEY环境变量进行指定。

请注意在运行评估时使用API的相关成本。

from openai import OpenAI
import pandas as pd

client = OpenAI()

为OpenAI Evals框架构建评估

本质上，一个评估是由YAML文件中定义的数据集和评估类组成的。要开始创建评估，我们需要

测试数据集为jsonl格式。
要使用的评估模板

创建评估数据集

让我们为一个用例创建数据集，评估模型生成语法正确SQL的能力。在这个用例中，我们有一系列与汽车制造相关的表格

首先我们需要创建一个待评估的系统提示。我们将传入模型的指令以及表格结构的概述：

"TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\nTable car_names, columns = [*,MakeId,Model,Make]\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\nTable continents, columns = [*,ContId,Continent]\nTable countries, columns = [*,CountryId,CountryName,Continent]\nTable model_list, columns = [*,ModelId,Maker,Model]\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]"

对于这个提示，我们可以提出一个具体的问题：

"Q: how many car makers are their in germany?"

我们有一个预期的答案：

"A: SELECT count ( * )  FROM CAR_MAKERS AS T1 JOIN COUNTRIES AS T2 ON T1.Country   =   T2.CountryId WHERE T2.CountryName   =   'germany'"

数据集需要采用以下格式：

"input": [{"role": "system", "content": "<input prompt>"}, {"role": "user", "content": <user input>}, "ideal": "correct answer"]

将所有内容整合起来，我们得到：

{"input": [{"role": "system", "content": "TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\nTable car_names, columns = [*,MakeId,Model,Make]\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\nTable continents, columns = [*,ContId,Continent]\nTable countries, columns = [*,CountryId,CountryName,Continent]\nTable model_list, columns = [*,ModelId,Maker,Model]\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]\n"}, {"role": "system", "content": "Q: how many car makers are their in germany"}, "ideal": ["A: SELECT count ( * )  FROM CAR_MAKERS AS T1 JOIN COUNTRIES AS T2 ON T1.Country   =   T2.CountryId WHERE T2.CountryName   =   'germany'"]}

加速构建评估数据集的一种方法是使用GPT-4生成合成数据

## Use GPT-4 to generate synthetic data
# Define the system prompt and user input (these should be filled as per the specific use case)
system_prompt = """You are a helpful assistant that can ask questions about a database table and write SQL queries to answer the question.
    A user will pass in a table schema and your job is to return a question answer pairing. The question should relevant to the schema of the table,
    and you can speculate on its contents. You will then have to generate a SQL query to answer the question. Below are some examples of what this should look like.

    Example 1
    ```````````
    User input: Table museum, columns = [*,Museum_ID,Name,Num_of_Staff,Open_Year]\nTable visit, columns = [*,Museum_ID,visitor_ID,Num_of_Ticket,Total_spent]\nTable visitor, columns = [*,ID,Name,Level_of_membership,Age]\nForeign_keys = [visit.visitor_ID = visitor.ID,visit.Museum_ID = museum.Museum_ID]\n
    Assistant Response:
    Q: How many visitors have visited the museum with the most staff?
    A: SELECT count ( * )  FROM VISIT AS T1 JOIN MUSEUM AS T2 ON T1.Museum_ID   =   T2.Museum_ID WHERE T2.Num_of_Staff   =   ( SELECT max ( Num_of_Staff )  FROM MUSEUM ) 
    ```````````

    Example 2
    ```````````
    User input: Table museum, columns = [*,Museum_ID,Name,Num_of_Staff,Open_Year]\nTable visit, columns = [*,Museum_ID,visitor_ID,Num_of_Ticket,Total_spent]\nTable visitor, columns = [*,ID,Name,Level_of_membership,Age]\nForeign_keys = [visit.visitor_ID = visitor.ID,visit.Museum_ID = museum.Museum_ID]\n
    Assistant Response:
    Q: What are the names who have a membership level higher than 4?
    A: SELECT Name   FROM VISITOR AS T1 WHERE T1.Level_of_membership   >   4 
    ```````````

    Example 3
    ```````````
    User input: Table museum, columns = [*,Museum_ID,Name,Num_of_Staff,Open_Year]\nTable visit, columns = [*,Museum_ID,visitor_ID,Num_of_Ticket,Total_spent]\nTable visitor, columns = [*,ID,Name,Level_of_membership,Age]\nForeign_keys = [visit.visitor_ID = visitor.ID,visit.Museum_ID = museum.Museum_ID]\n
    Assistant Response:
    Q: How many tickets of customer id 5?
    A: SELECT count ( * )  FROM VISIT AS T1 JOIN VISITOR AS T2 ON T1.visitor_ID   =   T2.ID WHERE T2.ID   =   5 
    ```````````
    """

user_input = "Table car_makers, columns = [*,Id,Maker,FullName,Country]\nTable car_names, columns = [*,MakeId,Model,Make]\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\nTable continents, columns = [*,ContId,Continent]\nTable countries, columns = [*,CountryId,CountryName,Continent]\nTable model_list, columns = [*,ModelId,Maker,Model]\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]"

messages = [{
        "role": "system",
        "content": system_prompt
    },
    {
        "role": "user",
        "content": user_input
    }
]

completion = client.chat.completions.create(
    model="gpt-4-turbo-preview",
    messages=messages,
    temperature=0.7,
    n=5
)

for choice in completion.choices:
    print(choice.message.content + "\n")

Q: What is the average horsepower for cars made in Europe?
A: SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'

Q: What is the average horsepower for cars made in the USA?
A: SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN car_makers ON car_names.MakeId = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId WHERE countries.CountryName = 'USA'

Q: What is the average horsepower for cars produced in countries from the continent with the id '3'?
A: SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.ContId = '3'

Q: What is the average horsepower for cars made by makers from Europe?
A: SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'

Q: What is the average horsepower for cars made in the USA?

A: SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN car_makers ON car_names.MakeId = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId WHERE countries.CountryName = 'USA'

一旦我们有了合成数据，就需要将其转换为与评估数据集相匹配的格式。

eval_data = []
input_prompt = "TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\nTable car_names, columns = [*,MakeId,Model,Make]\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\nTable continents, columns = [*,ContId,Continent]\nTable countries, columns = [*,CountryId,CountryName,Continent]\nTable model_list, columns = [*,ModelId,Maker,Model]\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]"

for choice in completion.choices:
    question = choice.message.content.split("Q: ")[1].split("\n")[0]  # Extracting the question
    answer = choice.message.content.split("\nA: ")[1].split("\n")[0]  # Extracting the answer
    eval_data.append({
        "input": [
            {"role": "system", "content": input_prompt},
            {"role": "user", "content": question},
        ],
        "ideal": answer
    })

for item in eval_data:
    print(item)

{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\nTable car_names, columns = [*,MakeId,Model,Make]\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\nTable continents, columns = [*,ContId,Continent]\nTable countries, columns = [*,CountryId,CountryName,Continent]\nTable model_list, columns = [*,ModelId,Maker,Model]\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'What is the average horsepower for cars made in Europe?'}], 'ideal': "SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'"}
{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\nTable car_names, columns = [*,MakeId,Model,Make]\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\nTable continents, columns = [*,ContId,Continent]\nTable countries, columns = [*,CountryId,CountryName,Continent]\nTable model_list, columns = [*,ModelId,Maker,Model]\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'What is the average horsepower for cars made in the USA?'}], 'ideal': "SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN car_makers ON car_names.MakeId = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId WHERE countries.CountryName = 'USA'"}
{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\nTable car_names, columns = [*,MakeId,Model,Make]\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\nTable continents, columns = [*,ContId,Continent]\nTable countries, columns = [*,CountryId,CountryName,Continent]\nTable model_list, columns = [*,ModelId,Maker,Model]\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': "What is the average horsepower for cars produced in countries from the continent with the id '3'?"}], 'ideal': "SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.ContId = '3'"}
{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\nTable car_names, columns = [*,MakeId,Model,Make]\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\nTable continents, columns = [*,ContId,Continent]\nTable countries, columns = [*,CountryId,CountryName,Continent]\nTable model_list, columns = [*,ModelId,Maker,Model]\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'What is the average horsepower for cars made by makers from Europe?'}], 'ideal': "SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN model_list ON car_names.Model = model_list.Model JOIN car_makers ON model_list.Maker = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId JOIN continents ON countries.Continent = continents.ContId WHERE continents.Continent = 'Europe'"}
{'input': [{'role': 'system', 'content': 'TASK: Answer the following question with syntactically correct SQLite SQL. The SQL should be correct and be in context of the previous question-answer pairs.\nTable car_makers, columns = [*,Id,Maker,FullName,Country]\nTable car_names, columns = [*,MakeId,Model,Make]\nTable cars_data, columns = [*,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year]\nTable continents, columns = [*,ContId,Continent]\nTable countries, columns = [*,CountryId,CountryName,Continent]\nTable model_list, columns = [*,ModelId,Maker,Model]\nForeign_keys = [countries.Continent = continents.ContId,car_makers.Country = countries.CountryId,model_list.Maker = car_makers.Id,car_names.Model = model_list.Model,cars_data.Id = car_names.MakeId]'}, {'role': 'user', 'content': 'What is the average horsepower for cars made in the USA?'}], 'ideal': "SELECT AVG(cars_data.Horsepower) FROM cars_data JOIN car_names ON cars_data.Id = car_names.MakeId JOIN car_makers ON car_names.MakeId = car_makers.Id JOIN countries ON car_makers.Country = countries.CountryId WHERE countries.CountryName = 'USA'"}

接下来我们需要创建评估注册表以在框架中运行它。

evals框架需要一个结构包含以下属性的.yaml文件：

id - 评估的唯一标识符
description - 对你的评估的简短描述
disclaimer - 关于你的评估的附加说明
metrics - 我们可以选择三种评估指标类型：精确匹配、包含匹配、模糊匹配

在我们的评估中，我们将配置以下内容：

"""
spider-sql:
  id: spider-sql.dev.v0
  metrics: [accuracy]
  description: Eval that scores SQL code from 194 examples in the Spider Text-to-SQL test dataset. The problems are selected by taking the first 10 problems for each database that appears in the test set.
    Yu, Tao, et al. \"Spider; A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task.\" Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, https://doi.org/10.18653/v1/d18-1425.
  disclaimer: Problems are solved zero-shot with no prompting other than the schema; performance may improve with training examples, fine tuning, or a different schema format. Evaluation is currently done through model-grading, where SQL code is not actually executed; the model may judge correct SQL to be incorrect, or vice-versa.
spider-sql.dev.v0:
  class: evals.elsuite.modelgraded.classify:ModelBasedClassify
  args:
    samples_jsonl: sql/spider_sql.jsonl
    eval_type: cot_classify
    modelgraded_spec: sql
  """""

'\nspider-sql:\n  id: spider-sql.dev.v0\n  metrics: [accuracy]\n  description: Eval that scores SQL code from 194 examples in the Spider Text-to-SQL test dataset. The problems are selected by taking the first 10 problems for each database that appears in the test set.\n    Yu, Tao, et al. "Spider; A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task." Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, https://doi.org/10.18653/v1/d18-1425.\n  disclaimer: Problems are solved zero-shot with no prompting other than the schema; performance may improve with training examples, fine tuning, or a different schema format. Evaluation is currently done through model-grading, where SQL code is not actually executed; the model may judge correct SQL to be incorrect, or vice-versa.\nspider-sql.dev.v0:\n  class: evals.elsuite.modelgraded.classify:ModelBasedClassify\n  args:\n    samples_jsonl: sql/spider_sql.jsonl\n    eval_type: cot_classify\n    modelgraded_spec: sql\n  '

运行评估

我们可以使用oaieval命令行工具运行此评估。要进行设置，请安装该库：pip install .（如果您在本地运行OpenAI Evals库）或pip install oaieval（如果您要运行现有评估）。

然后，使用CLI运行评估：oaieval gpt-3.5-turbo spider-sql

该命令需要一个模型名称和一个评估集名称。请注意我们提供了两个命令行接口(CLI)：oaieval用于运行单个评估，oaievalset用于运行一组评估。有效的评估名称在evals/registry/evals下的YAML文件中指定，其对应的实现可以在evals/elsuite中找到。

!pip install evals --quiet

oaieval 命令行工具可以接受各种参数来修改默认行为。您可以运行 oaieval --help 查看完整的命令行选项列表。

运行该命令后，您将在控制台看到打印的最终准确率报告，以及包含完整报告的临时文件路径。

oaieval 会在 evals/registry/evals 目录中查找 spider-sql 评估 YAML 文件，遵循上述单元格4中指定的格式。评估数据集的路径在评估 YAML 文件的 args: 参数下指定为 samples_jsonl: sql/spider_sql.jsonl，文件内容采用 JSONL 格式（如上述步骤3所生成）。

运行该命令后，您将在控制台看到打印的最终准确率报告，以及包含完整报告的临时文件路径。

!oaieval gpt-3.5-turbo spider-sql --max_samples 25

[2024-03-26 19:44:39,836] [registry.py:257] Loading registry from /Users/shyamal/.virtualenvs/openai/lib/python3.11/site-packages/evals/registry/evals
[2024-03-26 19:44:43,623] [registry.py:257] Loading registry from /Users/shyamal/.evals/evals
[2024-03-26 19:44:43,635] [oaieval.py:189] [1;35mRun started: 240327024443FACXGMKA[0m
[2024-03-26 19:44:43,663] [registry.py:257] Loading registry from /Users/shyamal/.virtualenvs/openai/lib/python3.11/site-packages/evals/registry/modelgraded
[2024-03-26 19:44:43,851] [registry.py:257] Loading registry from /Users/shyamal/.evals/modelgraded
[2024-03-26 19:44:43,853] [data.py:90] Fetching /Users/shyamal/.virtualenvs/openai/lib/python3.11/site-packages/evals/registry/data/sql/spider_sql.jsonl
[2024-03-26 19:44:43,878] [eval.py:36] Evaluating 25 samples
[2024-03-26 19:44:43,952] [eval.py:144] Running in threaded mode with 10 threads!
  0%|                                                    | 0/25 [00:00<?, ?it/s][2024-03-26 19:44:44,810] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:44,829] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:44,991] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:45,090] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:45,145] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:45,971] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:46,040] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:46,069] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:46,378] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:46,587] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:47,412] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
  4%|█▊                                          | 1/25 [00:03<01:23,  3.46s/it][2024-03-26 19:44:47,714] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
  8%|███▌                                        | 2/25 [00:03<00:36,  1.60s/it][2024-03-26 19:44:47,947] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 12%|█████▎                                      | 3/25 [00:03<00:21,  1.02it/s][2024-03-26 19:44:48,413] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:48,643] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 16%|███████                                     | 4/25 [00:04<00:18,  1.15it/s][2024-03-26 19:44:48,909] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 20%|████████▊                                   | 5/25 [00:04<00:12,  1.54it/s][2024-03-26 19:44:49,131] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:49,500] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:49,530] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 24%|██████████▌                                 | 6/25 [00:05<00:12,  1.56it/s][2024-03-26 19:44:49,962] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:49,964] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:49,967] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 28%|████████████▎                               | 7/25 [00:06<00:10,  1.73it/s][2024-03-26 19:44:50,577] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:50,602] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:50,634] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:50,862] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:51,503] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:51,608] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 40%|█████████████████▏                         | 10/25 [00:07<00:08,  1.79it/s][2024-03-26 19:44:51,801] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 44%|██████████████████▉                        | 11/25 [00:07<00:06,  2.09it/s][2024-03-26 19:44:51,856] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:51,969] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:52,227] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 52%|██████████████████████▎                    | 13/25 [00:08<00:04,  2.65it/s][2024-03-26 19:44:52,450] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:52,526] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:52,615] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 56%|████████████████████████                   | 14/25 [00:08<00:04,  2.64it/s][2024-03-26 19:44:52,625] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:52,777] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:53,653] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 60%|█████████████████████████▊                 | 15/25 [00:09<00:05,  1.87it/s][2024-03-26 19:44:53,670] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:54,028] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 68%|█████████████████████████████▏             | 17/25 [00:10<00:03,  2.54it/s][2024-03-26 19:44:54,388] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:54,396] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 72%|██████████████████████████████▉            | 18/25 [00:10<00:02,  2.58it/s][2024-03-26 19:44:54,529] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
[2024-03-26 19:44:54,585] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 76%|████████████████████████████████▋          | 19/25 [00:10<00:02,  2.94it/s][2024-03-26 19:44:54,980] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 80%|██████████████████████████████████▍        | 20/25 [00:11<00:01,  2.82it/s][2024-03-26 19:44:55,152] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 84%|████████████████████████████████████       | 21/25 [00:11<00:01,  3.27it/s][2024-03-26 19:44:56,420] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 88%|█████████████████████████████████████▊     | 22/25 [00:12<00:01,  1.75it/s][2024-03-26 19:44:56,984] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 92%|███████████████████████████████████████▌   | 23/25 [00:13<00:01,  1.76it/s][2024-03-26 19:44:57,370] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 96%|█████████████████████████████████████████▎ | 24/25 [00:13<00:00,  1.94it/s][2024-03-26 19:44:59,589] [_client.py:1026] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
100%|███████████████████████████████████████████| 25/25 [00:15<00:00,  1.60it/s]
[2024-03-26 19:44:59,607] [record.py:360] Final report: {'counts/Correct': 20, 'counts/Incorrect': 5, 'score': 0.8}. Logged to /tmp/evallogs/240327024443FACXGMKA_gpt-3.5-turbo_spider-sql.jsonl
[2024-03-26 19:44:59,608] [oaieval.py:229] Final report:
[2024-03-26 19:44:59,608] [oaieval.py:231] counts/Correct: 20
[2024-03-26 19:44:59,608] [oaieval.py:231] counts/Incorrect: 5
[2024-03-26 19:44:59,608] [oaieval.py:231] score: 0.8
[2024-03-26 19:44:59,640] [record.py:349] Logged 75 rows of events to /tmp/evallogs/240327024443FACXGMKA_gpt-3.5-turbo_spider-sql.jsonl: insert_time=27.915ms

oaievalset 需要一个模型名称和一个评估集名称，有效选项在 evals/registry/eval_sets 目录下的 YAML 文件中指定。

查看评估日志

评估日志位于/tmp/evallogs，每次评估运行都会创建不同的日志文件。

log_name = '240327024443FACXGMKA_gpt-3.5-turbo_spider-sql.jsonl' # "EDIT THIS" - copy from above
events = f"/tmp/evallogs/{log_name}"
display(pd.read_json(events, lines=True).head(5))

	规格	最终报告	运行ID	事件ID	样本ID	类型	数据	创建者	创建时间
0	{'completion_fns': ['gpt-3.5-turbo'], 'eval_name': 'spider-sql.dev.v0', 'base_eval': 'spider-sql', 'split': 'dev', 'run_config': {'completion_fns': ['gpt-3.5-turbo'], 'eval_spec': {'cls': 'evals.elsuite.modelgraded.classify:ModelBasedClassify', 'registry_path': '/Users/shyamal/.virtualenvs/openai/lib/python3.11/site-packages/evals/registry', 'args': {'samples_jsonl': 'sql/spider_sql.jsonl', 'eval_type': 'cot_classify', 'modelgraded_spec': 'sql'}, 'key': 'spider-sql.dev.v0', 'group': 'sql'}, 'seed': 20220722, 'max_samples': 25, 'command': '/Users/shyamal/.virtualenvs/openai/bin/oaieval gpt-3.5-turbo spider-sql --max_samples 25', 'initial_settings': {'visible': False}}, 'created_by': '', 'run_id': '240327024443FACXGMKA', 'created_at': '2024-03-27 02:44:43.626043'}	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaT
1	NaN	{'counts/Correct': 20, 'counts/Incorrect': 5, 'score': 0.8}	NaN	NaN	NaN	NaN	NaN	NaN	NaT
2	NaN	NaN	240327024443FACXGMKA	0.0	spider-sql.dev.88	sampling	{'prompt': [{'content': '请用语法正确的SQLite SQL回答以下问题。要有创意但SQL必须正确。仅使用以下表和列：表：players。列：player_id（数字）、first_name（文本）、last_name（文本）、hand（文本）、birth_date（时间）、country_code（文本）表：matches。列：best_of（数字）、draw_size（数字）、loser_age（数字）、loser_entry（文本）、loser_hand（文本）、loser_ht（数字）、loser_id（数字）、loser_ioc（文本）、loser_name（文本）、loser_rank（数字）、loser_rank_points（数字）、loser_seed（数字）、match_num（数字）、minutes（数字）、round（文本）、score（文本）、surface（文本）、tourney_date（时间）、tourney_id（文本）、tourney_level（文本）、tourney_name（文本）、winner_age（数字）、winner_entry（文本）、winner_hand（文本）、winner_ht（数字）、winner_id（数字）、winner_ioc（文本）、winner_name（文本）、winner_rank（数字）、winner_rank_points（数字）、winner_seed（数字）、year（数字）表：rankings。列：ranking_date（时间）、ranking（数字）、player_id（数字）、ranking_points（数字）、tours（数字）问题：找出所有比赛中获胜者的平均排名。 ', 'role': 'system'}], 'sampled': ['SELECT AVG(winner_rank) AS average_rank_of_winners FROM matches;']}		2024-03-27 02:44:44.821110+00:00
3	NaN	NaN	240327024443FACXGMKA	1.0	spider-sql.dev.82	sampling	{'prompt': [{'content': '请用语法正确的SQLite SQL回答以下问题。要有创意但SQL必须正确。仅使用以下表和列：表：players。列：player_id（数字）、first_name（文本）、last_name（文本）、hand（文本）、birth_date（时间）、country_code（文本）表：matches。列：best_of（数字）、draw_size（数字）、loser_age（数字）、loser_entry（文本）、loser_hand（文本）、loser_ht（数字）、loser_id（数字）、loser_ioc（文本）、loser_name（文本）、loser_rank（数字）、loser_rank_points（数字）、loser_seed（数字）、match_num（数字）、minutes（数字）、round（文本）、score（文本）、surface（文本）、tourney_date（时间）、tourney_id（文本）、tourney_level（文本）、tourney_name（文本）、winner_age（数字）、winner_entry（文本）、winner_hand（文本）、winner_ht（数字）、winner_id（数字）、winner_ioc（文本）、winner_name（文本）、winner_rank（数字）、winner_rank_points（数字）、winner_seed（数字）、year（数字）表：rankings。列：ranking_date（时间）、ranking（数字）、player_id（数字）、ranking_points（数字）、tours（数字）问题：查找比赛总场数。 ', 'role': 'system'}], 'sampled': ['SELECT COUNT(*) AS total_matches FROM matches;']}		2024-03-27 02:44:44.831848+00:00
4	NaN	NaN	240327024443FACXGMKA	2.0	spider-sql.dev.25	sampling	{'prompt': [{'content': '请用语法正确的SQLite SQL回答以下问题。要有创意但SQL必须正确。仅使用以下表和列：表：continents。列：ContId（数字），Continent（文本）表：countries。列：CountryId（数字），CountryName（文本），Continent（数字）表：car_makers。列：Id（数字），Maker（文本），FullName（文本），Country（文本）表：model_list。列：ModelId（数字），Maker（数字），Model（文本）表：car_names。列：MakeId（数字），Model（文本），Make（文本）表：cars_data。列：Id（数字），MPG（文本），Cylinders（数字），Edispl（数字），Horsepower（文本），Weight（数字），Accelerate（数字），Year（数字）问题：有多少个国家存在？ ', 'role': 'system'}], 'sampled': ['SELECT COUNT(*) AS TotalCountries FROM countries;']}		2024-03-27 02:44:44.996647+00:00

# processing the log events generated by oaieval

with open(events, "r") as f:
    events_df = pd.read_json(f, lines=True)

该文件将包含评估的结构化日志。第一条记录提供了评估的详细规范，包括完成函数、评估名称、运行配置、创建者姓名、运行ID和创建时间戳。

display(events_df.iloc[0].spec)

{'completion_fns': ['gpt-3.5-turbo'],
 'eval_name': 'spider-sql.dev.v0',
 'base_eval': 'spider-sql',
 'split': 'dev',
 'run_config': {'completion_fns': ['gpt-3.5-turbo'],
  'eval_spec': {'cls': 'evals.elsuite.modelgraded.classify:ModelBasedClassify',
   'registry_path': '/Users/shyamal/.virtualenvs/openai/lib/python3.11/site-packages/evals/registry',
   'args': {'samples_jsonl': 'sql/spider_sql.jsonl',
    'eval_type': 'cot_classify',
    'modelgraded_spec': 'sql'},
   'key': 'spider-sql.dev.v0',
   'group': 'sql'},
  'seed': 20220722,
  'max_samples': 25,
  'command': '/Users/shyamal/.virtualenvs/openai/bin/oaieval gpt-3.5-turbo spider-sql --max_samples 25',
  'initial_settings': {'visible': False}},
 'created_by': '',
 'run_id': '240327024443FACXGMKA',
 'created_at': '2024-03-27 02:44:43.626043'}

让我们也看看提供评估最终报告的条目。

display(events_df.dropna(subset=['final_report']).iloc[0]['final_report'])

{'counts/Correct': 20, 'counts/Incorrect': 5, 'score': 0.8}

我们还可以查看提供具体样本(sample_id)、结果、事件类型和元数据的单个评估事件。

pd.set_option('display.max_colwidth', None)  # None means no truncation
display(events_df.iloc[2][['run_id', 'event_id', 'sample_id', 'type', 'data', 'created_at']])

run_id                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                240327024443FACXGMKA
event_id                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               0.0
sample_id                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                spider-sql.dev.88
type                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              sampling
data          {'prompt': [{'content': 'Answer the following question with syntactically correct SQLite SQL. Be creative but the SQL must be correct.
Use only the following tables and columns:
Table: players. Columns: player_id (number), first_name (text), last_name (text), hand (text), birth_date (time), country_code (text)
Table: matches. Columns: best_of (number), draw_size (number), loser_age (number), loser_entry (text), loser_hand (text), loser_ht (number), loser_id (number), loser_ioc (text), loser_name (text), loser_rank (number), loser_rank_points (number), loser_seed (number), match_num (number), minutes (number), round (text), score (text), surface (text), tourney_date (time), tourney_id (text), tourney_level (text), tourney_name (text), winner_age (number), winner_entry (text), winner_hand (text), winner_ht (number), winner_id (number), winner_ioc (text), winner_name (text), winner_rank (number), winner_rank_points (number), winner_seed (number), year (number)
Table: rankings. Columns: ranking_date (time), ranking (number), player_id (number), ranking_points (number), tours (number)

Question: Find the average rank of winners in all matches.
', 'role': 'system'}], 'sampled': ['SELECT AVG(winner_rank) AS average_rank_of_winners
FROM matches;']}
created_at                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                2024-03-27 02:44:44.821110+00:00
Name: 2, dtype: object

# Inspect samples
for i, row in events_df[events_df['type'] == 'sampling'].head(5).iterrows():
    data = pd.json_normalize(row['data'])
    print(f"Prompt: {data['prompt'].iloc[0]}")
    print(f"Sampled: {data['sampled'].iloc[0]}")
    print("-" * 10)

Prompt: [{'content': 'Answer the following question with syntactically correct SQLite SQL. Be creative but the SQL must be correct.\nUse only the following tables and columns:\nTable: players. Columns: player_id (number), first_name (text), last_name (text), hand (text), birth_date (time), country_code (text)\nTable: matches. Columns: best_of (number), draw_size (number), loser_age (number), loser_entry (text), loser_hand (text), loser_ht (number), loser_id (number), loser_ioc (text), loser_name (text), loser_rank (number), loser_rank_points (number), loser_seed (number), match_num (number), minutes (number), round (text), score (text), surface (text), tourney_date (time), tourney_id (text), tourney_level (text), tourney_name (text), winner_age (number), winner_entry (text), winner_hand (text), winner_ht (number), winner_id (number), winner_ioc (text), winner_name (text), winner_rank (number), winner_rank_points (number), winner_seed (number), year (number)\nTable: rankings. Columns: ranking_date (time), ranking (number), player_id (number), ranking_points (number), tours (number)\n\nQuestion: Find the average rank of winners in all matches.\n', 'role': 'system'}]
Sampled: ['SELECT AVG(winner_rank) AS average_rank_of_winners\nFROM matches;']
----------
Prompt: [{'content': 'Answer the following question with syntactically correct SQLite SQL. Be creative but the SQL must be correct.\nUse only the following tables and columns:\nTable: players. Columns: player_id (number), first_name (text), last_name (text), hand (text), birth_date (time), country_code (text)\nTable: matches. Columns: best_of (number), draw_size (number), loser_age (number), loser_entry (text), loser_hand (text), loser_ht (number), loser_id (number), loser_ioc (text), loser_name (text), loser_rank (number), loser_rank_points (number), loser_seed (number), match_num (number), minutes (number), round (text), score (text), surface (text), tourney_date (time), tourney_id (text), tourney_level (text), tourney_name (text), winner_age (number), winner_entry (text), winner_hand (text), winner_ht (number), winner_id (number), winner_ioc (text), winner_name (text), winner_rank (number), winner_rank_points (number), winner_seed (number), year (number)\nTable: rankings. Columns: ranking_date (time), ranking (number), player_id (number), ranking_points (number), tours (number)\n\nQuestion: Find the total number of matches.\n', 'role': 'system'}]
Sampled: ['SELECT COUNT(*) AS total_matches\nFROM matches;']
----------
Prompt: [{'content': 'Answer the following question with syntactically correct SQLite SQL. Be creative but the SQL must be correct.\nUse only the following tables and columns:\nTable: continents. Columns: ContId (number), Continent (text)\nTable: countries. Columns: CountryId (number), CountryName (text), Continent (number)\nTable: car_makers. Columns: Id (number), Maker (text), FullName (text), Country (text)\nTable: model_list. Columns: ModelId (number), Maker (number), Model (text)\nTable: car_names. Columns: MakeId (number), Model (text), Make (text)\nTable: cars_data. Columns: Id (number), MPG (text), Cylinders (number), Edispl (number), Horsepower (text), Weight (number), Accelerate (number), Year (number)\n\nQuestion: How many countries exist?\n', 'role': 'system'}]
Sampled: ['SELECT COUNT(*) AS TotalCountries\nFROM countries;']
----------
Prompt: [{'content': 'Answer the following question with syntactically correct SQLite SQL. Be creative but the SQL must be correct.\nUse only the following tables and columns:\nTable: TV_Channel. Columns: id (text), series_name (text), Country (text), Language (text), Content (text), Pixel_aspect_ratio_PAR (text), Hight_definition_TV (text), Pay_per_view_PPV (text), Package_Option (text)\nTable: TV_series. Columns: id (number), Episode (text), Air_Date (text), Rating (text), Share (number), 18_49_Rating_Share (text), Viewers_m (text), Weekly_Rank (number), Channel (text)\nTable: Cartoon. Columns: id (number), Title (text), Directed_by (text), Written_by (text), Original_air_date (text), Production_code (number), Channel (text)\n\nQuestion: What is the name and directors of all the cartoons that are ordered by air date?\n', 'role': 'system'}]
Sampled: ['SELECT Title, Directed_by\nFROM Cartoon\nORDER BY Original_air_date;']
----------
Prompt: [{'content': 'Answer the following question with syntactically correct SQLite SQL. Be creative but the SQL must be correct.\nUse only the following tables and columns:\nTable: stadium. Columns: Stadium_ID (number), Location (text), Name (text), Capacity (number), Highest (number), Lowest (number), Average (number)\nTable: singer. Columns: Singer_ID (number), Name (text), Country (text), Song_Name (text), Song_release_year (text), Age (number), Is_male (others)\nTable: concert. Columns: concert_ID (number), concert_Name (text), Theme (text), Stadium_ID (text), Year (text)\nTable: singer_in_concert. Columns: concert_ID (number), Singer_ID (text)\n\nQuestion: Show the name and the release year of the song by the youngest singer.\n', 'role': 'system'}]
Sampled: ['```sql\nSELECT s.Name, s.Song_release_year\nFROM singer s\nWHERE s.Age = (SELECT MIN(Age) FROM singer)\n```']
----------

让我们回顾失败案例，了解哪些测试未通过。

def pretty_print_text(prompt):
    # Define markers for the sections
    markers = {
        "question": "[Question]:",
        "expert": "[Expert]:",
        "submission": "[Submission]:",
        "end": "[END DATA]"
    }
    
    # Function to extract text between markers
    def extract_text(start_marker, end_marker):
        start = prompt.find(start_marker) + len(start_marker)
        end = prompt.find(end_marker)
        text = prompt[start:end].strip()
        if start_marker == markers["question"]:
            text = text.split("\n\nQuestion:")[-1].strip() if "\n\nQuestion:" in text else text
        elif start_marker == markers["submission"]:
            text = text.replace("```sql", "").replace("```", "").strip()
        return text
    
    # Extracting text for each section
    question_text = extract_text(markers["question"], markers["expert"])
    expert_text = extract_text(markers["expert"], markers["submission"])
    submission_text = extract_text(markers["submission"], markers["end"])
    
    # HTML color codes and formatting
    colors = {
        "question": '<span style="color: #0000FF;">QUESTION:<br>', 
        "expert": '<span style="color: #008000;">EXPECTED:<br>',  
        "submission": '<span style="color: #FFA500;">SUBMISSION:<br>' 
    }
    color_end = '</span>'
    
    # Display each section with color
    from IPython.display import display, HTML
    display(HTML(f"{colors['question']}{question_text}{color_end}"))
    display(HTML(f"{colors['expert']}{expert_text}{color_end}"))
    display(HTML(f"{colors['submission']}{submission_text}{color_end}"))

# Inspect metrics where choice is made and print only the prompt, result, and expected result if the choice is incorrect
for i, row in events_df[events_df['type'] == 'metrics'].iterrows():
    if row['data']['choice'] == 'Incorrect':
        # Get the previous row's data, which contains the prompt and the expected result
        prev_row = events_df.iloc[i-1]
        prompt = prev_row['data']['prompt'][0]['content'] if 'prompt' in prev_row['data'] and len(prev_row['data']['prompt']) > 0 else "Prompt not available"
        expected_result = prev_row['data'].get('ideal', 'Expected result not provided')
        
        # Current row's data will be the actual result
        result = row['data'].get('result', 'Actual result not provided')
        
        pretty_print_text(prompt)
        print("-" * 40)

问题：
有多少国家采用共和制作为其政府形式？ ************

预期结果：
SELECT count(*) FROM country WHERE GovernmentForm = "Republic" ************

提交内容：
SELECT COUNT(*) FROM country WHERE GovernmentForm LIKE '%Republic%' ************

----------------------------------------

问题：
返回名为Robbin CV的文档的文档ID、模板ID和描述。 ************

预期结果：
SELECT document_id , template_id , Document_Description FROM Documents WHERE document_name = "Robbin CV" ************

提交内容：
SELECT Documents.Document_ID, Documents.Template_ID, Documents.Document_Description FROM Documents JOIN Templates ON Documents.Template_ID = Templates.Template_ID WHERE Documents.Document_Name = 'Robbin CV'; ************

----------------------------------------

问题：
哪些专业人士居住在印第安纳州或已执行超过2次治疗？列出他或她的ID、姓氏和手机号码。 ************

预期结果：
SELECT professional_id , last_name , cell_number FROM Professionals WHERE state = 'Indiana' UNION SELECT T1.professional_id , T1.last_name , T1.cell_number FROM Professionals AS T1 JOIN Treatments AS T2 ON T1.professional_id = T2.professional_id GROUP BY T1.professional_id HAVING count(*) > 2 ************

提交内容：
SELECT professional_id, last_name, cell_number FROM Professionals WHERE state = 'Indiana' OR professional_id IN ( SELECT professional_id FROM Treatments GROUP BY professional_id HAVING COUNT(*) > 2 ); ************

----------------------------------------

问题：
安圭拉属于哪个大洲？ ************

预期结果：
SELECT Continent FROM country WHERE Name = "Anguilla" ************

提交内容：
SELECT c.Continent FROM country c WHERE c.Code = 'AIA'; ************

----------------------------------------

问题：
我们有多少家航空公司？ ************

预期结果：
SELECT count(*) FROM AIRLINES ************

提交内容：
SELECT COUNT(DISTINCT Airline) AS TotalAirlines FROM airlines; ************

----------------------------------------

回顾我们观察到的一些失败案例，可以看到以下情况：

第二个错误答案中包含了与'Templates'表的不必要连接。我们的评估能够准确识别这一点并将其标记为错误。
Few other answers have minor syntax differences that caused the answers to get flagged.
- 在这种情况下，值得探讨的是我们应该继续优化提示词以确保特定的风格选择，还是应该调整评估套件来捕捉这种变化。
- 这类失败暗示可能需要模型评分评估作为一种确保结果评分准确性的方法

结论

构建有效的评估是基于大语言模型(LLM)应用开发周期的核心环节。OpenAI Evals框架提供了开箱即用的评估构建核心结构，使您能够快速为各种用例创建新测试。本指南逐步演示了如何创建评估、运行评估并分析结果。

本指南展示的示例代表了evals的一个简单用例。随着您继续探索这个框架，我们建议您尝试为实际生产用例创建更复杂的模型评分evals。祝您评估愉快！

2024年3月21日

OpenAI Evals 入门指南