Leveraging model distillation to fine-tune a model

OpenAI最近发布了蒸馏技术，该技术允许利用（大型）模型的输出来微调另一个（较小）模型。当您转向使用较小模型时，这可以显著降低特定任务的价格和延迟。在本指南中，我们将查看一个数据集，将gpt-4o的输出蒸馏到gpt-4o-mini，并展示如何获得比通用、未蒸馏的4o-mini显著更好的结果。

我们还将利用结构化输出来解决一个使用枚举列表的分类问题。我们将看到微调模型如何从结构化输出中受益，以及它将如何影响性能。我们将展示结构化输出适用于所有这些模型，包括蒸馏模型。

我们将首先分析数据集，获取4o和4o mini的输出结果，突出展示两个模型的性能差异，然后进行蒸馏并分析这个蒸馏模型的性能。

先决条件

让我们安装并加载依赖项。确保您的OpenAI API密钥已在环境中定义为"OPENAI_API_KEY"，客户端将直接加载它。

! pip install openai tiktoken numpy pandas tqdm --quiet

import openai
import json
import tiktoken
from tqdm import tqdm
from openai import OpenAI
import numpy as np
import concurrent.futures
import pandas as pd

client = OpenAI()

加载并理解数据集

在本教程中，我们将从以下Kaggle挑战加载数据：https://www.kaggle.com/datasets/zynicide/wine-reviews。

该数据集包含大量行记录，您可以自由地在整个数据上运行此烹饪书，但作为一个有偏见的法国葡萄酒爱好者，我将把数据集缩小到仅包含法国葡萄酒，以便专注于更少的行数和葡萄品种。

我们正在研究一个分类问题，希望通过所有其他可用标准（包括描述、子产区和省份）来猜测葡萄品种，这些信息都将包含在提示中。这为模型提供了大量信息，您也可以自由移除某些可能显著帮助模型的信息（例如生产地区），以观察模型是否能很好地识别葡萄品种。

让我们筛选出评论中出现次数少于5次的葡萄品种。

让我们从这个数据集中随机选取500行作为子集进行处理。

df = pd.read_csv('data/winemag/winemag-data-130k-v2.csv')
df_france = df[df['country'] == 'France']

# Let's also filter out wines that have less than 5 references with their grape variety – even though we'd like to find those
# they're outliers that we don't want to optimize for that would make our enum list be too long
# and they could also add noise for the rest of the dataset on which we'd like to guess, eventually reducing our accuracy.

varieties_less_than_five_list = df_france['variety'].value_counts()[df_france['variety'].value_counts() < 5].index.tolist()
df_france = df_france[~df_france['variety'].isin(varieties_less_than_five_list)]

df_france_subset = df_france.sample(n=500)
df_france_subset.head()

	未命名: 0	国家	描述	等级	评分	价格	省份	区域1	区域2	品鉴师姓名	品鉴师推特账号	标题	葡萄品种	酒庄
95206	95206	法国	酒体饱满、醇厚、成熟且香气四溢的葡萄酒...	梅尔塞一级酒庄	91	35.0	勃艮第	梅尔居雷	NaN	罗杰·沃斯	@vossroger	安东尼罗代酒庄2010年梅尔塞一级...	黑皮诺	安东尼罗代
66403	66403	法国	对于一款普通的夏布利酒来说，这款酒令人印象深刻，口感丰富...	Domaine	89	26.0	勃艮第	夏布利	NaN	Roger Voss	@vossroger	William Fèvre 2005 Domaine (Chablis)	霞多丽	William Fèvre
71277	71277	法国	这款由马瑟兰和梅洛各占50%的混酿葡萄酒初闻...	La Remise	84	13.0	法国其他产区	法国地区餐酒	NaN	劳伦·布泽奥	@laurbuzz	莫多黑酒庄2014年La Remise红葡萄酒(法国地区餐酒...	红葡萄混酿	莫多黑酒庄
27484	27484	法国	这款坚实易饮的葡萄酒散发着中等强度的香气...	真实与时尚	86	10.0	法国其他产区	法国葡萄酒	NaN	劳伦·布泽奥	@laurbuzz	浪漫2014 真实与时尚赤霞珠...	赤霞珠	浪漫
124917	124917	法国	新鲜纯净的会议梨皮香气...	NaN	89	30.0	阿尔萨斯	阿尔萨斯	NaN	安妮·克雷比尔 MW	@AnneInVino	文森特·斯托弗勒酒庄 2015 灰皮诺 (阿尔萨斯...	灰皮诺	文森特·斯托弗勒酒庄

让我们检索所有葡萄品种，将它们包含在提示词和结构化输出的枚举列表中。

varieties = np.array(df_france['variety'].unique()).astype('str')
varieties

array(['Gewürztraminer', 'Pinot Gris', 'Gamay',
       'Bordeaux-style White Blend', 'Champagne Blend', 'Chardonnay',
       'Petit Manseng', 'Riesling', 'White Blend', 'Pinot Blanc',
       'Alsace white blend', 'Bordeaux-style Red Blend', 'Malbec',
       'Tannat-Cabernet', 'Rhône-style Red Blend', 'Ugni Blanc-Colombard',
       'Savagnin', 'Pinot Noir', 'Rosé', 'Melon',
       'Rhône-style White Blend', 'Pinot Noir-Gamay', 'Colombard',
       'Chenin Blanc', 'Sylvaner', 'Sauvignon Blanc', 'Red Blend',
       'Chenin Blanc-Chardonnay', 'Cabernet Sauvignon', 'Cabernet Franc',
       'Syrah', 'Sparkling Blend', 'Duras', 'Provence red blend',
       'Tannat', 'Merlot', 'Malbec-Merlot', 'Chardonnay-Viognier',
       'Cabernet Franc-Cabernet Sauvignon', 'Muscat', 'Viognier',
       'Picpoul', 'Altesse', 'Provence white blend', 'Mondeuse',
       'Grenache-Syrah', 'G-S-M', 'Pinot Meunier', 'Cabernet-Syrah',
       'Vermentino', 'Marsanne', 'Colombard-Sauvignon Blanc',
       'Gros and Petit Manseng', 'Jacquère', 'Negrette', 'Mauzac',
       'Pinot Auxerrois', 'Grenache', 'Roussanne', 'Gros Manseng',
       'Tannat-Merlot', 'Aligoté', 'Chasselas', "Loin de l'Oeil",
       'Malbec-Tannat', 'Carignan', 'Colombard-Ugni Blanc', 'Sémillon',
       'Syrah-Grenache', 'Sciaccerellu', 'Auxerrois', 'Mourvèdre',
       'Tannat-Cabernet Franc', 'Braucol', 'Trousseau',
       'Merlot-Cabernet Sauvignon'], dtype='<U33')

生成提示词

让我们构建一个函数来生成提示语，并针对列表中的第一款葡萄酒进行首次尝试。

def generate_prompt(row, varieties):
    # Format the varieties list as a comma-separated string
    variety_list = ', '.join(varieties)
    
    prompt = f"""
    Based on this wine review, guess the grape variety:
    This wine is produced by {row['winery']} in the {row['province']} region of {row['country']}.
    It was grown in {row['region_1']}. It is described as: "{row['description']}".
    The wine has been reviewed by {row['taster_name']} and received {row['points']} points.
    The price is {row['price']}.

    Here is a list of possible grape varieties to choose from: {variety_list}.
    
    What is the likely grape variety? Answer only with the grape variety name or blend from the list.
    """
    return prompt

# Example usage with a specific row
prompt = generate_prompt(df_france.iloc[0], varieties)
prompt

'\n    Based on this wine review, guess the grape variety:\n    This wine is produced by Trimbach in the Alsace region of France.\n    It was grown in Alsace. It is described as: "This dry and restrained wine offers spice in profusion. Balanced with acidity and a firm texture, it\'s very much for food.".\n    The wine has been reviewed by Roger Voss and received 87 points.\n    The price is 24.0.\n\n    Here is a list of possible grape varieties to choose from: Gewürztraminer, Pinot Gris, Gamay, Bordeaux-style White Blend, Champagne Blend, Chardonnay, Petit Manseng, Riesling, White Blend, Pinot Blanc, Alsace white blend, Bordeaux-style Red Blend, Malbec, Tannat-Cabernet, Rhône-style Red Blend, Ugni Blanc-Colombard, Savagnin, Pinot Noir, Rosé, Melon, Rhône-style White Blend, Pinot Noir-Gamay, Colombard, Chenin Blanc, Sylvaner, Sauvignon Blanc, Red Blend, Chenin Blanc-Chardonnay, Cabernet Sauvignon, Cabernet Franc, Syrah, Sparkling Blend, Duras, Provence red blend, Tannat, Merlot, Malbec-Merlot, Chardonnay-Viognier, Cabernet Franc-Cabernet Sauvignon, Muscat, Viognier, Picpoul, Altesse, Provence white blend, Mondeuse, Grenache-Syrah, G-S-M, Pinot Meunier, Cabernet-Syrah, Vermentino, Marsanne, Colombard-Sauvignon Blanc, Gros and Petit Manseng, Jacquère, Negrette, Mauzac, Pinot Auxerrois, Grenache, Roussanne, Gros Manseng, Tannat-Merlot, Aligoté, Chasselas, Loin de l\'Oeil, Malbec-Tannat, Carignan, Colombard-Ugni Blanc, Sémillon, Syrah-Grenache, Sciaccerellu, Auxerrois, Mourvèdre, Tannat-Cabernet Franc, Braucol, Trousseau, Merlot-Cabernet Sauvignon.\n    \n    What is the likely grape variety? Answer only with the grape variety name or blend from the list.\n    '

在运行查询之前了解成本，您可以利用tiktoken来了解我们将发送的令牌数量及运行此操作的相关成本。这只会为您提供运行补全的估算，而非微调过程（在本手册后续运行蒸馏时会用到）的成本，后者取决于其他因素，如周期数、训练集等。

# Load encoding for the GPT-4 model
enc = tiktoken.encoding_for_model("gpt-4o")

# Initialize a variable to store the total number of tokens
total_tokens = 0

for index, row in df_france_subset.iterrows():
    prompt = generate_prompt(row, varieties)
    
    # Tokenize the input text and count tokens
    tokens = enc.encode(prompt)
    token_count = len(tokens)
    
    # Add the token count to the total
    total_tokens += token_count

print(f"Total number of tokens in the dataset: {total_tokens}")
print(f"Total number of prompts: {len(df_france_subset)}")

Total number of tokens in the dataset: 245439
Total number of prompts: 500

# outputing cost in $ as of 2024/10/16

gpt4o_token_price = 2.50 / 1_000_000  # $2.50 per 1M tokens
gpt4o_mini_token_price = 0.150 / 1_000_000  # $0.15 per 1M tokens

total_gpt4o_cost = gpt4o_token_price*total_tokens
total_gpt4o_mini_cost = gpt4o_mini_token_price*total_tokens

print(total_gpt4o_cost)
print(total_gpt4o_mini_cost)

0.6135975
0.03681585

准备函数以存储完成结果

由于我们正在查看有限的响应列表（葡萄品种的枚举列表），让我们利用结构化输出，以确保模型会从这个列表中回答。这还使我们能够直接将模型的答案与葡萄品种进行比较，并获得确定性答案（相比于模型可能回答“我认为葡萄是黑皮诺”而不仅仅是“黑皮诺”），同时还能提高性能以避免数据集中不存在的葡萄品种。

如果您想了解更多关于结构化输出的信息，可以阅读这本cookbook和这份documentation guide。

response_format = {
    "type": "json_schema",
    "json_schema": {
        "name": "grape-variety",
        "schema": {
            "type": "object",
            "properties": {
                "variety": {
                    "type": "string",
                    "enum": varieties.tolist()
                }
            },
            "additionalProperties": False,
            "required": ["variety"],
        },
        "strict": True
    }
}

要蒸馏一个模型，你需要存储模型的所有输出结果，以便将其作为参考提供给较小的模型进行微调。因此，我们正在为client.chat.completions.create方法添加一个store=True参数，这样我们就可以存储来自gpt-4o的这些输出结果。

我们将存储所有完成结果（包括4o-mini和我们未来微调的模型），以便能够直接从OpenAI平台运行Evals。

在存储这些补全结果时，建议使用元数据标签进行存储，这样可以通过OpenAI平台进行筛选，以便对您希望运行的特定补全集执行蒸馏和评估操作。

# Initialize the progress index
metadata_value = "wine-distillation" # that's a funny metadata tag :-)

# Function to call the API and process the result for a single model (blocking call in this case)
def call_model(model, prompt):
    response = client.chat.completions.create(
        model=model,
        store=True,
        metadata={
            "distillation": metadata_value,
        },
        messages=[
            {
                "role": "system",
                "content": "You're a sommelier expert and you know everything about wine. You answer precisely with the name of the variety/blend."
            },
            {
                "role": "user",
                "content": prompt
            }
        ],
         response_format=response_format
    )
    return json.loads(response.choices[0].message.content.strip())['variety']

并行处理

由于我们需要处理大量行数据，确保并行运行这些补全任务并使用并发机制。我们将遍历数据框，并每处理20行输出一次进度。运行模型补全后，我们会将补全结果存储在同一个数据框中，使用列名{model}-variety。

def process_example(index, row, model, df, progress_bar):
    global progress_index

    try:
        # Generate the prompt using the row
        prompt = generate_prompt(row, varieties)

        df.at[index, model + "-variety"] = call_model(model, prompt)
        
        # Update the progress bar
        progress_bar.update(1)
        
        progress_index += 1
    except Exception as e:
        print(f"Error processing model {model}: {str(e)}")

def process_dataframe(df, model):
    global progress_index
    progress_index = 1  # Reset progress index

    # Create a tqdm progress bar
    with tqdm(total=len(df), desc="Processing rows") as progress_bar:
        # Process each example concurrently using ThreadPoolExecutor
        with concurrent.futures.ThreadPoolExecutor() as executor:
            futures = {executor.submit(process_example, index, row, model, df, progress_bar): index for index, row in df.iterrows()}
            
            for future in concurrent.futures.as_completed(futures):
                try:
                    future.result()  # Wait for each example to be processed
                except Exception as e:
                    print(f"Error processing example: {str(e)}")

    return df

在处理整个数据框之前，让我们先试用一下我们的调用模型函数并检查输出。

answer = call_model('gpt-4o', generate_prompt(df_france_subset.iloc[0], varieties))
answer

'Pinot Noir'

太好了！我们确认可以获取葡萄品种作为输出结果，现在让我们同时使用gpt-4o和gpt-4o-mini处理数据集并比较结果。

df_france_subset = process_dataframe(df_france_subset, "gpt-4o")

Processing rows: 100%|███████████████████████████████████████████████| 500/500 [00:41<00:00, 12.09it/s]

df_france_subset = process_dataframe(df_france_subset, "gpt-4o-mini")

Processing rows: 100%|███████████████████████████████████████████████| 500/500 [01:31<00:00,  5.45it/s]

比较gpt-4o和gpt-4o-mini

现在我们已经获得了这两个模型的所有聊天补全结果；接下来我们将它们与预期的葡萄品种进行比较，并评估其识别的准确性。这里我们直接在Python中进行操作，因为只需运行简单的字符串检查，但如果您的任务涉及更复杂的评估，可以利用OpenAI Evals或我们的开源评估框架。

models = ['gpt-4o', 'gpt-4o-mini']

def get_accuracy(model, df):
    return np.mean(df['variety'] == df[model + '-variety'])

for model in models:
    print(f"{model} accuracy: {get_accuracy(model, df_france_subset) * 100:.2f}%")

gpt-4o accuracy: 81.80%
gpt-4o-mini accuracy: 69.00%

我们可以看到，gpt-4o在识别葡萄品种方面比4o-mini表现更好（高出12.80%，相对于4o-mini几乎提升了20%！）。现在我在想，我们是不是在训练期间让gpt-4o喝了葡萄酒！

将gpt-4o输出蒸馏至gpt-4o-mini

假设我们希望频繁运行此预测，希望完成速度更快、成本更低，同时保持相同的准确度。如果能将4o的准确度提炼到4o-mini上，岂不是很棒？让我们开始吧！

我们现在将前往OpenAI存储补全页面: https://platform.openai.com/chat-completions.

让我们选择模型gpt-4o（请务必这样做，您不会想要蒸馏我们运行的4o-mini输出）。同时选择元数据distillation: wine-distillation以仅获取从此烹饪手册运行的存储完成项。

Filtering out completions

选择完补全内容后，您可以点击右上角的"Distill"按钮，基于这些补全内容对模型进行微调。完成此操作后，系统将自动生成用于运行微调过程的文件。接着我们选择gpt-4o-mini作为基础模型，保持默认参数（但您可以自由修改这些参数或通过迭代来提高性能）。

Distilling modal

微调任务启动后，您可以从微调页面获取微调任务ID，我们将使用它来监控微调任务的状态，并在完成后获取微调模型的ID。

Fine tuning job

# copy paste your fine-tune job ID below
finetune_job = client.fine_tuning.jobs.retrieve("ftjob-pRyNWzUItmHpxmJ1TX7FOaWe")

if finetune_job.status == 'succeeded':
    fine_tuned_model = finetune_job.fine_tuned_model
    print('finetuned model: ' + fine_tuned_model)
else:
    print('finetuned job status: ' + finetune_job.status)

finetuned model: ft:gpt-4o-mini-2024-07-18:distillation-test:wine-distillation:AIZntSyE

为蒸馏模型运行补全

现在我们已经完成了模型的微调，可以使用这个模型来运行补全任务，并与gpt4o和gpt4o-mini比较准确性。让我们选取另一个法国葡萄酒的子集（由于我们将输出限制为法国葡萄品种且排除了异常值，验证数据集也需要聚焦于此）。让我们对每个模型运行300条数据。

validation_dataset = df_france.sample(n=300)

models.append(fine_tuned_model)

for model in models:
    another_subset = process_dataframe(validation_dataset, model)

Processing rows: 100%|███████████████████████████████████████████████| 300/300 [00:20<00:00, 14.69it/s]
Processing rows: 100%|███████████████████████████████████████████████| 300/300 [00:27<00:00, 10.99it/s]
Processing rows: 100%|███████████████████████████████████████████████| 300/300 [00:37<00:00,  8.08it/s]

让我们比较模型的准确性

for model in models:
    print(f"{model} accuracy: {get_accuracy(model, another_subset) * 100:.2f}%")

gpt-4o accuracy: 79.67%
gpt-4o-mini accuracy: 64.67%
ft:gpt-4o-mini-2024-07-18:distillation-test:wine-distillation:AIZntSyE accuracy: 79.33%

这比未蒸馏的gpt-4o-mini相对提升了近22%！🎉

我们的微调模型性能远超gpt-4o-mini，同时保持相同的基础模型。我们将能够使用该模型以更低的成本和更低的延迟运行推理，用于未来的葡萄品种预测。

2024年10月16日

利用模型蒸馏技术微调模型