Fine tuning classification example

我们将微调一个babbage-002分类器（用于替代ada模型）来区分这两项运动：棒球和冰球。

from sklearn.datasets import fetch_20newsgroups
import pandas as pd
import openai
import os

client = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))

categories = ['rec.sport.baseball', 'rec.sport.hockey']
sports_dataset = fetch_20newsgroups(subset='train', shuffle=True, random_state=42, categories=categories)

数据探索

可以使用sklearn加载新闻组数据集。首先我们来看一下数据本身：

print(sports_dataset['data'][0])

From: dougb@comm.mot.com (Doug Bank)
Subject: Re: Info needed for Cleveland tickets
Reply-To: dougb@ecs.comm.mot.com
Organization: Motorola Land Mobile Products Sector
Distribution: usa
Nntp-Posting-Host: 145.1.146.35
Lines: 17

In article <1993Apr1.234031.4950@leland.Stanford.EDU>, bohnert@leland.Stanford.EDU (matthew bohnert) writes:

|> I'm going to be in Cleveland Thursday, April 15 to Sunday, April 18.
|> Does anybody know if the Tribe will be in town on those dates, and
|> if so, who're they playing and if tickets are available?

The tribe will be in town from April 16 to the 19th.
There are ALWAYS tickets available! (Though they are playing Toronto,
and many Toronto fans make the trip to Cleveland as it is easier to
get tickets in Cleveland than in Toronto.  Either way, I seriously
doubt they will sell out until the end of the season.)

-- 
Doug Bank                       Private Systems Division
dougb@ecs.comm.mot.com          Motorola Communications Sector
dougb@nwu.edu                   Schaumburg, Illinois
dougb@casbah.acns.nwu.edu       708-576-8207

sports_dataset.target_names[sports_dataset['target'][0]]

'rec.sport.baseball'

len_all, len_baseball, len_hockey = len(sports_dataset.data), len([e for e in sports_dataset.target if e == 0]), len([e for e in sports_dataset.target if e == 1])
print(f"Total examples: {len_all}, Baseball examples: {len_baseball}, Hockey examples: {len_hockey}")

Total examples: 1197, Baseball examples: 597, Hockey examples: 600

上面可以看到棒球类别的一个样本。这是一封发送给邮件列表的电子邮件。我们可以观察到总共有1197个示例，这两个运动类别之间均匀分布。

数据准备

我们将数据集转换为pandas dataframe，其中包含prompt和completion两列。prompt列包含邮件列表中的电子邮件内容，completion列则是运动名称，可能是曲棍球或棒球。仅出于演示目的和微调速度考虑，我们仅选取了300个示例。在实际应用场景中，示例数量越多，模型性能越好。

import pandas as pd

labels = [sports_dataset.target_names[x].split('.')[-1] for x in sports_dataset['target']]
texts = [text.strip() for text in sports_dataset['data']]
df = pd.DataFrame(zip(texts, labels), columns = ['prompt','completion']) #[:300]
df.head()

	提示词	补全结果
0	发件人：dougb@comm.mot.com (Doug Bank)\n主题：...	棒球
1	发件人: gld@cunixb.cc.columbia.edu (Gary L Dare)...	曲棍球
2	发件人: rudy@netcom.com (Rudy Wade)\n主题: 回复...	棒球
3	发件人: monack@helium.gas.uug.arizona.edu (david...	曲棍球
4	主题：特此通知\n发件人：	棒球

棒球和曲棍球都是单个标记。我们将数据集保存为jsonl文件。

df.to_json("sport2.jsonl", orient='records', lines=True)

数据准备工具

我们现在可以使用一个数据准备工具，在微调前为数据集提供一些改进建议。在启动该工具前，我们更新了openai库以确保使用最新的数据准备工具。我们还指定了-q参数来自动接受所有建议。

!openai tools fine_tunes.prepare_data -f sport2.jsonl -q

Analyzing...

- Your file contains 1197 prompt-completion pairs
- Based on your data it seems like you're trying to fine-tune a model for classification
- For classification, we recommend you try one of the faster and cheaper models, such as `ada`
- For classification, you can estimate the expected model performance by keeping a held out dataset, which is not used for training
- There are 11 examples that are very long. These are rows: [134, 200, 281, 320, 404, 595, 704, 838, 1113, 1139, 1174]
For conditional generation, and for classification the examples shouldn't be longer than 2048 tokens.
- Your data does not contain a common separator at the end of your prompts. Having a separator string appended to the end of the prompt makes it clearer to the fine-tuned model where the completion should begin. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detail and examples. If you intend to do open-ended generation, then you should leave the prompts empty
- The completion should start with a whitespace character (` `). This tends to produce better results due to the tokenization we use. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more details

Based on the analysis we will perform the following actions:
- [Recommended] Remove 11 long examples [Y/n]: Y
- [Recommended] Add a suffix separator `\n\n###\n\n` to all prompts [Y/n]: Y
- [Recommended] Add a whitespace character to the beginning of the completion [Y/n]: Y
- [Recommended] Would you like to split into training and validation set? [Y/n]: Y


Your data will be written to a new JSONL file. Proceed [Y/n]: Y

Wrote modified files to `sport2_prepared_train (1).jsonl` and `sport2_prepared_valid (1).jsonl`
Feel free to take a look!

Now use that file when fine-tuning:
> openai api fine_tunes.create -t "sport2_prepared_train (1).jsonl" -v "sport2_prepared_valid (1).jsonl" --compute_classification_metrics --classification_positive_class " baseball"

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string `\n\n###\n\n` for the model to start generating completions, rather than continuing with the prompt.
Once your model starts training, it'll approximately take 30.8 minutes to train a `curie` model, and less for `ada` and `babbage`. Queue will approximately take half an hour per job ahead of you.

该工具提供了对数据集的一些改进建议，并将数据集划分为训练集和验证集。

在提示词和补全内容之间添加后缀是必要的，这能告知模型输入文本已结束，现在需要预测类别。由于我们在每个示例中使用相同的分隔符，模型能够学会在分隔符后预测棒球或曲棍球。补全内容中使用空格前缀很有帮助，因为大多数单词标记都是以空格为前缀进行标记化的。该工具还识别出这很可能是一个分类任务，因此建议将数据集拆分为训练集和验证集。这将使我们能够轻松衡量新数据上的预期性能。

微调

该工具建议我们运行以下命令来训练数据集。由于这是一个分类任务，我们想知道在我们的分类用例中，所提供验证集上的泛化性能如何。

我们可以直接从CLI工具中复制建议的命令。我们特意添加了-m ada来微调一个更便宜、更快的ada模型，在分类用例上，其性能通常与速度较慢且更昂贵的模型相当。

train_file = client.files.create(file=open("sport2_prepared_train.jsonl", "rb"), purpose="fine-tune")
valid_file = client.files.create(file=open("sport2_prepared_valid.jsonl", "rb"), purpose="fine-tune")

fine_tuning_job = client.fine_tuning.jobs.create(training_file=train_file.id, validation_file=valid_file.id, model="babbage-002")

print(fine_tuning_job)

FineTuningJob(id='ftjob-REo0uLpriEAm08CBRNDlPJZC', created_at=1704413736, error=None, fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs='auto', batch_size='auto', learning_rate_multiplier='auto'), model='babbage-002', object='fine_tuning.job', organization_id='org-9HXYFy8ux4r6aboFyec2OLRf', result_files=[], status='validating_files', trained_tokens=None, training_file='file-82XooA2AUDBAUbN5z2DuKRMs', validation_file='file-wTOcQF8vxQ0Z6fNY2GSm0z4P')

模型在大约十分钟内成功完成训练。您可以在https://platform.openai.com/finetune/上观看微调过程

你也可以通过编程方式检查其状态：

fine_tune_results = client.fine_tuning.jobs.retrieve(fine_tuning_job.id)
print(fine_tune_results.finished_at)

1704414393

[高级] 结果与预期模型表现

我们现在可以下载结果文件，观察在保留验证集上的预期性能。

fine_tune_results = client.fine_tuning.jobs.retrieve(fine_tuning_job.id).result_files
result_file = client.files.retrieve(fine_tune_results[0])
content = client.files.content(result_file.id)
# save content to file
with open("result.csv", "wb") as f:
    f.write(content.text.encode("utf-8"))

results = pd.read_csv('result.csv')
results[results['train_accuracy'].notnull()].tail(1)

	步骤	训练损失	训练准确率	验证损失	验证平均token准确率
2843	2844	0.0	1.0	NaN	NaN

准确率达到99.6%。在下图中我们可以看到验证集上的准确率在训练过程中是如何提升的。

results[results['train_accuracy'].notnull()]['train_accuracy'].plot()

使用模型

我们现在可以调用模型来获取预测结果。

test = pd.read_json('sport2_prepared_valid.jsonl', lines=True)
test.head()

	提示词	补全结果
0	发件人: gld@cunixb.cc.columbia.edu (Gary L Dare)...	曲棍球
1	发件人: smorris@venus.lerc.nasa.gov (Ron Morris ...	曲棍球
2	发件人：golchowy@alchemy.chem.utoronto.ca (Geral...	曲棍球
3	发件人: krattige@hpcc01.corp.hp.com (Kim Krattig...	棒球
4	发件人：warped@cs.montana.edu (Doug Dolven)\n主题...	棒球

我们需要在提示后使用与微调期间相同的分隔符。在本例中为\n\n###\n\n。由于我们关注的是分类任务，因此希望将温度值尽可能调低，并且只需要一个token的补全结果即可确定模型的预测。

ft_model = fine_tune_results.fine_tuned_model

# note that this calls the legacy completions api - https://platform.openai.com/docs/api-reference/completions
res = client.completions.create(model=ft_model, prompt=test['prompt'][0] + '\n\n###\n\n', max_tokens=1, temperature=0)
res.choices[0].text

' hockey'

要获取对数概率，我们可以在完成请求时指定logprobs参数

res = client.completions.create(model=ft_model, prompt=test['prompt'][0] + '\n\n###\n\n', max_tokens=1, temperature=0, logprobs=2)
res.choices[0].logprobs.top_logprobs

[{' hockey': 0.0, ' Hockey': -22.504879}]

我们可以看到，模型预测冰球的可能性远高于棒球，这是正确的预测。通过请求log_probs，我们可以查看每个类别的预测（对数）概率。

泛化能力

有趣的是，我们经过微调的分类器具有很高的通用性。尽管训练数据来自不同邮件列表的电子邮件，它也能成功预测推文内容。

sample_hockey_tweet = """Thank you to the 
@Canes
 and all you amazing Caniacs that have been so supportive! You guys are some of the best fans in the NHL without a doubt! Really excited to start this new chapter in my career with the 
@DetroitRedWings
 !!"""
res = client.completions.create(model=ft_model, prompt=sample_hockey_tweet + '\n\n###\n\n', max_tokens=1, temperature=0, logprobs=2)
res.choices[0].text

' hockey'

sample_baseball_tweet="""BREAKING: The Tampa Bay Rays are finalizing a deal to acquire slugger Nelson Cruz from the Minnesota Twins, sources tell ESPN."""
res = client.completions.create(model=ft_model, prompt=sample_baseball_tweet + '\n\n###\n\n', max_tokens=1, temperature=0, logprobs=2)
res.choices[0].text

2022年3月10日