微调OpenAI模型：基于维基百科数据的指南

指南 October 29, 2024

微调大型语言模型(LLMs)是根据特定需求定制AI的强大方法。借助OpenAI的新微调功能，您可以使用领域特定数据、指令甚至自定义格式来调整他们的模型。这意味着您无需不断调整提示词就能获得更准确、相关的响应，最终降低成本并提高效率。

在本指南中，我们将带您完整了解OpenAI平台上的微调流程——从数据准备到成本估算，再到部署微调后的模型。如果您之前阅读过我们关于微调Llama 3模型的文章，本文会采用相似的思路，但重点介绍OpenAI的工具，并通过最新的维基百科数据展示一个实际案例。

微调及其适用场景

微调大型语言模型(LLM)意味着继续使用新的特定数据对其进行训练，使您能够针对特定任务或领域塑造其响应。这种方法让您可以基于模型已有的广泛知识，同时教会它处理更专业的内容。

以OpenAI的GPT-4o模型为例——它具有高度通用性，适用于广泛任务。但就像任何通用工具一样，它也存在局限。例如，其训练数据截止于2023年9月，因此无法知晓此后信息。您可能还会发现，该模型需要冗长复杂的提示词才能精准满足需求，这会因token消耗而推高使用成本。

微调可以帮助克服这些挑战，但并不总是正确的选择。以下情况适合进行微调：

您有一个包含特定提示词和期望输出的数据集。
您需要模型始终遵循某种格式——无论是结构化报告还是进行API调用。
您希望缩短提示词的长度并降低成本。
您拥有特定领域的数据需要整合，从而避免昂贵的检索增强生成(RAG)工作流程。

在本指南中，我们将向您展示如何利用维基百科的最新更新来微调模型。如果您想深入了解完整代码，可以在我们的示例代码库中找到相关笔记本。

数据整理与准备

为了演示微调过程，我们将使用一个真实世界的数据集——维基百科上最新的飓风数据更新。这个数据集非常适合，原因有两点：首先，由于模型的知识截止于2023年9月，这些时效性信息是模型未曾接触过的；其次，这让我们有机会针对一个非常具体且重要的话题（飓风）来微调模型处理新信息的能力。我们的目标是确保模型能够基于最新数据生成准确、最新的响应。下面我们将逐步介绍数据准备过程。

收集数据: 从选定的维基百科页面获取最新修订版本。
生成问答对: 将原始飓风数据转化为一组实用的问答对。
创建微调数据集: 将数据集格式化为符合OpenAI微调要求的格式。

收集数据

我们首先定义包含飓风相关信息的维基百科页面列表。这可能包括特定风暴的文章，如飓风米尔顿，或更广泛的主题，如2024年大西洋飓风季。通过从这些页面获取最新修订版本，我们确保输入模型的数据是最新的，并且未包含在其原始训练集中。

# List of relevant topics
topics = [
   "List_of_United_States_hurricanes",
   "2024_Atlantic_hurricane_season",
   "Hurricane_Milton",
   "Hurricane_Beryl",
   "Hurricane_Francine",
   "Hurricane_Helene",
   "Hurricane_Isaac"
]

这一过程的关键部分是确定一个日期范围——具体来说，我们将重点关注模型2023年9月知识截止日期之后所做的更新。这确保了模型不会已经"知道"我们用于微调的数据。一旦我们收集了所有必要的更新，我们将继续创建训练集。

def get_wikipedia_revisions(article_title, start_date):
   ...


def fetch_revisions_for_topics(topics, start_date):
   """Fetches revisions for all topics after a certain date and returns a combined dataset."""
   full_dataset = []  # List to hold data for all topics
   for topic in topics:
       try:
           print(f"Fetching revisions for {topic} starting from {start_date}...")
           topic_data = get_wikipedia_revisions(topic, start_date)
           full_dataset.extend(topic_data)  # Append the data for each topic to the full dataset
       except Exception as e:
           print(f"Error fetching revisions for {topic}: {str(e)}")
  
   return full_dataset  # Return the full dataset




# Specify the start date (ISO 8601 format)
start_date = "2023-09-01T00:00:00Z"


# Fetch the latest revisions for all topics and store them in a dataset
dataset = fetch_revisions_for_topics(topics, start_date)

如果您对获取维基百科修订版本的详细代码感兴趣，可以查看我们这里链接的笔记本。

生成问答对

收集到新数据后，下一步是将其格式化为模型可以有效学习的形式。一个非常有效的方法是生成问答对(Q&A)。为什么？因为这种格式高度模拟了模型在实际应用中的使用场景——无论是回答客户查询、处理常见问题，还是根据提示提供信息。

例如，给定新的飓风数据，我们可以生成类似“飓风米尔顿何时袭击美国？”这样的问题，然后根据维基百科的更新提供相应的答案。虽然手动创建这些问答对会非常耗时且容易出错，但我们可以利用OpenAI的现有能力，直接从维基百科修订版本中自动生成它们。

一个特别有用的功能是结构化输出，它能确保生成的答案遵循一致的格式。这一点很重要，因为它使得在流程的下一阶段处理数据变得更加容易。如果您想了解更多关于如何有效使用结构化输出的信息，我们撰写了一篇详细的博客文章，您可以点击这里查看。

from openai import OpenAI
from pydantic import BaseModel
from typing import List, Literal
import json


# Define the Pydantic model for the output format
class QAItem(BaseModel):
   prompt: str
   completion: str


class QADataset(BaseModel):
   dataset: List[QAItem]




def generate_qa_pairs_from_changes(new_content, article_title):
   """
   Query OpenAI to analyze the new content and generate a set of question-answer pairs.
   If substantial information changes are detected (such as new sections, significant updates, or meaningful additions of facts),
   the function returns a list of question-answer pairs in the specified JSON format.
   """


   client = OpenAI()


   # Create a query prompt to ask OpenAI to generate question-answer pairs based on the content
   prompt = f"""
   The following is newly added content to the Wikipedia article titled '{article_title}'.
   Analyze the content and generate a set of specific question-answer pairs based on the new facts, updates, or meaningful changes.
   Focus on creating general questions that a person might ask and answered them comprehensively with the content provided.
   Do not ask questions that directly reference the date of the revision or the specific article title.
   If a hurricane is mentioned, it should be referred to by its full name.
   Ignore trivial changes such as typos or formatting.


   Example questions:
   - List the hurricanes that hit the US in 2024.
   - What was the most recent hurricane to hit the US?
   - What was the name of the hurricane that hit Florida in 2024?
   - What was the category of hurricane Beryl?
   - What was the path of hurricane Milton?


   New Content:
   \"\"\"{new_content[:3000]}\"\"\"


   Please return a set of question-answer pairs in the form of a JSON array where each item is an object containing
   'prompt' as the question and 'completion' as the direct answer from the content.
   """


   try:
       response = client.beta.chat.completions.parse(
           model="gpt-4o-2024-08-06",  # or "gpt-4o" if available
           messages=[
               {"role": "system", "content": "You are a helpful assistant that generates a set of specific question-answer pairs based on the new facts, updates, or meaningful changes Wikipedia articles."},
               {"role": "user", "content": prompt}
           ],
           max_tokens=1000,
           temperature=0.7,
           response_format=QADataset
       )


       return response.choices[0].message.content


   except Exception as e:
       return f"Error in generating QA pairs: {str(e)}"

在我们整理好问答数据集后，可以查看生成的部分示例。

Number of examples in master_qa_list:  10362
{'prompt': 'When were shelters opened on the Caribbean islands due to Hurricane Beryl?', 'completion': 'Shelters were opened on June 29 on the Caribbean islands in response to Hurricane Beryl.'}
{'prompt': 'What measures were taken in Tobago in response to Hurricane Beryl?', 'completion': 'A state of emergency was declared for Tobago. Ferry schedules were modified on June 30, and all ferries to Tobago for July 1 were cancelled. Schools across the nation were also closed for July 1. Additionally, 145 people were sheltered in 14 shelters across Tobago.'}
{'prompt': 'How did Hurricane Beryl affect ferry operations in Trinidad and Tobago?', 'completion': 'Ferry schedules were modified on June 30, and all ferries to Tobago for July 1 were cancelled due to Hurricane Beryl.'}
{'prompt': 'How many people sought shelter in Tobago during Hurricane Beryl?', 'completion': '145 people sought shelter in 14 shelters across Tobago during Hurricane Beryl.'}

在构建了一个强大的问答数据集后，仔细检查结果总是一个好主意。此时，我们可能需要人工审核员来验证问答对的准确性。我们将在Label Studio中创建一个项目来标注这个数据集。

from label_studio_sdk.client import LabelStudio


# Connect to the Label Studio API and check the connection
ls = LabelStudio(base_url=LABEL_STUDIO_URL, api_key=API_KEY)


label_config = """
    Details in Notebook...
"""


# Create a new project
project = ls.projects.create(
   title='Hurricane Data Project',
   description='Label questions and completions about hurricanes with their respective contexts and titles.',
   label_config=label_config
)


from label_studio_sdk.label_interface.objects import PredictionValue


for qa_pair in master_qa_list:
   # Create task data
   task_data = {
       "data": {
           "question": qa_pair['prompt'],
           "article_title": qa_pair['article_title'],
           "context": qa_pair['new_content']
       }
   }


   # Create the task in Label Studio
   task = ls.tasks.create(project=project.id, **task_data)
   task_id = task.id
   print(f"Task created with ID: {task_id}")


   # Create prediction data
   prediction = PredictionValue(
       model_version="v1",
       result=[
           {
               "from_name": "completion",
               "to_name": "question",
               "type": "textarea",
               "value": {
                   "text": [qa_pair['completion']]
               }
           }
       ]
   )


   # Insert prediction into the task
   ls.predictions.create(task=task_id, **prediction.model_dump())
   print(f"Prediction added for task ID: {task_id}")

将我们的数据插入项目后，我们查看带有预测生成的数据集，如图1所示。

关于使用Label Studio SDK创建和设置项目的完整详情，请参阅示例笔记本。

创建微调数据集

准备好我们的问答数据集后，最后一步是为OpenAI的微调平台格式化数据。OpenAI要求使用特定的格式，称为jsonl聊天格式，其中每个交互包含一条系统消息、用户提示和助手的相应回复。这确保微调后的模型能理解如何以符合您用例的方式作出响应。

例如，这里有一个来自OpenAI微调文档的简化示例，展示了如何训练模型以略带讽刺的语气进行回应：

{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}

在我们的案例中，我们需要调整问答数据结构以匹配该格式，将讽刺聊天机器人语境替换为更适合回答飓风相关问题的中立、有帮助的语气。完成后，数据集就可以进行微调了。

# System message for all entries
system_message = {"role": "system", "content": "You are a helpful assistant that answers questions about current events in hurricanes. Provide detailed answers."}


# List to store the converted dataset
new_format_dataset = []


# Convert each prompt-completion pair to the new format
for entry in master_qa_list:
   new_entry = {
       "messages": [
           system_message,
           {"role": "user", "content": entry['prompt']},
           {"role": "assistant", "content": entry['completion']}
       ]
   }
   new_format_dataset.append(new_entry)

微调

现在我们已经准备好了数据集，下一步是将其上传到OpenAI平台进行微调。这包括将数据写入文件，然后创建一个微调任务，使用格式化后的数据集来训练模型。

上传数据集

我们需要做的第一件事是将数据集上传到OpenAI。使用OpenAI的API非常简单，我们只需以正确的格式（`jsonl`）指定包含问答对的文件，并说明文件的用途——微调。

client.files.create(
 file=open("qa_pairs_openai_wiki_hurricane_dataset_format.jsonl", "rb"),
 purpose="fine-tune"
)

文件上传后，您将收到包含文件元数据的响应。其中一个关键信息是文件ID，我们在开始实际微调任务时需要引用它。您应该会看到类似以下的输出：

FileObject(id='file-qBVnzhGrEHvEPvZg6ZwvTfpK', bytes=4301148, created_at=1728916096, filename='qa_pairs_openai_wiki_hurricane_dataset_format.jsonl', object='file', purpose='fine-tune', status='processed', status_details=None)

这个ID（在本例中为`file-qBVnzhGrEHvEPvZg6ZwvTfpK`）就是我们创建微调任务时需要使用的标识符。

启动微调任务

文件上传完成后，我们就可以开始微调任务了。这一步需要告诉OpenAI您想要微调的模型、使用的数据集，还可以选择提供一个自定义后缀以便后续轻松识别。

在本例中，我们正在使用刚刚上传的飓风数据对模型`gpt-4o-mini-2024-07-18`进行微调：

client.fine_tuning.jobs.create(
 training_file="file-qBVnzhGrEHvEPvZg6ZwvTfpK",
 model="gpt-4o-mini-2024-07-18",
 suffix="wiki-hurricane-2024"
)

监控微调过程

创建任务后，您可以通过OpenAI仪表板或API跟踪其进度。图1展示了我们将在微调页面中看到的内容。我们可以监控微调任务的状态，或者等待电子邮件通知我们任务完成或出现错误的情况。

在本示例中，我们未包含验证数据集。验证集可用于确保模型不会对训练数据产生过拟合，使习得的知识具有泛化能力（例如仅回答被明确提出的问题，而非相似问题）。

微调时间与成本

需要注意的是，微调模型并非即时完成——根据模型大小和数据集复杂度的不同，可能需要一定时间。与较小模型相比，像GPT-4o这样的大型模型自然需要更多处理能力和时间。同样，包含数千条详细条目的数据集，其微调耗时也会比规模较小、更聚焦的示例集更长。

除了时间因素，微调还会带来成本，因此合理规划预算至关重要。OpenAI的收费基于微调过程中使用的token数量以及所使用的模型类型等因素。为了帮助管理这些成本，OpenAI提供了在整个过程中跟踪token使用情况和费用的工具，使您能够优化微调工作流程并避免不必要的开支。我们在示例笔记本中包含了一些这类工具，它们将提供如下统计信息：

Dataset has ~736249 tokens that will be charged for during training
By default, you'll train for 2 epochs on this dataset
By default, you'll be charged for ~1472498 tokens

如需了解具体价格和更多详情，请查看OpenAI的定价页面。

使用微调后的模型

微调过程完成后，您就可以像使用OpenAI的其他模型一样开始使用新定制化的模型。关键优势在于该模型现已针对您的特定用例进行了专门优化——在我们的示例中，即根据最新的维基百科更新回答有关近期飓风的问题。

要与微调后的模型交互，我们使用OpenAI的API，并指定刚训练模型的唯一标识符。在本例中，微调模型名为`ft:gpt-4o-mini-2024-07-18:personal:wiki-hurricane-2024:AIGr7s2N`。以下是一个简单示例，展示如何使用微调模型获取关于飓风米尔顿的信息：

from openai import OpenAI
client = OpenAI()

completion = client.beta.chat.completions.parse(
 model="ft:gpt-4o-mini-2024-07-18:personal:wiki-hurricane-2024:AIGr7s2N",
 messages=[
   {"role": "system", "content": "You are a helpful assistant that answers questions about current events in hurricanes. Provide detailed answers."},
   {"role": "user", "content": "When did Hurricane Milton hit the US?"}
 ]
)
print(completion.choices[0].message)

在这个示例中，模型使用针对飓风微调过的数据来回答关于飓风米尔顿的具体问题。结果应反映我们在微调过程中包含的最新信息：

The most recent hurricane to hit the US in 2024 is Hurricane Milton, which struck Florida on October 9.

此回复展示了模型现在如何"掌握"了在微调之前无法获取的最新信息。您现在可以将此模型用于类似查询，或将其集成到需要实时了解您特定领域当前动态的应用程序中，例如聊天机器人。

需要注意的是，虽然模型现在已针对您的需求进行了调整，但使用各种输入进行测试仍然至关重要。这有助于确保模型始终输出预期结果，并在不同场景下表现良好。这通常是一个迭代过程，需要不断优化数据集和后续的微调工作，以获得理想的结果。

如果您希望将这个微调后的模型集成到Label Studio中，可以通过Label Studio ML Backend轻松实现，特别是LLM Interactive Example功能。只需将`OPENAI_MODEL`参数设置为您的微调模型名称，即可利用该模型根据新提示生成预测结果。这能通过让模型协助重新生成预测或支持其他标注项目，帮助您简化标注流程。

结论

微调是一种强大的工具，可用于定制AI模型，使其在特定场景下以最少的人工干预实现更精准的表现。OpenAI平台简化了微调流程，降低了配置的复杂性，让您能专注于整理高质量数据。无论是提升聊天机器人性能还是生成定制内容，微调都能显著增强模型在真实场景中的表现。

如果您有兴趣将大型语言模型(LLM)集成到生成式AI工作流中，请务必查看企业版功能Prompts。Prompts提供了一种灵活的方式，可将这些先进的AI功能集成到您的项目中，从而简化整个组织内的标注和决策等任务。