Exploring Model Graders for Reinforcement Fine-Tuning

本指南面向已经熟悉OpenAI API、对强化微调(RFT)有基本了解，并希望将微调模型用于研究或其他适当用途的开发者和机器学习从业者。OpenAI的服务不适用于任何医疗状况的个性化治疗或诊断，并受我们适用条款的约束。

强化微调（RFT）推理模型的核心是通过在模型基础上运行强化学习，通过探索解决方案空间并强化能带来更高回报的策略，从而提升其推理性能。RFT帮助模型做出更精准的决策，并更有效地理解上下文。

在本指南中，我们将逐步介绍如何将RFT应用于OpenAI o4-mini推理模型，使用生命科学研究领域的一个任务示例：根据医患对话记录和描述预测结果，这是许多健康研究中必需的评估环节。我们将使用medical-o1-verifiable-problem 数据集的子集。您将学习为您的用例成功运行RFT任务的关键步骤。

我们将介绍以下内容：

1. 设置
2. 收集数据集
3. 基准测试基础模型
4. 定义你的评分器
5. 训练
6. 使用您微调后的模型

1. 设置

即便是强大的推理模型，在需要专家级行为的领域——尤其是像医学这样讲究细微差别和精确性的领域——也可能出现偏差。想象一个模型试图从转录文本中提取ICD-10代码的情形：即使它理解大意，也可能无法使用医疗专业人员期望的精确术语。

RFT的其他优秀应用场景包括账本标准化或欺诈风险分级等主题——在这些场景中，您需要精确、可靠且可重复的推理。查看我们的RFT用例指南获取精彩案例。

在我们的案例中，我们将重点教导o4-mini更好地预测临床对话和描述的结局。具体来说，我们想看看RFT是否能提高预测的准确性。

在此过程中，我们将讨论如何编写有效的评分器（grader），它们如何指导模型学习，以及如何防范经典的奖励破解陷阱。

3. 基准测试基础模型

在我们进行任何微调之前，需要先了解起点在哪里。基准测试能清晰展示模型的初始优势与短板——这样我们后续才能衡量它的进步程度。

我们将首先依赖两个简单但强大的评估器：

clinical_phrase_binary_grader - 精确匹配检查器。
clinical_phrase_grader - 一种基于词元的更宽松相似度评分器。

from rapidfuzz import fuzz, utils

def clinical_phrase_grader(sample: dict, item: dict) -> float:
    from rapidfuzz import fuzz, utils
    score = fuzz.token_set_ratio(sample["output_text"], item["reference_answer"], processor=utils.default_process)
    return score / 100.0

def clinical_phrase_binary_grader(sample: dict, item: dict) -> float:
    return 1.0 if sample["output_text"] == item["reference_answer"] else 0.0

def combined_grader(sample: dict, item: dict, weights: list[float] = [0.85, 0.15]) -> float:
    clinical_phrase_score = clinical_phrase_grader(sample, item)
    binary_score = clinical_phrase_binary_grader(sample, item)
    return weights[0] * clinical_phrase_score + weights[1] * binary_score

这种组合方式让我们能够同时追踪严格正确性和部分词汇重叠。二元评分器给出明确的0或1分：模型是否产生了完全匹配？而更柔和的评分器则提供更细微的差异——输出结果与标准答案有多接近？我们同时使用这两种方法，因为结果往往存在多种有效表达方式。例如，模型可能回答"痛风性关节炎"而非"痛风"。虽然人类评估者可能认为这个回答部分可接受，但严格的字符串匹配则不会。结合精确评分和模糊评分，能确保对模型输出进行更准确、更公平的评估。

我们构建了一个辅助函数，用于在示例前添加系统提示。

def prepend_system_prompt_to_first_user_message(samples, system_prompt, path=None):
    new_samples = []
    for sample in samples:
        # Deep copy to avoid mutating the original
        sample_copy = json.loads(json.dumps(sample))
        messages = sample_copy.get("messages", [])
        if messages and messages[0].get("role") == "user" and isinstance(messages[0].get("content"), str):
            if not messages[0]["content"].startswith(system_prompt):
                messages[0]["content"] = f"{system_prompt}\n\n{messages[0]['content']}"
        new_samples.append(sample_copy)
    if path is not None:
        with open(path, "w", encoding="utf-8") as f:
            for item in new_samples:
                f.write(json.dumps(item, ensure_ascii=False) + "\n")
    return new_samples

simple_prompt = """You are an expert clinician. For each clinical vignette, respond with exactly one phrase: the single most likely outcome or phenomenon, all in lowercase. 
- Do not add punctuation, articles, explanations, or commentary - output only the term itself.
- Sometimes, the expected answer can be a synonym of what you think.
- Use the standard clinical name (e.g. “thought withdrawal”, “Toxoplasma encephalitis”)."""
train_samples_loaded_simple_sys_prompt = prepend_system_prompt_to_first_user_message(
    train_samples_loaded, simple_prompt, path="data/medical_01_verifiable_problem_train_simple_prompt.jsonl"
)
test_samples_loaded_simple_sys_prompt = prepend_system_prompt_to_first_user_message(
    test_samples_loaded, simple_prompt, path="data/medical_01_verifiable_problem_val_simple_prompt.jsonl"
)

然后构建一个辅助函数来生成并存储模型的预测结果。

from openai import OpenAI
import concurrent.futures
from tqdm import tqdm
import os

client = OpenAI()

def generate_model_predictions(
    subset,
    prompt_type,
    model_name="o4-mini-2025-04-16",
    reasoning_effort="medium",
    n_runs=1,
    verbose=False,
):
    if isinstance(subset, str):
        samples_path = f"data/medical_01_verifiable_problem_{subset}_{prompt_type}_prompt.jsonl"
        with open(samples_path, "r", encoding="utf-8") as f:
            test_samples = [json.loads(line) for line in f if line.strip()]
    else:
        test_samples = [subset]

    def run_inference(item):
        resp = client.responses.create(
            model=model_name,
            input=item["messages"],
            reasoning={"effort": reasoning_effort, "summary": "detailed"},
        )
        model_prediction = {'output_text': resp.output_text}
        reasoning_tokens_used = resp.usage.output_tokens_details.reasoning_tokens
        summaries = [seg.text for item in resp.output if item.type == "reasoning" for seg in item.summary]
        summaries_string = "\n".join(summaries)
        if verbose:
            print("Prompt: {}".format(item["messages"][0]["content"]))
            print(f"Model Sample: {model_prediction}\nSolution: {item['reference_answer']}\n")
        return {
            "model_prediction": model_prediction["output_text"],
            "input": item,
            "reasoning_tokens_used": reasoning_tokens_used,
            "reference_answer": item["reference_answer"],
            "summaries": summaries_string
        }

    # Ensure the predictions directory exists before any file operations
    predictions_dir = os.path.join("data", "rft", "predictions")
    os.makedirs(predictions_dir, exist_ok=True)

    # Check if results already exist for all runs
    results_per_run = []
    for run_idx in range(n_runs):
        run_save_path = os.path.join(
            predictions_dir,
            f"{subset}_{prompt_type}_{model_name}_{reasoning_effort}_predictions_run{run_idx+1}.json"
        )
        if os.path.exists(run_save_path):
            print(f"Results for run {run_idx+1} already exist at {run_save_path}. Loading results.")
            with open(run_save_path, "r", encoding="utf-8") as f:
                run_results = json.load(f)
            results_per_run.append(run_results)
        else:
            if len(test_samples) == 1:
                run_results = [run_inference(test_samples[0])]
            else:
                run_results = []
                with concurrent.futures.ThreadPoolExecutor() as executor:
                    futures = [executor.submit(run_inference, item) for item in test_samples]
                    for future in tqdm(futures, total=len(futures), desc=f"Generating predictions (run {run_idx+1})"):
                        result = future.result()
                        run_results.append(result)
                with open(run_save_path, "w", encoding="utf-8") as f:
                    json.dump(run_results, f, ensure_ascii=False, indent=2)
            results_per_run.append(run_results)

    # Return a flat list for backward compatibility
    if n_runs == 1:
        return results_per_run[0]
    else:
        return results_per_run

要生成预测结果，首先请确保您的API密钥已设置：

export OPENAI_API_KEY=...

# OpenAI o4-mini model
results_simple_o4mini = generate_model_predictions(
    subset="train",
    prompt_type="simple",
    model_name="o4-mini",
    reasoning_effort="medium",
    n_runs=3
)

# OpenAI o3 model
results_simple_o3 = generate_model_predictions(
    subset="train",
    prompt_type="simple",
    model_name="o3",
    reasoning_effort="medium",
    n_runs=3
)

我们现在已经准备好可以评估的预测结果。
我们将构建一个辅助函数，方便我们轻松切换不同的评分方法，

import functools

def evaluate_predictions_with_grader(
    predictions,
    grader_func=combined_grader,
):
    results = []

    if isinstance(predictions, dict):
        predictions = [predictions]

    def run_grading(pred):
        model_prediction = {"output_text": pred["model_prediction"]}
        item = pred["input"]
        score = grader_func(model_prediction, item)
        result = pred.copy()
        result["score"] = score
        return result

    if len(predictions) == 1:
        result = run_grading(predictions[0])
        results.append(result)
    else:
        with concurrent.futures.ThreadPoolExecutor() as executor:
            futures = [executor.submit(run_grading, pred) for pred in predictions]
            for future in tqdm(concurrent.futures.as_completed(futures), total=len(futures), desc="Grading predictions"):
                results.append(future.result())

    total = len(results)
    correct = sum(r["score"] for r in results)
    accuracy = correct / total if total else 0.0

    metrics = {
        "total_samples": total,
        "accuracy": accuracy,
    }
    print(metrics)
    return metrics, results

def run_prediction_evaluation(
    model_name="o4-mini",
    reasoning_effort="medium",
    prompt_type="simple",
    subset="train",
    grader_func=combined_grader,
    num_runs=3,
):
    if isinstance(grader_func, functools.partial):
        name = grader_func.func.__name__
        mg = grader_func.keywords["model_grader"]
        mg_name = mg["name"]
        name = f"{name}_{mg_name}"
    else:
        name = getattr(grader_func, "__name__", getattr(grader_func, "__class__", type(grader_func)).__name__)
    grader_func_name = name.replace(" ", "_").replace(":", "_").replace("/", "_").replace(",", "_")

    for i in range(num_runs):
        preds_path = f"data/rft/predictions/{subset}_{prompt_type}_{model_name}_{reasoning_effort}_predictions_run{i+1}.json"
        with open(preds_path, "r") as f:
            preds = json.load(f)
        metrics, results_with_scores = evaluate_predictions_with_grader(preds, grader_func=grader_func)
        # Save the scored results
        with open(f"data/rft/predictions/{subset}_{prompt_type}_{model_name}_{reasoning_effort}_{grader_func_name}_predictions_run_{i+1}_scored.json", "w") as f:
            json.dump(results_with_scores, f, indent=2)
        # Save the metrics
        with open(f"data/rft/predictions/{subset}_{prompt_type}_{model_name}_{reasoning_effort}_{grader_func_name}_predictions_run_{i+1}_metrics.json", "w") as f:
            json.dump(metrics, f, indent=2)
        # Save the scores (if present in results_with_scores)
        scores = [item.get("score") for item in results_with_scores if "score" in item]
        with open(f"data/rft/predictions/{subset}_{prompt_type}_{model_name}_{reasoning_effort}_{grader_func_name}_predictions_run_{i+1}_scores.json", "w") as f:
            json.dump(scores, f, indent=2)

def load_predictions(
    model_name="o4-mini",
    reasoning_effort="medium",
    prompt_type="simple",
    subset="train",
    grader_func_name="clinical_phrase_grader",
    num_runs=3
):
    all_predictions = []
    all_metrics = []
    for run in range(1, num_runs + 1):
        pred_path = f"data/rft/predictions/{subset}_{prompt_type}_{model_name}_{reasoning_effort}_{grader_func_name}_predictions_run_{run}_scored.json"
        metrics_path = f"data/rft/predictions/{subset}_{prompt_type}_{model_name}_{reasoning_effort}_{grader_func_name}_predictions_run_{run}_metrics.json"
        try:
            with open(pred_path, "r") as f:
                predictions = json.load(f)
        except FileNotFoundError:
            predictions = None
        try:
            with open(metrics_path, "r") as f:
                metrics = json.load(f)
        except FileNotFoundError:
            metrics = None
        all_predictions.append(predictions)
        all_metrics.append(metrics)
    return all_predictions, all_metrics

然后运行评估。

model_name = "o4-mini"
reasoning_effort = "medium"
prompt_type = "simple"
subset = "train"
grader_func = combined_grader
grader_func_name = "combined_grader"
num_runs = 3
run_prediction_evaluation(
    model_name=model_name, 
    reasoning_effort=reasoning_effort, 
    prompt_type=prompt_type, 
    subset=subset, 
    grader_func=grader_func, 
    num_runs=num_runs
)
predictions_o4mini_medium_simple_prompt, metrics_o4mini_medium_simple_prompt = load_predictions(model_name=model_name, reasoning_effort=reasoning_effort, prompt_type=prompt_type, subset=subset, grader_func_name=grader_func_name, num_runs=num_runs)

Grading predictions: 100%|██████████| 100/100 [00:00<00:00, 329740.88it/s]

{'total_samples': 100, 'accuracy': 0.5716752010712578}

Grading predictions: 100%|██████████| 100/100 [00:00<00:00, 497544.96it/s]

{'total_samples': 100, 'accuracy': 0.5855097792577905}

Grading predictions: 100%|██████████| 100/100 [00:00<00:00, 414456.92it/s]

{'total_samples': 100, 'accuracy': 0.5702082734545793}

可视化结果能让我们发现趋势和故障模式。

# Print mistakes where the model did not get the correct answer (score < 1.0)
mistakes = [
    {"index": i, **res}
    for i, res in enumerate(predictions_o4mini_medium_simple_prompt[0])
    if res["score"] < 1.0
]

print(f"\nTotal mistakes: {len(mistakes)}")
for m in mistakes[15:20]:
    print(f"\n[Sample {m['index']}]")
    print(f"  Model prediction: {m['model_prediction']}")
    print(f"  Reference answer: {m['reference_answer']}")
    print(f"  Score: {m['score']}")

Total mistakes: 84

[Sample 16]
  Model prediction: enveloped double stranded linear dna virus
  Reference answer: double-stranded, enveloped dna virus
  Score: 0.85

[Sample 19]
  Model prediction: gallstone ileus
  Reference answer: gall stone ileus
  Score: 0.8225806451612904

[Sample 20]
  Model prediction: acute rheumatic fever
  Reference answer: postinfectious glomerulonephritis
  Score: 0.22037037037037036

[Sample 22]
  Model prediction: amygdala
  Reference answer: hippocampus
  Score: 0.17894736842105263

[Sample 23]
  Model prediction: hypopituitarism
  Reference answer: pituitary adenoma
  Score: 0.47812499999999997

如上所述，典型的故障模式可分为三类：

细微差异和格式问题，得分 >=0.8。
部分词汇匹配，0.3 < 分数 < 0.8。
词汇偏离基准，得分 < 0.3。

我们可以在训练集上可视化完整的分数分布。

注意：在实际操作中，大规模分析模型错误通常需要结合人工审查和自动化方法——例如标记故障类型或根据分数和内容对预测进行聚类。该工作流程超出了本指南的范围，但一旦您识别出广泛模式后，这是非常有价值的下一步。

import matplotlib.pyplot as plt
scores_distribution = [m['score'] for m in predictions_o4mini_medium_simple_prompt[0]]
plt.hist(scores_distribution, alpha=0.6, label='o4-mini medium simple prompt')
plt.legend()

<matplotlib.legend.Legend at 0x132843e90>

让我们与其他模型和提示进行比较，并可视化评分。

# OpenAI o3 model
model_name = "o3"
reasoning_effort = "medium"
prompt_type = "simple"
subset = "train"
grader_func = combined_grader
grader_func_name = "combined_grader"
num_runs = 3
run_prediction_evaluation(model_name=model_name, reasoning_effort=reasoning_effort, prompt_type=prompt_type, subset=subset, grader_func=grader_func, num_runs=num_runs)
predictions_o3_medium_simple_prompt, metrics_o3_medium_simple_prompt = load_predictions(model_name=model_name, reasoning_effort=reasoning_effort, prompt_type=prompt_type, subset=subset, grader_func_name=grader_func_name, num_runs=num_runs)

Grading predictions: 100%|██████████| 100/100 [00:00<00:00, 489988.79it/s]

{'total_samples': 100, 'accuracy': 0.6150339441350683}

Grading predictions: 100%|██████████| 100/100 [00:00<00:00, 507170.98it/s]

{'total_samples': 100, 'accuracy': 0.5901906182115139}

Grading predictions: 100%|██████████| 100/100 [00:00<00:00, 543303.63it/s]

{'total_samples': 100, 'accuracy': 0.5927679005876193}

import numpy as np
import pandas as pd
import seaborn as sns

def average_and_std_metrics(metrics_list):
    """Returns dicts of mean and std for a list of metrics dicts."""
    if not metrics_list: return {}, {}
    keys = metrics_list[0].keys()
    arr = {k: np.array([m[k] for m in metrics_list]) for k in keys}
    mean = {k: float(np.mean(arr[k])) for k in keys}
    std = {k: float(np.std(arr[k])) for k in keys}
    return mean, std

def plot_model_accuracies(model_metrics_avg, model_metrics_std, grader_title="Combined Grader Accuracy", sharey: bool = True) -> None:
    """Plots model accuracies with standard deviation error bars."""
    # Convert the nested dicts into tidy DataFrames
    df_avg = pd.DataFrame(model_metrics_avg).T.reset_index().rename(columns={"index": "Model"})
    df_std = pd.DataFrame(model_metrics_std).T.reset_index().rename(columns={"index": "Model"})

    # Long-form for Seaborn
    long_df_avg = df_avg.melt(id_vars="Model", value_vars=["accuracy"], var_name="Metric", value_name="Accuracy")
    long_df_std = df_std.melt(id_vars="Model", value_vars=["accuracy"], var_name="Metric", value_name="Std")

    # Merge avg and std for error bars
    long_df = pd.merge(long_df_avg, long_df_std, on=["Model", "Metric"])

    pretty_names = {"accuracy": grader_title}

    # Create a separate figure for each metric
    for metric_key in ["accuracy"]:
        metric_df = long_df[long_df["Metric"] == metric_key].copy()
        plt.figure(figsize=(8, 5))
        # Plot bars with error bars
        ax = sns.barplot(data=metric_df, x="Model", y="Accuracy", hue="Model", palette="tab10", legend=False, errorbar=None)
        bars = ax.patches
        # Add error bars manually
        for i, row in enumerate(metric_df.itertuples()):
            bar = bars[i]
            x = bar.get_x() + bar.get_width() / 2
            y = row.Accuracy
            yerr = row.Std
            ax.errorbar(x=x, y=y, yerr=yerr, fmt='none', ecolor='black', capsize=5, elinewidth=2, capthick=2, zorder=10)
        plt.title(pretty_names[metric_key])
        plt.ylabel("Accuracy")
        plt.xlabel("")
        if sharey: plt.ylim(0, 1)
        # Annotate bars with exact values
        for bar in bars:
            height = bar.get_height()
            ax.annotate(f"{height:.2f}", xy=(bar.get_x() + bar.get_width() / 2, height), xytext=(0, 6), textcoords="offset points", ha='center', va='bottom', fontsize=10, fontweight='bold')
        plt.xticks(rotation=15, ha="right")
        plt.tight_layout()
        plt.show()

avg_metrics_o4mini_medium_simple_prompt, std_metrics_o4mini_medium_simple_prompt = average_and_std_metrics(metrics_o4mini_medium_simple_prompt)
avg_metrics_o3_medium_simple_prompt, std_metrics_o3_medium_simple_prompt = average_and_std_metrics(metrics_o3_medium_simple_prompt)
model_metrics_avg = {
    "o4-mini-medium-simple-prompt": avg_metrics_o4mini_medium_simple_prompt,
    "o3-medium-simple-prompt": avg_metrics_o3_medium_simple_prompt,
}
model_metrics_std = {
    "o4-mini-medium-simple-prompt": std_metrics_o4mini_medium_simple_prompt,
    "o3-medium-simple-prompt": std_metrics_o3_medium_simple_prompt,
}
plot_model_accuracies(model_metrics_avg, model_metrics_std, grader_title="Combined Grader Accuracy")

我们可以看到模型的性能存在明显局限。实际上，通过反复优化提示词通常有助于提升基线结果，从而更好地发挥基础模型的潜力。但在本案例中，我们的提示工程并未带来显著改进——因此我们将这些实验运行排除在分析范围之外。

RFT能够发挥作用的一个关键前提是，基础模型必须证明它能够从一开始就成功完成至少部分示例任务。初始准确率约0.6是一个强烈信号，表明RFT可以提升性能。如果模型在您的任务上从未成功过，就没有可供攀登的训练信号。

这一评估流程为我们下一步做好准备：通过评分者提供的结构化高质量反馈来引导模型。

4. 定义您的评分器

评分器定义了在RFT过程中塑造模型行为的奖励函数。它提供了期望输出的示例，并对不良输出进行惩罚。设计一个有效的评分器既需要原则性的结构，也需要深思熟虑的领域洞察，这可能是成功实施RFT最重要的任务。

在本节中，我们将介绍3个评分器，展示如何设置它们以适配API，并讨论它们产生的结果。然后我们将展示如何实际启动RFT任务。

基于字符串的评分器

我们最初使用早期的评估函数构建了一个双重评分器，因为它能提供与预测答案和参考答案之间词汇接近度相对应的分数分布。这提供了一个起点，但对于o4-mini来说信号还不够丰富，无法真正学习和改进，首次实验显示在RFT运行期间奖励停滞不前。对于API调用，您应该按照如下所示构建python评分函数。

import inspect

# --- Utility functions ---
def build_python_grader_payload(grader_fn) :
    """Build a payload for a python grader."""
    grader_source = inspect.getsource(grader_fn)
    # Enforce function name to be `grade`
    grader_source = grader_source.replace(grader_fn.__name__, "grade", 1)
    return {
        "type": "python",
        "source": grader_source,
    }

multi_python_grader_tool_call = {
    "type": "multi",
    "graders": {
        "clinical_phrase": {
            "name": "clinical_phrase_grader",
            "image_tag": "2025-05-08",
            **build_python_grader_payload(clinical_phrase_grader),
        },
        "clinical_phrase_binary": {
            "name": "clinical_phrase_binary_grader",
            "image_tag": "2025-05-08",
            **build_python_grader_payload(clinical_phrase_binary_grader),
        },
    },
    "calculate_output": "0.85 * clinical_phrase + 0.15 * clinical_phrase_binary",
}

以下是其训练曲线的快照，其中绿色曲线代表训练集奖励，蓝色曲线代表测试集奖励：

RFT String Grader

模型评分器 1

为了解决这一限制，我们引入了一种更先进的方法：模型评分器。基于模型的评分器使我们能够将语义理解和细微差别融入反馈中。当涉及特定领域的同义词或模糊推理时，这种方法尤其强大。

我们采用gpt-4.1作为评分模型，并遵循强调语义保真度的评分标准：临床同义性、正确的疾病分类和概念一致性。评分者不是关注表面的措辞（例如"这是相同的字符串吗？"），而是旨在回答"这反映了正确的结果或现象吗？"

为确保评分器与专家预期保持一致，我们在一部分基础模型预测上对其进行了评估。对于任何生产环境用例，领域专家评审员应验证模型分配的分数是否反映了优选答案顺序并与领域判断一致。这通常需要确认模型评分器能根据预测的有效性正确排序。在本指南范围内，我们通过使用OpenAI o3来近似评估，检查更高质量的预测是否始终能获得比替代方案更高的评分。

通过对o3的这些讨论，我们迭代更新模型评分器，直到结果达成一致。

GRADER_PROMPT_1 = """
System:
  You are an expert medical grader. Compare the **Reference Answer** to the **Model's Answer** and produce **only** a JSON object with:
    • **result**: a float between 0.0 and 1.0  
    • **steps**: a list of reasoning steps (each with a `"description"` and a `"conclusion"`)

  Scoring rubric (start at 0.0, then add or subtract):
    1. Exact lexical match: **+0.15**  
    2. Clinical synonym (e.g. “withdrawal of thought” ↔ “thought withdrawal”): **+0.35**  
    3. Same disease family (e.g. two viral encephalitides): **+0.35**  
    4. Partial term overlap (e.g. “ulcer” in both phrases): **+0.15**  
    5. Completely unrelated: **-0.10**

  • If multiple criteria apply, sum their weights (max 1.0).  
  • Cap the final score to the [0.0, 1.0] range.  
  • In your **steps**, show which rule you applied and the running subtotal.
"""

要通过API提交，字典是这样构建的。

model_grader_1 = {
   "type": "score_model",
   "name": "gpt41_score_model_1",
   "input": [
        {
            "role": "system",
            "content": GRADER_PROMPT_1
        },
        {
            "role": "user",
            "content": "Reference Answer: {{item.reference_answer}}. Model's Answer: {{sample.output_text}}"
        }
   ],
   "pass_threshold": 0.75,
   "model": "gpt-4.1-2025-04-14",
   "range": [0, 1],
   "sampling_params": {
       "seed": 42,
       "temperature": 0,
   },
}

因此，我们在本地设置了模型评分器，用于检查接下来将要微调的模型结果。


from pydantic import BaseModel
from typing import List

class GraderStep(BaseModel):
    description: str
    conclusion: str

class GraderResponse(BaseModel):
    result: float
    steps: List[GraderStep]

# Adapted python_model_grader to match the other graders' interface
def python_model_grader(sample, item, model_grader=model_grader_1):
    """
    Calls an OpenAI model to grade the model output against the reference answer.
    Expects sample to have "output_text", item to have "reference_answer".
    Returns a float score (parsed from the model's JSON response).
    """
    # Prepare the prompt as the grader expects
    system_prompt = model_grader["input"][0]["content"]
    user_prompt = model_grader["input"][1]["content"]
    user_prompt_filled = user_prompt.replace("{{item.reference_answer}}", item["reference_answer"]).replace("{{sample.output_text}}", sample["output_text"])
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt_filled}
    ]
    # Call the OpenAI API with the grader's model
    response = client.beta.chat.completions.parse(
        model=model_grader["model"],
        messages=messages,
        seed=model_grader.get("sampling_params", {}).get("seed", None),
        temperature=model_grader.get("sampling_params", {}).get("temperature", 0),
        response_format=GraderResponse,
    )
    # Parse the float score from the model's JSON response
    parsed = response.choices[0].message.parsed
    if not isinstance(parsed, GraderResponse):
        raise RuntimeError(f"Grader returned invalid structured output: {parsed!r}")
    return float(parsed.result)

虽然评分标准最初提供了合理的反馈，但模型很快发现了一个漏洞并开始奖励黑客行为。分数急剧上升——有时高达20-30个百分点——并非因为临床准确性提高，而是因为模型用同义词、剂量和完整管理方案填充其"一句话"答案。你可能会看到开始华法林治疗**并**继续使用普通肝素≥5天，重叠使用直至INR达到治疗范围(2-3)或咀嚼型阿司匹林325毫克立即服用加硝酸甘油...，而不是按要求仅回答继续使用普通肝素或阿司匹林。尽管系统提示明确要求"用确切的一句话回答：最可能的结果或现象"，但这些冗长的输出会虚增lexical_similarity分数，却没有真正增加预测价值。这一经验凸显了持续检查模型输出并警惕可能悄悄扭曲评估指标的奖励黑客行为的必要性。

以下是其训练曲线快照（绿色代表训练奖励，蓝色代表测试奖励）：

RFT Model Hacking

Model Grader 2

为了缓解这种奖励滥用问题，我们通过明确期望、实施更严格的输出约束以及提供正确与错误行为的对比示例，改进了评分提示。我们再次使用o3进行迭代，利用基础o4-mini的预测和之前微调模型的黑客示例，来设计和验证我们的评分器。这个更新版评分器的另一个重要点是降低了lexical_similarity的权重，以确保clinical_similarity占据主导地位。

GRADER_PROMPT_2 = """You are an expert medical grader.

Compare the reference_answer (gold standard) with the model_prediction
and return **exactly** this JSON object:

{
  "steps": [            // each: {"description": "...", "conclusion": "..."}
    …
  ],
  "result": <float 0-1 rounded to 3 decimals>
}

──────────────── Input placeholders ───────────────
reference_answer:
model_prediction:

──────────── Normalisation steps ────────────
• lowercase, strip punctuation / excess whitespace  
• expand common abbreviations (e.g. cll → chronic lymphocytic leukemia)  
• map both strings to ICD-10 / SNOMED concepts when possible

──────────── Clinical layer rubric ───────────
L1  exact concept or universally accepted synonym  
L2  same concept but benign modifier differs (e.g. “acute”, “left”)  
L3  same disease / drug family but wrong subtype or variant  
L4  same organ system but entirely different disease / intervention  
L5  only partial mechanistic overlap (e.g. both vasodilators)  
L6  unrelated or nonsensical

──────────── Scoring parameters ─────────────
clinical_weight  = 0.90
lexical_weight   = 0.10
clinical_similarity = {1:1.00, 2:0.85, 3:0.45, 4:0.30, 5:0.10, 6:0.00}

lexical_similarity = normalized_levenshtein(reference_answer,
                                            model_prediction)

# Optional penalty if a clinically critical adjective is missing
critical_modifiers = [
  "wide", "narrow", "acute", "chronic", "posteromedial",
  "oxidized", "oxidised", "left", "right"
]
modifier_pen = -0.05 if any(
    w in reference_answer and w not in model_prediction
    for w in critical_modifiers
) else 0.0

# Determine layer L (1-6) per rubric above using ontology + judgment.
if L == 6:
    score = 0.0
else:
    score = (clinical_weight * clinical_similarity[L] +
             lexical_weight  * lexical_similarity) + modifier_pen

Clamp to [0,1] and round to 3 decimals.  
Output **only** the JSON.

──────────────── Worked examples ─────────────
reference_answer: beta-thalassemia major  
model_prediction: beta-thalassemia minor  
reasoning: Both involve β-globin chain synthesis, but “major” causes
          transfusion-dependent anemia while “minor” is largely benign;
          same family, wrong subtype → **L3**. Lexical ≈ 0.83.  
score = 0.90·0.45 + 0.10·0.83 = 0.488 → **0.488**

reference_answer: ACE inhibitor  
model_prediction: angiotensin-receptor blocker  
reasoning: Both act on the renin–angiotensin axis yet on different
          targets; only partial mechanistic overlap → **L5**.
          Lexical ≈ 0.31.  
score = 0.90·0.10 + 0.10·0.31 = 0.121 → **0.121**

reference_answer: acute pancreatitis  
model_prediction: pancreatitis  
reasoning: Same disorder but missing timing adjective “acute”;
          benign modifier difference → **L2**. Lexical ≈ 0.78.  
score = 0.90·0.85 + 0.10·0.78 = 0.843 → **0.843**

reference_answer: valproate  
model_prediction: valproic acid  
reasoning: Valproic acid is the active moiety of valproate; mechanisms
          and indications are identical → **L1**. Lexical ≈ 0.82.  
score = 0.90·1.00 + 0.10·0.82 = 0.982 → **0.982**

reference_answer: riboflavin  
model_prediction: riboflavin deficiency  
reasoning: Adds “deficiency” but refers to the same vitamin (B₂);
          benign modifier difference → **L2**. Lexical ≈ 0.60.  
score = 0.90·0.85 + 0.10·0.60 = 0.825 → **0.825**

reference_answer: splenectomy  
model_prediction: acetaminophen overdose  
reasoning: Surgical removal of the spleen has no mechanistic or anatomic
          relationship to toxic drug ingestion → **L6**.  
score = **0.000**

reference_answer: ulcerative colitis  
model_prediction: Crohn disease  
reasoning: Both are inflammatory-bowel diseases but differ in location,
          histology and management; same organ system, different disease
          → **L4**. Lexical ≈ 0.38.  
score = 0.90·0.30 + 0.10·0.38 = 0.308 → **0.308**"""

model_grader_2 = {
   "type": "score_model",
   "name": "gpt41_score_model_2",
   "input": [
        {
            "role": "system",
            "content": GRADER_PROMPT_2
        },
        {
            "role": "user",
            "content": "Reference Answer: {{item.reference_answer}}. Model's Answer: {{sample.output_text}}"
        }
   ],
   "pass_threshold": 0.75,
   "model": "gpt-4.1-2025-04-14",
   "range": [0, 1],
   "sampling_params": {
       "seed": 42,
       "temperature": 0,
   },
}

最终成果是一个高信号、领域敏感的评分器，它能引导模型做出更恰当和简洁的预测。

成本说明： LLM评分器除了训练计算外还会产生token使用费用。为了有效控制成本，我们建议：

在基础模型生成结果（可选地包括合成结果）上本地测试您的评分器，以确保其与您的评分标准或人类偏好一致。如果可用，使用flex processing以获得更高效的评估。
从小规模的RFT运行开始，以验证评分者一致性并在扩大规模前检测潜在的奖励滥用行为。

让我们看看下一步如何启动训练！

5. 训练

一旦您的提示和评分器最终确定，您就可以开始训练。本节展示如何使用您的最终评分器启动RFT——当然，您可能已经在对早期评分器版本进行性能评估时运行过类似的命令。

我们确保评分器通过了API测试，

import requests

API_KEY = os.environ["OPENAI_API_KEY"]
HEADERS = {"Authorization": f"Bearer {API_KEY}"}

# Validate a grader configuration for fine-tuning
payload = {"grader": model_grader_2}
try:
    response = requests.post(
        "https://api.openai.com/v1/fine_tuning/alpha/graders/validate",
        json=payload,
        headers=HEADERS,
    )
    response.raise_for_status()
    print("Grader validated")
except requests.exceptions.RequestException as e:
    print(f"Error validating grader: {e}")
    if 'response' in locals():
        print(f"Response: {response.text}")

并将训练集和测试集上传到OpenAI文件系统。

# Set your training and test file paths
train_file = "data/medical_01_verifiable_problem_train_simple_prompt.jsonl"
test_file = "data/medical_01_verifiable_problem_val_simple_prompt.jsonl"

def upload_file(file_path: str) -> str:
    """Upload a file to the OpenAI platform for fine-tuning."""
    print(f"Uploading file: {file_path}")
    with open(file_path, 'rb') as f:
        response = requests.post(
            "https://api.openai.com/v1/files",
            headers=HEADERS,
            files={"file": f},
            data={"purpose": "fine-tune"}
        )
        response.raise_for_status()
        file_id = response.json()["id"]
        print(f"File uploaded successfully. File ID: {file_id}")
        return file_id

train_file_id = train_file
if train_file.endswith("jsonl"):
    print(f"Training file detected: {train_file}")
    train_file_id = upload_file(train_file)
test_file_id = test_file
if test_file and test_file.endswith("jsonl"):
    print(f"test file detected: {test_file}")
    test_file_id = upload_file(test_file)

现在让我们定义本次运行的超参数。我们将对o4-mini进行微调，采用medium推理强度。该参数会通过限制模型用于推理的token数量来影响输出长度。我们采用中等计算乘数和合理的训练轮次进行调优，以优先保证效率和快速迭代。您需要根据预算、期望的泛化能力和数据集难度来调整这些参数。

# Set the model and other parameters
model = "o4-mini-2025-04-16"
suffix = "medical_01_verifiable_problem_gpt41_grader"
reasoning_effort = "medium"
n_epochs = 5
seed = 42
grader = model_grader_2
response_format = None
compute_multiplier = 1.0
eval_samples = 1
eval_interval = 5

我们现在可以启动运行了！

# Launch the RFT job
payload = dict(
    training_file=train_file_id,
    validation_file=test_file_id,
    model=model,
    suffix=suffix,
    method=dict(
        type="reinforcement",
        reinforcement=dict(
            grader=grader,
            response_format=response_format,
            hyperparameters=dict(
                compute_multiplier=compute_multiplier,
                eval_samples=eval_samples,
                eval_interval=eval_interval,
                n_epochs=n_epochs,
                reasoning_effort=reasoning_effort,
            )
        )
    ),
    seed=seed
)

try:
    response = requests.post(
        "https://api.openai.com/v1/fine_tuning/jobs",
        json=payload,
        headers=HEADERS,
    )
    response.raise_for_status()
    job_id = response.json().get("id")
    if job_id:
        print("Training job created with ID:", job_id)
        print(
            f"View the job details at: https://platform.openai.com/finetune/{job_id}")
    else:
        print("Failed to retrieve job ID from response.")
except requests.exceptions.RequestException as e:
    print(f"An error occurred while creating the training job: {e}")
    if 'response' in locals():
        print(f"Response: {response.text}")

在仪表盘上，您可以观察奖励曲线图 - 它们让您能看到整体性能随步骤提升的情况，而在使用multi_grader时，每个评分器的图表会分解具体组件。推理令牌使用趋势（通常会随着模型信心增强而下降）和步骤持续时间指标提供了效率洞察。评分器延迟和错误计数图表有助于确保您的评分器在运行期间保持高性能且无错误。

这是我们训练曲线的快照，其中绿色和橙色曲线代表训练集，而蓝色和红色曲线代表测试子集：

RFT Dashboard Example

在训练过程中，测试集上的评估结果会直接记录到Evaluation API。您可以前往该页面追踪样本表现情况，了解预测结果随时间的变化趋势。

6. 使用您微调后的模型

训练完成后，您可以通过model_id调用新模型并评估其改进效果。预测结果将更加精准！

# To retrieve information about a fine-tuning job (including the fine-tuned model id), use the job_id:
response = requests.get(
    f"https://api.openai.com/v1/fine_tuning/jobs/{job_id}",
    headers=HEADERS,
)
if response.ok:
    data = response.json()
    if data.get("status") == "succeeded":
        fine_tuned_model_id = data.get("fine_tuned_model")
    else:
        fine_tuned_model_id = None
else:
    raise Exception(f"Request failed: {response.status_code} - {response.text}")
print("Fine-tuned model id:", fine_tuned_model_id)

模型的预测分数

让我们计算基础模型和微调模型的得分以进行比较。

from functools import partial
model_name = fine_tuned_model_id
reasoning_effort = "medium"
prompt_type = "simple"
subset = "val"
grader_func = partial(python_model_grader, model_grader=model_grader_2)
grader_func_name = "python_model_grader_gpt41_score_model_2"
num_runs = 3

results_ft_model_grader_2 = generate_model_predictions(
    subset=subset,
    prompt_type=prompt_type,
    model_name=model_name,
    reasoning_effort=reasoning_effort,
    n_runs=num_runs
)
run_prediction_evaluation(
    model_name=model_name, 
    reasoning_effort=reasoning_effort, 
    prompt_type=prompt_type, 
    subset=subset,
    grader_func=grader_func, 
    num_runs=num_runs
)
predictions_ftmodel_medium_simple_prompt_model_grader_2, metrics_ftmodel_medium_simple_prompt_model_grader_2 = load_predictions(
    model_name=model_name,
    reasoning_effort=reasoning_effort,
    prompt_type=prompt_type,
    subset=subset,
    grader_func_name=grader_func_name,
    num_runs=num_runs
)

Generating predictions (run 1):   0%|          | 0/100 [00:00<?, ?it/s]

Generating predictions (run 1): 100%|██████████| 100/100 [02:27<00:00,  1.47s/it]
Generating predictions (run 2): 100%|██████████| 100/100 [02:28<00:00,  1.49s/it]
Generating predictions (run 3): 100%|██████████| 100/100 [02:13<00:00,  1.33s/it]
Grading predictions: 100%|██████████| 100/100 [00:23<00:00,  4.30it/s]

{'total_samples': 100, 'accuracy': 0.7207700000000001}

Grading predictions: 100%|██████████| 100/100 [00:29<00:00,  3.43it/s]

{'total_samples': 100, 'accuracy': 0.7125700000000001}

Grading predictions: 100%|██████████| 100/100 [00:22<00:00,  4.39it/s]

{'total_samples': 100, 'accuracy': 0.7239800000000003}

model_name = "o4-mini"
reasoning_effort = "medium"
prompt_type = "simple"
subset = "val"
grader_func = partial(python_model_grader, model_grader=model_grader_2)
grader_func_name = "python_model_grader_gpt41_score_model_2"
num_runs = 3

results_o4mini_model_grader_2 = generate_model_predictions(
    subset=subset,
    prompt_type=prompt_type,
    model_name=model_name,
    reasoning_effort=reasoning_effort,
    n_runs=num_runs
)
run_prediction_evaluation(
    model_name=model_name, 
    reasoning_effort=reasoning_effort, 
    prompt_type=prompt_type, 
    subset=subset,
    grader_func=grader_func, 
    num_runs=num_runs
)
predictions_o4mini_medium_simple_prompt_model_grader_2, metrics_o4mini_medium_simple_prompt_model_grader_2 = load_predictions(
    model_name=model_name,
    reasoning_effort=reasoning_effort,
    prompt_type=prompt_type,
    subset=subset,
    grader_func_name=grader_func_name,
    num_runs=num_runs
)

Results for run 1 already exist at data/rft/predictions/val_simple_o4-mini_medium_predictions_run1.json. Loading results.
Results for run 2 already exist at data/rft/predictions/val_simple_o4-mini_medium_predictions_run2.json. Loading results.
Results for run 3 already exist at data/rft/predictions/val_simple_o4-mini_medium_predictions_run3.json. Loading results.

Grading predictions: 100%|██████████| 100/100 [00:21<00:00,  4.57it/s]

{'total_samples': 100, 'accuracy': 0.6749300000000003}

Grading predictions: 100%|██████████| 100/100 [00:20<00:00,  4.96it/s]

{'total_samples': 100, 'accuracy': 0.6755199999999999}

Grading predictions: 100%|██████████| 100/100 [00:24<00:00,  4.16it/s]

{'total_samples': 100, 'accuracy': 0.64916}

model_name = "o3"
reasoning_effort = "medium"
prompt_type = "simple"
subset = "val"
grader_func = partial(python_model_grader, model_grader=model_grader_2)
grader_func_name = "python_model_grader_gpt41_score_model_2"
num_runs = 3

results_o3_model_grader_2 = generate_model_predictions(
    subset=subset,
    prompt_type=prompt_type,
    model_name=model_name,
    reasoning_effort=reasoning_effort,
    n_runs=num_runs
)
run_prediction_evaluation(
    model_name=model_name, 
    reasoning_effort=reasoning_effort, 
    prompt_type=prompt_type, 
    subset=subset,
    grader_func=grader_func, 
    num_runs=num_runs
)
predictions_o3_medium_simple_prompt_model_grader_2, metrics_o3_medium_simple_prompt_model_grader_2 = load_predictions(
    model_name=model_name,
    reasoning_effort=reasoning_effort,
    prompt_type=prompt_type,
    subset=subset,
    grader_func_name=grader_func_name,
    num_runs=num_runs
)

Results for run 1 already exist at data/rft/predictions/val_simple_o3_medium_predictions_run1.json. Loading results.
Results for run 2 already exist at data/rft/predictions/val_simple_o3_medium_predictions_run2.json. Loading results.
Results for run 3 already exist at data/rft/predictions/val_simple_o3_medium_predictions_run3.json. Loading results.

Grading predictions: 100%|██████████| 100/100 [00:32<00:00,  3.10it/s]

{'total_samples': 100, 'accuracy': 0.6493800000000001}

Grading predictions: 100%|██████████| 100/100 [00:20<00:00,  4.89it/s]

{'total_samples': 100, 'accuracy': 0.6722}

Grading predictions: 100%|██████████| 100/100 [00:20<00:00,  4.80it/s]

{'total_samples': 100, 'accuracy': 0.7137200000000001}

我们现在可以将它们可视化！

avg_metrics_o4mini_medium_simple_prompt_model_grader_2, std_metrics_o4mini_medium_simple_prompt_model_grader_2 = average_and_std_metrics(metrics_o4mini_medium_simple_prompt_model_grader_2)
avg_metrics_o3_medium_simple_prompt_model_grader_2, std_metrics_o3_medium_simple_prompt_model_grader_2 = average_and_std_metrics(metrics_o3_medium_simple_prompt_model_grader_2)
avg_metrics_ftmodel_medium_simple_prompt_model_grader_2, std_metrics_ftmodel_medium_simple_prompt_model_grader_2 = average_and_std_metrics(metrics_ftmodel_medium_simple_prompt_model_grader_2)
model_metrics_avg = {
    "o4-mini-medium-simple-prompt": avg_metrics_o4mini_medium_simple_prompt_model_grader_2,
    "o3-medium-simple-prompt": avg_metrics_o3_medium_simple_prompt_model_grader_2,
    "ftmodel-medium-simple-prompt": avg_metrics_ftmodel_medium_simple_prompt_model_grader_2
}
model_metrics_std = {
    "o4-mini-medium-simple-prompt": std_metrics_o4mini_medium_simple_prompt_model_grader_2,
    "o3-medium-simple-prompt": std_metrics_o3_medium_simple_prompt_model_grader_2,
    "ftmodel-medium-simple-prompt": std_metrics_ftmodel_medium_simple_prompt_model_grader_2
}
plot_model_accuracies(model_metrics_avg, model_metrics_std, grader_title="Model Grader 2 Accuracy")

# Print mistakes where the model did not get the correct answer (score < 1.0)
mistakes = [
    {"index": i, **res}
    for i, res in enumerate(predictions_ftmodel_medium_simple_prompt_model_grader_2[0])
    if res["score"] < 1.0
]

print(f"\nTotal mistakes: {len(mistakes)}")
for m in mistakes[5:10]:
    print(f"\n[Sample {m['index']}]")
    print(f"  Model prediction: {m['model_prediction']}")
    print(f"  Reference answer: {m['reference_answer']}")
    print(f"  Score: {m['score']}")

Total mistakes: 80

[Sample 5]
  Model prediction: carotid duplex ultrasound
  Reference answer: carotid doppler
  Score: 0.5525

[Sample 6]
  Model prediction: under fixation due to insufficient fixation time
  Reference answer: incomplete fixation
  Score: 0.5037037037037037

[Sample 7]
  Model prediction: acute rheumatic fever due to group a streptococcal pharyngitis mediated by type ii hypersensitivity
  Reference answer: acute rheumatic fever
  Score: 0.85

[Sample 8]
  Model prediction: exposure (open) method of burn treatment
  Reference answer: heterograft application with sutures to secure it in place and daily washes, but no dressing
  Score: 0.3031007751937985

[Sample 9]
  Model prediction: beta-lactamase production leading to enzymatic inactivation of ampicillin
  Reference answer: production of beta-lactamase enzyme
  Score: 0.7555555555555555

我们发现经过微调后准确率提升了约5个百分点。查看前几个错误时，模型倾向于严苛地惩罚那些接近但不完全符合临床标准的答案——比如颈动脉双功超声和颈动脉多普勒的差异。对于较长的答案，即使内容正确（例如β-内酰胺酶产生导致氨苄青霉素被酶解失活），模型也会扣分。

scores_o4 = [p['score'] for p in predictions_o4mini_medium_simple_prompt_model_grader_2[0]]
scores_ft = [p['score'] for p in predictions_ftmodel_medium_simple_prompt_model_grader_2[0]]

# Determine common bins for both histograms
all_scores = scores_o4 + scores_ft
bins = plt.hist(all_scores, bins=10, alpha=0)[1]

# Plot histograms and capture the counts
counts_o4, _, _ = plt.hist(
    scores_o4,
    bins=bins,
    alpha=0.6,
    label='o4-mini-medium-simple-prompt'
)
counts_ft, _, _ = plt.hist(
    scores_ft,
    bins=bins,
    alpha=0.6,
    label='ftmodel-medium-simple-prompt'
)

plt.title("Model Grader 2 Score Distribution by Model")
plt.xlabel("Score")
plt.ylabel("Count")
plt.ylim(top=25)
plt.legend()

# Print the bin counts
print("o4-mini-medium-simple-prompt bin counts:", counts_o4)
print("ftmodel-medium-simple-prompt bin counts:", counts_ft)
print("Max bin count (y-axis):", max(max(counts_o4), max(counts_ft)))

o4-mini-medium-simple-prompt bin counts: [ 4. 15.  9.  7.  7.  4.  3.  5. 22. 24.]
ftmodel-medium-simple-prompt bin counts: [ 8. 15.  7.  3.  9.  7.  8.  4. 19. 20.]
Max bin count (y-axis): 24.0

观察分数分布情况，我们发现RFT帮助模型预测结果从低中分区（0.4-0.5）转移至中高分区（0.5-0.6）。由于评分标准更注重临床相似性而非词汇匹配，根据我们的专家评分员判断，这种转变反映了更强的医学推理能力——而不仅仅是更好的措辞。虽然在0.9-1.0区间观察到尽管采取了缓解措施，仍出现些许冗长表述并导致整体分数轻微下降，但这些回答往往反映出更完整、语义更契合的答案。未来的评分流程可以更好地处理这类情况。

请注意，由于早期的combined_grader设计目标是奖励词汇准确性，其准确率提升不大——这符合预期。这种差距凸显了验证模型评分器的重要性，以及监控奖励黑客行为的必要性。在我们的案例中，我们使用o3进行评分行为抽查，但领域专家评审仍然至关重要。

模型的推理过程

在分析微调模型时，另一个重要点是推理摘要。模型可能会在这些摘要中提供关键信息，通过探索这些摘要来理解模型失败的原因，可以推动模型和评分系统提示的更新。下面，我们展示了模型生成的这类思维链摘要示例，以展示其回答问题的方式：

# Flatten the list of lists into a single list of dicts
predictions = {
    "o4-mini": predictions_o4mini_medium_simple_prompt_model_grader_2,
    "o3": predictions_o3_medium_simple_prompt_model_grader_2,
    "ftmodel": predictions_ftmodel_medium_simple_prompt_model_grader_2,
}

for model_name, predictions in predictions.items():
    all_preds = [item for sublist in predictions for item in sublist]
    reasoning_tokens = [p['reasoning_tokens_used'] for p in all_preds if 'reasoning_tokens_used' in p]
    mean_reasoning_tokens = np.mean(reasoning_tokens)
    print(f"Mean reasoning_tokens_used {model_name}: {mean_reasoning_tokens:.0f}")

Mean reasoning_tokens_used o4-mini: 424
Mean reasoning_tokens_used o3: 353
Mean reasoning_tokens_used ftmodel: 1820

经过微调的模型会消耗更多推理标记来深入思考问题。让我们借助推理摘要来可视化一个示例。

from IPython.display import Markdown, display
markdown_text = results_o4mini_model_grader_2[5]["summaries"]
display(Markdown(markdown_text))

Classifying staging type

The user provided a clinical scenario of a 35-year-old female with a 5 cm oral tumor and a 2 cm lymph node. They're asking how to stage it according to the TNM classification. This is a diagnosis query, so the correct answer type here is "diagnosis." Considering the tumor's size, it appears to be classified as T3 since it's greater than 4 cm. Thus, I think the staging might be Stage II, but I'll confirm that.

markdown_text = results_ft_model_grader_2[5]["summaries"]
display(Markdown(markdown_text))

Clarifying T staging for cancers

I’m digging into T staging for head and neck cancers in the oral cavity. So, T1 applies to tumors 2 cm or less, T2 for those over 2 cm but not more than 4 cm, and T3 is for tumors over 4 cm. T4a indicates invasion into adjacent structures. The patient's tumor measures 5 cm, which is over 4 cm. I’m not sure if it fits T3 or T4a, since T4a involves additional invasiveness, not just size. Determining T and N staging

I’m looking at a 5 cm tumor in the oral cavity. It seems there’s no mention of invasion into adjacent structures, so I’m categorizing it as T3 due to its size. T4a usually means invasion into structures like bone or skin. According to the TNM classification, since I see no such invasion, T classification remains T3.

Moving on to N staging, I see there's a single lymph node of 2 cm on the same side; this fits the N1 classification for metastasis, as it’s less than 3 cm.

基础版o4-mini的推理给出了快速答案，但没有解释其推理过程。它提到了肿瘤大小，但没有逐步分析实际的TNM分期规则，而且对结果似乎不太确定。另一方面，finetuned model则更加深思熟虑——逐步分解T和N分期，并解释每个部分适用的原因。后者显得更加谨慎，似乎学会了更细致地分解病例描述。

进一步提升分数

基准模型o3和我们微调后的o4-mini有时会在相同样本上得零分——这个危险信号表明参考标签可能有误。在增加计算资源之前，应先投资数据质量：请领域专家重新标注噪声数据切片，分析模型的推理过程，然后优化评分提示。干净可靠的数据和系统化的更新，几乎总能比增加训练轮次带来更高的准确率。

结论

我们已经探讨了如何设计评分器，为o4-mini在RFT期间提供所需的详细反馈。这些信号正是帮助模型真正学习并超越基准的关键。模型评分器在这方面可能非常强大——但前提是必须精心设计。草率的评分器或粗糙的数据会传递错误信号，将模型引导至错误方向。

你现在已准备好使用OpenAI API对自己的模型进行强化微调。我们期待看到你如何通过自定义评分器和更智能的模型行为来突破推理与工具使用的边界！

如需故障排除或后续步骤，请参阅OpenAI微调文档。

2025年5月23日

探索用于强化微调的模型评分器