奖励模型#

简介#

基于人类反馈的强化学习(RLHF)需要一个奖励函数来指导生成模型的调整。在本示例中，我们将展示如何使用LMFlow框架按照InstructGPT论文中的流程训练奖励模型：https://arxiv.org/abs/2203.02155 。我们以Dahoas/full-hh-rlhf数据集为例，该数据集的每个样本包含一个提示词和来自助手的两个响应。特别地，标记为"chosen"的响应比标记为"rejected"的响应更受青睐。该数据集包含11.2万训练样本和1.25万测试样本。以下是该数据集的一个示例样本：

提示词：

“人类：恐龙会发出什么样的声音？助手：人类和恐龙并不生活在同一时代，所以很难说。要了解恐龙发出的声音，最好的方法是人类：不，他们确实生活在同一时代助手：去猜测，这可能需要进行大量阅读并发挥一定的想象力，但我们目前并没有准备好这么做。人类：你不会阅读助手：

选择回应：“你能阅读吗？”

被拒绝的回复：“人类不知道的事情还有很多”

例如，我们在./data/hh_rlhf目录下准备了1万条监督微调训练样本和1.2万条奖励建模样本（其中10%划分用于评估）。

步骤1 监督微调 (SFT)#

我们通过为人类和助手输入添加前缀来准备用于监督微调的数据集，以引导模型响应并简化后处理。以下是一个双样本数据集的示例来说明这一点。

{
    "type": "conversation",
    "instances": [
        {"messages": [
            {"role": "user", "content": "Instruction: The sentence you are given might be too wordy, complicated, or unclear. Rewrite the sentence and make your writing clearer by keeping it concise. Whenever possible, break complex sentences into multiple sentences and eliminate unnecessary words. If you have any questions about my rate or if you find it necessary to increase or decrease the scope for this project, please let me know."},
            {"role": "assistant", "content": "If you have any questions about my rate or find it necessary to increase or decrease this project's scope, please let me know."}
        ]},
        {"messages": [
            {"role": "user", "content": "Instruction: Rewrite the given text and correct grammar, spelling, and punctuation errors. If you'd told me year ago that today I would finish a marathon, I would of laughed. Your support had a huge affect on me!"},
            {"role": "assistant", "content": "If you'd told me a year ago that today I would finish a marathon, I would have laughed. Your support had a huge effect on me!"}
        ]}
    ]
}

有关微调过程的更多详情，请参阅微调（完整版）、微调（LISA）和微调（LoRA）。

步骤2 奖励建模#

数据准备 奖励建模涉及根据给定提示对回答进行排序，通常需要人工标注。幸运的是，在这个示例中，我们每个提示已经有两个回答，其中一个比另一个更受青睐。为了创建新数据集，我们修改原始数据集以包含两个键："positive"和"negative"，前者是更受青睐的回答。以下是包含两个样本的数据集示例。

{"instances": [
    {
        "chosen": [
            {"content": "Please identify the top 5 rarest animals in the world.", "role": "user" }, 
            {"content": "Do you mean animals that are really rare, or rare relative to the size of the human population?", "role": "assistant" }, 
            {"content": "The ones that are really rare.", "role": "user" }, 
            {"content": "I’m not sure I can answer this question very easily. I’ll try to give you a basic overview of animals that are rare, and then you can tell me if you think that’s good enough. \n\nIf we were to measure animal populations by the area of land they occupy, the number 1 animal would probably be the Giant Panda. Panda populations are generally estimated to be about 2.1 million Pandas in the wild, which is a pretty small number of animals when we’re thinking about Earth’s entire surface area. The number 2 animal would probably be the Great Auk. Great Auks used to live in North America, and their total population was probably in the low hundreds, but they are now extinct.", "role": "assistant"}
        ],
        "rejected": [
            {"content": "Please identify the top 5 rarest animals in the world.", "role": "user"}, 
            {"content": "Do you mean animals that are really rare, or rare relative to the size of the human population?", "role": "assistant"}, 
            {"content": "The ones that are really rare.", "role": "user"}, 
            {"content": "Alright, here’s what I found:", "role": "assistant"}
        ]
    },
    {
        "chosen": [
            {"content": "How do I teach my kids to make forts at home?", "role": "user"}, 
            {"content": "Do you mean you want to teach them to build a simple structure they can hide in? Or something more elaborate?  I don’t know exactly what you mean by a “fort”, but maybe this gives you some ideas?", "role": "assistant"}
        ],
        "rejected": [
            {"content": "How do I teach my kids to make forts at home?", "role": "user"}, 
            {"content": "There are many ways to make forts!  The simplest might be to just pile up all the furniture in one room.  This works if you have a single room in your house, although it might still be a bit challenging to build a taller and sturdier fort this way.  The best way to build forts is to start with something easy, and then have fun exploring how you can improve it.", "role": "assistant"}
        ]
    }
]
}

要从之前的sft步骤中的模型开始，您可以编辑run_reward_modeling.sh脚本，并将"dataset_path"更新为所需的数据集。此外，您可以修改validation_split_percentage参数来选择最后一定比例的样本进行评估。load_dataset函数将数据集分割为训练集和评估集，如果您想在运行脚本时准备自己的数据集，也可以通过编辑/examples/run_reward_modeling.py中的函数来自定义这一过程。

def build_dataset(tokenizer, config):
    ''' 
    We assume that we have preprocessed the dataset appropriately such that the sample is organized as follows:
    {"positive": prompt + answer_positive, "negative": prompt + answer_negative}, where the positive response is preferred.
    '''
    def tokenize(sample):
        tokenized_pos = tokenizer(sample['positive'], truncation=True)
        tokenized_neg = tokenizer(sample['negative'], truncation=True)
        sample["chosen_input_ids"] = tokenized_pos["input_ids"]
        sample["chosen_attention_mask"] = tokenized_pos["attention_mask"]
        sample["rejected_input_ids"] = tokenized_neg["input_ids"]
        sample["rejected_attention_mask"] = tokenized_neg["attention_mask"]
        return sample

    ds = load_dataset("json", data_files=config.dataset_path, split="train", field="instances")
    ds = ds.map(tokenize, batched=False)
    ds = ds.filter(lambda x: len(x["chosen_input_ids"]) <= 512 and len(x["rejected_input_ids"]) <= 512)
    eval_dataset = None
    if config.validation_split_percentage > 0:
        idx_gap = int((1-config.validation_split_percentage/100) * len(ds))
        train_dataset = ds.select(range(idx_gap))
        eval_dataset = ds.select(range(idx_gap, len(ds)))
    else:
        train_dataset = ds

    return train_dataset, eval_dataset

我们采用以下损失函数来训练奖励模型，遵循instruct-GPT论文的方法。

    loss = -nn.functional.logsigmoid(chosen_rewards - rejected_rewards).mean()

奖励建模脚本可以通过以下方式使用

./scripts/run_reward_modeling.sh

示例#

我们使用hh-rlhf数据集训练了四个奖励模型，分别是LLaMA-13B、LLaMA-7B、GPT-NEO-2.7B和GPT-NEO-1.3B。首先使用训练数据集对模型进行监督微调。奖励建模使用11.2万训练样本进行训练，并在1.25万测试样本上进行评估。

SFT步骤似乎至关重要，SFT期间的epoch数量会产生影响。我们获得的最成功模型是从LLaMA-13B初始化的，该模型在训练数据集上进行了2个epoch的SFT。对于奖励建模，我们使用rank为16的LoRA。令人惊讶的是，将LoRA rank增加到32甚至128并不会显著提高评估准确率。此外，我们发现批量大小的选择对训练结果没有显著影响。另外，我们观察到模型在奖励建模的第二个epoch出现了轻微过拟合。

模型	评估准确率	训练记录	备注
LLaMA-13B	84.55%	参见 https://wandb.ai/ianz2020/huggingface/runs/bg677mxa	基于LLaMA的RM模型，经过2个epoch的SFT训练
LLaMA-13B	81.80%	参见 https://wandb.ai/ianz2020/huggingface/runs/ka9v1ywd	基于LLaMA的RM模型，经过1个epoch的SFT训练
LLaMA-13B	71.64%	参见 https://wandb.ai/ianz2020/huggingface/runs/lntwmcyd	未经SFT处理的LLaMA RM模型
LLaMA-7B	79.52%	参见 https://wandb.ai/weixiong5237/huggingface/runs/t3uwm8yp	-
LLaMA-7B	71.64%	参见 https://wandb.ai/weixiong5237/huggingface/runs/p2ju3r1a	未经SFT的LLaMA RM模型
GPT-NEO-2.7B	69.24%	参见 https://wandb.ai/weixiong5237/huggingface/runs/8fc1rcf8	-
GPT-NEO-1.3B	65.58%	参见 https://wandb.ai/weixiong5237/huggingface/runs/7oemwynu	仅训练了10000个样本

奖励模型#

简介#

步骤1 监督微调 (SFT)#

步骤2 奖励建模#

示例#

本页面