自定义对话模板#

对于初学者:为什么需要模板?
现今几乎所有大型语言模型(LLM)都执行一个简单的任务——预测下一个"词"。为了让用户与模型之间的交互更加流畅,开发者会使用技巧:他们在输入文本中添加特殊的"词"(在后台处理,因此用户在使用如ChatGPT等服务时看不到),用来"告诉"模型用户之前说了什么,并要求模型以助手的方式回应。这些"隐藏的词"就被称为"模板"。

我们提供了自定义对话模板的灵活性。您可以按照以下步骤自定义自己的对话模板:

1. 分解你的对话#

假设您希望用户与助手之间的对话看起来像这样:

<bos>System:
You are a chatbot developed by LMFlow team.

User:
Who are you?

Assistant:
I am a chatbot developed by LMFlow team.<eos>

User:
How old are you?

Assistant:
I don't age like humans do. I exist as a piece of software, so I don't have a concept of age in the traditional sense.<eos>

很容易抽象出每条消息的格式:

  • 系统消息: System:\n{{content}}\n\n

  • 用户消息: User:\n{{content}}\n\n

  • 助手消息: Assistant:\n{{content}}\n\n

此外,我们在对话会话的开头有一个bos标记。

2. 选择合适的 Formatter#

回顾对话数据集的要求:

  • system: Optional[string].

  • tools: Optional[List[string]].

  • messages: List[Dict].

    • role: string.

    • content: string.

系统消息、用户消息和助手消息都是字符串,因此我们可以对它们使用StringFormatter

3. 构建模板#

所有预设模板都位于src/lmflow/utils/conversation_template

在模板文件中,定义您自己的模板如下:

from .base import StringFormatter, TemplateComponent, ConversationTemplate


YOUR_TEMPLATE = ConversationTemplate(
    template_name='your_template_name',
    user_formatter=StringFormatter(
        template=[
            TemplateComponent(type='string', content='User:\n{{content}}\n\n')
        ]
    ),
    assistant_formatter=StringFormatter(
        template=[
            TemplateComponent(type='string', content='Assistant:\n{{content}}\n\n'),
            TemplateComponent(type='token', content='eos_token') # this will add the eos token at the end of every assistant message
            # please refer to the docstring of the `TemplateComponent` class to 
            # see the difference between different types of components.
        ]
    ),
    system_formatter=StringFormatter(
        template=[
            TemplateComponent(type='string', content='System:\n{{content}}\n\n')
        ]
    )
    # For models that has ONLY ONE bos token at the beginning of 
    # a conversation session (not a conversation pair), user can
    # specify a special starter to add that starter to the very
    # beginning of the conversation session. 
    # eg:
    #   llama-2: <s> and </s> at every pair of conversation 
    #   v.s.
    #   llama-3: <|begin_of_text|> only at the beginning of a session
    special_starter=TemplateComponent(type='token', content='bos_token'),

    # Similar to the special starter... (just for illustration, commented out 
    # since it is not necessary for our purposed template above)
    # special_stopper=TemplateComponent(type='token', content='eos_token')
)

欢迎通过继承ConversationTemplate类来创建您自己的模板。Llama-2与llama-3的对比可以作为很好的参考示例。

4. 注册您的模板#

在定义自己的模板后,您需要将其注册在src/lmflow/utils/conversation_template/__init__.py文件中。

# ...
from .your_template_file import YOUR_TEMPLATE


PRESET_TEMPLATES = {
    #...
    'your_template_name': YOUR_TEMPLATE,
}

5. 使用您的模板#

一切准备就绪!请在您的微调脚本中指定模板名称,例如:

./scripts/run_finetune.sh \
    --model_name_or_path path_to_your_model \
    --dataset_path your_conversation_dataset \
    --conversation_template your_template_name \
    --output_model_path output_models/your_model