使用torchtune的端到端工作流程¶

在本教程中，我们将通过一个端到端的示例，向您展示如何使用torchtune对您喜欢的LLM进行微调、评估、可选量化以及生成。我们还将介绍如何无缝地将社区中的一些流行工具和库与torchtune结合使用。

What this tutorial will cover:

在torchtune中可用的不同类型的配方，超越了微调
连接所有这些方法的端到端示例
你可以与torchtune一起使用的不同工具和库

Prerequisites

熟悉torchtune概述
确保安装torchtune
诸如配置和检查点等概念

微调你的模型¶

首先，让我们使用tune CLI下载一个模型。以下命令将从Hugging Face Hub下载Llama3.2 3B Instruct模型并保存到本地文件系统。Hugging Face上传了原始权重（consolidated.00.pth）和与from_pretrained() API兼容的权重（*.safetensors）。我们不需要两者，因此在下载时将忽略原始权重。

$ tune download meta-llama/Llama-3.2-3B-Instruct --ignore-patterns "original/consolidated.00.pth"
Successfully downloaded model repo and wrote to the following locations:
/tmp/Llama-3.2-3B-Instruct/.cache
/tmp/Llama-3.2-3B-Instruct/.gitattributes
/tmp/Llama-3.2-3B-Instruct/LICENSE.txt
/tmp/Llama-3.2-3B-Instruct/README.md
/tmp/Llama-3.2-3B-Instruct/USE_POLICY.md
/tmp/Llama-3.2-3B-Instruct/config.json
/tmp/Llama-3.2-3B-Instruct/generation_config.json
/tmp/Llama-3.2-3B-Instruct/model-00001-of-00002.safetensors
...

注意

要查看所有其他可以使用torchtune进行开箱即用微调的模型列表，请查看我们的模型页面。

在本教程中，我们将使用LoRA对模型进行微调。LoRA是一种参数高效的微调技术，当您没有足够的GPU内存时特别有用。LoRA冻结基础LLM并添加一小部分可学习的参数。这有助于保持与梯度和优化器状态相关的内存较低。使用torchtune，您应该能够在RTX 3090/4090上使用bfloat16在不到16GB的GPU内存中使用LoRA微调Llama-3.2-3B-Instruct模型。有关如何使用LoRA的更多信息，请查看我们的LoRA教程。

让我们通过使用tune CLI来寻找适合此用例的正确配置。

$ tune ls
RECIPE                                  CONFIG
full_finetune_single_device             llama2/7B_full_low_memory
                                        code_llama2/7B_full_low_memory
                                        llama3/8B_full_single_device
                                        llama3_1/8B_full_single_device
                                        llama3_2/1B_full_single_device
                                        llama3_2/3B_full_single_device
                                        mistral/7B_full_low_memory
                                        phi3/mini_full_low_memory
                                        qwen2/7B_full_single_device
                                        ...


full_finetune_distributed               llama2/7B_full
                                        llama2/13B_full
                                        llama3/8B_full
                                        llama3_1/8B_full
                                        llama3_2/1B_full
                                        llama3_2/3B_full
                                        mistral/7B_full
                                        gemma2/9B_full
                                        gemma2/27B_full
                                        phi3/mini_full
                                        qwen2/7B_full
                                        ...

lora_finetune_single_device             llama2/7B_lora_single_device
                                        llama2/7B_qlora_single_device
                                        llama3/8B_lora_single_device
...

我们将使用我们的单设备LoRA配方进行微调，并使用默认配置中的标准设置。

这将使用batch_size=4和dtype=bfloat16来微调我们的模型。使用这些设置，模型的内存峰值使用量应约为16GB，每个epoch的总训练时间约为2-3小时。

$ tune run lora_finetune_single_device --config llama3_2/3B_lora_single_device
Setting manual seed to local seed 3977464327. Local seed is seed + rank = 3977464327 + 0
Hint: enable_activation_checkpointing is True, but enable_activation_offloading isn't. Enabling activation offloading should reduce memory further.
Writing logs to /tmp/torchtune/llama3_2_3B/lora_single_device/logs/log_1734708879.txt
Model is initialized with precision torch.bfloat16.
Memory stats after model init:
        GPU peak memory allocation: 6.21 GiB
        GPU peak memory reserved: 6.27 GiB
        GPU peak memory active: 6.21 GiB
Tokenizer is initialized from file.
Optimizer and loss are initialized.
Loss is initialized.
Dataset and Sampler are initialized.
Learning rate scheduler is initialized.
Profiling disabled.
Profiler config after instantiation: {'enabled': False}
1|3|Loss: 1.943998098373413:   0%|                    | 3/1617 [00:21<3:04:47,  6.87s/it]

恭喜你训练了你的模型！让我们来看看由 torchtune 生成的工件。一个简单的方法是运行 tree -a path/to/outputdir，这应该会显示类似下面的树结构。有3种类型的文件夹：

recipe_state: 保存了recipe_state.pt，其中包含重新启动最后一个中间epoch所需的信息。更多信息，请查看我们的深入探讨torchtune中的检查点。
logs: 包含训练运行的所有日志输出：损失、内存、异常等。
epoch_{}: 包含您训练好的模型权重以及模型元数据。如果进行推理或推送到模型中心，您应该直接使用此文件夹。

$ tree -a /tmp/torchtune/llama3_2_3B/lora_single_device
/tmp/torchtune/llama3_2_3B/lora_single_device
├── epoch_0
│   ├── adapter_config.json
│   ├── adapter_model.pt
│   ├── adapter_model.safetensors
│   ├── config.json
│   ├── ft-model-00001-of-00002.safetensors
│   ├── ft-model-00002-of-00002.safetensors
│   ├── generation_config.json
│   ├── LICENSE.txt
│   ├── model.safetensors.index.json
│   ├── original
│   │   ├── orig_params.json
│   │   ├── params.json
│   │   └── tokenizer.model
│   ├── original_repo_id.json
│   ├── README.md
│   ├── special_tokens_map.json
│   ├── tokenizer_config.json
│   ├── tokenizer.json
│   └── USE_POLICY.md
├── epoch_1
│   ├── adapter_config.json
│   ...
├── logs
│   └── log_1734652101.txt
└── recipe_state
    └── recipe_state.pt

让我们来理解这些文件：

adapter_model.safetensors 和 adapter_model.pt 是您训练的LoRA适配器权重。我们保存了一个重复的.pt版本，以便从检查点恢复。
ft-model-{}-of-{}.safetensors 是您训练好的完整模型权重（不是适配器）。在LoRA微调时，只有在设置 save_adapter_weights_only=False 时才会存在这些权重。在这种情况下，我们将合并的基础模型与训练好的适配器合并，使推理更容易。
adapter_config.json 由 Huggingface PEFT 在加载适配器时使用（稍后会详细介绍）；
model.safetensors.index.json 由 Hugging Face 的 from_pretrained() 在加载模型权重时使用（稍后会详细介绍）
所有其他文件最初都在checkpoint_dir中。它们在训练期间会自动复制。超过100MiB并以.safetensors、.pth、.pt、.bin结尾的文件会被忽略，使其变得轻量。

评估你的模型¶

我们已经微调了一个模型。但这个模型的实际表现如何？让我们通过结构化评估和实际操作来确定这一点。

使用EleutherAI的Eval Harness运行评估¶

torchtune 集成了 EleutherAI 的评估工具。一个示例可以通过 eleuther_eval 配方获得。在本教程中，我们将通过修改其关联的配置文件 eleuther_evaluation.yaml 直接使用此配方。

注意

在本教程的这一部分中，您应该首先运行pip install lm_eval>=0.4.5来安装EleutherAI评估工具。

由于我们计划更新所有检查点文件以指向我们微调的检查点，让我们首先将配置复制到我们的本地工作目录中，以便我们可以进行更改。

$ tune cp eleuther_evaluation ./custom_eval_config.yaml
Copied file to custom_eval_config.yaml

请注意，我们使用的是合并后的权重，而不是LoRA适配器。

# TODO: update to your desired epoch
output_dir: /tmp/torchtune/llama3_2_3B/lora_single_device/epoch_0

# Tokenizer
tokenizer:
    _component_: torchtune.models.llama3.llama3_tokenizer
    path: ${output_dir}/original/tokenizer.model

model:
    # Notice that we don't pass the lora model. We are using the merged weights,
    _component_: torchtune.models.llama3_2.llama3_2_3b

checkpointer:
    _component_: torchtune.training.FullModelHFCheckpointer
    checkpoint_dir: ${output_dir}
    checkpoint_files: [
        ft-model-00001-of-00002.safetensors,
        ft-model-00002-of-00002.safetensors,
    ]
    output_dir: ${output_dir}
    model_type: LLAMA3_2

### OTHER PARAMETERS -- NOT RELATED TO THIS CHECKPOINT

# Environment
device: cuda
dtype: bf16
seed: 1234 # It is not recommended to change this seed, b/c it matches EleutherAI's default seed

# EleutherAI specific eval args
tasks: ["truthfulqa_mc2"]
limit: null
max_seq_length: 4096
batch_size: 8
enable_kv_cache: True

# Quantization specific args
quantizer: null

在本教程中，我们将使用harness中的truthfulqa_mc2任务。

此任务衡量模型在回答问题时的真实性倾向，并衡量模型在问题后跟随一个或多个真实回答和一个或多个错误回答时的零样本准确性。

$ tune run eleuther_eval --config ./custom_eval_config.yaml
[evaluator.py:324] Running loglikelihood requests
...

生成一些输出¶

我们已经进行了一些评估，模型似乎表现良好。但它真的能为你关心的提示生成有意义的文本吗？让我们来找出答案！

为此，我们将使用 generate recipe 和相关的 config。

首先，让我们将配置复制到本地工作目录，以便我们可以进行更改。

$ tune cp generation ./custom_generation_config.yaml
Copied file to custom_generation_config.yaml

Let’s modify custom_generation_config.yaml to include the following changes. Again, you only need: 替换两个字段：output_dir 和 checkpoint_files

output_dir: /tmp/torchtune/llama3_2_3B/lora_single_device/epoch_0

# Tokenizer
tokenizer:
    _component_: torchtune.models.llama3.llama3_tokenizer
    path: ${output_dir}/original/tokenizer.model
    prompt_template: null

model:
    # Notice that we don't pass the lora model. We are using the merged weights,
    _component_: torchtune.models.llama3_2.llama3_2_3b

checkpointer:
    _component_: torchtune.training.FullModelHFCheckpointer
    checkpoint_dir: ${output_dir}
    checkpoint_files: [
        ft-model-00001-of-00002.safetensors,
        ft-model-00002-of-00002.safetensors,
    ]
    output_dir: ${output_dir}
    model_type: LLAMA3_2

### OTHER PARAMETERS -- NOT RELATED TO THIS CHECKPOINT

device: cuda
dtype: bf16

seed: 1234

# Generation arguments; defaults taken from gpt-fast
prompt:
system: null
user: "Tell me a joke. "
max_new_tokens: 300
temperature: 0.6 # 0.8 and 0.6 are popular values to try
top_k: 300

enable_kv_cache: True

quantizer: null

一旦配置更新，让我们开始生成！我们将使用默认设置进行采样，top_k=300 和 temperature=0.8。这些参数控制采样概率的计算方式。我们建议在调整这些参数之前，先用这些设置检查模型。

$ tune run generate --config ./custom_generation_config.yaml prompt="tell me a joke. "
Tell me a joke. Here's a joke for you:

What do you call a fake noodle?

An impasta!

介绍一些量化¶

我们依赖torchao进行训练后量化。在安装torchao后，我们可以运行以下命令来量化微调后的模型：

# we also support `int8_weight_only()` and `int8_dynamic_activation_int8_weight()`, see
# https://github.com/pytorch/ao/tree/main/torchao/quantization#other-available-quantization-techniques
# for a full list of techniques that we support
from torchao.quantization.quant_api import quantize_, int4_weight_only
quantize_(model, int4_weight_only())

量化后，我们依赖 torch.compile 来加速。更多详情，请参见此示例用法。

torchao 还提供了此表，列出了 llama2 和 llama3 的性能和准确性结果。

对于Llama模型，您可以直接在torchao中使用他们的generate.py脚本在量化模型上运行生成，如此自述文件中所述。这样，您可以将自己的结果与之前链接的表格中的结果进行比较。

在野外使用你的模型¶

假设我们对模型在这一点上的表现感到满意——我们想用它做点什么！将其投入生产以供服务使用，发布在Hugging Face Hub上等。正如我们上面提到的，处理检查点转换的好处之一是你可以直接使用标准格式。这有助于与其他库的互操作性，因为torchtune不会在混合中添加另一种格式。

与Hugging Face一起使用 `from_pretrained()`¶

案例 1: 使用基础模型 + 训练适配器的 Hugging Face

这里我们从Hugging Face模型中心加载基础模型。然后我们使用PeftModel在其上加载适配器。它将查找文件adapter_model.safetensors以获取权重，并查找adapter_config.json以确定插入位置。

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

#TODO: update it to your chosen epoch
trained_model_path = "/tmp/torchtune/llama3_2_3B/lora_single_device/epoch_0"

# Define the model and adapter paths
original_model_name = "meta-llama/Llama-3.2-1B-Instruct"

model = AutoModelForCausalLM.from_pretrained(original_model_name)

# huggingface will look for adapter_model.safetensors and adapter_config.json
peft_model = PeftModel.from_pretrained(model, trained_model_path)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(original_model_name)

# Function to generate text
def generate_text(model, tokenizer, prompt, max_length=50):
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_length=max_length)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

prompt = "tell me a joke: '"
print("Base model output:", generate_text(peft_model, tokenizer, prompt))

案例2：使用合并权重的Hugging Face

在这种情况下，Hugging Face 将检查 model.safetensors.index.json 以确定应该加载哪些文件。

from transformers import AutoModelForCausalLM, AutoTokenizer

#TODO: update it to your chosen epoch
trained_model_path = "/tmp/torchtune/llama3_2_3B/lora_single_device/epoch_0"

model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=trained_model_path,
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(trained_model_path, safetensors=True)


# Function to generate text
def generate_text(model, tokenizer, prompt, max_length=50):
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_length=max_length)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)


prompt = "Complete the sentence: 'Once upon a time...'"
print("Base model output:", generate_text(model, tokenizer, prompt))

与vLLM一起使用¶

vLLM 是一个快速且易于使用的库，用于LLM推理和服务。它包含了许多令人惊叹的功能，如最先进的服务吞吐量、传入请求的连续批处理、量化和推测解码。

该库将加载任何.safetensors文件。由于这里我们混合了完整的模型权重和适配器权重，我们必须删除适配器权重才能成功加载它。

rm /tmp/torchtune/llama3_2_3B/lora_single_device/base_model/adapter_model.safetensors

现在我们可以运行以下脚本：

from vllm import LLM, SamplingParams

def print_outputs(outputs):
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
    print("-" * 80)

#TODO: update it to your chosen epoch
llm = LLM(
    model="/tmp/torchtune/llama3_2_3B/lora_single_device/epoch_0",
    load_format="safetensors",
    kv_cache_dtype="auto",
)
sampling_params = SamplingParams(max_tokens=16, temperature=0.5)

conversation = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Hello! How can I assist you today?"},
    {
        "role": "user",
        "content": "Write an essay about the importance of higher education.",
    },
]
outputs = llm.chat(conversation, sampling_params=sampling_params, use_tqdm=False)
print_outputs(outputs)

将您的模型上传到Hugging Face Hub¶

您的新模型运行良好，您希望与全世界分享它。最简单的方法是使用huggingface_hub。

import huggingface_hub
api = huggingface_hub.HfApi()

#TODO: update it to your chosen epoch
trained_model_path = "/tmp/torchtune/llama3_2_3B/lora_single_device/epoch_0"

username = huggingface_hub.whoami()["name"]
repo_name = "my-model-trained-with-torchtune"

# if the repo doesn't exist
repo_id = huggingface_hub.create_repo(repo_name).repo_id

# if it already exists
repo_id = f"{username}/{repo_name}"

api.upload_folder(
    folder_path=trained_model_path,
    repo_id=repo_id,
    repo_type="model",
    create_pr=False
)

如果您愿意，您也可以尝试使用命令行界面版本 huggingface-cli upload。

希望本教程能让你对如何在你的工作流程中使用torchtune有一些深入的了解。祝你调参愉快！