Meta Llama3 在 torchtune 中¶

You will learn how to:

下载Llama3-8B-Instruct的权重和分词器
使用LoRA和QLoRA微调Llama3-8B-Instruct
评估您微调的Llama3-8B-Instruct模型
使用您微调的模型生成文本
量化您的模型以加速生成

Prerequisites

熟悉 torchtune
确保安装torchtune

Llama3-8B¶

Meta Llama 3 是由 Meta AI 发布的新模型系列，它在多个不同的基准测试中提升了 Llama2 系列模型的性能。目前，Meta Llama 3 有两种不同的规模：8B 和 70B。在本教程中，我们将重点介绍 8B 规模的模型。 Llama2-7B 和 Llama3-8B 模型之间有几个主要的变化：

Llama3-8B 使用 grouped-query attention 而不是 Llama2-7B 中的标准多头注意力机制
Llama3-8B 的词汇量更大（128,256，而不是 Llama2 模型的 32,000）
Llama3-8B 使用与 Llama2 模型不同的分词器（tiktoken 而不是 sentencepiece）
Llama3-8B 在其 MLP 层中使用了比 Llama2-7B 更大的中间维度
Llama3-8B 使用更高的基值来计算其 rotary positional embeddings 中的 theta

获取Llama3-8B-Instruct的访问权限¶

在本教程中，我们将使用Llama3-8B的指令调优版本。首先，让我们从Hugging Face下载模型。您需要按照官方Meta页面上的说明来获取模型的访问权限。接下来，请确保您从这里获取您的Hugging Face令牌。

tune download meta-llama/Meta-Llama-3-8B-Instruct \
    --output-dir <checkpoint_dir> \
    --hf-token <ACCESS TOKEN>

在torchtune中微调Llama3-8B-Instruct¶

torchtune 提供了 LoRA、QLoRA 和全微调的配方，用于在一个或多个 GPU 上微调 Llama3-8B。有关 torchtune 中 LoRA 的更多信息，请参阅我们的 LoRA 教程。有关 torchtune 中 QLoRA 的更多信息，请参阅我们的 QLoRA 教程。

让我们来看看如何使用torchtune在单个设备上通过LoRA微调Llama3-8B-Instruct。在这个例子中，我们将在一个常见的指令数据集上进行一个epoch的微调，以作说明。单设备LoRA微调的基本命令是

tune run lora_finetune_single_device --config llama3/8B_lora_single_device

注意

要查看完整的配方列表及其相应的配置，只需从命令行运行tune ls。

我们还可以根据需要添加命令行覆盖，例如。

tune run lora_finetune_single_device --config llama3/8B_lora_single_device \
    checkpointer.checkpoint_dir=<checkpoint_dir> \
    tokenizer.path=<checkpoint_dir>/tokenizer.model \
    checkpointer.output_dir=<checkpoint_dir>

这将从加载Llama3-8B-Instruct检查点和分词器，该目录用于上面的调谐下载命令，然后按照原始格式在同一目录中保存最终检查点。有关torchtune支持的检查点格式的更多详细信息，请参阅我们的检查点深入探讨。

注意

要查看此（及其他）配置的所有可配置参数，我们可以使用调整 cp来复制（并修改）默认配置。调整 cp也可以与配方脚本一起使用，以防您希望进行更自定义的更改，而这些更改无法通过直接修改现有可配置参数来实现。有关调整 cp的更多信息，请参阅我们的“微调您的第一个LLM”教程中的修改配置部分。

训练完成后，模型检查点将被保存，并且它们的位置将被记录。对于LoRA微调，最终的检查点将包含合并的权重，并且仅包含（更小的）LoRA权重的副本将单独保存。

在我们的实验中，我们观察到峰值内存使用量为18.5 GB。默认配置可以在具有24 GB VRAM的消费级GPU上进行训练。

如果您有多个GPU可用，您可以运行分布式版本的配方。 torchtune 使用 PyTorch Distributed 的 FSDP API 来分片模型、优化器状态和梯度。这应该能让您增加批量大小，从而加快整体训练速度。例如，在两个设备上：

tune run --nproc_per_node 2 lora_finetune_distributed --config llama3/8B_lora

最后，如果我们想要使用更少的内存，我们可以通过以下方式利用torchtune的QLoRA配方：

tune run lora_finetune_single_device --config llama3/8B_qlora_single_device

由于我们的默认配置启用了完整的bfloat16训练，所有上述命令都可以在具有至少24 GB VRAM的设备上运行，实际上QLoRA配方的峰值分配内存应低于10 GB。您还可以尝试不同的LoRA和QLoRA配置，甚至运行完整的微调。试试看！

使用EleutherAI的Eval Harness评估微调的Llama3-8B模型¶

既然我们已经微调了我们的模型，接下来该做什么呢？让我们从前一节中获取我们的LoRA微调模型，并看看几种不同的方法来评估它在我们关心的任务上的表现。

首先，torchtune 提供了与 EleutherAI 的评估工具的集成，用于在常见基准任务上进行模型评估。

注意

确保您首先通过pip install "lm_eval==0.4.*"安装了评估工具。

在本教程中，我们将使用harness中的truthfulqa_mc2任务。该任务衡量模型在回答问题时保持真实的倾向，并衡量模型在问题后跟随一个或多个真实回答和一个或多个错误回答时的零样本准确性。首先，让我们复制配置，以便我们可以将YAML 文件指向我们微调的检查点文件。

tune cp eleuther_evaluation ./custom_eval_config.yaml

接下来，我们修改 custom_eval_config.yaml 以包含微调的检查点。

model:
  _component_: torchtune.models.llama3.llama3_8b

checkpointer:
  _component_: torchtune.training.FullModelMetaCheckpointer

  # directory with the checkpoint files
  # this should match the output_dir specified during
  # fine-tuning
  checkpoint_dir: <checkpoint_dir>

  # checkpoint files for the fine-tuned model. These will be logged
  # at the end of your fine-tune
  checkpoint_files: [
    meta_model_0.pt
  ]

  output_dir: <checkpoint_dir>
  model_type: LLAMA3

# Make sure to update the tokenizer path to the right
# checkpoint directory as well
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: <checkpoint_dir>/tokenizer.model

最后，我们可以使用修改后的配置运行评估。

tune run eleuther_eval --config ./custom_eval_config.yaml

自己试试看，看看你的模型能达到什么准确率！

使用我们微调的Llama3模型生成文本¶

接下来，让我们看看另一种评估模型的方法：生成文本！torchtune 也提供了一个生成配方。

类似于我们所做的，让我们复制并修改默认的生成配置。

tune cp generation ./custom_generation_config.yaml

现在我们修改custom_generation_config.yaml以指向我们的检查点和分词器。

model:
  _component_: torchtune.models.llama3.llama3_8b

checkpointer:
  _component_: torchtune.training.FullModelMetaCheckpointer

  # directory with the checkpoint files
  # this should match the output_dir specified during
  # fine-tuning
  checkpoint_dir: <checkpoint_dir>

  # checkpoint files for the fine-tuned model. These will be logged
  # at the end of your fine-tune
  checkpoint_files: [
    meta_model_0.pt
  ]

  output_dir: <checkpoint_dir>
  model_type: LLAMA3

# Make sure to update the tokenizer path to the right
# checkpoint directory as well
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: <checkpoint_dir>/tokenizer.model

使用我们的LoRA微调模型运行生成，我们看到以下输出：

tune run generate --config ./custom_generation_config.yaml \
prompt="Hello, my name is"

[generate.py:122] Hello, my name is Sarah and I am a busy working mum of two young children, living in the North East of England.
...
[generate.py:135] Time for inference: 10.88 sec total, 18.94 tokens/sec
[generate.py:138] Bandwidth achieved: 346.09 GB/s
[generate.py:139] Memory used: 18.31 GB

通过量化实现更快的生成¶

我们依赖torchao进行训练后量化。在安装torchao后，我们可以运行以下命令来量化微调后的模型：

# we also support `int8_weight_only()` and `int8_dynamic_activation_int8_weight()`, see
# https://github.com/pytorch/ao/tree/main/torchao/quantization#other-available-quantization-techniques
# for a full list of techniques that we support
from torchao.quantization.quant_api import quantize_, int4_weight_only
quantize_(model, int4_weight_only())

量化后，我们依赖 torch.compile 来加速。更多详情，请参见此示例用法。

torchao 还提供了此表，列出了 llama2 和 llama3 的性能和准确性结果。

对于Llama模型，您可以直接在torchao中使用他们的generate.py脚本在量化模型上运行生成，如此自述文件中所述。这样，您可以将自己的结果与之前链接的表格中的结果进行比较。

这只是你使用Meta Llama3、torchtune以及更广泛的生态系统所能实现的开端。我们期待看到你的成果！