微调Hugging Face Transformers模型#

本笔记本基于官方的 Hugging Face 示例，如何对文本分类模型进行微调。本笔记本展示了从普通的 HF 转换到 Ray Train 的过程，除非必要，否则不改变训练逻辑。

本笔记本包含以下步骤：

设置 Ray
加载数据集
使用 Ray Data 预处理数据集
使用 Ray Train 进行训练
可选，将模型与社区分享

取消注释并运行以下行以安装所有必要的依赖项。（此笔记本正在使用 transformers==4.19.1 进行测试。）

#! 使用 pip 安装 "datasets" "transformers>=4.19.0" "torch>=1.10.0" "mlflow"

设置Ray#

使用 ray.init() 来初始化一个本地集群。默认情况下，该集群仅包含您运行此笔记本的机器。您也可以在 Anyscale 集群上运行此笔记本。

from pprint import pprint
import ray

ray.init()

检查我们集群由哪些资源组成。如果您在本地计算机或 Google Colab 上运行此笔记本，您应该会看到可用的 CPU 核心和 GPU 的数量。

pprint(ray.cluster_resources())

{'CPU': 48.0,
 'GPU': 4.0,
 'accelerator_type:None': 1.0,
 'memory': 206158430208.0,
 'node:10.0.27.125': 1.0,
 'node:__internal_head__': 1.0,
 'object_store_memory': 59052625920.0}

该笔记本微调了一个 HF Transformers 模型，以用于 GLUE Benchmark 的一个文本分类任务。它使用 Ray Train 运行训练。

您可以更改这两个变量，以控制后续训练是使用 CPU 还是 GPU，以及要生成多少个工作者。每个工作者占用一个 CPU 或 GPU。请确保请求的资源不超过当前可用资源。默认情况下，训练使用一个 GPU 工作者运行。

use_gpu = True  # 将此设置为 False 以在 CPU 上运行
num_workers = 1  # 设置此项为你希望使用的GPU或CPU数量

在文本分类任务上微调模型#

GLUE基准是一组针对句子或句子对的九个分类任务。要了解更多信息，请参见原始笔记本。

每个任务都有一个名称，它是其首字母缩略词，其中mnli-mm表示它是MNLI的一个不匹配版本。每个任务的训练集与mnli相同，但验证集和测试集不同。

GLUE_TASKS = [
    "cola",
    "mnli",
    "mnli-mm",
    "mrpc",
    "qnli",
    "qqp",
    "rte",
    "sst2",
    "stsb",
    "wnli",
]

此笔记本适用于上述列表中的任何任务，使用来自模型库的任何模型检查点，只要该模型具有带有分类头的版本。根据您使用的模型和GPU，您可能需要调整批处理大小以避免内存溢出错误。设置这三个参数，其余的笔记本应能够顺利运行：

task = "cola"
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

加载数据集#

使用HF Datasets库下载数据，并获取用于评估的指标，以便将您的模型与基准进行比较。您可以通过load_dataset和load_metric函数轻松进行此比较。

除了mnli-mm是特殊代码外，您可以直接将任务名称传递给这些函数。

运行正常的HF Datasets代码以从Hub加载数据集。

from datasets import load_dataset

actual_task = "mnli" if task == "mnli-mm" else task
datasets = load_dataset("glue", actual_task)

Reusing dataset glue (/home/ray/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)

dataset对象本身是一个DatasetDict，它包含一个用于训练、验证和测试集的键，在mnli的特殊情况下，还包括用于不匹配验证和测试集的更多键。

使用Ray Data对数据进行预处理#

在将这些文本输入模型之前，您需要对它们进行预处理。使用HF Transformers的Tokenizer对它们进行预处理，该Tokenizer会对输入进行分词，包括将tokens转换为预训练词汇表中相应的ID，并将它们放入模型所期望的格式中。它还会生成模型所需的其他输入。

要完成所有这些预处理工作，请使用AutoTokenizer.from_pretrained方法实例化您的tokenizer，这样可以确保您：

获取与您想要使用的模型架构相对应的tokenizer。
下载在此特定检查点预训练时使用的词汇表。

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

将 use_fast=True 传递给前面的调用，以使用 HF Tokenizers 库中由 Rust 支持的快速分词器。这些快速分词器几乎适用于所有模型，但如果您在前面的调用中遇到错误，请移除该参数。

为了预处理数据集，您需要包含句子的列名称。下面的字典跟踪任务与列名称之间的对应关系：

task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mnli-mm": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
}

而不是直接使用HF数据集对象，可以将它们转换为Ray Data。两者都是基于Arrow表，因此转换很简单。使用内置的 from_huggingface() 函数。

import ray.data

ray_datasets = {
    "train": ray.data.from_huggingface(datasets["train"]),
    "validation": ray.data.from_huggingface(datasets["validation"]),
    "test": ray.data.from_huggingface(datasets["test"]),
}
ray_datasets

{'train': MaterializedDataset(
    num_blocks=1,
    num_rows=8551,
    schema={sentence: string, label: int64, idx: int32}
 ),
 'validation': MaterializedDataset(
    num_blocks=1,
    num_rows=1043,
    schema={sentence: string, label: int64, idx: int32}
 ),
 'test': MaterializedDataset(
    num_blocks=1,
    num_rows=1063,
    schema={sentence: string, label: int64, idx: int32}
 )}

然后，您可以编写预处理样本的函数。将它们作为参数 truncation=True 输入到 tokenizer 中。此配置确保 tokenizer 会截断并填充到批次中的最长序列，对于任何超过所选模型能够处理的输入长度的情况。

import numpy as np
from typing import Dict


# 将输入句子进行分词处理
def collate_fn(examples: Dict[str, np.array]):
    sentence1_key, sentence2_key = task_to_keys[task]
    if sentence2_key is None:
        outputs = tokenizer(
            list(examples[sentence1_key]),
            truncation=True,
            padding="longest",
            return_tensors="pt",
        )
    else:
        outputs = tokenizer(
            list(examples[sentence1_key]),
            list(examples[sentence2_key]),
            truncation=True,
            padding="longest",
            return_tensors="pt",
        )

    outputs["labels"] = torch.LongTensor(examples["label"])

    # 将所有输入张量移动到GPU
    for key, value in outputs.items():
        outputs[key] = value.cuda()

    return outputs

使用Ray Train微调模型#

现在数据准备好了，下载预训练模型并进行微调。

由于所有任务都涉及句子分类，使用 AutoModelForSequenceClassification 类。有关每个单独训练组件的更多具体信息，请参见原始笔记本。原始笔记本使用了与本笔记本前面的示例编码数据集相同的分词器。

使用 Ray Train 的主要区别是需要将训练逻辑定义为一个函数（train_func）。你将这个训练函数传递给 TorchTrainer，以便在每个 Ray 工作节点上进行训练。训练过程随后使用 PyTorch DDP 进行。

备注

确保在函数内部初始化模型、指标和分词器。否则，您可能会遇到序列化错误。

import torch
import numpy as np

from datasets import load_metric
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

import ray.train
from ray.train.huggingface.transformers import prepare_trainer, RayTrainReportCallback

num_labels = 3 if task.startswith("mnli") else 1 if task == "stsb" else 2
metric_name = (
    "pearson"
    if task == "stsb"
    else "matthews_correlation"
    if task == "cola"
    else "accuracy"
)
model_name = model_checkpoint.split("/")[-1]
validation_key = (
    "validation_mismatched"
    if task == "mnli-mm"
    else "validation_matched"
    if task == "mnli"
    else "validation"
)
name = f"{model_name}-finetuned-{task}"

# 根据训练数据集的行数计算每个epoch的最大步数。
# 请务必根据训练工作者的总数和每个设备的批量大小进行缩放。
max_steps_per_epoch = ray_datasets["train"].count() // (batch_size * num_workers)


def train_func(config):
    print(f"Is CUDA available: {torch.cuda.is_available()}")

    metric = load_metric("glue", actual_task)
    tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
    model = AutoModelForSequenceClassification.from_pretrained(
        model_checkpoint, num_labels=num_labels
    )

    train_ds = ray.train.get_dataset_shard("train")
    eval_ds = ray.train.get_dataset_shard("eval")

    train_ds_iterable = train_ds.iter_torch_batches(
        batch_size=batch_size, collate_fn=collate_fn
    )
    eval_ds_iterable = eval_ds.iter_torch_batches(
        batch_size=batch_size, collate_fn=collate_fn
    )

    print("max_steps_per_epoch: ", max_steps_per_epoch)

    args = TrainingArguments(
        name,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        logging_strategy="epoch",
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        learning_rate=config.get("learning_rate", 2e-5),
        num_train_epochs=config.get("epochs", 2),
        weight_decay=config.get("weight_decay", 0.01),
        push_to_hub=False,
        max_steps=max_steps_per_epoch * config.get("epochs", 2),
        disable_tqdm=True,  # 稍微整理一下输出内容
        no_cuda=not use_gpu,  # 如果你想使用CPU，你需要显式地设置no_cuda。
        report_to="none",
    )

    def compute_metrics(eval_pred):
        predictions, labels = eval_pred
        if task != "stsb":
            predictions = np.argmax(predictions, axis=1)
        else:
            predictions = predictions[:, 0]
        return metric.compute(predictions=predictions, references=labels)

    trainer = Trainer(
        model,
        args,
        train_dataset=train_ds_iterable,
        eval_dataset=eval_ds_iterable,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
    )

    trainer.add_callback(RayTrainReportCallback())

    trainer = prepare_trainer(trainer)

    print("Starting training")
    trainer.train()

2023-09-06 14:25:28.144428: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-06 14:25:28.284936: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-09-06 14:25:29.025734: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-09-06 14:25:29.025801: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-09-06 14:25:29.025807: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
comet_ml is installed but `COMET_API_KEY` is not set.

完成你的 train_func 后，你现在可以实例化 TorchTrainer。除了调用该函数外，设置 scaling_config，它控制工作者和资源的数量，以及用于训练和评估的 datasets。

from ray.train.torch import TorchTrainer
from ray.train import RunConfig, ScalingConfig, CheckpointConfig

trainer = TorchTrainer(
    train_func,
    scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
    datasets={
        "train": ray_datasets["train"],
        "eval": ray_datasets["validation"],
    },
    run_config=RunConfig(
        checkpoint_config=CheckpointConfig(
            num_to_keep=1,
            checkpoint_score_attribute="eval_loss",
            checkpoint_score_order="min",
        ),
    ),
)

最后，调用 fit 方法开始使用 Ray Train 进行训练。将 Result 对象保存到一个变量中，以便访问指标和检查点。

result = trainer.fit()

Tune Status

Current time:	2023-09-06 14:27:12
Running for:	00:01:40.12
Memory:	18.4/186.6 GiB

System Info

Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 1.0/4 GPUs (0.0/1.0 accelerator_type:None)

Trial Status

Trial name	status	loc	iter	total time (s)	loss	learning_rate	epoch
TorchTrainer_e8bd4_00000	TERMINATED	10.0.27.125:43821	2	76.6259	0.3866	0	1.5

(TrainTrainable pid=43821) 2023-09-06 14:25:35.638885: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
(TrainTrainable pid=43821) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(TrainTrainable pid=43821) 2023-09-06 14:25:35.782950: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
(TrainTrainable pid=43821) 2023-09-06 14:25:36.501583: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(TrainTrainable pid=43821) 2023-09-06 14:25:36.501653: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(TrainTrainable pid=43821) 2023-09-06 14:25:36.501660: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(TrainTrainable pid=43821) comet_ml is installed but `COMET_API_KEY` is not set.
(TorchTrainer pid=43821) Starting distributed worker processes: ['43946 (10.0.27.125)']
(RayTrainWorker pid=43946) Setting up process group for: env:// [rank=0, world_size=1]
(RayTrainWorker pid=43946) 2023-09-06 14:25:42.756510: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
(RayTrainWorker pid=43946) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(RayTrainWorker pid=43946) 2023-09-06 14:25:42.903398: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
(SplitCoordinator pid=44017) Auto configuring locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026']
(RayTrainWorker pid=43946) 2023-09-06 14:25:43.737476: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(RayTrainWorker pid=43946) 2023-09-06 14:25:43.737544: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(RayTrainWorker pid=43946) 2023-09-06 14:25:43.737554: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(RayTrainWorker pid=43946) comet_ml is installed but `COMET_API_KEY` is not set.

(RayTrainWorker pid=43946) Is CUDA available: True

(RayTrainWorker pid=43946) Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight']
(RayTrainWorker pid=43946) - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
(RayTrainWorker pid=43946) - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
(RayTrainWorker pid=43946) Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier.bias', 'pre_classifier.weight']
(RayTrainWorker pid=43946) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
(SplitCoordinator pid=44016) Auto configuring locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026']

(RayTrainWorker pid=43946) max_steps_per_epoch:  534

(RayTrainWorker pid=43946) max_steps is given, it will override any value given in num_train_epochs
(RayTrainWorker pid=43946) /home/ray/anaconda3/lib/python3.9/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
(RayTrainWorker pid=43946)   warnings.warn(

(RayTrainWorker pid=43946) Starting training

(RayTrainWorker pid=43946) ***** Running training *****
(RayTrainWorker pid=43946)   Num examples = 17088
(RayTrainWorker pid=43946)   Num Epochs = 9223372036854775807
(RayTrainWorker pid=43946)   Instantaneous batch size per device = 16
(RayTrainWorker pid=43946)   Total train batch size (w. parallel, distributed & accumulation) = 16
(RayTrainWorker pid=43946)   Gradient Accumulation steps = 1
(RayTrainWorker pid=43946)   Total optimization steps = 1068
(RayTrainWorker pid=43946) /tmp/ipykernel_43503/4088900328.py:23: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:206.)
(SplitCoordinator pid=44016) Executing DAG InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
(SplitCoordinator pid=44016) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
(SplitCoordinator pid=44016) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`

(RayTrainWorker pid=43946) [W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())

(RayTrainWorker pid=43946) {'loss': 0.5414, 'learning_rate': 9.9812734082397e-06, 'epoch': 0.5}

(RayTrainWorker pid=43946) ***** Running Evaluation *****
(RayTrainWorker pid=43946)   Num examples: Unknown
(RayTrainWorker pid=43946)   Batch size = 16
(SplitCoordinator pid=44017) Executing DAG InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
(SplitCoordinator pid=44017) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
(SplitCoordinator pid=44017) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`

(RayTrainWorker pid=43946) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-535
(RayTrainWorker pid=43946) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/config.json

(RayTrainWorker pid=43946) {'eval_loss': 0.5018134117126465, 'eval_matthews_correlation': 0.4145623770066859, 'eval_runtime': 0.6595, 'eval_samples_per_second': 1581.584, 'eval_steps_per_second': 100.081, 'epoch': 0.5}

(RayTrainWorker pid=43946) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/pytorch_model.bin
(RayTrainWorker pid=43946) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/tokenizer_config.json
(RayTrainWorker pid=43946) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/special_tokens_map.json
(RayTrainWorker pid=43946) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/ray_results/TorchTrainer_2023-09-06_14-25-31/TorchTrainer_e8bd4_00000_0_2023-09-06_14-25-32/checkpoint_000000)
(SplitCoordinator pid=44016) Executing DAG InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
(SplitCoordinator pid=44016) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
(SplitCoordinator pid=44016) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`

(RayTrainWorker pid=43946) {'loss': 0.3866, 'learning_rate': 0.0, 'epoch': 1.5}

(RayTrainWorker pid=43946) ***** Running Evaluation *****
(RayTrainWorker pid=43946)   Num examples: Unknown
(RayTrainWorker pid=43946)   Batch size = 16
(SplitCoordinator pid=44017) Executing DAG InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
(SplitCoordinator pid=44017) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
(SplitCoordinator pid=44017) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`

(RayTrainWorker pid=43946) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-1068
(RayTrainWorker pid=43946) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-1068/config.json

(RayTrainWorker pid=43946) {'eval_loss': 0.5527923107147217, 'eval_matthews_correlation': 0.44860917123689154, 'eval_runtime': 0.6646, 'eval_samples_per_second': 1569.42, 'eval_steps_per_second': 99.311, 'epoch': 1.5}

(RayTrainWorker pid=43946) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-1068/pytorch_model.bin
(RayTrainWorker pid=43946) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1068/tokenizer_config.json
(RayTrainWorker pid=43946) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1068/special_tokens_map.json
(RayTrainWorker pid=43946) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/ray_results/TorchTrainer_2023-09-06_14-25-31/TorchTrainer_e8bd4_00000_0_2023-09-06_14-25-32/checkpoint_000001)
(RayTrainWorker pid=43946) 
(RayTrainWorker pid=43946) 
(RayTrainWorker pid=43946) Training completed. Do not forget to share your model on huggingface.co/models =)
(RayTrainWorker pid=43946) 
(RayTrainWorker pid=43946) 

(RayTrainWorker pid=43946) {'train_runtime': 66.0485, 'train_samples_per_second': 258.719, 'train_steps_per_second': 16.17, 'train_loss': 0.46413421630859375, 'epoch': 1.5}

2023-09-06 14:27:12,180	WARNING experiment_state.py:371 -- Experiment checkpoint syncing has been triggered multiple times in the last 30.0 seconds. A sync will be triggered whenever a trial has checkpointed more than `num_to_keep` times since last sync or if 300 seconds have passed since last sync. If you have set `num_to_keep` in your `CheckpointConfig`, consider increasing the checkpoint frequency or keeping more checkpoints. You can supress this warning by changing the `TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S` environment variable.
2023-09-06 14:27:12,184	INFO tune.py:1141 -- Total run time: 100.17 seconds (85.12 seconds for the tuning loop).

您可以使用返回的 Result 对象访问指标以及与最后一次迭代相关的 Ray Train Checkpoint。

result

Result(
  metrics={'loss': 0.3866, 'learning_rate': 0.0, 'epoch': 1.5, 'step': 1068, 'eval_loss': 0.5527923107147217, 'eval_matthews_correlation': 0.44860917123689154, 'eval_runtime': 0.6646, 'eval_samples_per_second': 1569.42, 'eval_steps_per_second': 99.311},
  path='/mnt/cluster_storage/ray_results/TorchTrainer_2023-09-06_14-25-31/TorchTrainer_e8bd4_00000_0_2023-09-06_14-25-32',
  filesystem='local',
  checkpoint=Checkpoint(filesystem=local, path=/mnt/cluster_storage/ray_results/TorchTrainer_2023-09-06_14-25-31/TorchTrainer_e8bd4_00000_0_2023-09-06_14-25-32/checkpoint_000001)
)

使用 Ray Tune 调整超参数#

要调整模型的任何超参数，将您的 TorchTrainer 传入 Tuner 并定义搜索空间。

您还可以利用 Ray Tune 的高级搜索算法和调度器。此示例使用 ASHAScheduler 来积极终止表现不佳的实验。

from ray import tune
from ray.tune import Tuner
from ray.tune.schedulers.async_hyperband import ASHAScheduler

tune_epochs = 4
tuner = Tuner(
    trainer,
    param_space={
        "train_loop_config": {
            "learning_rate": tune.grid_search([2e-5, 2e-4, 2e-3, 2e-2]),
            "epochs": tune_epochs,
        }
    },
    tune_config=tune.TuneConfig(
        metric="eval_loss",
        mode="min",
        num_samples=1,
        scheduler=ASHAScheduler(
            max_t=tune_epochs,
        ),
    ),
    run_config=RunConfig(
        name="tune_transformers",
        checkpoint_config=CheckpointConfig(
            num_to_keep=1,
            checkpoint_score_attribute="eval_loss",
            checkpoint_score_order="min",
        ),
    ),
)

2023-09-06 14:46:47,821	INFO tuner_internal.py:508 -- A `RunConfig` was passed to both the `Tuner` and the `TorchTrainer`. The run config passed to the `Tuner` is the one that will be used.

tune_results = tuner.fit()

Tune Status

Current time:	2023-09-06 14:49:04
Running for:	00:02:16.18
Memory:	19.6/186.6 GiB

System Info

Using AsyncHyperBand: num_stopped=4
Bracket: Iter 4.000: -0.6517604142427444 | Iter 1.000: -0.5936744660139084
Logical resource usage: 1.0/48 CPUs, 1.0/4 GPUs (0.0/1.0 accelerator_type:None)

Trial Status

Trial name	status	loc	train_loop_config/le arning_rate	iter	total time (s)	loss	learning_rate	epoch
TorchTrainer_e1825_00000	TERMINATED	10.0.27.125:57496	2e-05	4	128.443	0.1934	0	3.25
TorchTrainer_e1825_00001	TERMINATED	10.0.27.125:57497	0.0002	1	41.2486	0.616	0.000149906	0.25
TorchTrainer_e1825_00002	TERMINATED	10.0.27.125:57498	0.002	1	41.1336	0.6699	0.00149906	0.25
TorchTrainer_e1825_00003	TERMINATED	10.0.27.125:57499	0.02	4	126.699	0.6073	0	3.25

(TrainTrainable pid=57498) 2023-09-06 14:46:52.049839: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
(TrainTrainable pid=57498) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(TrainTrainable pid=57498) 2023-09-06 14:46:52.195780: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
(TrainTrainable pid=57498) 2023-09-06 14:46:52.944517: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(TrainTrainable pid=57498) 2023-09-06 14:46:52.944590: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(TrainTrainable pid=57498) 2023-09-06 14:46:52.944597: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(TrainTrainable pid=57498) comet_ml is installed but `COMET_API_KEY` is not set.
(TorchTrainer pid=57498) Starting distributed worker processes: ['57731 (10.0.27.125)']
(TrainTrainable pid=57499) 2023-09-06 14:46:52.229406: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA [repeated 3x across cluster]
(TrainTrainable pid=57499) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. [repeated 3x across cluster]
(TrainTrainable pid=57499) 2023-09-06 14:46:52.378805: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. [repeated 3x across cluster]
(RayTrainWorker pid=57741) Setting up process group for: env:// [rank=0, world_size=1]
(TrainTrainable pid=57499) 2023-09-06 14:46:53.174151: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 [repeated 6x across cluster]
(TrainTrainable pid=57499) 2023-09-06 14:46:53.174160: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. [repeated 3x across cluster]
(TrainTrainable pid=57499) comet_ml is installed but `COMET_API_KEY` is not set. [repeated 3x across cluster]
(SplitCoordinator pid=57927) Auto configuring locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026']

(RayTrainWorker pid=57741) Is CUDA available: True
(RayTrainWorker pid=57741) max_steps_per_epoch:  534

(RayTrainWorker pid=57741) Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias']
(RayTrainWorker pid=57741) - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
(RayTrainWorker pid=57741) - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
(RayTrainWorker pid=57741) Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias']
(RayTrainWorker pid=57741) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
(TorchTrainer pid=57499) Starting distributed worker processes: ['57746 (10.0.27.125)'] [repeated 3x across cluster]
(RayTrainWorker pid=57740) 2023-09-06 14:47:00.036649: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA [repeated 4x across cluster]
(RayTrainWorker pid=57740) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. [repeated 4x across cluster]
(RayTrainWorker pid=57740) 2023-09-06 14:47:00.198894: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. [repeated 4x across cluster]
(RayTrainWorker pid=57746) Setting up process group for: env:// [rank=0, world_size=1] [repeated 3x across cluster]
(RayTrainWorker pid=57740) 2023-09-06 14:47:01.085704: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 [repeated 8x across cluster]
(RayTrainWorker pid=57740) 2023-09-06 14:47:01.085711: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. [repeated 4x across cluster]
(RayTrainWorker pid=57740) comet_ml is installed but `COMET_API_KEY` is not set. [repeated 4x across cluster]
(SplitCoordinator pid=57965) Auto configuring locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'] [repeated 7x across cluster]

(RayTrainWorker pid=57741) Starting training

(RayTrainWorker pid=57741) max_steps is given, it will override any value given in num_train_epochs
(RayTrainWorker pid=57741) /home/ray/anaconda3/lib/python3.9/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
(RayTrainWorker pid=57741)   warnings.warn(
(RayTrainWorker pid=57746) Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.bias']
(RayTrainWorker pid=57746) Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'classifier.bias', 'pre_classifier.bias']
(RayTrainWorker pid=57731) Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias']
(RayTrainWorker pid=57740) Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight']
(RayTrainWorker pid=57740) Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'classifier.weight']
(RayTrainWorker pid=57741) ***** Running training *****
(RayTrainWorker pid=57741)   Num examples = 34176
(RayTrainWorker pid=57741)   Num Epochs = 9223372036854775807
(RayTrainWorker pid=57741)   Instantaneous batch size per device = 16
(RayTrainWorker pid=57741)   Total train batch size (w. parallel, distributed & accumulation) = 16
(RayTrainWorker pid=57741)   Gradient Accumulation steps = 1
(RayTrainWorker pid=57741)   Total optimization steps = 2136
(RayTrainWorker pid=57741) /tmp/ipykernel_43503/4088900328.py:23: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:206.)
(SplitCoordinator pid=57927) Executing DAG InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
(SplitCoordinator pid=57927) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
(SplitCoordinator pid=57927) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`

(RayTrainWorker pid=57741) [W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())

(RayTrainWorker pid=57741) {'loss': 0.5481, 'learning_rate': 1.4990636704119851e-05, 'epoch': 0.25}
(RayTrainWorker pid=57740) Is CUDA available: True [repeated 3x across cluster]
(RayTrainWorker pid=57740) max_steps_per_epoch:  534 [repeated 3x across cluster]
(RayTrainWorker pid=57740) Starting training [repeated 3x across cluster]

(RayTrainWorker pid=57741) ***** Running Evaluation *****
(RayTrainWorker pid=57741)   Num examples: Unknown
(RayTrainWorker pid=57741)   Batch size = 16
(RayTrainWorker pid=57740) - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). [repeated 3x across cluster]
(RayTrainWorker pid=57740) - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). [repeated 3x across cluster]
(RayTrainWorker pid=57731) Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias']
(RayTrainWorker pid=57740) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [repeated 3x across cluster]
(RayTrainWorker pid=57740) max_steps is given, it will override any value given in num_train_epochs [repeated 3x across cluster]
(RayTrainWorker pid=57740) /home/ray/anaconda3/lib/python3.9/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning [repeated 3x across cluster]
(RayTrainWorker pid=57740)   warnings.warn( [repeated 3x across cluster]
(RayTrainWorker pid=57740) ***** Running training ***** [repeated 3x across cluster]
(RayTrainWorker pid=57740)   Num examples = 34176 [repeated 3x across cluster]
(RayTrainWorker pid=57740)   Num Epochs = 9223372036854775807 [repeated 3x across cluster]
(RayTrainWorker pid=57740)   Instantaneous batch size per device = 16 [repeated 3x across cluster]
(RayTrainWorker pid=57740)   Total train batch size (w. parallel, distributed & accumulation) = 16 [repeated 3x across cluster]
(RayTrainWorker pid=57740)   Gradient Accumulation steps = 1 [repeated 3x across cluster]
(RayTrainWorker pid=57740)   Total optimization steps = 2136 [repeated 3x across cluster]
(RayTrainWorker pid=57740) /tmp/ipykernel_43503/4088900328.py:23: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:206.) [repeated 3x across cluster]
(SplitCoordinator pid=57965) Executing DAG InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)] [repeated 3x across cluster]
(SplitCoordinator pid=57965) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False) [repeated 3x across cluster]
(SplitCoordinator pid=57965) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True` [repeated 3x across cluster]
(RayTrainWorker pid=57740) [W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [repeated 3x across cluster]

(RayTrainWorker pid=57741) {'eval_loss': 0.5202918648719788, 'eval_matthews_correlation': 0.37321205597032797, 'eval_runtime': 0.7255, 'eval_samples_per_second': 1437.704, 'eval_steps_per_second': 90.976, 'epoch': 0.25}

(RayTrainWorker pid=57741) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-535
(RayTrainWorker pid=57741) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/config.json
(RayTrainWorker pid=57741) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/pytorch_model.bin
(RayTrainWorker pid=57741) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/tokenizer_config.json
(RayTrainWorker pid=57741) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/special_tokens_map.json
(RayTrainWorker pid=57741) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/ray/ray_results/tune_transformers/TorchTrainer_e1825_00000_0_learning_rate=0.0000_2023-09-06_14-46-48/checkpoint_000000)

(RayTrainWorker pid=57746) {'loss': 0.6064, 'learning_rate': 0.009981273408239701, 'epoch': 1.25} [repeated 4x across cluster]
(RayTrainWorker pid=57740) {'eval_loss': 0.6181353330612183, 'eval_matthews_correlation': 0.0, 'eval_runtime': 0.7543, 'eval_samples_per_second': 1382.828, 'eval_steps_per_second': 87.504, 'epoch': 0.25} [repeated 3x across cluster]

(RayTrainWorker pid=57746) ***** Running Evaluation ***** [repeated 4x across cluster]
(RayTrainWorker pid=57746)   Num examples: Unknown [repeated 4x across cluster]
(RayTrainWorker pid=57746)   Batch size = 16 [repeated 4x across cluster]
(SplitCoordinator pid=57954) Executing DAG InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)] [repeated 6x across cluster]
(SplitCoordinator pid=57954) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False) [repeated 6x across cluster]
(SplitCoordinator pid=57954) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True` [repeated 6x across cluster]
(RayTrainWorker pid=57740) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-535 [repeated 3x across cluster]
(RayTrainWorker pid=57740) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/config.json [repeated 3x across cluster]
(RayTrainWorker pid=57740) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/pytorch_model.bin [repeated 3x across cluster]
(RayTrainWorker pid=57740) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/tokenizer_config.json [repeated 3x across cluster]
(RayTrainWorker pid=57740) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/special_tokens_map.json [repeated 3x across cluster]
(RayTrainWorker pid=57740) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/ray/ray_results/tune_transformers/TorchTrainer_e1825_00001_1_learning_rate=0.0002_2023-09-06_14-46-48/checkpoint_000000) [repeated 3x across cluster]

(RayTrainWorker pid=57746) {'loss': 0.6061, 'learning_rate': 0.004971910112359551, 'epoch': 2.25} [repeated 2x across cluster]
(RayTrainWorker pid=57741) {'eval_loss': 0.5246258974075317, 'eval_matthews_correlation': 0.489934557943789, 'eval_runtime': 0.6462, 'eval_samples_per_second': 1614.032, 'eval_steps_per_second': 102.134, 'epoch': 1.25} [repeated 2x across cluster]

(RayTrainWorker pid=57746) ***** Running Evaluation ***** [repeated 2x across cluster]
(RayTrainWorker pid=57746)   Num examples: Unknown [repeated 2x across cluster]
(RayTrainWorker pid=57746)   Batch size = 16 [repeated 2x across cluster]
(SplitCoordinator pid=57927) Executing DAG InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)] [repeated 4x across cluster]
(SplitCoordinator pid=57927) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False) [repeated 4x across cluster]
(SplitCoordinator pid=57927) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True` [repeated 4x across cluster]
(RayTrainWorker pid=57741) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-1070 [repeated 2x across cluster]
(RayTrainWorker pid=57741) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/config.json [repeated 2x across cluster]
(RayTrainWorker pid=57741) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/pytorch_model.bin [repeated 2x across cluster]
(RayTrainWorker pid=57741) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/tokenizer_config.json [repeated 2x across cluster]
(RayTrainWorker pid=57741) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/special_tokens_map.json [repeated 2x across cluster]
(RayTrainWorker pid=57741) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/ray/ray_results/tune_transformers/TorchTrainer_e1825_00000_0_learning_rate=0.0000_2023-09-06_14-46-48/checkpoint_000001) [repeated 2x across cluster]

(RayTrainWorker pid=57746) {'loss': 0.6073, 'learning_rate': 0.0, 'epoch': 3.25} [repeated 2x across cluster]
(RayTrainWorker pid=57741) {'eval_loss': 0.6450843811035156, 'eval_matthews_correlation': 0.5259674254268325, 'eval_runtime': 0.6474, 'eval_samples_per_second': 1611.106, 'eval_steps_per_second': 101.949, 'epoch': 2.25} [repeated 2x across cluster]

(RayTrainWorker pid=57746) ***** Running Evaluation ***** [repeated 2x across cluster]
(RayTrainWorker pid=57746)   Num examples: Unknown [repeated 2x across cluster]
(RayTrainWorker pid=57746)   Batch size = 16 [repeated 2x across cluster]
(SplitCoordinator pid=57927) Executing DAG InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)] [repeated 4x across cluster]
(SplitCoordinator pid=57927) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False) [repeated 4x across cluster]
(SplitCoordinator pid=57927) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True` [repeated 4x across cluster]
(RayTrainWorker pid=57741) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-1605 [repeated 2x across cluster]
(RayTrainWorker pid=57741) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-1605/config.json [repeated 2x across cluster]
(RayTrainWorker pid=57741) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-1605/pytorch_model.bin [repeated 2x across cluster]
(RayTrainWorker pid=57741) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1605/tokenizer_config.json [repeated 2x across cluster]
(RayTrainWorker pid=57741) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1605/special_tokens_map.json [repeated 2x across cluster]
(RayTrainWorker pid=57741) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/ray/ray_results/tune_transformers/TorchTrainer_e1825_00000_0_learning_rate=0.0000_2023-09-06_14-46-48/checkpoint_000002) [repeated 2x across cluster]

(RayTrainWorker pid=57746) 
(RayTrainWorker pid=57746) 
(RayTrainWorker pid=57746) Training completed. Do not forget to share your model on huggingface.co/models =)
(RayTrainWorker pid=57746) 
(RayTrainWorker pid=57746) 

(RayTrainWorker pid=57746) {'train_runtime': 115.5377, 'train_samples_per_second': 295.8, 'train_steps_per_second': 18.487, 'train_loss': 0.6787891173630618, 'epoch': 3.25}

2023-09-06 14:49:04,574	INFO tune.py:1141 -- Total run time: 136.19 seconds (136.17 seconds for the tuning loop).
(RayTrainWorker pid=57741) 
(RayTrainWorker pid=57741) 
(RayTrainWorker pid=57741) Training completed. Do not forget to share your model on huggingface.co/models =)
(RayTrainWorker pid=57741) 
(RayTrainWorker pid=57741) 

(RayTrainWorker pid=57741) {'train_runtime': 117.6791, 'train_samples_per_second': 290.417, 'train_steps_per_second': 18.151, 'train_loss': 0.3468295286657212, 'epoch': 3.25}

查看调优运行的结果并将其作为数据框展现，寻找最佳结果。

tune_results.get_dataframe().sort_values("eval_loss")

	loss	learning_rate	epoch	step	eval_loss	eval_matthews_correlation	eval_runtime	eval_samples_per_second	eval_steps_per_second	timestamp	...	time_total_s	pid	hostname	node_ip	time_since_restore	iterations_since_restore	checkpoint_dir_name	config/train_loop_config/learning_rate	config/train_loop_config/epochs	logdir
1	0.6160	0.000150	0.25	535	0.618135	0.000000	0.7543	1382.828	87.504	1694036857	...	41.248600	57497	ip-10-0-27-125	10.0.27.125	41.248600	1	checkpoint_000000	0.00020	4	e1825_00001
2	0.6699	0.001499	0.25	535	0.619657	0.000000	0.7449	1400.202	88.603	1694036856	...	41.133609	57498	ip-10-0-27-125	10.0.27.125	41.133609	1	checkpoint_000000	0.00200	4	e1825_00002
3	0.6073	0.000000	3.25	2136	0.619694	0.000000	0.6329	1648.039	104.286	1694036942	...	126.699238	57499	ip-10-0-27-125	10.0.27.125	126.699238	4	checkpoint_000003	0.02000	4	e1825_00003
0	0.1934	0.000000	3.25	2136	0.747960	0.520756	0.6530	1597.187	101.068	1694036944	...	128.443495	57496	ip-10-0-27-125	10.0.27.125	128.443495	4	checkpoint_000003	0.00002	4	e1825_00000

4 rows × 26 columns

best_result = tune_results.get_best_result()

分享模型#

为了与社区分享模型，还需进行几步操作。

您在Ray集群上进行了训练，但希望从本地环境共享模型。此配置允许您轻松进行身份验证。

首先，存储来自Hugging Face网站的身份验证令牌。如果您还没有注册，请点击这里进行注册。然后执行以下单元格并输入您的用户名和密码：

from huggingface_hub import notebook_login

notebook_login()

然后您需要安装 Git-LFS。取消注释以下说明：

# !apt 安装 git-lfs

加载表现最佳的检查点模型：

import os
from ray.train import Checkpoint

checkpoint: Checkpoint = best_result.checkpoint

with checkpoint.as_directory() as checkpoint_dir:
    checkpoint_path = os.path.join(checkpoint_dir, "checkpoint")
    model = AutoModelForSequenceClassification.from_pretrained(checkpoint_path)

您现在可以将训练结果上传到 Hub。执行此指令：

model.push_to_hub()

您现在可以共享此模型。其他人可以使用标识符 "your-username/the-name-you-picked" 加载它。例如：

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("sgugger/my-awesome-model")

另请参见#

Ray Train 示例以获取更多用例
Ray Train 用户指南以获取操作指南