教程：AI训练#

本示例使用SkyPilot在PyTorch中训练一个类似GPT的模型（灵感来自Karpathy的minGPT），并采用分布式数据并行（DDP）技术。

我们定义了一个任务 YAML，其中包含资源需求、设置命令和要运行的命令：

# train.yaml

name: minGPT-ddp

resources:
    cpus: 4+
    accelerators: L4:4  # Or A100:8, H100:8

# Optional: upload a working directory to remote ~/sky_workdir.
# Commands in "setup" and "run" will be executed under it.
#
# workdir: .

# Optional: upload local files.
# Format:
#   /remote/path: /local/path
#
# file_mounts:
#   ~/.vimrc: ~/.vimrc
#   ~/.netrc: ~/.netrc

setup: |
    git clone --depth 1 https://github.com/pytorch/examples || true
    cd examples
    git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp
    # SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5).
    uv pip install -r requirements.txt "numpy<2" "torch==1.12.1+cu113" --extra-index-url https://download.pytorch.org/whl/cu113

run: |
    cd examples/mingpt
    export LOGLEVEL=INFO

    echo "Starting minGPT-ddp training"

    torchrun \
    --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
    main.py

提示

在YAML中，workdir 和 file_mounts 字段被注释掉了。要了解如何使用它们将本地目录/文件或对象存储桶（S3、GCS、R2）挂载到集群中，请参阅 Syncing Code and Artifacts。

提示

SKYPILOT_NUM_GPUS_PER_NODE 环境变量由 SkyPilot 自动设置为每个节点的 GPU 数量。更多信息请参见 Secrets and Environment Variables。

然后，启动训练：

$ sky launch -c mingpt train.yaml

这将配置具有所需资源的最便宜的集群，执行设置命令，然后执行运行命令。

训练任务开始运行后，您可以安全地使用 Ctrl-C 来脱离日志记录，任务将继续在集群上远程运行。要停止任务，请使用 sky cancel 命令（请参阅 CLI 参考）。

训练后，使用熟悉的工具transfer artifacts，如日志和检查点。

提示

请随意复制粘贴上面的YAML，并根据您自己的项目进行自定义。