分布式多节点任务#

SkyPilot 支持在多节点集群上进行多节点集群的配置和分布式执行。

例如，这里有一个简单的例子，用于在PyTorch中使用分布式数据并行（DDP）在两个节点上训练一个类似GPT的模型（灵感来自Karpathy的minGPT）。

name: minGPT-ddp

resources:
    accelerators: A100:8

num_nodes: 2

setup: |
    git clone --depth 1 https://github.com/pytorch/examples || true
    cd examples
    git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp
    # SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5).
    uv pip install -r requirements.txt "numpy<2" "torch==1.12.1+cu113" --extra-index-url https://download.pytorch.org/whl/cu113

run: |
    cd examples/mingpt
    export LOGLEVEL=INFO

    MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
    echo "Starting distributed training, head node: $MASTER_ADDR"

    torchrun \
    --nnodes=$SKYPILOT_NUM_NODES \
    --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
    --master_addr=$MASTER_ADDR \
    --node_rank=${SKYPILOT_NODE_RANK} \
    --master_port=8008 \
    main.py

在上面，

num_nodes: 2 指定此任务将在2个节点上运行，每个节点有8个A100；
run 部分中高亮的行显示了用于启动分布式训练的常见环境变量，解释如下。

注意

如果您遇到错误 [Errno 24] Too many open files，这表明您的进程已超过系统允许的最大打开文件描述符数量。这种情况通常发生在高负载场景中，例如启动大量节点，如100个。

要解决此问题，请运行以下命令，然后重试：

ulimit -n 65535

您可以在我们的GitHub仓库中找到更多分布式训练示例（包括使用rdvz后端进行pytorch训练）。

环境变量#

SkyPilot 暴露了这些环境变量，可以在任务的 run 命令中访问：

SKYPILOT_NODE_RANK: 执行任务的节点的排名（从0到num_nodes-1的整数ID）。
SKYPILOT_NODE_IPS: 一个字符串，包含为执行任务而保留的节点的IP地址，其中每行包含一个IP地址。
SKYPILOT_NUM_NODES: 为任务保留的节点数量，可以通过num_nodes: 指定。与echo "$SKYPILOT_NODE_IPS" | wc -l的值相同。
SKYPILOT_NUM_GPUS_PER_NODE: 每个节点上保留用于执行任务的GPU数量；与accelerators: :中的计数相同（如果是小数则向上取整）。

详情请参见SkyPilot环境变量。

启动多节点任务（新集群）#

当使用 sky launch 在新集群上启动多节点任务时，以下步骤会依次发生：

节点已配置。（屏障）
工作目录/文件挂载已同步到所有节点。（屏障）
setup 命令在所有节点上执行。（屏障）
run 命令在所有节点上执行。

启动多节点任务（现有集群）#

当使用sky launch在现有集群上启动多节点任务时，集群可能拥有比当前任务的num_nodes要求更多的节点。

以下按顺序发生：

SkyPilot 检查所有节点上的运行时是否是最新的。（屏障）
工作目录/文件挂载已同步到所有节点。（屏障）
setup 命令在集群的所有节点上执行。（屏障）
run 命令在被调度执行任务的节点子集上执行，这可能少于集群的大小。

提示

要跳过重新运行设置命令，可以使用 sky launch --no-setup ... （执行上述步骤1、2、4）或 sky exec（仅执行步骤2（仅工作目录）和步骤4）。

仅在头节点上执行任务#

要在头节点上执行任务（这是像mpirun这样的工具的常见场景），请使用SKYPILOT_NODE_RANK环境变量，如下所示：

...

num_nodes: <n>

run: |
  if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
      # Launch the head-only command here.
  fi

SSH进入工作节点#

除了头节点外，多节点集群的工作节点的SSH配置也会被添加到~/.ssh/config中，作为-worker。这允许你在需要时直接SSH进入工作节点。

# Assuming 3 nodes in a cluster named mycluster

# Head node.
$ ssh mycluster

# Worker nodes.
$ ssh mycluster-worker1
$ ssh mycluster-worker2

执行分布式Ray程序#

要在多个节点上执行分布式Ray程序，您可以下载训练脚本并启动任务yaml：

$ wget https://raw.githubusercontent.com/skypilot-org/skypilot/master/examples/distributed_ray_train/train.py
$ sky launch ray_train.yaml

resources:
  accelerators: L4:2
  memory: 64+

num_nodes: 2

workdir: .

setup: |
  conda activate ray
  if [ $? -ne 0 ]; then
    conda create -n ray python=3.10 -y
    conda activate ray
  fi

  pip install "ray[train]"
  pip install tqdm
  pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

run: |
  sudo chmod 777 -R /var/tmp
  HEAD_IP=`echo "$SKYPILOT_NODE_IPS" | head -n1`
  if [ "$SKYPILOT_NODE_RANK" == "0" ]; then
    ps aux | grep ray | grep 6379 &> /dev/null || ray start --head  --disable-usage-stats --port 6379
    sleep 5
    python train.py --num-workers $SKYPILOT_NUM_NODES
  else
    sleep 5
    ps aux | grep ray | grep 6379 &> /dev/null || ray start --address $HEAD_IP:6379 --disable-usage-stats
  fi

警告

在使用Ray时，避免调用ray stop，因为这也会导致SkyPilot运行时停止。