分布式和并行训练教程¶

创建于：2022年10月4日 | 最后更新：2024年10月31日 | 最后验证：2024年11月5日

分布式训练是一种模型训练范式，它涉及将训练工作负载分散到多个工作节点上，从而显著提高训练速度和模型准确性。虽然分布式训练可以用于任何类型的机器学习模型训练，但对于大型模型和计算密集型任务（如深度学习）来说，使用分布式训练最为有益。

有几种方法可以在PyTorch中执行分布式训练，每种方法在某些用例中都有其优势：

DistributedDataParallel (DDP)
Fully Sharded Data Parallel (FSDP)
Tensor Parallel (TP)
Device Mesh
远程过程调用（RPC）分布式训练
自定义扩展

了解更多关于这些选项的信息，请参阅分布式概述。

学习DDP¶

DDP Intro Video Tutorials

一个逐步的视频系列，介绍如何开始使用DistributedDataParallel并进阶到更复杂的主题

代码视频

https://pytorch.org/tutorials/beginner/ddp_series_intro.html?utm_source=distr_landing&utm_medium=ddp_series_intro

Getting Started with Distributed Data Parallel

本教程提供了对PyTorch DistributedData Parallel的简短而温和的介绍。

代码

https://pytorch.org/tutorials/intermediate/ddp_tutorial.html?utm_source=distr_landing&utm_medium=intermediate_ddp_tutorial

Distributed Training with Uneven Inputs Using the Join Context Manager

本教程描述了Join上下文管理器，并演示了它与DistributedData Parallel的使用。

代码

https://pytorch.org/tutorials/advanced/generic_join.html?utm_source=distr_landing&utm_medium=generic_join

学习FSDP¶

Getting Started with FSDP

本教程演示了如何在MNIST数据集上使用FSDP进行分布式训练。

代码

https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html?utm_source=distr_landing&utm_medium=FSDP_getting_started

FSDP Advanced

在本教程中，您将学习如何使用FSDP微调HuggingFace (HF) T5模型进行文本摘要。

代码

https://pytorch.org/tutorials/intermediate/FSDP_advanced_tutorial.html?utm_source=distr_landing&utm_medium=FSDP_advanced

学习张量并行 (TP)¶

Large Scale Transformer model training with Tensor Parallel (TP)

本教程演示了如何使用Tensor Parallel和Fully Sharded Data Parallel在数百到数千个GPU上训练一个大型Transformer类模型。

代码

https://pytorch.org/tutorials/intermediate/TP_tutorial.html

学习设备网格¶

Getting Started with DeviceMesh

在本教程中，您将了解DeviceMesh以及它如何帮助分布式训练。

代码

https://pytorch.org/tutorials/recipes/distributed_device_mesh.html?highlight=devicemesh

学习RPC¶

Getting Started with Distributed RPC Framework

本教程演示如何开始使用基于RPC的分布式训练。

代码

https://pytorch.org/tutorials/intermediate/rpc_tutorial.html?utm_source=distr_landing&utm_medium=rpc_getting_started

Implementing a Parameter Server Using Distributed RPC Framework

本教程将引导您通过一个简单的示例，使用PyTorch的分布式RPC框架实现参数服务器。

代码

https://pytorch.org/tutorials/intermediate/rpc_param_server_tutorial.html?utm_source=distr_landing&utm_medium=rpc_param_server_tutorial

Implementing Batch RPC Processing Using Asynchronous Executions

在本教程中，您将使用@rpc.functions.async_execution装饰器构建批处理的RPC应用程序。

代码

https://pytorch.org/tutorials/intermediate/rpc_async_execution.html?utm_source=distr_landing&utm_medium=rpc_async_execution

Combining Distributed DataParallel with Distributed RPC Framework

在本教程中，您将学习如何将分布式数据并行与分布式模型并行结合起来。

代码

https://pytorch.org/tutorials/advanced/rpc_ddp_tutorial.html?utm_source=distr_landing&utm_medium=rpc_plus_ddp

自定义扩展¶

Customize Process Group Backends Using Cpp Extensions

在本教程中，您将学习如何实现一个自定义的ProcessGroup后端，并使用cpp扩展将其插入到PyTorch分布式包中。

代码

https://pytorch.org/tutorials/intermediate/process_group_cpp_extension_tutorial.html?utm_source=distr_landing&utm_medium=custom_extensions_cpp