(测试版) 使用LR调度器运行编译后的优化器¶

创建于：2024年5月21日 | 最后更新：2024年5月21日 | 最后验证：2024年11月5日

作者: Michael Lazos

优化器是训练任何深度学习模型的关键算法。在这个例子中，我们将展示如何将使用torch.compile编译的优化器与学习率调度器配对，以加速训练收敛。

注意

本教程需要 PyTorch 2.3.0 或更高版本。

模型设置¶

对于这个例子，我们将使用一个简单的线性层序列。

import torch

# Create simple model
model = torch.nn.Sequential(
    *[torch.nn.Linear(1024, 1024, False, device="cuda") for _ in range(10)]
)
input = torch.rand(1024, device="cuda")

# run forward pass
output = model(input)

# run backward to populate the grads for our optimizer below
output.sum().backward()

使用LR调度器设置和运行编译的优化器¶

在本节中，我们将使用Adam优化器与LinearLR调度器，并创建一个辅助函数来包装step()调用，以便在torch.compile()中对它们进行处理。

注意

torch.compile 仅在计算能力为7.0或更高的CUDA设备上受支持。

# exit cleanly if we are on a device that doesn't support ``torch.compile``
if torch.cuda.get_device_capability() < (7, 0):
    print("Exiting because torch.compile is not supported on this device.")
    import sys
    sys.exit(0)

# !!! IMPORTANT !!! Wrap the lr in a Tensor if we are pairing the
# the optimizer with an LR Scheduler.
# Without this, torch.compile will recompile as the value of the LR
# changes.
opt = torch.optim.Adam(model.parameters(), lr=torch.tensor(0.01))
sched = torch.optim.lr_scheduler.LinearLR(opt, total_iters=5)

@torch.compile(fullgraph=False)
def fn():
    opt.step()
    sched.step()


# Warmup runs to compile the function
for _ in range(5):
    fn()
    print(opt.param_groups[0]["lr"])

tensor(0.0047)
tensor(0.0060)
tensor(0.0073)
tensor(0.0087)
tensor(0.0100)

扩展：非张量学习率会发生什么？¶

对于好奇的读者，我们将展示如何在不将LR包装在张量中的情况下，窥探torch.compile会发生什么。

# No longer wrap the LR in a tensor here
opt = torch.optim.Adam(model.parameters(), lr=0.01)
sched = torch.optim.lr_scheduler.LinearLR(opt, total_iters=5)

@torch.compile(fullgraph=False)
def fn():
    opt.step()
    sched.step()

# Setup logging to view recompiles
torch._logging.set_logs(recompiles=True)

# Warmup runs to compile the function
# We will now recompile on each iteration
# as the value of the lr is mutated.
for _ in range(5):
    fn()

[rank0]:V0102 22:22:41.447000 624 torch/_dynamo/guards.py:2813] [7/2] [__recompiles] Recompiling function step in /usr/local/lib/python3.10/dist-packages/torch/optim/adam.py:189
[rank0]:V0102 22:22:41.447000 624 torch/_dynamo/guards.py:2813] [7/2] [__recompiles]     triggered by the following guard failure(s):
[rank0]:V0102 22:22:41.447000 624 torch/_dynamo/guards.py:2813] [7/2] [__recompiles]     - 7/1: L['self'].param_groups[0]['lr'] == 0.003333333333333333
[rank0]:V0102 22:22:43.829000 624 torch/_dynamo/guards.py:2813] [7/3] [__recompiles] Recompiling function step in /usr/local/lib/python3.10/dist-packages/torch/optim/adam.py:189
[rank0]:V0102 22:22:43.829000 624 torch/_dynamo/guards.py:2813] [7/3] [__recompiles]     triggered by the following guard failure(s):
[rank0]:V0102 22:22:43.829000 624 torch/_dynamo/guards.py:2813] [7/3] [__recompiles]     - 7/2: L['self'].param_groups[0]['lr'] == 0.004666666666666667
[rank0]:V0102 22:22:43.829000 624 torch/_dynamo/guards.py:2813] [7/3] [__recompiles]     - 7/1: L['self'].param_groups[0]['lr'] == 0.003333333333333333
[rank0]:V0102 22:22:46.219000 624 torch/_dynamo/guards.py:2813] [7/4] [__recompiles] Recompiling function step in /usr/local/lib/python3.10/dist-packages/torch/optim/adam.py:189
[rank0]:V0102 22:22:46.219000 624 torch/_dynamo/guards.py:2813] [7/4] [__recompiles]     triggered by the following guard failure(s):
[rank0]:V0102 22:22:46.219000 624 torch/_dynamo/guards.py:2813] [7/4] [__recompiles]     - 7/3: L['self'].param_groups[0]['lr'] == 0.006000000000000001
[rank0]:V0102 22:22:46.219000 624 torch/_dynamo/guards.py:2813] [7/4] [__recompiles]     - 7/2: L['self'].param_groups[0]['lr'] == 0.004666666666666667
[rank0]:V0102 22:22:46.219000 624 torch/_dynamo/guards.py:2813] [7/4] [__recompiles]     - 7/1: L['self'].param_groups[0]['lr'] == 0.003333333333333333
[rank0]:V0102 22:22:48.611000 624 torch/_dynamo/guards.py:2813] [7/5] [__recompiles] Recompiling function step in /usr/local/lib/python3.10/dist-packages/torch/optim/adam.py:189
[rank0]:V0102 22:22:48.611000 624 torch/_dynamo/guards.py:2813] [7/5] [__recompiles]     triggered by the following guard failure(s):
[rank0]:V0102 22:22:48.611000 624 torch/_dynamo/guards.py:2813] [7/5] [__recompiles]     - 7/4: L['self'].param_groups[0]['lr'] == 0.007333333333333335
[rank0]:V0102 22:22:48.611000 624 torch/_dynamo/guards.py:2813] [7/5] [__recompiles]     - 7/3: L['self'].param_groups[0]['lr'] == 0.006000000000000001
[rank0]:V0102 22:22:48.611000 624 torch/_dynamo/guards.py:2813] [7/5] [__recompiles]     - 7/2: L['self'].param_groups[0]['lr'] == 0.004666666666666667
[rank0]:V0102 22:22:48.611000 624 torch/_dynamo/guards.py:2813] [7/5] [__recompiles]     - 7/1: L['self'].param_groups[0]['lr'] == 0.003333333333333333

通过这个例子，我们可以看到，由于param_groups[0]中的lr的防护失败，我们重新编译了优化器几次。

结论¶

在本教程中，我们展示了如何将使用torch.compile编译的优化器与学习率调度器（LR Scheduler）配对，以加速训练收敛。我们使用了一个由简单线性层序列组成的模型，搭配Adam优化器和LinearLR调度器，来演示学习率在迭代过程中的变化。

另请参阅：

Compiled optimizer tutorial - 编译优化器简介。
使用PT2编译优化器 - 关于编译优化器的更深入技术细节。

脚本总运行时间： ( 0 分钟 15.547 秒)

Gallery generated by Sphinx-Gallery