torch.amp.autocast_mode 的源代码

```html
import functools
import warnings

from typing import Any, Optional

import torch
from torch.types import _dtype

__all__ = ["autocast_decorator", "autocast"]


def autocast_decorator(autocast_instance, func):
    @functools.wraps(func)
    def decorate_autocast(*args, **kwargs):
        with autocast_instance:
            return func(*args, **kwargs)

    decorate_autocast.__script_unsupported = "@autocast() decorator is not supported in script mode"  # type: ignore[attr-defined]
    return decorate_autocast


[docs]class autocast:
    r"""
    :class:`autocast` 的实例作为上下文管理器或装饰器，允许脚本中的区域以混合精度运行。

    在这些区域中，操作以 autocast 选择的特定 dtype 运行，以提高性能同时保持准确性。
    有关详细信息，请参阅 :ref:`Autocast Op Reference`。

    进入启用 autocast 的区域时，张量可以是任何类型。
    使用 autocasting 时，不应在模型或输入上调用 ``half()`` 或 ``bfloat16()``。

    :class:`autocast` 应该仅包装网络的前向传递（包括损失计算）。
    不建议在 autocast 下进行反向传递。
    反向操作以 autocast 用于相应前向操作的相同类型运行。

    示例适用于 CUDA 设备::

        # 以默认精度创建模型和优化器
        model = Net().cuda()
        optimizer = optim.SGD(model.parameters(), ...)

        for input, target in data:
            optimizer.zero_grad()

            # 为前向传递（模型 + 损失）启用 autocasting
            with torch.autocast(device_type="cuda"):
                output = model(input)
                loss = loss_fn(output, target)

            # 在反向传递之前退出上下文管理器
            loss.backward()
            optimizer.step()

    有关在更复杂场景（例如，梯度惩罚、多个模型/损失、自定义 autograd 函数）中使用（以及梯度缩放）的示例，请参阅 :ref:`CUDA Automatic Mixed Precision examples`。

    :class:`autocast` 也可以用作装饰器，例如在模型的 ``forward`` 方法上::

        class AutocastModel(nn.Module):
            ...
            @torch.autocast(device_type="cuda")
            def forward(self, input):
                ...

    在启用 autocast 的区域中生成的浮点张量可能是 ``float16``。
    返回未启用 autocast 的区域后，将它们与不同 dtypes 的浮点张量一起使用可能会导致类型不匹配错误。
    如果发生这种情况，请将 autocast 区域中生成的张量转换回 ``float32``（或所需的 dtype）。
    如果 autocast 区域中的张量已经是 ``float32``，则转换是无操作的，不会产生额外开销。
    CUDA 示例::

        # 以默认 dtype（此处假设为 float32）创建一些张量
        a_float32 = torch.rand((8, 8), device="cuda")
        b_float32 = torch.rand((8, 8), device="cuda")
        c_float32 = torch.rand((8, 8), device="cuda")
        d_float32 = torch.rand((8, 8), device="cuda")

        with torch.autocast(device_type="cuda"):
            # torch.mm 在 autocast 的列表中，应该以 float16 运行。
            # 输入是 float32，但操作以 float16 运行并生成 float16 输出。
            # 无需手动转换。
            e_float16 = torch.mm(a_float32, b_float32)
            # 还处理混合输入类型
            f_float16 = torch.mm(d_float32, e_float16)

        # 退出 autocast 后，调用 f_float16.float() 以与 d_float32 一起使用
        g_float32 = torch.mm(d_float32, f_float16.float())

    CPU 训练示例::

        # 以默认精度创建模型和优化器
        model = Net()
        optimizer = optim.SGD(model.parameters(), ...)

        for epoch in epochs:
            for input, target in data:
                optimizer.zero_grad()

                # 使用 autocasting 运行前向传递。
                with torch.autocast(device_type="cpu", dtype=torch.bfloat16):
                    output = model(input)
                    loss = loss_fn(output, target)

                loss.backward()
                optimizer.step()


    CPU 推理示例::

        # 以默认精度创建模型
        model = Net().eval()

        with torch.autocast(device_type="cpu", dtype=torch.bfloat16):
            for input in data:
                # 使用 autocasting 运行前向传递。
                output = model(input)

    CPU 推理示例与 Jit Trace::

        class TestModel(nn.Module):
            def __init__(self, input_size, num_classes):
                super().__init__()
                self.fc1 = nn.Linear(input_size, num_classes)
            def forward(self, x):
                return self.fc1(x)

        input_size = 2
        num_classes = 2
        model = TestModel(input_size, num_classes).eval()

        # 目前，我们建议禁用 Jit Autocast Pass，
        # 因为问题：https://github.com/pytorch/pytorch/issues/75956
        torch._C._jit_set_autocast_mode(False)

        with torch.cpu.amp.autocast(cache_enabled=False):
            model = torch.jit.trace(model, torch.randn(1, input_size))
        model = torch.jit.freeze(model)
        # 模型运行
        for _ in range(3):
            model(torch.randn(1, input_size))

    在启用 autocast 的区域内发生类型不匹配错误是一个错误；如果您观察到这种情况，请提交问题。

    ``autocast(enabled=False)`` 子区域可以嵌套在启用 autocast 的区域中。
    本地禁用 autocast 可能很有用，例如，如果您想强制子区域以特定 ``dtype`` 运行。
    禁用 autocast 为您提供了对执行类型的显式控制。
    在子区域中，来自周围区域的输入在使用前应转换为 ``dtype``::

        # 以默认 dtype（此处假设为 float32）创建一些张量
        a_float32 = torch.rand((8, 8), device="cuda")
        b_float32 = torch.rand((8, 8), device="cuda")
        c_float32 = torch.rand((8, 8), device="cuda")
        d_float32 = torch.rand((8, 8), device="cuda")

        with torch.autocast(device_type="cuda"):
            e_float16 = torch.mm(a_float32, b_float32)
            with torch.autocast(device_type="cuda", enabled=False):
                # 调用 e_float16.float() 以确保 float32 执行
                # （必要，因为 e_float16 是在 autocast 区域中创建的）
                f_float32 = torch.mm(c_float32, e_float16.float())

            # 重新进入启用 autocast 的区域时，无需手动转换。
            # torch.mm 再次以 float16 运行并生成 float16 输出，无论输入类型如何。
            g_float16 = torch.mm(d_float32, f_float32)

    autocast 状态是线程本地的。
    如果您希望在新线程中启用它，必须在那个线程中调用上下文管理器或装饰器。
    这会影响 :class:`torch.nn.DataParallel` 和 :class:`torch.nn.parallel.DistributedDataParallel`，
    当每个进程使用多个 GPU 时（参见 :ref:`Working with Multiple GPUs`）。

    参数:
        device_type(str, 必需):  要使用的设备类型。可能的值为：'cuda', 'cpu', 'xpu' 和 'hpu'。
                                    类型与 :class:`torch.device` 的 `type` 属性相同。
                                    因此，您可以使用 `Tensor.device.type` 获取张量的设备类型。
        enabled(bool, 可选):  是否应在区域内启用 autocasting。
            默认值: ``True``
        dtype(torch_dtype, 可选):  是否使用 torch.float16 或 torch.bfloat16。
        cache_enabled(bool, 可选):  是否应启用 autocast 内部的权重缓存。
            默认值: ``True``
    """

    def __init__(
        self,
        device_type: str,
        dtype: Optional[_dtype] = None,
        enabled: bool <span