注意

点击here下载完整的示例代码

(原型) 掩码张量概述¶

创建于：2022年10月28日 | 最后更新：2022年10月28日 | 最后验证：未验证

本教程旨在作为使用MaskedTensors的起点，并讨论其掩码语义。

MaskedTensor 作为 torch.Tensor 的扩展，为用户提供了以下能力：

使用任何掩码语义（例如，可变长度张量，nan* 操作符等）
区分0和NaN梯度
各种稀疏应用（见下面的教程）

有关MaskedTensors的更详细介绍，请参阅 torch.masked文档。

使用MaskedTensor¶

在本节中，我们将讨论如何使用MaskedTensor，包括如何构造、访问数据和掩码，以及索引和切片。

准备¶

我们将从为教程进行必要的设置开始：

import torch
from torch.masked import masked_tensor, as_masked_tensor
import warnings

# Disable prototype warnings and such
warnings.filterwarnings(action='ignore', category=UserWarning)

构建¶

有几种不同的方法来构建一个MaskedTensor：

第一种方法是直接调用MaskedTensor类
第二种（也是我们推荐的方式）是使用masked.masked_tensor()和masked.as_masked_tensor()工厂函数，这些函数类似于torch.tensor()和torch.as_tensor()

在本教程中，我们将假设导入行：from torch.masked import masked_tensor。

访问数据和掩码¶

可以通过以下方式访问MaskedTensor中的基础字段：

the MaskedTensor.get_data() 函数
the MaskedTensor.get_mask() 函数。回想一下，True 表示“指定”或“有效”，而 False 表示“未指定”或“无效”。

通常，返回的底层数据在未指定的条目中可能无效，因此我们建议当用户需要一个没有任何掩码条目的Tensor时，他们使用MaskedTensor.to_tensor()（如上所示）来返回一个填充了值的Tensor。

索引和切片¶

MaskedTensor 是 Tensor 的一个子类，这意味着它继承了与 torch.Tensor 相同的索引和切片语义。以下是一些常见的索引和切片模式的示例：

data = torch.arange(24).reshape(2, 3, 4)
mask = data % 2 == 0

print("data:\n", data)
print("mask:\n", mask)

data:
 tensor([[[ 0,  1,  2,  3],
         [ 4,  5,  6,  7],
         [ 8,  9, 10, 11]],

        [[12, 13, 14, 15],
         [16, 17, 18, 19],
         [20, 21, 22, 23]]])
mask:
 tensor([[[ True, False,  True, False],
         [ True, False,  True, False],
         [ True, False,  True, False]],

        [[ True, False,  True, False],
         [ True, False,  True, False],
         [ True, False,  True, False]]])

# float is used for cleaner visualization when being printed
mt = masked_tensor(data.float(), mask)

print("mt[0]:\n", mt[0])
print("mt[:, :, 2:4]:\n", mt[:, :, 2:4])

mt[0]:
 MaskedTensor(
  [
    [  0.0000,       --,   2.0000,       --],
    [  4.0000,       --,   6.0000,       --],
    [  8.0000,       --,  10.0000,       --]
  ]
)
mt[:, :, 2:4]:
 MaskedTensor(
  [
    [
      [  2.0000,       --],
      [  6.0000,       --],
      [ 10.0000,       --]
    ],
    [
      [ 14.0000,       --],
      [ 18.0000,       --],
      [ 22.0000,       --]
    ]
  ]
)

为什么MaskedTensor有用？¶

由于MaskedTensor将指定和未指定的值作为一等公民处理，而不是事后考虑（使用填充值、nans等），它能够解决常规张量无法解决的几个缺点；事实上，MaskedTensor在很大程度上是由于这些反复出现的问题而诞生的。

下面，我们将讨论一些在PyTorch中仍未解决的常见问题，并说明MaskedTensor如何解决这些问题。

区分0和NaN梯度¶

torch.Tensor 遇到的一个问题是无法区分未定义（NaN）的梯度和实际为0的梯度。由于 PyTorch 没有一种方法来标记一个值是已指定/有效的还是未指定/无效的，它被迫依赖于 NaN 或 0（取决于使用情况），这导致了不可靠的语义，因为许多操作并不打算正确处理 NaN 值。更令人困惑的是，有时根据操作的顺序，梯度可能会有所不同（例如，取决于在操作链中 NaN 值出现的早晚）。

MaskedTensor 是解决这个问题的完美方案！

torch.where¶

在Issue 10729中，我们注意到在使用torch.where()时，操作顺序可能会产生影响，因为我们难以区分0是真实的0还是来自未定义梯度的0。因此，我们保持一致并屏蔽结果：

当前结果：

x = torch.tensor([-10., -5, 0, 5, 10, 50, 60, 70, 80, 90, 100], requires_grad=True, dtype=torch.float)
y = torch.where(x < 0, torch.exp(x), torch.ones_like(x))
y.sum().backward()
x.grad

tensor([4.5400e-05, 6.7379e-03, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00,        nan,        nan])

MaskedTensor 结果:

x = torch.tensor([-10., -5, 0, 5, 10, 50, 60, 70, 80, 90, 100])
mask = x < 0
mx = masked_tensor(x, mask, requires_grad=True)
my = masked_tensor(torch.ones_like(x), ~mask, requires_grad=True)
y = torch.where(mask, torch.exp(mx), my)
y.sum().backward()
mx.grad

MaskedTensor(
  [  0.0000,   0.0067,       --,       --,       --,       --,       --,       --,       --,       --,       --]
)

这里的梯度仅提供给选定的子集。实际上，这将改变where的梯度，以屏蔽元素而不是将它们设置为零。

另一个 torch.where¶

Issue 52248 是另一个例子。

当前结果：

a = torch.randn((), requires_grad=True)
b = torch.tensor(False)
c = torch.ones(())
print("torch.where(b, a/0, c):\n", torch.where(b, a/0, c))
print("torch.autograd.grad(torch.where(b, a/0, c), a):\n", torch.autograd.grad(torch.where(b, a/0, c), a))

torch.where(b, a/0, c):
 tensor(1., grad_fn=<WhereBackward0>)
torch.autograd.grad(torch.where(b, a/0, c), a):
 (tensor(nan),)

MaskedTensor 结果:

a = masked_tensor(torch.randn(()), torch.tensor(True), requires_grad=True)
b = torch.tensor(False)
c = torch.ones(())
print("torch.where(b, a/0, c):\n", torch.where(b, a/0, c))
print("torch.autograd.grad(torch.where(b, a/0, c), a):\n", torch.autograd.grad(torch.where(b, a/0, c), a))

torch.where(b, a/0, c):
 MaskedTensor(  1.0000, True)
torch.autograd.grad(torch.where(b, a/0, c), a):
 (MaskedTensor(--, False),)

这个问题与下面的问题类似（甚至链接到下一个问题），因为它表达了对由于无法区分“无梯度”与“零梯度”而导致的意外行为的沮丧，这反过来使得与其他操作一起工作时难以推理。

使用mask时，x/0会产生NaN梯度¶

在Issue 4132中，用户提出x.grad应该是[0, 1]而不是[nan, 1]，而MaskedTensor通过完全屏蔽梯度使这一点非常清楚。

当前结果：

x = torch.tensor([1., 1.], requires_grad=True)
div = torch.tensor([0., 1.])
y = x/div # => y is [inf, 1]
mask = (div != 0)  # => mask is [0, 1]
y[mask].backward()
x.grad

tensor([nan, 1.])

MaskedTensor 结果:

x = torch.tensor([1., 1.], requires_grad=True)
div = torch.tensor([0., 1.])
y = x/div # => y is [inf, 1]
mask = (div != 0) # => mask is [0, 1]
loss = as_masked_tensor(y, mask)
loss.sum().backward()
x.grad

MaskedTensor(
  [      --,   1.0000]
)

`torch.nansum()` 和 `torch.nanmean()`¶

在Issue 67180中，梯度计算不正确（一个长期存在的问题），而MaskedTensor正确地处理了它。

当前结果：

a = torch.tensor([1., 2., float('nan')])
b = torch.tensor(1.0, requires_grad=True)
c = a * b
c1 = torch.nansum(c)
bgrad1, = torch.autograd.grad(c1, b, retain_graph=True)
bgrad1

tensor(nan)

MaskedTensor 结果:

a = torch.tensor([1., 2., float('nan')])
b = torch.tensor(1.0, requires_grad=True)
mt = masked_tensor(a, ~torch.isnan(a))
c = mt * b
c1 = torch.sum(c)
bgrad1, = torch.autograd.grad(c1, b, retain_graph=True)
bgrad1

MaskedTensor(  3.0000, True)

安全Softmax¶

安全softmax是另一个经常出现的问题的很好例子。简而言之，如果整个批次被“屏蔽”或完全由填充组成（在softmax的情况下，这意味着设置为-inf），那么这将导致NaN，可能会导致训练发散。

幸运的是，MaskedTensor 已经解决了这个问题。考虑以下设置：

data = torch.randn(3, 3)
mask = torch.tensor([[True, False, False], [True, False, True], [False, False, False]])
x = data.masked_fill(~mask, float('-inf'))
mt = masked_tensor(data, mask)
print("x:\n", x)
print("mt:\n", mt)

x:
 tensor([[ 0.2345,    -inf,    -inf],
        [-0.1863,    -inf, -0.6380],
        [   -inf,    -inf,    -inf]])
mt:
 MaskedTensor(
  [
    [  0.2345,       --,       --],
    [ -0.1863,       --,  -0.6380],
    [      --,       --,       --]
  ]
)

例如，我们想要沿着dim=0计算softmax。请注意，第二列是“不安全的”（即完全被屏蔽），因此当计算softmax时，结果将产生0/0 = nan，因为exp(-inf) = 0。然而，我们真正希望的是梯度被屏蔽，因为它们是未指定的，并且对于训练来说是无效的。

PyTorch 结果：

x.softmax(0)

tensor([[0.6037,    nan, 0.0000],
        [0.3963,    nan, 1.0000],
        [0.0000,    nan, 0.0000]])

MaskedTensor 结果:

mt.softmax(0)

MaskedTensor(
  [
    [  0.6037,       --,       --],
    [  0.3963,       --,   1.0000],
    [      --,       --,       --]
  ]
)

实现缺失的 torch.nan* 操作符¶

在Issue 61474中，有一个请求是添加额外的操作符以覆盖各种torch.nan*应用，例如torch.nanmax, torch.nanmin等。

一般来说，这些问题更适合使用掩码语义，因此我们建议使用MaskedTensor而不是引入额外的操作符。由于nanmean已经实现，我们可以将其作为比较点：

x = torch.arange(16).float()
y = x * x.fmod(4)
z = y.masked_fill(y == 0, float('nan'))  # we want to get the mean of y when ignoring the zeros

print("y:\n", y)
# z is just y with the zeros replaced with nan's
print("z:\n", z)

y:
 tensor([ 0.,  1.,  4.,  9.,  0.,  5., 12., 21.,  0.,  9., 20., 33.,  0., 13.,
        28., 45.])
z:
 tensor([nan,  1.,  4.,  9., nan,  5., 12., 21., nan,  9., 20., 33., nan, 13.,
        28., 45.])

print("y.mean():\n", y.mean())
print("z.nanmean():\n", z.nanmean())
# MaskedTensor successfully ignores the 0's
print("torch.mean(masked_tensor(y, y != 0)):\n", torch.mean(masked_tensor(y, y != 0)))

y.mean():
 tensor(12.5000)
z.nanmean():
 tensor(16.6667)
torch.mean(masked_tensor(y, y != 0)):
 MaskedTensor( 16.6667, True)

在上面的例子中，我们构建了一个y，并且希望在忽略零值的情况下计算该序列的平均值。torch.nanmean可以用来实现这一点，但我们没有实现其余的torch.nan*操作。MaskedTensor通过能够使用基本操作来解决这个问题，并且我们已经支持了问题中列出的其他操作。例如：

torch.argmin(masked_tensor(y, y != 0))

MaskedTensor(  1.0000, True)

实际上，忽略0时最小参数的索引是索引1中的1。

MaskedTensor 也可以在数据完全被屏蔽时支持归约操作，这相当于上述情况中数据张量完全为 nan 的情况。nanmean 会返回 nan（一个模糊的返回值），而 MaskedTensor 会更准确地表示一个被屏蔽的结果。

x = torch.empty(16).fill_(float('nan'))
print("x:\n", x)
print("torch.nanmean(x):\n", torch.nanmean(x))
print("torch.nanmean via maskedtensor:\n", torch.mean(masked_tensor(x, ~torch.isnan(x))))

x:
 tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])
torch.nanmean(x):
 tensor(nan)
torch.nanmean via maskedtensor:
 MaskedTensor(--, False)

这是一个类似于安全softmax的问题，其中0/0 = nan，而我们真正想要的是一个未定义的值。

结论¶

在本教程中，我们介绍了什么是MaskedTensors，演示了如何使用它们，并通过一系列示例和它们帮助解决的问题来展示它们的价值。

进一步阅读¶

要继续学习更多内容，您可以查看我们的 MaskedTensor 稀疏性教程了解 MaskedTensor 如何实现稀疏性以及我们目前支持的不同存储格式。

脚本总运行时间： ( 0 分钟 0.052 秒)

Gallery generated by Sphinx-Gallery

(原型) 掩码张量概述¶

使用MaskedTensor¶

准备¶

构建¶

访问数据和掩码¶

索引和切片¶

为什么MaskedTensor有用？¶

区分0和NaN梯度¶

torch.where¶

另一个 torch.where¶

使用mask时，x/0会产生NaN梯度¶

torch.nansum() 和 torch.nanmean()¶

安全Softmax¶

实现缺失的 torch.nan* 操作符¶

结论¶

进一步阅读¶

`torch.nansum()` 和 `torch.nanmean()`¶