speechbrain.nnet.unet 模块

用于扩散模型的UNet模型实现

改编自OpenAI引导扩散，略有修改并增加了额外功能 https://github.com/openai/guided-diffusion

MIT许可证

特此授予任何获得本软件及相关文档文件（“软件”）副本的人免费许可，允许其在不受限制的情况下处理软件，包括但不限于使用、复制、修改、合并、发布、分发、再许可和/或销售软件的副本，并允许向其提供软件的人员这样做，但须遵守以下条件：

本软件按“原样”提供，不提供任何形式的明示或暗示保证，包括但不限于适销性、特定用途适用性和非侵权性的保证。在任何情况下，作者或版权持有人均不对任何索赔、损害或其他责任负责，无论是因合同、侵权或其他原因引起的，还是与本软件或使用或其他交易有关的。

Authors

阿尔乔姆·普洛日尼科夫 2022

摘要

类：

`AttentionBlock`	一个注意力块，允许空间位置相互关注。
`AttentionPool2d`	二维注意力池化
`DecoderUNetModel`	带有注意力和时间步嵌入的半UNet模型。
`Downsample`	一个带有可选卷积的下采样层。
`DownsamplingPadding`	一个包装模块，用于为下采样因子应用必要的填充
`EmbeddingProjection`	一个简单的模块，用于计算嵌入向量在指定维度上的投影
`EncoderUNetModel`	带有注意力和时间步嵌入的半UNet模型。
`QKVAttention`	一个执行QKV注意力并以不同顺序拆分的模块。
`ResBlock`	一个可以可选地改变通道数的残差块。
`TimestepBlock`	任何模块，其中 forward() 将时间步嵌入作为第二个参数。
`TimestepEmbedSequential`	一个顺序模块，将时间步嵌入作为额外输入传递给支持它的子模块。
`UNetModel`	完整的UNet模型，带有注意力和时间步嵌入。
`UNetNormalizingAutoencoder`	一个基于UNet的变分自编码器（VAE）的便利类 - 在构建潜在扩散模型时非常有用
`Upsample`	一个带有可选卷积的上采样层。

函数：

`avg_pool_nd`	创建一个1D、2D或3D的平均池化模块。
`build_emb_proj`	构建一个用于嵌入投影的嵌入模块字典
`conv_nd`	创建一个1D、2D或3D卷积模块。
`fixup`	将模块的参数归零并返回它。
`timestep_embedding`	创建正弦时间步嵌入。

参考

speechbrain.nnet.unet.fixup(module, use_fixup_init=True)[source]

将模块的参数归零并返回它。

Parameters:

模块 (torch.nn.Module) – 一个模块
use_fixup_init (bool) – 是否将参数归零。如果设置为false，该函数将不执行任何操作

Return type:

固定模块

speechbrain.nnet.unet.conv_nd(dims, *args, **kwargs)[source]

创建一个一维、二维或三维卷积模块。

Parameters:

dims (int) – 维度的数量
*args (元组)
**kwargs (dict) – 任何剩余的参数都会传递给构造函数

Return type:

构建的Conv层

speechbrain.nnet.unet.avg_pool_nd(dims, *args, **kwargs)[source]: 创建一个一维、二维或三维的平均池化模块。

speechbrain.nnet.unet.timestep_embedding(timesteps, dim, max_period=10000)[source]

创建正弦时间步嵌入。

Parameters:

timesteps (torch.Tensor) – 一个包含N个索引的一维张量，每个批次元素一个。这些索引可能是分数。
dim (int) – 输出的维度。
max_period (int) – 控制嵌入的最小频率。

Returns:

result – 一个 [N x dim] 的位置嵌入张量。

Return type:

torch.Tensor

class speechbrain.nnet.unet.AttentionPool2d(spatial_dim: int, embed_dim: int, num_heads_channels: int, output_dim: int = None)[source]

基础：Module

二维注意力池化

改编自CLIP：https://github.com/openai/CLIP/blob/main/clip/model.py

Parameters:

spatial_dim (int) – 空间维度的大小
embed_dim (int) – 嵌入维度
num_heads_channels (int) – 注意力头的数量
output_dim (int) – 输出维度

Example

>>> attn_pool = AttentionPool2d(
...     spatial_dim=64,
...     embed_dim=16,
...     num_heads_channels=2,
...     output_dim=4
... )
>>> x = torch.randn(4, 1, 64, 64)
>>> x_pool = attn_pool(x)
>>> x_pool.shape
torch.Size([4, 4])

forward(x)[source]

计算注意力前向传递

Parameters:: x (torch.Tensor) – 需要被关注的张量
Returns:: result – 注意力输出
Return type:: torch.Tensor

class speechbrain.nnet.unet.TimestepBlock(*args, **kwargs)[source]

基础：Module

任何模块，其中forward()将时间步嵌入作为第二个参数。

abstract forward(x, emb=None)[source]

将模块应用于给定x的emb时间步嵌入。

Parameters:

x (torch.Tensor) – 数据张量
emb (torch.Tensor) – 嵌入张量

class speechbrain.nnet.unet.TimestepEmbedSequential(*args: Module)[source]

class speechbrain.nnet.unet.TimestepEmbedSequential(arg: OrderedDict[str, Module])

基础类: Sequential, TimestepBlock

一个顺序模块，将时间步嵌入传递给支持它作为额外输入的子模块。

Example

>>> from speechbrain.nnet.linear import Linear
>>> class MyBlock(TimestepBlock):
...     def __init__(self, input_size, output_size, emb_size):
...         super().__init__()
...         self.lin = Linear(
...             n_neurons=output_size,
...             input_size=input_size
...         )
...         self.emb_proj = Linear(
...             n_neurons=output_size,
...             input_size=emb_size,
...         )
...     def forward(self, x, emb):
...         return self.lin(x) + self.emb_proj(emb)
>>> tes = TimestepEmbedSequential(
...     MyBlock(128, 64, 16),
...     Linear(
...         n_neurons=32,
...         input_size=64
...     )
... )
>>> x = torch.randn(4, 10, 128)
>>> emb = torch.randn(4, 10, 16)
>>> out = tes(x, emb)
>>> out.shape
torch.Size([4, 10, 32])

forward(x, emb=None)[source]

在适用的情况下计算带有顺序嵌入的顺序传递

Parameters:

x (torch.Tensor) – 数据张量
emb (torch.Tensor) – 时间步嵌入

Return type:

处理后的输入

class speechbrain.nnet.unet.Upsample(channels, use_conv, dims=2, out_channels=None)[source]

基础：Module

一个带有可选卷积的上采样层。

Parameters:

channels (torch.Tensor) – 输入和输出中的通道。
use_conv (bool) – 一个布尔值，用于确定是否应用卷积。
dims (int) – 确定信号是1D、2D还是3D。如果是3D，则上采样发生在内部的两个维度。
out_channels (int) – 输出通道的数量。如果为None，则与输入通道相同。

Example

>>> ups = Upsample(channels=4, use_conv=True, dims=2, out_channels=8)
>>> x = torch.randn(8, 4, 32, 32)
>>> x_up = ups(x)
>>> x_up.shape
torch.Size([8, 8, 64, 64])

forward(x)[source]

计算上采样过程

Parameters:: x (torch.Tensor) – 层输入
Returns:: result – 上采样输出
Return type:: torch.Tensor

class speechbrain.nnet.unet.Downsample(channels, use_conv, dims=2, out_channels=None)[source]

基础：Module

一个带有可选卷积的下采样层。

Parameters:

channels (int) – 输入和输出中的通道数。
use_conv (bool) – 一个布尔值，用于确定是否应用卷积。
dims (int) – 确定信号是1D、2D还是3D。如果是3D，则下采样发生在内部的两个维度。
out_channels (int) – 输出通道数。如果为None，则与输入通道数相同。

Example

>>> ups = Downsample(channels=4, use_conv=True, dims=2, out_channels=8)
>>> x = torch.randn(8, 4, 32, 32)
>>> x_up = ups(x)
>>> x_up.shape
torch.Size([8, 8, 16, 16])

forward(x)[source]

计算下采样过程

Parameters:: x (torch.Tensor) – 层输入
Returns:: result – 降采样输出
Return type:: torch.Tensor

class speechbrain.nnet.unet.ResBlock(channels, emb_channels, dropout, out_channels=None, use_conv=False, dims=2, up=False, down=False, norm_num_groups=32, use_fixup_init=True)[source]

基础类: TimestepBlock

一个可以可选地改变通道数量的残差块。

Parameters:

channels (int) – 输入通道的数量。
emb_channels (int) – 时间步嵌入通道的数量。
dropout (float) – dropout的比率。
out_channels (int) – 如果指定，表示输出通道的数量。
use_conv (bool) – 如果为True并且指定了out_channels，则使用空间卷积而不是较小的1x1卷积来改变跳跃连接中的通道。
dims (int) – 确定信号是1D、2D还是3D。
up (bool) – 如果为True，则使用此块进行上采样。
down (bool) – 如果为True，则使用此块进行下采样。
norm_num_groups (int) – 用于组归一化的组数
use_fixup_init (bool) – 是否使用FixUp初始化

Example

>>> res = ResBlock(
...     channels=4,
...     emb_channels=8,
...     dropout=0.1,
...     norm_num_groups=2,
...     use_conv=True,
... )
>>> x = torch.randn(2, 4, 32, 32)
>>> emb = torch.randn(2, 8)
>>> res_out = res(x, emb)
>>> res_out.shape
torch.Size([2, 4, 32, 32])

forward(x, emb=None)[source]

将块应用于 torch.Tensor，条件是时间步嵌入。

Parameters:

x (torch.Tensor) – 一个 [N x C x …] 的特征张量。
emb (torch.Tensor) – 一个 [N x emb_channels] 的时间步嵌入张量。

Returns:

result – 一个 [N x C x …] 维度的输出张量。

Return type:

torch.Tensor

class speechbrain.nnet.unet.AttentionBlock(channels, num_heads=1, num_head_channels=-1, norm_num_groups=32, use_fixup_init=True)[source]

基础：Module

一个允许空间位置相互关注的注意力块。最初是从这里移植的，但已适应N维情况。 https://github.com/hojonathanho/diffusion/blob/1e0dceb3b3495bbe19116a5e1b3596cd0706c543/diffusion_tf/models/unet.py#L66.

Parameters:

channels (int) – 通道数量
num_heads (int) – 注意力头的数量
num_head_channels (int) – 每个注意力头中的通道数
norm_num_groups (int) – 用于组归一化的组数
use_fixup_init (bool) – 是否使用FixUp初始化

Example

>>> attn = AttentionBlock(
...     channels=8,
...     num_heads=4,
...     num_head_channels=4,
...     norm_num_groups=2
... )
>>> x = torch.randn(4, 8, 16, 16)
>>> out = attn(x)
>>> out.shape
torch.Size([4, 8, 16, 16])

forward(x)[source]

完成前向传递

Parameters:: x (torch.Tensor) – 需要被关注的数据
Returns:: result – 应用了注意力机制的数据
Return type:: torch.Tensor

class speechbrain.nnet.unet.QKVAttention(n_heads)[source]

基础：Module

一个执行QKV注意力并以不同顺序拆分的模块。

Parameters:: n_heads (int) – 注意力头的数量。

Example

>>> attn = QKVAttention(4)
>>> n = 4
>>> c = 8
>>> h = 64
>>> w = 16
>>> qkv = torch.randn(4, (3 * h * c), w)
>>> out = attn(qkv)
>>> out.shape
torch.Size([4, 512, 16])

forward(qkv)[source]

应用QKV注意力机制。

Parameters:: qkv (torch.Tensor) – 一个 [N x (3 * H * C) x T] 的张量，包含 Qs、Ks 和 Vs。
Returns:: result – 一个 [N x (H * C) x T] 的张量，经过注意力机制后的结果。
Return type:: torch.Tensor

speechbrain.nnet.unet.build_emb_proj(emb_config, proj_dim=None, use_emb=None)[source]

构建一个用于嵌入投影的嵌入模块字典

Parameters:

emb_config (dict) – 一个配置字典
proj_dim (int) – 目标投影维度
use_emb (dict) – 一个可选的“开关”字典，用于打开和关闭嵌入

Returns:

result – 一个包含每个嵌入模块的 ModuleDict

Return type:

torch.nn.ModuleDict

class speechbrain.nnet.unet.UNetModel(in_channels, model_channels, out_channels, num_res_blocks, attention_resolutions, dropout=0, channel_mult=(1, 2, 4, 8), conv_resample=True, dims=2, emb_dim=None, cond_emb=None, use_cond_emb=None, num_heads=1, num_head_channels=-1, num_heads_upsample=-1, norm_num_groups=32, resblock_updown=False, use_fixup_init=True)[source]

基础：Module

带有注意力和时间步嵌入的完整UNet模型。

Parameters:

in_channels (int) – 输入 torch.Tensor 的通道数。
model_channels (int) – 模型的基础通道数。
out_channels (int) – 输出 torch.Tensor 中的通道数。
num_res_blocks (int) – 每个下采样层的残差块数量。
attention_resolutions (int) – 一个下采样率的集合，在这些下采样率下将进行注意力操作。可以是一个集合、列表或元组。例如，如果这个集合包含4，那么在4倍下采样时，将使用注意力机制。
dropout (float) – 丢弃概率。
channel_mult (int) – UNet每一层的通道乘数。
conv_resample (bool) – 如果为True，则使用学习的卷积进行上采样和下采样
dims (int) – 确定信号是1D、2D还是3D。
emb_dim (int) – 时间嵌入维度（默认为 model_channels * 4）
cond_emb (dict) –
模型将基于的条件嵌入

示例: {

”speaker”: {
“emb_dim”: 256

}, “label”: {

”emb_dim”: 12

}

}
use_cond_emb (dict) –
一个字典，其键对应于cond_emb中的键，值对应于布尔值，用于打开和关闭嵌入。这在结合hparams文件使用时非常有用，可以通过简单的开关来打开和关闭嵌入。

示例： {“speaker”: False, “label”: True}
num_heads (int) – 每个注意力层中的注意力头数量。
num_head_channels (int) – 如果指定，则忽略 num_heads 并改为使用每个注意力头的固定通道宽度。
num_heads_upsample (int) – 与num_heads一起工作，为上采样设置不同数量的头。已弃用。
norm_num_groups (int) – 规范中的组数，默认为32
resblock_updown (bool) – 使用残差块进行上/下采样。
use_fixup_init (bool) – 是否使用FixUp初始化

Example

>>> model = UNetModel(
...    in_channels=3,
...    model_channels=32,
...    out_channels=1,
...    num_res_blocks=1,
...    attention_resolutions=[1]
... )
>>> x = torch.randn(4, 3, 16, 32)
>>> ts = torch.tensor([10, 100, 50, 25])
>>> out = model(x, ts)
>>> out.shape
torch.Size([4, 1, 16, 32])

forward(x, timesteps, cond_emb=None)[source]

将模型应用于输入批次。

Parameters:

x (torch.Tensor) – 一个 [N x C x …] 的输入张量。
timesteps (torch.Tensor) – 一维批次的时间步长。
cond_emb (dict) – 一个字符串 -> 张量字典的条件嵌入（支持多个嵌入）

Returns:

result – 一个 [N x C x …] 维度的输出张量。

Return type:

torch.Tensor

diffusion_forward(x, timesteps, cond_emb=None, length=None, out_mask_value=None, latent_mask_value=None)[source]: 适用于通过扩散包装的前向函数。对于此模型，length/out_mask_value/latent_mask_value 未使用并被丢弃。详情请参见 forward()。

class speechbrain.nnet.unet.EncoderUNetModel(in_channels, model_channels, out_channels, num_res_blocks, attention_resolutions, dropout=0, channel_mult=(1, 2, 4, 8), conv_resample=True, dims=2, num_heads=1, num_head_channels=-1, num_heads_upsample=-1, norm_num_groups=32, resblock_updown=False, pool=None, attention_pool_dim=None, out_kernel_size=3, use_fixup_init=True)[source]

基础：Module

带有注意力和时间步嵌入的半UNet模型。有关用法，请参见UNetModel。

Parameters:

in_channels (int) – 输入 torch.Tensor 的通道数。
model_channels (int) – 模型的基础通道数。
out_channels (int) – 输出 torch.Tensor 中的通道数。
num_res_blocks (int) – 每个下采样层的残差块数量。
attention_resolutions (int) – 一个下采样率的集合，在这些下采样率下将进行注意力操作。可以是一个集合、列表或元组。例如，如果这个集合包含4，那么在4倍下采样时，将使用注意力机制。
dropout (float) – 丢弃概率。
channel_mult (int) – UNet每一层的通道乘数。
conv_resample (bool) – 如果为True，则使用学习的卷积进行上采样和下采样
dims (int) – 确定信号是1D、2D还是3D。
num_heads (int) – 每个注意力层中的注意力头数量。
num_head_channels (int) – 如果指定，则忽略 num_heads 并改为使用每个注意力头的固定通道宽度。
num_heads_upsample (int) – 与 num_heads 一起工作，为上采样设置不同数量的头。已弃用。
norm_num_groups (int) – 规范中的组数，默认为32。
resblock_updown (bool) – 使用残差块进行上/下采样。
pool (str) – 使用的池化类型，其中之一： [“adaptive”, “attention”, “spatial”, “spatial_v2”].
attention_pool_dim (int) – 应用注意力池化的维度。
out_kernel_size (int) – 输出卷积的核大小
use_fixup_init (bool) – 是否使用FixUp初始化

Example

>>> model = EncoderUNetModel(
...    in_channels=3,
...    model_channels=32,
...    out_channels=1,
...    num_res_blocks=1,
...    attention_resolutions=[1]
... )
>>> x = torch.randn(4, 3, 16, 32)
>>> ts = torch.tensor([10, 100, 50, 25])
>>> out = model(x, ts)
>>> out.shape
torch.Size([4, 1, 2, 4])

forward(x, timesteps=None)[source]

将模型应用于输入批次。

Parameters:

x (torch.Tensor) – 一个 [N x C x …] 的输入张量。
timesteps (torch.Tensor) – 一维批次的时间步长。

Returns:

result – 一个 [N x K] 的输出张量。

Return type:

torch.Tensor

class speechbrain.nnet.unet.EmbeddingProjection(emb_dim, proj_dim)[source]

基础：Module

一个简单的模块，用于计算嵌入向量在指定维度上的投影

Parameters:

emb_dim (int) – 原始嵌入维度
proj_dim (int) – 目标投影空间的维度

Example

>>> mod_emb_proj = EmbeddingProjection(emb_dim=16, proj_dim=64)
>>> emb = torch.randn(4, 16)
>>> emb_proj = mod_emb_proj(emb)
>>> emb_proj.shape
torch.Size([4, 64])

forward(emb)[source]

计算前向传播

Parameters:: emb (torch.Tensor) – 原始的嵌入张量
Returns:: result – 目标嵌入空间
Return type:: torch.Tensor

class speechbrain.nnet.unet.DecoderUNetModel(in_channels, model_channels, out_channels, num_res_blocks, attention_resolutions, dropout=0, channel_mult=(1, 2, 4, 8), conv_resample=True, dims=2, num_heads=1, num_head_channels=-1, num_heads_upsample=-1, resblock_updown=False, norm_num_groups=32, out_kernel_size=3, use_fixup_init=True)[source]

基础：Module

带有注意力和时间步嵌入的半UNet模型。有关用法，请参见UNet。

Parameters:

in_channels (int) – 输入 torch.Tensor 的通道数。
model_channels (int) – 模型的基础通道数。
out_channels (int) – 输出 torch.Tensor 中的通道数。
num_res_blocks (int) – 每个下采样层的残差块数量。
attention_resolutions (int) – 一个下采样率的集合，在这些下采样率下将进行注意力操作。可以是一个集合、列表或元组。例如，如果这个集合包含4，那么在4倍下采样时，将使用注意力机制。
dropout (float) – 丢弃概率。
channel_mult (int) – UNet每一层的通道乘数。
conv_resample (bool) – 如果为True，则使用学习的卷积进行上采样和下采样
dims (int) – 确定信号是1D、2D还是3D。
num_heads (int) – 每个注意力层中的注意力头数量。
num_head_channels (int) –

如果指定，忽略 num_heads 并改为使用
每个注意力头的固定通道宽度。
num_heads_upsample (int) –

与num_heads一起工作以设置不同的数量
用于上采样的头数。已弃用。
resblock_updown (bool) – 使用残差块进行上/下采样。
norm_num_groups (int) – 在norm中使用的组数，默认为32
out_kernel_size (int) – 输出内核大小，默认为3
use_fixup_init (bool) – 是否使用FixUp初始化

Example

>>> model = DecoderUNetModel(
...    in_channels=1,
...    model_channels=32,
...    out_channels=3,
...    num_res_blocks=1,
...    attention_resolutions=[1]
... )
>>> x = torch.randn(4, 1, 2, 4)
>>> ts = torch.tensor([10, 100, 50, 25])
>>> out = model(x, ts)
>>> out.shape
torch.Size([4, 3, 16, 32])

forward(x, timesteps=None)[source]

将模型应用于输入批次。

Parameters:

x (torch.Tensor) – 一个 [N x C x …] 的输入张量。
timesteps (torch.Tensor) – 一维批次的时间步长。

Returns:

result – 一个 [N x K] 的输出张量。

Return type:

torch.Tensor

class speechbrain.nnet.unet.DownsamplingPadding(factor, len_dim=2, dims=None)[source]

基础：Module

一个包装模块，用于为下采样因子应用必要的填充

Parameters:

factor (int) – 下采样 / 可除性因子
len_dim (int) – 长度将变化的维度的索引
dims (list) – 包含在填充中的维度列表

Example

>>> padding = DownsamplingPadding(factor=4, dims=[1, 2], len_dim=1)
>>> x = torch.randn(4, 7, 14)
>>> length = torch.tensor([1., 0.8, 1., 0.7])
>>> x, length_new = padding(x, length)
>>> x.shape
torch.Size([4, 8, 16])
>>> length_new
tensor([0.8750, 0.7000, 0.8750, 0.6125])

forward(x, length=None)[source]

应用填充

Parameters:

x (torch.Tensor) – 样本
length (torch.Tensor) – 长度张量

Returns:

x_pad (torch.Tensor) – 填充的张量
lens (torch.Tensor) – 新的调整后的长度，如果适用

class speechbrain.nnet.unet.UNetNormalizingAutoencoder(in_channels, model_channels, encoder_out_channels, latent_channels, encoder_num_res_blocks, encoder_attention_resolutions, decoder_num_res_blocks, decoder_attention_resolutions, dropout=0, channel_mult=(1, 2, 4, 8), dims=2, num_heads=1, num_head_channels=-1, num_heads_upsample=-1, norm_num_groups=32, resblock_updown=False, out_kernel_size=3, len_dim=2, out_mask_value=0.0, latent_mask_value=0.0, use_fixup_norm=False, downsampling_padding=None)[source]

基础类: NormalizingAutoencoder

一个用于基于UNet的变分自编码器（VAE）的便利类 - 在构建潜在扩散模型时非常有用

Parameters:

in_channels (int) – 输入通道的数量
model_channels (int) – UNet编码器和解码器的卷积层中的通道数
encoder_out_channels (int) – 编码器将输出的通道数
latent_channels (int) – 潜在空间中的通道数
encoder_num_res_blocks (int) – 编码器中残差块的数量
encoder_attention_resolutions (list) – 在编码器中应用注意力层的分辨率
decoder_num_res_blocks (int) – 解码器中残差块的数量
decoder_attention_resolutions (list) – 在编码器中应用注意力层的分辨率
dropout (float) – 丢弃概率
channel_mult (tuple) – 每层的通道乘数
dims (int) – 使用的卷积维度（1、2 或 3）
num_heads (int) – 注意力头的数量
num_head_channels (int) – 注意力头中的通道数
num_heads_upsample (int) – 上采样头的数量
norm_num_groups (int) – 规范组的数量，默认为32
resblock_updown (bool) – 是否使用残差块进行上采样和下采样
out_kernel_size (int) – 输出卷积层的核大小（如果适用）
len_dim (int) – 输出的大小。
out_mask_value (float) – 在屏蔽输出时填充的值。
latent_mask_value (float) – 当屏蔽潜在变量时要填充的值。
use_fixup_norm (bool) – 是否使用FixUp归一化
downsampling_padding (int) – 在下采样中应用的填充量，默认值为 2 ** len(channel_mult)

Example

>>> unet_ae = UNetNormalizingAutoencoder(
...     in_channels=1,
...     model_channels=4,
...     encoder_out_channels=16,
...     latent_channels=3,
...     encoder_num_res_blocks=1,
...     encoder_attention_resolutions=[],
...     decoder_num_res_blocks=1,
...     decoder_attention_resolutions=[],
...     norm_num_groups=2,
... )
>>> x = torch.randn(4, 1, 32, 32)
>>> x_enc = unet_ae.encode(x)
>>> x_enc.shape
torch.Size([4, 3, 4, 4])
>>> x_dec = unet_ae.decode(x_enc)
>>> x_dec.shape
torch.Size([4, 1, 32, 32])