Transformer解码器¶

class torchtune.modules.TransformerDecoder(*, tok_embeddings: Embedding, layers: Union[Module, List[Module], ModuleList], max_seq_len: int, num_heads: int, head_dim: int, norm: Module, output: Union[Linear, Callable], num_layers: Optional[int] = None, output_hidden_states: Optional[List[int]] = None)[source]¶

Transformer 解码器源自 Llama2 架构。

Parameters:

tok_embeddings (nn.Embedding) – PyTorch 嵌入层，用于将标记移动到嵌入空间。
layers (Union[nn.Module, List[nn.Module], nn.ModuleList]) – 单个transformer解码器层，一个层的nn.ModuleList或层的列表。建议使用nn.ModuleList。
max_seq_len (int) – 模型将运行的最大序列长度，由KVCache()使用
num_heads (int) – 查询头的数量。对于MHA来说，这也是键和值的头的数量。这用于设置KVCache()
head_dim (int) – 自注意力机制中每个头的嵌入维度。这用于设置 KVCache()
norm (nn.Module) – 可调用对象，用于在最终MLP之前对解码器的输出应用归一化。
output (Union[nn.Linear, Callable]) – 可调用对象，用于对解码器的输出应用线性变换。
num_layers (可选[int]) – Transformer解码器层的数量，仅在层不是列表时定义。
output_hidden_states (Optional[List[int]]) – 包含在输出中的层（索引）列表

Raises:

AssertionError – num_layers 已设置且 layer 是一个列表
AssertionError – num_layers 未设置且 layer 是一个 nn.Module

注意

参数值在使用它们的模块中会被检查是否正确（例如：attn_dropout 属于 [0,1]）。这有助于减少代码中的 raise 语句数量，并提高可读性。

caches_are_enabled() → bool[source]¶: 检查键值缓存是否已启用。一旦KV缓存设置完成，相关的注意力模块将被“启用”，并且所有前向传递都将更新缓存。可以通过使用torchtune.modules.common_utils.disable_kv_cache()“禁用”KV缓存来禁用此行为，而不改变KV缓存的状态，此时caches_are_enabled将返回False。

caches_are_setup() → bool[source]¶: 检查键值缓存是否已设置。这意味着setup_caches已被调用，并且模型中的相关注意力模块已创建了它们的KVCache。

chunked_output(last_hidden_state: Tensor) → List[Tensor][source]¶

分块应用输出投影。这应该与CEWithChunkedOutputLoss一起应用，因为在那里完成了向上转换为fp32的操作。

要使用此方法，您应首先调用 set_num_output_chunks()。

Parameters:

last_hidden_state (torch.Tensor) – 解码器的最后隐藏状态，形状为 [b, seq_len, embed_dim]。

Returns:

输出张量的列表，每个张量的形状为: [b, seq_len/num_chunks, out_dim]，其中out_dim通常是词汇表大小。

Return type:

列表[torch.Tensor]

forward(tokens: Tensor, *, mask: Optional[Tensor] = None, encoder_input: Optional[Tensor] = None, encoder_mask: Optional[Tensor] = None, input_pos: Optional[Tensor] = None) → Union[Tensor, List[Tensor]][source]¶

Parameters:

tokens (torch.Tensor) – 输入张量，形状为 [b x s]
mask (Optional[_MaskType]) –
Used to mask the scores after the query-key multiplication and before the softmax. This parameter is required during inference if caches have been setup. Either:

A boolean tensor with shape [b x s x s], [b x s x self.encoder_max_cache_seq_len], or [b x s x self.encoder_max_cache_seq_len] if using KV-cacheing with encoder/decoder layers. A value of True in row i and column j means token i attends to token j. A value of False means token i does not attend to token j. If no mask is specified, a causal mask is used by default.

A BlockMask for document masking in a packed sequence created via create_block_mask. We use flex_attention() when computing attention with block masks. Default is None.
encoder_input (可选[torch.Tensor]) – 来自编码器的可选输入嵌入。形状 [b x s_e x d_e]
encoder_mask (Optional[torch.Tensor]) – 布尔张量定义了标记和编码器嵌入之间的关系矩阵。位置 i,j 的 True 值表示标记 i 可以关注解码器中的嵌入 j。掩码的形状为 [b x s x s_e]。默认值为 None，但在推理过程中，如果模型设置了任何使用编码器嵌入的层并且缓存已设置，则这是必需的。
input_pos (可选[torch.Tensor]) – 可选的张量，包含每个标记的位置ID。在训练期间，这用于指示每个标记相对于其样本的位置，形状为[b x s]。在推理期间，这表示当前标记的位置。如果已经设置了缓存，则在推理期间需要此参数。默认值为None。

Returns:

输出张量的形状为 [b x s x v] 或层的列表: 由 output_hidden_states 定义的输出张量，并将最终的输出张量附加到列表中。

Return type:

联合[torch.Tensor, 列表[torch.Tensor]]

注意

在推理的第一步，当模型接收到提示时，input_pos 应包含提示中所有标记的位置。对于单批次提示或长度相同的一批提示，这将是 torch.arange(prompt_length)。对于长度不同的一批提示，较短的提示会进行左填充，位置ID相应地右移，因此位置ID的形状应为 [b, padded_prompt_length]。这是因为我们需要为每个输入ID检索位置嵌入。在后续步骤中，如果模型已设置KV缓存，input_pos 将包含当前标记的位置 torch.tensor([padded_prompt_length])。否则，input_pos 将包含直到当前标记的所有位置ID。

Shape notation:

b: 批量大小
s: 令牌序列长度
s_e: 编码器序列长度
v: 词汇大小
d: 令牌嵌入维度
d_e: 编码器嵌入维度
m_s: 最大序列长度

reset_caches()[source]¶

将相关注意力模块上的KV缓存缓冲区重置为零，并将缓存位置重置为零，而不删除或重新分配缓存张量。

Raises:: 运行时错误 – 如果KV缓存未设置。请先使用setup_caches()来设置缓存。

set_num_output_chunks(num_output_chunks: int) → None[source]¶: 用于与CEWithChunkedOutputLoss结合以节省内存。这应该在第一次前向传递之前调用，在配方中。

setup_caches(batch_size: int, dtype: dtype, *, encoder_max_seq_len: Optional[int] = None, decoder_max_seq_len: Optional[int] = None)[source]¶

Sets up key-value attention caches for inference. For each layer in self.layers:

TransformerSelfAttentionLayer 将使用 decoder_max_seq_len。
TransformerCrossAttentionLayer 将使用 encoder_max_seq_len。
FusionLayer 将使用 decoder_max_seq_len 和 encoder_max_seq_len。

Parameters:

batch_size (int) – 缓存的批量大小。
dtype (torch.dpython:type) – 缓存的dtype。
encoder_max_seq_len (可选[int]) – 编码器缓存的最大序列长度。
decoder_max_seq_len (可选[int]) – 最大解码器缓存序列长度。