TritonAMDGPUOps¶

`amdgpu.buffer_atomic_rmw` (triton::amdgpu::BufferAtomicRMWOp)¶

原子性读取-修改-写入操作，该操作对基础标量指针和张量偏移量进行读取、修改和写入

语法：

operation ::= `amdgpu.buffer_atomic_rmw` $atomic_rmw_op `,` $sem `,` $scope `,` $value `,` $ptr `[` $offsets `]` (`,` $mask^)?
              (`stride` `=` $stride^)?
              attr-dict `:` type($result)

AMD缓冲区的原子读-修改-写操作。缓冲区原子操作与普通原子操作类似，但通过标量基指针和偏移量张量（而非指针张量）访问全局内存。与其他缓冲区操作类似，mask是一个布尔向量，用于决定是否应对给定元素执行原子RMW操作。满足mask[i] == 0的元素会被跳过（即不执行原子操作）。与TT_AtomicRMWOp类似：缓冲区原子RMW操作会在$ptr地址加载数据，用$val执行$rmw_op运算，并将结果按指定的内存语义和作用域存储到$ptr。若被使用，原子RMW操作会返回操作前的值，否则该值会被隐式丢弃。步长是连续内存块起始地址之间的距离。在执行RMW操作时，stride表示各行首元素之间的字节地址差。编译器在转换为缓冲区操作时会尝试获取stride，这对优化缓存内存访问至关重要。

特性: AttrSizedOperandSegments, SameLoadStoreOperandsAndResultEncoding

属性:¶

属性	MLIR类型	描述
`atomic_rmw_op`	::mlir::triton::RMWOpAttr	allowed 32-bit signless integer cases: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
`sem`	::mlir::triton::MemSemanticAttr	allowed 32-bit signless integer cases: 1, 2, 3, 4
`scope`	::mlir::triton::MemSyncScopeAttr	allowed 32-bit signless integer cases: 1, 2, 3

操作数:¶

操作数	描述
`ptr`	ptr
`offsets`	32位无符号整数值的张量
`value`	浮点数、整数或指针值的排序张量
`stride`	32位无符号整数
`mask`	1位无符号整数值的排序张量

结果:¶

结果	描述
`result`	浮点数、整数或指针值的排序张量

`amdgpu.buffer_load` (triton::amdgpu::BufferLoadOp)¶

从标量基指针和张量偏移量加载

语法：

operation ::= `amdgpu.buffer_load` $ptr `[` $offsets `]` (`,` $mask^)? (`,` $other^)?
              oilist(`cacheModifier` `=` $cache)
              (`stride` `=` $stride^)?
              attr-dict `:` type($result)

AMD缓冲加载操作。缓冲存储与普通存储类似，但它通过标量基指针和偏移张量（而非指针张量）访问全局内存。其他字段与普通加载类似，即mask是布尔向量，决定是否应从内存读取给定元素，而other是当mask[i] == 0时应在通道i返回的元素。步长是连续内存块起始地址之间的距离。在执行块加载时，stride表示各行首元素之间的字节地址差。编译器在转换为缓冲操作时会尝试获取stride，因为这对优化缓存内存访问至关重要。

特性: AttrSizedOperandSegments, SameLoadStoreOperandsAndResultEncoding

属性:¶

属性	MLIR类型	描述
`cache`	::mlir::triton::CacheModifierAttr	allowed 32-bit signless integer cases: 1, 2, 3, 4, 5, 6, 7

操作数:¶

操作数	描述
`ptr`	ptr
`offsets`	32位无符号整数值的张量
`stride`	32位无符号整数
`mask`	1位无符号整数值的排序张量
`other`	浮点数、整数或指针值的排序张量

结果:¶

结果	描述
`result`	浮点数、整数或指针值的排序张量

`amdgpu.buffer_load_to_local` (triton::amdgpu::BufferLoadToLocalOp)¶

从标量基指针和张量偏移量加载到共享内存

语法：

operation ::= `amdgpu.buffer_load_to_local` $ptr `[` $offsets `]` (`mask` `=` $mask^)? (`other` `=` $other^)? (`stride` `=` $stride^)?
              oilist(`cacheModifier` `=` $cache) `into` $dest
              attr-dict `:` type($ptr) `[` type($offsets) `]` type($other) `->` type($dest)

AMD缓冲区加载操作。类似于amdgpu.buffer_load操作，但直接写入共享内存而非寄存器。

特性：AttrSizedOperandSegments

接口: InferTypeOpInterface

属性:¶

属性	MLIR类型	描述
`cache`	::mlir::triton::CacheModifierAttr	allowed 32-bit signless integer cases: 1, 2, 3, 4, 5, 6, 7

操作数:¶

操作数	描述
`dest`	Triton IR类型系统中的内存描述符类型 (`::mlir::triton::gpu::MemDescType`)
`ptr`	ptr
`offsets`	32位无符号整数值的张量
`mask`	1位无符号整数值的排序张量
`other`	浮点数、整数或指针值的排序张量
`stride`	32位无符号整数

结果:¶

结果	描述
`token`	异步令牌类型

`amdgpu.buffer_store` (triton::amdgpu::BufferStoreOp)¶

存储到标量基指针和张量偏移量

语法：

operation ::= `amdgpu.buffer_store` $value `,` $ptr `[` $offsets `]` (`,` $mask^)?
              oilist(`cacheModifier` `=` $cache)
              (`stride` `=` $stride^)?
              attr-dict `:` type($value)

AMD缓冲区存储操作。缓冲区存储与常规存储类似，但它是通过标量基指针和偏移量张量（而非指针张量）来访问全局内存的。其他字段与常规存储相似，例如mask是一个布尔向量，用于决定是否应将给定元素写入内存；而value是当mask[i] == 1时应在通道i写入的元素张量。步长是连续内存块起始地址之间的距离。执行块存储时，stride表示每行首元素之间的字节地址差。编译器在转换为缓冲区操作时会尝试获取stride，因为这对优化缓存内存访问至关重要。

特性: AttrSizedOperandSegments, SameLoadStoreOperandsEncoding

属性:¶

属性	MLIR类型	描述
`cache`	::mlir::triton::CacheModifierAttr	allowed 32-bit signless integer cases: 1, 2, 3, 4, 5, 6, 7

操作数:¶

操作数	描述
`value`	浮点数、整数或指针值的排序张量
`ptr`	ptr
`offsets`	32位无符号整数值的张量
`stride`	32位无符号整数
`mask`	1位无符号整数值的排序张量

`amdgpu.cond_barrier` (triton::amdgpu::CondBarrierOp)¶

有条件地设置屏障以同步块中的部分线程

语法：

operation ::= `amdgpu.cond_barrier` $pred attr-dict

condBarrierOp 仅在给定参数为 true 时设置屏障指令。这提供了一种同步块中部分线程的方式，故意使执行序列分叉。然而，用户应确保所有线程最终通过调用 condBarrierOp(true) 使剩余线程重新汇聚。从概念上讲，这类似于在 if 语句内部设置执行屏障。该操作允许我们在适合调度时避免阻塞整个块。注意：此操作不会设置任何内存栅栏。

操作数:¶

操作数	描述
`pred`	1位无符号整数

`amdgpu.extract_slice` (triton::amdgpu::ExtractSliceOp)¶

提取切片操作

语法：

operation ::= `amdgpu.extract_slice` $source $static_offsets attr-dict `:` type($source) `to` type($result)

"extract_slice"操作支持提取寄存器中张量的一个切片。

"extract_slice" 操作支持以下参数：

source: 用于创建视图张量的基础张量
offsets: 创建视图时基于基础张量的偏移量

示例1：

#blocked = #ttg.blocked<{sizePerThread = [1, 8],
    threadsPerWarp = [4, 16], warpsPerCTA = [4, 1], order = [0, 1]}>
#blocked1 = #ttg.blocked<{sizePerThread = [1, 8],
    threadsPerWarp = [16, 4], warpsPerCTA = [4, 1], order = [0, 1]}>
%1 = ttg.convert_layout %0 : tensor<128x128xf16, #blocked>
    -> tensor<128x128xf16, #blocked1>
// create a slice of base tensor %1 with static offsets
%2 = amdgpu.extract_slice %0 [0, 0] :
  tensor<128x128xf16, #blocked1> to tensor<128x32xf16, #blocked1>

示例1展示了如何使用"extract_slice"操作。在这个例子中，创建了一个128x32的新切片。"extract_slice"适用于张量布局，其中所需切片与源张量具有相同的布局。 "%0"不能直接切片，因为结果切片无法与"%0"保持相同布局。因此需要将其转换为适合切片的布局。"#blocked1"布局适用于此，因为它保持sizePerThread不变，从而保持合并属性相同。为了充分利用warp中的所有线程，这个新布局的"threadsPerWarp"设置为[16,4]。在使用"extract_slice"之前进行的这种布局转换确保了切片操作仍能高效使用所有线程。切片的大小由结果类型决定。

特性: AlwaysSpeculatableImplTrait

接口: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

效果: MemoryEffects::Effect{}

属性:¶

属性	MLIR类型	描述
`static_offsets`	::mlir::DenseI64ArrayAttr	i64 dense array attribute

操作数:¶

操作数	描述
`source`	任意类型值的排序张量

结果:¶

结果	描述
`result`	任意类型值的排序张量

`amdgpu.in_thread_transpose` (triton::amdgpu::InThreadTransposeOp)¶

对属于每个线程的寄存器值执行转置操作

语法：

operation ::= `amdgpu.in_thread_transpose` $src attr-dict `:` type($src) `->` type($result)

该操作对每个线程寄存器中的值执行布局转置。具体来说，给定输入布局的分块布局，它会沿着底层线性布局的寄存器维度转置最后两个维度(rank-1和rank-2)。

转换示例：

输入布局：块状布局，每个线程的大小为[2, 2]，顺序为[0, 1]。这是线性布局寄存器基址 = [[1, 0], [2, 0], [0, 1], [0, 2]]
输出布局：与输入相同的线程和warp基础，寄存器基础 = [[0, 1], [0, 2], [1, 0], [2, 0]]

该操作实现了从HBM的高效合并加载，随后向量化写入共享内存，适用于HBM与共享内存顺序不同且目标AMD硬件本身不支持这种转置的情况。这是ttg.convert_layout的一个特定变体，在降级到llvm时会被转换为ttg.convert_layout。我们不希望这个转换被优化掉，因为需要显式生成指令，在从HBM加载后、写入共享内存前在每个线程内执行转置操作。

特性: AlwaysSpeculatableImplTrait

接口: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

效果: MemoryEffects::Effect{}

操作数:¶

操作数	描述
`src`	浮点数、整数或指针值的排序张量

结果:¶

结果	描述
`result`	浮点数、整数或指针值的排序张量

`amdgpu.instruction_sched_hint` (triton::amdgpu::InstructionSchedHint)¶

一个占位操作，用于基本块内的指令调度提示

语法：

operation ::= `amdgpu.instruction_sched_hint` attr-dict

一个用于指令调度提示的占位操作，应用于基本块内的指令，其中该占位操作所在位置。此操作主要用于调整tt.dot操作生成的主循环内部的指令调度。在高层级识别点积操作更为容易，因此可以标记预期的调度区域。这些提示操作最终会被降级为LLVM AMDGPU指令调度原语，旨在控制不同类型指令（valu/mfma、全局/共享内存等）应如何交错执行，以获得更好的指令级并行性。

属性:¶

属性	MLIR类型	描述
`variant`	::mlir::triton::amdgpu::SchedHintAttr	Instruction Scheduling Hints for AMD GPUs
`numDsReadsA`	::mlir::triton::amdgpu::InstCounterAttr	An instruction counter attribute. {{% markdown %}} The attribute holds the number of issued LLVM instructions of a specific kind as well as the data type. {{% /markdown %}}
`numDsReadsB`	::mlir::triton::amdgpu::InstCounterAttr	An instruction counter attribute. {{% markdown %}} The attribute holds the number of issued LLVM instructions of a specific kind as well as the data type. {{% /markdown %}}
`numDsWritesA`	::mlir::triton::amdgpu::InstCounterAttr	An instruction counter attribute. {{% markdown %}} The attribute holds the number of issued LLVM instructions of a specific kind as well as the data type. {{% /markdown %}}
`numDsWritesB`	::mlir::triton::amdgpu::InstCounterAttr	An instruction counter attribute. {{% markdown %}} The attribute holds the number of issued LLVM instructions of a specific kind as well as the data type. {{% /markdown %}}
`numGlobalLoadsA`	::mlir::triton::amdgpu::InstCounterAttr	An instruction counter attribute. {{% markdown %}} The attribute holds the number of issued LLVM instructions of a specific kind as well as the data type. {{% /markdown %}}
`numGlobalLoadsB`	::mlir::triton::amdgpu::InstCounterAttr	An instruction counter attribute. {{% markdown %}} The attribute holds the number of issued LLVM instructions of a specific kind as well as the data type. {{% /markdown %}}
`isBufferLoadsAEnabled`	::mlir::BoolAttr	bool attribute
`isBufferLoadsBEnabled`	::mlir::BoolAttr	bool attribute
`numMMAs`	::mlir::triton::amdgpu::InstCounterAttr	An instruction counter attribute. {{% markdown %}} The attribute holds the number of issued LLVM instructions of a specific kind as well as the data type. {{% /markdown %}}

`amdgpu.upcast_mxfp` (triton::amdgpu::UpcastMXFPOp)¶

将mxfp张量转换为bf16/fp16格式

语法：

operation ::= `amdgpu.upcast_mxfp` $src `,` $scale  `fp_type` `=` $fp_type attr-dict `:` type($src) `,` type($scale) `->` type($result)

根据 https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf 计算给定mxfp数字中编码的bf16格式

特性: AlwaysSpeculatableImplTrait

接口: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

效果: MemoryEffects::Effect{}

属性:¶

属性	MLIR类型	描述
`fp_type`	::mlir::triton::ScaleDotElemTypeAttr	allowed 32-bit signless integer cases: 0, 1, 2, 3, 4, 5, 6
`fastMath`	::mlir::BoolAttr	bool attribute

操作数:¶

操作数	描述
`src`	浮点数、整数或指针值的排序张量
`scale`	浮点数、整数或指针值的排序张量

结果:¶

结果	描述
`result`	浮点数、整数或指针值的排序张量

TritonAMDGPUOps¶

amdgpu.buffer_atomic_rmw (triton::amdgpu::BufferAtomicRMWOp)¶

属性:¶

操作数:¶

结果:¶

amdgpu.buffer_load (triton::amdgpu::BufferLoadOp)¶

属性:¶

操作数:¶

结果:¶

amdgpu.buffer_load_to_local (triton::amdgpu::BufferLoadToLocalOp)¶

属性:¶

操作数:¶

结果:¶

amdgpu.buffer_store (triton::amdgpu::BufferStoreOp)¶

属性:¶

操作数:¶

amdgpu.cond_barrier (triton::amdgpu::CondBarrierOp)¶

操作数:¶

amdgpu.extract_slice (triton::amdgpu::ExtractSliceOp)¶

属性:¶

操作数:¶

结果:¶

amdgpu.in_thread_transpose (triton::amdgpu::InThreadTransposeOp)¶

操作数:¶

结果:¶

amdgpu.instruction_sched_hint (triton::amdgpu::InstructionSchedHint)¶

属性:¶

amdgpu.upcast_mxfp (triton::amdgpu::UpcastMXFPOp)¶

属性:¶

操作数:¶

结果:¶

`amdgpu.buffer_atomic_rmw` (triton::amdgpu::BufferAtomicRMWOp)¶

`amdgpu.buffer_load` (triton::amdgpu::BufferLoadOp)¶

`amdgpu.buffer_load_to_local` (triton::amdgpu::BufferLoadToLocalOp)¶

`amdgpu.buffer_store` (triton::amdgpu::BufferStoreOp)¶

`amdgpu.cond_barrier` (triton::amdgpu::CondBarrierOp)¶

`amdgpu.extract_slice` (triton::amdgpu::ExtractSliceOp)¶

`amdgpu.in_thread_transpose` (triton::amdgpu::InThreadTransposeOp)¶

`amdgpu.instruction_sched_hint` (triton::amdgpu::InstructionSchedHint)¶

`amdgpu.upcast_mxfp` (triton::amdgpu::UpcastMXFPOp)¶