cuSPARSELt 数据类型#

不透明数据结构#

`cusparseLtHandle_t`#

该结构体保存了cuSPARSELt库的上下文（设备属性、系统信息等）。

The handle must be initialized and destroyed with cusparseLtInit() and cusparseLtDestroy() functions respectively.

`cusparseLtMatDescriptor_t`#

该结构捕获了矩阵的形状和特征。

It is initialized with cusparseLtDenseDescriptorInit() or cusparseLtStructuredDescriptorInit() functions and destroyed with cusparseLtMatDescriptorDestroy().

`cusparseLtMatmulDescriptor_t`#

该结构体用于描述矩阵乘法运算。

It is initialized with cusparseLtMatmulDescriptorInit() function.

`cusparseLtMatmulAlgSelection_t`#

该结构体用于描述矩阵乘法算法的相关信息。

It is initialized with cusparseLtMatmulAlgSelectionInit() function.

`cusparseLtMatmulPlan_t`#

The structure holds the matrix multiplication execution plan, namely all the information necessary to execute the cusparseLtMatmul() operation.

It is initialized and destroyed with cusparseLtMatmulPlanInit() and cusparseLtMatmulPlanDestroy() functions respectively.

枚举器#

`cusparseLtSparsity_t`#

枚举器将结构化矩阵的稀疏比率指定为

$sparsity\ ratio = \frac{nnz}{num\_rows * num\_cols}$

值	描述
`CUSPARSELT_SPARSITY_50_PERCENT`	50% 稀疏比例： - 配对4:8 用于 `e2m1` - 2:4 用于 `half`, `bfloat16`, `int`, `int8`, `e4m3`, `e5m2` - 1:2 对应 `float` 类型

值

描述

CUSPARSELT_SPARSITY_50_PERCENT

50% 稀疏比例：

- 配对4:8 用于 e2m1 - 2:4 用于 half, bfloat16, int, int8, e4m3, e5m2

- 1:2 对应 float 类型

The sparsity property is used in the cusparseLtStructuredDescriptorInit() function.

`cusparseComputeType`#

枚举器指定了矩阵的计算精度模式

值	描述
`CUSPARSE_COMPUTE_32I`	- 对矩阵A和B进行逐元素乘法运算，中间值的累加采用32位整数精度执行。 - Alpha和beta系数以及尾声部分使用单精度浮点数执行。 - 在可能的情况下将使用Tensor Cores。
`CUSPARSE_COMPUTE_32F`	- 对矩阵A和B进行逐元素乘法运算，中间结果的累加采用单精度浮点数完成。 - Alpha和beta系数以及尾声部分使用单精度浮点数执行。 - 在可能的情况下将使用Tensor Cores。
`CUSPARSE_COMPUTE_16F`	- 对矩阵A和B进行逐元素乘法运算，中间值的累加采用半精度浮点数完成。 - Alpha和beta系数以及尾声部分使用单精度浮点数执行。 - 在可能的情况下将使用Tensor Cores。

值

描述

CUSPARSE_COMPUTE_32I

- 对矩阵A和B进行逐元素乘法运算，中间值的累加采用32位整数精度执行。

- Alpha和beta系数以及尾声部分使用单精度浮点数执行。

- 在可能的情况下将使用Tensor Cores。

CUSPARSE_COMPUTE_32F

- 对矩阵A和B进行逐元素乘法运算，中间结果的累加采用单精度浮点数完成。

- Alpha和beta系数以及尾声部分使用单精度浮点数执行。

- 在可能的情况下将使用Tensor Cores。

CUSPARSE_COMPUTE_16F

- 对矩阵A和B进行逐元素乘法运算，中间值的累加采用半精度浮点数完成。

- Alpha和beta系数以及尾声部分使用单精度浮点数执行。

- 在可能的情况下将使用Tensor Cores。

The compute precision is used in the cusparseLtMatmulDescriptorInit() function.

`cusparseLtMatDescAttribute_t`#

枚举器指定了矩阵描述符的附加属性

值	描述
`CUSPARSELT_MAT_NUM_BATCHES`	批处理中的矩阵数量
`CUSPARSELT_MAT_BATCH_STRIDE`	批次中连续矩阵之间的步长，以矩阵元素表示

The algorithm enumerator is used in the cusparseLtMatDescSetAttribute() and cusparseLtMatDescGetAttribute() functions.

`cusparseLtMatmulDescAttribute_t`#

枚举器指定了矩阵乘法描述符的附加属性

值	类型	默认值	描述
`CUSPARSELT_MATMUL_ACTIVATION_RELU`	`int` 0: false, 其他情况为 true	`false`	ReLU激活函数
`CUSPARSELT_MATMUL_ACTIVATION_RELU_UPPERBOUND`	`float`	`inf`	ReLU激活函数的上界
`CUSPARSELT_MATMUL_ACTIVATION_RELU_THRESHOLD`	`float`	`0.0f`	ReLU激活函数的下限阈值
`CUSPARSELT_MATMUL_ACTIVATION_GELU`	`int` 0: false, 其他情况为 true	`false`	Enable/Disable GeLU activation function. The GeLU activation function is available only with `INT8` 输入, `INT8` 输出, `INT32` Tensor Core 计算内核 `E4M3` 输入, `E4M3/BF16` 输出, `FP32` Tensor Core 计算内核 `E5M2` 输入，`E5M2/BF16` 输出，`FP32` Tensor Core 计算内核 `E2M1` 输入, `E2M1/BF16` 输出, `FP32` Tensor Core 计算内核
`CUSPARSELT_MATMUL_ACTIVATION_GELU_SCALING`	`float`	`1.0f`	GeLU激活函数的缩放系数。它表示`CUSPARSELT_MATMUL_ACTIVATION_GELU`
`CUSPARSELT_MATMUL_ALPHA_VECTOR_SCALING`	`int` 0: false, 其他情况为 true	`false`	启用/禁用 alpha 向量（逐通道）缩放
`CUSPARSELT_MATMUL_BETA_VECTOR_SCALING`	`int` 0: false, 其他情况为 true	`false`	启用/禁用测试版向量(每通道)缩放功能。`CUSPARSELT_MATMUL_BETA_VECTOR_SCALING`包含 `CUSPARSELT_MATMUL_ALPHA_VECTOR_SCALING`功能
`CUSPARSELT_MATMUL_BIAS_POINTER`	`void*`	`NULL` (禁用)	偏置指针。偏置向量的大小必须等于输出矩阵(D)的行数。偏置向量的数据类型与矩阵C相同，以下情况除外： `INT8` 输入, `INT8/INT32` 输出, `INT32` Tensor Core 计算内核 `INT8` 输入, `FP16/BF16` 输出, 在 `SM 9.0` 之前的架构上使用 `INT32` Tensor Core 计算核心其中偏置的数据类型为`FP32`。
`CUSPARSELT_MATMUL_BIAS_STRIDE`	`int64_t`	`0` (禁用)	连续偏置向量之间的偏置步长。`0`表示广播第一个偏置向量
`CUSPARSELT_MATMUL_SPARSE_MAT_POINTER`	`void*`	`NULL` (禁用)	指向修剪后的稀疏矩阵的指针。
`CUSPARSELT_MATMUL_A_SCALE_MODE`	`cublasLtMatmulMatrixScale_t`	`CUSPARSELT_MATMUL_SCALE_NONE`	定义如何解释矩阵A的矩阵缩放因子的缩放模式。
`CUSPARSELT_MATMUL_B_SCALE_MODE`	`cublasLtMatmulMatrixScale_t`	`CUSPARSELT_MATMUL_SCALE_NONE`	缩放模式，定义如何解释矩阵B的矩阵缩放因子。
`CUSPARSELT_MATMUL_C_SCALE_MODE`	`cublasLtMatmulMatrixScale_t`	`CUSPARSELT_MATMUL_SCALE_NONE`	定义如何解释矩阵C的矩阵缩放因子的缩放模式。
`CUSPARSELT_MATMUL_D_SCALE_MODE`	`cublasLtMatmulMatrixScale_t`	`CUSPARSELT_MATMUL_SCALE_NONE`	缩放模式，定义如何解释矩阵D的矩阵缩放因子。
`CUSPARSELT_MATMUL_D_OUT_SCALE_MODE`	`cublasLtMatmulMatrixScale_t`	`CUSPARSELT_MATMUL_SCALE_NONE`	定义如何解释矩阵D的输出矩阵缩放因子的缩放模式。
`CUSPARSELT_MATMUL_A_SCALE_POINTER`	`void*`	`NULL`	指向将矩阵A中的数据转换为计算数据类型范围的缩放因子值的指针。缩放因子必须与计算类型相同。如果未指定，则假定缩放因子为1。
`CUSPARSELT_MATMUL_B_SCALE_POINTER`	`void*`	`NULL`	等同于矩阵B的`CUSPARSELT_MATMUL_A_SCALE_POINTER`。
`CUSPARSELT_MATMUL_C_SCALE_POINTER`	`void*`	`NULL`	相当于矩阵C的`CUSPARSELT_MATMUL_A_SCALE_POINTER`。当前未使用。
`CUSPARSELT_MATMUL_D_SCALE_POINTER`	`void*`	`NULL`	相当于矩阵D的`CUSPARSELT_MATMUL_A_SCALE_POINTER`。
`CUSPARSELT_MATMUL_D_OUT_SCALE_POINTER`	`void*`	`NULL`	指向比例因子的设备指针，这些比例因子用于将矩阵D中的数据转换为计算数据类型范围。比例因子值类型由缩放模式定义（参见`CUSPARSELT_MATMUL_D_OUT_SCALE_MODE`）。

其中ReLU激活函数的定义为：

CUSPARSELT_MATMUL_SPARSE_MAT_POINTER 为 cusparseLtMatmulSearch() 提供了更大的灵活性来选择最佳算法。在被调用之前，所引用的内存不能被修改。

The algorithm enumerator is used in the cusparseLtMatmulDescSetAttribute() and cusparseLtMatmulDescGetAttribute() functions.

`cusparseLtMatmulAlg_t`#

该枚举器指定了矩阵-矩阵乘法的算法

值	描述
`CUSPARSELT_MATMUL_ALG_DEFAULT`	默认算法

The algorithm enumerator is used in the cusparseLtMatmulAlgSelectionInit() function.

`cusparseLtMatmulAlgAttribute_t`#

枚举器指定了矩阵乘法算法的属性

值	描述	可选值
`CUSPARSELT_MATMUL_ALG_CONFIG_ID`	算法ID	[0, MAX) (参见 `CUSPARSELT_MATMUL_ALG_CONFIG_MAX_ID`)
`CUSPARSELT_MATMUL_ALG_CONFIG_MAX_ID`	算法ID限制（仅查询）
`CUSPARSELT_MATMUL_SEARCH_ITERATIONS`	cusparseLtMatmulSearch() 的迭代次数（每次算法的内核启动次数）	> 0 (默认=5)
`CUSPARSELT_MATMUL_SPLIT_K`	Split-K因子（切片数量）	在`SM 9.0`之前的版本中，[1, K]区间，1表示禁用Split-K（默认=未设置）；在`SM 9.0`版本中，-1（启用segment-K）或1（禁用segment-K）
`CUSPARSELT_MATMUL_SPLIT_K_MODE`	Split-K算法使用的内核数量	`CUSPARSELT_SPLIT_K_MODE_ONE_KERNEL`, `CUSPARSELT_SPLIT_K_MODE_TWO_KERNELS`
`CUSPARSELT_MATMUL_SPLIT_K_BUFFERS`	用于存储归约操作部分结果的设备内存缓冲区	在`SM 9.0`之前版本为[0, SplitK - 1]；在`SM 9.0`上为0

The algorithm attribute enumerator is used in the cusparseLtMatmulAlgGetAttribute() and cusparseLtMatmulAlgSetAttribute() functions.

Split-K parameters allow users to split the GEMM computation along the K dimension so that more CTAs will be created with a better SM utilization when N or M dimensions are small. However, this comes with the cost of reducing the operation of K slides to the final results. The cusparseLtMatmulSearch() function can be used to find the optimal combination of Split-K parameters.

Segment-K is a split-K method on SM 9.0 that utilizes warp-specialized persistent CTAs for enhanced efficiency and replaces the tranditional split-K method.

`cusparseLtSplitKMode_t`#

The enumerator specifies the Split-K mode values corresponding to CUSPARSELT_MATMUL_SPLIT_K_MODE attribute in cusparseLtMatmulAlgAttribute_t

值	描述
`CUSPARSELT_SPLIT_K_MODE_ONE_KERNEL`	为Split-K使用单一内核
`CUSPARSELT_SPLIT_K_MODE_TWO_KERNELS`	使用两个内核实现Split-K：一个GPU内核执行GEMM运算，另一个执行最终归约
`CUSPARSELT_SPLITK`	使用split-k分解
`CUSPARSELT_DATAPARALLEL`	不沿K维度分割
`CUSPARSELT_STREAMK`	使用流式K分解
`CUSPARSELT_HEURISTIC`	使用启发式方法确定分解模式 \| 启动另一个GPU内核执行最终归约

`cusparseLtPruneAlg_t`#

枚举器指定在压缩前应用于结构化矩阵的剪枝算法

值	描述
`CUSPARSELT_PRUNE_SPMMA_TILE`	- `e2m1`: 在8x4（行优先）或4x8（列优先）的瓦片中，将16个配对值清零，以最大化结果瓦片的L1范数，约束条件是为每行和每列精确选择两个元素或两对元素 - `half`, `bfloat16`, `int8`, `e4m3`, `e5m2`: 在4x4矩阵块中将八个值置零，以最大化结果矩阵块的L1范数，约束条件是为每行和每列精确选择两个元素 - `float`: 在2x2的矩阵块中，将两个值置零，以最大化结果矩阵块的L1范数，约束条件是为每行和每列恰好选择一个元素
`CUSPARSELT_PRUNE_SPMMA_STRIP`	- `e2m1`: 将1x8条带中的四个配对值清零，以最大化结果条带的L1范数 - `half`, `bfloat16`, `int8`, `e4m3`, `e5m2`: 在1x4条带中将两个值置零，以使结果条带的L1范数最大化 - `float`: 在1x2条带中将一个值置零，以使结果条带的L1范数最大化条带方向是根据操作`op`及应用于结构化(稀疏)矩阵的矩阵布局来选择的

值

描述

CUSPARSELT_PRUNE_SPMMA_TILE

- e2m1: 在8x4（行优先）或4x8（列优先）的瓦片中，将16个配对值清零，以最大化结果瓦片的L1范数，约束条件是为每行和每列精确选择两个元素或两对元素

- half, bfloat16, int8, e4m3, e5m2: 在4x4矩阵块中将八个值置零，以最大化结果矩阵块的L1范数，约束条件是为每行和每列精确选择两个元素

- float: 在2x2的矩阵块中，将两个值置零，以最大化结果矩阵块的L1范数，约束条件是为每行和每列恰好选择一个元素

CUSPARSELT_PRUNE_SPMMA_STRIP

- e2m1: 将1x8条带中的四个配对值清零，以最大化结果条带的L1范数

- half, bfloat16, int8, e4m3, e5m2: 在1x4条带中将两个值置零，以使结果条带的L1范数最大化

- float: 在1x2条带中将一个值置零，以使结果条带的L1范数最大化

条带方向是根据操作op及应用于结构化(稀疏)矩阵的矩阵布局来选择的

The pruning algorithm is used in the cusparseLtSpMMAPrune() function.

`cusparseLtMatmulMatrixScale_t`#

该枚举器指定了缩放模式，用于定义如何解释缩放因子指针。

值	描述
`CUSPARSELT_MATMUL_SCALE_NONE`	缩放功能已禁用。这是默认设置，也是不使用窄数据类型的矩阵唯一有效的值。
`CUSPARSELT_MATMUL_MATRIX_SCALE_SCALAR_32F`	缩放因子是应用于整个矩阵的单精度标量。当D矩阵使用窄精度数据类型时，这是`CUSPARSELTLT_MATMUL_D_SCALE_MODE`唯一有效的值。
`CUSPARSELT_MATMUL_MATRIX_SCALE_VEC32_UE4M3`	缩放因子是张量，其中包含针对对应数据矩阵最内层维度中每个32元素块存储的专用缩放因子，以8位`CUDA_R_8F_UE4M3`值表示。
`CUSPARSELT_MATMUL_MATRIX_SCALE_VEC64_UE8M0`	缩放因子是张量，其中包含针对对应数据矩阵最内维度中每个64元素块的专用缩放因子，存储为8位`CUDA_R_8F_UE8M0`值。

Note: cusparrseLtMatmulMatrixScale_t is introduced for narrow precisions (E4M3 and E2M1) to be scaled or dequantized before and potentially quantized after computations. See FP8和FP4数据类型的1D块缩放 for more details. The translation from row and column indices to linear offset is the same, as well as how multiple blocks are arranged. The only difference with cuBLASLt is the block size: in cuSPARSELt a single tile of scaling factors is applied to a 128x128 block when the scaling mode is CUSPARSELT_MATMUL_MATRIX_SCALE_VEC32_UE4M3 and to a 128x256 block when it is CUSPARSELT_MATMUL_MATRIX_SCALE_VEC64_UE8M0.

cuSPARSELt 数据类型#

不透明数据结构#

cusparseLtHandle_t#

cusparseLtMatDescriptor_t#

cusparseLtMatmulDescriptor_t#

cusparseLtMatmulAlgSelection_t#

cusparseLtMatmulPlan_t#

枚举器#

cusparseLtSparsity_t#

cusparseComputeType#

cusparseLtMatDescAttribute_t#

cusparseLtMatmulDescAttribute_t#

cusparseLtMatmulAlg_t#

cusparseLtMatmulAlgAttribute_t#

cusparseLtSplitKMode_t#

cusparseLtPruneAlg_t#

cusparseLtMatmulMatrixScale_t#

`cusparseLtHandle_t`#

`cusparseLtMatDescriptor_t`#

`cusparseLtMatmulDescriptor_t`#

`cusparseLtMatmulAlgSelection_t`#

`cusparseLtMatmulPlan_t`#

`cusparseLtSparsity_t`#

`cusparseComputeType`#

`cusparseLtMatDescAttribute_t`#

`cusparseLtMatmulDescAttribute_t`#

`cusparseLtMatmulAlg_t`#

`cusparseLtMatmulAlgAttribute_t`#

`cusparseLtSplitKMode_t`#

`cusparseLtPruneAlg_t`#

`cusparseLtMatmulMatrixScale_t`#