入门指南#
在本节中,我们将展示如何使用cuSPARSELt实现稀疏矩阵-矩阵乘法。首先通过概述计算设置的主要步骤介绍工作流程。接着说明如何安装该库并进行编译。最后我们将通过带有详细注释的代码示例逐步演示实现过程。
cuSPARSELt 工作流程#
cuSPARSELt follows an equivalent approach and adopts similar concepts to cuBLASLt and cuTENSOR. The library programming model requires organizing the computation in such a way the same setup can be repeatedly used for different inputs.
具体来说,该模型依赖于以下高层次阶段:
A. Problem definition: Specify matrices shapes, data types, operations, etc.B. User preferences/constraints: User algorithm selection or limit search space of viable implementations (candidates)C. Plan: Gather descriptors for the execution and “find” the best implementation if neededD. Execution: Perform the actual computation
更详细地说,常见工作流程包含以下步骤:
1. Initialize the library handle:cusparseLtInit().2. Specify the input/output matrix characteristics:cusparseLtDenseDescriptorInit(),cusparseLtStructuredDescriptorInit().3. Initialize the matrix multiplication descriptor and its properties (e.g. operations, compute type, etc.):cusparseLtMatmulDescriptorInit().4. Initialize the algorithm selection descriptor:cusparseLtMatmulAlgSelectionInit().5. Initialize the matrix multiplication plan:cusparseLtMatmulPlanInit().6. Prune theAmatrix:cusparseLtSpMMAPrune(). This step is not needed if the user provides a customized matrix pruning.7. Compress the pruned matrix:cusparseLtSpMMACompress().8. Compute the required size of workspace:cusparseLtMatmulGetWorkspace. Allocate a device buffer of this size.9. Execute the matrix multiplication:cusparseLtMatmul(). This step can be repeated multiple times with different input values.10. Destroy the matrix descriptors, matrix multiplication plan and the library handle:cusparseLtMatDescriptorDestroy(),cusparseLtMatmulPlanDestroy()cusparseLtDestroy().
安装与编译#
从developer.nvidia.com/cusparselt/downloads下载cuSPARSELt软件包
先决条件#
硬件要求#
支持最低计算能力8.0的GPU
查看NVIDIA GPU的计算能力列表请访问:https://developer.nvidia.com/cuda-gpus
有关基于GPU计算能力的详细功能支持列表,请参阅Key Features。
软件需求#
CUDA 12.8工具包或更新版本以及兼容的CUDA驱动程序(参见CUDA驱动程序发行说明)。
Linux#
假设cuSPARSELt已被解压到CUSPARSELT_DIR目录中,我们相应地更新库路径:
export LD_LIBRARY_PATH=${CUSPARSELT_DIR}/lib64:${LD_LIBRARY_PATH}
要编译我们下面将讨论的示例代码(matmul_example.cpp),
nvcc matmul_example.cpp -I {CUSPARSELT_PATH}/include/ -l cusparse -l cusparseLt -L ${CUSPARSELT_PATH}/lib/
请注意,前面的命令将cusparseLt链接为一个共享库。若要将代码与库的静态版本链接,则需要额外的标志:
nvcc matmul_example.cpp -I${CUSPARSELT_DIR}/include \
-Xlinker=${CUSPARSELT_DIR}/lib64/libcusparseLt_static.a \
-o matmul_static -ldl -lcuda
Windows#
假设cuSPARSELt已被解压到CUSPARSELT_DIR目录下,我们相应地更新库路径:
setx PATH "%CUSPARSELT_DIR%\lib:%PATH%"
要编译我们下面将讨论的示例代码(matmul_example.cpp),
nvcc.exe matmul_example.cpp -I "%CUSPARSELT_DIR%\include" -lcusparseLt -lcuda -o matmul.exe
请注意,前面的命令将cusparseLt链接为一个共享库。如果要将代码与库的静态版本链接,则需要额外的标志:
nvcc.exe matmul_example.cpp -I %CUSPARSELT_DIR%\include \
-Xlinker=/WHOLEARCHIVE:"%CUSPARSELT_DIR%\lib\cusparseLt_static.lib" \
-Xlinker=/FORCE -lcuda -o matmul.exe
代码示例#
以下代码示例展示了使用cuSPARSELt的常见步骤并执行矩阵乘法运算。
The full code can be found in cuSPARSELt 示例 1.
A more advanced example that demonstrates the use of Batched Sparse GEMM, activation function, and bias can be found in cuSPARSELt 示例 2.
#include <cusparseLt.h> // cusparseLt header
// Device pointers and coefficient definitions
float alpha = 1.0f;
float beta = 0.0f;
__half* dA = ...
__half* dB = ...
__half* dC = ...
//--------------------------------------------------------------------------
// cusparseLt data structures and handle initialization
cusparseLtHandle_t handle;
cusparseLtMatDescriptor_t matA, matB, matC;
cusparseLtMatmulDescriptor_t matmul;
cusparseLtMatmulAlgSelection_t alg_sel;
cusparseLtMatmulPlan_t plan;
cudaStream_t stream = nullptr;
cusparseLtInit(&handle);
//--------------------------------------------------------------------------
// matrix descriptor initialization
cusparseLtStructuredDescriptorInit(&handle, &matA, num_A_rows, num_A_cols,
lda, alignment, type, order,
CUSPARSELT_SPARSITY_50_PERCENT);
cusparseLtDenseDescriptorInit(&handle, &matB, num_B_rows, num_B_cols, ldb,
alignment, type, order);
cusparseLtDenseDescriptorInit(&handle, &matC, num_C_rows, num_C_cols, ldc,
alignment, type, order);
//--------------------------------------------------------------------------
// matmul, algorithm selection, and plan initialization
cusparseLtMatmulDescriptorInit(&handle, &matmul, opA, opB, &matA, &matB,
&matC, &matC, compute_type);
cusparseLtMatmulAlgSelectionInit(&handle, &alg_sel, &matmul,
CUSPARSELT_MATMUL_ALG_DEFAULT);
cusparseLtMatmulPlanInit(&handle, &plan, &matmul, &alg_sel);
//--------------------------------------------------------------------------
// Prune the A matrix (in-place) and check the correctness
cusparseLtSpMMAPrune(&handle, &matmul, dA, dA, CUSPARSELT_PRUNE_SPMMA_TILE,
stream);
int *d_valid;
cudaMalloc((void**) &d_valid, sizeof(d_valid));
cusparseLtSpMMAPruneCheck(&handle, &matmul, dA, &d_valid, stream);
int is_valid;
cudaMemcpyAsync(&is_valid, d_valid, sizeof(d_valid), cudaMemcpyDeviceToHost,
stream);
cudaStreamSynchronize(stream);
if (is_valid != 0) {
std::printf("!!!! The matrix has been pruned in a wrong way. "
"cusparseLtMatmul will not provided correct results\n");
return EXIT_FAILURE;
}
//--------------------------------------------------------------------------
// Matrix A compression
size_t compressed_size;
cusparseLtSpMMACompressedSize(&handle, &plan, &compressed_size);
cudaMalloc((void**) &dA_compressed, compressed_size);
cusparseLtSpMMACompress(&handle, &plan, dA, dA_compressed, stream);
//--------------------------------------------------------------------------
// Allocate workspace
size_t workspace_size;
void* d_workspace = nullptr;
cusparseLtMatmulGetWorkspace(&handle, &plan, &workspace_size);
cudaMalloc((void**) &d_workspace, workspace_size);
//--------------------------------------------------------------------------
// Perform the matrix multiplication
int num_streams = 0;
cudaStream_t* streams = nullptr;
cusparseLtMatmul(&handle, &plan, &alpha, dA_compressed, dB, &beta, dC, dD,
d_workspace, streams, num_streams);
//--------------------------------------------------------------------------
// Destroy descriptors, plan and handle
cusparseLtMatDescriptorDestroy(&matA);
cusparseLtMatDescriptorDestroy(&matB);
cusparseLtMatDescriptorDestroy(&matC);
cusparseLtMatmulPlanDestroy(&plan);
cusparseLtDestroy(&handle);