6.30. Tensor Map对象管理

本节介绍CUDA底层驱动应用编程接口的张量映射对象管理功能。 张量核心API仅支持计算能力9.0或更高的设备。

Functions

CUresult cuTensorMapEncodeIm2col ( CUtensorMap* tensorMap, CUtensorMapDataType tensorDataType, cuuint32_t tensorRank, void* globalAddress, const cuuint64_t* globalDim, const cuuint64_t* globalStrides, const int* pixelBoxLowerCorner, const int* pixelBoxUpperCorner, cuuint32_t channelsPerPixel, cuuint32_t pixelsPerColumn, const cuuint32_t* elementStrides, CUtensorMapInterleave interleave, CUtensorMapSwizzle swizzle, CUtensorMapL2promotion l2Promotion, CUtensorMapFloatOOBfill oobFill )
Create a tensor map descriptor object representing im2col memory region.
CUresult cuTensorMapEncodeIm2colWide ( CUtensorMap* tensorMap, CUtensorMapDataType tensorDataType, cuuint32_t tensorRank, void* globalAddress, const cuuint64_t* globalDim, const cuuint64_t* globalStrides, int  pixelBoxLowerCornerWidth, int  pixelBoxUpperCornerWidth, cuuint32_t channelsPerPixel, cuuint32_t pixelsPerColumn, const cuuint32_t* elementStrides, CUtensorMapInterleave interleave, CUtensorMapIm2ColWideMode mode, CUtensorMapSwizzle swizzle, CUtensorMapL2promotion l2Promotion, CUtensorMapFloatOOBfill oobFill )
Create a tensor map descriptor object representing im2col memory region, but where the elements are exclusively loaded along the W dimension.
CUresult cuTensorMapEncodeTiled ( CUtensorMap* tensorMap, CUtensorMapDataType tensorDataType, cuuint32_t tensorRank, void* globalAddress, const cuuint64_t* globalDim, const cuuint64_t* globalStrides, const cuuint32_t* boxDim, const cuuint32_t* elementStrides, CUtensorMapInterleave interleave, CUtensorMapSwizzle swizzle, CUtensorMapL2promotion l2Promotion, CUtensorMapFloatOOBfill oobFill )
Create a tensor map descriptor object representing tiled memory region.
CUresult cuTensorMapReplaceAddress ( CUtensorMap* tensorMap, void* globalAddress )
Modify an existing tensor map descriptor with an updated global address.

Functions

CUresult cuTensorMapEncodeIm2col ( CUtensorMap* tensorMap, CUtensorMapDataType tensorDataType, cuuint32_t tensorRank, void* globalAddress, const cuuint64_t* globalDim, const cuuint64_t* globalStrides, const int* pixelBoxLowerCorner, const int* pixelBoxUpperCorner, cuuint32_t channelsPerPixel, cuuint32_t pixelsPerColumn, const cuuint32_t* elementStrides, CUtensorMapInterleave interleave, CUtensorMapSwizzle swizzle, CUtensorMapL2promotion l2Promotion, CUtensorMapFloatOOBfill oobFill )
创建一个表示im2col内存区域的张量映射描述符对象。
参数
tensorMap
- Tensor map object to create
tensorDataType
- Tensor data type
tensorRank
- Dimensionality of tensor; must be at least 3
globalAddress
- Starting address of memory region described by tensor
globalDim
- Array containing tensor size (number of elements) along each of the tensorRank dimensions
globalStrides
- Array containing stride size (in bytes) along each of the tensorRank - 1 dimensions
pixelBoxLowerCorner
- Array containing DHW dimensions of lower box corner
pixelBoxUpperCorner
- Array containing DHW dimensions of upper box corner
channelsPerPixel
- Number of channels per pixel
pixelsPerColumn
- Number of pixels per column
elementStrides
- Array containing traversal stride in each of the tensorRank dimensions
interleave
- Type of interleaved layout the tensor addresses
swizzle
- Bank swizzling pattern inside shared memory
l2Promotion
- L2 promotion size
oobFill
- Indicate whether zero or special NaN constant will be used to fill out-of-bound elements
描述

为Tensor Memory Access (TMA)对象创建一个描述符,该对象由描述im2col内存布局的参数指定,并在tensorMap中返回。

张量映射对象仅支持计算能力9.0或更高的设备。此外,张量映射对象是一个不透明值,因此应仅通过CUDA API和PTX进行访问。

传入的参数需满足以下要求:

  • tensorMap 地址必须64字节对齐。

  • tensorDataType 必须来自 CUtensorMapDataType 枚举类型,其定义如下:
    ‎    typedef enum CUtensorMapDataType_enum {
                  CU_TENSOR_MAP_DATA_TYPE_UINT8 = 0,       // 1字节
                  CU_TENSOR_MAP_DATA_TYPE_UINT16,          // 2字节
                  CU_TENSOR_MAP_DATA_TYPE_UINT32,          // 4字节
                  CU_TENSOR_MAP_DATA_TYPE_INT32,           // 4字节
                  CU_TENSOR_MAP_DATA_TYPE_UINT64,          // 8字节
                  CU_TENSOR_MAP_DATA_TYPE_INT64,           // 8字节
                  CU_TENSOR_MAP_DATA_TYPE_FLOAT16,         // 2字节
                  CU_TENSOR_MAP_DATA_TYPE_FLOAT32,         // 4字节
                  CU_TENSOR_MAP_DATA_TYPE_FLOAT64,         // 8字节
                  CU_TENSOR_MAP_DATA_TYPE_BFLOAT16,        // 2字节
                  CU_TENSOR_MAP_DATA_TYPE_FLOAT32_FTZ,     // 4字节
                  CU_TENSOR_MAP_DATA_TYPE_TFLOAT32,        // 4字节
                  CU_TENSOR_MAP_DATA_TYPE_TFLOAT32_FTZ     // 4字节
                  CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN8B,    // 4位
                  CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B,   // 4位
                  CU_TENSOR_MAP_DATA_TYPE_16U6_ALIGN16B    // 6位
              } CUtensorMapDataType;
    CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN8B 将'16 x U4'打包值复制到8字节对齐的内存中。打包值之间没有间隙。CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B 将'16 x U4'打包值复制到16字节对齐的内存中。每8字节打包值块之间有8字节间隙。CU_TENSOR_MAP_DATA_TYPE_16U6_ALIGN16B 将'16 x U6'打包值复制到16字节对齐的内存中。每12字节打包值块之间有4字节间隙。

  • tensorRank参数用于指定张量的维度数,必须为3、4或5。

  • globalAddress, which specifies the starting address of the memory region described, must be 16 byte aligned. The following requirements need to also be met:
    • interleave设置为CU_TENSOR_MAP_INTERLEAVE_32B时,globalAddress必须32字节对齐。

    • tensorDataType为CU_TENSOR_MAP_DATA_TYPE_16U6_ALIGN16B或CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B时,globalAddress必须32字节对齐。

  • globalDim array, which specifies tensor size of each of the tensorRank dimensions, must be non-zero and less than or equal to 2^32. Additionally, the following requirements need to be met for the packed data types:
    • tensorDataType为CU_TENSOR_MAP_DATA_TYPE_16U6_ALIGN16B或CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B时,globalDim[0]必须是128的倍数。

    • tensorDataType为CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN8B时,globalDim[0]必须是2的倍数。

    • 打包数据类型的维度必须反映单个U#值的数量。

  • globalStrides array, which specifies tensor stride of each of the lower tensorRank - 1 dimensions in bytes, must be a multiple of 16 and less than 2^40. Additionally, the following requirements need to be met:
    • interleave为CU_TENSOR_MAP_INTERLEAVE_32B时,步长必须是32的倍数。

    • tensorDataType为CU_TENSOR_MAP_DATA_TYPE_16U6_ALIGN16B或CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B时,步长必须是32的倍数。 每个后续指定的维度都包含前一个维度的步长:
      ‎    globalStrides[0] = globalDim[0] * elementSizeInBytes(tensorDataType) + padding[0];
                for (i = 1; i < tensorRank - 1; i++)
                    globalStrides[i] = globalStrides[i – 1] * (globalDim[i] + padding[i]);
                    assert(globalStrides[i] >= globalDim[i]);

  • pixelBoxLowerCorner array specifies the coordinate offsets {D, H, W} of the bounding box from top/left/front corner. The number of offsets and their precision depend on the tensor dimensionality:
    • tensorRank为3时,支持一个在[-32768, 32767]范围内的有符号偏移量。

    • tensorRank为4时,支持两个带符号的偏移量,每个偏移量的范围在[-128, 127]之间。

    • tensorRank为5时,支持三个偏移量,每个偏移量的范围在[-16, 15]内。

  • pixelBoxUpperCorner array specifies the coordinate offsets {D, H, W} of the bounding box from bottom/right/back corner. The number of offsets and their precision depend on the tensor dimensionality:
    • tensorRank为3时,支持在[-32768, 32767]范围内的一个有符号偏移量。

    • tensorRank为4时,支持两个带符号的偏移量,每个偏移量的范围在[-128, 127]之间。

    • tensorRank为5时,支持三个偏移量,每个偏移量范围在[-16, 15]之间。由pixelBoxLowerCornerpixelBoxUpperCorner指定的边界框必须具有非零面积。

  • channelsPerPixel参数指定了沿C维度必须访问的元素数量,必须小于或等于256。此外,当tensorDataType为CU_TENSOR_MAP_DATA_TYPE_16U6_ALIGN16B或CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B时,channelsPerPixel必须为128。

  • pixelsPerColumn,指定必须沿{N, D, H, W}维度访问的元素数量,必须小于或等于1024。

  • elementStrides数组用于指定沿tensorRank各维度的迭代步长,必须为非零且小于等于8。请注意当interleave为CU_TENSOR_MAP_INTERLEAVE_NONE时,该数组的第一个元素会被忽略,因为TMA不支持第零维的步长。当elementStrides数组所有元素均为1时,boxDim指定要加载的元素数量。但如果elementStrides[i]对某些i不等于1,则TMA会沿第i维加载ceil(boxDim[i]/elementStrides[i])个元素。要沿第i维加载N个元素,必须将boxDim[i]设为N*elementStrides[i]。

  • interleave specifies the interleaved layout of type CUtensorMapInterleave, which is defined as:
    ‎    typedef enum CUtensorMapInterleave_enum {
                  CU_TENSOR_MAP_INTERLEAVE_NONE = 0,
                  CU_TENSOR_MAP_INTERLEAVE_16B,
                  CU_TENSOR_MAP_INTERLEAVE_32B
              } CUtensorMapInterleave;
    TMA supports interleaved layouts like NC/8HWC8 where C8 utilizes 16 bytes in memory assuming 2 byte per channel or NC/16HWC16 where C16 uses 32 bytes. When interleave is CU_TENSOR_MAP_INTERLEAVE_NONE and swizzle is not CU_TENSOR_MAP_SWIZZLE_NONE, the bounding box inner dimension (computed as channelsPerPixel multiplied by element size in bytes derived from tensorDataType) must be less than or equal to the swizzle size.
    • CU_TENSOR_MAP_SWIZZLE_32B要求边界框内部维度必须小于等于32。

    • CU_TENSOR_MAP_SWIZZLE_64B要求边界框内部维度必须小于等于64。

    • CU_TENSOR_MAP_SWIZZLE_128B* 要求边界框内部维度 ≤ 128。此外,当 tensorDataType 为 CU_TENSOR_MAP_DATA_TYPE_16U6_ALIGN16B 时,要求 interleave 必须设置为 CU_TENSOR_MAP_INTERLEAVE_NONE。

  • swizzle, which specifies the shared memory bank swizzling pattern, has to be of type CUtensorMapSwizzle which is defined as:
    ‎    typedef enum CUtensorMapSwizzle_enum {
                  CU_TENSOR_MAP_SWIZZLE_NONE = 0,
                  CU_TENSOR_MAP_SWIZZLE_32B,                   // Swizzle 16B chunks within 32B  span
                  CU_TENSOR_MAP_SWIZZLE_64B,                   // Swizzle 16B chunks within 64B  span
                  CU_TENSOR_MAP_SWIZZLE_128B,                  // Swizzle 16B chunks within 128B span
                  CU_TENSOR_MAP_SWIZZLE_128B_ATOM_32B,         // Swizzle 32B chunks within 128B span
                  CU_TENSOR_MAP_SWIZZLE_128B_ATOM_32B_FLIP_8B, // Swizzle 32B chunks within 128B span, additionally swap lower 8B with upper 8B within each 16B for every alternate row
                  CU_TENSOR_MAP_SWIZZLE_128B_ATOM_64B          // Swizzle 64B chunks within 128B span
              } CUtensorMapSwizzle;
    Data are organized in a specific order in global memory; however, this may not match the order in which the application accesses data in shared memory. This difference in data organization may cause bank conflicts when shared memory is accessed. In order to avoid this problem, data can be loaded to shared memory with shuffling across shared memory banks. When interleave is CU_TENSOR_MAP_INTERLEAVE_32B, swizzle must be CU_TENSOR_MAP_SWIZZLE_32B. Other interleave modes can have any swizzling pattern. When the tensorDataType is CU_TENSOR_MAP_DATA_TYPE_16U6_ALIGN16B, only the following swizzle modes are supported:
    • CU_TENSOR_MAP_SWIZZLE_NONE (加载与存储)

    • CU_TENSOR_MAP_SWIZZLE_128B (加载与存储)

    • CU_TENSOR_MAP_SWIZZLE_128B_ATOM_32B (加载与存储)

    • CU_TENSOR_MAP_SWIZZLE_128B_ATOM_64B (仅存储) 当tensorDataType为CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B时,仅支持以下交换模式:

    • CU_TENSOR_MAP_SWIZZLE_NONE (仅加载)

    • CU_TENSOR_MAP_SWIZZLE_128B (仅加载)

    • CU_TENSOR_MAP_SWIZZLE_128B_ATOM_32B (仅加载)

  • l2Promotion 指定L2获取大小,表示从DRAM填充L2请求的字节粒度。它必须是类型 CUtensorMapL2promotion,其定义为:
    ‎    typedef enum CUtensorMapL2promotion_enum {
                  CU_TENSOR_MAP_L2_PROMOTION_NONE = 0,
                  CU_TENSOR_MAP_L2_PROMOTION_L2_64B,
                  CU_TENSOR_MAP_L2_PROMOTION_L2_128B,
                  CU_TENSOR_MAP_L2_PROMOTION_L2_256B
              } CUtensorMapL2promotion;

  • oobFill参数,用于指定是使用零还是特殊NaN常量来填充越界元素,其类型必须为CUtensorMapFloatOOBfill,该类型定义为:
    ‎    typedef enum CUtensorMapFloatOOBfill_enum {
                  CU_TENSOR_MAP_FLOAT_OOB_FILL_NONE = 0,
                  CU_TENSOR_MAP_FLOAT_OOB_FILL_NAN_REQUEST_ZERO_FMA
              } CUtensorMapFloatOOBfill;
    请注意,CU_TENSOR_MAP_FLOAT_OOB_FILL_NAN_REQUEST_ZERO_FMA仅当tensorDataType表示浮点数据类型时才可使用,且tensorDataType不能是CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN8B、CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B和CU_TENSOR_MAP_DATA_TYPE_16U6_ALIGN16B。

另请参阅:

cuTensorMapEncodeTiled, cuTensorMapEncodeIm2colWide, cuTensorMapReplaceAddress

CUresult cuTensorMapEncodeIm2colWide ( CUtensorMap* tensorMap, CUtensorMapDataType tensorDataType, cuuint32_t tensorRank, void* globalAddress, const cuuint64_t* globalDim, const cuuint64_t* globalStrides, int  pixelBoxLowerCornerWidth, int  pixelBoxUpperCornerWidth, cuuint32_t channelsPerPixel, cuuint32_t pixelsPerColumn, const cuuint32_t* elementStrides, CUtensorMapInterleave interleave, CUtensorMapIm2ColWideMode mode, CUtensorMapSwizzle swizzle, CUtensorMapL2promotion l2Promotion, CUtensorMapFloatOOBfill oobFill )
创建一个表示im2col内存区域的张量映射描述符对象,但其中的元素仅沿W维度加载。
参数
tensorMap
- Tensor map object to create
tensorDataType
- Tensor data type
tensorRank
- Dimensionality of tensor; must be at least 3
globalAddress
- Starting address of memory region described by tensor
globalDim
- Array containing tensor size (number of elements) along each of the tensorRank dimensions
globalStrides
- Array containing stride size (in bytes) along each of the tensorRank - 1 dimensions
pixelBoxLowerCornerWidth
- Width offset of left box corner
pixelBoxUpperCornerWidth
- Width offset of right box corner
channelsPerPixel
- Number of channels per pixel
pixelsPerColumn
- Number of pixels per column
elementStrides
- Array containing traversal stride in each of the tensorRank dimensions
interleave
- Type of interleaved layout the tensor addresses
mode
- W or W128 mode
swizzle
- Bank swizzling pattern inside shared memory
l2Promotion
- L2 promotion size
oobFill
- Indicate whether zero or special NaN constant will be used to fill out-of-bound elements
描述

为Tensor Memory Access (TMA)对象创建一个描述符,该对象由描述im2col内存布局的参数指定,其中行总是沿W维度加载,并在tensorMap中返回。这假设内存中的张量布局是NDHWC、NHWC或NWC。

此API仅支持计算能力10.0或更高的设备。此外,张量映射对象是一个不透明值,因此应仅通过CUDA API和PTX进行访问。

传入的参数需满足以下要求:

  • tensorMap 地址必须64字节对齐。

  • tensorDataType 必须是一个来自 CUtensorMapDataType 的枚举类型,其定义为:
    ‎    typedef enum CUtensorMapDataType_enum {
                  CU_TENSOR_MAP_DATA_TYPE_UINT8 = 0,       // 1字节
                  CU_TENSOR_MAP_DATA_TYPE_UINT16,          // 2字节
                  CU_TENSOR_MAP_DATA_TYPE_UINT32,          // 4字节
                  CU_TENSOR_MAP_DATA_TYPE_INT32,           // 4字节
                  CU_TENSOR_MAP_DATA_TYPE_UINT64,          // 8字节
                  CU_TENSOR_MAP_DATA_TYPE_INT64,           // 8字节
                  CU_TENSOR_MAP_DATA_TYPE_FLOAT16,         // 2字节
                  CU_TENSOR_MAP_DATA_TYPE_FLOAT32,         // 4字节
                  CU_TENSOR_MAP_DATA_TYPE_FLOAT64,         // 8字节
                  CU_TENSOR_MAP_DATA_TYPE_BFLOAT16,        // 2字节
                  CU_TENSOR_MAP_DATA_TYPE_FLOAT32_FTZ,     // 4字节
                  CU_TENSOR_MAP_DATA_TYPE_TFLOAT32,        // 4字节
                  CU_TENSOR_MAP_DATA_TYPE_TFLOAT32_FTZ     // 4字节
                  CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN8B,    // 4位
                  CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B,   // 4位
                  CU_TENSOR_MAP_DATA_TYPE_16U6_ALIGN16B    // 6位
              } CUtensorMapDataType;
    CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN8B 将'16 x U4'打包值复制到8字节对齐的内存中。打包值之间没有间隙。CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B 将'16 x U4'打包值复制到16字节对齐的内存中。每8字节打包值块之间有8字节间隙。CU_TENSOR_MAP_DATA_TYPE_16U6_ALIGN16B 将'16 x U6'打包值复制到16字节对齐的内存中。每12字节打包值块之间有4字节间隙。

  • tensorRank参数用于指定张量的维度数,必须为3、4或5。

  • globalAddress, which specifies the starting address of the memory region described, must be 16 byte aligned. The following requirements need to also be met:
    • interleave设置为CU_TENSOR_MAP_INTERLEAVE_32B时,globalAddress必须32字节对齐。

    • tensorDataType为CU_TENSOR_MAP_DATA_TYPE_16U6_ALIGN16B或CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B时,globalAddress必须32字节对齐。

  • globalDim array, which specifies tensor size of each of the tensorRank dimensions, must be non-zero and less than or equal to 2^32. Additionally, the following requirements need to be met for the packed data types:
    • tensorDataType为CU_TENSOR_MAP_DATA_TYPE_16U6_ALIGN16B或CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B时,globalDim[0]必须是128的倍数。

    • tensorDataType为CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN8B时,globalDim[0]必须是2的倍数。

    • 打包数据类型的维度必须反映单个U#值的数量。

  • globalStrides array, which specifies tensor stride of each of the lower tensorRank - 1 dimensions in bytes, must be a multiple of 16 and less than 2^40. Additionally, the following requirements need to be met:
    • interleave为CU_TENSOR_MAP_INTERLEAVE_32B时,步幅必须是32的倍数。

    • tensorDataType为CU_TENSOR_MAP_DATA_TYPE_16U6_ALIGN16B或CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B时,步长必须是32的倍数。 每个后续指定的维度都包含前一个维度的步长:
      ‎    globalStrides[0] = globalDim[0] * elementSizeInBytes(tensorDataType) + padding[0];
                for (i = 1; i < tensorRank - 1; i++)
                    globalStrides[i] = globalStrides[i – 1] * (globalDim[i] + padding[i]);
                    assert(globalStrides[i] >= globalDim[i]);

  • pixelBoxLowerCornerWidth 指定边界框从左角起的坐标偏移量W。该偏移量必须在[-32768, 32767]范围内。

  • pixelBoxUpperCornerWidth 指定边界框从右上角开始的W坐标偏移量。该偏移量必须在[-32768, 32767]范围内。

pixelBoxLowerCornerWidthpixelBoxUpperCornerWidth指定的边界框必须具有非零面积。请注意,该框沿D和H维度的大小始终等于一。

  • channelsPerPixel参数指定了沿C维度必须访问的元素数量,该值必须小于或等于256。此外, 当tensorDataType为CU_TENSOR_MAP_DATA_TYPE_16U6_ALIGN16B或CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B时,channelsPerPixel必须为128。

  • pixelsPerColumn,指定沿W维度必须访问的元素数量,必须小于或等于1024。 当mode为CU_TENSOR_MAP_IM2COL_WIDE_MODE_W128时,此字段将被忽略。

  • elementStrides数组,用于指定沿tensorRank各维度的迭代步长,必须非零且小于等于8。请注意当interleave为CU_TENSOR_MAP_INTERLEAVE_NONE时,该数组的第一个元素会被忽略,因为TMA不支持第零维的步长。当elementStrides数组所有元素均为1时,boxDim指定要加载的元素数量。但如果elementStrides[i]对某些iboxDim[i]/elementStrides[i])个元素。要沿第i维加载N个元素,必须将boxDim[i]设为N*elementStrides[i]。

  • interleave specifies the interleaved layout of type CUtensorMapInterleave, which is defined as:
    ‎    typedef enum CUtensorMapInterleave_enum {
                  CU_TENSOR_MAP_INTERLEAVE_NONE = 0,
                  CU_TENSOR_MAP_INTERLEAVE_16B,
                  CU_TENSOR_MAP_INTERLEAVE_32B
              } CUtensorMapInterleave;
    TMA supports interleaved layouts like NC/8HWC8 where C8 utilizes 16 bytes in memory assuming 2 byte per channel or NC/16HWC16 where C16 uses 32 bytes. When interleave is CU_TENSOR_MAP_INTERLEAVE_NONE, the bounding box inner dimension (computed as channelsPerPixel multiplied by element size in bytes derived from tensorDataType) must be less than or equal to the swizzle size.
    • CU_TENSOR_MAP_SWIZZLE_64B要求边界框内部维度必须小于等于64。

    • CU_TENSOR_MAP_SWIZZLE_128B* 要求边界框内部维度 ≤ 128。此外,当 tensorDataType 为 CU_TENSOR_MAP_DATA_TYPE_16U6_ALIGN16B 时,要求 interleave 必须设置为 CU_TENSOR_MAP_INTERLEAVE_NONE。

  • mode参数用于描述沿W维度加载元素的方式,必须是以下CUtensorMapIm2ColWideMode类型之一:
    ‎          CU_TENSOR_MAP_IM2COL_WIDE_MODE_W,
                    CU_TENSOR_MAP_IM2COL_WIDE_MODE_W128
    CU_TENSOR_MAP_IM2COL_WIDE_MODE_W模式允许通过pixelsPerColumn字段指定沿W维度加载的元素数量。

  • swizzle, which specifies the shared memory bank swizzling pattern, must be one of the following CUtensorMapSwizzle modes (other swizzle modes are not supported):
    ‎    typedef enum CUtensorMapSwizzle_enum {
                  CU_TENSOR_MAP_SWIZZLE_64B,                   // Swizzle 16B chunks within 64B  span
                  CU_TENSOR_MAP_SWIZZLE_128B,                  // Swizzle 16B chunks within 128B span
                  CU_TENSOR_MAP_SWIZZLE_128B_ATOM_32B,         // Swizzle 32B chunks within 128B span
              } CUtensorMapSwizzle;
    Data are organized in a specific order in global memory; however, this may not match the order in which the application accesses data in shared memory. This difference in data organization may cause bank conflicts when shared memory is accessed. In order to avoid this problem, data can be loaded to shared memory with shuffling across shared memory banks. When the tensorDataType is CU_TENSOR_MAP_DATA_TYPE_16U6_ALIGN16B, only the following swizzle modes are supported:
    • CU_TENSOR_MAP_SWIZZLE_128B (加载与存储)

    • CU_TENSOR_MAP_SWIZZLE_128B_ATOM_32B (加载与存储) 当tensorDataType为CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B时,仅支持以下交换模式:

    • CU_TENSOR_MAP_SWIZZLE_128B (仅加载)

    • CU_TENSOR_MAP_SWIZZLE_128B_ATOM_32B (仅加载)

  • l2Promotion 指定L2获取大小,表示从DRAM填充L2请求的字节粒度。它必须是类型 CUtensorMapL2promotion,其定义为:
    ‎    typedef enum CUtensorMapL2promotion_enum {
                  CU_TENSOR_MAP_L2_PROMOTION_NONE = 0,
                  CU_TENSOR_MAP_L2_PROMOTION_L2_64B,
                  CU_TENSOR_MAP_L2_PROMOTION_L2_128B,
                  CU_TENSOR_MAP_L2_PROMOTION_L2_256B
              } CUtensorMapL2promotion;

  • oobFill参数,用于指定是使用零还是特殊NaN常量来填充越界元素,必须为CUtensorMapFloatOOBfill类型,其定义为:
    ‎    typedef enum CUtensorMapFloatOOBfill_enum {
                  CU_TENSOR_MAP_FLOAT_OOB_FILL_NONE = 0,
                  CU_TENSOR_MAP_FLOAT_OOB_FILL_NAN_REQUEST_ZERO_FMA
              } CUtensorMapFloatOOBfill;
    注意CU_TENSOR_MAP_FLOAT_OOB_FILL_NAN_REQUEST_ZERO_FMA仅当tensorDataType表示浮点数据类型时才能使用,且tensorDataType不能是CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN8B、CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B和CU_TENSOR_MAP_DATA_TYPE_16U6_ALIGN16B。

另请参阅:

cuTensorMapEncodeTiled, cuTensorMapEncodeIm2col, cuTensorMapReplaceAddress

CUresult cuTensorMapEncodeTiled ( CUtensorMap* tensorMap, CUtensorMapDataType tensorDataType, cuuint32_t tensorRank, void* globalAddress, const cuuint64_t* globalDim, const cuuint64_t* globalStrides, const cuuint32_t* boxDim, const cuuint32_t* elementStrides, CUtensorMapInterleave interleave, CUtensorMapSwizzle swizzle, CUtensorMapL2promotion l2Promotion, CUtensorMapFloatOOBfill oobFill )
创建一个表示平铺内存区域的张量映射描述符对象。
参数
tensorMap
- Tensor map object to create
tensorDataType
- Tensor data type
tensorRank
- Dimensionality of tensor
globalAddress
- Starting address of memory region described by tensor
globalDim
- Array containing tensor size (number of elements) along each of the tensorRank dimensions
globalStrides
- Array containing stride size (in bytes) along each of the tensorRank - 1 dimensions
boxDim
- Array containing traversal box size (number of elments) along each of the tensorRank dimensions. Specifies how many elements to be traversed along each tensor dimension.
elementStrides
- Array containing traversal stride in each of the tensorRank dimensions
interleave
- Type of interleaved layout the tensor addresses
swizzle
- Bank swizzling pattern inside shared memory
l2Promotion
- L2 promotion size
oobFill
- Indicate whether zero or special NaN constant must be used to fill out-of-bound elements
描述

为指定的张量内存访问(TMA)对象创建一个描述符,该对象由描述分块区域的参数定义,并在tensorMap中返回。

张量映射对象仅支持计算能力9.0或更高的设备。此外,张量映射对象是一个不透明值,因此应仅通过CUDA API和PTX进行访问。

传入的参数需满足以下要求:

  • tensorMap 地址必须64字节对齐。

  • tensorDataType 必须来自 CUtensorMapDataType 枚举类型,其定义为:
    ‎    typedef enum CUtensorMapDataType_enum {
                  CU_TENSOR_MAP_DATA_TYPE_UINT8 = 0,       // 1字节
                  CU_TENSOR_MAP_DATA_TYPE_UINT16,          // 2字节
                  CU_TENSOR_MAP_DATA_TYPE_UINT32,          // 4字节
                  CU_TENSOR_MAP_DATA_TYPE_INT32,           // 4字节
                  CU_TENSOR_MAP_DATA_TYPE_UINT64,          // 8字节
                  CU_TENSOR_MAP_DATA_TYPE_INT64,           // 8字节
                  CU_TENSOR_MAP_DATA_TYPE_FLOAT16,         // 2字节
                  CU_TENSOR_MAP_DATA_TYPE_FLOAT32,         // 4字节
                  CU_TENSOR_MAP_DATA_TYPE_FLOAT64,         // 8字节
                  CU_TENSOR_MAP_DATA_TYPE_BFLOAT16,        // 2字节
                  CU_TENSOR_MAP_DATA_TYPE_FLOAT32_FTZ,     // 4字节
                  CU_TENSOR_MAP_DATA_TYPE_TFLOAT32,        // 4字节
                  CU_TENSOR_MAP_DATA_TYPE_TFLOAT32_FTZ,    // 4字节
                  CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN8B,    // 4位
                  CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B,   // 4位
                  CU_TENSOR_MAP_DATA_TYPE_16U6_ALIGN16B    // 6位
              } CUtensorMapDataType;
    CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN8B 将'16 x U4'打包值复制到8字节对齐的内存中。打包值之间没有间隙。CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B 将'16 x U4'打包值复制到16字节对齐的内存中。每8字节的打包值块之间有8字节间隙。CU_TENSOR_MAP_DATA_TYPE_16U6_ALIGN16B 将'16 x U6'打包值复制到16字节对齐的内存中。每12字节的打包值块之间有4字节间隙。

  • tensorRank 必须非零且不超过支持的最大维度数5。如果 interleave 不是 CU_TENSOR_MAP_INTERLEAVE_NONE,则 tensorRank 还必须大于等于3。

  • globalAddress, which specifies the starting address of the memory region described, must be 16 byte aligned. The following requirements need to also be met:
    • interleave设置为CU_TENSOR_MAP_INTERLEAVE_32B时,globalAddress必须32字节对齐。

    • tensorDataType为CU_TENSOR_MAP_DATA_TYPE_16U6_ALIGN16B或CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B时,globalAddress必须32字节对齐。

  • globalDim array, which specifies tensor size of each of the tensorRank dimensions, must be non-zero and less than or equal to 2^32. Additionally, the following requirements need to be met for the packed data types:
    • tensorDataType为CU_TENSOR_MAP_DATA_TYPE_16U6_ALIGN16B或CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B时,globalDim[0]必须是128的倍数。

    • tensorDataType为CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN8B时,globalDim[0]必须是2的倍数。

    • 打包数据类型的维度必须反映单个U#值的数量。

  • globalStrides array, which specifies tensor stride of each of the lower tensorRank - 1 dimensions in bytes, must be a multiple of 16 and less than 2^40. Additionally, the following requirements need to be met:
    • interleave为CU_TENSOR_MAP_INTERLEAVE_32B时,步长必须是32的倍数。

    • tensorDataType为CU_TENSOR_MAP_DATA_TYPE_16U6_ALIGN16B或CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B时,步长必须是32的倍数。 每个后续指定的维度都包含前一个维度的步长:
      ‎    globalStrides[0] = globalDim[0] * elementSizeInBytes(tensorDataType) + padding[0];
                for (i = 1; i < tensorRank - 1; i++)
                    globalStrides[i] = globalStrides[i – 1] * (globalDim[i] + padding[i]);
                    assert(globalStrides[i] >= globalDim[i]);

  • boxDim array, which specifies number of elements to be traversed along each of the tensorRank dimensions, must be non-zero and less than or equal to 256. Additionally, the following requirements need to be met:
    • interleave为CU_TENSOR_MAP_INTERLEAVE_NONE时,{ boxDim[0] * elementSizeInBytes( tensorDataType ) }必须是16字节的倍数。

    • tensorDataType为CU_TENSOR_MAP_DATA_TYPE_16U6_ALIGN16B或CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B时,boxDim[0]必须为128。

  • elementStrides数组,用于指定沿tensorRank每个维度的迭代步长,必须非零且小于等于8。请注意当interleave为CU_TENSOR_MAP_INTERLEAVE_NONE时,该数组的第一个元素会被忽略,因为TMA不支持第零维的步长。当elementStrides数组所有元素都为1时,boxDim指定要加载的元素数量。但如果elementStrides[i]不等于1,则TMA会沿第i维加载ceil(boxDim[i]/elementStrides[i])个元素。要沿第i维加载N个元素,必须将boxDim[i]设置为N*elementStrides[i]。

  • interleave specifies the interleaved layout of type CUtensorMapInterleave, which is defined as:
    ‎    typedef enum CUtensorMapInterleave_enum {
                  CU_TENSOR_MAP_INTERLEAVE_NONE = 0,
                  CU_TENSOR_MAP_INTERLEAVE_16B,
                  CU_TENSOR_MAP_INTERLEAVE_32B
              } CUtensorMapInterleave;
    TMA supports interleaved layouts like NC/8HWC8 where C8 utilizes 16 bytes in memory assuming 2 byte per channel or NC/16HWC16 where C16 uses 32 bytes. When interleave is CU_TENSOR_MAP_INTERLEAVE_NONE and swizzle is not CU_TENSOR_MAP_SWIZZLE_NONE, the bounding box inner dimension (computed as boxDim[0] multiplied by element size derived from tensorDataType) must be less than or equal to the swizzle size.
    • CU_TENSOR_MAP_SWIZZLE_32B要求边界框内部维度必须小于等于32。

    • CU_TENSOR_MAP_SWIZZLE_64B要求边界框内部维度必须小于等于64。

    • CU_TENSOR_MAP_SWIZZLE_128B* 要求边界框内部维度 ≤ 128。此外,当 tensorDataType 为 CU_TENSOR_MAP_DATA_TYPE_16U6_ALIGN16B 时,要求 interleave 必须设置为 CU_TENSOR_MAP_INTERLEAVE_NONE。

  • swizzle, which specifies the shared memory bank swizzling pattern, has to be of type CUtensorMapSwizzle which is defined as:
    ‎    typedef enum CUtensorMapSwizzle_enum {
                  CU_TENSOR_MAP_SWIZZLE_NONE = 0,
                  CU_TENSOR_MAP_SWIZZLE_32B,                   // Swizzle 16B chunks within 32B  span
                  CU_TENSOR_MAP_SWIZZLE_64B,                   // Swizzle 16B chunks within 64B  span
                  CU_TENSOR_MAP_SWIZZLE_128B,                  // Swizzle 16B chunks within 128B span
                  CU_TENSOR_MAP_SWIZZLE_128B_ATOM_32B,         // Swizzle 32B chunks within 128B span
                  CU_TENSOR_MAP_SWIZZLE_128B_ATOM_32B_FLIP_8B, // Swizzle 32B chunks within 128B span, additionally swap lower 8B with upper 8B within each 16B for every alternate row
                  CU_TENSOR_MAP_SWIZZLE_128B_ATOM_64B          // Swizzle 64B chunks within 128B span
              } CUtensorMapSwizzle;
    Data are organized in a specific order in global memory; however, this may not match the order in which the application accesses data in shared memory. This difference in data organization may cause bank conflicts when shared memory is accessed. In order to avoid this problem, data can be loaded to shared memory with shuffling across shared memory banks. When interleave is CU_TENSOR_MAP_INTERLEAVE_32B, swizzle must be CU_TENSOR_MAP_SWIZZLE_32B. Other interleave modes can have any swizzling pattern. When the tensorDataType is CU_TENSOR_MAP_DATA_TYPE_16U6_ALIGN16B, only the following swizzle modes are supported:
    • CU_TENSOR_MAP_SWIZZLE_NONE (加载与存储)

    • CU_TENSOR_MAP_SWIZZLE_128B (加载与存储)

    • CU_TENSOR_MAP_SWIZZLE_128B_ATOM_32B (加载与存储)

    • CU_TENSOR_MAP_SWIZZLE_128B_ATOM_64B (仅存储) 当tensorDataType为CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B时,仅支持以下交换模式:

    • CU_TENSOR_MAP_SWIZZLE_NONE (仅加载)

    • CU_TENSOR_MAP_SWIZZLE_128B (仅加载)

    • CU_TENSOR_MAP_SWIZZLE_128B_ATOM_32B (仅加载)

  • l2Promotion 指定L2获取大小,表示从DRAM填充L2请求的字节粒度。它必须是类型 CUtensorMapL2promotion,其定义为:
    ‎    typedef enum CUtensorMapL2promotion_enum {
                  CU_TENSOR_MAP_L2_PROMOTION_NONE = 0,
                  CU_TENSOR_MAP_L2_PROMOTION_L2_64B,
                  CU_TENSOR_MAP_L2_PROMOTION_L2_128B,
                  CU_TENSOR_MAP_L2_PROMOTION_L2_256B
              } CUtensorMapL2promotion;

  • oobFill参数,用于指定是使用零还是特殊NaN常量来填充越界元素,其类型必须为CUtensorMapFloatOOBfill,该类型定义为:
    ‎    typedef enum CUtensorMapFloatOOBfill_enum {
                  CU_TENSOR_MAP_FLOAT_OOB_FILL_NONE = 0,
                  CU_TENSOR_MAP_FLOAT_OOB_FILL_NAN_REQUEST_ZERO_FMA
              } CUtensorMapFloatOOBfill;
    请注意,CU_TENSOR_MAP_FLOAT_OOB_FILL_NAN_REQUEST_ZERO_FMA仅当tensorDataType表示浮点数据类型时才能使用,且tensorDataType不能是CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN8B、CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B和CU_TENSOR_MAP_DATA_TYPE_16U6_ALIGN16B。

另请参阅:

cuTensorMapEncodeIm2col, cuTensorMapEncodeIm2colWide, cuTensorMapReplaceAddress

CUresult cuTensorMapReplaceAddress ( CUtensorMap* tensorMap, void* globalAddress )
使用更新后的全局地址修改现有的张量映射描述符。
参数
tensorMap
- Tensor map object to modify
globalAddress
- Starting address of memory region described by tensor, must follow previous alignment requirements
描述

修改传入tensorMap的张量内存访问(TMA)对象描述符,更新其中的globalAddress地址。

张量映射对象仅支持计算能力9.0或更高的设备。此外,张量映射对象是一个不透明的值,因此应仅通过CUDA API调用来访问。

另请参阅:

cuTensorMapEncodeTiled, cuTensorMapEncodeIm2colcuTensorMapEncodeIm2colWide