您的下一代自定义FFT内核¶

在实际应用场景中，我们通常需要不止一个内核。为了在多种架构上实现最佳性能，单个用例可能需要多种不同的实现方案。cuFFTDx的设计初衷就是自动处理这种复杂性，同时让用户能完全掌控实现细节。

使用最优参数¶

cuFFTDx允许用户将实现中的某些细节定义（例如每个线程计算的FFT元素数量或每个块的FFT数量）交由库来决定。ElementsPerThread的默认值是建议值，它基于一个启发式算法，该算法取决于FFT本身（大小、精度、GPU架构等）。然而，对于选定的ElementsPerThread（无论是默认值还是手动设置的值），可以通过FFT::suggested_ffts_per_block静态字段获取建议的FFTsPerBlock。FFTsPerBlock的默认值为1。

#include <cufftdx.hpp>
using namespace cufftdx;

template<class FFT>
__global__ void block_fft_kernel(FFT::value_type* data, typename FFT::workspace_type workspace) {
  using complex_type = typename FFT::value_type;

  // Registers
  complex_type thread_data[FFT::storage_size];

  // Local batch id of this FFT in CUDA block, in range [0; FFT::ffts_per_block)
  const unsigned int local_fft_id = threadIdx.y;
  // Global batch id of this FFT in CUDA grid is equal to number of batches per CUDA block (ffts_per_block)
  // times CUDA block id, plus local batch id.
  const unsigned int global_fft_id = (blockIdx.x * FFT::ffts_per_block) + local_fft_id;

  // Load data from global memory to registers
  const unsigned int offset = cufftdx::size_of<FFT>::value * global_fft_id;
  const unsigned int stride = FFT::stride;
  unsigned int       index  = offset + threadIdx.x;
  for (unsigned int i = 0; i < FFT::elements_per_thread; i++) {
      // Make sure not to go out-of-bounds
      if ((i * stride + threadIdx.x) < cufftdx::size_of<FFT>::value) {
          thread_data[i] = data[index];
          index += stride;
      }
  }

  // FFT::shared_memory_size bytes of shared memory
  extern __shared__ __align__(alignof(float4)) complex_type shared_mem[];

  // Execute FFT
  FFT().execute(thread_data, shared_mem, workspace);

  // Save results
  index = offset + threadIdx.x;
  for (unsigned int i = 0; i < FFT::elements_per_thread; i++) {
      if ((i * stride + threadIdx.x) < cufftdx::size_of<FFT>::value) {
          data[index] = thread_data[i];
          index += stride;
      }
  }
}

void introduction_example(cudaStream_t stream = 0) {
  // Base of the FFT description
  using FFT_base = decltype(Size<128>() + Precision<float>() + Type<fft_type::c2c>()
                            + Direction<fft_direction::forward>()
                            /* Notice lack of ElementsPerThread and FFTsPerBlock operators */
                            + SM<700>() + Block());
  // FFT description with suggested FFTs per CUDA block for the default (optimal) elements per thread
  using FFT = decltype(FFT_base() + FFTsPerBlock<FFT_base::suggested_ffts_per_block>());

  // Allocate managed memory for input/output
  complex_type* data;
  auto          size       = FFT::ffts_per_block * cufftdx::size_of<FFT>::value;
  auto          size_bytes = size * sizeof(complex_type);
  cudaMallocManaged(&data, size_bytes);
  // Generate data
  for (size_t i = 0; i < size; i++) {
      data[i] = complex_type {float(i), -float(i)};
  }

  cudaError_t error_code = cudaSuccess;


  auto workspace = make_workspace<FFT>(error_code, stream);

  // Invokes kernel with FFT::block_dim threads in CUDA block
  block_fft_kernel<FFT><<<1, FFT::block_dim, FFT::shared_memory_size, stream>>>(data, workspace);
  cudaDeviceSynchronize();

  cudaFree(data);
}

要获取最优的FFTsPerBlock值，我们需要一个完整的执行描述符（如cufftdx::is_complete_fft_execution所示）。这是因为某些细节只有在FFT操作被完整描述、且目标架构和执行模式选定后才会确定。SM Operator在主机端编译时，允许用户查询特定架构的启动参数。

额外共享内存¶

如共享内存使用章节所述，某些FFT运算可能要求每个CUDA块分配比默认值更多的共享内存。这种情况下，我们需要通过cudaFuncSetAttribute()函数主动申请更大的共享内存分配，将带有FFT的内核的cudaFuncAttributeMaxDynamicSharedMemorySize属性设置为FFT::shared_memory_size。

#include <cufftdx.hpp>
using namespace cufftdx;

template<class FFT>
__global__ void block_fft_kernel(FFT::value_type* data, typename FFT::workspace_type workspace) {
  using complex_type = typename FFT::value_type;

  (...)

  // FFT::shared_memory_size bytes of shared memory
  extern __shared__ __align__(alignof(float4)) complex_type shared_mem[];

  (...)
}

void introduction_example() {
  // Base of the FFT description
  using FFT_base = decltype(Size<128>() + Precision<float>() + Type<fft_type::c2c>()
                            + Direction<fft_direction::forward>()
                            /* Notice lack of ElementsPerThread and FFTsPerBlock operators */
                            + SM<700>() + Block());
  // FFT description with suggested FFTs per CUDA block for the default (optimal) elements per thread
  using FFT = decltype(FFT_base() + FFTsPerBlock<FFT_base::suggested_ffts_per_block>());

  (...)

  // Increases the max dynamic shared memory size to match FFT requirements
  cudaFuncSetAttribute(block_fft_kernel<FFT>,
      cudaFuncAttributeMaxDynamicSharedMemorySize,
      FFT::shared_memory_size);

  // Invokes kernel with FFT::block_dim threads in CUDA block
  block_fft_kernel<FFT><<<1, FFT::block_dim, FFT::shared_memory_size>>>(data, workspace);

  (...)
}

幕后原理¶

Expression templates: cuFFTDx API采用了一种称为表达式模板的C++技术变体。我们利用表达式模板让用户能够构建描述待计算FFT的编译时对象。通过编译时C++机制，cuFFTDx可以将优化的FFT例程附加到该对象上，并将其作为用户可调用的计算方法公开。
Header only: cuFFTDx FFT例程以优化的内联PTX形式提供。

为什么？¶

要使一个库真正实用，它需要以面向未来的方式抽象功能。所谓面向未来，是指现有用户代码在未来不应需要修改，而新功能应作为现有代码的简单扩展来实现。在CUDA平台上，这要求适应快速演进的GPU硬件。

cuFFTDx通过两种方式实现面向未来的设计。一方面，该API作为源代码级抽象层，使库与ABI变更解耦。结合头文件中的PTX代码，cuFFTDx可向前兼容任何支持其目标硬件的CUDA工具包、驱动程序和编译器。CUDA编译器可重新编译PTX代码以在未来GPU架构上运行。

另一方面，API的组织方式可以保留描述计算内容和计算方式的运算符。如果代码将实现选择委托给库，那么依赖于类型的新特性可以自动获取；否则，就需要向现有表达式添加运算符。