使用 CuPy 的快速傅里叶变换#

CuPy 涵盖了 NumPy 中提供的全套快速傅里叶变换（FFT）功能（cupy.fft）以及 SciPy 中的一部分功能（cupyx.scipy.fft）。除了可以直接使用的高级 API 外，CuPy 还提供了额外的功能来

访问 cuFFT 为 NVIDIA GPU 提供的先进例程，
更好地控制 FFT 例程的性能和行为。

这些功能中的一些是 实验性的 （可能会更改、弃用或移除，请参阅 API 兼容性政策），或者在针对 AMD GPU 的 hipFFT/rocFFT 中可能不存在。

SciPy FFT 后端#

自 SciPy v1.4 起，提供了一个后端机制，以便用户可以注册不同的 FFT 后端，并使用 SciPy 的 API 通过目标后端执行实际的变换，例如 CuPy 的 cupyx.scipy.fft 模块。对于一次性使用，可以使用上下文管理器 scipy.fft.set_backend()：

import cupy as cp
import cupyx.scipy.fft as cufft
import scipy.fft

a = cp.random.random(100).astype(cp.complex64)
with scipy.fft.set_backend(cufft):
    b = scipy.fft.fft(a)  # equivalent to cufft.fft(a)

然而，这种用法可能会很繁琐。或者，用户可以通过 scipy.fft.register_backend() 或 scipy.fft.set_global_backend() 注册一个后端，以避免使用上下文管理器：

import cupy as cp
import cupyx.scipy.fft as cufft
import scipy.fft
scipy.fft.set_global_backend(cufft)

a = cp.random.random(100).astype(cp.complex64)
b = scipy.fft.fft(a)  # equivalent to cufft.fft(a)

备注

更多信息请参阅 SciPy FFT 文档。

备注

要与显式的 plan 参数一起使用后端，需要 SciPy 1.5.0 或更高版本。请参见下文了解如何创建 FFT 计划。

用户管理的FFT计划#

出于性能考虑，用户可能希望自己创建、重用和管理FFT计划。CuPy提供了一个高层次的 实验性 API get_fft_plan() 来满足这一需求。用户可以像使用大多数高层次FFT API一样指定要执行的变换，并根据输入生成计划。

import cupy as cp
from cupyx.scipy.fft import get_fft_plan

a = cp.random.random((4, 64, 64)).astype(cp.complex64)
plan = get_fft_plan(a, axes=(1, 2), value_type='C2C')  # for batched, C2C, 2D transform

返回的计划可以显式地作为参数与 cupyx.scipy.fft API 一起使用：

import cupyx.scipy.fft

# the rest of the arguments must match those used when generating the plan
out = cupyx.scipy.fft.fft2(a, axes=(1, 2), plan=plan)

或者作为 cupy.fft API 的上下文管理器：

with plan:
    # the arguments must match those used when generating the plan
    out = cp.fft.fft2(a, axes=(1, 2))

FFT 计划缓存#

然而，有时用户可能*不*希望自己管理FFT计划。此外，计划也可以在CuPy的内部例程中重复使用，而用户管理的计划则不适用。因此，从CuPy v8开始，我们提供了一个内置的计划缓存，默认情况下启用。计划缓存是基于*每个设备，每个线程*的，并且可以通过 get_plan_cache() API 获取。

>>> import cupy as cp
>>>
>>> cache = cp.fft.config.get_plan_cache()
>>> cache.show_info()
------------------- cuFFT plan cache (device 0) -------------------
cache enabled? True
current / max size   : 0 / 16 (counts)
current / max memsize: 0 / (unlimited) (bytes)
hits / misses: 0 / 0 (counts)

cached plans (most recently used first):

>>> # perform a transform, which would generate a plan and cache it
>>> a = cp.random.random((4, 64, 64))
>>> out = cp.fft.fftn(a, axes=(1, 2))
>>> cache.show_info()  # hit = 0
------------------- cuFFT plan cache (device 0) -------------------
cache enabled? True
current / max size   : 1 / 16 (counts)
current / max memsize: 262144 / (unlimited) (bytes)
hits / misses: 0 / 1 (counts)

cached plans (most recently used first):
key: ((64, 64), (64, 64), 1, 4096, (64, 64), 1, 4096, 105, 4, 'C', 2, None), plan type: PlanNd, memory usage: 262144

>>> # perform the same transform again, the plan is looked up from cache and reused
>>> out = cp.fft.fftn(a, axes=(1, 2))
>>> cache.show_info()  # hit = 1
------------------- cuFFT plan cache (device 0) -------------------
cache enabled? True
current / max size   : 1 / 16 (counts)
current / max memsize: 262144 / (unlimited) (bytes)
hits / misses: 1 / 1 (counts)

cached plans (most recently used first):
key: ((64, 64), (64, 64), 1, 4096, (64, 64), 1, 4096, 105, 4, 'C', 2, None), plan type: PlanNd, memory usage: 262144

>>> # clear the cache
>>> cache.clear()
>>> cp.fft.config.show_plan_cache_info()  # = cache.show_info(), for all devices
=============== cuFFT plan cache info (all devices) ===============
------------------- cuFFT plan cache (device 0) -------------------
cache enabled? True
current / max size   : 0 / 16 (counts)
current / max memsize: 0 / (unlimited) (bytes)
hits / misses: 0 / 0 (counts)

cached plans (most recently used first):

返回的 PlanCache 对象有其他方法用于更精细的控制，例如设置缓存大小（可以通过计数或内存使用量）。如果大小设置为 0，则缓存被禁用。请参阅其文档以获取更多详细信息。

备注

如上所示，每个FFT计划都有一个分配的关联工作区。如果发生内存不足错误，可能需要检查、清除或限制计划缓存。

备注

由 get_fft_plan() 返回的计划不会被缓存。

FFT 回调#

cuFFT 提供了 FFT 回调，用于将预处理和/或后处理内核与 FFT 例程合并，从而减少对全局内存的访问。此功能由 CuPy 以 实验性 支持。用户需要提供自定义的加载和/或存储内核作为字符串，并通过 set_cufft_callbacks() 设置上下文管理器。请注意，加载（存储）内核指针必须命名为 d_loadCallbackPtr``（``d_storeCallbackPtr）。

import cupy as cp

# a load callback that overwrites the input array to 1
code = r'''
__device__ cufftComplex CB_ConvertInputC(
    void *dataIn,
    size_t offset,
    void *callerInfo,
    void *sharedPtr)
{
    cufftComplex x;
    x.x = 1.;
    x.y = 0.;
    return x;
}
__device__ cufftCallbackLoadC d_loadCallbackPtr = CB_ConvertInputC;
'''

a = cp.random.random((64, 128, 128)).astype(cp.complex64)

# this fftn call uses callback
with cp.fft.config.set_cufft_callbacks(cb_load=code):
    b = cp.fft.fftn(a, axes=(1,2))

# this does not use
c = cp.fft.fftn(cp.ones(shape=a.shape, dtype=cp.complex64), axes=(1,2))

# result agrees
assert cp.allclose(b, c)

# "static" plans are also cached, but are distinct from their no-callback counterparts
cp.fft.config.get_plan_cache().show_info()

备注

在内部，此功能需要为 每一对 加载和存储内核重新编译一个 Python 模块。因此，第一次调用将会非常慢，但如果回调可以在后续计算中重用，则此成本会被分摊。编译的模块会被缓存到磁盘上，默认位置为 $HOME/.cupy/callback_cache，可以通过环境变量 CUPY_CACHE_DIR 更改。

多GPU快速傅里叶变换#

CuPy 目前为多GPU FFT提供了两种*实验性*支持。

警告

使用多个GPU来执行FFT并不保证会更高效。经验法则是，如果变换适合1个GPU，你应该避免使用多个。

第一种支持是通过高级 fft() 和 ifft() API 实现的，这要求输入数组位于参与计算的 GPU 之一上。多 GPU 计算在幕后进行，计算结束时，结果再次位于其开始的设备上。目前仅支持 1D 复数到复数 (C2C) 变换；复数到实数 (C2R) 或实数到复数 (R2C) 变换（如 rfft() 及其相关函数）不支持。变换可以是批量的（批量大小 > 1）或非批量的（批量大小 = 1）。

import cupy as cp

cp.fft.config.use_multi_gpus = True
cp.fft.config.set_cufft_gpus([0, 1])  # use GPU 0 & 1

shape = (64, 64)  # batch size = 64
dtype = cp.complex64
a = cp.random.random(shape).astype(dtype)  # reside on GPU 0

b = cp.fft.fft(a)  # computed on GPU 0 & 1, reside on GPU 0

如果你需要执行2D/3D变换（例如：fftn()）而不是1D（例如：fft()），它可能仍然有效，但在这种特定使用情况下，它在底层对变换的轴进行循环（这正是NumPy中也做的），这可能导致性能不佳。

第二种用法是使用低级别的、*私有*的 CuPy API。你需要构造一个 Plan1d 对象，并像在 C/C++ 中使用 cuFFT 编程一样使用它。使用这种方法，你的输入数组可以作为 numpy.ndarray 驻留在主机上，这样它的尺寸可以比单个 GPU 所能容纳的大得多，这是运行多 GPU FFT 的主要原因之一。

import numpy as np
import cupy as cp

# no need to touch cp.fft.config, as we are using low-level API

shape = (64, 64)
dtype = np.complex64
a = np.random.random(shape).astype(dtype)  # reside on CPU

if len(shape) == 1:
    batch = 1
    nx = shape[0]
elif len(shape) == 2:
    batch = shape[0]
    nx = shape[1]

# compute via cuFFT
cufft_type = cp.cuda.cufft.CUFFT_C2C  # single-precision c2c
plan = cp.cuda.cufft.Plan1d(nx, cufft_type, batch, devices=[0,1])
out_cp = np.empty_like(a)  # output on CPU
plan.fft(a, out_cp, cufft.CUFFT_FORWARD)

out_np = numpy.fft.fft(a)  # use NumPy's fft
# np.fft.fft always returns np.complex128
if dtype is numpy.complex64:
    out_np = out_np.astype(dtype)

# check result
assert np.allclose(out_cp, out_np, rtol=1e-4, atol=1e-7)

对于这种情况，请参阅 cuFFT 文档中关于多GPU变换的详细信息。

备注

如果通过高级API自动生成，多GPU计划会被缓存，但如果通过低级API手动生成则不会。

半精度快速傅里叶变换#

cuFFT 提供了 cufftXtMakePlanMany 和 cufftXtExec 例程来支持广泛的 FFT 需求，包括 64 位索引和半精度 FFT。CuPy 通过新的（尽管是私有的） XtPlanNd API 提供了对此功能的 实验性 支持。对于半精度 FFT，在支持的硬件上，它的速度可以是单精度 FFT 的两倍。然而，NumPy 尚未提供半精度复数（即 numpy.complex32）所需的必要基础设施，因此，此功能的步骤目前比常见情况稍微复杂一些。

import cupy as cp
import numpy as np


shape = (1024, 256, 256)  # input array shape
idtype = odtype = edtype = 'E'  # = numpy.complex32 in the future

# store the input/output arrays as fp16 arrays twice as long, as complex32 is not yet available
a = cp.random.random((shape[0], shape[1], 2*shape[2])).astype(cp.float16)
out = cp.empty_like(a)

# FFT with cuFFT
plan = cp.cuda.cufft.XtPlanNd(shape[1:],
                              shape[1:], 1, shape[1]*shape[2], idtype,
                              shape[1:], 1, shape[1]*shape[2], odtype,
                              shape[0], edtype,
                              order='C', last_axis=-1, last_size=None)

plan.fft(a, out, cp.cuda.cufft.CUFFT_FORWARD)

# FFT with NumPy
a_np = cp.asnumpy(a).astype(np.float32)  # upcast
a_np = a_np.view(np.complex64)
out_np = np.fft.fftn(a_np, axes=(-2,-1))
out_np = np.ascontiguousarray(out_np).astype(np.complex64)  # downcast
out_np = out_np.view(np.float32)
out_np = out_np.astype(np.float16)

# don't worry about accuracy for now, as we probably lost a lot during casting
print('ok' if cp.mean(cp.abs(out - cp.asarray(out_np))) < 0.1 else 'not ok')

所有高级FFT API的64位索引支持计划在未来的CuPy版本中实现。