11.2.6. CUDA

错误

待办事项本节需要从常见问答(FAQ)风格转换为常规文档风格。

11.2.6.1. 如何构建支持CUDA感知的Open MPI？

CUDA感知支持意味着MPI库可以直接发送和接收GPU缓冲区。CUDA支持正在持续更新，因此不同版本中存在不同级别的支持。我们建议您使用最新版本的Open MPI以获得最佳支持。

Open MPI 提供两种类型的 CUDA 支持：

通过UCX。

这是首选机制。由于UCX将提供CUDA支持，因此确保UCX本身构建时启用了CUDA支持非常重要。

要检查您的UCX是否构建了CUDA支持，请运行以下命令：

# 检查UCX是否构建了CUDA支持
shell$ ucx_info -v

# configured with: --build=powerpc64le-redhat-linux-gnu --host=powerpc64le-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --enable-cma --without-cuda --without-gdrcopy --with-verbs --with-cm --with-knem --with-rdmacm --without-rocm --without-xpmem --without-ugni --without-java

如果您需要自行构建包含CUDA支持的UCX，请参阅UCX文档中关于使用Open MPI构建UCX的部分：

配置应该类似这样：

# 这样配置UCX
shell$ ./configure --prefix=/path/to/ucx-cuda-install --with-cuda=/usr/local/cuda --with-gdrcopy=/usr

# 这样配置Open MPI
shell$ ./configure --with-cuda=/usr/local/cuda --with-ucx=/path/to/ucx-cuda-install  configure params>

通过内部Open MPI CUDA支持

无论您计划使用哪种CUDA支持（或两者兼用），都应使用--with-cuda=配置选项来构建Open MPI中的CUDA支持。配置脚本会自动在给定路径中搜索libcuda.so。如果找不到该文件，请同时传递--with-cuda-libdir参数。例如： --with-cuda= --with-cuda-libdir=/usr/local/cuda/lib64/stubs。

Open MPI支持使用CUDA库进行构建，并可在没有CUDA库或硬件的系统上运行。

对于5.0.2及更高版本，无需特殊步骤即可实现此行为。

为了实现v5.0.0和v5.0.1版本中的这一功能，在配置Open MPI时，您需要通过--enable-mca-dso=配置选项，指定将CUDA相关组件构建为动态共享对象(DSO)。

这会影响smcuda共享内存和uct BTLs，以及rgpusm和gpusm rcache组件。

一个配置命令示例如下所示：

# Configure Open MPI this way
shell$ ./configure --with-cuda=/usr/local/cuda \
       --enable-mca-dso=btl-smcuda,rcache-rgpusm,rcache-gpusm,accelerator-cuda <other configure params>

11.2.6.2. 如何验证Open MPI是否已构建支持CUDA？

验证Open MPI是否已通过以下任一命令使用ompi_info构建了CUDA支持。

# Use ompi_info to verify cuda support in Open MPI
shell$ ompi_info | grep "MPI extensions"
       MPI extensions: affinity, cuda, pcollreq
shell$ ompi_info --parsable --all | grep mpi_built_with_cuda_support:value
       mca:mpi:base:param:mpi_built_with_cuda_support:value:true

11.2.6.3. 如何在使用CUDA缓冲区的MPI应用中运行Open MPI？

Open MPI 会在运行时自动检测并启用支持 CUDA 的组件，无需额外的 mpirun 参数。

11.2.6.4. 如何使用PGI构建支持CUDA的Open MPI？

使用CUDA 6.5时，您可以构建所有版本的CUDA-aware Open MPI而无需任何特殊操作。然而，对于CUDA 7.0和CUDA 7.5，您需要传入一些特定的编译器标志才能正常工作。请将以下内容添加到您的配置行中。

# For PGI 15.9 and later (Also called NVCC):
shell$ ./configure --with-wrapper-cflags=-ta:tesla

# For earlier versions of PGI:
shell$ ./configure CFLAGS=-D__LP64__ --with-wrapper-cflags="-D__LP64__ -ta:tesla"

11.2.6.5. Open MPI 中支持哪些类型的 CUDA 功能？

CUDA感知支持的定义是，Open MPI能够自动检测传递给MPI例程的参数指针是否为CUDA设备内存指针。

有关哪些API支持CUDA的更多详情，请参阅此FAQ条目。

错误

CUDA 4.0已经非常老旧了！终端用户并不关心cuda-aware、cuda-ipc、gpu-direct和gpu-direct-rdma之间的区别

Open MPI依赖于CUDA 4.0的多种功能，因此需要至少安装CUDA 4.0驱动程序和工具包。其中关键的新特性是统一虚拟寻址(UVA)，这使得程序中的所有指针都具有唯一地址。此外，新增了一个API接口，可用于判断指针是CUDA设备指针还是主机内存指针。该库利用此API来决定每个缓冲区的处理方式。另外，CUDA 4.1还提供了将主机内存注册到CUDA驱动程序的功能，这可以提升性能。CUDA 4.1还新增了CUDA IPC支持，用于同一节点上GPU之间的高速通信。

请注意，派生数据类型——无论是连续还是非连续的——都得到支持。然而，当前非连续数据类型的开销较高，这是由于需要多次调用CUDA函数cuMemcpy()来将缓冲区的所有片段复制到中间缓冲区中。

支持CUDA感知功能的版本包括：

UCX (ucx) PML
PSM2 (psm2) MTL 与 CM (cm) PML 的组合。
使用CM (cm) PML的OFI (ofi) MTL。
同时支持CUDA优化的共享内存(smcuda)和TCP(tcp) BTL传输层，配合OB1(ob1) PML使用。
HCOLL (hcoll) 集合通信组件

11.2.6.6. PSM2对CUDA的支持

PSM2 MTL中支持CUDA感知功能。当在Cornelis Networks Omni-Path上运行支持CUDA的Open MPI时，PSM2 MTL会自动设置PSM2_CUDA环境变量，使PSM2能够处理GPU缓冲区。如果用户希望在支持CUDA的Open MPI中使用主机缓冲区，建议在执行环境中将PSM2_CUDA设为0。PSM2还支持NVIDIA GPUDirect功能。要启用此功能，用户需要在执行环境中将PSM2_GPUDIRECT设为1。

注意：要在Cornelis Networks Omni-Path上使用GPUDirect支持，需要具备支持CUDA的PSM2库和hfi1驱动。所需的最低PSM2构建版本为PSM2 10.2.175。

更多信息请参考Cornelis Networks客户中心。

11.2.6.7. OFI对CUDA的支持

OFI MTL支持CUDA感知功能。当在Libfabric上运行支持CUDA的Open MPI时，OFI MTL会检查是否存在能够通过hmem相关标志处理GPU（或其他加速器）内存的提供程序。如果存在支持CUDA的提供程序，OFI MTL将在注册内存后直接通过Libfabric的API发送GPU缓冲区。如果没有可用的支持CUDA的提供程序，缓冲区将在通过Libfabric的API传输之前自动复制到主机缓冲区。

11.2.6.8. 我能在运行时获取额外的CUDA调试级别信息吗？

是的，通过启用一些详细输出标志。

opal_cuda_verbose 参数仅有一个详细级别：
```
shell$ mpirun --mca opal_cuda_verbose 10 ...
```

mpi_common_cuda_verbose参数提供有关CUDA感知相关活动的额外信息。该参数可设置为多种不同的值。除非遇到异常问题，否则通常无需使用这些选项：

# 大量CUDA调试信息
shell$ mpirun --mca mpi_common_cuda_verbose 10 ...
# 更详尽的CUDA调试信息
shell$ mpirun --mca mpi_common_cuda_verbose 20 ...
# 更加详细的CUDA调试信息
shell$ mpirun --mca mpi_common_cuda_verbose 100 ...

smcuda BTL有三个与使用CUDA IPC相关的MCA参数。默认情况下，系统会尽可能使用CUDA IPC。但用户现在可以根据需要关闭此功能。
```
shell$ mpirun --mca btl_smcuda_use_cuda_ipc 0 ...
```
此外，系统假设在同一GPU上运行时可以使用CUDA IPC，这通常是成立的。不过，用户也可以关闭此功能。
```
shell$ mpirun --mca btl_smcuda_use_cuda_ipc_same_gpu 0 ...
```
最后，为了了解是否正在使用CUDA IPC，您可以开启一些详细输出，显示是否在两个GPU之间启用了CUDA IPC。
```
shell$ mpirun --mca btl_smcuda_cuda_ipc_verbose 100 ...
```

11.2.6.9. NUMA节点问题

当在具有多个GPU的节点上运行时，您可能希望选择与进程运行的NUMA节点最近的GPU。一种实现方法是利用hwloc库。以下是可在应用程序中使用的C代码片段，用于选择邻近的GPU。该代码将确定当前运行的CPU位置，然后寻找最近的GPU。可能存在多个距离相同的GPU。这取决于系统中是否安装有hwloc库。

/**
 * Test program to show the use of hwloc to select the GPU closest to the CPU
 * that the MPI program is running on.  Note that this works even without
 * any libpciaccess or libpci support as it keys off the NVIDIA vendor ID.
 * There may be other ways to implement this but this is one way.
 * January 10, 2014
 */
#include <assert.h>
#include <stdio.h>
#include "cuda.h"
#include "mpi.h"
#include "hwloc.h"

#define ABORT_ON_ERROR(func) \
  { CUresult res; \
    res = func; \
    if (CUDA_SUCCESS != res) { \
        printf("%s returned error=%d\n", #func, res); \
        abort(); \
    } \
  }
static hwloc_topology_t topology = NULL;
static int gpuIndex = 0;
static hwloc_obj_t gpus[16] = {0};

/**
 * This function searches for all the GPUs that are hanging off a NUMA
 * node.  It walks through each of the PCI devices and looks for ones
 * with the NVIDIA vendor ID.  It then stores them into an array.
 * Note that there can be more than one GPU on the NUMA node.
 */
static void find_gpus(hwloc_topology_t topology, hwloc_obj_t parent, hwloc_obj_t child) {
    hwloc_obj_t pcidev;
    pcidev = hwloc_get_next_child(topology, parent, child);
    if (NULL == pcidev) {
        return;
    } else if (0 != pcidev->arity) {
        /* This device has children so need to look recursively at them */
        find_gpus(topology, pcidev, NULL);
        find_gpus(topology, parent, pcidev);
    } else {
        if (pcidev->attr->pcidev.vendor_id == 0x10de) {
            gpus[gpuIndex++] = pcidev;
        }
        find_gpus(topology, parent, pcidev);
    }
}

int main(int argc, char *argv[])
{
    int rank, retval, length;
    char procname[MPI_MAX_PROCESSOR_NAME+1];
    const unsigned long flags = HWLOC_TOPOLOGY_FLAG_IO_DEVICES | HWLOC_TOPOLOGY_FLAG_IO_BRIDGES;
    hwloc_cpuset_t newset;
    hwloc_obj_t node, bridge;
    char pciBusId[16];
    CUdevice dev;
    char devName[256];

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    if (MPI_SUCCESS != MPI_Get_processor_name(procname, &length)) {
        strcpy(procname, "unknown");
    }

    /* Now decide which GPU to pick.  This requires hwloc to work properly.
     * We first see which CPU we are bound to, then try and find a GPU nearby.
     */
    retval = hwloc_topology_init(&topology);
    assert(retval == 0);
    retval = hwloc_topology_set_flags(topology, flags);
    assert(retval == 0);
    retval = hwloc_topology_load(topology);
    assert(retval == 0);
    newset = hwloc_bitmap_alloc();
    retval = hwloc_get_last_cpu_location(topology, newset, 0);
    assert(retval == 0);

    /* Get the object that contains the cpuset */
    node = hwloc_get_first_largest_obj_inside_cpuset(topology, newset);

    /* Climb up from that object until we find the HWLOC_OBJ_NODE */
    while (node->type != HWLOC_OBJ_NODE) {
        node = node->parent;
    }

    /* Now look for the HWLOC_OBJ_BRIDGE.  All PCI busses hanging off the
     * node will have one of these */
    bridge = hwloc_get_next_child(topology, node, NULL);
    while (bridge->type != HWLOC_OBJ_BRIDGE) {
        bridge = hwloc_get_next_child(topology, node, bridge);
    }

    /* Now find all the GPUs on this NUMA node and put them into an array */
    find_gpus(topology, bridge, NULL);

    ABORT_ON_ERROR(cuInit(0));
    /* Now select the first GPU that we find */
    if (gpus[0] == 0) {
        printf("No GPU found\n");
    } else {
        sprintf(pciBusId, "%.2x:%.2x:%.2x.%x", gpus[0]->attr->pcidev.domain, gpus[0]->attr->pcidev.bus,
        gpus[0]->attr->pcidev.dev, gpus[0]->attr->pcidev.func);
        ABORT_ON_ERROR(cuDeviceGetByPCIBusId(&dev, pciBusId));
        ABORT_ON_ERROR(cuDeviceGetName(devName, 256, dev));
        printf("rank=%d (%s): Selected GPU=%s, name=%s\n", rank, procname, pciBusId, devName);
    }

    MPI_Finalize();
    return 0;
}

11.2.6.10. 如何开发支持CUDA的Open MPI应用程序？

开发支持CUDA的应用程序是一个复杂的话题，超出了本文档的范围。支持CUDA的应用程序通常需要考虑机器特定的因素，包括每个节点上安装的GPU数量以及GPU如何与CPU和彼此连接。通常，在使用特定传输层（如OPA/PSM2）时，需要运行时决定哪些CPU核心将与哪些GPU配合使用。

一个不错的起点是查阅NVIDIA CUDA工具包文档，包括编程指南和最佳实践指南。关于如何编写支持CUDA的MPI应用程序示例，NVIDIA开发者博客提供了相关案例，而OSU微基准测试则展示了编写支持CUDA的MPI应用程序的优秀范例。

11.2.6.11. 哪些MPI API支持CUDA感知功能？

MPI_Allgather
MPI_Allgatherv
MPI_Allreduce
MPI_Alltoall
MPI_Alltoallv
MPI_Alltoallw
MPI_Bcast
MPI_Bsend
MPI_Bsend_init
MPI_Exscan
MPI_Ibsend
MPI_Irecv
MPI_Isend
MPI_Irsend
MPI_Issend
MPI_Gather
MPI_Gatherv
MPI_Get
MPI_Put
MPI_Rsend
MPI_Rsend_init
MPI_Recv
MPI_Recv_init
MPI_Reduce
MPI_Reduce_scatter
MPI_Reduce_scatter_block
MPI_Scan
MPI_Scatter
MPI_Scatterv
MPI_Send
MPI_Send_init
MPI_Sendrecv
MPI_Ssend
MPI_Ssend_init
MPI_Win_create

11.2.6.12. 哪些MPI API不支持CUDA感知功能？

MPI_Accumulate
MPI_Compare_and_swap
MPI_Fetch_and_op
MPI_Get_Accumulate
MPI_Iallgather
MPI_Iallgatherv
MPI_Iallreduce
MPI_Ialltoall
MPI_Ialltoallv
MPI_Ialltoallw
MPI_Ibcast
MPI_Iexscan
MPI_Rget
MPI_Rput

11.2.6.13. 如何在Open MPI中使用支持CUDA的UCX？

使用Open MPI和UCX CUDA支持运行来自OSU基准测试的osu_latency示例（使用CUDA缓冲区）：

shell$ mpirun -n 2 --mca pml ucx \
    -x UCX_TLS=rc,sm,cuda_copy,gdr_copy,cuda_ipc ./osu_latency D D

11.2.6.14. 哪些MPI API能与支持CUDA的UCX协同工作？

MPI_Send
MPI_Bsend
MPI_Ssend
MPI_Rsend
MPI_Isend
MPI_Ibsend
MPI_Issend
MPI_Irsend
MPI_Send_init
MPI_Bsend_init
MPI_Ssend_init
MPI_Rsend_init
MPI_Recv
MPI_Irecv
MPI_Recv_init
MPI_Sendrecv
MPI_Bcast
MPI_Gather
MPI_Gatherv
MPI_Allgather
MPI_Reduce
MPI_Reduce_scatter
MPI_Reduce_scatter_block
MPI_Allreduce
MPI_Scan
MPI_Exscan
MPI_Allgatherv
MPI_Alltoall
MPI_Alltoallv
MPI_Alltoallw
MPI_Scatter
MPI_Scatterv
MPI_Iallgather
MPI_Iallgatherv
MPI_Ialltoall
MPI_Iialltoallv
MPI_Ialltoallw
MPI_Ibcast
MPI_Iexscan

11.2.6.15. 哪些MPI API无法与支持CUDA的UCX协同工作？

所有单边操作，如MPI_Put、MPI_Get、MPI_Accumulate、 MPI_Rget、MPI_Rput、MPI_Get_Accumulate、MPI_Fetch_and_op、 MPI_Compare_and_swap等
所有窗口创建调用，例如MPI_Win_create
所有非阻塞式归约集合操作，如MPI_Ireduce、MPI_Iallreduce等

11.2.6.16. 我能在编译时或运行时判断是否支持CUDA-aware功能吗？

提供编译时检查和运行时检查两种方式。您可以根据程序需求选择最方便的一种。要使用这些功能，需要包含mpi-ext.h头文件。请注意mpi-ext.h是Open MPI特有的。以下程序展示了如何使用CUDA感知宏和运行时检查的示例。

/*
 * Program that shows the use of CUDA-aware macro and runtime check.
 */
#include <stdio.h>
#include "mpi.h"

#if !defined(OPEN_MPI) || !OPEN_MPI
#error This source code uses an Open MPI-specific extension
#endif

/* Needed for MPIX_Query_cuda_support(), below */
#include "mpi-ext.h"

int main(int argc, char *argv[])
{
    MPI_Init(&argc, &argv);

    printf("Compile time check:\n");
#if defined(MPIX_CUDA_AWARE_SUPPORT) && MPIX_CUDA_AWARE_SUPPORT
    printf("This MPI library has CUDA-aware support.\n", MPIX_CUDA_AWARE_SUPPORT);
#elif defined(MPIX_CUDA_AWARE_SUPPORT) && !MPIX_CUDA_AWARE_SUPPORT
    printf("This MPI library does not have CUDA-aware support.\n");
#else
    printf("This MPI library cannot determine if there is CUDA-aware support.\n");
#endif /* MPIX_CUDA_AWARE_SUPPORT */

    printf("Run time check:\n");
#if defined(MPIX_CUDA_AWARE_SUPPORT)
    if (1 == MPIX_Query_cuda_support()) {
        printf("This MPI library has CUDA-aware support.\n");
    } else {
        printf("This MPI library does not have CUDA-aware support.\n");
    }
#else /* !defined(MPIX_CUDA_AWARE_SUPPORT) */
    printf("This MPI library cannot determine if there is CUDA-aware support.\n");
#endif /* MPIX_CUDA_AWARE_SUPPORT */

    MPI_Finalize();

    return 0;
}

11.2.6.17. 如何限制注册缓存中保留的CUDA IPC内存量？

如前所述，Open MPI库会在可能的情况下利用CUDA IPC支持，在位于同一节点且共享相同PCI根复合体的GPU之间快速传输GPU数据。该库会在数据传输完成后仍保留注册信息，因为某些CUDA IPC注册调用的开销较大。若需限制注册内存量，可使用mpool_rgpusm_rcache_size_limit MCA参数。例如，以下设置将限制值设为1000000字节：

shell$ mpirun --mca mpool_rgpusm_rcache_size_limit 1000000 ...

当缓存达到此大小时，它将移除最近最少使用的条目，直到能够容纳新的注册信息。

当达到限制时，缓存还具有自动清空的功能：

shell$ mpirun --mca mpool_rgpusm_rcache_empty_cache 1 ...

11.2.6.18. 使用CUDA和Open MPI搭配Omni-Path有哪些指导原则？

在为基于OPA架构开发支持CUDA的Open MPI应用时，优先推荐使用PSM2传输协议，所有版本的Cornelis Networks Omni-Path OPXS软件套件都提供了支持CUDA的PSM2版本。

错误

待办事项：Intel/OPA的引用信息是否仍然正确？

PSM2库提供了多个设置项来控制其与CUDA的交互方式，包括PSM2_CUDA和PSM2_GPUDIRECT，这些环境变量应在调用MPI_Init()之前进行设置。例如：

shell$ mpirun -x PSM2_CUDA=1 -x PSM2_GPUDIRECT=1 --mca mtl psm2 mpi_hello

此外，应用程序的每个进程在调用MPI_Init()之前，应通过使用cudaChooseDevice()、cudaSetDevice()及类似函数选择要使用的特定GPU卡。所选GPU应与MPI进程运行的CPU位于同一NUMA节点内。您还需要使用mpirun的--bind-to-core或--bind-to-socket选项来确保MPI进程不会在NUMA节点之间移动。更多信息请参阅关于NUMA节点问题的章节。

如需了解更多信息，请参阅Cornelis Networks性能扩展消息传递2(PSM2)程序员指南和Cornelis Networks Omni-Path性能调优指南，这些文档可在Cornelis Networks客户中心找到。

错误

待办事项：Intel/OPA的引用信息是否仍然正确？

11.2.6.19. 何时需要选择CUDA设备？

"mpi-cuda-dev-selection"

OpenMPI需要分配CUDA资源供内部使用。这些资源在首次需要时才会被延迟分配，例如当通信例程在传输过程中首次需要CUDA IPC内存句柄时才会创建它们。因此，必须在首次需要CUDA资源的MPI调用之前选择CUDA设备。MPI_Init和大多数与通信器相关的操作不会创建任何CUDA资源（MPI_Init、MPI_Comm_rank、MPI_Comm_size、MPI_Comm_split_type和MPI_Comm_free保证不会创建）。因此可以使用这些例程查询进程排名信息，并利用这些信息选择GPU，例如使用

int local_rank = -1;
{
    MPI_Comm local_comm;
    MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, rank, MPI_INFO_NULL, &local_comm);
    MPI_Comm_rank(local_comm, &local_rank);
    MPI_Comm_free(&local_comm);
}
int num_devices = 0;
cudaGetDeviceCount(&num_devices);
cudaSetDevice(local_rank % num_devices);

MPI内部的CUDA资源会在MPI_Finalize期间释放。因此，如果在调用MPI_Finalize之前调用cudaDeviceReset，将导致应用程序错误。

11.2.6.20. 如何在HCOLL集合组件中启用CUDA支持

HCOLL组件支持以下集合操作中的CUDA GPU缓冲区：

MPI_Allreduce MPI_Bcast MPI_Allgather MPI_Ibarrier MPI_Ibcast MPI_Iallgather MPI_Iallreduce

要在这些集合操作中启用CUDA GPU缓冲区支持，请通过mpirun传递以下环境变量：

shell$ mpirun -x HCOLL_GPU_ENABLE=1 -x HCOLL_ENABLE_NBC=1 ..

更多信息请参阅nVidia HCOLL文档