CUDA Runtime API :: CUDA Toolkit Documentation

6.1. 设备管理

本节介绍CUDA运行时应用程序编程接口中的设备管理功能。

Functions

__host__ cudaError_t cudaChooseDevice ( int* device, const cudaDeviceProp* prop ): Select compute-device which best matches criteria.
__host__ cudaError_t cudaDeviceFlushGPUDirectRDMAWrites ( cudaFlushGPUDirectRDMAWritesTarget target, cudaFlushGPUDirectRDMAWritesScope scope ): Blocks until remote writes are visible to the specified scope.
__host__ __device__ cudaError_t cudaDeviceGetAttribute ( int* value, cudaDeviceAttr attr, int device ): Returns information about the device.
__host__ cudaError_t cudaDeviceGetByPCIBusId ( int* device, const char* pciBusId ): Returns a handle to a compute device.
__host__ __device__ cudaError_t cudaDeviceGetCacheConfig ( cudaFuncCache* pCacheConfig ): Returns the preferred cache configuration for the current device.
__host__ cudaError_t cudaDeviceGetDefaultMemPool ( cudaMemPool_t* memPool, int device ): Returns the default mempool of a device.
__host__ __device__ cudaError_t cudaDeviceGetLimit ( size_t* pValue, cudaLimit limit ): Return resource limits.
__host__ cudaError_t cudaDeviceGetMemPool ( cudaMemPool_t* memPool, int device ): Gets the current mempool for a device.
__host__ cudaError_t cudaDeviceGetNvSciSyncAttributes ( void* nvSciSyncAttrList, int device, int flags ): Return NvSciSync attributes that this device can support.
__host__ cudaError_t cudaDeviceGetP2PAttribute ( int* value, cudaDeviceP2PAttr attr, int srcDevice, int dstDevice ): Queries attributes of the link between two devices.
__host__ cudaError_t cudaDeviceGetPCIBusId ( char* pciBusId, int len, int device ): Returns a PCI Bus Id string for the device.
__host__ cudaError_t cudaDeviceGetStreamPriorityRange ( int* leastPriority, int* greatestPriority ): Returns numerical values that correspond to the least and greatest stream priorities.
__host__ cudaError_t cudaDeviceGetTexture1DLinearMaxWidth ( size_t* maxWidthInElements, const cudaChannelFormatDesc* fmtDesc, int device ): Returns the maximum number of elements allocatable in a 1D linear texture for a given element size.
__host__ cudaError_t cudaDeviceRegisterAsyncNotification ( int device, cudaAsyncCallback callbackFunc, void* userData, cudaAsyncCallbackHandle_t* callback ): Registers a callback function to receive async notifications.
__host__ cudaError_t cudaDeviceReset ( void ): Destroy all allocations and reset all state on the current device in the current process.
__host__ cudaError_t cudaDeviceSetCacheConfig ( cudaFuncCache cacheConfig ): Sets the preferred cache configuration for the current device.
__host__ cudaError_t cudaDeviceSetLimit ( cudaLimit limit, size_t value ): Set resource limits.
__host__ cudaError_t cudaDeviceSetMemPool ( int device, cudaMemPool_t memPool ): Sets the current memory pool of a device.
__host__ __device__ cudaError_t cudaDeviceSynchronize ( void ): Wait for compute device to finish.
__host__ cudaError_t cudaDeviceUnregisterAsyncNotification ( int device, cudaAsyncCallbackHandle_t callback ): Unregisters an async notification callback.
__host__ __device__ cudaError_t cudaGetDevice ( int* device ): Returns which device is currently being used.
__host__ __device__ cudaError_t cudaGetDeviceCount ( int* count ): Returns the number of compute-capable devices.
__host__ cudaError_t cudaGetDeviceFlags ( unsigned int* flags ): Gets the flags for the current device.
__host__ cudaError_t cudaGetDeviceProperties ( cudaDeviceProp* prop, int device ): Returns information about the compute-device.
__host__ cudaError_t cudaInitDevice ( int device, unsigned int deviceFlags, unsigned int flags ): Initialize device to be used for GPU executions.
__host__ cudaError_t cudaIpcCloseMemHandle ( void* devPtr ): Attempts to close memory mapped with cudaIpcOpenMemHandle.
__host__ cudaError_t cudaIpcGetEventHandle ( cudaIpcEventHandle_t* handle, cudaEvent_t event ): Gets an interprocess handle for a previously allocated event.
__host__ cudaError_t cudaIpcGetMemHandle ( cudaIpcMemHandle_t* handle, void* devPtr ): Gets an interprocess memory handle for an existing device memory allocation.
__host__ cudaError_t cudaIpcOpenEventHandle ( cudaEvent_t* event, cudaIpcEventHandle_t handle ): Opens an interprocess event handle for use in the current process.
__host__ cudaError_t cudaIpcOpenMemHandle ( void** devPtr, cudaIpcMemHandle_t handle, unsigned int flags ): Opens an interprocess memory handle exported from another process and returns a device pointer usable in the local process.
__host__ cudaError_t cudaSetDevice ( int device ): Set device to be used for GPU executions.
__host__ cudaError_t cudaSetDeviceFlags ( unsigned int flags ): Sets flags to be used for device executions.
__host__ cudaError_t cudaSetValidDevices ( int* device_arr, int len ): Set a list of devices that can be used for CUDA.

Functions

__host__ cudaError_t cudaChooseDevice ( int* device, const cudaDeviceProp* prop )

选择最符合标准的计算设备。

参数

device: - Device with best match
prop: - Desired device properties

cudaSuccess, cudaErrorInvalidValue

描述

返回*device中属性与*prop最匹配的设备。

Note:

请注意，此函数也可能返回之前异步启动的错误代码。
请注意，如果此调用尝试初始化CUDA RT内部状态，该函数也可能返回cudaErrorInitializationError、cudaErrorInsufficientDriver或cudaErrorNoDevice。
请注意，根据cudaStreamAddCallback的规定，回调函数中不得调用任何CUDA函数。在这种情况下，可能会（但不保证）返回cudaErrorNotPermitted作为诊断信息。

另请参阅：

cudaGetDeviceCount, cudaGetDevice, cudaSetDevice, cudaGetDeviceProperties, cudaInitDevice

__host__ cudaError_t cudaDeviceFlushGPUDirectRDMAWrites ( cudaFlushGPUDirectRDMAWritesTarget target, cudaFlushGPUDirectRDMAWritesScope scope )

阻塞直到远程写入对指定范围可见。

参数

target: - The target of the operation, see cudaFlushGPUDirectRDMAWritesTarget
scope: - The scope of the operation, see cudaFlushGPUDirectRDMAWritesScope

cudaSuccess, cudaErrorNotSupported,

描述

阻塞直到通过GPUDirect RDMA API(如nvidia_p2p_get_pages)创建的映射对目标上下文的远程写入，在指定范围内可见。 (详见https://docs.nvidia.com/cuda/gpudirect-rdma获取更多信息)。

如果作用域等于或位于cudaDevAttrGPUDirectRDMAWritesOrdering指示的作用域范围内，则该调用将是无操作，可以安全地省略以提高性能。这可以通过比较两个枚举的数值来确定，较小作用域对应的数值也较小。

用户可以通过cudaDevAttrGPUDirectRDMAFlushWritesOptions查询该API的支持情况。

Note:

请注意，此函数也可能返回之前异步启动的错误代码。
请注意，如果此调用尝试初始化CUDA RT内部状态，该函数也可能返回cudaErrorInitializationError、cudaErrorInsufficientDriver或cudaErrorNoDevice。
请注意，根据cudaStreamAddCallback的规定，回调函数中不得调用任何CUDA函数。在这种情况下，可能会（但不保证）返回cudaErrorNotPermitted作为诊断信息。

另请参阅：

cuFlushGPUDirectRDMAWrites

__host__ __device__ cudaError_t cudaDeviceGetAttribute ( int* value, cudaDeviceAttr attr, int device )

返回有关设备的信息。

参数

value: - Returned device attribute value
attr: - Device attribute to query
device: - Device number to query

cudaSuccess, cudaErrorInvalidDevice, cudaErrorInvalidValue

描述

在*value中返回设备device上属性attr的整数值。支持的属性包括：

cudaDevAttrMaxThreadsPerBlock: 每个块的最大线程数
cudaDevAttrMaxBlockDimX: 块的最大x维度
cudaDevAttrMaxBlockDimY: 块的最大y维度
cudaDevAttrMaxBlockDimZ: 块的最大z维度
cudaDevAttrMaxGridDimX: 网格的最大x维度
cudaDevAttrMaxGridDimY: 网格的最大y维度
cudaDevAttrMaxGridDimZ: 网格的最大z维度
cudaDevAttrMaxSharedMemoryPerBlock: 每个线程块可用的最大共享内存大小（以字节为单位）
cudaDevAttrTotalConstantMemory: 设备上可用于CUDA C内核中__constant__变量的内存大小（以字节为单位）
cudaDevAttrWarpSize: 线程束大小（以线程数为单位）
cudaDevAttrMaxPitch: 内存复制函数允许的最大间距（字节），这些函数涉及通过cudaMallocPitch()分配的内存区域
cudaDevAttrMaxTexture1DWidth: 最大1D纹理宽度
cudaDevAttrMaxTexture1DLinearWidth: 绑定到线性内存的一维纹理的最大宽度
cudaDevAttrMaxTexture1DMipmappedWidth: 最大mipmapped一维纹理宽度
cudaDevAttrMaxTexture2DWidth: 最大2D纹理宽度
cudaDevAttrMaxTexture2DHeight: 最大2D纹理高度
cudaDevAttrMaxTexture2DLinearWidth: 绑定到线性内存的2D纹理的最大宽度
cudaDevAttrMaxTexture2DLinearHeight: 绑定到线性内存的2D纹理的最大高度
cudaDevAttrMaxTexture2DLinearPitch: 绑定到线性内存的2D纹理的最大间距（以字节为单位）
cudaDevAttrMaxTexture2DMipmappedWidth: 最大mipmapped 2D纹理宽度
cudaDevAttrMaxTexture2DMipmappedHeight: 最大mipmapped 2D纹理高度
cudaDevAttrMaxTexture3DWidth: 最大3D纹理宽度
cudaDevAttrMaxTexture3DHeight: 最大3D纹理高度
cudaDevAttrMaxTexture3DDepth: 最大3D纹理深度
cudaDevAttrMaxTexture3DWidthAlt: 替代最大3D纹理宽度，如果不支持替代最大3D纹理尺寸则为0
cudaDevAttrMaxTexture3DHeightAlt: 替代最大3D纹理高度，如果不支持替代最大3D纹理尺寸则为0
cudaDevAttrMaxTexture3DDepthAlt: 替代最大3D纹理深度，如果不支持替代最大3D纹理尺寸则为0
cudaDevAttrMaxTextureCubemapWidth: 最大立方体贴图纹理宽度或高度
cudaDevAttrMaxTexture1DLayeredWidth: 最大1D分层纹理宽度
cudaDevAttrMaxTexture1DLayeredLayers: 一维分层纹理中的最大层数
cudaDevAttrMaxTexture2DLayeredWidth: 最大2D分层纹理宽度
cudaDevAttrMaxTexture2DLayeredHeight: 最大2D分层纹理高度
cudaDevAttrMaxTexture2DLayeredLayers: 二维分层纹理中的最大层数
cudaDevAttrMaxTextureCubemapLayeredWidth: 最大立方体贴图分层纹理宽度或高度
cudaDevAttrMaxTextureCubemapLayeredLayers: 立方体贴图分层纹理中的最大层数
cudaDevAttrMaxSurface1DWidth: 最大1D表面宽度
cudaDevAttrMaxSurface2DWidth: 最大2D表面宽度
cudaDevAttrMaxSurface2DHeight: 最大2D表面高度
cudaDevAttrMaxSurface3DWidth: 最大3D表面宽度
cudaDevAttrMaxSurface3DHeight: 最大3D表面高度
cudaDevAttrMaxSurface3DDepth: 最大3D表面深度
cudaDevAttrMaxSurface1DLayeredWidth: 最大1D分层表面宽度
cudaDevAttrMaxSurface1DLayeredLayers: 一维分层表面中的最大层数
cudaDevAttrMaxSurface2DLayeredWidth: 最大2D分层表面宽度
cudaDevAttrMaxSurface2DLayeredHeight: 最大2D分层表面高度
cudaDevAttrMaxSurface2DLayeredLayers: 二维分层表面中的最大层数
cudaDevAttrMaxSurfaceCubemapWidth: 最大立方体贴图表面宽度
cudaDevAttrMaxSurfaceCubemapLayeredWidth: 最大立方体贴图分层表面宽度
cudaDevAttrMaxSurfaceCubemapLayeredLayers: 立方体贴图分层表面中的最大层数
cudaDevAttrMaxRegistersPerBlock: 每个线程块可用的32位寄存器最大数量
cudaDevAttrClockRate: 峰值时钟频率（单位：千赫兹）
cudaDevAttrTextureAlignment: 对齐要求；纹理基地址对齐到textureAlign字节时，纹理获取操作无需应用偏移量
cudaDevAttrTexturePitchAlignment: 绑定到倾斜内存的2D纹理引用所需的间距对齐要求
cudaDevAttrGpuOverlap: 如果设备能在执行内核时同时进行主机与设备间的内存拷贝，则返回1，否则返回0
cudaDevAttrMultiProcessorCount: 设备上的多处理器数量
cudaDevAttrKernelExecTimeout: 如果设备上执行的内核有运行时限制则为1，否则为0
cudaDevAttrIntegrated: 如果设备与内存子系统集成则为1，否则为0
cudaDevAttrCanMapHostMemory: 如果设备可以将主机内存映射到CUDA地址空间则为1，否则为0
cudaDevAttrComputeMode: Compute mode is the compute mode that the device is currently in. Available modes are as follows:
- cudaComputeModeDefault: 默认模式 - 设备不受限制，多个线程可以使用cudaSetDevice()与该设备交互。
- cudaComputeModeProhibited: 计算禁止模式 - 任何线程都无法通过cudaSetDevice()使用此设备。
- cudaComputeModeExclusiveProcess: 计算独占进程模式 - 单个进程中的多个线程将能够通过cudaSetDevice()使用该设备。
cudaDevAttrConcurrentKernels: 值为1表示该设备支持在同一上下文中同时执行多个内核，0则表示不支持。需要注意的是，不能保证多个内核会同时驻留在设备上，因此不应依赖此功能来确保正确性。
cudaDevAttrEccEnabled: 如果设备启用了错误校正则为1，如果设备禁用错误校正或不支持则为0
cudaDevAttrPciBusId: 设备的PCI总线标识符
cudaDevAttrPciDeviceId: 设备的PCI设备（也称为插槽）标识符
cudaDevAttrTccDriver: 如果设备使用TCC驱动则返回1。TCC驱动仅适用于运行Windows Vista或更高版本的Tesla硬件。
cudaDevAttrMemoryClockRate: 内存峰值时钟频率（单位：千赫兹）
cudaDevAttrGlobalMemoryBusWidth: 全局内存总线宽度（以位为单位）
cudaDevAttrL2CacheSize: L2缓存的大小（以字节为单位）。如果设备没有L2缓存则为0。
cudaDevAttrMaxThreadsPerMultiProcessor: 每个多处理器最大常驻线程数
cudaDevAttrUnifiedAddressing: 值为1表示该设备与主机共享统一地址空间，值为0则表示不共享
cudaDevAttrComputeCapabilityMajor: 主计算能力版本号
cudaDevAttrComputeCapabilityMinor: 次要计算能力版本号
cudaDevAttrStreamPrioritiesSupported: 如果设备支持流优先级则为1，否则为0
cudaDevAttrGlobalL1CacheSupported: 若设备支持在L1缓存中缓存全局变量则返回1，否则返回0
cudaDevAttrLocalL1CacheSupported: 值为1表示设备支持将局部变量缓存到L1缓存中，值为0表示不支持
cudaDevAttrMaxSharedMemoryPerMultiprocessor: 每个多处理器可用的最大共享内存量（以字节为单位）；该内存量由同时驻留在多处理器上的所有线程块共享
cudaDevAttrMaxRegistersPerMultiprocessor: 每个多处理器可用的32位寄存器最大数量；该数值由同时驻留在多处理器上的所有线程块共享
cudaDevAttrManagedMemory: 1表示设备支持分配托管内存，0表示不支持
cudaDevAttrIsMultiGpuBoard: 如果设备位于多GPU主板上则为1，否则为0
cudaDevAttrMultiGpuBoardGroupID: 同一多GPU板上设备组的唯一标识符
cudaDevAttrHostNativeAtomicSupported: 如果设备与主机之间的链接支持原生原子操作，则返回1
cudaDevAttrSingleToDoublePrecisionPerfRatio: 单精度性能(每秒浮点运算次数)与双精度性能的比率
cudaDevAttrPageableMemoryAccess: 如果设备支持无需调用cudaHostRegister即可一致访问可分页内存，则返回1，否则返回0
cudaDevAttrConcurrentManagedAccess: 如果设备可以与CPU同时一致地访问托管内存，则为1，否则为0
cudaDevAttrComputePreemptionSupported: 如果设备支持计算抢占则返回1，否则返回0
cudaDevAttrCanUseHostPointerForRegisteredMem: 如果设备能够以与CPU相同的虚拟地址访问主机注册内存，则为1，否则为0
cudaDevAttrCooperativeLaunch: 如果设备支持通过cudaLaunchCooperativeKernel启动协作内核，则返回1，否则返回0
cudaDevAttrCooperativeMultiDeviceLaunch: 如果设备支持通过cudaLaunchCooperativeKernelMultiDevice启动协作内核，则返回1，否则返回0
cudaDevAttrCanFlushRemoteWrites: 如果设备支持刷新未完成的远程写入，则为1，否则为0
cudaDevAttrHostRegisterSupported: 如果设备支持通过cudaHostRegister进行主机内存注册，则为1，否则为0
cudaDevAttrPageableMemoryAccessUsesHostPageTables: 如果设备通过主机页表访问可分页内存则为1，否则为0
cudaDevAttrDirectManagedMemAccessFromHost: 如果主机可以直接访问设备上的托管内存而无需迁移，则为1，否则为0
cudaDevAttrMaxSharedMemoryPerBlockOptin: 设备上每个块的最大共享内存大小。在使用cudaFuncSetAttribute时可以选择此值
cudaDevAttrMaxBlocksPerMultiprocessor: 单个多处理器上可驻留的最大线程块数量
cudaDevAttrMaxPersistingL2CacheSize: 最大L2持久化缓存行容量设置（以字节为单位）
cudaDevAttrMaxAccessPolicyWindowSize: cudaAccessPolicyWindow::num_bytes的最大值
cudaDevAttrReservedSharedMemoryPerBlock: CUDA驱动为每个块保留的共享内存大小（字节）
cudaDevAttrSparseCudaArraySupported: 若设备支持稀疏CUDA数组和稀疏CUDA多级渐远纹理数组，则返回1。
cudaDevAttrHostRegisterReadOnlySupported: 设备支持使用cudaHostRegister标志cudaHostRegisterReadOnly来注册必须映射为GPU只读的内存
cudaDevAttrMemoryPoolsSupported: 如果设备支持使用cudaMallocAsync和cudaMemPool系列API，则返回1，否则返回0
cudaDevAttrGPUDirectRDMASupported: 如果设备支持GPUDirect RDMA API则返回1，否则返回0
cudaDevAttrGPUDirectRDMAFlushWritesOptions: 根据cudaFlushGPUDirectRDMAWritesOptions枚举解释的位掩码
cudaDevAttrGPUDirectRDMAWritesOrdering: 数值请参阅cudaGPUDirectRDMAWritesOrdering枚举
cudaDevAttrMemoryPoolSupportedHandleTypes: 基于内存池的IPC所支持的句柄类型的位掩码
cudaDevAttrDeferredMappingCudaArraySupported : 如果设备支持延迟映射CUDA数组和CUDA多级渐远纹理数组，则值为1。
cudaDevAttrIpcEventSupport: 如果设备支持IPC事件，则值为1。
cudaDevAttrNumaConfig: 设备的NUMA配置：值为cudaDeviceNumaConfig枚举类型
cudaDevAttrNumaId: GPU内存的NUMA节点ID

Note:

请注意，此函数也可能返回之前异步启动的错误代码。
请注意，如果此调用尝试初始化CUDA RT内部状态，该函数也可能返回cudaErrorInitializationError、cudaErrorInsufficientDriver或cudaErrorNoDevice。
请注意，根据cudaStreamAddCallback的规定，回调函数中不得调用任何CUDA函数。在这种情况下，可能会（但不保证）返回cudaErrorNotPermitted作为诊断信息。

另请参阅：

cudaGetDeviceCount, cudaGetDevice, cudaSetDevice, cudaChooseDevice, cudaGetDeviceProperties, cudaInitDevice, cuDeviceGetAttribute

__host__ cudaError_t cudaDeviceGetByPCIBusId ( int* device, const char* pciBusId )

返回计算设备的句柄。

参数

device: - Returned device ordinal
pciBusId: - String in one of the following forms: [domain]:[bus]:[device].[function] [domain]:[bus]:[device] [bus]:[device].[function] where domain, bus, device, and function are all hexadecimal values

cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevice

描述

根据给定的PCI总线ID字符串，在*device中返回设备序号。

Note:

请注意，此函数也可能返回之前异步启动的错误代码。
请注意，如果此调用尝试初始化CUDA RT内部状态，该函数也可能返回cudaErrorInitializationError、cudaErrorInsufficientDriver或cudaErrorNoDevice。
请注意，根据cudaStreamAddCallback的规定，回调函数中不得调用任何CUDA函数。在这种情况下，可能会（但不保证）返回cudaErrorNotPermitted作为诊断信息。

另请参阅：

cudaDeviceGetPCIBusId, cuDeviceGetByPCIBusId

__host__ __device__ cudaError_t cudaDeviceGetCacheConfig ( cudaFuncCache* pCacheConfig )

返回当前设备的首选缓存配置。

参数

pCacheConfig: - Returned cache configuration

cudaSuccess

描述

在使用相同硬件资源的L1缓存和共享内存的设备上，此函数通过pCacheConfig返回当前设备的首选缓存配置。这仅是一个偏好设置。如果可能，运行时将使用请求的配置，但为了执行函数，运行时可以自由选择不同的配置。

在L1缓存和共享内存大小固定的设备上，这将返回一个cudaFuncCachePreferNone的pCacheConfig。

支持的缓存配置包括：

cudaFuncCachePreferNone: 对共享内存或L1缓存无偏好（默认设置）
cudaFuncCachePreferShared: 优先使用更大的共享内存和更小的L1缓存
cudaFuncCachePreferL1: 优先使用更大的L1缓存和更小的共享内存
cudaFuncCachePreferEqual: 偏好L1缓存与共享内存大小相等

Note:

请注意，此函数也可能返回之前异步启动的错误代码。
请注意，如果此调用尝试初始化CUDA RT内部状态，该函数也可能返回cudaErrorInitializationError、cudaErrorInsufficientDriver或cudaErrorNoDevice。
请注意，根据cudaStreamAddCallback的规定，回调函数中不得调用任何CUDA函数。在这种情况下，可能会（但不保证）返回cudaErrorNotPermitted作为诊断信息。

另请参阅：

cudaDeviceSetCacheConfig, cudaFuncSetCacheConfig ( C API), cudaFuncSetCacheConfig ( C++ API), cuCtxGetCacheConfig

__host__ cudaError_t cudaDeviceGetDefaultMemPool ( cudaMemPool_t* memPool, int device )

返回设备的默认内存池。

cudaSuccess, cudaErrorInvalidDevice, cudaErrorInvalidValue cudaErrorNotSupported

描述

设备的默认内存池包含来自该设备的设备内存。

Note:

请注意，此函数也可能返回之前异步启动的错误代码。
请注意，如果此调用尝试初始化CUDA RT内部状态，该函数也可能返回cudaErrorInitializationError、cudaErrorInsufficientDriver或cudaErrorNoDevice。
请注意，根据cudaStreamAddCallback的规定，回调函数中不得调用任何CUDA函数。在这种情况下，可能会（但不保证）返回cudaErrorNotPermitted作为诊断信息。

另请参阅：

cuDeviceGetDefaultMemPool, cudaMallocAsync, cudaMemPoolTrimTo, cudaMemPoolGetAttribute, cudaDeviceSetMemPool, cudaMemPoolSetAttribute, cudaMemPoolSetAccess

__host__ __device__ cudaError_t cudaDeviceGetLimit ( size_t* pValue, cudaLimit limit )

返回资源限制。

参数

pValue: - Returned size of the limit
limit: - Limit to query

cudaSuccess, cudaErrorUnsupportedLimit, cudaErrorInvalidValue

描述

返回*pValue中limit的当前大小。支持以下cudaLimit值。

cudaLimitStackSize 是每个GPU线程的堆栈大小（以字节为单位）。
cudaLimitPrintfFifoSize 是printf()设备系统调用使用的共享FIFO的大小（以字节为单位）。
cudaLimitMallocHeapSize 是以字节为单位的堆大小，供malloc()和free()设备系统调用使用。
cudaLimitDevRuntimeSyncDepth 表示线程能够调用设备运行时函数cudaDeviceSynchronize()等待子网格启动完成的最大网格深度。对于计算能力>=9.0的设备已移除此功能，因此在此类设备上会返回错误cudaErrorUnsupportedLimit。
cudaLimitDevRuntimePendingLaunchCount 是设备运行时待处理启动的最大数量。
cudaLimitMaxL2FetchGranularity 是L2缓存的获取粒度。
cudaLimitPersistingL2CacheSize 表示持久化L2缓存的大小（以字节为单位）。

Note:

请注意，此函数也可能返回之前异步启动的错误代码。
请注意，如果此调用尝试初始化CUDA RT内部状态，该函数也可能返回cudaErrorInitializationError、cudaErrorInsufficientDriver或cudaErrorNoDevice。
请注意，根据cudaStreamAddCallback的规定，回调函数中不得调用任何CUDA函数。在这种情况下，可能会（但不保证）返回cudaErrorNotPermitted作为诊断信息。

另请参阅：

cudaDeviceSetLimit, cuCtxGetLimit

__host__ cudaError_t cudaDeviceGetMemPool ( cudaMemPool_t* memPool, int device )

获取设备的当前内存池。

cudaSuccess, cudaErrorInvalidValue cudaErrorNotSupported

描述

返回为该设备最后提供给cudaDeviceSetMemPool的内存池，如果从未调用过cudaDeviceSetMemPool则返回设备的默认内存池。默认情况下当前内存池是设备的默认内存池，否则返回的池必须是通过cuDeviceSetMemPool或cudaDeviceSetMemPool设置的。

Note:

请注意，此函数也可能返回之前异步启动的错误代码。
请注意，如果此调用尝试初始化CUDA RT内部状态，该函数也可能返回cudaErrorInitializationError、cudaErrorInsufficientDriver或cudaErrorNoDevice。
请注意，根据cudaStreamAddCallback的规定，回调函数中不得调用任何CUDA函数。在这种情况下，可能会（但不保证）返回cudaErrorNotPermitted作为诊断信息。

另请参阅：

cuDeviceGetMemPool, cudaDeviceGetDefaultMemPool, cudaDeviceSetMemPool

__host__ cudaError_t cudaDeviceGetNvSciSyncAttributes ( void* nvSciSyncAttrList, int device, int flags )

返回该设备支持的NvSciSync属性。

参数

nvSciSyncAttrList: - Return NvSciSync attributes supported.
device: - Valid Cuda Device to get NvSciSync attributes for.
flags: - flags describing NvSciSync usage.

描述

返回nvSciSyncAttrList中该CUDA设备dev支持的NvSciSync属性。返回的nvSciSyncAttrList可用于创建与该设备能力匹配的NvSciSync。

如果nvSciSyncAttrList中的NvSciSyncAttrKey_RequiredPerm字段已设置，该API将返回cudaErrorInvalidValue。

应用程序应将nvSciSyncAttrList设置为有效的NvSciSyncAttrList，否则此API将返回cudaErrorInvalidHandle。

flags 控制应用程序如何使用从 nvSciSyncAttrList 创建的 NvSciSync。有效的标志包括：

cudaNvSciSyncAttrSignal, 指定应用程序打算在此CUDA设备上触发NvSciSync信号。
cudaNvSciSyncAttrWait, 指定应用程序打算在此CUDA设备上等待NvSciSync。

必须设置这些标志中的至少一个，否则API将返回cudaErrorInvalidValue。这两个标志彼此独立：开发者可以同时设置这两个标志，从而允许在同一个nvSciSyncAttrList中设置等待和信号特定属性。

请注意，此API会使用以下公共属性键值等效的值更新输入nvSciSyncAttrList：NvSciSyncAttrKey_RequiredPerm被设置为

如果flags中设置了cudaNvSciSyncAttrSignal，则为NvSciSyncAccessPerm_SignalOnly权限。
如果flags中设置了cudaNvSciSyncAttrWait，则使用NvSciSyncAccessPerm_WaitOnly权限。
如果flags中同时设置了cudaNvSciSyncAttrWait和cudaNvSciSyncAttrSignal，则为NvSciSyncAccessPerm_WaitSignal。NvSciSyncAttrKey_PrimitiveInfo被设置为
在任何有效的设备上使用NvSciSyncAttrValPrimitiveType_SysmemSemaphore。
如果device是Tegra设备，则使用NvSciSyncAttrValPrimitiveType_Syncpoint。
如果device是GA10X+，则使用NvSciSyncAttrValPrimitiveType_SysmemSemaphorePayload64b。NvSciSyncAttrKey_GpuId被设置为与此device在cudaDeviceGetProperties中返回的cudaDeviceProp.uuid相同的UUID。

cudaSuccess, cudaErrorDeviceUninitialized, cudaErrorInvalidValue, cudaErrorInvalidHandle, cudaErrorInvalidDevice, cudaErrorNotSupported, cudaErrorMemoryAllocation

另请参阅：

cudaImportExternalSemaphore, cudaDestroyExternalSemaphore, cudaSignalExternalSemaphoresAsync, cudaWaitExternalSemaphoresAsync

__host__ cudaError_t cudaDeviceGetP2PAttribute ( int* value, cudaDeviceP2PAttr attr, int srcDevice, int dstDevice )

查询两个设备之间链接的属性。

参数

value: - Returned value of the requested attribute
attr
srcDevice: - The source device of the target link.
dstDevice: - The destination device of the target link.

cudaSuccess, cudaErrorInvalidDevice, cudaErrorInvalidValue

描述

在*value中返回srcDevice和dstDevice之间链接的请求属性attrib的值。支持的属性包括：

cudaDevP2PAttrPerformanceRank: 一个相对值，表示两个设备之间链接的性能。数值越低性能越好（0表示最高性能链接的数值）。
cudaDevP2PAttrAccessSupported: 如果启用了对等访问，则返回1。
cudaDevP2PAttrNativeAtomicSupported: 如果支持通过该链接进行原生原子操作，则值为1。
cudaDevP2PAttrCudaArrayAccessSupported: 如果支持通过该链接访问CUDA数组，则值为1。

如果srcDevice或dstDevice无效，或者它们代表同一个设备，则返回cudaErrorInvalidDevice。

如果attrib无效或value为空指针，则返回cudaErrorInvalidValue。

Note:

请注意，此函数也可能返回之前异步启动的错误代码。
请注意，如果此调用尝试初始化CUDA RT内部状态，该函数也可能返回cudaErrorInitializationError、cudaErrorInsufficientDriver或cudaErrorNoDevice。
请注意，根据cudaStreamAddCallback的规定，回调函数中不得调用任何CUDA函数。在这种情况下，可能会（但不保证）返回cudaErrorNotPermitted作为诊断信息。

另请参阅：

cudaDeviceEnablePeerAccess, cudaDeviceDisablePeerAccess, cudaDeviceCanAccessPeer, cuDeviceGetP2PAttribute

__host__ cudaError_t cudaDeviceGetPCIBusId ( char* pciBusId, int len, int device )

返回设备的PCI总线ID字符串。

参数

pciBusId: - Returned identifier string for the device in the following format [domain]:[bus]:[device].[function] where domain, bus, device, and function are all hexadecimal values. pciBusId should be large enough to store 13 characters including the NULL-terminator.
len: - Maximum length of string to store in name
device: - Device to get identifier string for

cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevice

描述

返回一个ASCII字符串，用于标识设备dev，该字符串以NULL结尾并存储在pciBusId指向的位置。len指定了可能返回的字符串的最大长度。

Note:

请注意，此函数也可能返回之前异步启动的错误代码。
请注意，如果此调用尝试初始化CUDA RT内部状态，该函数也可能返回cudaErrorInitializationError、cudaErrorInsufficientDriver或cudaErrorNoDevice。
请注意，根据cudaStreamAddCallback的规定，回调函数中不得调用任何CUDA函数。在这种情况下，可能会（但不保证）返回cudaErrorNotPermitted作为诊断信息。

另请参阅：

cudaDeviceGetByPCIBusId, cuDeviceGetPCIBusId

__host__ cudaError_t cudaDeviceGetStreamPriorityRange ( int* leastPriority, int* greatestPriority )

返回与最小和最大流优先级对应的数值。

参数

leastPriority: - Pointer to an int in which the numerical value for least stream priority is returned
greatestPriority: - Pointer to an int in which the numerical value for greatest stream priority is returned

cudaSuccess

描述

在*leastPriority和*greatestPriority中分别返回对应最低和最高流优先级的数值。流优先级遵循数值越小优先级越高的约定。有效流优先级的范围为[*greatestPriority, *leastPriority]。如果用户尝试创建具有超出此API指定有效范围的优先级值的流，则优先级会自动向下或向上截取为*leastPriority或*greatestPriority。有关创建优先级流的详细信息，请参阅cudaStreamCreateWithPriority。如果不需要该值，可以向*leastPriority或*greatestPriority传入NULL。

如果当前上下文设备不支持流优先级（参见cudaDeviceGetAttribute），该函数将在*leastPriority和*greatestPriority中返回'0'。

Note:

请注意，此函数也可能返回之前异步启动的错误代码。
请注意，如果此调用尝试初始化CUDA RT内部状态，该函数也可能返回cudaErrorInitializationError、cudaErrorInsufficientDriver或cudaErrorNoDevice。
请注意，根据cudaStreamAddCallback的规定，回调函数中不得调用任何CUDA函数。在这种情况下，可能会（但不保证）返回cudaErrorNotPermitted作为诊断信息。

另请参阅：

cudaStreamCreateWithPriority, cudaStreamGetPriority, cuCtxGetStreamPriorityRange

__host__ cudaError_t cudaDeviceGetTexture1DLinearMaxWidth ( size_t* maxWidthInElements, const cudaChannelFormatDesc* fmtDesc, int device )

返回给定元素大小下可在一维线性纹理中分配的最大元素数量。

参数

maxWidthInElements: - Returns maximum number of texture elements allocatable for given fmtDesc.
fmtDesc: - Texture format description.
device

cudaSuccess, cudaErrorUnsupportedLimit, cudaErrorInvalidValue

描述

返回在maxWidthInElements中给定格式描述符fmtDesc的一维线性纹理中可分配的最大元素数量。

Note:

请注意，此函数也可能返回之前异步启动的错误代码。
请注意，如果此调用尝试初始化CUDA RT内部状态，该函数也可能返回cudaErrorInitializationError、cudaErrorInsufficientDriver或cudaErrorNoDevice。
请注意，根据cudaStreamAddCallback的规定，回调函数中不得调用任何CUDA函数。在这种情况下，可能会（但不保证）返回cudaErrorNotPermitted作为诊断信息。

另请参阅：

cuDeviceGetTexture1DLinearMaxWidth

__host__ cudaError_t cudaDeviceRegisterAsyncNotification ( int device, cudaAsyncCallback callbackFunc, void* userData, cudaAsyncCallbackHandle_t* callback )

注册一个回调函数以接收异步通知。

参数

device: - The device on which to register the callback
callbackFunc: - The function to register as a callback
userData: - A generic pointer to user data. This is passed into the callback function.
callback: - A handle representing the registered callback instance

cudaSuccess cudaErrorNotSupported cudaErrorInvalidDevice cudaErrorInvalidValue cudaErrorNotPermitted cudaErrorUnknown

描述

userData 参数会在异步通知时传递给回调函数。同样地，callback 也会被传递给回调函数，用于区分多个已注册的回调。

注册的回调函数应设计为快速返回（约10毫秒）。任何长时间运行的任务都应排队在应用程序线程上执行。

回调函数不得调用cudaDeviceRegisterAsyncNotification或cudaDeviceUnregisterAsyncNotification。这样做将导致cudaErrorNotPermitted错误。异步通知回调的执行顺序未定义，且可能被串行化。

在*callback中返回一个表示已注册回调实例的句柄。

Note:

请注意，此函数也可能返回之前异步启动的错误代码。

另请参阅：

cudaDeviceUnregisterAsyncNotification

__host__ cudaError_t cudaDeviceReset ( void )

销毁当前进程中当前设备上的所有分配并重置所有状态。

cudaSuccess

描述

显式销毁并清理当前进程中与当前设备关联的所有资源。调用者有责任确保这些资源在后续API调用中不被访问或传递，否则将导致未定义行为。这些资源包括CUDA类型cudaStream_t、cudaEvent_t、cudaArray_t、cudaMipmappedArray_t、cudaPitchedPtr、cudaTextureObject_t、cudaSurfaceObject_t、textureReference、surfaceReference、cudaExternalMemory_t、cudaExternalSemaphore_t和cudaGraphicsResource_t。这些资源还包括通过cudaMalloc、cudaMallocHost、cudaMallocManaged和cudaMallocPitch进行的内存分配。任何对该设备的后续API调用都将重新初始化设备。

请注意，此函数将立即重置设备。调用者有责任确保在调用此函数时，设备未被进程中的其他主机线程访问。

Note:

cudaDeviceReset() 不会销毁由 cudaMallocAsync() 和 cudaMallocFromPoolAsync() 创建的内存分配。这些内存分配需要显式销毁。
如果线程当前绑定的是非主CUcontext，cudaDeviceReset()将仅销毁该CUcontext内部的CUDA运行时状态。

Note:

请注意，此函数也可能返回之前异步启动的错误代码。
请注意，如果此调用尝试初始化CUDA RT内部状态，该函数也可能返回cudaErrorInitializationError、cudaErrorInsufficientDriver或cudaErrorNoDevice。
请注意，根据cudaStreamAddCallback的规定，回调函数中不得调用任何CUDA函数。在这种情况下，可能会（但不保证）返回cudaErrorNotPermitted作为诊断信息。

另请参阅：

cudaDeviceSynchronize

__host__ cudaError_t cudaDeviceSetCacheConfig ( cudaFuncCache cacheConfig )

为当前设备设置首选的缓存配置。

参数

cacheConfig: - Requested cache configuration

cudaSuccess

描述

在L1缓存和共享内存使用相同硬件资源的设备上，通过cacheConfig设置当前设备的首选缓存配置。这仅是一个偏好设置。运行时系统会尽可能采用请求的配置，但必要时可以自由选择其他配置。通过cudaFuncSetCacheConfig ( C API)或cudaFuncSetCacheConfig ( C++ API)设置的任何函数级偏好都将优先于此设备全局设置。将设备全局缓存配置设为cudaFuncCachePreferNone会导致后续内核启动时优先保持原有缓存配置，除非必须更改才能启动内核。

在L1缓存和共享内存大小固定的设备上，此设置不起作用。

使用与最近偏好设置不同的偏好启动内核可能会插入一个设备端同步点。

支持的缓存配置包括：

cudaFuncCachePreferNone: 对共享内存或L1缓存无偏好（默认设置）
cudaFuncCachePreferShared: 优先使用更大的共享内存和更小的L1缓存
cudaFuncCachePreferL1: 优先使用更大的L1缓存和更小的共享内存
cudaFuncCachePreferEqual: 偏好L1缓存与共享内存大小相等

Note:

请注意，此函数也可能返回之前异步启动的错误代码。
请注意，如果此调用尝试初始化CUDA RT内部状态，该函数也可能返回cudaErrorInitializationError、cudaErrorInsufficientDriver或cudaErrorNoDevice。
请注意，根据cudaStreamAddCallback的规定，回调函数中不得调用任何CUDA函数。在这种情况下，可能会（但不保证）返回cudaErrorNotPermitted作为诊断信息。

另请参阅：

cudaDeviceGetCacheConfig, cudaFuncSetCacheConfig ( C API), cudaFuncSetCacheConfig ( C++ API), cuCtxSetCacheConfig

__host__ cudaError_t cudaDeviceSetLimit ( cudaLimit limit, size_t value )

设置资源限制。

参数

limit: - Limit to set
value: - Size of limit

cudaSuccess, cudaErrorUnsupportedLimit, cudaErrorInvalidValue, cudaErrorMemoryAllocation

描述

将limit设置为value是应用程序请求更新设备维护的当前限制值。驱动程序可以自由修改请求值以满足硬件要求（可能是限制到最小/最大值、向上取整到最近的元素大小等）。应用程序可以使用cudaDeviceGetLimit()来查询限制值实际被设置为何值。

设置每个cudaLimit都有其特定的限制条件，因此这里分别进行讨论。

cudaLimitStackSize 控制每个GPU线程的堆栈大小（以字节为单位）。

cudaLimitPrintfFifoSize 控制由printf()设备系统调用使用的共享FIFO的大小（以字节为单位）。在启动任何使用printf()设备系统调用的内核后，不得设置cudaLimitPrintfFifoSize - 在这种情况下将返回cudaErrorInvalidValue。

cudaLimitMallocHeapSize 控制由malloc()和free()设备系统调用使用的堆大小（以字节为单位）。在启动任何使用malloc()或free()设备系统调用的内核后，不得设置cudaLimitMallocHeapSize - 在这种情况下将返回cudaErrorInvalidValue。

cudaLimitDevRuntimeSyncDepth 控制网格的最大嵌套深度，在此深度下线程可以安全调用cudaDeviceSynchronize()。必须在首次启动使用设备运行时并在默认同步深度（两层网格）之上调用cudaDeviceSynchronize()的内核之前设置此限制。如果违反此限制，调用cudaDeviceSynchronize()将失败并返回错误代码cudaErrorSyncDepthExceeded。此限制可以设置为小于默认值或最高24层的启动深度。设置此限制时请注意，额外的同步深度层级需要运行时预留大量设备内存，这些内存将无法用于用户分配。如果设备内存预留失败，cudaDeviceSetLimit将返回cudaErrorMemoryAllocation，此时可将限制重置为较低值。此限制仅适用于计算能力<9.0的设备。尝试在其他计算能力的设备上设置此限制将返回错误cudaErrorUnsupportedLimit。

cudaLimitDevRuntimePendingLaunchCount 控制从当前设备可以发起的未完成设备运行时启动的最大数量。一个网格从启动时刻起直到确认为已完成期间都处于未完成状态。违反此限制的设备运行时启动会失败，并在启动后调用cudaGetLastError()时返回cudaErrorLaunchPendingCountExceeded。如果使用设备运行时的模块需要比默认值（2048次启动）更多的待处理启动，可以增加此限制。请注意，能够维持更多待处理启动将要求运行时预先保留更大容量的设备内存，这些内存将无法再用于分配。如果这些保留失败，cudaDeviceSetLimit将返回cudaErrorMemoryAllocation，此时可以将限制重置为较低值。此限制仅适用于计算能力3.5及更高的设备。尝试在计算能力低于3.5的设备上设置此限制将导致返回错误cudaErrorUnsupportedLimit。

cudaLimitMaxL2FetchGranularity 控制L2缓存的获取粒度。取值范围为0字节到128字节。这纯粹是一个性能提示，根据平台的不同可能会被忽略或限制。

cudaLimitPersistingL2CacheSize 控制可用于持久化L2缓存的大小（以字节为单位）。这纯粹是一个性能提示，根据平台的不同可能会被忽略或限制。

Note:

请注意，此函数也可能返回之前异步启动的错误代码。
请注意，如果此调用尝试初始化CUDA RT内部状态，该函数也可能返回cudaErrorInitializationError、cudaErrorInsufficientDriver或cudaErrorNoDevice。
请注意，根据cudaStreamAddCallback的规定，回调函数中不得调用任何CUDA函数。在这种情况下，可能会（但不保证）返回cudaErrorNotPermitted作为诊断信息。

另请参阅：

cudaDeviceGetLimit, cuCtxSetLimit

__host__ cudaError_t cudaDeviceSetMemPool ( int device, cudaMemPool_t memPool )

设置设备的当前内存池。

cudaSuccess, cudaErrorInvalidValue cudaErrorInvalidDevice cudaErrorNotSupported

描述

内存池必须位于指定设备本地。除非在cudaMallocAsync调用中显式指定内存池，否则cudaMallocAsync会从提供流所属设备的当前内存池进行分配。默认情况下，设备的当前内存池就是其默认内存池。

Note:

使用cudaMallocFromPoolAsync来指定从与流运行设备不同的设备进行异步分配。

Note:

请注意，此函数也可能返回之前异步启动的错误代码。
请注意，根据cudaStreamAddCallback的规定，回调函数中不得调用任何CUDA函数。在这种情况下，可能会（但不保证）返回cudaErrorNotPermitted作为诊断信息。

另请参阅：

cuDeviceSetMemPool, cudaDeviceGetMemPool, cudaDeviceGetDefaultMemPool, cudaMemPoolCreate, cudaMemPoolDestroy, cudaMallocFromPoolAsync

__host__ __device__ cudaError_t cudaDeviceSynchronize ( void )

等待计算设备完成操作。

cudaSuccess

描述

阻塞当前线程，直到设备完成所有先前请求的任务。如果任一前置任务失败，cudaDeviceSynchronize()将返回错误。若为该设备设置了cudaDeviceScheduleBlockingSync标志，主机线程将一直阻塞直到设备完成工作。

Note:

在CUDA 11.6中已弃用在设备代码中使用cudaDeviceSynchronize，并在compute_90+编译中移除了该功能。对于计算能力<9.0的设备，目前需要通过指定编译选项-D CUDA_FORCE_CDP1_IF_SUPPORTED来选择继续在设备代码中使用cudaDeviceSynchronize()。请注意，这与主机端的cudaDeviceSynchronize不同，后者仍然受支持。
请注意，此函数也可能返回之前异步启动的错误代码。
请注意，如果此调用尝试初始化CUDA RT内部状态，该函数也可能返回cudaErrorInitializationError、cudaErrorInsufficientDriver或cudaErrorNoDevice。
请注意，根据cudaStreamAddCallback的规定，回调函数中不得调用任何CUDA函数。在这种情况下，可能会（但不保证）返回cudaErrorNotPermitted作为诊断信息。

另请参阅：

cudaDeviceReset, cuCtxSynchronize

__host__ cudaError_t cudaDeviceUnregisterAsyncNotification ( int device, cudaAsyncCallbackHandle_t callback )

注销一个异步通知回调函数。

参数

device: - The device from which to remove callback.
callback: - The callback instance to unregister from receiving async notifications.

cudaSuccess cudaErrorNotSupported cudaErrorInvalidDevice cudaErrorInvalidValue cudaErrorNotPermitted cudaErrorUnknown

描述

注销callback，使对应的回调函数停止接收异步通知。

Note:

请注意，此函数也可能返回之前异步启动的错误代码。

另请参阅：

cudaDeviceRegisterAsyncNotification

__host__ __device__ cudaError_t cudaGetDevice ( int* device )

返回当前正在使用的设备。

参数

device: - Returns the device on which the active host thread executes the device code.

cudaSuccess, cudaErrorInvalidValue, cudaErrorDeviceUnavailable,

描述

在*device中返回调用主机线程的当前设备。

Note:

请注意，此函数也可能返回之前异步启动的错误代码。
请注意，如果此调用尝试初始化CUDA RT内部状态，该函数也可能返回cudaErrorInitializationError、cudaErrorInsufficientDriver或cudaErrorNoDevice。
请注意，根据cudaStreamAddCallback的规定，回调函数中不得调用任何CUDA函数。在这种情况下，可能会（但不保证）返回cudaErrorNotPermitted作为诊断信息。

另请参阅：

cudaGetDeviceCount, cudaSetDevice, cudaGetDeviceProperties, cudaChooseDevice, cuCtxGetCurrent

__host__ __device__ cudaError_t cudaGetDeviceCount ( int* count )

返回支持计算功能的设备数量。

参数

count: - Returns the number of devices with compute capability greater or equal to 2.0

cudaSuccess

描述

返回*count中可用于执行且计算能力大于或等于2.0的设备数量。

Note:

请注意，此函数也可能返回之前异步启动的错误代码。
请注意，如果此调用尝试初始化CUDA RT内部状态，该函数也可能返回cudaErrorInitializationError、cudaErrorInsufficientDriver或cudaErrorNoDevice。
请注意，根据cudaStreamAddCallback的规定，回调函数中不得调用任何CUDA函数。在这种情况下，可能会（但不保证）返回cudaErrorNotPermitted作为诊断信息。

另请参阅：

cudaGetDevice, cudaSetDevice, cudaGetDeviceProperties, cudaChooseDevice, cudaInitDevice, cuDeviceGetCount

__host__ cudaError_t cudaGetDeviceFlags ( unsigned int* flags )

获取当前设备的标志位。

参数

flags: - Pointer to store the device flags

cudaSuccess, cudaErrorInvalidDevice

描述

返回当前设备的标志到flags中。如果调用线程存在当前设备，则返回该设备的标志。如果没有当前设备，则返回第一个设备的标志，可能是默认标志。请对比 cudaSetDeviceFlags的行为。

通常情况下，返回的标志位应当与调用线程在此次调用后使用设备时观察到的行为一致，前提是在此期间该线程或其他线程未对标志位或当前设备进行任何修改。需要注意的是，如果设备尚未初始化，其他线程有可能在设备初始化前修改当前设备的标志位。此外，当使用独占模式时，若当前线程未请求特定设备，则可能使用非首选的设备，这与本函数所做的假设相矛盾。

如果通过驱动程序API创建了一个上下文并且对调用线程是当前的，那么始终返回该上下文的标志。

此函数返回的标志可能特别包含cudaDeviceMapHost，尽管它不被cudaSetDeviceFlags接受，因为它在运行时API标志中是隐式的。这样做的原因是当前上下文可能是通过驱动程序API创建的，在这种情况下该标志不是隐式的，可能未被设置。

Note:

请注意，此函数也可能返回之前异步启动的错误代码。
请注意，如果此调用尝试初始化CUDA RT内部状态，该函数也可能返回cudaErrorInitializationError、cudaErrorInsufficientDriver或cudaErrorNoDevice。
请注意，根据cudaStreamAddCallback的规定，回调函数中不得调用任何CUDA函数。在这种情况下，可能会（但不保证）返回cudaErrorNotPermitted作为诊断信息。

另请参阅：

cudaGetDevice, cudaGetDeviceProperties, cudaSetDevice, cudaSetDeviceFlags, cudaInitDevice, cuCtxGetFlags, cuDevicePrimaryCtxGetState

__host__ cudaError_t cudaGetDeviceProperties ( cudaDeviceProp* prop, int device )

返回有关计算设备的信息。

参数

prop: - Properties for the specified device
device: - Device number to get properties for

cudaSuccess, cudaErrorInvalidDevice

描述

返回设备dev的属性到*prop中。cudaDeviceProp结构体定义为：

‎    struct cudaDeviceProp {
              char name[256];
              cudaUUID_t uuid;
              size_t totalGlobalMem;
              size_t sharedMemPerBlock;
              int regsPerBlock;
              int warpSize;
              size_t memPitch;
              int maxThreadsPerBlock;
              int maxThreadsDim[3];
              int maxGridSize[3];
              int clockRate;
              size_t totalConstMem;
              int major;
              int minor;
              size_t textureAlignment;
              size_t texturePitchAlignment;
              int deviceOverlap;
              int multiProcessorCount;
              int kernelExecTimeoutEnabled;
              int integrated;
              int canMapHostMemory;
              int computeMode;
              int maxTexture1D;
              int maxTexture1DMipmap;
              int maxTexture1DLinear;
              int maxTexture2D[2];
              int maxTexture2DMipmap[2];
              int maxTexture2DLinear[3];
              int maxTexture2DGather[2];
              int maxTexture3D[3];
              int maxTexture3DAlt[3];
              int maxTextureCubemap;
              int maxTexture1DLayered[2];
              int maxTexture2DLayered[3];
              int maxTextureCubemapLayered[2];
              int maxSurface1D;
              int maxSurface2D[2];
              int maxSurface3D[3];
              int maxSurface1DLayered[2];
              int maxSurface2DLayered[3];
              int maxSurfaceCubemap;
              int maxSurfaceCubemapLayered[2];
              size_t surfaceAlignment;
              int concurrentKernels;
              int ECCEnabled;
              int pciBusID;
              int pciDeviceID;
              int pciDomainID;
              int tccDriver;
              int asyncEngineCount;
              int unifiedAddressing;
              int memoryClockRate;
              int memoryBusWidth;
              int l2CacheSize;
              int persistingL2CacheMaxSize;
              int maxThreadsPerMultiProcessor;
              int streamPrioritiesSupported;
              int globalL1CacheSupported;
              int localL1CacheSupported;
              size_t sharedMemPerMultiprocessor;
              int regsPerMultiprocessor;
              int managedMemory;
              int isMultiGpuBoard;
              int multiGpuBoardGroupID;
              int singleToDoublePrecisionPerfRatio;
              int pageableMemoryAccess;
              int concurrentManagedAccess;
              int computePreemptionSupported;
              int canUseHostPointerForRegisteredMem;
              int cooperativeLaunch;
              int cooperativeMultiDeviceLaunch;
              int pageableMemoryAccessUsesHostPageTables;
              int directManagedMemAccessFromHost;
              int accessPolicyMaxWindowSize;
          }

where:

name[256] 是一个用于标识设备的ASCII字符串。
uuid 是一个16字节的唯一标识符。
totalGlobalMem 表示设备上可用的全局内存总量，单位为字节。
sharedMemPerBlock 是每个线程块可用的最大共享内存量（以字节为单位）。
regsPerBlock 是每个线程块可用的最大32位寄存器数量。
warpSize 表示以线程为单位的warp大小。
memPitch 是涉及通过cudaMallocPitch()分配的内存区域的内存复制函数所允许的最大间距（以字节为单位）。
maxThreadsPerBlock 是每个块的最大线程数。
maxThreadsDim[3] 包含一个块(block)每个维度的最大尺寸。
maxGridSize[3] 包含网格每个维度的最大尺寸。
clockRate 是时钟频率，单位为千赫兹。
totalConstMem 表示设备上可用的常量内存总量，单位为字节。
major, minor 是定义设备计算能力的主版本号和次版本号。
textureAlignment 是对齐要求；与 textureAlignment 字节对齐的纹理基地址不需要对纹理提取应用偏移量。
texturePitchAlignment 是绑定到倾斜内存的2D纹理引用所需的间距对齐要求。
如果设备可以在执行内核的同时在主机和设备之间并发复制内存，则deviceOverlap为1，否则为0。已弃用，请改用asyncEngineCount。
multiProcessorCount 表示设备上的多处理器数量。
如果设备上执行的内核(kernel)有运行时限制，则kernelExecTimeoutEnabled为1，否则为0。
integrated 如果设备是集成GPU（主板内置）则为1，如果是独立GPU（显卡组件）则为0。
如果设备能够将主机内存映射到CUDA地址空间以供cudaHostAlloc()/cudaHostGetDevicePointer()使用，则canMapHostMemory为1，否则为0。
computeMode is the compute mode that the device is currently in. Available modes are as follows:
- cudaComputeModeDefault: 默认模式 - 设备不受限制，多个线程可以使用cudaSetDevice()与该设备交互。
- cudaComputeModeProhibited: 计算禁止模式 - 任何线程都无法通过cudaSetDevice()使用该设备。
- cudaComputeModeExclusiveProcess: 独占进程计算模式 - 单个进程中的多个线程将能够通过cudaSetDevice()使用该设备。
  
  当通过cudaSetDevice选择已被占用的独占模式设备时，所有后续非设备管理的运行时函数都将返回cudaErrorDevicesUnavailable错误。
maxTexture1D 是最大的一维纹理尺寸。
maxTexture1DMipmap 是最大的一维mipmapped纹理尺寸。
maxTexture1DLinear 是绑定到线性内存的1D纹理的最大尺寸。
maxTexture2D[2] 包含最大的2D纹理尺寸。
maxTexture2DMipmap[2] 包含最大的2D Mipmap纹理尺寸。
maxTexture2DLinear[3] 包含绑定到线性间距内存的2D纹理的最大二维尺寸。
maxTexture2DGather[2] 包含执行纹理聚集操作时的最大2D纹理尺寸。
maxTexture3D[3] 包含最大3D纹理尺寸。
maxTexture3DAlt[3] 包含最大替代3D纹理尺寸。
maxTextureCubemap 是立方体贴图的最大宽度或高度。
maxTexture1DLayered[2] 包含最大的一维分层纹理维度。
maxTexture2DLayered[3] 包含最大二维分层纹理尺寸。
maxTextureCubemapLayered[2] 包含最大的立方体贴图分层纹理尺寸。
maxSurface1D 是最大的一维表面尺寸。
maxSurface2D[2] 包含最大的2D表面尺寸。
maxSurface3D[3] 包含最大的3D表面尺寸。
maxSurface1DLayered[2] 包含最大的一维分层表面维度。
maxSurface2DLayered[3] 包含最大的2D分层表面维度。
maxSurfaceCubemap 是立方体贴图表面的最大宽度或高度。
maxSurfaceCubemapLayered[2] 包含最大的立方体贴图层级表面尺寸。
surfaceAlignment 指定了表面的对齐要求。
concurrentKernels 如果设备支持在同一上下文中同时执行多个内核，则值为1，否则为0。不能保证多个内核会同时驻留在设备上，因此不应依赖此功能来确保正确性。
如果设备启用了ECC支持，则ECCEnabled为1，否则为0。
pciBusID 是设备的PCI总线标识符。
pciDeviceID 是设备的PCI设备（有时称为插槽）标识符。
pciDomainID 是设备的PCI域标识符。
tccDriver 如果设备使用TCC驱动则为1，否则为0。
当设备可以在执行内核的同时在主机和设备之间并发复制内存时，asyncEngineCount为1。当设备可以同时在两个方向上并发复制主机和设备内存并同时执行内核时，该值为2。如果都不支持这些功能，则为0。
如果设备与主机共享统一地址空间，则unifiedAddressing为1，否则为0。
memoryClockRate 是内存的峰值时钟频率，单位为千赫兹。
memoryBusWidth 是内存总线宽度（以位为单位）。
l2CacheSize 表示L2缓存的大小，单位为字节。
persistingL2CacheMaxSize 是L2缓存的最大持久化行大小（以字节为单位）。
maxThreadsPerMultiProcessor 是每个多处理器上最大常驻线程数。
如果设备支持流优先级，streamPrioritiesSupported为1，否则为0。
如果设备支持在L1缓存中缓存全局变量，则globalL1CacheSupported为1，否则为0。
如果设备支持在L1缓存中缓存局部变量，则localL1CacheSupported为1，否则为0。
sharedMemPerMultiprocessor 表示每个多处理器可用的最大共享内存量（以字节为单位）；该内存量由同时驻留在多处理器上的所有线程块共享。
regsPerMultiprocessor 是每个多处理器可用的最大32位寄存器数量；该数值由同时驻留在多处理器上的所有线程块共享。
如果设备支持在当前系统上分配托管内存，则managedMemory为1，否则为0。
如果设备位于多GPU板卡上（例如Gemini卡），则isMultiGpuBoard为1，否则为0；
multiGpuBoardGroupID 是与同一主板关联的设备组的唯一标识符。同一多GPU主板上的设备将共享相同的标识符。
如果设备与主机之间的链接支持原生原子操作，则hostNativeAtomicSupported为1，否则为0。
singleToDoublePrecisionPerfRatio 是单精度性能（以每秒浮点运算次数计）与双精度性能的比率。
如果设备支持无需调用cudaHostRegister即可一致访问可分页内存，则pageableMemoryAccess为1，否则为0。
如果设备能够与CPU同时一致地访问托管内存，则concurrentManagedAccess为1，否则为0。
如果设备支持计算抢占（Compute Preemption），则computePreemptionSupported为1，否则为0。
如果设备能够以与CPU相同的虚拟地址访问主机注册内存，则canUseHostPointerForRegisteredMem为1，否则为0。
如果设备支持通过cudaLaunchCooperativeKernel启动协作内核，则cooperativeLaunch为1，否则为0。
如果设备支持通过cudaLaunchCooperativeKernelMultiDevice启动协作内核，则cooperativeMultiDeviceLaunch为1，否则为0。
sharedMemPerBlockOptin 是每个设备上每个块可用的最大共享内存，需特别启用
如果设备通过主机页表访问可分页内存，则pageableMemoryAccessUsesHostPageTables为1，否则为0。
如果主机可以直接访问设备上的托管内存而无需迁移，则directManagedMemAccessFromHost为1，否则为0。
maxBlocksPerMultiProcessor 是单个多处理器上可驻留的最大线程块数量。
accessPolicyMaxWindowSize 是 cudaAccessPolicyWindow::num_bytes 的最大值。
reservedSharedMemPerBlock 是CUDA驱动为每个块保留的共享内存大小（以字节为单位）
如果设备支持通过cudaHostRegister进行主机内存注册，则hostRegisterSupported为1，否则为0。
如果设备支持稀疏CUDA数组和稀疏CUDA mipmapped数组，则sparseCudaArraySupported为1，否则为0
如果设备支持使用cudaHostRegister标志cudaHostRegisterReadOnly来注册必须映射为GPU只读的内存，则hostRegisterReadOnlySupported为1
如果设备支持外部时间线信号量互操作，则timelineSemaphoreInteropSupported为1，否则为0
如果设备支持使用cudaMallocAsync和cudaMemPool系列API，则memoryPoolsSupported为1，否则为0
如果设备支持GPUDirect RDMA API，则gpuDirectRDMASupported为1，否则为0
gpuDirectRDMAFlushWritesOptions 是一个位掩码，需根据 cudaFlushGPUDirectRDMAWritesOptions 枚举类型进行解释
gpuDirectRDMAWritesOrdering 数值定义请参阅cudaGPUDirectRDMAWritesOrdering枚举类型
memoryPoolSupportedHandleTypes 是一个位掩码，表示支持基于内存池的IPC的句柄类型
如果设备支持延迟映射CUDA数组和CUDA多级渐远纹理数组，则deferredMappingCudaArraySupported值为1
如果设备支持IPC事件，则ipcEventSupported为1，否则为0
unifiedFunctionPointers 如果设备支持统一指针则为1，否则为0

Note:

请注意，此函数也可能返回之前异步启动的错误代码。
请注意，如果此调用尝试初始化CUDA RT内部状态，该函数也可能返回cudaErrorInitializationError、cudaErrorInsufficientDriver或cudaErrorNoDevice。
请注意，根据cudaStreamAddCallback的规定，回调函数中不得调用任何CUDA函数。在这种情况下，可能会（但不保证）返回cudaErrorNotPermitted作为诊断信息。

另请参阅：

cudaGetDeviceCount, cudaGetDevice, cudaSetDevice, cudaChooseDevice, cudaDeviceGetAttribute, cudaInitDevice, cuDeviceGetAttribute, cuDeviceGetName

__host__ cudaError_t cudaInitDevice ( int device, unsigned int deviceFlags, unsigned int flags )

初始化用于GPU执行的设备。

参数

device: - Device on which the runtime will initialize itself.
deviceFlags: - Parameters for device operation.
flags: - Flags for controlling the device initialization.

cudaSuccess, cudaErrorInvalidDevice,

描述

该函数将在调用时初始化CUDA运行时结构和device上的主上下文，但该上下文不会成为device的当前上下文。

当在flags中设置cudaInitDeviceFlagsAreValid时，deviceFlags将被应用到请求的设备上。deviceFlags的取值与cudaSetDeviceFlags中的flags参数一致。其效果可以通过cudaGetDeviceFlags进行验证。

如果设备处于cudaComputeModeExclusiveProcess模式且被其他进程占用，或者设备处于cudaComputeModeProhibited模式，此函数将返回错误。

Note:

请注意，此函数也可能返回之前异步启动的错误代码。
请注意，如果此调用尝试初始化CUDA RT内部状态，该函数也可能返回cudaErrorInitializationError、cudaErrorInsufficientDriver或cudaErrorNoDevice。
请注意，根据cudaStreamAddCallback的规定，回调函数中不得调用任何CUDA函数。在这种情况下，可能会（但不保证）返回cudaErrorNotPermitted作为诊断信息。

另请参阅：

cudaGetDeviceCount, cudaGetDevice, cudaGetDeviceProperties, cudaChooseDevice, cudaSetDevice cuCtxSetCurrent

__host__ cudaError_t cudaIpcCloseMemHandle ( void* devPtr )

尝试关闭通过cudaIpcOpenMemHandle映射的内存。

参数

devPtr: - Device pointer returned by cudaIpcOpenMemHandle

cudaSuccess, cudaErrorMapBufferObjectFailed, cudaErrorNotSupported, cudaErrorInvalidValue

描述

将cudaIpcOpenMemHandle返回的内存引用计数减1。当引用计数降为0时，该API会取消内存映射。导出进程中的原始分配以及其他进程中的导入映射将不受影响。

如果这是使用它们的最后一个映射，那么用于启用对等访问的任何资源都将被释放。

IPC功能仅限于支持Linux和Windows操作系统统一寻址的设备。Windows上的IPC功能出于兼容性目的而提供，但由于存在性能损耗，不建议使用。用户可以通过调用cudaDeviceGetAttribute并传入cudaDevAttrIpcEventSupport来测试设备的IPC功能支持情况

Note:

请注意，如果此调用尝试初始化CUDA RT内部状态，该函数也可能返回cudaErrorInitializationError、cudaErrorInsufficientDriver或cudaErrorNoDevice。
请注意，根据cudaStreamAddCallback的规定，回调函数中不得调用任何CUDA函数。在这种情况下，可能会（但不保证）返回cudaErrorNotPermitted作为诊断信息。

另请参阅：

cudaMalloc, cudaFree, cudaIpcGetEventHandle, cudaIpcOpenEventHandle, cudaIpcGetMemHandle, cudaIpcOpenMemHandle, cuIpcCloseMemHandle

__host__ cudaError_t cudaIpcGetEventHandle ( cudaIpcEventHandle_t* handle, cudaEvent_t event )

获取先前分配事件的进程间句柄。

参数

handle: - Pointer to a user allocated cudaIpcEventHandle in which to return the opaque event handle
event: - Event allocated with cudaEventInterprocess and cudaEventDisableTiming flags.

cudaSuccess, cudaErrorInvalidResourceHandle, cudaErrorMemoryAllocation, cudaErrorMapBufferObjectFailed, cudaErrorNotSupported, cudaErrorInvalidValue

描述

接收一个先前分配的事件作为输入。该事件必须已创建并设置了cudaEventInterprocess和cudaEventDisableTiming标志。这个不透明的句柄可以被复制到其他进程中，并通过cudaIpcOpenEventHandle打开，以实现不同进程间GPU工作的高效硬件同步。

在导入过程中事件被打开后，cudaEventRecord、cudaEventSynchronize、cudaStreamWaitEvent和cudaEventQuery可以在任一进程中使用。如果在导出的事件已通过cudaEventDestroy释放后对导入的事件执行操作，将导致未定义行为。

IPC功能仅限于支持Linux和Windows操作系统统一寻址的设备。Windows上的IPC功能出于兼容性目的而提供，但由于会带来性能损耗，不建议使用。用户可以通过调用cudaDeviceGetAttribute并传入cudaDevAttrIpcEventSupport来测试设备是否支持IPC功能。

Note:

请注意，如果此调用尝试初始化CUDA RT内部状态，该函数也可能返回cudaErrorInitializationError、cudaErrorInsufficientDriver或cudaErrorNoDevice。
请注意，根据cudaStreamAddCallback的规定，回调函数中不得调用任何CUDA函数。在这种情况下，可能会（但不保证）返回cudaErrorNotPermitted作为诊断信息。

另请参阅：

cudaEventCreate, cudaEventDestroy, cudaEventSynchronize, cudaEventQuery, cudaStreamWaitEvent, cudaIpcOpenEventHandle, cudaIpcGetMemHandle, cudaIpcOpenMemHandle, cudaIpcCloseMemHandle, cuIpcGetEventHandle

__host__ cudaError_t cudaIpcGetMemHandle ( cudaIpcMemHandle_t* handle, void* devPtr )

获取现有设备内存分配的进程间内存句柄。

参数

handle: - Pointer to user allocated cudaIpcMemHandle to return the handle in.
devPtr: - Base pointer to previously allocated device memory

cudaSuccess, cudaErrorMemoryAllocation, cudaErrorMapBufferObjectFailed, cudaErrorNotSupported, cudaErrorInvalidValue

描述

获取一个指向由cudaMalloc创建的现有设备内存分配基址的指针，并将其导出供另一个进程使用。这是一个轻量级操作，可以在同一内存分配上多次调用而不会产生负面影响。

如果使用cudaFree释放了一块内存区域，随后调用cudaMalloc返回了相同设备地址的内存，cudaIpcGetMemHandle将为新内存返回一个唯一句柄。

Note:

请注意，如果此调用尝试初始化CUDA RT内部状态，该函数也可能返回cudaErrorInitializationError、cudaErrorInsufficientDriver或cudaErrorNoDevice。
请注意，根据cudaStreamAddCallback的规定，回调函数中不得调用任何CUDA函数。在这种情况下，可能会（但不保证）返回cudaErrorNotPermitted作为诊断信息。

另请参阅：

cudaMalloc, cudaFree, cudaIpcGetEventHandle, cudaIpcOpenEventHandle, cudaIpcOpenMemHandle, cudaIpcCloseMemHandle, cuIpcGetMemHandle

__host__ cudaError_t cudaIpcOpenEventHandle ( cudaEvent_t* event, cudaIpcEventHandle_t handle )

为当前进程打开一个进程间事件句柄。

参数

event: - Returns the imported event
handle: - Interprocess handle to open

cudaSuccess, cudaErrorMapBufferObjectFailed, cudaErrorNotSupported, cudaErrorInvalidValue, cudaErrorDeviceUninitialized

描述

打开一个从另一个进程通过cudaIpcGetEventHandle导出的进程间事件句柄。该函数返回一个cudaEvent_t，其行为类似于本地创建的、指定了cudaEventDisableTiming标志的事件。此事件必须使用cudaEventDestroy释放。

在导出的event通过cudaEventDestroy释放后，对导入的event执行操作将导致未定义行为。

IPC功能仅限于支持Linux和Windows操作系统统一寻址的设备。Windows上的IPC功能出于兼容性目的而提供，但由于存在性能损耗，不建议使用。用户可以通过调用cudaDeviceGetAttribute并传入cudaDevAttrIpcEventSupport来测试设备是否支持IPC功能。

Note:

请注意，如果此调用尝试初始化CUDA RT内部状态，该函数也可能返回cudaErrorInitializationError、cudaErrorInsufficientDriver或cudaErrorNoDevice。
请注意，根据cudaStreamAddCallback的规定，回调函数中不得调用任何CUDA函数。在这种情况下，可能会（但不保证）返回cudaErrorNotPermitted作为诊断信息。

另请参阅：

cudaEventCreate, cudaEventDestroy, cudaEventSynchronize, cudaEventQuery, cudaStreamWaitEvent, cudaIpcGetEventHandle, cudaIpcGetMemHandle, cudaIpcOpenMemHandle, cudaIpcCloseMemHandle, cuIpcOpenEventHandle

__host__ cudaError_t cudaIpcOpenMemHandle ( void** devPtr, cudaIpcMemHandle_t handle, unsigned int flags )

打开从另一个进程导出的进程间内存句柄，并返回可在本地进程中使用的设备指针。

参数

devPtr: - Returned device pointer
handle: - cudaIpcMemHandle to open
flags: - Flags for this operation. Must be specified as cudaIpcMemLazyEnablePeerAccess

cudaSuccess, cudaErrorMapBufferObjectFailed, cudaErrorInvalidResourceHandle, cudaErrorDeviceUninitialized, cudaErrorTooManyPeers, cudaErrorNotSupported, cudaErrorInvalidValue

描述

将使用cudaIpcGetMemHandle从其他进程导出的内存映射到当前设备地址空间。对于不同设备上的上下文，cudaIpcOpenMemHandle可以尝试启用设备间的对等访问，就像用户调用了cudaDeviceEnablePeerAccess一样。此行为由cudaIpcMemLazyEnablePeerAccess标志控制。cudaDeviceCanAccessPeer可判断映射是否可行。

cudaIpcOpenMemHandle 可以打开对调用该API的进程可能不可见的设备句柄。

可能打开cudaIpcMemHandles的上下文受到以下限制。在给定进程中，来自每个设备的cudaIpcMemHandles只能由其他进程中每个设备的一个上下文打开。

如果当前上下文已经打开了该内存句柄，则该句柄的引用计数会增加1，并返回现有的设备指针。

从cudaIpcOpenMemHandle返回的内存必须使用cudaIpcCloseMemHandle释放。

在导入上下文中调用cudaIpcCloseMemHandle之前，对导出的内存区域调用cudaFree将导致未定义行为。

Note:

请注意，如果此调用尝试初始化CUDA RT内部状态，该函数也可能返回cudaErrorInitializationError、cudaErrorInsufficientDriver或cudaErrorNoDevice。
请注意，根据cudaStreamAddCallback的规定，回调函数中不得调用任何CUDA函数。在这种情况下，可能会（但不保证）返回cudaErrorNotPermitted作为诊断信息。
无法保证*devPtr中返回的地址。特别是，多个进程可能无法为相同的handle获取相同的地址。

另请参阅：

cudaMalloc, cudaFree, cudaIpcGetEventHandle, cudaIpcOpenEventHandle, cudaIpcGetMemHandle, cudaIpcCloseMemHandle, cudaDeviceEnablePeerAccess, cudaDeviceCanAccessPeer, cuIpcOpenMemHandle

__host__ cudaError_t cudaSetDevice ( int device )

设置用于GPU执行的设备。

参数

device: - Device on which the active host thread should execute the device code.

cudaSuccess, cudaErrorInvalidDevice, cudaErrorDeviceUnavailable,

描述

将device设置为调用主机线程的当前设备。有效的设备ID范围是0到(cudaGetDeviceCount() - 1)。

随后从该主机线程使用cudaMalloc()、cudaMallocPitch()或cudaMallocArray()分配的任何设备内存将物理驻留在device上。从该主机线程使用cudaMallocHost()、cudaHostAlloc()或cudaHostRegister()分配的任何主机内存的生命周期将与device相关联。从该主机线程创建的任何流或事件都将与device相关联。从该主机线程使用<<<>>>操作符或cudaLaunchKernel()启动的任何内核都将在device上执行。

此调用可以从任何主机线程发起，针对任何设备，且在任何时间进行。该函数不会与之前或新的设备进行同步，仅在初始化运行时上下文状态时可能需要较长时间。此调用会将指定设备的主上下文绑定到调用线程，之后所有的内存分配、流和事件创建以及内核启动都将与该主上下文相关联。该函数还会立即初始化主上下文上的运行时状态，上下文将立即在device上变为当前状态。如果设备处于cudaComputeModeExclusiveProcess模式且被其他进程占用，或者设备处于cudaComputeModeProhibited模式，此函数将返回错误。

在使用此函数之前不需要调用cudaInitDevice。

Note:

请注意，此函数也可能返回之前异步启动的错误代码。
请注意，如果此调用尝试初始化CUDA RT内部状态，该函数也可能返回cudaErrorInitializationError、cudaErrorInsufficientDriver或cudaErrorNoDevice。
请注意，根据cudaStreamAddCallback的规定，回调函数中不得调用任何CUDA函数。在这种情况下，可能会（但不保证）返回cudaErrorNotPermitted作为诊断信息。

另请参阅：

cudaGetDeviceCount, cudaGetDevice, cudaGetDeviceProperties, cudaChooseDevice, cudaInitDevice, cuCtxSetCurrent

__host__ cudaError_t cudaSetDeviceFlags ( unsigned int flags )

设置用于设备执行的标志。

参数

flags: - Parameters for device operation

cudaSuccess, cudaErrorInvalidValue

描述

将flags记录为当前设备的标志。如果当前设备已设置且该设备已完成初始化，则之前的标志将被覆盖。如果当前设备尚未初始化，则会使用提供的标志进行初始化。如果调用线程尚未设置当前设备，则会选择默认设备并使用提供的标志进行初始化。

flags 参数的最低三位可用于控制在等待设备返回结果时，CPU线程如何与操作系统调度器进行交互。

cudaDeviceScheduleAuto: 如果flags参数为零时的默认值，采用基于进程中活跃CUDA上下文数量C与系统逻辑处理器数量P的启发式算法。当C > P时，CUDA在等待设备时会主动让出CPU给其他操作系统线程；否则CUDA将在等待结果期间保持自旋状态而不让出CPU。此外，在Tegra设备上，cudaDeviceScheduleAuto会根据平台电源模式采用启发式策略，可能为低功耗设备选择cudaDeviceScheduleBlockingSync模式。
cudaDeviceScheduleSpin: 指示CUDA在等待设备返回结果时主动进行轮询。这可以减少等待设备时的延迟，但如果CPU线程与CUDA线程并行执行任务，可能会降低CPU线程的性能。
cudaDeviceScheduleYield: 指示CUDA在等待设备返回结果时让出线程。这可能会增加等待设备的延迟，但可以提高与设备并行工作的CPU线程性能。
cudaDeviceScheduleBlockingSync: 指示CUDA在等待设备完成工作时，在同步原语上阻塞CPU线程。
cudaDeviceBlockingSync: 指示CUDA在等待设备完成工作时，在同步原语上阻塞CPU线程。

Deprecated: 该标志自CUDA 4.0起已弃用，并被cudaDeviceScheduleBlockingSync取代。
cudaDeviceMapHost: 该标志用于分配设备可访问的固定主机内存。运行时环境默认隐含此功能，但若通过驱动API创建上下文时可能缺失。若未设置此标志，cudaHostGetDevicePointer()将始终返回失败代码。
cudaDeviceLmemResizeToMax: 指示CUDA在内核调整本地内存大小后不缩减本地内存。这可以防止在启动多个高本地内存使用量的内核时因内存分配导致的抖动，但代价是可能增加内存使用量。

Deprecated: 此标志已弃用，该标志启用的行为现在已成为默认行为且无法禁用。
cudaDeviceSyncMemops: 确保在此上下文中发起的同步内存操作将始终保持同步。更多关于同步内存操作可能表现出异步行为的情况，请参阅标题为"API同步行为"章节中的进一步文档。

Note:

请注意，此函数也可能返回之前异步启动的错误代码。
请注意，如果此调用尝试初始化CUDA RT内部状态，该函数也可能返回cudaErrorInitializationError、cudaErrorInsufficientDriver或cudaErrorNoDevice。
请注意，根据cudaStreamAddCallback的规定，回调函数中不得调用任何CUDA函数。在这种情况下，可能会（但不保证）返回cudaErrorNotPermitted作为诊断信息。

另请参阅：

cudaGetDeviceFlags, cudaGetDeviceCount, cudaGetDevice, cudaGetDeviceProperties, cudaSetDevice, cudaSetValidDevices, cudaInitDevice, cudaChooseDevice, cuDevicePrimaryCtxSetFlags

__host__ cudaError_t cudaSetValidDevices ( int* device_arr, int len )

设置可用于CUDA的设备列表。

参数

device_arr: - List of devices to try
len: - Number of devices in specified list

cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevice

描述

使用device_arr按优先级顺序设置CUDA执行的设备列表。参数len指定列表中元素的数量。CUDA将按顺序尝试列表中的设备，直到找到可用的设备。如果未调用此函数，或者调用时len为0，则CUDA将恢复其默认行为，即按顺序尝试包含系统中所有可用CUDA设备的默认列表。如果列表中指定的设备ID不存在，此函数将返回cudaErrorInvalidDevice。如果len不为0且device_arr为NULL，或者len超过系统中的设备数量，则返回cudaErrorInvalidValue。

Note:

请注意，此函数也可能返回之前异步启动的错误代码。
请注意，如果此调用尝试初始化CUDA RT内部状态，该函数也可能返回cudaErrorInitializationError、cudaErrorInsufficientDriver或cudaErrorNoDevice。
请注意，根据cudaStreamAddCallback的规定，回调函数中不得调用任何CUDA函数。在这种情况下，可能会（但不保证）返回cudaErrorNotPermitted作为诊断信息。

另请参阅：

cudaGetDeviceCount, cudaSetDevice, cudaGetDeviceProperties, cudaSetDeviceFlags, cudaChooseDevice