nvrtc

NVRTC库的用户指南。

1. 简介

NVRTC是CUDA C++的运行时编译库。它接收字符串形式的CUDA C++源代码，并创建可用于获取PTX的句柄。由NVRTC生成的PTX字符串可通过cuModuleLoadData和cuModuleLoadDataEx加载，并能使用nvJitLink库或CUDA驱动API的cuLinkAddData与其他模块链接。这一功能通常能提供纯离线静态编译无法实现的优化和性能提升。

在没有NVRTC（或CUDA中任何运行时编译支持）的情况下，如果用户希望在其应用程序或库中实现运行时编译，他们需要生成一个单独的进程来在运行时执行nvcc。不幸的是，这种方法存在以下缺点：

编译开销往往比必要的要高。
终端用户需要安装nvcc及相关工具，这使得分发使用运行时编译的应用程序变得复杂。

NVRTC通过提供一个库接口来解决这些问题，该接口消除了与启动独立进程、磁盘I/O等相关联的开销，同时保持应用程序部署的简单性。

2. 快速入门

2.1. 系统要求

NVRTC 支持以下平台：Linux x86_64、Linux ppc64le、Linux aarch64、Windows x86_64。

注意: NVRTC不依赖于CUDA工具包中的任何其他库或头文件，可以在没有GPU的系统上运行。

2.2. 安装

NVRTC是CUDA工具包发布的一部分，其组件在CUDA工具包安装目录中的组织结构如下：

在Windows上：
- include\nvrtc.h
- bin\nvrtc64_Major Release Version_0.dll
- bin\nvrtc-builtins64_Major Release VersionMinor Release Version.dll
- lib\x64\nvrtc.lib
- lib\x64\nvrtc_static.lib
- lib\x64\nvrtc-builtins_static.lib
- doc\pdf\NVRTC_User_Guide.pdf
在Linux系统上：
- include/nvrtc.h
- lib64/libnvrtc.so
- lib64/libnvrtc.so.Major Release Version
- lib64/libnvrtc.so.Major Release Version.Minor Release Version. version>
- lib64/libnvrtc-builtins.so
- lib64/libnvrtc-builtins.so.Major Release Version.Minor Release Version
- lib64/libnvrtc-builtins.so.Major Release Version.Minor Release Version. version>
- lib64/libnvrtc_static.a
- lib64/libnvrtc-builtins_static.a
- doc/pdf/NVRTC_User_Guide.pdf

3. 用户界面

本章介绍NVRTC的API接口。基本用法部分会讲解API的基础使用方法。

错误处理

通用信息查询

编译

支持的编译选项

预编译头文件(PCH) (CUDA 12.8+)

主机助手

3.1. 错误处理

NVRTC定义了以下枚举类型和函数用于API调用错误处理。

枚举

nvrtcResult: 枚举类型nvrtcResult定义了API调用的结果代码。

Functions

const char * nvrtcGetErrorString(nvrtcResult result): nvrtcGetErrorString 是一个辅助函数，用于返回描述给定 nvrtcResult 代码的字符串，例如将 NVRTC_SUCCESS 转换为 "NVRTC_SUCCESS"。

3.1.1. 枚举

enum nvrtcResult

枚举类型nvrtcResult定义了API调用的结果代码。

NVRTC API函数返回nvrtcResult以指示调用结果。

取值：

enumerator NVRTC_SUCCESS

enumerator NVRTC_ERROR_OUT_OF_MEMORY

enumerator NVRTC_ERROR_PROGRAM_CREATION_FAILURE

enumerator NVRTC_ERROR_INVALID_INPUT

enumerator NVRTC_ERROR_INVALID_PROGRAM

enumerator NVRTC_ERROR_INVALID_OPTION

enumerator NVRTC_ERROR_COMPILATION

enumerator NVRTC_ERROR_BUILTIN_OPERATION_FAILURE

enumerator NVRTC_ERROR_NO_NAME_EXPRESSIONS_AFTER_COMPILATION

enumerator NVRTC_ERROR_NO_LOWERED_NAMES_BEFORE_COMPILATION

enumerator NVRTC_ERROR_NAME_EXPRESSION_NOT_VALID

enumerator NVRTC_ERROR_INTERNAL_ERROR

enumerator NVRTC_ERROR_TIME_FILE_WRITE_FAILED

enumerator NVRTC_ERROR_NO_PCH_CREATE_ATTEMPTED

enumerator NVRTC_ERROR_PCH_CREATE_HEAP_EXHAUSTED

enumerator NVRTC_ERROR_PCH_CREATE

enumerator NVRTC_ERROR_CANCELLED

3.1.2. 功能

const char *nvrtcGetErrorString(nvrtcResult result)

nvrtcGetErrorString 是一个辅助函数，用于返回描述给定 nvrtcResult 代码的字符串，例如将 NVRTC_SUCCESS 转换为 "NVRTC_SUCCESS"。

对于无法识别的枚举值，返回"NVRTC_ERROR unknown"。

Parameters: result – [in] CUDA运行时编译API的结果代码。
Returns: 给定nvrtcResult代码对应的消息字符串。

3.2. 通用信息查询

NVRTC定义了以下函数用于通用信息查询。

Functions

nvrtcResult nvrtcGetNumSupportedArchs(int *numArchs): nvrtcGetNumSupportedArchs 将输出参数 numArchs 设置为 NVRTC 支持的架构数量。
nvrtcResult nvrtcGetSupportedArchs(int *supportedArchs): nvrtcGetSupportedArchs 通过输出参数 supportedArchs 填充数组，该数组包含 NVRTC 支持的架构。
nvrtcResult nvrtcVersion(int *major, int *minor): nvrtcVersion 将使用 CUDA 运行时编译版本号设置输出参数 major 和 minor。

3.2.1. 功能

nvrtcResult nvrtcGetNumSupportedArchs(int *numArchs)

nvrtcGetNumSupportedArchs 将输出参数 numArchs 设置为 NVRTC 支持的架构数量。

这可以用于向nvrtcGetSupportedArchs传递数组以获取支持的架构。

参见 nvrtcGetSupportedArchs

Parameters

numArchs – [输出] 支持的架构数量。

Returns

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_INPUT

nvrtcResult nvrtcGetSupportedArchs(int *supportedArchs)

nvrtcGetSupportedArchs 通过输出参数 supportedArchs 填充数组，该数组包含 NVRTC 支持的架构。

数组按升序排列。要传递的数组大小可以使用nvrtcGetNumSupportedArchs确定。

参见 nvrtcGetNumSupportedArchs

Parameters

supportedArchs – [out] 已排序的支持架构数组。

Returns

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_INPUT

nvrtcResult nvrtcVersion(int *major, int *minor)

nvrtcVersion 将使用CUDA运行时编译版本号设置输出参数 major 和 minor。

Parameters

major – [out] CUDA运行时编译的主版本号。
minor – [out] CUDA运行时编译的次版本号。

Returns

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_INPUT

3.3. 编译

NVRTC定义了以下类型和函数用于实际编译。

Functions

nvrtcResult nvrtcAddNameExpression(nvrtcProgram prog, const char *const name_expression): nvrtcAddNameExpression 记录给定的名称表达式，该表达式表示全局函数或设备/__constant__变量的地址。
nvrtcResult nvrtcCompileProgram(nvrtcProgram prog, int numOptions, const char *const *options): nvrtcCompileProgram 编译给定的程序。
nvrtcResult nvrtcCreateProgram(nvrtcProgram *prog, const char *src, const char *name, int numHeaders, const char *const *headers, const char *const *includeNames): nvrtcCreateProgram 使用给定的输入参数创建一个 nvrtcProgram 实例，并通过输出参数 prog 返回该实例。
nvrtcResult nvrtcDestroyProgram(nvrtcProgram *prog): nvrtcDestroyProgram 销毁给定的程序。
nvrtcResult nvrtcGetCUBIN(nvrtcProgram prog, char *cubin): nvrtcGetCUBIN 将之前编译 prog 生成的 cubin 存储在 cubin 所指向的内存中。
nvrtcResult nvrtcGetCUBINSize(nvrtcProgram prog, size_t *cubinSizeRet): nvrtcGetCUBINSize 将 cubinSizeRet 的值设置为由先前编译 prog 生成的 cubin 的大小。
nvrtcResult nvrtcGetLTOIR(nvrtcProgram prog, char *LTOIR): nvrtcGetLTOIR 将之前编译 prog 生成的 LTO IR 存储在 LTOIR 所指向的内存中。
nvrtcResult nvrtcGetLTOIRSize(nvrtcProgram prog, size_t *LTOIRSizeRet): nvrtcGetLTOIRSize 将 LTOIRSizeRet 的值设置为由先前编译 prog 生成的 LTO IR 的大小。
nvrtcResult nvrtcGetLoweredName(nvrtcProgram prog, const char *const name_expression, const char **lowered_name): nvrtcGetLoweredName 用于提取全局函数或设备/__constant__变量的降级（混淆）名称，并将*lowered_name更新为指向该名称。
nvrtcResult nvrtcGetNVVM(nvrtcProgram prog, char *nvvm): 弃用通知：此函数将在未来版本中移除。
nvrtcResult nvrtcGetNVVMSize(nvrtcProgram prog, size_t *nvvmSizeRet): 弃用通知：此函数将在未来版本中移除。
nvrtcResult nvrtcGetOptiXIR(nvrtcProgram prog, char *optixir): nvrtcGetOptiXIR 将之前编译 prog 生成的 OptiX IR 存储在 optixir 所指向的内存中。
nvrtcResult nvrtcGetOptiXIRSize(nvrtcProgram prog, size_t *optixirSizeRet): nvrtcGetOptiXIRSize 将 optixirSizeRet 的值设置为由先前编译 prog 生成的 OptiX IR 的大小。
nvrtcResult nvrtcGetPTX(nvrtcProgram prog, char *ptx): nvrtcGetPTX 将之前由 prog 编译生成的 PTX 存储在 ptx 所指向的内存中。
nvrtcResult nvrtcGetPTXSize(nvrtcProgram prog, size_t *ptxSizeRet): nvrtcGetPTXSize 将 ptxSizeRet 的值设置为由先前编译 prog 生成的 PTX 的大小（包括结尾的 NULL）。
nvrtcResult nvrtcGetProgramLog(nvrtcProgram prog, char *log): nvrtcGetProgramLog 将之前编译 prog 生成的日志存储在 log 所指向的内存中。
nvrtcResult nvrtcGetProgramLogSize(nvrtcProgram prog, size_t *logSizeRet): nvrtcGetProgramLogSize 将 logSizeRet 设置为由先前编译 prog 生成的日志大小（包括结尾的 NULL）。
nvrtcResult nvrtcSetFlowCallback(nvrtcProgram prog, int(*callback)(void *, void *), void *payload): nvrtcSetFlowCallback 注册一个回调函数，编译器在调用 nvrtcCompileProgram 期间会在不同节点触发该回调，回调函数可以通过返回特定值来决定是否取消编译。

类型定义

nvrtcProgram: nvrtcProgram是编译的基本单位，也是一个程序的不透明句柄。

3.3.1. 功能

nvrtcResult nvrtcAddNameExpression(nvrtcProgram prog, const char *const name_expression)

nvrtcAddNameExpression 记录给定的名称表达式，该表达式表示全局函数或设备/__constant__变量的地址。

在后续调用nvrtcGetLoweredName以提取降级名称时，必须提供相同的名称表达式字符串。

另请参阅

nvrtcGetLoweredName

Parameters

prog – [输入] CUDA运行时编译程序。
name_expression – [输入] 表示全局函数或设备/__constant__变量地址的常量表达式。

Returns

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_PROGRAM
NVRTC_ERROR_INVALID_INPUT
NVRTC_ERROR_NO_NAME_EXPRESSIONS_AFTER_COMPILATION

nvrtcResult nvrtcCompileProgram(nvrtcProgram prog, int numOptions, const char *const *options)

nvrtcCompileProgram 编译给定的程序。

它支持支持的编译选项中列出的编译选项。

Parameters

prog – [输入] CUDA运行时编译程序。
numOptions – [in] 传入的编译器选项数量。
options – [输入] 以C字符串数组形式表示的编译器选项。当numOptions为0时，options可以为NULL。

Returns

NVRTC_SUCCESS
NVRTC_ERROR_OUT_OF_MEMORY
NVRTC_ERROR_INVALID_INPUT
NVRTC_ERROR_INVALID_PROGRAM
NVRTC_ERROR_INVALID_OPTION
NVRTC_ERROR_COMPILATION
NVRTC_ERROR_BUILTIN_OPERATION_FAILURE
NVRTC_ERROR_TIME_FILE_WRITE_FAILED
NVRTC_ERROR_CANCELLED

nvrtcResult nvrtcCreateProgram(nvrtcProgram *prog, const char *src, const char *name, int numHeaders, const char *const *headers, const char *const *includeNames)

nvrtcCreateProgram 使用给定的输入参数创建一个 nvrtcProgram 实例，并通过输出参数 prog 返回该实例。

另请参阅

nvrtcDestroyProgram

Parameters

prog – [out] CUDA运行时编译程序。
src – [in] CUDA程序源代码。
name – [in] CUDA程序名称。name可以是NULL；当name为NULL或空字符串时，将使用"default_program"。
numHeaders – [输入] 使用的头部数量。numHeaders 必须大于或等于0。
headers – [输入] 头文件的来源。当numHeaders为0时，headers可以为NULL。
includeNames – [输入] 每个头文件的名称，通过这些名称可以在CUDA程序源代码中包含它们。当numHeaders为0时，includeNames可以为NULL。这些头文件必须使用此处指定的确切名称包含。

Returns

NVRTC_SUCCESS
NVRTC_ERROR_OUT_OF_MEMORY
NVRTC_ERROR_PROGRAM_CREATION_FAILURE
NVRTC_ERROR_INVALID_INPUT
NVRTC_ERROR_INVALID_PROGRAM

nvrtcResult nvrtcDestroyProgram(nvrtcProgram *prog)

nvrtcDestroyProgram 销毁给定的程序。

另请参阅

nvrtcCreateProgram

Parameters

prog – [in] CUDA运行时编译程序。

Returns

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_PROGRAM

nvrtcResult nvrtcGetCUBIN(nvrtcProgram prog, char *cubin)

nvrtcGetCUBIN 将之前由 prog 编译生成的 cubin 存储在 cubin 所指向的内存中。

如果为-arch指定的值是虚拟架构而非实际架构，则无法获取cubin文件。

另请参阅

nvrtcGetCUBINSize

Parameters

prog – [输入] CUDA运行时编译程序。
cubin – [out] 编译和汇编后的结果。

Returns

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_INPUT
NVRTC_ERROR_INVALID_PROGRAM

nvrtcResult nvrtcGetCUBINSize(nvrtcProgram prog, size_t *cubinSizeRet)

nvrtcGetCUBINSize 将 cubinSizeRet 的值设置为由先前编译 prog 生成的 cubin 的大小。

如果指定给-arch的值是虚拟架构而非实际架构，则cubinSizeRet的值将被设为0。

另请参阅

nvrtcGetCUBIN

Parameters

prog – [输入] CUDA运行时编译程序。
cubinSizeRet – [out] 生成的cubin文件大小。

Returns

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_INPUT
NVRTC_ERROR_INVALID_PROGRAM

nvrtcResult nvrtcGetLTOIR(nvrtcProgram prog, char *LTOIR)

nvrtcGetLTOIR 将之前编译 prog 生成的 LTO IR 存储在 LTOIR 所指向的内存中。

如果程序编译时未使用-dlto选项，则无法获取LTO IR。

另请参阅

nvrtcGetLTOIRSize

Parameters

prog – [输入] CUDA运行时编译程序。
LTOIR – [输出] 编译结果。

Returns

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_INPUT
NVRTC_ERROR_INVALID_PROGRAM

nvrtcResult nvrtcGetLTOIRSize(nvrtcProgram prog, size_t *LTOIRSizeRet)

nvrtcGetLTOIRSize 将 LTOIRSizeRet 的值设置为由先前编译 prog 生成的 LTO IR 的大小。

如果程序未使用-dlto编译，则LTOIRSizeRet的值将被设为0。

另请参阅

nvrtcGetLTOIR

Parameters

prog – [输入] CUDA运行时编译程序。
LTOIRSizeRet – [out] 生成的LTO IR的大小。

Returns

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_INPUT
NVRTC_ERROR_INVALID_PROGRAM

nvrtcResult nvrtcGetLoweredName(nvrtcProgram prog, const char *const name_expression, const char **lowered_name)

nvrtcGetLoweredName用于提取全局函数或设备/__constant__变量的降级（混淆）名称，并将*lowered_name更新为指向该名称。

当NVRTC程序被nvrtcDestroyProgram销毁时，包含名称的内存会被释放。相同的名称表达式必须事先通过nvrtcAddNameExpression提供。

另请参阅

nvrtcAddNameExpression

Parameters

prog – [输入] CUDA运行时编译程序。
name_expression – [输入] 表示全局函数或设备/__constant__变量地址的常量表达式。
lowered_name – [out] 由函数初始化，指向一个包含与提供的名称表达式对应的降级（重整）名称的C字符串。

Returns

NVRTC_SUCCESS
NVRTC_ERROR_NO_LOWERED_NAMES_BEFORE_COMPILATION
NVRTC_ERROR_INVALID_PROGRAM
NVRTC_ERROR_INVALID_INPUT
NVRTC_ERROR_NAME_EXPRESSION_NOT_VALID

nvrtcResult nvrtcGetNVVM(nvrtcProgram prog, char *nvvm)

弃用通知：此函数将在未来版本中移除。

请改用 nvrtcGetLTOIR（以及 nvrtcGetLTOIRSize）。

nvrtcResult nvrtcGetNVVMSize(nvrtcProgram prog, size_t *nvvmSizeRet)

弃用通知：此函数将在未来版本中移除。

请改用 nvrtcGetLTOIRSize（以及 nvrtcGetLTOIR）。

nvrtcResult nvrtcGetOptiXIR(nvrtcProgram prog, char *optixir)

nvrtcGetOptiXIR 将之前编译 prog 生成的 OptiX IR 存储在 optixir 指针所指向的内存中。

如果程序编译时使用了与OptiX IR生成不兼容的选项，则无法获得OptiX IR。

另请参阅

nvrtcGetOptiXIRSize

Parameters

prog – [输入] CUDA运行时编译程序。
optixir – [out] Optix IR 编译结果。

Returns

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_INPUT
NVRTC_ERROR_INVALID_PROGRAM

nvrtcResult nvrtcGetOptiXIRSize(nvrtcProgram prog, size_t *optixirSizeRet)

nvrtcGetOptiXIRSize 将 optixirSizeRet 的值设置为由先前编译 prog 生成的 OptiX IR 的大小。

如果程序编译时使用了与OptiX IR生成不兼容的选项，则nvrtcGetOptiXIRSize的值将被设为0。

另请参阅

nvrtcGetOptiXIR

Parameters

prog – [输入] CUDA运行时编译程序。
optixirSizeRet – [out] 生成的LTO IR的大小。

Returns

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_INPUT
NVRTC_ERROR_INVALID_PROGRAM

nvrtcResult nvrtcGetPTX(nvrtcProgram prog, char *ptx)

nvrtcGetPTX 将之前由 prog 编译生成的 PTX 存储在 ptx 所指向的内存中。

另请参阅

nvrtcGetPTXSize

Parameters

prog – [输入] CUDA运行时编译程序。
ptx – [out] 编译结果。

Returns

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_INPUT
NVRTC_ERROR_INVALID_PROGRAM

nvrtcResult nvrtcGetPTXSize(nvrtcProgram prog, size_t *ptxSizeRet)

nvrtcGetPTXSize 将 ptxSizeRet 的值设置为由先前编译 prog 生成的 PTX 的大小（包括结尾的 NULL）。

另请参阅

nvrtcGetPTX

Parameters

prog – [输入] CUDA运行时编译程序。
ptxSizeRet – [输出] 生成的PTX大小（包含结尾的NULL）。

Returns

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_INPUT
NVRTC_ERROR_INVALID_PROGRAM

nvrtcResult nvrtcGetProgramLog(nvrtcProgram prog, char *log)

nvrtcGetProgramLog 将之前编译 prog 生成的日志存储在 log 所指向的内存中。

另请参阅

nvrtcGetProgramLogSize

Parameters

prog – [输入] CUDA运行时编译程序。
log – [out] 编译日志。

Returns

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_INPUT
NVRTC_ERROR_INVALID_PROGRAM

nvrtcResult nvrtcGetProgramLogSize(nvrtcProgram prog, size_t *logSizeRet)

nvrtcGetProgramLogSize 将 logSizeRet 设置为由先前编译 prog 生成的日志大小（包括结尾的 NULL）。

请注意，即使prog编译成功，编译日志仍可能包含警告和信息性消息。

另请参阅

nvrtcGetProgramLog

Parameters

prog – [输入] CUDA运行时编译程序。
logSizeRet – [out] 编译日志的大小（包括结尾的NULL）。

Returns

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_INPUT
NVRTC_ERROR_INVALID_PROGRAM

nvrtcResult nvrtcSetFlowCallback(nvrtcProgram prog, int (*callback)(void*, void*), void *payload)

nvrtcSetFlowCallback 注册一个回调函数，编译器在调用 nvrtcCompileProgram 期间会在不同节点触发该回调，回调函数可以通过返回特定值来决定是否取消编译。

回调函数必须满足以下约束条件：

(1) 其函数签名应为：

int callback(void* param1, void* param2);

When invoking the callback, the compiler will always pass payload to param1 so that the callback may make decisions based on payload . It’ll always pass NULL to param2 for now which is reserved for future extensions.

(2) 它必须返回1以取消编译，或返回0以继续。其他返回值保留供未来使用。

(3) 它必须返回一致的值。一旦在某一点返回1，在当前正在进行的nvrtcCompileProgram调用期间，后续所有调用都必须返回1。

(4) 它必须是线程安全的。

(5) 它不得调用任何nvrtc/libnvvm/ptx API。

Parameters

prog – [输入] CUDA运行时编译程序。
callback – [in] 发出取消信号的回调函数。
payload – [in] 调用回调函数时作为参数传递的数据。

Returns

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_PROGRAM
NVRTC_ERROR_INVALID_INPUT

3.3.2. 类型定义

typedef struct _nvrtcProgram *nvrtcProgram

nvrtcProgram是编译的基本单位，也是程序的不透明句柄。

要编译一个CUDA程序字符串，首先需要使用nvrtcCreateProgram创建一个nvrtcProgram实例，然后通过nvrtcCompileProgram进行编译。

3.4. 支持的编译选项

NVRTC支持以下编译选项。

带有两个前导短横线(--)的选项名称是长选项名，带有一个前导短横线(-)的是短选项名。短选项名可以替代长选项名使用。当编译选项需要参数时，使用赋值运算符(=)将编译选项参数与选项名称分隔开，例如"--gpu-architecture=compute_60"。或者，编译选项名称和参数可以在不使用赋值运算符的情况下分别指定为单独的字符串，例如"--gpu-architecture" "compute_60"。单字符的短选项名，如-D、-U和-I，不需要赋值运算符，编译选项名称和参数可以出现在同一个字符串中，无论它们之间是否有空格。例如"-D="、"-D"和"-D "都是支持的格式。

有效的编译器选项包括：

编译目标
- --gpu-architecture= (-arch)
  
  指定输入必须编译的GPU架构类别的名称。
  - 有效的架构：
    - compute_50
    - compute_52
    - compute_53
    - compute_60
    - compute_61
    - compute_62
    - compute_70
    - compute_72
    - compute_75
    - compute_80
    - compute_87
    - compute_89
    - compute_90
    - compute_90a
    - compute_100
    - compute_100a
    - sm_50
    - sm_52
    - sm_53
    - sm_60
    - sm_61
    - sm_62
    - sm_70
    - sm_72
    - sm_75
    - sm_80
    - sm_87
    - sm_89
    - sm_90
    - sm_90a
    - sm_100
    - sm_100a
  - 默认值: compute_52
独立编译/全程序编译
- --device-c (-dc)
  
  生成可重定位代码，可与其他可重定位设备代码链接。等同于--relocatable-device-code=true。
- --device-w (-dw)
  
  生成不可重定位的代码。等同于 --relocatable-device-code=false。
- --relocatable-device-code={true|false} (-rdc)
  
  启用（禁用）可重定位设备代码的生成。
  - 默认值: false
- --extensible-whole-program (-ewp)
  
  对设备代码进行可扩展的全局程序编译。
  - 默认值: false
调试支持
- --device-debug (-G)
  
  生成调试信息。如果未指定--dopt，则会关闭所有优化。
- --generate-line-info (-lineinfo)
  
  生成行号信息。
代码生成
- --dopt on (-dopt)
- --dopt=on
  
  启用设备代码优化。当与-G一起指定时，可为优化后的设备代码生成有限的调试信息（目前仅包含行号信息）。当未指定-G时，-dopt=on是默认启用的。
- --ptxas-options (-Xptxas)
- --ptxas-options=
  
  直接向PTX优化汇编器ptxas指定选项。
- --maxrregcount= (-maxrregcount)
  
  指定GPU函数可使用的最大寄存器数量。在达到函数特定限制之前，较高的值通常会提升执行该函数的单个GPU线程的性能。但由于线程寄存器是从GPU全局寄存器池中分配的，此选项值越高，也会降低最大线程块大小，从而减少线程并行度。因此，最佳的maxrregcount值需要权衡取舍。若未指定此选项，则不设上限。低于ABI所需最小寄存器数的值将被编译器提升至ABI最低限制。
- --ftz={true|false} (-ftz)
  
  在执行单精度浮点运算时，将非规格化值刷新为零或保留非规格化值。
  
  --use_fast_math 意味着 --ftz=true。
  - 默认值: false
- --prec-sqrt={true|false} (-prec-sqrt)
  
  对于单精度浮点数平方根计算，可使用IEEE四舍五入模式或采用更快速的近似算法。--use_fast_math选项隐含--prec-sqrt=false参数设置。
  - 默认值: true
- --prec-div={true|false} (-prec-div) 对于单精度浮点除法和倒数运算，可选择使用IEEE四舍五入模式或更快速的近似计算方式。--use_fast_math 隐含设置 --prec-div=false。
  - 默认值: true
- --fmad={true|false} (-fmad)
  
  启用（禁用）将浮点乘法和加法/减法运算合并为浮点乘加运算（FMAD、FFMA或DFMA）。--use_fast_math隐含--fmad=true。
  - 默认值: true
- --use_fast_math (-use_fast_math)
  
  启用快速数学运算。--use_fast_math 隐含设置了 --ftz=true --prec-div=false --prec-sqrt=false --fmad=true。
- --extra-device-vectorization (-extra-device-vectorization)
  
  在NVVM优化器中启用更激进的设备代码向量化。
- --modify-stack-limit={true|false} (-modify-stack-limit)
  
  在Linux系统上，编译期间使用setrlimit()将堆栈大小增加到允许的最大值。编译完成后，限制会被重置为之前的值。注意：setrlimit()会改变整个进程的值。
  - 默认值: true
- --dlink-time-opt (-dlto)
  
  生成用于后续链接时优化的中间代码。该选项隐含-rdc=true。注意：使用此选项时应调用nvrtcGetLTOIR API，因为此时不会生成PTX或Cubin。
- --gen-opt-lto (-gen-opt-lto)
  
  在生成LTO IR之前运行优化器传递。
- --optix-ir (-optix-ir)
  
  生成OptiX中间表示(IR)。OptiX IR仅适用于通过适当的API被OptiX使用。此功能不支持链接时优化(-dlto)。
  
  注意：使用此选项时应调用nvrtcGetOptiX API，因为不会生成PTX或Cubin。
- --jump-table-density=[0-101] (-jtd)
  
  指定switch语句中的分支密度百分比，并以此作为判断是否使用跳转表(brx.idx指令)来实现switch语句的最小阈值。默认值为101。该百分比范围包含0到101。
- --device-stack-protector={true|false} (-device-stack-protector)
  
  启用（禁用）在设备代码中生成堆栈保护区的功能。
  - 默认值: false
预处理
- --define-macro= (-D)
  
  可以是或。
  - 将预定义为值为1的宏。
  - =
    
    The contents of are tokenized and preprocessed as if they appeared during translation phase three in a #define directive. In particular, the definition will be truncated by embedded new line characters.
- --undefine-macro= (-U)
  
  取消之前对的任何定义。
- --include-path=
  (-I)
  
  将目录
  添加到头文件搜索路径列表中。这些路径会在nvrtcCreateProgram提供的头文件列表之后进行搜索。
- --pre-include=
  (-include)
  
  在预处理阶段预先包含
  文件。
- --no-source-include (-no-source-include)
  
  默认情况下，预处理器会将每个输入源文件的目录添加到包含路径中。此选项禁用此功能，仅考虑显式指定的路径。
语言方言
- --std={c++03|c++11|c++14|c++17|c++20} (-std)
  
  将语言方言设置为C++03、C++11、C++14、C++17或C++20
  - 默认值: c++17
- --builtin-move-forward={true|false} (-builtin-move-forward)
  
  当选择C++11或更新的语言标准时，提供std::move和std::forward的内置定义。
  - 默认值: true
- --builtin-initializer-list={true|false} (-builtin-initializer-list)
  
  当选择C++11或更新的语言标准时，提供std::initializer_list类及其成员函数的内置定义。
  - 默认值: true
预编译头文件支持 (CUDA 12.8+)
- --pch (-pch)
  
  启用自动PCH处理。
- --create-pch= (-create-pch)
  
  创建一个PCH文件。
- --use-pch= (-use-pch)
  
  使用指定的PCH文件。
- --pch-dir= (-pch-dir)
  
  当使用自动PCH(-pch)时，在指定目录中查找并创建PCH文件。当使用显式PCH(-create-pch或-use-pch)时，目录名会被添加在指定文件名前，除非文件名是绝对路径。
- --pch-verbose={true|false} (-pch-verbose)
  
  在自动PCH模式下，对于当前编译中无法使用的每个PCH文件，在编译日志中打印原因。
  - 默认值: true
- --pch-messages={true|false} (-pch-messages)
  
  如果在当前编译过程中创建或使用了PCH文件，则在编译日志中打印一条消息。
  - 默认值: true
- --instantiate-templates-in-pch={true|false} (-instantiate-templates-in-pch)
  
  在创建PCH之前启用或禁用模板实例化。实例化模板可能会增加PCH文件的大小，但在使用PCH文件时能降低编译成本（因为可以跳过某些模板实例化）。
  - 默认值: true
杂项
- --disable-warnings (-w)
  
  禁止显示所有警告信息。
- --restrict (-restrict)
  
  程序员断言所有内核指针参数都是限制指针。
- --device-as-default-execution-space (-default-device)
  
  将没有执行空间注解的实体视为__device__实体。
- --device-int128 (-device-int128)
  
  允许在设备代码中使用__int128类型。同时会定义宏__CUDACC_RTC_INT128__。
- --device-float128 (-device-float128)
  
  允许在设备代码中使用__float128和_Float128类型。同时会定义宏D__CUDACC_RTC_FLOAT128__。
- --optimization-info= (-opt-info)
  
  为指定类型的优化提供优化报告。支持以下类型标签：
  - inline : 当函数被内联时发出提示。
- --display-error-number (-err-no)
  
  显示警告消息的诊断编号。(默认)
- --no-display-error-number (-no-err-no)
  
  禁用警告消息中诊断编号的显示。
- --diag-error=,… (-diag-error)
  
  为指定的诊断消息编号发出错误。消息编号可以用逗号分隔。
- --diag-suppress=,… (-diag-suppress)
  
  抑制指定的诊断消息编号。消息编号可以用逗号分隔。
- --diag-warn=,… (-diag-warn)
  
  对指定的诊断消息编号发出警告。消息编号可以用逗号分隔。
- --brief-diagnostics={true|false} (-brief-diag)
  
  此选项用于禁用或启用显示诊断信息中的源代码行和列信息。设置--brief-diagnostics=true时将不会显示源代码行和列信息。
  - 默认值: false
- --time= (-time)
  
  生成一个逗号分隔值表格，记录每个编译阶段所花费的时间，并将该表格追加到作为选项参数给出的文件末尾。如果文件不存在，表格的第一行将生成列标题。如果文件名为'-'，则计时数据将写入编译日志。
- --split-compile= (-split-compile=)
  
  并行执行编译器优化。拆分编译尝试通过使编译器能够并发运行某些优化过程来减少编译时间。此选项接受一个数值，用于指定编译器可以使用的最大线程数。也可以通过设置--split-compile=0允许编译器使用系统上可用的最大线程数。设置--split-compile=1将导致此选项被忽略。
- --fdevice-syntax-only (-fdevice-syntax-only)
  
  在前端语法检查后结束设备编译。此选项不会生成有效的设备代码。
- --minimal (-minimal)
  
  省略某些语言特性以减少小型程序的编译时间。具体来说，以下内容被省略：
  - 纹理和表面函数及相关类型，例如 cudaTextureObject_t。
  - 由cudadevrt设备代码库提供的CUDA运行时函数，通常以"cuda"作为前缀命名，例如cudaMalloc。
  - 从设备代码启动内核。
  - 与CUDA运行时和驱动程序API相关的类型和宏定义，由cuda/tools/cudart/driver_types.h提供，通常以"cuda"作为前缀命名，例如cudaError_t。
- --device-stack-protector (-device-stack-protector)
  
  在设备代码中启用栈保护机制。栈保护机制使得利用涉及栈局部变量的某些内存安全漏洞变得更加困难。编译器会使用启发式方法来评估每个函数中存在此类漏洞的风险。只有被判定为高风险的函数才会使用栈保护机制。
- --fdevice-time-trace= (-fdevice-time-trace=) 启用时间分析器，根据给定的输出JSON文件。结果可在chrome://tracing上进行分析，以获得火焰图可视化。

3.5. 预编译头文件(PCH) (CUDA 12.8+)

另请参阅传递给nvrtcCompileProgram的PCH相关标志。

Functions

nvrtcResult nvrtcGetPCHCreateStatus(nvrtcProgram prog): 返回PCH创建状态。
nvrtcResult nvrtcGetPCHHeapSize(size_t *ret): 获取PCH堆的当前大小。
nvrtcResult nvrtcGetPCHHeapSizeRequired(nvrtcProgram prog, size_t *size): 获取编译给定程序所需的PCH堆大小。
nvrtcResult nvrtcSetPCHHeapSize(size_t size): 设置PCH堆的大小。

3.5.1. 函数

nvrtcResult nvrtcGetPCHCreateStatus(nvrtcProgram prog)

返回PCH创建状态。

NVRTC_SUCCESS 表示PCH文件创建成功。NVRTC_ERROR_NO_PCH_CREATE_ATTEMPTED 表示未尝试创建PCH文件，可能是因为在前一次nvrtcCompileProgram调用中未请求PCH功能，或者请求了自动PCH处理但编译器选择不创建PCH文件。NVRTC_ERROR_PCH_CREATE_HEAP_EXHAUSTED 表示虽然有可能创建PCH文件，但编译器耗尽了PCH堆空间。在这种情况下，可以使用nvrtcGetPCHHeapSizeRequired()查询所需的堆大小，通过nvrtcSetPCHHeapSize()重新分配该大小的堆空间，然后使用新的NVRTC程序实例再次调用nvrtcCompileProgram()重新尝试PCH创建。NVRTC_ERROR_PCH_CREATE 表示存在错误条件导致无法创建PCH文件。

Parameters

prog – [in] CUDA运行时编译程序。

Returns

NVRTC_SUCCESS
NVRTC_ERROR_NO_PCH_CREATE_ATTEMPTED
NVRTC_ERROR_PCH_CREATE
NVRTC_ERROR_PCH_CREATE_HEAP_EXHAUSTED
NVRTC_ERROR_INVALID_PROGRAM

nvrtcResult nvrtcGetPCHHeapSize(size_t *ret)

获取PCH堆的当前大小。

Parameters

ret – [输出] 指向存储PCH堆大小的内存位置的指针

Returns

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_INPUT

nvrtcResult nvrtcGetPCHHeapSizeRequired(nvrtcProgram prog, size_t *size)

检索编译给定程序所需的PCH堆大小。

Parameters

prog – [输入] CUDA运行时编译程序。
size – [out] 指向存储PCH堆所需大小的位置的指针

Returns

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_PROGRAM
NVRTC_ERROR_INVALID_INPUT 通过此函数获取的大小仅在nvrtcGetPCHCreateStatus()返回NVRTC_SUCCESS或NVRTC_ERROR_PCH_CREATE_HEAP_EXHAUSTED时有效

nvrtcResult nvrtcSetPCHHeapSize(size_t size)

设置PCH堆的大小。

请求的大小可能会向上舍入到与平台相关的对齐边界（例如页面大小）。如果PCH堆已经分配，堆内存将被释放并重新分配一个新的PCH堆。

Parameters

size – [in] 请求的PCH堆大小，以字节为单位

Returns

NVRTC_SUCCESS

3.6. 主机辅助工具

NVRTC定义了以下函数，以便更轻松地与主机代码交互。

Functions

nvrtcResult nvrtcGetTypeName(const std::type_info &tinfo, std::string *result): nvrtcGetTypeName 将类型的源级名称存储在给定的 std::string 位置中。
nvrtcResult nvrtcGetTypeName(std::string *result): nvrtcGetTypeName 将模板类型参数 T 的源码级名称存储到给定的 std::string 位置中。

3.6.1. 函数

inline nvrtcResult nvrtcGetTypeName(const std::type_info &tinfo, std::string *result)

nvrtcGetTypeName 将类型的源级名称存储在给定的 std::string 位置中。

此函数仅在宏NVRTC_GET_TYPE_NAME被定义为非零值时提供。它分别在使用gcc/clang或cl.exe编译器时，通过调用abi::__cxa_demangle或UnDecorateSymbolName函数来提取类型名称。如果名称提取失败，将返回NVRTC_INTERNAL_ERROR，否则会用提取的名称初始化*result。

Windows系统特别说明：

nvrtcGetTypeName() 不是线程安全的，因为它调用了非线程安全的 UnDecorateSymbolName()。
返回的字符串可能包含微软特定的关键字，例如__ptr64和__cdecl。

Parameters

tinfo – [in] 对给定类型的std::type_info对象的引用。
result – [in] 指向std::string的指针，用于存储类型名称。

Returns

NVRTC_SUCCESS
NVRTC_ERROR_INTERNAL_ERROR

template<typename T> nvrtcResult nvrtcGetTypeName(std::string *result)

nvrtcGetTypeName 将模板类型参数 T 的源级名称存储在给定的 std::string 位置中。

此函数仅在宏NVRTC_GET_TYPE_NAME被定义为非零值时提供。它分别在使用gcc/clang或cl.exe编译器时，通过调用abi::__cxa_demangle或UnDecorateSymbolName函数来提取类型名称。如果名称提取失败，将返回NVRTC_INTERNAL_ERROR，否则*result会被初始化为提取的名称。

Windows系统特别说明：

nvrtcGetTypeName() 不是线程安全的，因为它调用了非线程安全的 UnDecorateSymbolName() 函数。
返回的字符串可能包含微软特定的关键字，例如__ptr64和__cdecl。

Parameters

result – [in] 指向std::string的指针，用于存储类型名称。

Returns

NVRTC_SUCCESS
NVRTC_ERROR_INTERNAL_ERROR

4. 语言

与离线版nvcc编译器不同，NVRTC专用于编译设备端CUDA C++代码。除非特别说明，它不接受输入代码中的主机端代码或主机编译器扩展。

4.1. 执行空间

NVRTC默认使用__host__作为执行空间，当在输入中遇到任何主机端代码时会生成错误。也就是说，如果输入包含带有显式__host__注解或没有执行空间注解的实体，NVRTC将发出错误。__host__ __device__函数会被视为设备端函数。

NVRTC提供了一个编译选项--device-as-default-execution-space（参见支持的编译选项），该选项启用了一种替代编译模式，在此模式下，没有执行空间注解的实体将被视为__device__ entities。

4.2. 独立编译

NVRTC本身不提供任何链接器。不过，用户可以使用nvJitLink库或CUDA Driver API中的cuLinkAddData将生成的可重定位PTX代码与其他可重定位代码进行链接。要生成可重定位PTX代码，需要编译选项--relocatable-device-code=true或--device-c。

4.3. 动态并行

NVRTC在以下条件下支持动态并行：

编译目标必须为compute 35或更高版本。
必须启用单独编译（--relocatable-device-code=true 或 --device-c）或可扩展的整个程序编译（--extensible-whole-program）。
生成的PTX必须链接到CUDA设备运行时(cudadevrt)库(参考Separate Compilation)。

示例：Dynamic Parallelism提供了一个简单的例子。

4.4. 整数大小

不同的操作系统对整数类型大小的定义有所不同。 Linux x86_64系统采用LP64标准，而Windows x86_64系统则采用LLP64标准。

表1. LLP64与LP64架构下的整数位宽
	`short`	`int`	`long`	`long long`	指针和`size_t`
LLP64	16	32	32	64	64
LP64	16	32	64	64	64

NVRTC在Linux上实现LP64，在Windows上实现LLP64。

NVRTC通过__int128类型支持128位整数类型。可以使用--device-int128标志启用此功能。Windows系统不支持128位整数。

4.5. 包含语法

当调用nvrtcCompileProgram()时，当前工作目录会被添加到头文件搜索路径中，用于定位使用引号语法包含的文件（例如#include "foo.h"），这一操作发生在代码编译之前。

4.6. 预定义宏

__CUDACC_RTC__: 用于在用户代码中区分运行时和离线nvcc编译。
__CUDACC__: 定义语义与离线nvcc编译相同。
__CUDACC_RDC__: 定义语义与离线nvcc编译相同。
__CUDACC_EWP__: 定义语义与离线nvcc编译相同。
__CUDACC_DEBUG__: 定义与离线nvcc编译具有相同的语义。
__CUDA_ARCH__: 定义语义与离线nvcc编译时相同。
__CUDA_ARCH_LIST__: 定义语义与离线nvcc编译相同。
__CUDACC_VER_MAJOR__: 定义为由nvrtcVersion返回的主版本号。
__CUDACC_VER_MINOR__: 定义为nvrtcVersion返回的次版本号。
__CUDACC_VER_BUILD__: 定义为构建版本号。
__NVCC_DIAG_PRAGMA_SUPPORT__: 定义语义与离线nvcc编译相同。
__CUDACC_RTC_INT128__: 当编译时指定了-device-int128标志时定义，表示支持__int128类型。
NULL: 空指针常量。
va_start
va_end
va_arg
va_copy : 当选用的语言方言为C++11或更新版本时定义。
__cplusplus
_WIN64 : 在Windows平台上定义。
__LP64__ : 在非Windows平台上定义，其中long int和指针类型为64位。
__cdecl : 在所有平台上定义为空。
__ptr64 : 在Windows平台上定义为空。
__CUDACC_RTC_MINIMAL__: 在编译时指定-minimal标志时定义（自CUDA 12.4起）。
在nv/target头文件中定义的宏是隐式提供的，例如NV_IF_TARGET。
__CUDACC_DEVICE_ATOMIC_BUILTINS__: 当支持设备原子编译器内置函数时定义。更多详情请参阅CUDA C++编程指南。

4.7. 预定义类型

clock_t
size_t
ptrdiff_t
va_list: 请注意，此类型的定义可能与nvcc编译CUDA代码时选择的定义不同。
预定义类型如dim3、char4等在离线使用nvcc编译时可从CUDA Runtime头文件中获取的类型也同样可用，除非另有说明。
std::initializer_list: 在C++11及后续版本中隐式提供，除非指定了-builtin-initializer-list=false。
std::move, std::forward: 在C++11及后续版本中隐式提供，除非指定了-builtin-move-forward=false。

4.8. 内置函数

除非另有说明，否则在离线编译时使用nvcc提供的CUDA运行时头文件中的内置函数均可使用。

4.9. 默认C++方言

默认的C++方言是C++17。可以使用-std标志选择其他方言。

5. 基础用法

本文档的这一部分使用一个简单示例——单精度α⋅X加Y（SAXPY），如图1所示，来解释使用NVRTC进行运行时编译所涉及的内容。为简洁和可读性起见，未展示对API返回值的错误检查。完整代码清单见示例：SAXPY。

图1. SAXPY的CUDA源代码字符串

const char *saxpy = "                                          \n\
extern \"C\" __global__                                        \n\
void saxpy(float a, float *x, float *y, float *out, size_t n)  \n\
{                                                              \n\
   size_t tid = blockIdx.x * blockDim.x + threadIdx.x;         \n\
   if (tid < n) {                                              \n\
      out[tid] = a * x[tid] + y[tid];                          \n\
   }                                                           \n\
}                                                              \n";

首先，需要创建一个nvrtcProgram实例。图2展示了为SAXPY创建nvrtcProgram的过程。由于SAXPY不需要任何头文件，因此将0作为numHeaders传递，并将NULL作为headers和includeNames传递。

图2. SAXPY的nvrtcProgram创建

nvrtcProgram prog;
nvrtcCreateProgram(&prog, // prog
        saxpy,         // buffer
        "saxpy.cu",    // name
        0,             // numHeaders
        NULL,          // headers
        NULL);         // includeNames

如果SAXPY包含任何#include指令，被包含文件的内容可以作为headers参数的元素传递，文件名则作为includeNames参数的元素。例如，#include 和#include 需要将numHeaders设为2，{ " of foo.h>", " of bar.h>" }作为headers，{ "foo.h", "bar.h" }作为includeNames（其中 of foo.h>和 of bar.h>需要替换为实际的foo.h和bar.h文件内容）。或者，如果确保头文件在运行时存在于文件系统中，也可以使用编译选项-I。

一旦创建了用于编译的nvrtcProgram实例，就可以通过nvrtcCompileProgram进行编译，如图3所示。本示例中使用了两个编译选项：--gpu-architecture=compute_80和--fmad=false，用于为compute_80架构生成代码，并禁用将浮点乘法和加法/减法收缩为浮点乘加运算。根据需求可以使用其他编译选项组合，"支持的编译选项"中列出了有效的编译选项。

图3. 启用FMAD时针对compute_80的SAXPY编译

const char *opts[] = {"--gpu-architecture=compute_80",
         "--fmad=false"};
nvrtcCompileProgram(prog,     // prog
         2,        // numOptions
         opts);    // options

编译完成后，用户可获取程序编译日志及生成的PTX文件，如图4所示。若编译失败，NVRTC不会生成有效的PTX；即使编译成功，必要时仍可能生成程序编译日志。

一个nvrtcProgram可以通过nvrtcCompileProgram使用不同的编译选项进行多次编译，用户只能获取最后一次编译生成的PTX和日志。

图4. 获取生成的PTX及程序编译日志

// Obtain compilation log from the program.

size_t logSize;

nvrtcGetProgramLogSize(prog, &logSize);
char *log = new char[logSize];
nvrtcGetProgramLog(prog, log);
// Obtain PTX from the program.
size_t ptxSize;
nvrtcGetPTXSize(prog, &ptxSize);
char *ptx = new char[ptxSize];
nvrtcGetPTX(prog, ptx);

当不再需要nvrtcProgram实例时，可以通过nvrtcDestroyProgram销毁它，如图5所示。

图5. nvrtcProgram的销毁

nvrtcDestroyProgram(&prog);

生成的PTX可以通过CUDA Driver API进一步操作以执行或链接。图6展示了一个执行生成PTX的示例代码序列。

图6. 使用NVRTC生成的PTX执行SAXPY

CUdevice cuDevice;
CUcontext context;
CUmodule module;
CUfunction kernel;
cuInit(0);
cuDeviceGet(&cuDevice, 0);
cuCtxCreate(&context, 0, cuDevice);
cuModuleLoadDataEx(&module, ptx, 0, 0, 0);
cuModuleGetFunction(&kernel, module, "saxpy");
size_t n = size_t n = NUM_THREADS * NUM_BLOCKS;
size_t bufferSize = n * sizeof(float);
float a = ...;
float *hX = ..., *hY = ..., *hOut = ...;
CUdeviceptr dX, dY, dOut;
cuMemAlloc(&dX, bufferSize);
cuMemAlloc(&dY, bufferSize);
cuMemAlloc(&dOut, bufferSize);
cuMemcpyHtoD(dX, hX, bufferSize);
cuMemcpyHtoD(dY, hY, bufferSize);
void *args[] = { &a, &dX, &dY, &dOut, &n };
cuLaunchKernel(kernel,
            NUM_THREADS, 1, 1,   // grid dim
            NUM_BLOCKS, 1, 1,    // block dim
            0, NULL,             // shared mem and stream
            args,                // arguments
            0);
cuCtxSynchronize();
cuMemcpyDtoH(hOut, dOut, bufferSize);

6. 预编译头文件(CUDA 12.8+)

6.1. 概述

Precompiled Headers (PCH) is a compile time optimization feature for use when the same set of ‘prefix’ header files is compiled in successive compiler invocations. For example, consider two translation units a.cu and b.cu that include the same set of header files:

//a.cu
#include "foo.h"
#include "bar.h"

//<-- 'header stop' point
int xxx;

//b.cu
#include "foo.h"
#include "bar.h"

//<-- 'header stop' point
double ddd;

假设a.cu通过NVRTC编译后，接着编译b.cu。如果启用了PCH功能，在编译a.cu时，编译器会识别头文件停止点（通常是主源文件中不属于预处理指令的第一个标记）3，随后将内部状态保存到PCH文件中。之后在编译b.cu时，编译器会确定预处理指令前缀直至头文件停止点，检查是否存在兼容的PCH文件，并通过从PCH文件重新加载内部状态来跳过头文件解析，继续完成编译。

如果头文件很大，这可以显著节省编译时间。编译器支持自动和显式两种PCH模式。自动模式通过-pch标志指定；在此模式下，编译器会自动创建并使用PCH文件。在显式模式下，需要使用--create-pch=filename标志显式创建PCH文件，并在后续编译时通过--use-pch=filename标志指定使用该文件。

3: 有关头部停止点确定的详细信息，请参阅文档后面的部分。

6.2. 实现概述

PCH编译器实现会保存和恢复编译器的内部状态。该内部状态包括指向数据结构的指针所在内存缓冲区的内容。遗憾的是，在现代操作系统中，一项名为地址空间布局随机化(ASLR)的安全特性会导致动态内存分配(例如malloc/mmap)返回的地址在每次程序调用时都不同。因此，在一次程序调用期间创建的PCH文件通常与程序下一次运行时不兼容，因为动态分配返回的内存地址不再与PCH文件中保存的编译器状态里的对象地址相匹配。

因此，PCH文件必须在NVRTC库的同一个动态实例中创建和使用。图1展示了编译器实现的概览。连续进行了2次NVRTC调用，第一次编译a.cu，接着编译b.cu。在内部，编译器有两个不同的堆 - PCH堆和临时堆。当请求PCH处理时，PCH堆会被延迟分配。一旦分配，为PCH堆分配的地址空间在a.cu的NVRTC调用结束时不会返回给操作系统（不过，后备内存会被'释放'，以便操作系统可以重用）。在下一个带有PCH处理的NVRTC调用中（针对b.cu），内存对象会从PCH堆中分配。如果内存分配的顺序与之前的NVRTC调用相同，内存分配器返回的地址现在会匹配之前NVRTC调用返回的值（因为PCH堆的地址空间被保留了）。这使得编译器的保留状态可以从编译a.cu时创建的PCH文件中成功恢复。

Once the PCH heap is exhausted, or if PCH processing is not active, the compiler will allocate from the transient heap. The transient heap is freed after the current NVRTC compilation call is finished. The compiler will report an error (NVRTC_ERROR_PCH_CREATE_HEAP_EXHAUSTED) if it runs out of PCH heap space when PCH creation is requested; the amount of memory required to create the PCH file can be queried using nvrtcGetPCHHeapSizeRequired(). PCH creation can be attempted again after adjusting the PCH heap size with nvrtcSetPCHHeapSize().

创建PCH文件时，编译器会保存一个用于检查PCH文件是否兼容的元数据前缀。该元数据前缀包含以下信息：

主源文件中预处理指令的初始序列，直到头文件停止点。
命令行选项。
编译器版本。
PCH堆的基础地址。

当考虑使用PCH文件时，会检查元数据前缀中的信息以确保兼容性。

6.3. 自动PCH

自动PCH模式通过在NVRTC编译调用中传递-pch来激活。在自动PCH模式下，编译器将首先检测头文件停止点。然后它会从文件系统中查找扩展名为.pch的兼容PCH文件。搜索PCH文件的目录位置也可以通过-pch-dir标志显式指定。如果找到合适的PCH文件，将会使用它，编译器将跳过解析到头文件停止点之前的一系列头文件。编译器会在编译日志中为每个被认为不兼容的PCH文件打印消息，并提供不兼容的原因。此外，编译器也可能选择创建新的PCH文件。如果无法创建PCH文件，编译仍会成功；函数nvrtcGetPCHCreateStatus()可用于检索PCH创建状态，并会报告以下状态之一：NVRTC_SUCCESS（成功）、NVRTC_ERROR_NO_PCH_CREATE_ATTEMPTED、NVRTC_ERROR_PCH_CREATE_HEAP_EXHAUSTED或NVRTC_ERROR_PCH_CREATE。

编译器在自动PCH处理过程中创建的PCH文件会在卸载NVRTC库时被删除。

示例：自动PCH（CUDA 12.8+）展示了自动PCH的使用方法。

6.4. 显式PCH创建与使用

另外，也可以使用--create-pch=filename显式创建PCH文件，并通过--use-pch=filename使用。与自动PCH类似，在创建PCH文件时，可以使用nvrtcGetPCHCreateStatus()来检查PCH创建状态。

示例：显式PCH创建/使用（CUDA 12.8+）展示了如何显式创建和使用PCH文件。

6.5. 确定头部停止点

PCH文件包含编译器解析至头文件停止点的状态。头文件停止点通常是主源文件中不属于预处理指令的第一个标记。例如：

#include "foo.h"
#include "bar.h"
int qqq;

这里的头文件停止点是'int'。或者也可以使用#pragma nv_hdrstop来指定头文件停止点：

#include "foo.h"
#pragma nv_hdrstop

#include "bar.h"
int qqq;

如果预期的头文件停止点或#pragma nv_hdrstop位于#if内部，则头文件停止点是最外层的封闭if：

#include "aaa.h"
#ifndef FOO_H
#define FOO_H 1
#include "bbb.h"
#endif
#if MYMACRO
int qqq;
#endif

这里，第一个非预处理标记是int，但头部停止位置是外层#if MYMACRO块的起始处。

6.6. PCH故障条件

由于创建该文件的编译器调用与当前调用在以下任何方面不匹配，PCH文件可能被视为不兼容而无法使用：

NVRTC命令行参数。
主源文件的预处理指令初始序列（例如 #include）。
PCH堆基地址。如果PCH文件是由NVRTC库的不同动态实例创建的，或者在创建PCH文件后使用nvrtcSetPCHHeapSize()调整了PCH堆大小，则可能出现这种情况。
编译器版本。

注意：编译器不会存储PCH前缀中引用的头文件序列的修改时间。用户需自行确保自PCH文件创建后头文件内容4未发生更改。

PCH文件创建可能因以下原因失败：

在标题停止之前的代码中存在错误。
遇到了__DATE__或__TIME__宏。
遇到了编译指令 #pragma nv_no_pch。
头部停止位置不在顶层声明之间。示例：

// foo.h
static

// foo.cu
#include "foo.h"
int qqq;

在达到头文件停止点之前，PCH堆已耗尽。如果发生这种情况，nvrtcGetPCHCreateStatus()将报告NVRTC_ERROR_PCH_CREATE_HEAP_EXHAUSTED。可以通过调用nvrtcGetPCHHeapSizeRequired()获取所需的堆大小，并通过调用nvrtcSetPCHHeapSize()增加堆大小，然后可以再次尝试PCH创建。示例：PCH堆大小调整（CUDA 12.8+）演示了这种模式。

4: 既包括从文件系统读取的头文件，也包括直接以字符串形式指定给nvrtcCreateProgram()的头文件。

6.7. PCH堆控制

PCH堆在首次请求PCH处理时延迟分配。在该编译调用结束时，PCH堆的后备内存会返还给操作系统（'解除提交'），但地址空间不会被释放。在下次需要PCH处理的NVRTC调用期间，会从操作系统重新获取PCH堆的后备内存。

默认的PCH堆大小为256 MB。环境变量NVRTC_PCH_HEAP_SIZE会在NVRTC库初始化时被读取，可用于修改默认的PCH堆分配大小（以字节为单位）。PCH堆大小（字节）可通过nvrtcGetPCHHeapSize()获取，并通过nvrtcSetPCHHeapSize()设置。用户指定的PCH堆大小会被向上取整至一个与平台相关的值5。

注意: 将PCH堆大小设置为0将释放PCH堆并禁用PCH处理。

PCH堆基址会被编码到生成的PCH文件中。nvrtcSetPCHHeapSize()会释放当前分配的PCH堆并重新分配新堆。因此在调用nvrtcSetPCHHeapSize()之前创建的PCH文件很可能与后续编译不兼容，因为PCH堆基址几乎肯定会发生变化。

示例：PCH堆大小调整（CUDA 12.8+）列出了一个完整的可运行示例，演示了PCH堆大小调整功能。

5: 例如，平台上的页面大小。

6.8. 其他控制项

6.8.1. 环境变量

这些变量在NVRTC初始化期间被读取：

NVRTC_PCH_HEAP_SIZE : 设置默认的PCH堆大小（以字节为单位）。该堆在首次请求PCH处理时才会被延迟分配。
NVRTC_DISABLE_PCH : 为所有NVRTC调用禁用PCH处理。

6.8.2. 编译指示

支持以下编译指令：

#pragma nv_hdrstop : 表示预编译头文件停止位置。
#pragma nv_no_pch : 禁用当前源文件的PCH文件创建。

6.8.3. 标志

请参阅支持的编译选项了解nvrtCompileProgram支持的PCH相关标志。

6.8.4. 在创建PCH之前的模板实例化

标志 -instantiate-templates-in-pch={true|false} 可用于控制在创建PCH文件之前是否实例化模板。这可能会增加PCH文件的大小，同时加快使用PCH文件的编译速度（因为不需要再次进行模板实例化）。该标志默认开启。

7. 访问降级名称

NVRTC会按照IA64 ABI规范对__global__函数名以及__device__和__constant__变量名进行名称修饰。如果使用CUDA Driver API加载生成的PTX代码，需要通过名称查找内核函数或__device__/__constant__变量，但当名称被修饰后这将变得困难。为解决此问题，NVRTC提供了API函数，可将源代码级别的__global__函数或__device__/__constant__变量名映射到生成PTX中的修饰名称。

两个API函数nvrtcAddNameExpression和nvrtcGetLoweredName共同提供这一功能。首先，将表示__global__函数或__device__/__constant__变量地址的"名称表达式"字符串提供给nvrtcAddNameExpression。然后，使用nvrtcCompileProgram编译程序。在编译过程中，NVRTC会将名称表达式字符串作为C++常量表达式解析到用户程序末尾。该常量表达式必须提供__global__函数或__device__/__constant__变量的地址。最后，调用nvrtcGetLoweredName函数并传入原始名称表达式，它将返回指向降级名称的指针。这个降级名称可用于在CUDA Driver API中引用内核或变量。

NVRTC确保在调用nvrtcAddNameExpression时引用的任何__global__函数或__device__/__constant__变量都会出现在生成的PTX中（前提是输入源代码中包含该定义）。

7.1. 示例

示例：使用小写名称列出了一个完整的可运行示例。以下是相关代码片段：

GPU源代码('gpu_program')包含各种__global__函数/函数模板以及__device__/__constant__变量的定义：

const char *gpu_program = "                                     \n\
__device__ int V1; // 从主机代码设置                        \n\
static __global__ void f1(int *result) { *result = V1 + 10; }   \n\
namespace N1 {                                                  \n\
   namespace N2 {                                               \n\
      __constant__ int V2; // 从主机代码设置                \n\
      __global__ void f2(int *result) { *result = V2 + 20; }    \n\
   }                                                            \n\
}                                                               \n\
template                                            \n\
__global__ void f3(int *result) { *result = sizeof(T); }        \n\

主机源代码调用nvrtcAddNameExpression，使用各种名称表达式来引用__global__函数和__device__/__constant__变量的地址：

kernel_name_vec.push_back("&f1");
..
kernel_name_vec.push_back("N1::N2::f2");
..
kernel_name_vec.push_back("f3");
..
kernel_name_vec.push_back("f3");

// 将名称表达式添加到NVRTC。注意这必须在程序编译前完成
for (size_t i = 0; i < name_vec.size(); ++i)
NVRTC_SAFE_CALL(nvrtcAddNameExpression(prog, kernel_name_vec[i].c_str()));
..
// 将__device__/__constant__变量的表达式添加到NVRTC
variable_name_vec.push_back("&V1");
..
variable_name_vec.push_back("&N1::N2::V2");
..
for (size_t i = 0; i < variable_name_vec.size(); ++i)
NVRTC_SAFE_CALL(nvrtcAddNameExpression(prog,
variable_name_vec[i].c_str()));

然后使用nvrtcCompileProgram编译GPU程序。生成的PTX被加载到GPU上。查找__device__/__constant__变量和__global__函数的修饰名称：

// 注意：此调用必须在NVRTC程序编译完成后且销毁前执行
NVRTC_SAFE_CALL(nvrtcGetLoweredName(
prog,
variable_name_vec[i].c_str(), // 名称表达式
&name                         // 修饰后的名称
));
..
NVRTC_SAFE_CALL(nvrtcGetLoweredName(
prog,
kernel_name_vec[i].c_str(), // 名称表达式
&name // 修饰后的名称
));

然后使用__device__/__constant__变量的混淆名称在模块中查找该变量，并通过CUDA Driver API更新其值：

CUdeviceptr variable_addr;
CUDA_SAFE_CALL(cuModuleGetGlobal(&variable_addr, NULL, module, name));
CUDA_SAFE_CALL(cuMemcpyHtoD(variable_addr,
&initial_value, sizeof(initial_value)));

然后使用这个经过名称修饰的内核，通过CUDA驱动API来启动它：

CUfunction kernel;
CUDA_SAFE_CALL(cuModuleGetFunction(&kernel, module, name));
...
CUDA_SAFE_CALL(
cuLaunchKernel(kernel,
1, 1, 1, // 网格维度
1, 1, 1, // 块维度
0, NULL, // 共享内存和流
args, 0));

7.2. 备注

调用顺序：所有名称表达式必须在通过nvrtcCompileProgram编译NVRTC程序之前，使用nvrtcAddNameExpression添加。这是必需的，因为名称表达式会在用户程序结束时被解析，并可能触发模板实例化。降低后的名称必须仅在NVRTC程序编译完成后、销毁前通过调用nvrtcGetLoweredName来查找。nvrtcGetLoweredName返回的指针指向NVRTC拥有的内存，该内存在NVRTC程序销毁时(nvrtcDestroyProgram)会被释放。因此正确的调用顺序是：nvrtcAddNameExpression、nvrtcCompileProgram、nvrtcGetLoweredName、nvrtcDestroyProgram。
相同名称表达式：传递给nvrtcAddNameExpression和nvrtcGetLoweredName的名称表达式字符串必须具有完全相同的字符。例如，"foo"和"foo "不是相同的字符串，尽管在语义上它们指向同一个实体(foo)，因为第二个字符串多了一个空格字符。
常量表达式：名称表达式字符串中的字符在用户程序末尾被解析为C++常量表达式。解析过程中的任何错误都会导致编译失败，并在编译日志中生成编译器诊断信息。该常量表达式必须引用__global__函数或__device__/__constant__变量的地址。
重载函数地址：如果NVRTC源代码中有多个重载的 __global__函数，那么名称表达式必须使用强制转换操作来消除歧义。但在C++11之前的C++方言中，常量表达式不允许使用强制转换。如果使用此类名称表达式，请使用-std命令行标志以C++11或更高版本的方言编译代码。示例：假设GPU代码字符串包含：
```
__global__ void foo(int) { }
__global__ void foo(char) { }
```
名称表达式(void(*)(int))foo能正确消除foo(int)的歧义，但程序必须以C++11或更高版本的方言（如-std=c++11）编译，因为在C++11之前的常量表达式中不允许使用强制转换。

8. 与模板主机代码的接口

在某些场景下，基于主机代码中的模板参数在设备代码中实例化__global__函数模板非常有用。NVRTC辅助函数nvrtcGetTypeName可用于提取主机代码中类型的源码级名称，该字符串可用于实例化__global__函数模板，并通过nvrtcAddNameExpression和nvrtcGetLoweredName函数获取实例化的修饰名称。

nvrtcGetTypeName 函数在NVRTC头文件中以内联方式定义，当宏 NVRTC_GET_TYPE_NAME 被定义为非零值时可用。该函数在使用gcc/clang和cl.exe编译器时，会分别调用主机代码函数 abi::__cxa_demangle 和 UnDecorateSymbolName。用户可能需要指定额外的头文件路径和库来定位这些主机函数（abi::__cxa_demangle / UnDecorateSymbolName）。具体构建方法可参考下方示例的构建说明（nvrtcGetTypeName Build Instructions）。

8.1. 模板主机代码示例

示例：使用nvrtcGetTypeName 列出了一个完整的可运行示例。以下是相关代码片段：

GPU源代码 (gpu_program) 包含一个 __global__ 函数模板的定义：

const char *gpu_program = " \n\
namespace N1 { struct S1_t { int i; double d; }; } \n\
template \n\
__global__ void f3(int *result) { *result = sizeof(T); } \n\
\n";

主机代码函数 getKernelNameForType 会根据主机模板类型 T 为 __global__ 函数模板实例化创建名称表达式。类型 T 的名称是通过 nvrtcGetTypeName 提取的：

template <typename T>
std::string getKernelNameForType(void)
{
// 使用 nvrtcGetTypeName() 查找类型 "T" 的源代码级名称字符串
// 并用它来创建内核名称
std::string type_name;
NVRTC_SAFE_CALL(nvrtcGetTypeName<T>(&type_name));
return std::string("f3<") + type_name + ">";
}

名称表达式通过nvrtcAddNameExpression函数提供给NVRTC：

name_vec.push_back(getKernelNameForType<int>());
..
name_vec.push_back(getKernelNameForType<double>());
..
name_vec.push_back(getKernelNameForType<N1::S1_t>());
..
for (size_t i = 0; i < name_vec.size(); ++i)
NVRTC_SAFE_CALL(nvrtcAddNameExpression(prog, name_vec[i].c_str()));

随后使用nvrtcCompileProgram编译GPU程序。生成的PTX代码会被加载到GPU上。查找__global__函数模板实例化的修饰名称：

// 注意：此调用必须在NVRTC程序编译完成后、销毁前执行
NVRTC_SAFE_CALL(nvrtcGetLoweredName(
prog,
name_vec[i].c_str(), // 名称表达式
&name // 修饰后的名称
));

然后使用这个重整后的名称通过CUDA驱动API来启动内核：

CUfunction kernel;
CUDA_SAFE_CALL(cuModuleGetFunction(&kernel, module, name));
...
CUDA_SAFE_CALL(
cuLaunchKernel(kernel,
1, 1, 1, // 网格维度
1, 1, 1, // 块维度
0, NULL, // 共享内存和流
args, 0));

9. 版本控制方案

9.1. NVRTC共享库版本控制

在以下内容中，MAJOR和MINOR分别表示CUDA工具包的主版本号和次版本号。例如，对于CUDA 11.2，MAJOR是"11"，MINOR是"2"。

Linux:
- 在CUDA 11.3之前的工具包中，soname被设置为"MAJOR.MINOR"。
- 在CUDA 11.3及后续11.x工具包中，soname字段被设置为"11.2"。
- 在主要版本大于11的CUDA工具包中（例如CUDA 12.x），soname字段被设置为"MAJOR"。
Windows:
- 在cuda 11.3之前的CUDA工具包中，DLL文件名格式为"nvrtc64_XY_0.dll"，其中X表示主版本号，Y表示次版本号。
- 在CUDA 11.3及后续11.x工具包中，DLL文件名为"nvrtc64_112_0.dll"。
- 在主要版本大于11的CUDA工具包中（例如CUDA 12.x），DLL文件名格式为"nvrtc64_X0_0.dll"，其中X代表主版本号。

考虑一个主版本号大于11的CUDA工具包。该CUDA工具包中的NVRTC共享库将与同一CUDA工具包先前次要版本中的NVRTC共享库具有相同的soname（Linux）或DLL名称（Windows）。同样，CUDA 11.3及后续11.x版本中的NVRTC共享库将与CUDA 11.2中的NVRTC共享库具有相同的soname（Linux）或DLL名称（Windows）。

根据上述版本控制方案，链接特定NVRTC共享库的NVRTC客户端将继续与具有匹配soname（Linux）或DLL名称（Windows）的未来NVRTC共享库兼容。这使得NVRTC客户端能够利用较新NVRTC共享库中提供的错误修复和功能增强1。然而，如最佳实践指南所述，较新的NVRTC共享库生成的PTX版本可能不被旧版CUDA驱动程序中的CUDA Driver API函数所接受。

解决此问题的一些方法：

安装一个更新的CUDA驱动程序，该驱动程序需与包含正在使用的NVRTC库的CUDA工具包兼容。
使用NVRTC直接编译为SASS而非PTX（参见最佳实践指南）。

或者，NVRTC客户端也可以选择链接静态NVRTC库，或者重新分发特定版本的NVRTC共享库，并在运行时使用dlopen（Linux）或LoadLibrary（Windows）函数来加载该库。这两种方法都能让NVRTC客户端在部署过程中控制所使用的NVRTC版本，从而确保功能与性能的可预测性。

9.2. NVRTC内置函数库

NVRTC-builtins库包含作为NVRTC包组成部分的辅助代码。它仅由NVRTC库内部使用。每个NVRTC库仅与来自相同CUDA工具包的NVRTC-builtins库兼容。

10. 其他注意事项

10.1. 线程安全

多个线程可以并发调用NVRTC API函数，只要不存在竞争条件。在此上下文中，如果多个线程并发调用具有相同nvrtcProgram参数的NVRTC API函数，且至少有一个线程正在调用nvrtcCompileProgram或nvrtcAddNameExpression 2，则定义为发生了竞争条件。

自CUDA 12.3起，NVRTC允许nvrtcCompileProgram的并发调用，也可能同时触发内置的NVVM优化器/代码生成阶段。设置环境变量NVRTC_DISABLE_CONCURRENT_NVVM可禁用此行为，即内置NVVM优化器/代码生成阶段的调用将被串行化。

10.2. 堆栈大小

在Linux系统上，NVRTC会在编译期间使用setrlimit()函数将堆栈大小增加到允许的最大值。这降低了编译器在处理复杂输入源时耗尽堆栈的可能性。编译完成后，堆栈大小会重置为之前的值。

由于setrlimit()会改变整个进程的堆栈大小，它也会影响其他可能正在并发执行的应用程序线程。命令行标志-modify-stack-limit=false将阻止NVRTC修改堆栈限制。

10.3. NVRTC静态库

NVRTC静态库引用了在NVRTC内置静态库和PTX编译器静态库中定义的函数。请参阅构建说明以获取示例。

11. 示例：SAXPY

11.1. 代码 (saxpy.cpp)

#include <nvrtc.h>
#include <cuda.h>
#include <iostream>

#define NUM_THREADS 128
#define NUM_BLOCKS 32
#define NVRTC_SAFE_CALL(x)                                        \
  do {                                                            \
    nvrtcResult result = x;                                       \
    if (result != NVRTC_SUCCESS) {                                \
      std::cerr << "\nerror: " #x " failed with error "           \
                << nvrtcGetErrorString(result) << '\n';           \
      exit(1);                                                    \
    }                                                             \
} while(0)
#define CUDA_SAFE_CALL(x)                                         \
  do {                                                            \
    CUresult result = x;                                          \
    if (result != CUDA_SUCCESS) {                                 \
      const char *msg;                                            \
      cuGetErrorName(result, &msg);                               \
      std::cerr << "\nerror: " #x " failed with error "           \
                << msg << '\n';                                   \
      exit(1);                                                    \
    }                                                             \
} while(0)

const char *saxpy = "                                           \n\
extern \"C\" __global__                                         \n\
void saxpy(float a, float *x, float *y, float *out, size_t n)   \n\
{                                                               \n\
  size_t tid = blockIdx.x * blockDim.x + threadIdx.x;           \n\
  if (tid < n) {                                                \n\
    out[tid] = a * x[tid] + y[tid];                             \n\
  }                                                             \n\
}                                                               \n";

int main()
{
   // Create an instance of nvrtcProgram with the SAXPY code string.
   nvrtcProgram prog;
   NVRTC_SAFE_CALL(
      nvrtcCreateProgram(&prog,         // prog
                        saxpy,         // buffer
                        "saxpy.cu",    // name
                        0,             // numHeaders
                        NULL,          // headers
                        NULL));        // includeNames
   // Compile the program with fmad disabled.
   // Note: Can specify GPU target architecture explicitly with '-arch' flag.
   const char *opts[] = {"--fmad=false"};
   nvrtcResult compileResult = nvrtcCompileProgram(prog,  // prog
                                                   1,     // numOptions
                                                   opts); // options
   // Obtain compilation log from the program.
   size_t logSize;
   NVRTC_SAFE_CALL(nvrtcGetProgramLogSize(prog, &logSize));
   char *log = new char[logSize];
   NVRTC_SAFE_CALL(nvrtcGetProgramLog(prog, log));
   std::cout << log << '\n';
   delete[] log;
   if (compileResult != NVRTC_SUCCESS) {
      exit(1);
   }
   // Obtain PTX from the program.
   size_t ptxSize;
   NVRTC_SAFE_CALL(nvrtcGetPTXSize(prog, &ptxSize));
   char *ptx = new char[ptxSize];
   NVRTC_SAFE_CALL(nvrtcGetPTX(prog, ptx));
   // Destroy the program.
   NVRTC_SAFE_CALL(nvrtcDestroyProgram(&prog));
   // Load the generated PTX and get a handle to the SAXPY kernel.
   CUdevice cuDevice;
   CUcontext context;
   CUmodule module;
   CUfunction kernel;
   CUDA_SAFE_CALL(cuInit(0));
   CUDA_SAFE_CALL(cuDeviceGet(&cuDevice, 0));
   CUDA_SAFE_CALL(cuCtxCreate(&context, 0, cuDevice));
   CUDA_SAFE_CALL(cuModuleLoadDataEx(&module, ptx, 0, 0, 0));
   CUDA_SAFE_CALL(cuModuleGetFunction(&kernel, module, "saxpy"));
   // Generate input for execution, and create output buffers.
   size_t n = NUM_THREADS * NUM_BLOCKS;
   size_t bufferSize = n * sizeof(float);
   float a = 5.1f;
   float *hX = new float[n], *hY = new float[n], *hOut = new float[n];
   for (size_t i = 0; i < n; ++i) {
      hX[i] = static_cast<float>(i);
      hY[i] = static_cast<float>(i * 2);
   }
   CUdeviceptr dX, dY, dOut;
   CUDA_SAFE_CALL(cuMemAlloc(&dX, bufferSize));
   CUDA_SAFE_CALL(cuMemAlloc(&dY, bufferSize));
   CUDA_SAFE_CALL(cuMemAlloc(&dOut, bufferSize));
   CUDA_SAFE_CALL(cuMemcpyHtoD(dX, hX, bufferSize));
   CUDA_SAFE_CALL(cuMemcpyHtoD(dY, hY, bufferSize));
   // Execute SAXPY.
   void *args[] = { &a, &dX, &dY, &dOut, &n };
   CUDA_SAFE_CALL(
      cuLaunchKernel(kernel,
                     NUM_BLOCKS, 1, 1,    // grid dim
                     NUM_THREADS, 1, 1,   // block dim
                     0, NULL,             // shared mem and stream
                     args, 0));           // arguments
   CUDA_SAFE_CALL(cuCtxSynchronize());
   // Retrieve and print output.
   CUDA_SAFE_CALL(cuMemcpyDtoH(hOut, dOut, bufferSize));
   for (size_t i = 0; i < n; ++i) {
      std::cout << a << " * " << hX[i] << " + " << hY[i]
               << " = " << hOut[i] << '\n';
   }
   // Release resources.
   CUDA_SAFE_CALL(cuMemFree(dX));
   CUDA_SAFE_CALL(cuMemFree(dY));
   CUDA_SAFE_CALL(cuMemFree(dOut));
   CUDA_SAFE_CALL(cuModuleUnload(module));
   CUDA_SAFE_CALL(cuCtxDestroy(context));
   delete[] hX;
   delete[] hY;
   delete[] hOut;
   delete[] ptx;
   return 0;
}

11.2. Saxpy 构建指南

假设环境变量 CUDA_PATH 指向 CUDA 工具包的安装目录，按以下方式构建此示例：

使用NVRTC共享库：

Windows:

cl.exe saxpy.cpp /Fesaxpy ^
   /I "%CUDA_PATH%"\include ^
   "%CUDA_PATH%"\lib\x64\nvrtc.lib "%CUDA_PATH%"\lib\x64\cuda.lib

Linux系统:

g++ saxpy.cpp -o saxpy \
   -I $CUDA_PATH/include \
   -L $CUDA_PATH/lib64 \
   -lnvrtc -lcuda \
   -Wl,-rpath,$CUDA_PATH/lib64

使用NVRTC静态库：

Windows:

cl.exe saxpy.cpp /Fesaxpy  ^
    /I "%CUDA_PATH%"\include ^
    "%CUDA_PATH%"\lib\x64\nvrtc_static.lib ^
    "%CUDA_PATH%"\lib\x64\nvrtc-builtins_static.lib ^
    "%CUDA_PATH%"\lib\x64\nvptxcompiler_static.lib ^
    "%CUDA_PATH%"\lib\x64\cuda.lib user32.lib Ws2_32.lib

Linux系统:

g++ saxpy.cpp -o saxpy \
   -I $CUDA_PATH/include \
   -L $CUDA_PATH/lib64 \
   -lnvrtc_static -lnvrtc-builtins_static -lnvptxcompiler_static -lcuda \
   -lpthread

12. 示例：使用Lowered Name

12.1. 代码 (lowered-name.cpp)

#include <nvrtc.h>
#include <cuda.h>
#include <iostream>
#include <vector>
#include <string>

#define NVRTC_SAFE_CALL(x)                                        \
do {                                                              \
   nvrtcResult result = x;                                        \
   if (result != NVRTC_SUCCESS) {                                 \
      std::cerr << "\nerror: " #x " failed with error "           \
               << nvrtcGetErrorString(result) << '\n';            \
      exit(1);                                                    \
   }                                                              \
} while(0)
#define CUDA_SAFE_CALL(x)                                         \
do {                                                              \
   CUresult result = x;                                           \
   if (result != CUDA_SUCCESS) {                                  \
      const char *msg;                                            \
      cuGetErrorName(result, &msg);                               \
      std::cerr << "\nerror: " #x " failed with error "           \
               << msg << '\n';                                    \
      exit(1);                                                    \
   }                                                              \
} while(0)

const char *gpu_program = "
  __device__ int V1; // set from host code                        \n\
  static __global__ void f1(int *result) { *result = V1 + 10; }   \n\
  namespace N1 {                                                  \n\
    namespace N2 {                                                \n\
       __constant__ int V2; // set from host code                 \n\
       __global__ void f2(int *result) { *result = V2 + 20; }     \n\
      }                                                           \n\
   }                                                              \n\
   template<typename T>                                           \n\
   __global__ void f3(int *result) { *result = sizeof(T); }       \n\
                                                                  \n";

int main()
{
   // Create an instance of nvrtcProgram
   nvrtcProgram prog;
   NVRTC_SAFE_CALL(nvrtcCreateProgram(&prog,         // prog
                                       gpu_program,   // buffer
                                       "prog.cu",     // name
                                       0,             // numHeaders
                                       NULL,          // headers
                                       NULL));        // includeNames

   // add all name expressions for kernels
   std::vector<std::string> kernel_name_vec;
   std::vector<std::string> variable_name_vec;
   std::vector<int> variable_initial_value;

   std::vector<int> expected_result;

   // note the name expressions are parsed as constant expressions
   kernel_name_vec.push_back("&f1");
   expected_result.push_back(10 + 100);

   kernel_name_vec.push_back("N1::N2::f2");
   expected_result.push_back(20 + 200);

   kernel_name_vec.push_back("f3<int>");
   expected_result.push_back(sizeof(int));

   kernel_name_vec.push_back("f3<double>");
   expected_result.push_back(sizeof(double));

   // add kernel name expressions to NVRTC. Note this must be done before
   // the program is compiled.
   for (size_t i = 0; i < kernel_name_vec.size(); ++i)
      NVRTC_SAFE_CALL(nvrtcAddNameExpression(prog, kernel_name_vec[i].c_str()));

   // add expressions for  __device__ / __constant__ variables to NVRTC
   variable_name_vec.push_back("&V1");
   variable_initial_value.push_back(100);

   variable_name_vec.push_back("&N1::N2::V2");
   variable_initial_value.push_back(200);

   for (size_t i = 0; i < variable_name_vec.size(); ++i)
      NVRTC_SAFE_CALL(nvrtcAddNameExpression(prog, variable_name_vec[i].c_str()));

   nvrtcResult compileResult = nvrtcCompileProgram(prog,  // prog
                                                   0,     // numOptions
                                                   NULL); // options
   // Obtain compilation log from the program.
   size_t logSize;
   NVRTC_SAFE_CALL(nvrtcGetProgramLogSize(prog, &logSize));
   char *log = new char[logSize];
   NVRTC_SAFE_CALL(nvrtcGetProgramLog(prog, log));
   std::cout << log << '\n';
   delete[] log;
   if (compileResult != NVRTC_SUCCESS) {
      exit(1);
   }
   // Obtain PTX from the program.
   size_t ptxSize;
   NVRTC_SAFE_CALL(nvrtcGetPTXSize(prog, &ptxSize));
   char *ptx = new char[ptxSize];
   NVRTC_SAFE_CALL(nvrtcGetPTX(prog, ptx));
   // Load the generated PTX
   CUdevice cuDevice;
   CUcontext context;
   CUmodule module;

   CUDA_SAFE_CALL(cuInit(0));
   CUDA_SAFE_CALL(cuDeviceGet(&cuDevice, 0));
   CUDA_SAFE_CALL(cuCtxCreate(&context, 0, cuDevice));
   CUDA_SAFE_CALL(cuModuleLoadDataEx(&module, ptx, 0, 0, 0));

   CUdeviceptr dResult;
   int hResult = 0;
   CUDA_SAFE_CALL(cuMemAlloc(&dResult, sizeof(hResult)));
   CUDA_SAFE_CALL(cuMemcpyHtoD(dResult, &hResult, sizeof(hResult)));

   // for each of the __device__/__constant__ variable address
   // expressions provided to NVRTC, extract the lowered name for the
   // corresponding variable, and set its value
   for (size_t i = 0; i < variable_name_vec.size(); ++i) {
      const char *name;

      // note: this call must be made after NVRTC program has been
      // compiled and before it has been destroyed.
      NVRTC_SAFE_CALL(nvrtcGetLoweredName(
                           prog,
            variable_name_vec[i].c_str(), // name expression
            &name                         // lowered name
                                          ));
      int initial_value = variable_initial_value[i];

      // get pointer to variable using lowered name, and set its
      // initial value
      CUdeviceptr variable_addr;
      CUDA_SAFE_CALL(cuModuleGetGlobal(&variable_addr, NULL, module, name));
      CUDA_SAFE_CALL(cuMemcpyHtoD(variable_addr, &initial_value, sizeof(initial_value)));
   }


   // for each of the kernel name expressions previously provided to NVRTC,
   // extract the lowered name for corresponding __global__ function,
   // and launch it.

   for (size_t i = 0; i < kernel_name_vec.size(); ++i) {
      const char *name;

      // note: this call must be made after NVRTC program has been
      // compiled and before it has been destroyed.
      NVRTC_SAFE_CALL(nvrtcGetLoweredName(
                           prog,
            kernel_name_vec[i].c_str(), // name expression
            &name                // lowered name
                                          ));

      // get pointer to kernel from loaded PTX
      CUfunction kernel;
      CUDA_SAFE_CALL(cuModuleGetFunction(&kernel, module, name));

      // launch the kernel
      std::cout << "\nlaunching " << name << " ("
            << kernel_name_vec[i] << ")" << std::endl;

      void *args[] = { &dResult };
      CUDA_SAFE_CALL(
         cuLaunchKernel(kernel,
            1, 1, 1,             // grid dim
            1, 1, 1,             // block dim
            0, NULL,             // shared mem and stream
            args, 0));           // arguments
      CUDA_SAFE_CALL(cuCtxSynchronize());

      // Retrieve the result
      CUDA_SAFE_CALL(cuMemcpyDtoH(&hResult, dResult, sizeof(hResult)));

      // check against expected value
      if (expected_result[i] != hResult) {
         std::cout << "\n Error: expected result = " << expected_result[i]
                  << " , actual result = " << hResult << std::endl;
         exit(1);
      }
   }  // for

   // Release resources.
   CUDA_SAFE_CALL(cuMemFree(dResult));
   CUDA_SAFE_CALL(cuModuleUnload(module));
   CUDA_SAFE_CALL(cuCtxDestroy(context));
   delete[] ptx;

   // Destroy the program.
   NVRTC_SAFE_CALL(nvrtcDestroyProgram(&prog));

   return 0;
}

12.2. Lowered Name 构建指南

假设环境变量 CUDA_PATH 指向 CUDA 工具包的安装目录，按以下方式构建此示例：

使用NVRTC共享库：

Windows:

cl.exe lowered-name.cpp /Felowered-name ^
/I "%CUDA_PATH%"\include ^
"%CUDA_PATH%"\lib\x64\nvrtc.lib "%CUDA_PATH%"\lib\x64\cuda.lib

Linux系统:

g++ lowered-name.cpp -o lowered-name \
-I $CUDA_PATH/include \
-L $CUDA_PATH/lib64 \
-lnvrtc -lcuda \
-Wl,-rpath,$CUDA_PATH/lib64

使用NVRTC静态库：

Windows:

cl.exe lowered-name.cpp /Felowered-name  ^
/I "%CUDA_PATH%"\include ^
"%CUDA_PATH%"\lib\x64\nvrtc_static.lib ^
"%CUDA_PATH%"\lib\x64\nvrtc-builtins_static.lib ^
"%CUDA_PATH%"\lib\x64\nvptxcompiler_static.lib ^
"%CUDA_PATH%"\lib\x64\cuda.lib user32.lib Ws2_32.lib

Linux系统:

g++ lowered-name.cpp -o lowered-name \
-I $CUDA_PATH/include \
-L $CUDA_PATH/lib64 \
-lnvrtc_static -lnvrtc-builtins_static -lnvptxcompiler_static \
-lcuda -lpthread

13. 示例：使用nvrtcGetTypeName

13.1. 代码 (host-type-name.cpp)

#include <nvrtc.h>
#include <cuda.h>
#include <iostream>
#include <vector>
#include <string>

#define NVRTC_SAFE_CALL(x)                                        \
do {                                                            \
   nvrtcResult result = x;                                       \
   if (result != NVRTC_SUCCESS) {                                \
      std::cerr << "\nerror: " #x " failed with error "           \
               << nvrtcGetErrorString(result) << '\n';           \
      exit(1);                                                    \
   }                                                             \
} while(0)
#define CUDA_SAFE_CALL(x)                                         \
do {                                                            \
   CUresult result = x;                                          \
   if (result != CUDA_SUCCESS) {                                 \
      const char *msg;                                            \
      cuGetErrorName(result, &msg);                               \
      std::cerr << "\nerror: " #x " failed with error "           \
               << msg << '\n';                                   \
      exit(1);                                                    \
   }                                                             \
} while(0)

const char *gpu_program = "                                     \n\
namespace N1 { struct S1_t { int i; double d; }; }              \n\
template<typename T>                                            \n\
__global__ void f3(int *result) { *result = sizeof(T); }        \n\
                                                               \n";


// note: this structure is also defined in GPU code string. Should ideally
// be in a header file included by both GPU code string and by CPU code.
namespace N1 { struct S1_t { int i; double d; }; };
template <typename T>
std::string getKernelNameForType(void)
{
   // Look up the source level name string for the type "T" using
   // nvrtcGetTypeName() and use it to create the kernel name
   std::string type_name;
   NVRTC_SAFE_CALL(nvrtcGetTypeName<T>(&type_name));
   return std::string("f3<") + type_name + ">";
}

int main()
{
// Create an instance of nvrtcProgram
nvrtcProgram prog;
NVRTC_SAFE_CALL(
   nvrtcCreateProgram(&prog,         // prog
                     gpu_program,   // buffer
                     "gpu_program.cu",    // name
                     0,             // numHeaders
                     NULL,          // headers
                     NULL));        // includeNames

// add all name expressions for kernels
std::vector<std::string> name_vec;
std::vector<int> expected_result;

// note the name expressions are parsed as constant expressions
name_vec.push_back(getKernelNameForType<int>());
expected_result.push_back(sizeof(int));

name_vec.push_back(getKernelNameForType<double>());
expected_result.push_back(sizeof(double));

name_vec.push_back(getKernelNameForType<N1::S1_t>());
expected_result.push_back(sizeof(N1::S1_t));


// add name expressions to NVRTC. Note this must be done before
// the program is compiled.
for (size_t i = 0; i < name_vec.size(); ++i)
   NVRTC_SAFE_CALL(nvrtcAddNameExpression(prog, name_vec[i].c_str()));

nvrtcResult compileResult = nvrtcCompileProgram(prog,  // prog
                                                0,     // numOptions
                                                NULL); // options
// Obtain compilation log from the program.
size_t logSize;
NVRTC_SAFE_CALL(nvrtcGetProgramLogSize(prog, &logSize));
char *log = new char[logSize];
NVRTC_SAFE_CALL(nvrtcGetProgramLog(prog, log));
std::cout << log << '\n';
delete[] log;
if (compileResult != NVRTC_SUCCESS) {
   exit(1);
}
// Obtain PTX from the program.
size_t ptxSize;
NVRTC_SAFE_CALL(nvrtcGetPTXSize(prog, &ptxSize));
char *ptx = new char[ptxSize];
NVRTC_SAFE_CALL(nvrtcGetPTX(prog, ptx));


// Load the generated PTX
CUdevice cuDevice;
CUcontext context;
CUmodule module;

CUDA_SAFE_CALL(cuInit(0));
CUDA_SAFE_CALL(cuDeviceGet(&cuDevice, 0));
CUDA_SAFE_CALL(cuCtxCreate(&context, 0, cuDevice));
CUDA_SAFE_CALL(cuModuleLoadDataEx(&module, ptx, 0, 0, 0));

CUdeviceptr dResult;
int hResult = 0;
CUDA_SAFE_CALL(cuMemAlloc(&dResult, sizeof(hResult)));
CUDA_SAFE_CALL(cuMemcpyHtoD(dResult, &hResult, sizeof(hResult)));

// for each of the name expressions previously provided to NVRTC,
// extract the lowered name for corresponding __global__ function,
// and launch it.

for (size_t i = 0; i < name_vec.size(); ++i) {
   const char *name;

   // note: this call must be made after NVRTC program has been
   // compiled and before it has been destroyed.
   NVRTC_SAFE_CALL(nvrtcGetLoweredName(
                        prog,
         name_vec[i].c_str(), // name expression
         &name                // lowered name
                                       ));

   // get pointer to kernel from loaded PTX
   CUfunction kernel;
   CUDA_SAFE_CALL(cuModuleGetFunction(&kernel, module, name));

   // launch the kernel
   std::cout << "\nlaunching " << name << " ("
         << name_vec[i] << ")" << std::endl;

   void *args[] = { &dResult };
   CUDA_SAFE_CALL(
      cuLaunchKernel(kernel,
         1, 1, 1,             // grid dim
         1, 1, 1,             // block dim
         0, NULL,             // shared mem and stream
         args, 0));           // arguments
   CUDA_SAFE_CALL(cuCtxSynchronize());

   // Retrieve the result
   CUDA_SAFE_CALL(cuMemcpyDtoH(&hResult, dResult, sizeof(hResult)));

   // check against expected value
   if (expected_result[i] != hResult) {
      std::cout << "\n Error: expected result = " << expected_result[i]
      << " , actual result = " << hResult << std::endl;
      exit(1);
   }
}  // for

// Release resources.
CUDA_SAFE_CALL(cuMemFree(dResult));
CUDA_SAFE_CALL(cuModuleUnload(module));
CUDA_SAFE_CALL(cuCtxDestroy(context));
delete[] ptx;

// Destroy the program.
NVRTC_SAFE_CALL(nvrtcDestroyProgram(&prog));

return 0;
}

13.2. nvrtcGetTypeName 构建说明

假设环境变量 CUDA_PATH 指向 CUDA 工具包的安装目录，按以下方式构建此示例：

使用NVRTC共享库：

Windows:

cl.exe -DNVRTC_GET_TYPE_NAME=1 host-type-name.cpp /Fehost-type-name ^
   /I "%CUDA_PATH%"\include ^
   "%CUDA_PATH%"\lib\x64\nvrtc.lib "%CUDA_PATH%"\lib\x64\cuda.lib DbgHelp.lib

Linux系统:

g++ -DNVRTC_GET_TYPE_NAME=1 host-type-name.cpp -o host-type-name \
   -I $CUDA_PATH/include \
   -L $CUDA_PATH/lib64 \
   -lnvrtc -lcuda \
   -Wl,-rpath,$CUDA_PATH/lib64

使用NVRTC静态库：

Windows:

cl.exe -DNVRTC_GET_TYPE_NAME=1 host-type-name.cpp /Fehost-type-name  ^
   /I "%CUDA_PATH%"\include ^
   "%CUDA_PATH%"\lib\x64\nvrtc_static.lib ^
   "%CUDA_PATH%"\lib\x64\nvrtc-builtins_static.lib ^
   "%CUDA_PATH%"\lib\x64\nvptxcompiler_static.lib ^
   "%CUDA_PATH%"\lib\x64\cuda.lib DbgHelp.lib user32.lib Ws2_32.lib

Linux系统:

g++ -DNVRTC_GET_TYPE_NAME=1 host-type-name.cpp -o host-type-name \
   -I $CUDA_PATH/include \
   -L $CUDA_PATH/lib64 \
   -lnvrtc_static -lnvrtc-builtins_static -lnvptxcompiler_static \
   -lcuda -lpthread

14. 示例：动态并行

代码 (dynamic-parallelism.cpp)

#include <nvrtc.h>
#include <cuda.h>
#include <iostream>

#define NVRTC_SAFE_CALL(x)                                        \
do {                                                              \
   nvrtcResult result = x;                                        \
   if (result != NVRTC_SUCCESS) {                                 \
      std::cerr << "\nerror: " #x " failed with error "           \
                << nvrtcGetErrorString(result) << '\n';           \
      exit(1);                                                    \
   }                                                              \
} while(0)
#define CUDA_SAFE_CALL(x)                                         \
do {                                                              \
   CUresult result = x;                                           \
   if (result != CUDA_SUCCESS) {                                  \
      const char *msg;                                            \
      cuGetErrorName(result, &msg);                               \
      std::cerr << "\nerror: " #x " failed with error "           \
                << msg << '\n';                                   \
      exit(1);                                                    \
   }                                                              \
} while(0)

const char *dynamic_parallelism = "                             \n\
extern \"C\" __global__                                         \n\
void child(float *out, size_t n)                                \n\
{                                                               \n\
   size_t tid = blockIdx.x * blockDim.x + threadIdx.x;          \n\
   if (tid < n) {                                               \n\
      out[tid] = tid;                                           \n\
   }                                                            \n\
}                                                               \n\
                                                                \n\
extern \"C\" __global__                                         \n\
void parent(float *out, size_t n,                               \n\
            size_t numBlocks, size_t numThreads)                \n\
{                                                               \n\
   child<<<numBlocks, numThreads>>>(out, n);                    \n\
   cudaDeviceSynchronize();                                     \n\
}                                                               \n";
int main(int argc, char *argv[])
{
if (argc < 2) {
   std::cout << "Usage: dynamic-parallelism <path to cudadevrt library>\n\n"
             << "<path to cudadevrt library> must include the cudadevrt\n"
             << "library name itself, e.g., Z:\\path\\to\\cudadevrt.lib on \n"
             << "Windows and /path/to/libcudadevrt.a on Linux.\n";
   exit(1);
}
size_t numBlocks = 32;
size_t numThreads = 128;
// Create an instance of nvrtcProgram with the code string.
nvrtcProgram prog;
NVRTC_SAFE_CALL(
   nvrtcCreateProgram(&prog,                       // prog
                     dynamic_parallelism,          // buffer
                     "dynamic_parallelism.cu",     // name
                     0,                            // numHeaders
                     NULL,                         // headers
                     NULL));                       // includeNames
// Compile the program for compute_35 with rdc enabled.
const char *opts[] = {"--gpu-architecture=compute_35",
                        "--relocatable-device-code=true"};
nvrtcResult compileResult = nvrtcCompileProgram(prog,  // prog
                                                2,     // numOptions
                                                opts); // options
// Obtain compilation log from the program.
size_t logSize;
NVRTC_SAFE_CALL(nvrtcGetProgramLogSize(prog, &logSize));
char *log = new char[logSize];
NVRTC_SAFE_CALL(nvrtcGetProgramLog(prog, log));
std::cout << log << '\n';
delete[] log;
if (compileResult != NVRTC_SUCCESS) {
   exit(1);
}
// Obtain PTX from the program.
size_t ptxSize;
NVRTC_SAFE_CALL(nvrtcGetPTXSize(prog, &ptxSize));
char *ptx = new char[ptxSize];
NVRTC_SAFE_CALL(nvrtcGetPTX(prog, ptx));
// Destroy the program.
NVRTC_SAFE_CALL(nvrtcDestroyProgram(&prog));
// Load the generated PTX and get a handle to the parent kernel.
CUdevice cuDevice;
CUcontext context;
CUlinkState linkState;
CUmodule module;
CUfunction kernel;
CUDA_SAFE_CALL(cuInit(0));
CUDA_SAFE_CALL(cuDeviceGet(&cuDevice, 0));
CUDA_SAFE_CALL(cuCtxCreate(&context, 0, cuDevice));
CUDA_SAFE_CALL(cuLinkCreate(0, 0, 0, &linkState));
CUDA_SAFE_CALL(cuLinkAddFile(linkState, CU_JIT_INPUT_LIBRARY, argv[1],
                              0, 0, 0));
CUDA_SAFE_CALL(cuLinkAddData(linkState, CU_JIT_INPUT_PTX,
                              (void *)ptx, ptxSize, "dynamic_parallelism.ptx",
                              0, 0, 0));
size_t cubinSize;
void *cubin;
CUDA_SAFE_CALL(cuLinkComplete(linkState, &cubin, &cubinSize));
CUDA_SAFE_CALL(cuModuleLoadData(&module, cubin));
CUDA_SAFE_CALL(cuModuleGetFunction(&kernel, module, "parent"));
// Generate input for execution, and create output buffers.
size_t n = numBlocks * numThreads;
size_t bufferSize = n * sizeof(float);
float *hOut = new float[n];
CUdeviceptr dX, dY, dOut;
CUDA_SAFE_CALL(cuMemAlloc(&dOut, bufferSize));
// Execute parent kernel.
void *args[] = { &dOut, &n, &numBlocks, &numThreads };
CUDA_SAFE_CALL(
   cuLaunchKernel(kernel,
                  1, 1, 1,    // grid dim
                  1, 1, 1,    // block dim
                  0, NULL,    // shared mem and stream
                  args, 0));  // arguments
CUDA_SAFE_CALL(cuCtxSynchronize());
// Retrieve and print output.
CUDA_SAFE_CALL(cuMemcpyDtoH(hOut, dOut, bufferSize));

for (size_t i = 0; i < n; ++i) {
   std::cout << hOut[i] << '\n';
}
// Release resources.
CUDA_SAFE_CALL(cuMemFree(dOut));
CUDA_SAFE_CALL(cuModuleUnload(module));
CUDA_SAFE_CALL(cuLinkDestroy(linkState));
CUDA_SAFE_CALL(cuCtxDestroy(context));
delete[] hOut;
delete[] ptx;
return 0;
}

14.1. 动态并行构建说明

假设环境变量 CUDA_PATH 指向 CUDA 工具包的安装目录，按以下方式构建此示例：

使用NVRTC共享库：

Windows:

cl.exe dynamic-parallelism.cpp /Fedynamic-parallelism ^
   /I "%CUDA_PATH%\include" ^
   "%CUDA_PATH%"\lib\x64\nvrtc.lib "%CUDA_PATH%"\lib\x64\cuda.lib

Linux系统:

g++ dynamic-parallelism.cpp -o dynamic-parallelism \
   -I $CUDA_PATH/include \
   -L $CUDA_PATH/lib64 \
   -lnvrtc -lcuda \
   -Wl,-rpath,$CUDA_PATH/lib64

使用NVRTC静态库：

Windows:

cl.exe dynamic-parallelism.cpp /Fedynamic-parallelism  ^
   /I "%CUDA_PATH%"\include ^
   "%CUDA_PATH%"\lib\x64\nvrtc_static.lib ^
   "%CUDA_PATH%"\lib\x64\nvrtc-builtins_static.lib ^
   "%CUDA_PATH%"\lib\x64\nvptxcompiler_static.lib ^
  "%CUDA_PATH%"\lib\x64\cuda.lib user32.lib Ws2_32.lib

Linux系统:

g++ dynamic-parallelism.cpp -o dynamic-parallelism \
   -I $CUDA_PATH/include \
   -L $CUDA_PATH/lib64 \
   -lnvrtc_static -lnvrtc-builtins_static -lnvptxcompiler_static -lcuda \
   -lpthread

15. 示例：设备LTO（链接时优化）

本节展示设备链接时优化(LTO)功能。 LTO IR包含两个单元。第一个单元通过nvcc离线生成，需指定架构为-arch lto_XX（参阅Code (offline.cu)）。生成的LTO IR会被打包到fatbinary中。

第二个单元是通过使用NVRTC在线生成的，通过指定标志-dlto（参见Code (online.cpp)）。

这两个单元随后被传递到libnvJitLink* API函数，该函数将LTO IR链接在一起，在链接后的IR上运行优化器并生成cubin（参见Code (online.cpp)）。然后cubin被加载到GPU上并执行。

15.1. 代码 (offline.cu)

__device__  float compute(float a, float x, float y) {
return a * x + y;
}

15.2. 代码 (online.cpp)

#include <nvrtc.h>
#include <cuda.h>
#include <nvJitLink.h>
#include <iostream>

#define NUM_THREADS 128
#define NUM_BLOCKS 32

#define NVRTC_SAFE_CALL(x)                                        \
do {                                                              \
   nvrtcResult result = x;                                        \
   if (result != NVRTC_SUCCESS) {                                 \
      std::cerr << "\nerror: " #x " failed with error "           \
                << nvrtcGetErrorString(result) << '\n';           \
      exit(1);                                                    \
   }                                                              \
} while(0)

#define CUDA_SAFE_CALL(x)                                         \
do {                                                              \
   CUresult result = x;                                           \
   if (result != CUDA_SUCCESS) {                                  \
      const char *msg;                                            \
      cuGetErrorName(result, &msg);                               \
      std::cerr << "\nerror: " #x " failed with error "           \
                << msg << '\n';                                   \
      exit(1);                                                    \
   }                                                              \
} while(0)

#define NVJITLINK_SAFE_CALL(h,x)                                  \
do {                                                              \
   nvJitLinkResult result = x;                                    \
   if (result != NVJITLINK_SUCCESS) {                             \
      std::cerr << "\nerror: " #x " failed with error "           \
                << result << '\n';                                \
      size_t lsize;                                               \
      result = nvJitLinkGetErrorLogSize(h, &lsize);               \
      if (result == NVJITLINK_SUCCESS && lsize > 0) {             \
         char *log = (char*)malloc(lsize);                        \
         result = nvJitLinkGetErrorLog(h, log);                   \
         if (result == NVJITLINK_SUCCESS) {                       \
            std::cerr << "error: " << log << '\n';                \
            free(log);                                            \
         }                                                        \
      }                                                           \
      exit(1);                                                    \
   }                                                              \
} while(0)

const char *lto_saxpy = "                                       \n\
extern __device__ float compute(float a, float x, float y);     \n\
                                                                \n\
extern \"C\" __global__                                         \n\
void saxpy(float a, float *x, float *y, float *out, size_t n)   \n\
{                                                               \n\
   size_t tid = blockIdx.x * blockDim.x + threadIdx.x;          \n\
   if (tid < n) {                                               \n\
      out[tid] = compute(a, x[tid], y[tid]);                    \n\
   }                                                            \n\
} \n";

int main(int argc, char *argv[])
{
   size_t numBlocks = 32;
   size_t numThreads = 128;
   // Create an instance of nvrtcProgram with the code string.
   nvrtcProgram prog;
   NVRTC_SAFE_CALL(
      nvrtcCreateProgram(&prog,                      // prog
                        lto_saxpy,                   // buffer
                        "lto_saxpy.cu",              // name
                        0,                           // numHeaders
                        NULL,                        // headers
                        NULL));                      // includeNames

   // specify that LTO IR should be generated for LTO operation
   const char *opts[] = {"-dlto",
                           "--relocatable-device-code=true"};
   nvrtcResult compileResult = nvrtcCompileProgram(prog,  // prog
                                                   2,     // numOptions
                                                   opts); // options
   // Obtain compilation log from the program.
   size_t logSize;
   NVRTC_SAFE_CALL(nvrtcGetProgramLogSize(prog, &logSize));
   char *log = new char[logSize];
   NVRTC_SAFE_CALL(nvrtcGetProgramLog(prog, log));
   std::cout << log << '\n';
   delete[] log;
   if (compileResult != NVRTC_SUCCESS) {
      exit(1);
   }
   // Obtain generated LTO IR from the program.
   size_t LTOIRSize;
   NVRTC_SAFE_CALL(nvrtcGetLTOIRSize(prog, &LTOIRSize));
   char *LTOIR = new char[LTOIRSize];
   NVRTC_SAFE_CALL(nvrtcGetLTOIR(prog, LTOIR));
   // Destroy the program.
   NVRTC_SAFE_CALL(nvrtcDestroyProgram(&prog));

   CUdevice cuDevice;
   CUcontext context;
   CUmodule module;
   CUfunction kernel;
   CUDA_SAFE_CALL(cuInit(0));
   CUDA_SAFE_CALL(cuDeviceGet(&cuDevice, 0));
   CUDA_SAFE_CALL(cuCtxCreate(&context, 0, cuDevice));

   // Load the generated LTO IR and the LTO IR generated offline
   // and link them together.
   nvJitLinkHandle handle;
   // Dynamically determine the arch to link for
   int major = 0;
   int minor = 0;
   CUDA_SAFE_CALL(cuDeviceGetAttribute(&major,
                     CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, cuDevice));
   CUDA_SAFE_CALL(cuDeviceGetAttribute(&minor,
                     CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, cuDevice));
   int arch = major*10 + minor;
   char smbuf[16];
   sprintf(smbuf, "-arch=sm_%d", arch);
   const char *lopts[] = {"-dlto", smbuf};
   NVJITLINK_SAFE_CALL(handle, nvJitLinkCreate(&handle, 2, lopts));

   // NOTE: assumes "offline.fatbin" is in the current directory
   // The fatbinary contains LTO IR generated offline using nvcc
   NVJITLINK_SAFE_CALL(handle, nvJitLinkAddFile(handle, NVJITLINK_INPUT_FATBIN,
                                 "offline.fatbin"));
   NVJITLINK_SAFE_CALL(handle, nvJitLinkAddData(handle, NVJITLINK_INPUT_LTOIR,
                              (void *)LTOIR, LTOIRSize, "lto_online"));

   // The call to nvJitLinkComplete causes linker to link together the two
   // LTO IR modules (offline and online), do optimization on the linked LTO IR,
   // and generate cubin from it.
   NVJITLINK_SAFE_CALL(handle, nvJitLinkComplete(handle));
   size_t cubinSize;
   NVJITLINK_SAFE_CALL(handle, nvJitLinkGetLinkedCubinSize(handle, &cubinSize));
   void *cubin = malloc(cubinSize);
   NVJITLINK_SAFE_CALL(handle, nvJitLinkGetLinkedCubin(handle, cubin));
   NVJITLINK_SAFE_CALL(handle, nvJitLinkDestroy(&handle));


   CUDA_SAFE_CALL(cuModuleLoadData(&module, cubin));
   CUDA_SAFE_CALL(cuModuleGetFunction(&kernel, module, "saxpy"));

   // Generate input for execution, and create output buffers.
   size_t n = NUM_THREADS * NUM_BLOCKS;
   size_t bufferSize = n * sizeof(float);
   float a = 5.1f;
   float *hX = new float[n], *hY = new float[n], *hOut = new float[n];
   for (size_t i = 0; i < n; ++i) {
      hX[i] = static_cast<float>(i);
      hY[i] = static_cast<float>(i * 2);
   }
   CUdeviceptr dX, dY, dOut;
   CUDA_SAFE_CALL(cuMemAlloc(&dX, bufferSize));
   CUDA_SAFE_CALL(cuMemAlloc(&dY, bufferSize));
   CUDA_SAFE_CALL(cuMemAlloc(&dOut, bufferSize));
   CUDA_SAFE_CALL(cuMemcpyHtoD(dX, hX, bufferSize));
   CUDA_SAFE_CALL(cuMemcpyHtoD(dY, hY, bufferSize));
   // Execute SAXPY.
   void *args[] = { &a, &dX, &dY, &dOut, &n };
   CUDA_SAFE_CALL(
      cuLaunchKernel(kernel,
                     NUM_BLOCKS, 1, 1,    // grid dim
                     NUM_THREADS, 1, 1,   // block dim
                     0, NULL,             // shared mem and stream
                     args, 0));           // arguments
   CUDA_SAFE_CALL(cuCtxSynchronize());
   // Retrieve and print output.
   CUDA_SAFE_CALL(cuMemcpyDtoH(hOut, dOut, bufferSize));

   for (size_t i = 0; i < n; ++i) {
      std::cout << a << " * " << hX[i] << " + " << hY[i]
               << " = " << hOut[i] << '\n';
   }
   // Release resources.
   CUDA_SAFE_CALL(cuMemFree(dX));
   CUDA_SAFE_CALL(cuMemFree(dY));
   CUDA_SAFE_CALL(cuMemFree(dOut));
   CUDA_SAFE_CALL(cuModuleUnload(module));
   CUDA_SAFE_CALL(cuCtxDestroy(context));
   free(cubin);
   delete[] hX;
   delete[] hY;
   delete[] hOut;
   delete[] LTOIR;
   return 0;
}

15.3. 设备LTO构建指南

假设环境变量 CUDA_PATH 指向 CUDA 工具包的安装目录，按以下方式构建此示例：

将offline.cu编译为包含LTO IR的fatbinary（根据实际情况将lto_52更改为其他lto_XX架构）。
```
nvcc -arch lto_52 -rdc=true -fatbin offline.cu
```

使用NVRTC共享库：

Windows:

cl.exe online.cpp /Feonline ^
   /I "%CUDA_PATH%\include" ^
   "%CUDA_PATH%"\lib\x64\nvrtc.lib ^
   "%CUDA_PATH%"\lib\x64\nvJitLink.lib ^
   "%CUDA_PATH%"\lib\x64\cuda.lib

Linux系统:

g++ online.cpp -o online \
   -I $CUDA_PATH/include \
   -L $CUDA_PATH/lib64 \
   -lnvrtc -lnvJitLink -lcuda \
   -Wl,-rpath,$CUDA_PATH/lib64

使用NVRTC静态库：

Windows:

cl.exe online.cpp /Feonline  ^
   /I "%CUDA_PATH%"\include ^
   "%CUDA_PATH%"\lib\x64\nvrtc_static.lib ^
   "%CUDA_PATH%"\lib\x64\nvrtc-builtins_static.lib ^
   "%CUDA_PATH%"\lib\x64\nvJitLink_static.lib ^
   "%CUDA_PATH%"\lib\x64\nvptxcompiler_static.lib ^
   "%CUDA_PATH%"\lib\x64\cuda.lib user32.lib Ws2_32.lib

Linux系统:

g++ online.cpp -o online \
   -I $CUDA_PATH/include \
   -L $CUDA_PATH/lib64 \
   -lnvrtc_static -lnvrtc-builtins_static -lnvJitLink_static -lnvptxcompiler_static -lcuda \
   -lpthread

16. 示例：自动PCH（CUDA 12.8+）

本示例演示自动PCH模式，该模式通过在调用nvrtcCompileProgram时传入-pch参数启用。这里有两个不同程序first和second，它们都包含相同的头文件：

const char *first  = "#include \"auto_pch_common.h\" \n"
                     "__global__ void foo(double *ptr) {\n"
                     "*ptr = doit();\n}\n";

const char *second  = "#include \"auto_pch_common.h\" \n"
                      "__global__ void other(double *a, double *b) {\n"
                      "*a = *b + doit();\n}\n";

当使用NVRTC编译first并带有-pch参数时，编译器将为包含auto_pch_common.h头文件创建一个PCH文件。当使用-pch编译second时，编译器将透明地使用先前创建的PCH文件。

程序运行时的输出如下：

compiling first program
"default_program": creating precompiled header file "default_program.pch"

 nvrtcGetPCHCreateStatus returned : NVRTC_SUCCESS

 compiling second program (expect to use PCH)
"default_program": using precompiled header file "default_program.pch"

 nvrtcGetPCHCreateStatus returned : NVRTC_ERROR_NO_PCH_CREATE_ATTEMPTED

在编译first程序时，PCH文件成功创建。在编译second程序时， PCH文件被成功使用，但编译器选择不再创建另一个PCH文件（因此 nvrtcGetPCHCreateStatus()返回了NVRC_ERROR_NO_PCH_CREATE_ATTEMPTED - 这是预期行为）。

16.1. 代码 (auto_pch_common.h)

__device__ double qqq = 10;

template <typename T>
__device__ double get(T in) { return sin(in) + qqq; }

__device__  double doit() { return get<double>(0.5); }

16.2. 代码 (auto_pch.cpp)

#include <iostream>
#include <nvrtc.h>

#define NVRTC_SAFE_CALL(x)                                        \
  do {                                                            \
    nvrtcResult result = x;                                       \
    if (result != NVRTC_SUCCESS) {                                \
      std::cerr << "\nerror: " #x " failed with error "           \
                << nvrtcGetErrorString(result) << '\n';           \
      exit(1);                                                    \
    }                                                             \
  } while(0)


const char *docompile(const char *progstr)
{
  nvrtcProgram prog;

  NVRTC_SAFE_CALL(
    nvrtcCreateProgram(&prog,
      progstr,         // buffer
      "",              // name
      0,               // numHeaders
      NULL,            // headers
      NULL));          // includeNames

  const char *opts[] = { "-pch" /* automatic PCH */
                       };

  nvrtcResult compileResult = nvrtcCompileProgram(prog,  // prog
    sizeof(opts) / sizeof(opts[0]),     // numOptions
    opts); // options


  // Obtain compilation log from the program.
  size_t logSize;
  NVRTC_SAFE_CALL(nvrtcGetProgramLogSize(prog, &logSize));
  char* log = new char[logSize];
  NVRTC_SAFE_CALL(nvrtcGetProgramLog(prog, log));
  std::cout << log;
  delete[] log;
  if (compileResult != NVRTC_SUCCESS) {
    exit(1);
  }
  nvrtcResult err = nvrtcGetPCHCreateStatus(prog);
  std::cout << "\n nvrtcGetPCHCreateStatus returned : "
            << nvrtcGetErrorString(err) << std::endl;

  // Obtain PTX from the program.
  size_t ptxSize;
  NVRTC_SAFE_CALL(nvrtcGetPTXSize(prog, &ptxSize));
  char* ptx = new char[ptxSize];
  NVRTC_SAFE_CALL(nvrtcGetPTX(prog, ptx));
  // Destroy the program.
  NVRTC_SAFE_CALL(nvrtcDestroyProgram(&prog));

  return ptx;
}

int main()
{
const char *first  = "#include \"auto_pch_common.h\" \n"
                     "__global__ void foo(double *ptr) {\n"
                     "*ptr = doit();\n}\n";

const char *second  = "#include \"auto_pch_common.h\" \n"
                     "__global__ void other(double *a, double *b) {\n"
                     "*a = *b + doit();\n}\n";

std::cout << "\n compiling first program\n";
const char *ptr1 = docompile(first);

std::cout << "\n compiling second program (expect to use PCH)\n";
const char *ptr2 = docompile(second);

delete [] ptr1;
delete [] ptr2;

}

16.3. 自动PCH构建说明

假设环境变量 CUDA_PATH 指向 CUDA 工具包的安装目录，按以下方式构建此示例：

使用NVRTC共享库：

Windows:

cl.exe auto_pch.cpp /Feauto_pch ^
   /I "%CUDA_PATH%"\include ^
   "%CUDA_PATH%"\lib\x64\nvrtc.lib "%CUDA_PATH%"\lib\x64\cuda.lib

Linux系统:

g++ auto_pch.cpp -o auto_pch \
   -I $CUDA_PATH/include \
   -L $CUDA_PATH/lib64 \
   -lnvrtc -lcuda \
   -Wl,-rpath,$CUDA_PATH/lib64

使用NVRTC静态库：

Windows系统:

cl.exe auto_pch.cpp /Feauto_pch  ^
    /I "%CUDA_PATH%"\include ^
    "%CUDA_PATH%"\lib\x64\nvrtc_static.lib ^
    "%CUDA_PATH%"\lib\x64\nvrtc-builtins_static.lib ^
    "%CUDA_PATH%"\lib\x64\nvptxcompiler_static.lib ^
    "%CUDA_PATH%"\lib\x64\cuda.lib user32.lib Ws2_32.lib

Linux系统:

g++ auto_pch.cpp -o auto_pch \
   -I $CUDA_PATH/include \
   -L $CUDA_PATH/lib64 \
   -lnvrtc_static -lnvrtc-builtins_static -lnvptxcompiler_static -lcuda \
   -lpthread

17. 示例：显式PCH创建与使用（CUDA 12.8+）

本示例演示显式PCH模式，用户代码可以分别通过"-create-pch=filename"和"-use-pch=filename"这两个NVRTC标志显式创建和使用PCH文件。这里有两个不同的程序first和second，它们都包含相同的头文件：

const char *first  = "#include \"common.h\" \n"
                     "__global__ void foo(double *ptr) {\n"
                     "*ptr = doit();\n}\n";

const char *second  = "#include \"common.h\" \n"
                      "__global__ void other(double *a, double *b) {\n"
                      "*a = *b + doit();\n}\n";

当使用NVRTC编译first并指定-create-pch=foo.pch参数时，编译器将为包含common.h头文件创建一个PCH文件("foo.pch")。当使用-use-pch=foo.pch参数编译second时，编译器将使用指定的PCH文件"foo.pch"。

程序运行时的输出如下：

compiling first program
"default_program": creating precompiled header file "foo.pch"

 nvrtcGetPCHCreateStatus returned : NVRTC_SUCCESS

 compiling second program (expect to use PCH)
"default_program": using precompiled header file "foo.pch"

 nvrtcGetPCHCreateStatus returned : NVRTC_ERROR_NO_PCH_CREATE_ATTEMPTED

在编译first程序时，成功创建了PCH文件"foo.pch"。在编译second程序时，成功使用了PCH文件"foo.pch"；但没有创建新的PCH文件(因此 nvrtcGetPCHCreateStatus()返回了NVRC_ERROR_NO_PCH_CREATE_ATTEMPTED)。

17.1. 代码 (common.h)

__device__ double qqq = 10;

template <typename T>
__device__ double get(T in) { return sin(in) + qqq; }

__device__  double doit() { return get<double>(0.5); }

17.2. 代码 (explicit_pch.cpp)

#include <iostream>
#include <nvrtc.h>

#define NVRTC_SAFE_CALL(x)                                        \
  do {                                                            \
    nvrtcResult result = x;                                       \
    if (result != NVRTC_SUCCESS) {                                \
      std::cerr << "\nerror: " #x " failed with error "           \
                << nvrtcGetErrorString(result) << '\n';           \
      exit(1);                                                    \
    }                                                             \
  } while(0)


const char *docompile(const char *progstr, const char *pch_mode)
{
  nvrtcProgram prog;

  NVRTC_SAFE_CALL(
    nvrtcCreateProgram(&prog,
      progstr,         // buffer
      "",              // name
      0,               // numHeaders
      NULL,            // headers
      NULL));          // includeNames

  const char *opts[] = { pch_mode /*create/use PCH */
                       };

  nvrtcResult compileResult = nvrtcCompileProgram(prog,  // prog
    sizeof(opts) / sizeof(opts[0]),     // numOptions
    opts); // options


  // Obtain compilation log from the program.
  size_t logSize;
  NVRTC_SAFE_CALL(nvrtcGetProgramLogSize(prog, &logSize));
  char* log = new char[logSize];
  NVRTC_SAFE_CALL(nvrtcGetProgramLog(prog, log));
  std::cout << log;
  delete[] log;
  if (compileResult != NVRTC_SUCCESS) {
    exit(1);
  }
  nvrtcResult err = nvrtcGetPCHCreateStatus(prog);
  std::cout << "\n nvrtcGetPCHCreateStatus returned : " << nvrtcGetErrorString(err) << std::endl;

  // Obtain PTX from the program.
  size_t ptxSize;
  NVRTC_SAFE_CALL(nvrtcGetPTXSize(prog, &ptxSize));
  char* ptx = new char[ptxSize];
  NVRTC_SAFE_CALL(nvrtcGetPTX(prog, ptx));
  // Destroy the program.
  NVRTC_SAFE_CALL(nvrtcDestroyProgram(&prog));

  return ptx;
}

int main()
{
const char *first  = "#include \"common.h\" \n"
                     "__global__ void foo(double *ptr) {\n"
                     "*ptr = doit();\n}\n";

const char *second  = "#include \"common.h\" \n"
                     "__global__ void other(double *a, double *b) {\n"
                     "*a = *b + doit();\n}\n";

std::cout << "\n compiling first program\n";
const char *ptr1 = docompile(first, "-create-pch=foo.pch");

std::cout << "\n compiling second program (expect to use PCH)\n";
const char *ptr2 = docompile(second, "-use-pch=foo.pch");

delete [] ptr1;
delete [] ptr2;

}

17.3. 显式PCH构建说明

假设环境变量 CUDA_PATH 指向 CUDA 工具包的安装目录，按以下方式构建此示例：

使用NVRTC共享库：

Windows:

cl.exe explicit_pch.cpp /Feexplicit_pch ^
   /I "%CUDA_PATH%"\include ^
   "%CUDA_PATH%"\lib\x64\nvrtc.lib "%CUDA_PATH%"\lib\x64\cuda.lib

Linux系统:

g++ explicit_pch.cpp -o explicit_pch \
   -I $CUDA_PATH/include \
   -L $CUDA_PATH/lib64 \
   -lnvrtc -lcuda \
   -Wl,-rpath,$CUDA_PATH/lib64

使用NVRTC静态库：

Windows:

cl.exe explicit_pch.cpp /Feexplicit_pch  ^
    /I "%CUDA_PATH%"\include ^
    "%CUDA_PATH%"\lib\x64\nvrtc_static.lib ^
    "%CUDA_PATH%"\lib\x64\nvrtc-builtins_static.lib ^
    "%CUDA_PATH%"\lib\x64\nvptxcompiler_static.lib ^
    "%CUDA_PATH%"\lib\x64\cuda.lib user32.lib Ws2_32.lib

Linux系统:

g++ explicit_pch.cpp -o explicit_pch \
   -I $CUDA_PATH/include \
   -L $CUDA_PATH/lib64 \
   -lnvrtc_static -lnvrtc-builtins_static -lnvptxcompiler_static -lcuda \
   -lpthread

18. 示例：PCH堆大小调整（CUDA 12.8+）

PCH堆在nvrtcCompileProgram()调用之间是持久化的。在内存受限的环境中，可能需要将PCH堆的大小设置为比默认值更小的值。此示例展示了如何调整PCH堆的大小。首先，将堆大小设置为较低的值（8 KB）6：

NVRTC_SAFE_CALL(nvrtcSetPCHHeapSize(8*1024));

然后，调用nvrtcCompileProgram()并传入-pch参数。预计PCH创建会失败，此时nvrtcGetPCHCreateStatus()将返回 NVRTC_ERROR_PCH_CREATE_HEAP_EXHAUSTED错误。随后可通过调用nvrtcGetPCHHeapSizeRequired()获取所需的PCH堆大小，并通过调用nvrtcSetPCHHeapSize()来调整PCH堆大小：

nvrtcResult err = nvrtcGetPCHCreateStatus(prog);
std::cout << "\nnvrtcGetPCHCreateStatus returned : " << nvrtcGetErrorString(err) << std::endl;

if (err == NVRTC_ERROR_PCH_CREATE_HEAP_EXHAUSTED) {
    size_t size;
    NVRTC_SAFE_CALL(nvrtcGetPCHHeapSize(&size));
    ...
    NVRTC_SAFE_CALL(nvrtcGetPCHHeapSizeRequired(prog, &size));
    ..
    NVRTC_SAFE_CALL(nvrtcSetPCHHeapSize(size));
}

下一次使用相同文件请求PCH创建的NVRTC编译现在预计会成功。以下是程序的输出：

compiling first program
auto_pch_common.h(7): warning #639-D: insufficient preallocated memory for generation of precompiled header file (4481024 bytes required)
  __device__  double doit() { return get<double>(0.5); }
                                                       ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"


nvrtcGetPCHCreateStatus returned : NVRTC_ERROR_PCH_CREATE_HEAP_EXHAUSTED
nvrtcGetPCHHeapSize() before: 8192
nvrtcGetPCHHeapSizeRequired() reports: 4481024
nvrtcGetPCHHeapSize() after: 4481024

 compiling second program (expect to use PCH)
"default_program": creating precompiled header file "default_program.pch"

nvrtcGetPCHCreateStatus returned : NVRTC_SUCCESS

6: 但不能是0，因为这会禁用PCH操作。

18.1. 代码 (auto_pch_common.h)

__device__ double qqq = 10;

template <typename T>
__device__ double get(T in) { return sin(in) + qqq; }

__device__  double doit() { return get<double>(0.5); }

18.2. 代码 (pch_resize.cpp)

#include <iostream>
#include <nvrtc.h>

#define NVRTC_SAFE_CALL(x)                                        \
  do {                                                            \
    nvrtcResult result = x;                                       \
    if (result != NVRTC_SUCCESS) {                                \
      std::cerr << "\nerror: " #x " failed with error "           \
                << nvrtcGetErrorString(result) << '\n';           \
      exit(1);                                                    \
    }                                                             \
  } while(0)


const char *docompile(const char *progstr)
{
  nvrtcProgram prog;

  NVRTC_SAFE_CALL(
    nvrtcCreateProgram(&prog,
      progstr,         // buffer
      "",              // name
      0,               // numHeaders
      NULL,            // headers
      NULL));          // includeNames

  const char *opts[] = { "-pch"  /* automatic PCH */
                       };

  nvrtcResult compileResult = nvrtcCompileProgram(prog,  // prog
    sizeof(opts) / sizeof(opts[0]),     // numOptions
    opts); // options


  // Obtain compilation log from the program.
  size_t logSize;
  NVRTC_SAFE_CALL(nvrtcGetProgramLogSize(prog, &logSize));
  char* log = new char[logSize];
  NVRTC_SAFE_CALL(nvrtcGetProgramLog(prog, log));
  std::cout << log;
  delete[] log;
  if (compileResult != NVRTC_SUCCESS) {
    exit(1);
  }
  nvrtcResult err = nvrtcGetPCHCreateStatus(prog);
  std::cout << "\nnvrtcGetPCHCreateStatus returned : " << nvrtcGetErrorString(err) << std::endl;

  if (err == NVRTC_ERROR_PCH_CREATE_HEAP_EXHAUSTED) {
    size_t size;
    NVRTC_SAFE_CALL(nvrtcGetPCHHeapSize(&size));
    std::cout << "nvrtcGetPCHHeapSize() before: " << size << std::endl;
    NVRTC_SAFE_CALL(nvrtcGetPCHHeapSizeRequired(prog, &size));
    std::cout << "nvrtcGetPCHHeapSizeRequired() reports: " << size << std::endl;
    NVRTC_SAFE_CALL(nvrtcSetPCHHeapSize(size));
    NVRTC_SAFE_CALL(nvrtcGetPCHHeapSize(&size));
    std::cout << "nvrtcGetPCHHeapSize() after: " << size << std::endl;
  }

  // Obtain PTX from the program.
  size_t ptxSize;
  NVRTC_SAFE_CALL(nvrtcGetPTXSize(prog, &ptxSize));
  char* ptx = new char[ptxSize];
  NVRTC_SAFE_CALL(nvrtcGetPTX(prog, ptx));
  // Destroy the program.
  NVRTC_SAFE_CALL(nvrtcDestroyProgram(&prog));

  return ptx;
}

int main()
{
const char *first  = "#include \"auto_pch_common.h\" \n"
                     "__global__ void foo(double *ptr) {\n"
                     "*ptr = doit();\n}\n";

const char *second  = "#include \"auto_pch_common.h\" \n"
                     "__global__ void other(double *a, double *b) {\n"
                     "*a = *b + doit();\n}\n";

//set NVRTC PCH heap to a low initial value (8 KB) (note: don't use 0)
NVRTC_SAFE_CALL(nvrtcSetPCHHeapSize(8*1024));

std::cout << "\n compiling first program\n";
const char *ptr1 = docompile(first);

std::cout << "\n compiling second program (expect to use PCH)\n";
const char *ptr2 = docompile(second);

delete [] ptr1;
delete [] ptr2;

}

18.3. PCH堆大小调整构建说明

假设环境变量 CUDA_PATH 指向 CUDA 工具包的安装目录，按以下方式构建此示例：

使用NVRTC共享库：

Windows:

cl.exe pch_resize.cpp /Fepch_resize ^
   /I "%CUDA_PATH%"\include ^
   "%CUDA_PATH%"\lib\x64\nvrtc.lib "%CUDA_PATH%"\lib\x64\cuda.lib

Linux:

g++ pch_resize.cpp -o pch_resize \
   -I $CUDA_PATH/include \
   -L $CUDA_PATH/lib64 \
   -lnvrtc -lcuda \
   -Wl,-rpath,$CUDA_PATH/lib64

使用NVRTC静态库：

Windows:

cl.exe pch_resize.cpp /Fepch_resize  ^
    /I "%CUDA_PATH%"\include ^
    "%CUDA_PATH%"\lib\x64\nvrtc_static.lib ^
    "%CUDA_PATH%"\lib\x64\nvrtc-builtins_static.lib ^
    "%CUDA_PATH%"\lib\x64\nvptxcompiler_static.lib ^
    "%CUDA_PATH%"\lib\x64\cuda.lib user32.lib Ws2_32.lib

Linux系统:

g++ pch_resize.cpp -o pch_resize \
   -I $CUDA_PATH/include \
   -L $CUDA_PATH/lib64 \
   -lnvrtc_static -lnvrtc-builtins_static -lnvptxcompiler_static -lcuda \
   -lpthread

18.4. 通知

18.4.1. 注意事项

本文档仅供信息参考之用，不应视为对产品功能、状态或质量的保证。NVIDIA公司（“NVIDIA”）对本文件所含信息的准确性或完整性不作任何明示或暗示的陈述或保证，并对其中可能存在的错误不承担任何责任。NVIDIA对于因使用此类信息而产生的后果、或因使用该信息导致的第三方专利或其他权利侵权概不负责。本文件不构成对开发、发布或交付任何材料（定义见下文）、代码或功能的承诺。

NVIDIA保留随时对本文件进行更正、修改、增强、改进以及任何其他变更的权利，恕不另行通知。

客户在下单前应获取最新的相关信息，并确认这些信息是最新且完整的。

除非NVIDIA与客户授权代表签署的单独销售协议中另有约定，否则NVIDIA产品的销售均以订单确认时提供的NVIDIA标准销售条款和条件为准（以下简称"销售条款"）。NVIDIA特此明确反对将任何客户通用条款适用于本文件所述NVIDIA产品的采购。本文件不直接或间接构成任何合同义务。

NVIDIA产品并非设计、授权或保证适用于医疗、军事、航空、航天或生命支持设备，也不适用于那些可以合理预期NVIDIA产品故障或失灵会导致人身伤害、死亡、财产或环境损害的应用场景。NVIDIA对于在此类设备或应用中使用和/或包含NVIDIA产品不承担任何责任，因此客户需自行承担相关风险。

NVIDIA不声明或保证基于本文档的产品适用于任何特定用途。NVIDIA未必会对每个产品的所有参数进行测试。客户应全权负责评估和确定本文档所含信息的适用性，确保产品适合并满足客户计划的应用需求，并执行必要的应用测试以避免应用或产品出现故障。客户产品设计中的缺陷可能会影响NVIDIA产品的质量和可靠性，并可能导致超出本文档范围的其他或不同的条件和/或要求。对于任何因以下原因导致的故障、损坏、成本或问题，NVIDIA不承担任何责任：(i) 以违反本文档的任何方式使用NVIDIA产品或(ii) 客户产品设计。

本文档不授予任何NVIDIA专利权、版权或其他NVIDIA知识产权的明示或暗示许可。NVIDIA发布的关于第三方产品或服务的信息，不构成NVIDIA对这些产品或服务的使用许可或担保认可。使用此类信息可能需要获得第三方基于其专利或其他知识产权的许可，或需要获得NVIDIA基于其专利或其他知识产权的许可。

本文件中的信息仅可在获得NVIDIA事先书面批准、未经改动完整复制且完全符合所有适用的出口法律法规，并附带所有相关条件、限制和声明的情况下进行复制。

本文件及所有NVIDIA设计规格、参考板、文件、图纸、诊断工具、清单和其他文档（统称及单独称为"材料"）均以"现状"提供。NVIDIA不对材料作出任何明示或默示的保证，包括但不限于对不侵权、适销性和特定用途适用性的默示保证免责。在法律允许的最大范围内，NVIDIA不就因使用本文件导致的任何损害承担责任，包括但不限于任何直接、间接、特殊、附带、惩罚性或后果性损害，无论损害成因如何，也无论责任理论为何，即使NVIDIA已被告知发生此类损害的可能性。不论客户因任何原因可能遭受的任何损害，NVIDIA对客户就本文所述产品的全部及累计责任应受产品销售条款的限制。

18.4.2. OpenCL

OpenCL是苹果公司的商标，经Khronos Group Inc.授权使用。

18.4.3. 商标

NVIDIA和NVIDIA标识是美国及其他国家NVIDIA公司的商标或注册商标。其他公司及产品名称可能是其各自关联公司的商标。

1: 新版本NVRTC共享库中对编译器优化器启发式算法的更改也可能对生成代码的性能造成潜在影响。
2: 这些API函数会修改关联的nvrtcProgram状态。