3. 设备API概述

要使用设备API,请在定义使用cuRAND设备函数的核函数文件中包含curand_kernel.h文件。该设备API包含伪随机生成准随机生成功能函数。

3.1. 伪随机序列

伪随机序列生成函数支持位生成和从分布中生成。

3.1.1. 使用XORWOW和MRG32k3a生成器进行位生成

__device__ unsigned int
curand(curandStateXORWOW_t *state)

__device__ unsigned int
curand(curandStateMRG32k3a_t *state)

在调用curand_init()之后,curand()会返回一个周期大于 2 190 的伪随机数序列。如果每次调用curand()时使用相同的初始状态,并且在两次调用curand()之间没有修改状态,那么总是会生成相同的序列。

__device__ void
curand_init(unsigned long long seed,
            unsigned long long sequence,
            unsigned long long offset,
            curandStateXORWOW_t *state)

__device__ void
curand_init(unsigned long long seed,
            unsigned long long sequence,
            unsigned long long offset,
            curandStateMRG32k3a_t  *state)

curand_init() 函数使用给定的种子、序列号和序列内偏移量来初始化由调用者分配的状态。 不同的种子保证会产生不同的初始状态和不同序列。相同的种子总是产生 相同的状态和相同的序列。设置的状态将是种子状态经过 2 67 sequence + offsetcurand() 调用后的状态。

使用不同种子生成的序列通常不具有统计相关性,但某些种子选择可能会导致序列在统计上相关。使用相同种子但不同序列号生成的序列将不会具有统计相关性。

为了获得最高质量的并行伪随机数生成,每个实验应分配一个唯一的种子。在实验内部,每个计算线程应分配一个唯一的序列号。如果实验跨越多个内核启动,建议在内核启动之间为线程分配相同的种子,并以单调递增的方式分配序列号。如果启动相同的线程配置,可以在启动之间将随机状态保存在全局内存中,以避免状态设置时间。

3.1.2. 使用MTGP32生成器进行比特生成

MTGP32生成器是基于广岛大学开发的代码改编而成(参见[1])。在该算法中,会为多个序列生成样本,每个序列基于一组计算参数。cuRAND使用了为32位生成器预生成的200组参数集,其周期为211214。如[1]所述,也可以生成其他参数集并替代使用。每个参数集(序列)对应一个状态结构,该算法允许每个序列在最多256个并发线程(单个块内)的情况下进行线程安全的生成和状态更新。

请注意,两个不同的块无法安全地对同一状态进行操作。 另外请注意,在一个块内,最多256个线程可以对给定状态进行操作。

对于MTGP32生成器,提供了两个主机函数来帮助在设备内存中为不同序列设置参数,并初始化状态。

__host__ curandStatust curandMakeMTGP32Constants(mtgp32paramsfastt params[],
                                                 mtgp32kernelparamst *p)

此函数将预生成格式(mtgp32_params_fast_t)的参数集数据重新组织为内核函数使用的格式(mtgp32_kernel_params_t),并将其复制到设备内存中。

__host__ curandStatus_t
curandMakeMTGP32KernelState(curandStateMtgp32_t *s,
                            mtgp32_params_fast_t params[],
                            mtgp32_kernel_params_t *k,
                            int n,
                            unsigned long long seed)

该函数根据指定的参数集和种子初始化n个状态,并将它们复制到s指向的设备内存中。请注意,如果使用预生成的状态,n的最大值为200。

cuRAND MTGP32生成器提供了两个内核函数来生成随机位。

__device__ unsigned int
curand(curandStateMtgp32_t *state)

该函数计算线程索引,并根据该索引生成结果并更新状态。线程索引 t 的计算方式为:

t = (blockDim.x * blockDim.y * threadIdx.z) + (blockDim.x * threadIdx.y) + threadIdx.x

该函数可以从单个内核启动中重复调用,但需遵守以下约束条件:

  • 仅可从线程数不超过256的块中安全调用。
  • 一个给定的状态不能被多个块同时使用。
  • 一个给定的块可以使用多个状态生成随机数。
  • 在代码的某个特定点,块中的所有线程要么都必须调用此函数,要么都不调用。
__device__ unsigned int
curandmtgp32specific(curandStateMtgp32_t *state, unsigned char index,
                     unsigned char n)

此函数会生成结果并更新由线程特定index指定的位置状态,同时将状态中的偏移量前进n个位置。curand_mtgp32_specific可以在内核启动期间被多次调用,但需遵守以下限制条件:

  • 对于给定的状态,最多256个线程可以调用此函数。
  • 在一个块中,对于给定状态,如果有n个线程调用该函数,索引必须从0...n-1运行。索引不必与线程编号匹配,可以根据调用程序的要求在多个线程之间分配。在代码的某个特定点,必须使用从0...n-1的所有索引,或者完全不使用。
  • 一个给定的状态不能被多个块同时使用。
  • 一个给定的块可以使用多个状态生成随机数。
Figure 1. MTGP32 Block and Thread Operation
Illustration of MTGP32 progressing through its state array

图1展示了MTGP32中块和线程如何操作生成器状态。每行代表一个由32位整数s(n)组成的循环状态数组。操作该数组的线程标识为T(m)。所示特定情况与主机API的内部实现相匹配,该API启动64个块,每个块包含256个线程。每个块根据一组唯一参数P(n)操作不同的序列。MTGP32序列的一个完整状态由351个32位整数定义。每个线程T(m)操作其中一个整数s(n+m),将其与s(n+m+1)和拾取元素s(n+m+p)(其中p <= 95)结合。它将新状态存储在状态数组的s(n+m+351)位置。线程同步后,基础索引n会根据已更新状态的线程数前进。为避免被覆盖,数组本身的长度必须至少为256 + 351个整数。实际上,为了提高索引效率,其大小设置为1024个整数。

对块中线程数量的限制(这些线程可操作给定状态数组)源于需要确保状态 s(n+351)在被用作拾取状态前已完成更新。如果存在线程 T(256),它可能会在线程零更新 s(n+351)之前使用 s(n+256+95)s(n+351)。若应用程序要求块中超过256个线程调用MTGP32生成器函数,则必须使用多个MTGP32状态,可通过使用多组参数或不同种子的多个生成器实现。另请注意,生成器函数在每次调用结束时会对线程进行同步,因此最有效的方式是让块中的256个线程调用该生成器。

3.1.3. 使用Philox_4x32_10生成器进行位生成

__device__ unsigned int
curand(curandStatePhilox4_32_10_t *state)

在调用curand_init()之后,curand()会返回一个周期为 2 128 的伪随机数序列。如果每次使用相同的初始状态调用curand(),并且在两次调用curand()之间没有修改状态,那么总是会生成相同的序列。

__device__ void
curand_init(unsigned long long seed,
            unsigned long long subsequence,
            unsigned long long offset,
            curandStatePhilox4_32_10_t *state)

curand_init() 函数使用给定的种子、子序列和偏移量来初始化由调用者分配的状态。不同的种子保证会产生不同的初始状态和不同序列。子序列和偏移量共同定义了周期为2128的序列中的偏移位置。偏移量定义了长度为264的子序列中的偏移量。当子序列的最后一个元素生成后,下一个随机数将是连续子序列的第一个元素。相同的种子总是产生相同的状态和相同的序列。

使用不同种子生成的序列通常不具有统计相关性,但某些种子选择可能会导致序列在统计上相关。

为了获得最高质量的并行伪随机数生成,每个实验应分配一个唯一的种子值。在实验内部,每个计算线程应分配一个唯一的ID号。如果实验跨越多个内核启动,建议在内核启动之间为线程分配相同的种子,并以单调递增的方式分配ID号。如果启动相同的线程配置,可以在启动之间将随机状态保存在全局内存中,以避免状态设置时间。

3.1.4. 分布

__device__ float
curand_uniform (curandState_t *state)

该函数返回一个均匀分布在0.0到1.0之间的伪随机浮点数序列。返回值可能从0.0到1.0,其中包含1.0但不包含0.0。分布函数可以使用基础生成器产生的任意数量的无符号整数值。消耗的数值数量不保证是固定的。

__device__ float
curand_normal (curandState_t *state)

该函数返回一个均值为0.0、标准差为1.0的服从正态分布的浮点数。此结果可通过缩放和偏移来生成具有任意均值和标准差的正态分布值。

__device__ float
curand_log_normal (curandState_t *state, float mean, float stddev)

该函数基于给定均值和标准差的正态分布,返回一个对数正态分布的浮点数。

__device__ unsigned int
curand_poisson (curandState_t *state, double lambda)

该函数基于给定lambda值的泊松分布返回一个无符号整数。根据lambda值和生成器类型的不同,从均匀分布结果推导出泊松结果的算法会有所变化。某些算法在生成单个输出时会抽取多个样本。

__device__ double
curand_uniform_double (curandState_t *state)
__device__ double
curand_normal_double (curandState_t *state)
__device__ double
curand_log_normal_double (curandState_t *state, double mean, double stddev)

上述三个函数是curand_uniform()curand_normal()curand_log_normal()的双精度版本。

对于伪随机数生成器,双精度函数会多次调用curand()来生成53位随机数。

__device__ float2
curand_normal2 (curandState_t *state)
__device__ float2
curand_log_normal2 (curandState_t *state)
__device__ double2
curand_normal2_double (curandState_t *state)
__device__ double2
curand_log_normal2_double (curandState_t *state)

上述函数每次调用会生成两个服从正态或对数正态分布的伪随机结果。由于底层实现采用了Box-Muller变换方法,这通常比每次调用只生成单个结果效率更高。

__device__ uint4
curand4 (curandStatePhilox4_32_10_t *state)
__device__ float4
curand_uniform4 (curandStatePhilox4_32_10_t *state)
__device__ float4
curand_normal4 (curandStatePhilox4_32_10_t *state)
__device__ float4
curand_log_normal4 (curandStatePhilox4_32_10_t *state, float mean, float stddev)
__device__ uint4
curand_poisson4 (curandStatePhilox4_32_10_t *state, double lambda)
__device__ uint4
curand_discrete4 (curandStatePhilox4_32_10_t *state, curandDiscreteDistribution_t discrete_distribution)
__device__ double2
curand_uniform2_double (curandStatePhilox4_32_10_t *state)
__device__ double2
curand_normal2_double (curandStatePhilox4_32_10_t *state)
__device__ double2
curand_log_normal2_double (curandStatePhilox4_32_10_t *state, double mean, double stddev)

上述函数每次调用会生成四个单精度或两个双精度结果。由于底层实现采用了Philox生成器,这通常比每次调用生成单个结果更为高效。

3.2. 准随机序列

虽然默认生成器类型是基于XORWOW的伪随机数,但也可以使用以下函数生成基于Sobol 32位整数的Sobol序列:

__device__ void
curand_init (
    unsigned int *direction_vectors, 
    unsigned int offset,
    curandStateSobol32_t *state)
__device__ void
curand_init (
    unsigned int *direction_vectors,
    unsigned int scramble_c, 
    unsigned int offset,
    curandStateScrambledSobol32_t *state)
__device__ unsigned int
curand (curandStateSobol32_t *state)
__device__ float
curand_uniform (curandStateSobol32_t *state)
__device__ float 
curand_normal (curandStateSobol32_t *state)
__device__ float 
curand_log_normal (
    curandStateSobol32_t *state,  
    float mean, 
    float stddev)
__device__ unsigned int 
curand_poisson (curandStateSobol32_t *state, double lambda)
__device__ double
curand_uniform_double (curandStateSobol32_t *state)
__device__ double
curand_normal_double (curandStateSobol32_t *state)
__device__ double
curand_log_normal_double (
    curandStateSobol32_t *state, 
    double mean, 
    double stddev)

curand_init() 函数用于初始化准随机数生成器状态。该函数没有种子参数,仅包含方向向量和偏移量参数。 对于加扰Sobol生成器,还有一个额外参数scramble_c,表示加扰序列的初始值。对于curandStateSobol32_t类型和curandStateScrambledSobol32_t类型,方向向量是由32个无符号整数值组成的数组。对于curandStateSobol64_t类型和curandStateScrambledSobol64_t类型,方向向量是由64个无符号长整型值组成的数组。32位Sobol生成器的偏移量和加扰序列初始常量为无符号整型,64位Sobol生成器的这些参数则为无符号长整型。对于curandStateSobol32_t类型和curandStateScrambledSobol32_t类型,序列长度精确为 2 32 个元素,每个元素32位。对于curandStateSobol64_t类型和curandStateScrambledSobol64_t类型,序列长度精确为 2 64 个元素,每个元素64位。每次调用curand()会返回下一个准随机元素。调用curand_uniform()会返回0.0到1.0之间的准随机浮点数或双精度数(包含1.0但不包含0.0)。类似地,调用curand_normal()会返回均值为0.0、标准差为1.0的正态分布浮点数或双精度数。调用curand_log_normal()会返回对数正态分布的浮点数或双精度数,这些值由具有指定均值和标准差的正态分布派生而来。所有生成函数都可以与任何类型的Sobol生成器配合使用。

例如,生成填充单位立方体的拟随机坐标需要跟踪三个拟随机生成器。这三个生成器的初始offset均为0,对应的维度分别为0、1和2。对每个生成器状态调用一次curand_uniform()即可生成xyz坐标。方向向量表可通过curandGetDirectionVectors32()curandGetDirectionVectors64()函数在主机端访问。所需的方向向量应在使用前复制到设备内存中。

用于准随机生成的正态分布函数使用逆累积密度函数来保持准随机序列的维度。因此,与伪随机生成器不同,没有函数能一次生成多个结果。

双精度Sobol32函数返回使用底层生成器32位内部精度的双精度结果。

双精度Sobol64函数返回使用底层生成器53位内部精度的双精度结果。这些位取自64位样本的高53位。

3.3. 跳过前置步骤

有多个函数可以从生成器状态向前跳过。

__device__ void
skipahead(unsigned long long n, curandState_t *state)
__device__ void
skipahead(unsigned int n, curandStateSobol32_t *state)

使用此函数相当于调用curand() n 次而不使用返回值,但速度要快得多。

__device__ void
skipahead_sequence(unsigned long long n, curandState_t *state)

此函数等效于调用curand() n 2 67 次而不使用返回值,且速度更快。

3.4. 离散分布的设备API

离散分布(如泊松分布)需要额外的API在主机端进行预处理,以生成特定分布的直方图。对于泊松分布而言,不同lambda值对应的直方图会有所不同。这些分布在具有至少48KB L1缓存的GPU上能获得最佳性能。

curandStatus_t 
curandCreatePoissonDistribution(
    double lambda, 
    curandDiscreteDistribution_t *discrete_distribution)

curandCreatePoissonDistribution() 函数用于为给定lambda值的泊松分布创建直方图。

__device__ unsigned int 
curand_discrete (
    curandState_t *state,
    curandDiscreteDistribution_t discrete_distribution)

该函数基于给定的离散分布直方图返回一个离散分布的无符号整数。

curandStatus_t 
curandDestroyDistribution(
    curandDiscreteDistribution_t discrete_distribution)

curandDestroyDistribution() 函数用于清理与直方图相关的结构。

3.5. 性能说明

调用curand_init()比调用curand()curand_uniform()更慢。对curand_init()使用大的偏移量比较小的偏移量需要更多时间。保存和恢复随机数生成器状态比重复重新计算初始状态要快得多。

如下图所示,生成器状态可以在内核启动之间存储在全局内存中,在本地内存中用于快速生成,然后再存储回全局内存。

__global__ void example(curandState *global_state)
{
    curandState local_state;
    local_state = global_state[threadIdx.x];
    for(int i = 0; i < 10000; i++) {
        unsigned int x = curand(&local_state);
        ...
    }
    global_state[threadIdx.x] = local_state;
}

随机数生成器状态的初始化通常比随机数生成需要更多的寄存器和本地内存。 为了获得最佳性能,将curand_init()curand()的调用分离到不同的内核中可能是有益的。

状态设置可能是一项开销较大的操作。一种加速设置的方法是让每个线程使用不同的种子值,并将序列号固定为0。当需要创建大量生成器时,这种方法尤为有效。虽然设置速度更快,但该方法对生成序列的数学特性提供的保证较少。如果用于从种子初始化生成器状态的哈希函数与生成器的周期性之间产生不良相互作用,某些种子值可能导致线程输出高度相关的结果。目前尚未发现具体的问题数值;即使存在,其出现概率可能极低。

3.6. 设备API示例

此示例使用cuRAND设备API,通过XORWOW或MRG32k3a生成器生成伪随机数。对于整数,它计算低位设置为1的比例。对于均匀分布的实数,它计算大于0.5的比例。对于正态分布的实数,它计算在均值一个标准差范围内的比例。

/*
 * This program uses the device CURAND API to calculate what
 * proportion of pseudo-random ints have low bit set.
 * It then generates uniform results to calculate how many
 * are greater than .5.
 * It then generates  normal results to calculate how many
 * are within one standard deviation of the mean.
 */
#include <stdio.h>
#include <stdlib.h>

#include <cuda_runtime.h>
#include <curand_kernel.h>

#define CUDA_CALL(x) do { if((x) != cudaSuccess) { \
    printf("Error at %s:%d\n",__FILE__,__LINE__); \
    return EXIT_FAILURE;}} while(0)

__global__ void setup_kernel(curandState *state)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    /* Each thread gets same seed, a different sequence
       number, no offset */
    curand_init(1234, id, 0, &state[id]);
}

__global__ void setup_kernel(curandStatePhilox4_32_10_t *state)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    /* Each thread gets same seed, a different sequence
       number, no offset */
    curand_init(1234, id, 0, &state[id]);
}

__global__ void setup_kernel(curandStateMRG32k3a *state)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    /* Each thread gets same seed, a different sequence
       number, no offset */
    curand_init(0, id, 0, &state[id]);
}

__global__ void generate_kernel(curandState *state,
                                int n,
                                unsigned int *result)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    int count = 0;
    unsigned int x;
    /* Copy state to local memory for efficiency */
    curandState localState = state[id];
    /* Generate pseudo-random unsigned ints */
    for(int i = 0; i < n; i++) {
        x = curand(&localState);
        /* Check if low bit set */
        if(x & 1) {
            count++;
        }
    }
    /* Copy state back to global memory */
    state[id] = localState;
    /* Store results */
    result[id] += count;
}

__global__ void generate_kernel(curandStatePhilox4_32_10_t *state,
                                int n,
                                unsigned int *result)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    int count = 0;
    unsigned int x;
    /* Copy state to local memory for efficiency */
    curandStatePhilox4_32_10_t localState = state[id];
    /* Generate pseudo-random unsigned ints */
    for(int i = 0; i < n; i++) {
        x = curand(&localState);
        /* Check if low bit set */
        if(x & 1) {
            count++;
        }
    }
    /* Copy state back to global memory */
    state[id] = localState;
    /* Store results */
    result[id] += count;
}

__global__ void generate_uniform_kernel(curandState *state,
                                int n,
                                unsigned int *result)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    unsigned int count = 0;
    float x;
    /* Copy state to local memory for efficiency */
    curandState localState = state[id];
    /* Generate pseudo-random uniforms */
    for(int i = 0; i < n; i++) {
        x = curand_uniform(&localState);
        /* Check if > .5 */
        if(x > .5) {
            count++;
        }
    }
    /* Copy state back to global memory */
    state[id] = localState;
    /* Store results */
    result[id] += count;
}

__global__ void generate_uniform_kernel(curandStatePhilox4_32_10_t *state,
                                int n,
                                unsigned int *result)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    unsigned int count = 0;
    float x;
    /* Copy state to local memory for efficiency */
    curandStatePhilox4_32_10_t localState = state[id];
    /* Generate pseudo-random uniforms */
    for(int i = 0; i < n; i++) {
        x = curand_uniform(&localState);
        /* Check if > .5 */
        if(x > .5) {
            count++;
        }
    }
    /* Copy state back to global memory */
    state[id] = localState;
    /* Store results */
    result[id] += count;
}

__global__ void generate_normal_kernel(curandState *state,
                                int n,
                                unsigned int *result)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    unsigned int count = 0;
    float2 x;
    /* Copy state to local memory for efficiency */
    curandState localState = state[id];
    /* Generate pseudo-random normals */
    for(int i = 0; i < n/2; i++) {
        x = curand_normal2(&localState);
        /* Check if within one standard deviaton */
        if((x.x > -1.0) && (x.x < 1.0)) {
            count++;
        }
        if((x.y > -1.0) && (x.y < 1.0)) {
            count++;
        }
    }
    /* Copy state back to global memory */
    state[id] = localState;
    /* Store results */
    result[id] += count;
}

__global__ void generate_normal_kernel(curandStatePhilox4_32_10_t *state,
                                int n,
                                unsigned int *result)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    unsigned int count = 0;
    float2 x;
    /* Copy state to local memory for efficiency */
    curandStatePhilox4_32_10_t localState = state[id];
    /* Generate pseudo-random normals */
    for(int i = 0; i < n/2; i++) {
        x = curand_normal2(&localState);
        /* Check if within one standard deviaton */
        if((x.x > -1.0) && (x.x < 1.0)) {
            count++;
        }
        if((x.y > -1.0) && (x.y < 1.0)) {
            count++;
        }
    }
    /* Copy state back to global memory */
    state[id] = localState;
    /* Store results */
    result[id] += count;
}

__global__ void generate_kernel(curandStateMRG32k3a *state,
                                int n,
                                unsigned int *result)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    unsigned int count = 0;
    unsigned int x;
    /* Copy state to local memory for efficiency */
    curandStateMRG32k3a localState = state[id];
    /* Generate pseudo-random unsigned ints */
    for(int i = 0; i < n; i++) {
        x = curand(&localState);
        /* Check if low bit set */
        if(x & 1) {
            count++;
        }
    }
    /* Copy state back to global memory */
    state[id] = localState;
    /* Store results */
    result[id] += count;
}

__global__ void generate_uniform_kernel(curandStateMRG32k3a *state,
                                int n,
                                unsigned int *result)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    unsigned int count = 0;
    double x;
    /* Copy state to local memory for efficiency */
    curandStateMRG32k3a localState = state[id];
    /* Generate pseudo-random uniforms */
    for(int i = 0; i < n; i++) {
        x = curand_uniform_double(&localState);
        /* Check if > .5 */
        if(x > .5) {
            count++;
        }
    }
    /* Copy state back to global memory */
    state[id] = localState;
    /* Store results */
    result[id] += count;
}

__global__ void generate_normal_kernel(curandStateMRG32k3a *state,
                                int n,
                                unsigned int *result)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    unsigned int count = 0;
    double2 x;
    /* Copy state to local memory for efficiency */
    curandStateMRG32k3a localState = state[id];
    /* Generate pseudo-random normals */
    for(int i = 0; i < n/2; i++) {
        x = curand_normal2_double(&localState);
        /* Check if within one standard deviaton */
        if((x.x > -1.0) && (x.x < 1.0)) {
            count++;
        }
        if((x.y > -1.0) && (x.y < 1.0)) {
            count++;
        }
    }
    /* Copy state back to global memory */
    state[id] = localState;
    /* Store results */
    result[id] += count;
}

int main(int argc, char *argv[])
{
    const unsigned int threadsPerBlock = 64;
    const unsigned int blockCount = 64;
    const unsigned int totalThreads = threadsPerBlock * blockCount;

    unsigned int i;
    unsigned int total;
    curandState *devStates;
    curandStateMRG32k3a *devMRGStates;
    curandStatePhilox4_32_10_t *devPHILOXStates;
    unsigned int *devResults, *hostResults;
    bool useMRG = 0;
    bool usePHILOX = 0;
    int sampleCount = 10000;
    bool doubleSupported = 0;
    int device;
    struct cudaDeviceProp properties;

    /* check for double precision support */
    CUDA_CALL(cudaGetDevice(&device));
    CUDA_CALL(cudaGetDeviceProperties(&properties,device));
    if ( properties.major >= 2 || (properties.major == 1 && properties.minor >= 3) ) {
        doubleSupported = 1;
    }

    /* Check for MRG32k3a option (default is XORWOW) */
    if (argc >= 2)  {
        if (strcmp(argv[1],"-m") == 0) {
            useMRG = 1;
            if (!doubleSupported){
                printf("MRG32k3a requires double precision\n");
                printf("^^^^ test WAIVED due to lack of double precision\n");
                return EXIT_SUCCESS;
            }
        }else if (strcmp(argv[1],"-p") == 0) {
		usePHILOX = 1;
	}
        /* Allow over-ride of sample count */
        sscanf(argv[argc-1],"%d",&sampleCount);
    }

    /* Allocate space for results on host */
    hostResults = (unsigned int *)calloc(totalThreads, sizeof(int));

    /* Allocate space for results on device */
    CUDA_CALL(cudaMalloc((void **)&devResults, totalThreads *
              sizeof(unsigned int)));

    /* Set results to 0 */
    CUDA_CALL(cudaMemset(devResults, 0, totalThreads *
              sizeof(unsigned int)));

    /* Allocate space for prng states on device */
    if (useMRG) {
        CUDA_CALL(cudaMalloc((void **)&devMRGStates, totalThreads *
                  sizeof(curandStateMRG32k3a)));
    }else if(usePHILOX) {
        CUDA_CALL(cudaMalloc((void **)&devPHILOXStates, totalThreads *
                  sizeof(curandStatePhilox4_32_10_t)));
    }else {
        CUDA_CALL(cudaMalloc((void **)&devStates, totalThreads *
                  sizeof(curandState)));
    }

    /* Setup prng states */
    if (useMRG) {
        setup_kernel<<<64, 64>>>(devMRGStates);
    }else if(usePHILOX)
    {
        setup_kernel<<<64, 64>>>(devPHILOXStates);
    }else {
        setup_kernel<<<64, 64>>>(devStates);
    }

    /* Generate and use pseudo-random  */
    for(i = 0; i < 50; i++) {
        if (useMRG) {
            generate_kernel<<<64, 64>>>(devMRGStates, sampleCount, devResults);
        }else if (usePHILOX){
            generate_kernel<<<64, 64>>>(devPHILOXStates, sampleCount, devResults);
	}else {
            generate_kernel<<<64, 64>>>(devStates, sampleCount, devResults);
        }
    }

    /* Copy device memory to host */
    CUDA_CALL(cudaMemcpy(hostResults, devResults, totalThreads *
        sizeof(unsigned int), cudaMemcpyDeviceToHost));

    /* Show result */
    total = 0;
    for(i = 0; i < totalThreads; i++) {
        total += hostResults[i];
    }
    printf("Fraction with low bit set was %10.13f\n",
        (float)total / (totalThreads * sampleCount * 50.0f));

    /* Set results to 0 */
    CUDA_CALL(cudaMemset(devResults, 0, totalThreads *
              sizeof(unsigned int)));

    /* Generate and use uniform pseudo-random  */
    for(i = 0; i < 50; i++) {
        if (useMRG) {
            generate_uniform_kernel<<<64, 64>>>(devMRGStates, sampleCount, devResults);
        }else if(usePHILOX) {
            generate_uniform_kernel<<<64, 64>>>(devPHILOXStates, sampleCount, devResults);
	}else {
            generate_uniform_kernel<<<64, 64>>>(devStates, sampleCount, devResults);
        }
    }

    /* Copy device memory to host */
    CUDA_CALL(cudaMemcpy(hostResults, devResults, totalThreads *
        sizeof(unsigned int), cudaMemcpyDeviceToHost));

    /* Show result */
    total = 0;
    for(i = 0; i < totalThreads; i++) {
        total += hostResults[i];
    }
    printf("Fraction of uniforms > 0.5 was %10.13f\n",
        (float)total / (totalThreads * sampleCount * 50.0f));
    /* Set results to 0 */
    CUDA_CALL(cudaMemset(devResults, 0, totalThreads *
              sizeof(unsigned int)));

    /* Generate and use normal pseudo-random  */
    for(i = 0; i < 50; i++) {
        if (useMRG) {
            generate_normal_kernel<<<64, 64>>>(devMRGStates, sampleCount, devResults);
        }else if(usePHILOX) {
            generate_normal_kernel<<<64, 64>>>(devPHILOXStates, sampleCount, devResults);
	}else {
            generate_normal_kernel<<<64, 64>>>(devStates, sampleCount, devResults);
        }
    }

    /* Copy device memory to host */
    CUDA_CALL(cudaMemcpy(hostResults, devResults, totalThreads *
        sizeof(unsigned int), cudaMemcpyDeviceToHost));

    /* Show result */
    total = 0;
    for(i = 0; i < totalThreads; i++) {
        total += hostResults[i];
    }
    printf("Fraction of normals within 1 standard deviation was %10.13f\n",
        (float)total / (totalThreads * sampleCount * 50.0f));

    /* Cleanup */
    if (useMRG) {
        CUDA_CALL(cudaFree(devMRGStates));
    }else if(usePHILOX)
    {
        CUDA_CALL(cudaFree(devPHILOXStates));
    }else {
        CUDA_CALL(cudaFree(devStates));
    }
    CUDA_CALL(cudaFree(devResults));
    free(hostResults);
    printf("^^^^ kernel_example PASSED\n");
    return EXIT_SUCCESS;
}

以下示例使用cuRAND主机MTGP设置API和cuRAND设备API,通过MTGP32生成器生成整数,并计算设置了低位比特的比例。

/*
 * This program uses the device CURAND API to calculate what
 * proportion of pseudo-random ints have low bit set.
 */
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <curand_kernel.h>
/* include MTGP host helper functions */
#include <curand_mtgp32_host.h>
/* include MTGP pre-computed parameter sets */
#include <curand_mtgp32dc_p_11213.h>


#define CUDA_CALL(x) do { if((x) != cudaSuccess) { \
    printf("Error at %s:%d\n",__FILE__,__LINE__); \
    return EXIT_FAILURE;}} while(0)

#define CURAND_CALL(x) do { if((x) != CURAND_STATUS_SUCCESS) { \
    printf("Error at %s:%d\n",__FILE__,__LINE__); \
    return EXIT_FAILURE;}} while(0)

__global__ void generate_kernel(curandStateMtgp32 *state,
                                int n,
                                int *result)
{
    int id = threadIdx.x + blockIdx.x * 256;
    int count = 0;
    unsigned int x;
    /* Generate pseudo-random unsigned ints */
    for(int i = 0; i < n; i++) {
        x = curand(&state[blockIdx.x]);
        /* Check if low bit set */
        if(x & 1) {
            count++;
        }
    }
    /* Store results */
    result[id] += count;
}

int main(int argc, char *argv[])
{
    int i;
    long long total;
    curandStateMtgp32 *devMTGPStates;
    mtgp32_kernel_params *devKernelParams;
    int *devResults, *hostResults;
    int sampleCount = 10000;

    /* Allow over-ride of sample count */
    if (argc == 2) {
        sscanf(argv[1],"%d",&sampleCount);
    }

    /* Allocate space for results on host */
    hostResults = (int *)calloc(64 * 256, sizeof(int));

    /* Allocate space for results on device */
    CUDA_CALL(cudaMalloc((void **)&devResults, 64 * 256 *
              sizeof(int)));

    /* Set results to 0 */
    CUDA_CALL(cudaMemset(devResults, 0, 64 * 256 *
              sizeof(int)));

    /* Allocate space for prng states on device */
    CUDA_CALL(cudaMalloc((void **)&devMTGPStates, 64 *
              sizeof(curandStateMtgp32)));

    /* Setup MTGP prng states */

    /* Allocate space for MTGP kernel parameters */
    CUDA_CALL(cudaMalloc((void**)&devKernelParams, sizeof(mtgp32_kernel_params)));

    /* Reformat from predefined parameter sets to kernel format, */
    /* and copy kernel parameters to device memory               */
    CURAND_CALL(curandMakeMTGP32Constants(mtgp32dc_params_fast_11213, devKernelParams));

    /* Initialize one state per thread block */
    CURAND_CALL(curandMakeMTGP32KernelState(devMTGPStates,
                mtgp32dc_params_fast_11213, devKernelParams, 64, 1234));

    /* State setup is complete */

    /* Generate and use pseudo-random  */
    for(i = 0; i < 10; i++) {
        generate_kernel<<<64, 256>>>(devMTGPStates, sampleCount, devResults);
    }

    /* Copy device memory to host */
    CUDA_CALL(cudaMemcpy(hostResults, devResults, 64 * 256 *
        sizeof(int), cudaMemcpyDeviceToHost));

    /* Show result */
    total = 0;
    for(i = 0; i < 64 * 256; i++) {
        total += hostResults[i];
    }


    printf("Fraction with low bit set was %10.13g\n",
        (double)total / (64.0f * 256.0f * sampleCount * 10.0f));

    /* Cleanup */
    CUDA_CALL(cudaFree(devKernelParams));
    CUDA_CALL(cudaFree(devMTGPStates));
    CUDA_CALL(cudaFree(devResults));
    free(hostResults);
    printf("^^^^ kernel_mtgp_example PASSED\n");
    return EXIT_SUCCESS;
}

以下示例使用cuRAND设备API生成均匀双精度数,采用64位加扰Sobol生成器。它利用这些结果推导出球体体积的近似值。

/*
 * This program uses the device CURAND API to calculate what
 * proportion of quasi-random 3D points fall within a sphere
 * of radius 1, and to derive the volume of the sphere.
 *
 * In particular it uses 64 bit scrambled Sobol direction
 * vectors returned by curandGetDirectionVectors64, to
 * generate double precision uniform samples.
 *
 */

#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <curand_kernel.h>
#include <curand.h>

#define THREADS_PER_BLOCK 64
#define BLOCK_COUNT 64
#define TOTAL_THREADS (THREADS_PER_BLOCK * BLOCK_COUNT)

/* Number of 64-bit vectors per dimension */
#define VECTOR_SIZE 64


#define CUDA_CALL(x) do { if((x) != cudaSuccess) { \
    printf("Error at %s:%d\n",__FILE__,__LINE__); \
    return EXIT_FAILURE;}} while(0)

#define CURAND_CALL(x) do { if((x) != CURAND_STATUS_SUCCESS) { \
    printf("Error at %s:%d\n",__FILE__,__LINE__); \
    return EXIT_FAILURE;}} while(0)

/* This kernel initializes state per thread for each of x, y, and z */

__global__ void setup_kernel(unsigned long long * sobolDirectionVectors,
                             unsigned long long *sobolScrambleConstants,
                             curandStateScrambledSobol64 *state)
{
    int id = threadIdx.x + blockIdx.x * THREADS_PER_BLOCK;
    int dim = 3*id;
    /* Each thread uses 3 different dimensions */
    curand_init(sobolDirectionVectors + VECTOR_SIZE*dim,
                sobolScrambleConstants[dim],
                1234,
                &state[dim]);

    curand_init(sobolDirectionVectors + VECTOR_SIZE*(dim + 1),
                sobolScrambleConstants[dim + 1],
                1234,
                &state[dim + 1]);

    curand_init(sobolDirectionVectors + VECTOR_SIZE*(dim + 2),
                sobolScrambleConstants[dim + 2],
                1234,
                &state[dim + 2]);
}

/* This kernel generates random 3D points and increments a counter if
 * a point is within a unit sphere
 */
__global__ void generate_kernel(curandStateScrambledSobol64 *state,
                                int n,
                                long long int *result)
{
    int id = threadIdx.x + blockIdx.x * THREADS_PER_BLOCK;
    int baseDim = 3 * id;
    long long int count = 0;
    double x, y, z;

    /* Generate quasi-random double precision coordinates */
    for(int i = 0; i < n; i++) {
        x = curand_uniform_double(&state[baseDim]);
        y = curand_uniform_double(&state[baseDim + 1]);
        z = curand_uniform_double(&state[baseDim + 2]);

        /* Check if within sphere of radius 1 */
        if( (x*x + y*y + z*z) < 1.0) {
            count++;
        }
    }
    /* Store results */
    result[id] += count;
}

int main(int argc, char *argv[])
{
    int i;
    long long total;
    curandStateScrambledSobol64 *devSobol64States;
    curandDirectionVectors64_t *hostVectors64;
    unsigned long long int * hostScrambleConstants64;
    unsigned long long int * devDirectionVectors64;
    unsigned long long int * devScrambleConstants64;
    long long int *devResults, *hostResults;
    int sampleCount = 10000;
    int iterations = 100;
    double fraction;
    double pi = 3.1415926535897932;

    /* Allow over-ride of sample count */
    if (argc == 2) {
        sscanf(argv[1],"%d",&sampleCount);
    }

    /* Allocate space for results on host */
    hostResults = (long long int*)calloc(TOTAL_THREADS,
                                      sizeof(long long int));

    /* Allocate space for results on device */
    CUDA_CALL(cudaMalloc((void **)&devResults,
                         TOTAL_THREADS * sizeof(long long int)));

    /* Set results to 0 */
    CUDA_CALL(cudaMemset(devResults, 0,
                         TOTAL_THREADS * sizeof(long long int)));

    /* Get pointers to the 64 bit scrambled direction vectors and constants*/
    CURAND_CALL(curandGetDirectionVectors64( &hostVectors64,
                                             CURAND_SCRAMBLED_DIRECTION_VECTORS_64_JOEKUO6));

    CURAND_CALL(curandGetScrambleConstants64( &hostScrambleConstants64));


    /* Allocate memory for 3 states per thread (x, y, z), each state to get a unique dimension */
    CUDA_CALL(cudaMalloc((void **)&devSobol64States,
              TOTAL_THREADS * 3 * sizeof(curandStateScrambledSobol64)));

    /* Allocate memory and copy 3 sets of vectors per thread to the device */

    CUDA_CALL(cudaMalloc((void **)&(devDirectionVectors64),
                         3 * TOTAL_THREADS * VECTOR_SIZE * sizeof(long long int)));

    CUDA_CALL(cudaMemcpy(devDirectionVectors64, hostVectors64,
                         3 * TOTAL_THREADS * VECTOR_SIZE * sizeof(long long int),
                         cudaMemcpyHostToDevice));

    /* Allocate memory and copy 3 scramble constants (one costant per dimension)
       per thread to the device */

    CUDA_CALL(cudaMalloc((void **)&(devScrambleConstants64),
                         3 * TOTAL_THREADS * sizeof(long long int)));

    CUDA_CALL(cudaMemcpy(devScrambleConstants64, hostScrambleConstants64,
                         3 * TOTAL_THREADS * sizeof(long long int),
                         cudaMemcpyHostToDevice));

    /* Initialize the states */

    setup_kernel<<<BLOCK_COUNT, THREADS_PER_BLOCK>>>(devDirectionVectors64,
                                                     devScrambleConstants64,
                                                     devSobol64States);

    /* Generate and count quasi-random points  */

    for(i = 0; i < iterations; i++) {
        generate_kernel<<<BLOCK_COUNT, THREADS_PER_BLOCK>>>(devSobol64States, sampleCount, devResults);
    }

    /* Copy device memory to host */

    CUDA_CALL(cudaMemcpy(hostResults,
              devResults,
              TOTAL_THREADS * sizeof(long long int),
              cudaMemcpyDeviceToHost));

    /* Tally and show result */

    total = 0;
    for(i = 0; i < TOTAL_THREADS; i++) {
        total += hostResults[i];
    }

    fraction = (double)total / ((double)TOTAL_THREADS * (double)sampleCount * (double)iterations);
    printf("Fraction inside sphere was %g\n", fraction);
    printf("(4/3) pi = %g, sampled volume = %g\n",(4.0*pi/3.0),8.0 * fraction);

    /* Cleanup */

    CUDA_CALL(cudaFree(devSobol64States));
    CUDA_CALL(cudaFree(devDirectionVectors64));
    CUDA_CALL(cudaFree(devScrambleConstants64));
    CUDA_CALL(cudaFree(devResults));
    free(hostResults);
    printf("^^^^ kernel_sobol_example PASSED\n");


    return EXIT_SUCCESS;
}

3.7. Thrust与cuRAND示例

以下示例演示了如何混合使用cuRAND和Thrust。这是标准Thrust示例monte_carlo.cu的一个最小修改版本。该示例通过随机选取单位正方形中的点,并计算到原点的距离来判断这些点是否位于四分之一单位圆内,从而估算 π 的值。

#include <thrust/iterator/counting_iterator.h>
#include <thrust/functional.h>
#include <thrust/transform_reduce.h>
#include <curand_kernel.h>

#include <iostream>
#include <iomanip>

// we could vary M & N to find the perf sweet spot

struct estimate_pi : 
    public thrust::unary_function<unsigned int, float>
{
  __device__
  float operator()(unsigned int thread_id)
  {
    float sum = 0;
    unsigned int N = 10000; // samples per thread

    unsigned int seed = thread_id;

    curandState s;

    // seed a random number generator
    curand_init(seed, 0, 0, &s);

    // take N samples in a quarter circle
    for(unsigned int i = 0; i < N; ++i)
    {
      // draw a sample from the unit square
      float x = curand_uniform(&s);
      float y = curand_uniform(&s);

      // measure distance from the origin
      float dist = sqrtf(x*x + y*y);

      // add 1.0f if (u0,u1) is inside the quarter circle
      if(dist <= 1.0f)
        sum += 1.0f;
    }

    // multiply by 4 to get the area of the whole circle
    sum *= 4.0f;

    // divide by N
    return sum / N;
  }
};

int main(void)
{
  // use 30K independent seeds
  int M = 30000;

  float estimate = thrust::transform_reduce(
        thrust::counting_iterator<int>(0),
        thrust::counting_iterator<int>(M),
        estimate_pi(),
        0.0f,
        thrust::plus<float>());
  estimate /= M;

  std::cout << std::setprecision(3);
  std::cout << "pi is approximately ";
  std::cout << estimate << std::endl;
  return 0;
}

3.8. Poisson API 示例

这个示例展示了泊松分布的三种API类型之间的差异。它模拟了商店中的排队场景。主机API在生成大量泊松分布随机数向量时最为稳健(即在所有lambda值范围内具有最佳统计特性)。离散设备API几乎与主机API一样稳健,并允许在内核中生成泊松分布随机数。简单设备API的稳健性最差,但在为多个不同lambda值生成泊松分布随机数时效率更高。

/*
 * This program uses CURAND library for Poisson distribution
 * to simulate queues in store for 16 hours. It shows the
 * difference of using 3 different APIs:
 * - HOST API -arrival of customers is described by Poisson(4)
 * - SIMPLE DEVICE API -arrival of customers is described by
 *     Poisson(4*(sin(x/100)+1)), where x is number of minutes
 *     from store opening time.
 * - ROBUST DEVICE API -arrival of customers is described by:
 *     - Poisson(2) for first 3 hours.
 *     - Poisson(1) for second 3 hours.
 *     - Poisson(3) after 6 hours.
 */

#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <curand_kernel.h>
#include <curand.h>

#define CUDA_CALL(x) do { if((x) != cudaSuccess) { \
    printf("Error at %s:%d\n",__FILE__,__LINE__); \
    return EXIT_FAILURE;}} while(0)
#define CURAND_CALL(x) do { if((x)!=CURAND_STATUS_SUCCESS) { \
    printf("Error at %s:%d\n",__FILE__,__LINE__);\
    return EXIT_FAILURE;}} while(0)


#define HOURS 16
#define OPENING_HOUR 7
#define CLOSING_HOUR (OPENING_HOUR + HOURS)

#define access_2D(type, ptr, row, column, pitch)\
    *((type*)((char*)ptr + (row) * pitch) + column)

enum API_TYPE {
    HOST_API = 0,
    SIMPLE_DEVICE_API = 1,
    ROBUST_DEVICE_API = 2,
};

/* global variables */
API_TYPE api;
int report_break;
int cashiers_load_h[HOURS];
__constant__ int cashiers_load[HOURS];

__global__ void setup_kernel(curandState *state)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    /* Each thread gets same seed, a different sequence
       number, no offset */
    curand_init(1234, id, 0, &state[id]);
}

__inline__ __device__
void update_queue(int id, int min, unsigned int new_customers,
                  unsigned int &queue_length,
                  unsigned int *queue_lengths, size_t pitch)
{
    int balance;
    balance = new_customers - 2 * cashiers_load[(min-1)/60];
    if (balance + (int)queue_length <= 0){
        queue_length = 0;
    }else{
        queue_length += balance;
    }
    /* Store results */
    access_2D(unsigned int, queue_lengths, min-1, id, pitch)
        = queue_length;
}


__global__ void simple_device_API_kernel(curandState *state,
                    unsigned int *queue_lengths, size_t pitch)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    unsigned int new_customers;
    unsigned int queue_length = 0;
    /* Copy state to local memory for efficiency */
    curandState localState = state[id];
    /* Simulate queue in time */
    for(int min = 1; min <= 60 * HOURS; min++) {
        /* Draw number of new customers depending on API */
        new_customers = curand_poisson(&localState,
                                4*(sin((float)min/100.0)+1));
        /* Update queue */
        update_queue(id, min, new_customers, queue_length,
                     queue_lengths, pitch);
    }
    /* Copy state back to global memory */
    state[id] = localState;
}


__global__ void host_API_kernel(unsigned int *poisson_numbers,
                    unsigned int *queue_lengths, size_t pitch)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    unsigned int new_customers;
    unsigned int queue_length = 0;
    /* Simulate queue in time */
    for(int min = 1; min <= 60 * HOURS; min++) {
        /* Get random number from global memory */
        new_customers = poisson_numbers
                    [blockDim.x * gridDim.x * (min -1) + id];
        /* Update queue */
        update_queue(id, min, new_customers, queue_length,
                     queue_lengths, pitch);
    }
}

__global__ void robust_device_API_kernel(curandState *state,
                   curandDiscreteDistribution_t poisson_1,
                   curandDiscreteDistribution_t poisson_2,
                   curandDiscreteDistribution_t poisson_3,
                   unsigned int *queue_lengths, size_t pitch)
{
    int id = threadIdx.x + blockIdx.x * 64;
    unsigned int new_customers;
    unsigned int queue_length = 0;
    /* Copy state to local memory for efficiency */
    curandState localState = state[id];
    /* Simulate queue in time */
    /* first 3 hours */
    for(int min = 1; min <= 60 * 3; min++) {
        /* draw number of new customers depending on API */
        new_customers =
                    curand_discrete(&localState, poisson_2);
        /* Update queue */
        update_queue(id, min, new_customers, queue_length,
                                    queue_lengths, pitch);
    }
    /* second 3 hours */
    for(int min = 60 * 3 + 1; min <= 60 * 6; min++) {
        /* draw number of new customers depending on API */
        new_customers =
                    curand_discrete(&localState, poisson_1);
        /* Update queue */
        update_queue(id, min, new_customers, queue_length,
                                    queue_lengths, pitch);
    }
    /* after 6 hours */
    for(int min = 60 * 6 + 1; min <= 60 * HOURS; min++) {
        /* draw number of new customers depending on API */
        new_customers =
                    curand_discrete(&localState, poisson_3);
        /* Update queue */
        update_queue(id, min, new_customers, queue_length,
                                    queue_lengths, pitch);
    }
    /* Copy state back to global memory */
    state[id] = localState;
}

/* Set time intervals between reports */
void report_settings()
{
    do{
        printf("Set time intervals between queue reports");
        printf("(in minutes > 0)\n");
        if (scanf("%d", &report_break) == 0) continue;
    }while(report_break <= 0);
}


/* Set number of cashiers each hour */
void add_cachiers(int *cashiers_load)
{
    int i, min, max, begin, end;
    printf("Cashier serves 2 customers per minute...\n");
    for (i = 0; i < HOURS; i++){
        cashiers_load_h[i] = 0;
    }
    while (true){
        printf("Adding cashier...\n");
        min = OPENING_HOUR;
        max = CLOSING_HOUR-1;
        do{
            printf("Set hour that cahier comes (%d-%d)",
                                                min, max);
            printf(" [type 0 to finish adding cashiers]\n");
            if (scanf("%d", &begin) == 0) continue;
        }while (begin > max || (begin < min && begin != 0));
        if (begin == 0) break;
        min = begin+1;
        max = CLOSING_HOUR;
        do{
            printf("Set hour that cahier leaves (%d-%d)",
                                                min, max);
            printf(" [type 0 to finish adding cashiers]\n");
            if (scanf("%d", &end) == 0) continue;
        }while (end > max || (end < min && end != 0));
        if (end == 0) break;
        for (i = begin - OPENING_HOUR;
             i < end   - OPENING_HOUR; i++){
            cashiers_load_h[i]++;
        }
    }
    for (i = OPENING_HOUR; i < CLOSING_HOUR; i++){
        printf("\n%2d:00 - %2d:00     %d cashier",
                i, i+1, cashiers_load_h[i-OPENING_HOUR]);
        if (cashiers_load[i-OPENING_HOUR] != 1) printf("s");
    }
    printf("\n");
}

/* Set API type */
API_TYPE set_API_type()
{
    printf("Choose API type:\n");
    int choose;
    do{
        printf("type 1 for HOST API\n");
        printf("type 2 for SIMPLE DEVICE API\n");
        printf("type 3 for ROBUST DEVICE API\n");
        if (scanf("%d", &choose) == 0) continue;
    }while( choose < 1 || choose > 3);
    switch(choose){
        case 1: return HOST_API;
        case 2: return SIMPLE_DEVICE_API;
        case 3: return ROBUST_DEVICE_API;
        default:
            fprintf(stderr, "wrong API\n");
            return HOST_API;
    }
}

void settings()
{
    add_cachiers(cashiers_load);
    cudaMemcpyToSymbol("cashiers_load", cashiers_load_h,
            HOURS * sizeof(int), 0, cudaMemcpyHostToDevice);
    report_settings();
    api = set_API_type();
}

void print_statistics(unsigned int *hostResults, size_t pitch)
{
    int min, i, hour, minute;
    unsigned int sum;
    for(min = report_break; min <= 60 * HOURS;
                            min += report_break) {
        sum = 0;
        for(i = 0; i < 64 * 64; i++) {
            sum += access_2D(unsigned int, hostResults,
                                        min-1, i, pitch);
        }
        hour = OPENING_HOUR + min/60;
        minute = min%60;
        printf("%2d:%02d   # of waiting customers = %10.4g |",
                    hour, minute, (float)sum/(64.0 * 64.0));
        printf("  # of cashiers = %d  |  ",
                    cashiers_load_h[(min-1)/60]);
        printf("# of new customers/min ~= ");
        switch (api){
            case HOST_API:
                printf("%2.2f\n", 4.0);
                break;
            case SIMPLE_DEVICE_API:
                printf("%2.2f\n",
                            4*(sin((float)min/100.0)+1));
                break;
            case ROBUST_DEVICE_API:
                if (min <= 3 * 60){
                    printf("%2.2f\n", 2.0);
                }else{
                    if (min <= 6 * 60){
                        printf("%2.2f\n", 1.0);
                    }else{
                        printf("%2.2f\n", 3.0);
                    }
                }
                break;
            default:
                fprintf(stderr, "Wrong API\n");
        }
    }
}


int main(int argc, char *argv[])
{
    int n;
    size_t pitch;
    curandState *devStates;
    unsigned int *devResults, *hostResults;
    unsigned int *poisson_numbers_d;
    curandDiscreteDistribution_t poisson_1, poisson_2;
    curandDiscreteDistribution_t poisson_3;
    curandGenerator_t gen;

    /* Setting cashiers, report and API */
    settings();

    /* Allocate space for results on device */
    CUDA_CALL(cudaMallocPitch((void **)&devResults, &pitch,
                64 * 64 * sizeof(unsigned int), 60 * HOURS));

    /* Allocate space for results on host */
    hostResults = (unsigned int *)calloc(pitch * 60 * HOURS,
                sizeof(unsigned int));

    /* Allocate space for prng states on device */
    CUDA_CALL(cudaMalloc((void **)&devStates, 64 * 64 *
              sizeof(curandState)));

    /* Setup prng states */
    if (api != HOST_API){
        setup_kernel<<<64, 64>>>(devStates);
    }
    /* Simulate queue  */
    switch (api){
        case HOST_API:
            /* Create pseudo-random number generator */
            CURAND_CALL(curandCreateGenerator(&gen,
                                CURAND_RNG_PSEUDO_DEFAULT));
            /* Set seed */
            CURAND_CALL(curandSetPseudoRandomGeneratorSeed(
                                            gen, 1234ULL));
            /* compute n */
            n = 64 * 64 * HOURS * 60;
            /* Allocate n unsigned ints on device */
            CUDA_CALL(cudaMalloc((void **)&poisson_numbers_d,
                                n * sizeof(unsigned int)));
            /* Generate n unsigned ints on device */
            CURAND_CALL(curandGeneratePoisson(gen,
                                poisson_numbers_d, n, 4.0));
            host_API_kernel<<<64, 64>>>(poisson_numbers_d,
                                        devResults, pitch);
            /* Cleanup */
            CURAND_CALL(curandDestroyGenerator(gen));
            break;
        case SIMPLE_DEVICE_API:
            simple_device_API_kernel<<<64, 64>>>(devStates,
                                        devResults, pitch);
            break;
        case ROBUST_DEVICE_API:
            /* Create histograms for Poisson(1) */
            CURAND_CALL(curandCreatePoissonDistribution(1.0,
                                                &poisson_1));
            /* Create histograms for Poisson(2) */
            CURAND_CALL(curandCreatePoissonDistribution(2.0,
                                                &poisson_2));
            /* Create histograms for Poisson(3) */
            CURAND_CALL(curandCreatePoissonDistribution(3.0,
                                                &poisson_3));
            robust_device_API_kernel<<<64, 64>>>(devStates,
                            poisson_1, poisson_2, poisson_3,
                            devResults, pitch);
            /* Cleanup */
            CURAND_CALL(curandDestroyDistribution(poisson_1));
            CURAND_CALL(curandDestroyDistribution(poisson_2));
            CURAND_CALL(curandDestroyDistribution(poisson_3));
            break;
        default:
            fprintf(stderr, "Wrong API\n");
    }
    /* Copy device memory to host */
    CUDA_CALL(cudaMemcpy2D(hostResults, pitch, devResults,
                pitch, 64 * 64 * sizeof(unsigned int),
                60 * HOURS, cudaMemcpyDeviceToHost));
    /* Show result */
    print_statistics(hostResults, pitch);
    /* Cleanup */
    CUDA_CALL(cudaFree(devStates));
    CUDA_CALL(cudaFree(devResults));
    free(hostResults);
    return EXIT_SUCCESS;
}