[Solved] how to use the cula device

Question

I don’t know cula. However, after a brief look at the reference guide (which I suggest to consult prior to SO) you can use cula device functions just as host functions. However, you have to pass device memory pointers to the function.

__global__ void kernel( double * A,double * B, curandState * globalState, int Asize, int Bsize )
{
    // generate random numbers
    ...

void kernel_wrapper( 
    double * const A, 
    double * const B, 
          const int Asize , 
          const int Bsize )
{
...
    // create random states  
    curandState * devStates;
    gpuErrchk( cudaMalloc( &devStates, N * sizeof(curandState) ) );

    // setup seeds
    setup_kernel<<<1,N>>>( devStates, unsigned( time(NULL)) );
    ...

    // generate random numbers
    kernel<<<1,1>>>( A, B, devStates, Asize, Bsize );
    gpuErrchk( cudaPeekAtLastError() );
    gpuErrchk( cudaDeviceSynchronize() );


    // clean up device memory
    gpuErrchk( cudaFree( devStates ) );

    return;

}

and in your cpp:

extern void kernel_wrapper(double** A,double** B, int Asize ,int Bsize);
...
 culaDouble* A;
 culaDouble* B;

gpuErrchk( cudaMalloc( (void**) &A, Asize * sizeof(double) ) );
gpuErrchk( cudaMalloc( (void**) &B, Bsize * sizeof(double) ) );

kernel_wrapper( A, B, Asize, Bsize );
...
status = culaDeviceDgels('N',N,N, NRHS, A, N, B, N);
gpuErrchk( cudaFree( A ) );
gpuErrchk( cudaFree( B ) );

That’s it you don’t even need host memory as long as everything shall remain in device memory.

Finaly, may I suggest that you take a look at the CUDA Programming Guide? I think this will help you understand the differences in host and device memory and in “memory transfers” to and from a CUDA device.

Accepted Answer

I don’t know cula. However, after a brief look at the reference guide (which I suggest to consult prior to SO) you can use cula device functions just as host functions. However, you have to pass device memory pointers to the function.

__global__ void kernel( double * A,double * B, curandState * globalState, int Asize, int Bsize )
{
    // generate random numbers
    ...

void kernel_wrapper( 
    double * const A, 
    double * const B, 
          const int Asize , 
          const int Bsize )
{
...
    // create random states  
    curandState * devStates;
    gpuErrchk( cudaMalloc( &devStates, N * sizeof(curandState) ) );

    // setup seeds
    setup_kernel<<<1,N>>>( devStates, unsigned( time(NULL)) );
    ...

    // generate random numbers
    kernel<<<1,1>>>( A, B, devStates, Asize, Bsize );
    gpuErrchk( cudaPeekAtLastError() );
    gpuErrchk( cudaDeviceSynchronize() );


    // clean up device memory
    gpuErrchk( cudaFree( devStates ) );

    return;

}

and in your cpp:

extern void kernel_wrapper(double** A,double** B, int Asize ,int Bsize);
...
 culaDouble* A;
 culaDouble* B;

gpuErrchk( cudaMalloc( (void**) &A, Asize * sizeof(double) ) );
gpuErrchk( cudaMalloc( (void**) &B, Bsize * sizeof(double) ) );

kernel_wrapper( A, B, Asize, Bsize );
...
status = culaDeviceDgels('N',N,N, NRHS, A, N, B, N);
gpuErrchk( cudaFree( A ) );
gpuErrchk( cudaFree( B ) );

That’s it you don’t even need host memory as long as everything shall remain in device memory.

Finaly, may I suggest that you take a look at the CUDA Programming Guide? I think this will help you understand the differences in host and device memory and in “memory transfers” to and from a CUDA device.