[Solved] What is option -O3 for g++ and nvcc?

It’s optimization on level 3, basically a shortcut for several other options related to speed optimization etc. (see link below). I can’t find any documentation on it. … it is one of the best known options: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#options-for-altering-compiler-linker-behavior solved What is option -O3 for g++ and nvcc?

[Solved] Can I parallelize my program?

Your code is fairly straightforward with lots of independent parallel loops. These parallel loops appear to be wrapped in an outer convergence do while loop, so as long as you keep the data on the device for all iterations of the convergence loop, you won’t be bottlenecked by transfers. I would recommend starting with compiler … Read more

[Solved] Cuda: Compact and result size

You can do this using thrust as @RobertCrovella already pointed out. The following example uses thrust::copy_if to copy all of elements’ indices for which the condition (“equals 7”) is fulfilled. thrust::counting_iterator is used to avoid creating the sequence of indices explicitly. #include <thrust/copy.h> #include <thrust/iterator/counting_iterator.h> #include <thrust/functional.h> #include <iostream> using namespace thrust::placeholders; int main() { … Read more

[Solved] how to use the cula device

I don’t know cula. However, after a brief look at the reference guide (which I suggest to consult prior to SO) you can use cula device functions just as host functions. However, you have to pass device memory pointers to the function. __global__ void kernel( double * A,double * B, curandState * globalState, int Asize, … Read more

[Solved] vector addition in CUDA using streams

One problem is how you are handling h_A, h_B, and h_C: h_A = (float *) wbImport(wbArg_getInputFile(args, 0), &inputLength); h_B = (float *) wbImport(wbArg_getInputFile(args, 1), &inputLength); The above lines of code are creating an allocation for h_A and h_B and importing some data (presumably). These lines of code: cudaHostAlloc((void **) &h_A, size, cudaHostAllocDefault); cudaHostAlloc((void **) &h_B, … Read more

[Solved] I have cuda installed on win10, but anaconda let me to reinstall it in the environment

Anaconda is only capable of detecting and managing packages within its own environment. It cannot and will not detect and use an existing CUDA installation when installing packages with a CUDA dependency. Note however that the cudatoolkit package which conda will install is not a complete CUDA toolkit distribution. It only contains the necessary libraries … Read more

[Solved] pgi cuda fortran compiling error

You’re calling this a “cuda fortran” code, but it is syntactically incorrect whether you want to ultimately run the subroutine on the host (CPU) or device (GPU). You may wish to refer to this blog post as a quick start guide. If you want to run the subroutine increment on the GPU, you have not … Read more

[Solved] Pytorch crashes cuda on wrong line

I found an answer in a completely unrelated thread in the forums. Couldn’t find a Googleable answer, so posting here for future users’ sake. Since CUDA calls are executed asynchronously, you should run your code with CUDA_LAUNCH_BLOCKING=1 python script.py This makes sure the right line of code will throw the error message. solved Pytorch crashes … Read more