Your code is fairly straightforward with lots of independent parallel loops. These parallel loops appear to be wrapped in an outer convergence do while
loop, so as long as you keep the data on the device for all iterations of the convergence loop, you won’t be bottlenecked by transfers.
I would recommend starting with compiler directives for this code rather than diving in to CUDA Fortran. Compiler directives work well for simple independent loops like these — they are simple hints that you place in code comments that tell the compiler which loops to parallelize, which data to copy, etc.
You can first try OpenMP to accelerate to multiple CPU cores. Then you can use GPU directives such as OpenACC, which is going to be available soon in compilers from PGI, Cray, and CAPS. To get a head start, you could download a free trial of the PGI compiler and use their “Accelerator” directives. Accelerator is very similar in syntax to OpenACC.
3
solved Can I parallelize my program?