OpenACC¶
Warning
This page is currently under active development. Check back soon for more content.
OpenACC provides the compiler directives, library routines, and environment variables, to make identified regions executed in parallel on multicore CPUs or attached accelerators (e.g., GPUs). The method described provides a model for parallel programming that is portable across operating systems and various types of multicore CPUs and accelerators. Like OpenMP, parallel constructs expressed by compiler directives and functions are to be parallelized for a multicore host or offloaded to a GPU.
A device, either an attached accelerator or the multicore CPU, executes parallel regions, which typically contain work-sharing loops, kernels regions, which typically contain one or more loops that may be executed as kernels, or serial regions, which are blocks of sequential code.
Most current accelerators and many multicore CPUs support two or three levels of parallelism.
- Most accelerators and multicore CPUs support coarse-grain parallelism, which is fully parallel execution across execution units.
- Many accelerators and some CPUs also support fine-grain parallelism, often implemented as multiple threads of execution within a single execution unit, which are typically rapidly switched on the execution unit to tolerate long latency memory operations.
- Finally, most accelerators and CPUs also support SIMD or vector operations within each execution unit.
OpenACC exposes these three levels of parallelism via gang, worker, and vector parallelism.
- Gang parallelism is coarse-grain. A number of gangs will be launched on the accelerator.
- Worker parallelism is fine-grain. Each gang will have one or more workers.
- Vector parallelism is for SIMD or vector operations within a worker.
Gang parallelism corresponds to a grid of thread blocks in CUDA on GPUs, while one of the remaining two corresponds to parallelism in a thread block.
C/C++¶
A 2-D Jacobi solver written in C and accelerated with OpenACC is provided as a "laplace2d.c" source code file and is discussed in detail on the NVIDIA Developer Blog.
It can be compiled as follows (also requires the timer.h
file provided as this "timer.h" source code file). Below we are using the NVIDIA HPC SDK compiler available via the nvidia
module on Perlmutter:
$ nvc -acc -Minfo -o laplace2d.ex laplace2d.c
GetTimer:
74, FMA (fused multiply-add) instruction(s) generated
main:
87, Generating copy(A[:][:]) [if not already present]
Generating create(Anew[:][:]) [if not already present]
89, Generating implicit copy(error) [if not already present]
93, Loop is parallelizable
96, Loop is parallelizable
Generating Tesla code
93, #pragma acc loop gang(32), vector(16) /* blockIdx.y threadIdx.y */
Generating implicit reduction(max:error)
96, #pragma acc loop gang(16), vector(32) /* blockIdx.x threadIdx.x */
106, Loop is parallelizable
109, Loop is parallelizable
Generating Tesla code
106, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
109, #pragma acc loop gang(16), vector(32) /* blockIdx.x threadIdx.x */
109, Memory copy idiom, loop replaced by call to __c_mcopy4
Fortran¶
The same solver described above is also provided in a Fortran version at this "laplace2d.f90" source code file and can be compiled as follows:
$ nvfortran -acc -Minfo -o laplace2d.ex laplace2d.F90
laplace:
43, Memory zero idiom, array assignment replaced by call to pgf90_mzero4
50, Memory copy idiom, loop replaced by call to __c_mcopy4
77, Generating copy(a(:,:)) [if not already present]
Generating create(anew(:,:)) [if not already present]
82, Generating implicit copy(error) [if not already present]
83, Loop is parallelizable
85, Loop is parallelizable
Generating Tesla code
83, !$acc loop gang(32), vector(16) ! blockidx%y threadidx%y
Generating implicit reduction(max:error)
85, !$acc loop gang(16), vector(32) ! blockidx%x threadidx%x
100, Loop is parallelizable
102, Loop is parallelizable
Generating Tesla code
100, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
102, !$acc loop gang(16), vector(32) ! blockidx%x threadidx%x
102, Memory copy idiom, loop replaced by call to __c_mcopy4
Using OpenACC on Perlmutter¶
The PrgEnv-nvidia
module should be used to compile OpenACC code:
Vendor | PrgEnv | Language(s) | OpenACC flag |
---|---|---|---|
NVIDIA | PrgEnv-nvidia | C/C++/Fortran | -acc |
GPU Offload¶
GPU target must be set
Either load the cudatoolkit
and craype-accel-nvidia80
modules or use the the -acc=gpu
option.
For info on OpenACC for Perlmutter, please see the OpenACC and OpenACC in Fortran sections in the Perlmutter Readiness page.
References¶
- The OpenACC specificification
- OpenACC resources for guides, tutorials, code samples, etc.
- NERSC OpenACC Training Series, 2020