Skip to content

OpenACC

Warning

This page is currently under active development. Check back soon for more content.

OpenACC provides the compiler directives, library routines, and environment variables, to make identified regions executed in parallel on multicore CPUs or attached accelerators (e.g., GPUs). The method described provides a model for parallel programming that is portable across operating systems and various types of multicore CPUs and accelerators. Like OpenMP, parallel constructs expressed by compiler directives and functions are to be parallelized for a multicore host or offloaded to a GPU.

A device, either an attached accelerator or the multicore CPU, executes parallel regions, which typically contain work-sharing loops, kernels regions, which typically contain one or more loops that may be executed as kernels, or serial regions, which are blocks of sequential code.

Most current accelerators and many multicore CPUs support two or three levels of parallelism.

  • Most accelerators and multicore CPUs support coarse-grain parallelism, which is fully parallel execution across execution units.
  • Many accelerators and some CPUs also support fine-grain parallelism, often implemented as multiple threads of execution within a single execution unit, which are typically rapidly switched on the execution unit to tolerate long latency memory operations.
  • Finally, most accelerators and CPUs also support SIMD or vector operations within each execution unit.

OpenACC exposes these three levels of parallelism via gang, worker, and vector parallelism.

  • Gang parallelism is coarse-grain. A number of gangs will be launched on the accelerator.
  • Worker parallelism is fine-grain. Each gang will have one or more workers.
  • Vector parallelism is for SIMD or vector operations within a worker.

Gang parallelism corresponds to a grid of thread blocks in CUDA on GPUs, while one of the remaining two corresponds to parallelism in a thread block.

C/C++

A 2-D Jacobi solver written in C and accelerated with OpenACC is provided as a "laplace2d.c" source code file and is discussed in detail on the NVIDIA Developer Blog.

It can be compiled as follows (also requires the timer.h file provided as this "timer.h" source code file). Below we are using the NVIDIA HPC SDK compiler available via the nvidia module on Perlmutter:

$ nvc -acc -Minfo -o laplace2d.ex laplace2d.c
GetTimer:
     74, FMA (fused multiply-add) instruction(s) generated
main:
     87, Generating copy(A[:][:]) [if not already present]
         Generating create(Anew[:][:]) [if not already present]
     89, Generating implicit copy(error) [if not already present]
     93, Loop is parallelizable
     96, Loop is parallelizable
         Generating Tesla code
         93, #pragma acc loop gang(32), vector(16) /* blockIdx.y threadIdx.y */
             Generating implicit reduction(max:error)
         96, #pragma acc loop gang(16), vector(32) /* blockIdx.x threadIdx.x */
    106, Loop is parallelizable
    109, Loop is parallelizable
         Generating Tesla code
        106, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
        109, #pragma acc loop gang(16), vector(32) /* blockIdx.x threadIdx.x */
    109, Memory copy idiom, loop replaced by call to __c_mcopy4

Fortran

The same solver described above is also provided in a Fortran version at this "laplace2d.f90" source code file and can be compiled as follows:

$ nvfortran -acc -Minfo -o laplace2d.ex laplace2d.F90
laplace:
     43, Memory zero idiom, array assignment replaced by call to pgf90_mzero4
     50, Memory copy idiom, loop replaced by call to __c_mcopy4
     77, Generating copy(a(:,:)) [if not already present]
         Generating create(anew(:,:)) [if not already present]
     82, Generating implicit copy(error) [if not already present]
     83, Loop is parallelizable
     85, Loop is parallelizable
         Generating Tesla code
         83, !$acc loop gang(32), vector(16) ! blockidx%y threadidx%y
             Generating implicit reduction(max:error)
         85, !$acc loop gang(16), vector(32) ! blockidx%x threadidx%x
    100, Loop is parallelizable
    102, Loop is parallelizable
         Generating Tesla code
        100, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
        102, !$acc loop gang(16), vector(32) ! blockidx%x threadidx%x
    102, Memory copy idiom, loop replaced by call to __c_mcopy4

Preparing for Perlmutter

For info on OpenACC for Perlmutter, please see the OpenACC and OpenACC in Fortran sections in the Perlmutter Readiness page.

References