Skip to content

Native compilers on NERSC systems

Introduction

On Perlmutter, code performance can be accelerated by having part of the code executed on GPUs, and there are ways for doing that. One way is to use a CUDA code where functions to be executed on GPUs ("CUDA kernels") are written in the CUDA programming model syntax. Data transfers between the host CPU and GPU devices may have to be prescribed, too. In this case, NVIDIA's CUDA compilers are used to compile the code. Another way is to offload compute intensive portions in a regular C, C++ or Fortran code to GPUs using OpenMP and OpenACC constructs (or other methods explained below). In this case, special compiler flags have to be passed to the compiler, to enable the programming model used (OpenMP, OpenACC, etc.), and to have the compiler generate an offload binary for the GPU architecture. In some cases, the compiler generates both host and GPU binaries with capability of falling back to the host binary if there is a problem with using the GPU binary. For specific behavior of each compiler, however, please check the compiler's manual.

There are several HPE-Cray-provided base compilers available on Perlmutter, with varying levels of support for GPU code generation: Cray, GNU, AOCC (AMD Optimizing C/C++ Compiler), and NVIDIA. All suites provide compilers for C, C++, and Fortran. Each compiler has different characteristics - different compilers generate faster code under different circumstances. All compilers provide support for OpenMP.

Cori provides three HPE-Cray-provided compiler suites: Intel, GNU, and Cray. Each suite provides compilers for C, C++, and Fortran. All three compilers provide support for OpenMP.

Additionally, NERSC provides the LLVM compilers (clang and clang++) on Cori. The LLVM compilers will be available on Perlmutter. These are not supported by HPE Cray and therefore are not compatible with all of the same software and libraries that the HPE-Cray-provided compiler suites are, but are nevertheless useful for users who require an open-source LLVM-based compiler toolchain.

Below is a table listing the available compilers on Perlmutter and Cori, with the default compilers marked as such.

Compilers Perlmutter Cori
Intel -
(Default)
GNU
Cray
NVIDIA
(Default)
-
AOCC -
LLVM TBP (Provided by NERSC)
(Provided by NERSC)

TBP: To be provided

All compilers supplied by HPE Cray are provided via the "programming environments" that are accessed via the module utility. Each programming environment contains the full set of compatible compilers and libraries. To change from one compiler to another, you change the programming environment via the module swap command. For example, the following is to change from the Cray programming environment to the GNU environment. Since Perlmutter uses Lmod, the second command works, too.

module swap PrgEnv-cray PrgEnv-gnu      # On Cori and Perlmutter
module load PrgEnv-gnu                  # On Perlmutter

Programming environment for using GPUs on Perlmutter

To compile a CUDA source code in any of the supported programming environments, the cuda module is required to make the CUDA Toolkit accessible. The toolkit includes GPU-accelerated libraries, debugging and optimization tools, a C/C++ compiler, and a runtime library to build and deploy your application. For information about the CUDA Toolkit, see the documentation. Note that this module is loaded by default.

To set the NVIDIA GPUs as the OpenMP and OpenACC offloading target while using the Cray compiler wrappers, use the compiler flag -target-accel=nvidia80 or set the environment variable CRAY_ACCEL_TARGET to nvidia80. To set the acceleration target to host CPUs instead, use the -target-accel=host flag, set the environment variable to host, or load the craype-accel-host module.

Do not use native compiler's target flag with the Cray compiler wrappers

The native compiler's target flag (e.g., NVIDIA's -target=gpu) will not work with the Cray compiler wrappers.

cudatoolkit, craype-accel-nvidia80 and cuda modules

Normally, when compiling a CUDA code, the HPE Cray-provided cudatoolkit module would be needed. Setting the offloading target with OpenMP and OpenACC to GPUs would be done by loading the craype-accel-nvidia80 module which is also provided by HPE Cray. However, a version conflict is observed among NVIDIA compiler drivers when using the cudatoolkit module, and craype-accel-nvidia80 cannot be used because of its dependency on this module. Do not use these modules, especially in the PrgEnv-nvidia environment. The cuda module is a temporary replacement for the cudatoolkit module.

Using compatible gcc for CUDA compiler drivers with PrgEnv-gnu

In case that the default gcc module provided by PrgEnv-gnu is an incompatible host compiler for the NVIDIA's nvcc provided by the cudatoolkit or cuda module on Perlmutter, the cpe-cuda module becomes handy. If incompatibility between the gcc versions is detected when either module is loaded, it will fix by loading the compatible version. At the time of writing (August, 2021), there happens to be such incompatibility between the default gcc (10.3.0) and the host gcc compiler (9.3.0) used by cudatoolkit and cuda, and below is how the cpe-cuda module changes to the proper version.

$ module load PrgEnv-gnu        # Swapping PrgEnv-nvidia with PrgEnv-gnu

Lmod is automatically replacing "nvidia/20.9" with "gcc/10.3.0".
...

$ gcc --version
gcc (GCC) 10.3.0 20210408 (Cray Inc.)
...

$ module load cpe-cuda

The following have been reloaded with a version change:
  1) gcc/10.3.0 => gcc/9.3.0

$ gcc --version
gcc (GCC) 9.3.0 20200312 (Cray Inc.)
...

Note that if the cpe-cuda module is loaded before the PrgEnv-gnu module, there will be no change:

$ module load cpe-cuda

$ module load PrgEnv-gnu        # Swapping PrgEnv-nvidia with PrgEnv-gnu
...

$ gcc --version
gcc (GCC) 10.3.0 20210408 (Cray Inc.)
...

Compilers

Intel

The Intel compiler suite is available via the PrgEnv-intel module, which will load the intel module for the actual native compilers. This compiler suite is loaded by default on Cori. The native compilers in this suite are:

  • C: icc
  • C++: icpc
  • Fortran: ifort

See the full documentation of the Intel compilers. Additionally, compiler documentation is provided through man pages (e.g., man icpc) and through the -help flag to each compiler (e.g., ifort -help).

OpenMP and OpenACC

To enable OpenMP, use the -qopenmp flag.

The Intel compilers do not support OpenACC.

GNU

The GCC compiler suite is available via the PrgEnv-gnu module, which will load the gcc module for the actual native compilers. The native compilers in this suite are:

  • C: gcc
  • C++: g++
  • Fortran: gfortran

See the full documentation of the GCC compilers. Additionally, compiler documentation is provided through man pages (e.g., man g++) and through the --help flag to each compiler (e.g., gfortran --help).

OpenMP and OpenACC

OpenMP/OpenACC offloading to GPUs not supported yet

Offloading to GPUs with OpenMP/OpenACC is not supported in the PrgEnv-gnu environment on Perlmutter at the moment. The offloading-related information below is for future references only, and can be updated.

GCC has support for OpenMP and OpenACC offloading to GPUs. OpenMP offloading with gcc looks something like:

gcc -fopenmp -foffload=nvptx-none="-Ofast -lm -misa=sm_80" base.c -c

where -misa=sm_80 is for the NVIDIA A100 GPU. The extra compile flags of -Ofast -lm are passed for building a binary for the architecture.

Note that, if the Cray compiler wrapper, cc, is used instead, use the -target-accel=nvidia80 flag instead.

cc -fopenmp -target-accel=nvidia80 base.c -c

OpenMP/OpenACC GPU offload support in GCC is limited

The GCC compiler's offload capabilities for GPU code generation may be limited, in terms of both functionality and performance. Users are advised to try different compilers for C/C++ codes, which also includes a Fortran compiler with OpenMP offload capability.

Mixture of C/C++/Fortran and CUDA codes

The progrmaming environment supports a mixture of C/C++/Fortran and CUDA codes. CUDA and CPU codes should be in separate files, and Cray compiler wrapper commands must be used at link time:

CC -c main.cxx
nvcc -c cuda_code.cu
CC -o main.ex main.o cuda_code.o

Compatibility between nvcc host compiler and gcc compiler

To make the above work, the GCC version needs to be 9.x due to compatibility issues between the compilers.

Cray

The HPE Cray compiler suite is available via the PrgEnv-cray module, which will load the cce module for the actual native compilers. The native compilers in this suite are:

  • C: cc
  • C++: CC
  • Fortran: ftn

Full documentation of the Cray compilers is provided in the HPE Cray Clang C and C++ Quick Reference for the C/C++ compilers, and the HPE Cray Fortran Reference Manual for the Fortran compiler. Additionally, compiler documentation is provided through man pages (e.g., man clang or man crayftn) or the help page (cc -help, etc.).

Major changes to Cray compilers starting in version 9.0

Verison 8.7.9 of the Cray compiler (CCE) is the last version based on the old compiler environment and default settings. Starting in version 9.0, Cray made major changes to the C/C++ compilers, and smaller changes to the Fortran compiler. In particular:

  • The C/C++ compilers have been replaced with LLVM and clang, with some additional Cray enhancements. This means that nearly all of the compiler flags have changed, and some capabilities available in CCE 8 and previous versions are no longer available in CCE 9. It may also result in performance differences in code generated using CCE 8 vs CCE 9, due to the two versions using different optimizers.
  • OpenMP has been disabled by default in the C, C++, and Fortran compilers. This behavior is more consistent with other compilers. To enable OpenMP, one can use the following flags:
    • C/C++: -fopenmp
    • Fortran: -h omp

Cray provides a migration guide for users switching from CCE 8 to CCE 9.

For users who are unable to migrate their workflows to the clang/LLVM-based CCE 9 C/C++ compilers, Cray has simultaneously released a CCE 9 "classic" version, which continues to use the same compiler technology in CCE 8 and older versions. This version of CCE is available as the module cce/<version>-classic. However, users should be aware that "classic" CCE is now considered "legacy," and that all future versions of CCE are based on clang/LLVM. See the the Cray documentation for CCE "classic".

OpenMP and OpenACC

OpenMP/OpenACC offloading to GPUs not supported yet

Offloading to GPUs with OpenMP/OpenACC is not supported in the PrgEnv-cray environment on Perlmutter at the moment. The offloading-related information below is for future references only, and can be updated.

The Cray compilers have a mature OpenMP offloading implementation.

Compiling codes using OpenMP offload capabilities on Perlmutter requires different flags for C and C++ codes than for Fortran codes. The C and C++ compilers are based on clang, and thus use similar flags that one would use for clang to generate OpenMP offload code:

cc -fopenmp -target-accel=nvidia80 -o my_openmp_code.ex my_openmp_code.c

CC -fopenmp -target-accel=nvidia80 -o my_openmp_code.ex my_openmp_code.cpp

For Fortran codes, the flag is different, and the environment variable CRAY_ACCEL_TARGET must be set to nvidia80 at compile time, or use the `-target-accel=nvidia80 compiler flag. Then, build as follows:

ftn -h omp -target-accel=nvidia80 -o my_openmp_code.ex my_openmp_code.f90

Only the Fortran compiler supports OpenACC.

The compiler flag for enabling OpenACC in Fortran codes is -h acc. To offload to GPUs, use the -target-accel=nvidia80 compiler flag, or set the CRAY_ACCEL_TARGET environment variable to nvidia80.

Explicitly set the target to host CPUs when compiling OpenMP/OpenACC code for the host on Perlmutter

Due to an issue with the PrgEnv-cray compiler wrappers, you must add -target-accel=host compiler option or load the craype-accel-host module in order to successfully compile any OpenMP/OpenACC code for the host.

ftn -h omp -target-accel=host -o my_openmp_code.ex my_openmp_code.f90

Mixture of C/C++/Fortran and CUDA codes

The progrmaming environment allows a mixture of C/C++/Fortran and CUDA codes. In this case CUDA and CPU codes should be in separate files. Cray compiler wrapper commands must be used at link time, and CUDA runtime must be included:

CC -c main.cxx
nvcc -c cuda_code.cu
CC -o main.ex main.o cuda_code.o -lcudart

NVIDIA

The NVIDIA compiler suite is available via the PrgEnv-nvidia module, which will load the nvidia module for the actual native compilers. The native compilers in this suite are:

  • CUDA compiler drivers
    • CUDA C/C++: nvcc
    • CUDA Fortran: nvfortran
  • HPC compilers: for host multithreading and GPU offloading with OpenMP, OpenACC, C++17 Parallel Algorithms and Fortran's DO-CONCURRENT; part of the NVIDIA HPC SDK:
    • C: nvc
    • C++: nvc++
    • Fortran: nvfortran

The CUDA compiler derivers are used to compile CUDA codes. Below is to compile a hello-world CUDA code, helloworld.cu, to generate an executable helloworld:

$ cat helloworld.cu
#include <stdio.h>

__global__ void helloworld() {
  printf("Hello, World!\n");
}

int main() {
  helloworld<<<1,1>>>();
  cudaDeviceSynchronize();
  return 0;
}

$ nvcc -o helloworld helloworld.cu

OpenMP, OpenACC and CUDA

If OpenMP and CUDA code coexist in the same program, the OpenMP runtime and the CUDA runtime use the same CUDA context on each GPU. To enable this coexistence, use the compilation and linking option -cuda, as shown below.

$ cat cuda_interop.cpp      # offload code calling a function in a CUDA code
...
#pragma omp target data map(from:array2D[0:M][0:N])
{
  ...
#pragma omp target data use_device_ptr(p)
  {
    add_i_slice(p, i, N);
  }
  ...
}
...

$ cat interop_kernel.cu     # CUDA code where the called function is defined
...
__global__ void add_kernel(int *slice, int t, int n)
{
  ...
}

void add_i_slice(int *slice, int i, int n)
{
  add_kernel<<<n/128, 128>>>(slice, i, n);
}
...

$ nvc++ -Minfo -mp -target=gpu -c cuda_interop.cpp
$ nvcc -c interop_kernel.cu

$ nvc++ -mp -target=gpu -cuda interop_kernel.o cuda_interop.o

where -mp is to enable OpenMP and -target=gpu is to offload the OpenMP construct to GPUs.

Note that, in the above non-MPI code example, the HPC compiler nvc++ is used, but the Cray compiler wrapper, CC, can be used instead. In that case, drop the -target=gpu flag from the CC commands as the offload target is correctly set by the craype-accel-nvidia80 module. MPI codes must be compiled with the Cray compiler wrapper if Cray MPI is to be used.

The HPC compilers support OpenMP and OpenACC offloading. Invoking OpenACC in the HPC compilers, for example, looks like:

nvfortran -acc=gpu -Minfo=acc -o main.ex main.f90

or

nvfortran -acc -target=gpu -Minfo=acc -o main.ex main.f90

where the flag -acc is to enable OpenACC for GPU execution only, and -Minfo=acc prints diagnostic information to STDERR regarding whether the compiler was able to produce GPU code successfully.

Note that, when the HPE Cray compiler wrappers are used, replace the -target=gpu flag with -target-accel=nvidia80.

C++17 introduced parallel STL algorithms ("pSTL"), such that standard C++ code can express parallelism when using many of the STL algorithms. The NVIDIA HPC compilers supports GPU-accelerated pSTL algorithms, which can be activated by invoking nvc++ with the flag -stdpar=gpu. See the documentation regarding pSTL for the HPC SDK.

GPU acceleration of Fortran's DO CONCURRENT is enabled also with the -stdpar option. If the flag is specified, the compiler does the parallelization of the DO CONCURRENT loops and offloads them to the GPU. All data movement between host memory and GPU device memory is performed implicitly and automatically under the control of CUDA Unified Memory. It is also possible to target a multi-core CPU with -stdpar=multicore. For more info, check the NVIDIA blog, Fortran Standard Parallelism.

The NVIDIA HPC SDK provides cuTENSOR extensions so that some Fortran intrinsic math functions can be accelerated on GPUs. Accelerated functions include MATMUL, TRANSPOSE, and several others. The nvfortran compile provides access to these GPU-accelerated functions via the module cutensorEx. See the documentation about the cutensorEx module in nvfortran.

CUDA Math libraries (cuBLAS, cuFFT, cuFFTW, cuSOLVER, etc.) can be linked easily by specifying the name of the library with the -cudalib flag:

nvfortran -Minfo -mp -target=gpu -cudalib=cublas mp_cublas.f90

Note again that, when the HPE Cray compiler wrapper ftn is used, replace the -target=gpu flag with -target-accel=nvidia80.

Full documentation of the NVIDIA compilers can be found in the NVIDIA HPC Compilers, User's Guide and the CUDA C++ Programming Guide.

Please check the NVIDIA HPC SDK - OpenMP Target Offload Training, December 2020 for useful information on the HPC compilers.

AOCC

The AOCC (AMD Optimizing C/C++ Compiler) compiler suite is based on LLVM and includes many optimizations for the AMD processors. It supports Flang as the Fortran front-end compiler. The AOCC suite is available via the PrgEnv-aocc module, which will load the aocc module for the actual native compilers. The native compilers in this suite are:

  • C: clang
  • C++: clang++
  • Fortran: flang

Full documentation of the AOCC compilers is provided at AOCC webpage, where you can find user manuals and a quick reference guide: AOCC User Guide, Clang – the C, C++ Compiler, Flang – the Fortran Compiler and Compiler Options Reference Guide for AMD EPYC 7xx3 Series Processors.

OpenMP and OpenACC

The compilers can generate the OpenMP parallel code for the host CPU only, and do not support offloading to NVIDIA GPUs. To enable OpenMP, add the compiler flag -fopenmp for C and C++ and -mp for Fortran:

clang -fopenmp -o my_openmp_code.ex my_openmp_code.c

clang++ -fopenmp -o my_openmp_code.ex my_openmp_code.cpp

flang -mp -o my_openmp_code.ex my_openmp_code.f90

When using the HPE Cray compiler wrappers, add the target flag -target-accel=nvidia80 for offloadding to GPUs.

OpenACC is not supported.

Mixture of C/C++/Fortran and CUDA codes

The progrmaming environment allows a mixture of C/C++/Fortran and CUDA codes. In this case CUDA and CPU codes should be in separate files. Cray compiler wrapper commands must be used at link time, and CUDA runtime must be included:

CC -c main.cxx
nvcc -c cuda_code.cu
CC -o main.ex main.o cuda_code.o -lcudart

LLVM

Note

The information below is about the LLVM compilers on Cori. Information for Perlmutter will be provided when it becomes more available.

The LLVM core libraries along with the compilers are locally built by NERSC, not HPE Cray. It is compiled against the GCC compiler suite and thus cannot be used with the Intel or HPE Cray programming environments.

The native compilers in this suite are:

  • C: clang
  • C++: clang++

In order to enable clang compiler, first make sure to load the gnu programming environment

module load gcc
module load llvm/<version>

where module avail llvm displays which versions are currently installed.

Using the clang++ compiler

The clang++ compiler will fail unless you add a compiler option to use an official C++ standard, e.g., -std=c++11. The issue seems to be related to GPU-offload support for GCC extensions, e.g., __float128 type.

The LLVM/clang compiler is also a valid CUDA compiler. One can replace NVIDIA's nvcc command with clang --cuda-gpu-arch=<arch>, where <arch> on the Cori GPU nodes is sm_80. If using clang as a CUDA compiler, one usually will also need to add the -I/path/to/cuda/include and -L/path/to/cuda/lib64 flags manually, since nvcc includes them implicitly.

For documentation of the LLVM compilers, see LLVM, Clang, and Flang websites. Additionally, compiler documentation is provided through man pages (e.g., man clang) and through the -help flag to each compiler (e.g., clang -help).

Common compiler options

Below is a table documenting common flags for each of the compilers.

Intel GNU Cray NVIDIA AOCC LLVM comment
Overall optimization -O<n>, -Ofast -O<n>, -Ofast -O<n>, -Ofast -O<n> -O<n>, -Ofast Replace <n> with 1, 2, 3, etc.
Enable OpenMP -qopenmp -fopenmp -fopenmp for C/C++ with CCE 9.0 or later: -h omp, otherwise -mp[=multicore*|[no]align] *: default C/C++: -fopenmp; Fortran: -mp -fopenmp OpenMP enabled by default in Cray.
Enable OpenACC - -fopenacc Fortran: -h acc -acc - - OpenACC not supported by clang/clang++.
Free-form Fortran -free -ffree-form -f free -Mfree -Mfreeform Also determined by file suffix (.f, .F, .f90, etc.)
Fixed-form Fortran -fixed -ffixed-form -f fixed -Mfixed -Mfixed Also determined by file suffix (.f, .F, .f90, etc.)
Debug symbols -g -g N/A HPC compilers: -g, -gopt; CUDA: -g (or --debug) for host code and -G (or --device-debug) for device code -g -g Debug symbols enabled by default in Cray.