Skip to content

Perlmutter Readiness


This page is currently under active development. Check back soon for more content.

This page contains recommendations for application developers and users to "hit the ground running" on Perlmutter which unlike previous NERSC systems features GPUs (NVIDIA A100).

Testing of performance on relevant hardware and compatibility with the software environment are both important.

Getting Started with GPUs

Memory Management

Unlike multicore architectures like Intel's Knight Landing and Haswell processors on Cori, GPU nodes on Perlmutter have two distinct memory spaces: one for the CPUs, known as the host memory and one for the GPUs called as the device memory. Similar to CPUs, GPU memory spaces have their own hierarchies. The hierarchy with the largest storage space is called the global memory and is similar to the RAM on CPUs. A100 GPUs on Perlmutter have 40GBs of global memory which is shared between all the threads of a GPU. L2 caches form the next level of memory hierarchy with 40MB shared among all the GPU threads. L2 cache has a lower access latency than the global memory. Additionally, each SM has 192KB of memory that is divided between shared memory and L1 cache. Each SM has 65536 registers which have the least latency from the GPU threads.

For application kernels to run on a GPU, memory must be transfered from CPU to GPU, either via explicit memory transfers made by the programmer or by taking advantage of a Unified Virtual Memory (UVM) subsystem, where either the software or the hardware performs the data transfer seamlesly without any input from the user.

Unified Virtual Memory (UVM)

Most of the programming frameworks that target NVIDIA GPUs provide a mechanism to take advantage of the UVM subsytem. OpenMP5.0 provides directives that allow users to request memory to be allocated via a UVM mechanism. NVIDIA implementation of OpenACC allows UVM implementation via compiler flags while Kokkos provides a UVM backend called CUDAUVM which uses the UVM calls from the Cuda framework. Shown below is an example of an array addition performed using the UVM calls available in the Cuda programming language.

#include <cuda.h>
#include <cuda_runtime.h>
#include <stdio.h>

__global__ void kernel_add(int *a, int *b, int *c) {
    int threadId = blockIdx.x * blockDim.x + threadIdx.x;
    c[threadId] = a[threadId] + b[threadId];

int main(int argc, char **argv) {
    int N = (argc > 1) ? atoi(argv[1]) : 64;
    if (N % 32 != 0) {
            "Error: Please provide the number of blocks as a multiple of 32 "
            "for simplicity.\n");

    size_t size = N * sizeof(int);

    // Allocate the memory using `cudaMallocManaged` call from CUDA.
    int *a, *b, *c;
    cudaMallocManaged(&a, size);
    cudaMallocManaged(&b, size);
    cudaMallocManaged(&c, size);

    for (int i = 0; i < N; ++i) {
        a[i] = i + 1;
        b[i] = i + 2;

    // Generate a kernel with N/32 blocks and 32 threads per block.
    kernel_add<<<N / 32, 32>>>(a, b, c);

    // Wait for all the kernels to finish their work.

    // Correctness.
    for (int i = 0; i < N; ++i) {
        if (c[i] != i * 2 + 3)
            printf("Failure: expected = %d, c[%d] = %d\n", i + 3, i, c[i]);

        "If no failure message is printed before this then the program ran "
        "successfully!! \n");

    // Delete memory

    return 0;

In the above example, cudaMallocManaged call from Cuda is used to allocate memory for a,b and c. This allows the Cuda framework to manage data via the UVM subsystem and transfers data from host-to-device or from device-to-host as and when needed. Data transfers happen at the hierarchy of global memory when transfered from host-to-device. Use of UVM for data management relieves a programmer from the responsibility of memory management. It can be beneficial to rely on UVM at the start of a project where CPU kernels are ported to GPU since memory management would be one less issue to consider in the intial stages of code transition. UVM is also beneficial in applications with complicated data structures where tracking the movement of data might be difficult.

Prefetch Optimization

Data allocated via the above method, i.e., using cudaMallocManaged, if requested and not present on the requested device, is moved at the granularity of a page(4KB). However, this might not be efficient for data larger than a page or if all the data requested is not be present on a single page. In order to optimize on the data movement, UVM allows users to adjust the granularity of the data transfers via the "prefetch" optimizations. For example, in the above code the following optimizations can be made to reduce the number of memory transfers:

    // Set the default GPU 0.
    int gpuDeviceId = 0;

    // Prefetch the data needed on the GPU before the kernel launch.
    cudaMemPrefetchAsync(a, size, gpuDeviceId, 0);
    cudaMemPrefetchAsync(b, size, gpuDeviceId, 0);
    cudaMemPrefetchAsync(c, size, gpuDeviceId, 0);

    // Generate a kernel with N/32 blocks and 32 threads per block.
    kernel_add<<<N / 32, 32>>>(a, b, c);
    // Wait for all the kernels to finish their work.

    // Prefecth the data needed on the CPU after the kernel execution.
    cudaMemPrefetchAsync(c, size, cudaCpuDeviceId, 0);

The cudaMemPrefetchAsync call migrates the data to the destination device, the third parameter in the call, as soon as it is encountered if there is no previously launched work on the same stream which is the final parameter of the call. Similarly, the prefetch optimization is used to migrate c back onto CPU at the end of the kernel execution. cudaCpuDeviceId is the device id for the host memory. Similar to cudaMemPrefetchAsync, cudaMemAdvise can also be used to inform the UVM framework about the intended use of the memory allocated via cudaMallocManaged.

Manual Memory Management

Although UVM makes it easier to manage memory transfers from host-to-device and vice vesra, manual memory management is generally accepted to be the performant method to handle data movement. However, this method puts the responsibility of data transfer on the programmer. Shown below is a modified version of the array addition example using manually managed data transfers.

    // Allocate memory on the host.
    int *h_a = (int *)malloc(size);
    int *h_b = (int *)malloc(size);
    int *h_c = (int *)malloc(size);

    // Allocate corresponding memory on the device.
    int *a,*b,*c;
    cudaMalloc(&a, size);
    cudaMalloc(&b, size);
    cudaMalloc(&c, size);

    // Initialize data on the host data structures.
    for (int i = 0; i < N; ++i) {
        h_a[i] = i + 1;
        h_b[i] = i + 2;

    // Copy data from host to device.
    cudaMemcpy(a, h_a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(b, h_b, size, cudaMemcpyHostToDevice);

    // Pass device data structures to the kernel.
    kernel_add<<<N / 32, 32>>>(a, b, c);

    // Wait for all the kernels to finish their work.

    // After the kernel execution copy the results back from device to host.
    cudaMemcpy(h_c, c, size, cudaMemcpyDeviceToHost);

Manual memory management requires users to maintain distinct data structures for host and device. In the above example h_a, h_b and h_c are the host mirrors for a, b and c and are allocated using the regular host memory allocation calls. cudaMalloc is used to allocate memory on the device. In order to transfer memory from host-to-device and vice versa cudaMemcpy is used which is a host side call to transfer memory to the destination pointer(first parameter) from the source pointer(second parameter) and transfers the data in bytes (thrid parameter). The direction of the memory transfer is provided by the final parameter in the cudaMemcpy call. A detailed description of the available Cuda calls related to memory management is provided here cuda-memory-management

ATS is another UVM implementation that is available on Summit supercomputer but will not be available on Perlmutter. In the ATS implementation GPUs are able to read CPUs page tables at the granularity of a cacheline. Details and analysis of the differences between ATS and using managed memory from Cuda is discussed here UVM-implementations.

Introduction to GPUs

Additional Resources

Slides / Video from training events

Training Events

Prebuilt applications

High-level considerations for using GPUs

Unlike CPUs, accelerator program requires additional care with memory management. Transfers from Host-to-Device and Device-to-Host are comparatively much slower than the memory bandwidth and care should be taken to minimize them.

Tensor cores

The "Ampere" architecture on the NVIDIA A100 GPUs on Perlmutter feature tensor cores, which are specialized accelerators which can significantly accelerate certain types of computational kernels, typically those involving operations on matrices. An innovation of the tensor cores on A100 GPUs which may be of interest to HPC applications running at NERSC is support for double precision floating point (FP64) data types, in contrast to previous generations of tensor cores which were limited to reduced precision data types. A second improvement in A100 tensor cores is support for and acceleration of operations on sparse data structures.

NVIDIA provides substantial documentation regarding programming models which can target the A100 tensor cores:

Programming Models

The choice of programming model depends on multiple factors including: number of performance critical kernels, source language, existing programming model, and portability of algorithm. A 20K line C++ code with 5 main kernels will have different priorities and choices vs a 10M line Fortran code.

NERSC supports a wide variety of programming models on Perlmutter:

Native CUDA C/C++

CUDA is NVIDIA's native GPU programming language. CUDA is well established, mature and its design is the basis and backend for most other languages and frameworks, such as HIP and Kokkos. NVIDIA works directly with the scientific community to develop CUDA's performance and features and has a variety of teaching material and community outreach to help users achieve their computational goals. NVIDIA also develops a variety of performant mathematical, parallel algorithm, communication and other libraries to help users achieve optimal performance on NVIDIA hardware.

Given CUDA's ubiquitous nature, it may be the best language to study and follow in order to understand GPU programming. Its highly developed programming model is the standard for discussion throughout the community. Many other programming models are compatible with CUDA. This can allow highly optimized CUDA code to be implemented in critical code paths while using more portable frameworks elsewhere.

CUDA is a good option for smaller codes, codes that are primiarily looking to run on NVIDIA systems and codes that continuously develop and maximize their performance. It will often have the newest features and high performance on NVIDIA systems and writing in CUDA will get you the newest features, performance improvements and bug fixes quickly.

However, CUDA only runs on NVIDIA hardware, so writing in CUDA will require additional work to achieve portability to other vendor's GPUs. That process may be simple, such as using a tool to convert, or quite complex if specialized, performant algorithms have been created. CUDA also typically requires more tuning and understanding than pragma-based languages, such as OpenMP or portability layers, such as Kokkos, so developers looking for a quick port may wish to explore other options.




Codes at NERSC that use CUDA:


With C++ it is possible to encapsulate such code through template specialization in order to preserve portability.


NVIDIA compilers are recommended for obtaining good performance with directive based approaches.



OpenACC is a programming framework that allows users to expose parallelism in their code via compiler directives. Compilers that support the framework will interpret the directives and schedule the parallel code accordingly. On Perlmutter, NVHPC/HPC-SDK (formerly PGI), GCC and CCE compilers will support OpenACC directives where, the degree of the support will vary based on the compiler. Here is the OpenACC specification OpenACC spec.

The benefit of OpenACC is its ease of use. But the downside of the approach is the inability to perform low level optimizations. Following are few of the highlights and recommendations for OpenACC:

  • OpenACC provides kernels construct for users to identify code segments that may contain parallelism. It allows users to rely on compilers to identify and safely parallelize the marked code sections. This is a good starting place for OpenACC.
  • Take advantage of OpenACC clauses such as gang, vector and worker to map hierarchical parallelism onto the GPUs.
  • collapse clause can be used to expose more parallelism in nested loop structures.
  • Optimize the data movement in applications. Due to high cost of data transfer between CPU and GPU memories, it is beneficial to perform computation on the GPU even when the code lacks parallelism if the trade-off is a reduction in memory transfers.
  • Reorder loops and data structures to take advantage of memory coalescing.
  • OpenACC allows interoperability with other programming languages via the use of clauses such as host_data and deviceptr.
  • OpenACC implementation of NVHPC compiler allows interoperability with CUDA. This implies that kernels that need low level optimizations via architecture specific programming can be intermixed with parts of code that are implemented in OpenACC.
  • Some of the advanced OpenACC features include asynchronus operations and multi-device programming.

The following link provides a detailed description of the best practices to be followed in OpenACC OpenACC best practices.

C++ language based




Kokkos is a C++ based programming model that allows mapping abstract hierarchies of parallelism onto diverse architectures. It provides a unified front-end for users to write portable parallel code via its constructs while the framework provides architecture specific backends to map the constructs onto the intended hardware. A few of the framework's highlights are :

  • The View abstraction for multidimensional arrays. It allows users to chose the optimal order to traverse the data structures based on the loop ordering and target architecture via its semantics.
  • The framework allows users to expose additional parallelism by taking advantage of its MDRange constructs.
  • Kokkos also allows the users to exploit multiple levels of parallelism and take advantage of hierarchical parallelism available in most of the modern high performance computers.

Kokkos framework is availiable at Github. Here is a link to the tutorial lecture series on use of the Kokkos framework kokkos lecture series.




HIP is AMD's GPU portability layer. HIP is compiled with either a CUDA or ROCm backend to run on NVIDIA or AMD GPUs, respectively. ROCm is AMD's native programming language, although its a lower-end language than HIP and it is expected that users write their codes in HIP and rarely use ROCm directly.

HIP is designed to directly mimic CUDA. HIP uses an extremely similar programming model and it designed so that most of its calls are direct copies of cuda, in most cases simply replacing "cuda" with "hip". HIP uses a subset of CUDA calls (roughly based on a target version of CUDA) and additional AMD specific features, so porting between HIP and CUDA is typically a very simple process.

HIP is a good option for those code teams with knowledge of the CUDA programming model and who also want to achieve maximum performance on AMD systems. For teams targeting both Perlmutter and Frontier (NVIDIA and AMD architechtures), HIP may be the ideal programming model.

However, HIP may not work well for the same codes that should avoid CUDA: codes that want full portability, minimial tuning, minimal editing or don't have the time to maintain the code base. OpenMP, Kokkos, or another abstraction layer is likely the better bet for such coding efforts.



Codes at NERSC that use CUDA:





The Fortran standard avoids language and features which target particular architectures. This generic approach provides Fortran compilers the freedom to interpret features in different ways, including how to offload kernels to GPUs. Consequently, different compilers may implement certain features of the Fortran language differently in terms of how they are accelerated on GPUs.

On Perlmutter, perhaps the most effective resource for accelerating Fortran code on GPUs is the compiler included in the NVIDIA HPC SDK. The HPC SDK provides three separate mechanisms by which Fortran programs may be accelerated on A100 GPUs. These are described below.

GPU acceleration with standard Fortran

The HPC SDK enables GPU acceleration of several different Fortran features. One is DO CONCURRENT loops which can be offloaded to GPUs by adding the -stdpar compiler flag. The HPC SDK can also offload several other Fortran features to GPUs by mapping them to cuTENSOR functions; these features include RESHAPE, TRANSPOSE, SPREAD, certain element-wise expressions, MATMUL, and several element-wise functions.

Please provide feedback to NERSC

GPU acceleration of standard Fortran intrinsics as described above is a new feature in the NVIDIA HPC SDK compiler. Users are encouraged to experiment with this feature in the compiler and provide feedback to NERSC by filing a ticket at the online help desk.


The HPC SDK also implements a subset of OpenMP offload capabilities which can be used in Fortran programs too. Features of OpenMP which are not implemented in the HPC SDK are parsed but ignored in order to maintain portability across OpenMP-enabled Fortran compilers.

A feature in OpenMP implemented in the HPC SDK which may be of special interest to Fortran programmers is the loop construct, which is similar to the kernels construct in OpenACC. The loop construct is a descriptive directive which allows the compiler the freedom to analyze the contained loop for dependencies and apply optimizations for GPU offload as it sees fit; see here for details regarding its implementation in the HPC SDK.


The HPC SDK continues to support OpenACC for GPU acceleration; see here for details.



For more information about preparing Python code for Perlmutter GPUs, please see this page.


Some modern MPI implementations support a feature known as "CUDA-awareness." This term remains somewhat loosely defined, but is generally understood to mean that pointers to data residing in GPU device memory can be used as arguments in MPI function calls invoked on the CPU.

CUDA-awareness in MPI has several potential benefits but also some potential drawbacks. One significant benefit is that it can be a convenience to the programmer. For example, when executing an MPI function call, a CUDA-aware MPI implementation may inspect the pointer to determine whether the data it points to resides in CPU or GPU memory, and if the latter, the MPI implementation may execute a cudaMemcpy() to copy the data from the GPU to the CPU, and then execute the MPI function on the CPU memory. Such an implementation would save the programmer the work of writing the cudaMemcpy() in their own code, but would not offer any performance benefit. An alternative implementation of this behavior might be for CUDA to allocate the memory using cudaMallocManaged(), in which case the CUDA runtime itself may copy the data back to CPU memory before executing the MPI function.

There are also potential performance benefits to CUDA-aware implementations of MPI, namely that the library can take advantage of hardware features like GPUDirect, such that if the user executes an MPI function with a pointer to data in device memory, the MPI library may copy the data directly from GPU device memory to the network device on the node, without copying the data to CPU memory. This behavior would provide both a convenience and a performance benefit.

However, there are also potential drawbacks to writing programs which target CUDA-aware MPI implementations. Chiefly, CUDA-aware MPI is not standardized, and so its behavior may vary among different MPI implementations which each claim to be "CUDA-aware." For example, one MPI implementation may support CUDA-awareness only in point-to-point communication routines, whereas another may support it only in collective routines. This may result in programs executing correctly on one system, but crashing on another.

Details regarding CUDA-awareness support in MPI on Perlmutter will be forthcoming. However, some open-source MPI implementations already support this behavior, e.g., OpenMPI and MVAPICH2-GDR. Users are encouraged to review those pages to understand how CUDA-awareness is implemented in those libraries.

Machine Learning Applications

We support distributed PyTorch and TensorFlow on Perlmutter. Currently, we recommend to use the relevant Shifter images which are based on optimized NVIDIA NGC ones, details can be found in our Machine Learning documentation.

Performance and IO optimization

Deep learning pipelines often need performance tuning to make best use of powerful accelerators. Special considerations should be given to IO bottlenecks. Our NERSC-NVIDIA Deep Learning at Scale Tutorial (Supercomputing 2020) provides guidance on how to analyze and address some of these performance challenges.

Case studies

  1. Optimizing Data Processing Pipelines 3D CNN (Cosmo-3D)


Alternatives to Intel Math Kernel Library (MKL) for CPU math functions

Several generations of NERSC systems have provided the Intel Math Kernel Library (MKL), which provides a large, diverse collection of functions and routines for commonly used numerical calculations, which are summarized here. MKL is optimized for Intel architectures, such as the Haswell and Knights Landing processors on the Cori system.

While it is possible that Intel MKL may work on the AMD CPUs on Perlmutter, it is not guaranteed, nor is it certain that if it does work that will achieve similar levels of performance that its users have enjoyed on Intel architectures. Consequently, NERSC users are encouraged to explore other software offerings which may fulfill many of the same roles as MKL.

Currently, the AMD AOCL library provides implementations of several math functions and routines provided in MKL, including BLAS, LAPACK, ScaLAPACK, sparse linear algebra, FFTW, optimized math functions and random number generators.

GPU implementations of math libraries are often much faster

While CPU-based implementations of math functions and routines included in MKL are available and provide below, users should first consider GPU implementations, which are often much faster.

Math libraries for Nvidia GPUs

One way to port existing applications to GPUs without rewriting the entire application in a GPU-aware programming model is to use GPU-accelerated libraries to execute performance-critical calculations. Below is an incomplete list of GPU-accelerated math libraries which are expected to run effectively on Perlmutter.

Linear algebra

Dense linear algebra
  • BLAS: Nvidia cuBLAS
  • LAPACK: Nvidia cuSOLVER (not all LAPACK subroutines have been implemented)
  • BLAS/LAPACK: MAGMA (not all LAPACK subroutines have been implemented)
  • ScaLAPACK replacement: SLATE
Sparse linear algebra
Matrix operations/solvers
Algebraic multigrid

Solver libraries


  • Single GPU:

    • Nvidia cuFFT/cuFFTW (cuFFTW library provides the FFTW3 API to facilitate porting of existing FFTW applications), supports multiple GPUs on a single compute node
  • Distributed FFT (multi-node):

Ray tracing

IO Considerations

All file systems can be access from Perlmutter except Cori Scratch. Details about the Perlmutter Scratch File System can be found here.

AMD CPU Considerations


Profiling tools

Perlmutter includes several tools available for profiling CPU and GPU applications.

  • Nsight Systems: low-overhead sampling-based tool for collecting "timelines" of CPU and GPU activity.
  • Nsight Compute: higher-overhead profiling tool which provides a large amount of detail about GPU kernels; works best with short-running kernels.
  • HPCToolkit: comprehensive profiling tool supporting CPU, GPU, and MPI activity.

When profiling an application using multiple tasks, Nsight Systems and Nsight Compute can generate a separate profile per task. One may also use these tools to profile a subset of all running tasks by using the Slurm flag --multi-prog to run the application using MPMD (see here for details).

Running Applications

Controlling task and GPU binding

By default, in a Slurm job allocation, all GPUs allocated to the job on a particular node are accessible by all tasks in the job which are also on that same node. In some cases, this is precisely the desired behavior for a job, while in others, it is preferable for each task on a node to be bound only to a subset of the GPUs on that node.

There are several ways to control which GPUs are visible to which tasks in a job. One way is via the CUDA Runtime itself; another is to use Slurm.

Using the CUDA Runtime

The CUDA Runtime provides different ways for the programmer to control which GPUs are accessible to which tasks. One way is to use the CUDA_VISIBLE_DEVICES environment variable, which is a variable assigned per-task. No source code changes are required in order to control task and GPU affinity using this approach. More information about using the CUDA_VISIBLE_DEVICES environment variable in a Slurm job is provided here. Additional documentation regarding how to use CUDA_VISIBLE_DEVICES is provided in the intro_mpi man page.

A second way is to use the CUDA Runtime API itself, as described here. This approach requires making changes to one's source code in order to implement.

Using Slurm

An alternative approach to controlling task and GPU affinity is to use Slurm itself. The sbatch, salloc, and srun commands each support several flags specific to GPU allocation, including --gpus-per-task, --gpus-per-node, and several others, which are described here and also in the man pages for the various Slurm commands.

Additionally, Slurm provides a flag --gpu-bind which controls which GPUs are accessible by which tasks on a given node. Documentation regarding this flag is provided here or via the man srun command on Perlmutter. For example, if one wishes to assign to run 4 MPI tasks on a Perlmutter compute node, with each task assigned to a single, unique GPU on the node, this could be accomplished using the following #SBATCH header in a job script:

#SBATCH --gpu-bind=map_gpu:0,1,2,3

Like the environment variable approach described above, this requires no source code changes, but does rely on Slurm implementation details and so may not be portable to other systems with different workload managers.

General recommendations

Write a validation test

Performance doesn't matter if you get the wrong answer!

  • Define benchmarks for performance. These should represent the science cases you want to run on Perlmutter.
  • Use optimized libraries when possible (FFT, BLAS, etc).
  • Start with 1 MPI rank per GPU.
  • Start with UVM and add explicit data movement control as needed.
  • Minimize data movement (Host to device and device to host transfers).
  • Avoid device allocations (Use a pool allocator)


The ability for applications to achieve both portability and high performance across computer architectures remains an open challenge.

However there are some general trends in current and emerging HPC hardware: increased thread parallelism; wider vector units; and deep, complex, memory hierarchies.

In some cases a performance portable algorithm can realized by considering generic "wide vectors" which could map to either GPU SIMT threads or CPU SIMD lanes.

Case Studies and Examples

Many examples of kernels and applications being ported to NVIDIA GPUs are available in literature, conference proceedings, and online at GitHub, GitLab, etc. Below is a small selection of examples and case studies which may be useful to NERSC users as they port their applications to GPUs.

NERSC proxy app suite

NERSC maintains a collection of proxy applications, which are typically small codes with only a handful of computationally intensive algorithms included. These proxy apps are often written in multiple languages and programming models, in order to evaluate performance of those languages and models, as well as to illustrate how one might port an existing code (e.g., one which does not offload anything to a GPU) to a language or model which enables efficient GPU acceleration.

References and Events