Skip to content

Transitioning Applications to Perlmutter

This page contains recommendations for application developers and users to "hit the ground running" on Perlmutter which unlike previous NERSC systems features GPUs (NVIDIA A100).

Testing of performance on relevant hardware and compatibility with the software environment are both important.

Introduction and Basic GPU Concepts

The transition of computational science applications to Perlmutter represents the second leg in NERSC’s transition from traditional server CPU HPC architectures to energy efficient exascale-like architectures that allow the center to increase compute capability as demand increases. NERSC began the transition to exascale-like architectures with Cori KNL (Intel Knights Landing) system deployed in 2016.

Perlmutter gets the majority of its FLOPs from its NVIDIA "Ampere" generation A100 GPUs. Each GPU has the following characteristics:

  • 108 Streaming Multiprocessors (SMs) per device
  • 32 Double Precision wide "Warps" (2 active at once per SM)
  • 64 Warps per SM
  • On device pool of 40GB High Bandwidth Memory with bandwidth of over 1.5 TB/s

Comparing some of the Ivy Bridge and KNL properties to A100 properties we see the following trends:

  • Number of processors (cores, SMs) increasing from 12 → 68 → 108.
  • Vector width (VPU size, Warp Size) increasing from 4 → 8 → 32
  • Oversubscription/Fast-Context-Switching (HyperThreads, Warps per SM) increasing from 2 → 4 → 64
  • HBM availability increasing from 0 → 16GB → 40 GB
  • Fastest memory bandwidth increasing from 50 GB/s → 500 GB/s → 1500 GB/s

The first three trends represent a significant increase in the amount of parallelism available on the device/socket at the core/SM level, "vector" level, and context switching/latency hiding level. Multiplying them together we have 96 way parallelism on Ivy Bridge to 2000+ way parallelism on KNL to 200,000+ way parallelism on A100.

Thus, in just two generations, NERSC systems have grown from 96 → 2,000+ → 200,000+ way parallelism per socket/device.

While there are many details in this transition guide that can and do affect your performance on Perlmutter - the single most important thing to keep in mind is that to effectively use the Perlmutter system you need to find and express as much parallelism as you can in your code or algorithm.

For example, even if a code is latency bound or non-vectorizable, it may still benefit from GPU acceleration if it can expose enough parallelism by keeping a large number of Warps active.

Up to this point, we’ve described a Perlmutter node in a similar way as a Cori KNL node - with simply more parallelism at each level of the CPU/Device. However, there are other important differences. Namely that a Perlmutter node is heterogeneous and actually contains both CPU and GPU compute units. While the memory bandwidth of the HBM on the GPU is 1.5TB/s as described above, the bandwidth for moving data between the CPU and the GPU is two orders of magnitude smaller! So, the second most important thing to keep in mind is that moving data back and forth between the CPU and GPU is very costly and should be avoided or hidden whenever possible. This guide contains multiple strategies for going about this and considers this issue in case-studies.

To maximize performance on the A100 GPUs in Perlmutter, a number of subtle considerations often need to be taken into account - choosing an appropriate programming language/model, utilizing cache and shared memory, minimizing divergence, increasing occupancy while reducing register spills. We dive into a number of these issues in this transition guide. But, as you are considering a transformation for your application, it is important to not get lost in the weeds. The two most important factors to consider remain:

  1. Exposing as much parallelism as possible, and
  2. Minimizing data motion as much as possible.

With that said, let’s go ahead and dive into the exciting task of transitioning to Perlmutter!

Additional High-Level GPU Considerations

NVIDIA Ampere (A100) in depth

Tensor cores

The "Ampere" architecture on the NVIDIA A100 GPUs on Perlmutter feature tensor cores, which are specialized accelerators which can significantly accelerate certain types of computational kernels, typically those involving operations on matrices. An innovation of the tensor cores on A100 GPUs which may be of interest to HPC applications running at NERSC is support for double precision floating point (FP64) data types, in contrast to previous generations of tensor cores which were limited to reduced precision data types. A second improvement in A100 tensor cores is support for and acceleration of operations on sparse data structures.

NVIDIA provides substantial documentation regarding programming models which can target the A100 tensor cores:

Memory Management

Unlike CPUs, accelerator program requires additional care with memory management. Transfers from Host-to-Device and Device-to-Host are comparatively much slower than the memory bandwidth and care should be taken to minimize them.

Unlike multicore architectures like Intel's Knight Landing and Haswell processors on Cori, GPU nodes on Perlmutter have two distinct memory spaces: one for the CPUs, known as the host memory and one for the GPUs called as the device memory. Similar to CPUs, GPU memory spaces have their own hierarchies. The hierarchy with the largest storage space is called the global memory and is similar to the RAM on CPUs. A100 GPUs on Perlmutter have 40GBs of global memory which is shared between all the threads of a GPU. L2 caches form the next level of memory hierarchy with 40MB shared among all the GPU threads. L2 cache has a lower access latency than the global memory. Additionally, each SM has 192KB of memory that is divided between shared memory and L1 cache. Each SM has 65536 registers which have the least latency from the GPU threads.

For application kernels to run on a GPU, memory must be transfered from CPU to GPU, either via explicit memory transfers made by the programmer or by taking advantage of a Unified Virtual Memory (UVM) subsystem, where either the software or the hardware performs the data transfer seamlesly without any input from the user.

Unified Virtual Memory (UVM)

Most of the programming frameworks that target NVIDIA GPUs provide a mechanism to take advantage of the UVM subsytem. OpenMP5.0 provides directives that allow users to request memory to be allocated via a UVM mechanism. NVIDIA implementation of OpenACC allows UVM implementation via compiler flags while Kokkos provides a UVM backend called CUDAUVM which uses the UVM calls from the Cuda framework. Shown below is an example of an array addition performed using the UVM calls available in the Cuda programming language.

#include <cuda.h>
#include <cuda_runtime.h>
#include <stdio.h>

__global__ void kernel_add(int *a, int *b, int *c) {
    int threadId = blockIdx.x * blockDim.x + threadIdx.x;
    c[threadId] = a[threadId] + b[threadId];

int main(int argc, char **argv) {
    int N = (argc > 1) ? atoi(argv[1]) : 64;
    if (N % 32 != 0) {
            "Error: Please provide the number of blocks as a multiple of 32 "
            "for simplicity.\n");

    size_t size = N * sizeof(int);

    // Allocate the memory using `cudaMallocManaged` call from CUDA.
    int *a, *b, *c;
    cudaMallocManaged(&a, size);
    cudaMallocManaged(&b, size);
    cudaMallocManaged(&c, size);

    for (int i = 0; i < N; ++i) {
        a[i] = i + 1;
        b[i] = i + 2;

    // Generate a kernel with N/32 blocks and 32 threads per block.
    kernel_add<<<N / 32, 32>>>(a, b, c);

    // Wait for all the kernels to finish their work.

    // Correctness.
    for (int i = 0; i < N; ++i) {
        if (c[i] != i * 2 + 3)
            printf("Failure: expected = %d, c[%d] = %d\n", i + 3, i, c[i]);

        "If no failure message is printed before this then the program ran "
        "successfully!! \n");

    // Delete memory

    return 0;

In the above example, cudaMallocManaged call from Cuda is used to allocate memory for a,b and c. This allows the Cuda framework to manage data via the UVM subsystem and transfers data from host-to-device or from device-to-host as and when needed. Data transfers happen at the hierarchy of global memory when transfered from host-to-device. Use of UVM for data management relieves a programmer from the responsibility of memory management. It can be beneficial to rely on UVM at the start of a project where CPU kernels are ported to GPU since memory management would be one less issue to consider in the intial stages of code transition. UVM is also beneficial in applications with complicated data structures where tracking the movement of data might be difficult.

Prefetch Optimization

Data allocated via the above method, i.e., using cudaMallocManaged, if requested and not present on the requested device, is moved at the granularity of a page(4KB). However, this might not be efficient for data larger than a page or if all the data requested is not be present on a single page. In order to optimize on the data movement, UVM allows users to adjust the granularity of the data transfers via the "prefetch" optimizations. For example, in the above code the following optimizations can be made to reduce the number of memory transfers:

    // Set the default GPU 0.
    int gpuDeviceId = 0;

    // Prefetch the data needed on the GPU before the kernel launch.
    cudaMemPrefetchAsync(a, size, gpuDeviceId, 0);
    cudaMemPrefetchAsync(b, size, gpuDeviceId, 0);
    cudaMemPrefetchAsync(c, size, gpuDeviceId, 0);

    // Generate a kernel with N/32 blocks and 32 threads per block.
    kernel_add<<<N / 32, 32>>>(a, b, c);
    // Wait for all the kernels to finish their work.

    // Prefecth the data needed on the CPU after the kernel execution.
    cudaMemPrefetchAsync(c, size, cudaCpuDeviceId, 0);

The cudaMemPrefetchAsync call migrates the data to the destination device, the third parameter in the call, as soon as it is encountered if there is no previously launched work on the same stream which is the final parameter of the call. Similarly, the prefetch optimization is used to migrate c back onto CPU at the end of the kernel execution. cudaCpuDeviceId is the device id for the host memory. Similar to cudaMemPrefetchAsync, cudaMemAdvise can also be used to inform the UVM framework about the intended use of the memory allocated via cudaMallocManaged.

Manual Memory Management

Although UVM makes it easier to manage memory transfers from host-to-device and vice vesra, manual memory management is generally accepted to be the performant method to handle data movement. However, this method puts the responsibility of data transfer on the programmer. Shown below is a modified version of the array addition example using manually managed data transfers.

    // Allocate memory on the host.
    int *h_a = (int *)malloc(size);
    int *h_b = (int *)malloc(size);
    int *h_c = (int *)malloc(size);

    // Allocate corresponding memory on the device.
    int *a,*b,*c;
    cudaMalloc(&a, size);
    cudaMalloc(&b, size);
    cudaMalloc(&c, size);

    // Initialize data on the host data structures.
    for (int i = 0; i < N; ++i) {
        h_a[i] = i + 1;
        h_b[i] = i + 2;

    // Copy data from host to device.
    cudaMemcpy(a, h_a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(b, h_b, size, cudaMemcpyHostToDevice);

    // Pass device data structures to the kernel.
    kernel_add<<<N / 32, 32>>>(a, b, c);

    // Wait for all the kernels to finish their work.

    // After the kernel execution copy the results back from device to host.
    cudaMemcpy(h_c, c, size, cudaMemcpyDeviceToHost);

Manual memory management requires users to maintain distinct data structures for host and device. In the above example h_a, h_b and h_c are the host mirrors for a, b and c and are allocated using the regular host memory allocation calls. cudaMalloc is used to allocate memory on the device. In order to transfer memory from host-to-device and vice versa cudaMemcpy is used which is a host side call to transfer memory to the destination pointer(first parameter) from the source pointer(second parameter) and transfers the data in bytes (thrid parameter). The direction of the memory transfer is provided by the final parameter in the cudaMemcpy call. A detailed description of the available Cuda calls related to memory management is provided here cuda-memory-management

ATS is another UVM implementation that is available on Summit supercomputer but will not be available on Perlmutter. In the ATS implementation GPUs are able to read CPUs page tables at the granularity of a cacheline. Details and analysis of the differences between ATS and using managed memory from Cuda is discussed here UVM-implementations.

Programming Models

The choice of programming model depends, primarily, on either the choice of source language or the willingness of developers to utilize multiple languages. For example, a Fortran code has OpenMP, OpenACC, MPI, Do-Concurrent, and Co-Arrays natively available in the language but can utilize C++ programming models through the interoperability of Fortran with C in conjunction with the interoperability of C with C++, i.e. Fortran ⇔ C ⇔ C++. Additional considerations include the number of performance critical kernels, Amdahl's Law, and the portability of algorithm: a project with 20K lines of code could theoretically be entirely rewritten in another language supporting the targeted programming model, while a project with 10M lines of code would probably have to selectively migrate key algorithms/pipelines which are both capable of parallel execution and constitute significant percentages of the overall runtime.

NERSC supports a wide variety of programming models on Perlmutter. Programming models vary in terms of ease-of-use, portability, and performance portability. For example, directive based approaches such as OpenMP and OpenACC are supported at the compiler-level and are generally accepted as the easiest to use and are relatively portable: a developer adds additional directives to their code and when the compiler supports the programming model for the target architecture, the low-level implementation (e.g. vectorization, threading, CUDA, etc.) is generated by the compiler; when the compiler does not support the directive-based model, these directives are ignored. Alternatively, some programming models require modifying the source code to explicitly utilize the programming model. Consider the following example in C++:

using data_t = std::vector<double>;

void copy(const data_t& foo, data_t& bar)
    for(size_t i = 0; i < bar.size(); ++i)
        bar[i] = foo[i];

A directive-based approach with OpenMP would not require fundamentally changing the loop and assignment:

void copy(const data_t& foo, data_t& bar)
#pragma omp parallel for
    for(size_t i = 0; i < bar.size(); ++i)
        bar[i] = foo[i];

However, an implementation using C++17 executors would require explicitly modifying the code in order to support for the programming model and will fail to compile if the programming model is not supported:

void copy(const data_t& foo, data_t& bar)
    std::copy(std::execution::par, foo.begin(), foo.end(), bar.begin());

NERSC does not explicitly recommend one programming model, the decision for whether to use a directive-based programming model or an explicit programming model (and, once that decision is made, which model in particular to use), should be based on the priorities of each individual project. The following sub-sections are provided to assist users in this decision.


General Recommendations

Many developers are hesitant to break C++11 compatibility due to the inconsistent compiler support in the years following the publication of the 2011 ISO C++ standard. The primary reason behind this inconsistency was the length of time between the C++11 standard and the previous standard: C++98 (13 years). The new model adopted by the ISO C++ standards committee of releasing a new standard every 3 years has drastically improved the communication between the standards committee and the compiler developers and streamlined compiler implementation support. Thus, users are encouraged to upgrade to C++141 as the minimum standard. C++14 has universal compiler support going back several versions2 and provides many valuable additions such as variable templates, generic lambdas, new/delete elision, relaxed constexpr restrictions, return type deductions for functions, and many additions to the standard library. Furthermore, users whose code-bases are not significantly relied upon by other codes are highly encouraged to adopt C++173 as the minimum language standard. C++17 has widespread support4 and introduces numerous new language and library features which can significantly improve performance, portability, and readability. The most notable features are: fold-expressions, compile-time if statements (if constexpr), execution policies, guaranteed copy-elision, inline variables, constexpr lambdas, __has_include directives, string-views, filesystem support, constexpr lambdas, structured bindings, memory resource classes, std::aligned_alloc, std::any, and std::variant.

Memory Usage

The order that you declare member variables in classes can unnecessarily increase your memory usage! Consider the two classes Foo and Bar below. In the Foo implementation, we declare a 1-byte boolean, 8-byte integer, and 4-byte float. In the Bar implementation, we re-order the 4-byte float before the 8-byte integer.

#include <cstdint>

class Foo
    bool    a;
    int64_t b;
    float   c;

class Bar
    bool    a;
    float   c;
    int64_t b;

Using the sizeof operator, we can print out the number of bytes of memory each class requires when it is allocated:

#include <cstdio>

int main()
    printf("sizeof(Foo): %lu\n", sizeof(Foo));
    printf("sizeof(Bar): %lu\n", sizeof(Bar));

The result is:

$ g++ -O3 class-size.cpp -o class-size && ./class-size
sizeof(Foo): 24
sizeof(Bar): 16

As you can see, Bar requires 8 fewer bytes of memory. This can make a significant difference if you have a large array of Foo data! The reason behind the size difference is because the code was compiled on a 64-bit architecture, which means memory addresses are loaded in 8-byte widths. In the Foo class, the a and b variables require 9 bytes total and because the CPU loads addresses in 8-byte intervals and a and b must be sequential in memory, the compiler pads Foo with 7 bytes between a and b. In the Bar class, a and c require 5 bytes total and therefore both variables can fit inside a single 8-byte address.

Template Meta-Programming

Compile-time if statements in C++17 make template meta-programming far more accessible to the average user. Template meta-programming can improve performance by enabling logical decisions and computations at compile-time. Consider the following implementation of the fibonacci algorithm which can compute part or all of the fibonacci value at compile-time.

template <long N>
long fibonacci()
    if constexpr(N < 2) { return N; }
    else { return fibonacci<N - 1>() + fibonacci<N - 2>(); }

template <long N>
long fibonacci(long n)
    if(n == N) return fibonacci<N>();
    return (n < 2) ? n : (fibonacci<N>(n - 1) + fibonacci<N>(n - 2));

Running fibonacci<45>(45) requires ~0 seconds whereas a standard fibonacci(45) would require ~3 seconds.

C++ based Programming Models

  • CUDA (docs)
  • HIP
    • Target architecture: NVIDIA and AMD GPUs
    • Generalized API for targeting NVIDIA and AMD GPUs
    • HIP compiler generates CUDA when targeting NVIDIA GPUs, ROCm when targeting AMD GPUs
    • ROCm Libraries
    • HIPify
    • HIP Documentation
  • Kokkos
    • Target architecture: GPU, CPU, others (depending on backend)
    • Provides abstractions for both parallel execution of code and data management
    • Supports multiple programming model backends, e.g. Serial, CUDA, HIP, pthreads, OpenMP
  • HPX
    • Target architecture: GPU, CPU, and others.
    • Implements all of C++ Standard features and proposed and ongoing C++ standard proposals.
    • Extends the C++ Standard APIs to the distributed case.
    • Supports multiple programming model backends, e.g. Serial, CUDA, MPI, Kokkos.
  • SYCL / DPC++
    • Target architecture: GPU, CPU, FPGA
    • Cross-platform abstraction layer for targeting heterogeneous architectures
  • Parallel STL (pSTL) to achieve parallelism in many STL functions
  • Executors
    • ISO C++ support for heterogeneous programming and code execution

Native CUDA C/C++

CUDA is NVIDIA's native GPU programming language. CUDA is well established, mature and its design is the basis and backend for most other languages and frameworks, such as HIP and Kokkos. NVIDIA works directly with the scientific community to develop CUDA's performance and features and has a variety of teaching material and community outreach to help users achieve their computational goals. NVIDIA also develops a variety of performant mathematical, parallel algorithm, communication and other libraries to help users achieve optimal performance on NVIDIA hardware.

Given CUDA's ubiquitous nature, it may be the best language to study and follow in order to understand GPU programming. Its highly developed programming model is the standard for discussion throughout the community. Many other programming models are compatible with CUDA. This can allow highly optimized CUDA code to be implemented in critical code paths while using more portable frameworks elsewhere.

CUDA is a good option for smaller codes, codes that are primiarily looking to run on NVIDIA systems and codes that continuously develop and maximize their performance. It will often have the newest features and high performance on NVIDIA systems and writing in CUDA will get you the newest features, performance improvements and bug fixes quickly.

However, CUDA only runs on NVIDIA hardware, so writing in CUDA will require additional work to achieve portability to other vendor's GPUs. That process may be simple, such as using a tool to convert, or quite complex if specialized, performant algorithms have been created. CUDA also typically requires more tuning and understanding than pragma-based languages, such as OpenMP or portability layers, such as Kokkos, so developers looking for a quick port may wish to explore other options.




Codes at NERSC that use CUDA:


With C++ it is possible to encapsulate such code through template specialization in order to preserve portability.


NVIDIA compilers are recommended for obtaining best performance with OpenMP and OpenACC directive-based approaches.



NERSC recommends the NVIDIA compiler for OpenMP GPU-offload. You may compile a C, C++ and Fortran OpenMP code for Perlmutter using the flags -fast -mp=gpu -gpu=cc80.

Best practices

OpenMP applications can execute on GPUs with high performance when following a set of simple rules. Our rules below tell you what OpenMP directives to use and how to use these directives. We sometimes use the word construct instead of directive to refer to a subset of OpenMP directives known as executable directives.

Rule 1: Always use the target and teams directives

The target directive is used to offload a region of code to a target device, i.e. a GPU, and the teams directive is used to create teams parallelism. Teams parallelism is the coarse-level parallelism that is required to make use of multiple SMs on a single GPU. A code written without a teams directive, such as target parallel for, will only use 1 SM. It will therefore perform very poorly on the A100 GPUs in Perlmutter which have 108 SMs.

Rule 2: Strive to use combined directives

The single best piece of advice is to use combined OpenMP directives that parallelize and workshare loop iterations over OpenMP teams and threads. Below we show OpenMP-4.0 and OpenMP-5.0 combined directives on a Stream Triad kernel. We assume that the pointers a, b and c point to data already present on the GPU.

// Recommended OpenMP-4.0 combined constructs
// More portable to today's compilers
#pragma omp target teams distribute parallel for
for (int i = 0; i < N; i++)
  a[i] = b[i] + scalar * c[i];
// Recommended OpenMP-5.0 combined constructs
// Less portable to today's compilers
#pragma omp target teams loop
for (int i = 0; i < N; i++)
  a[i] = b[i] + scalar * c[i];

We recommend the OpenMP-4.0 directives for best portability across today's compilers. We recommend the OpenMP-5.0 loop directive for highest performance with the NVIDIA compiler; not all compilers implement the loop directive yet.

Rule 3: Collapse strictly nested loops

The OpenMP language provides an extremely powerful capability known as the collapse clause to collapse multiple loops into a single iteration space. This is often just what is needed to create a parallelized iteration space large enough to obtain the O(10K)-O(100K) parallelism required to fully utilize 1 GPU.

We have used this approach successfully in the SU3 mini-app. This is a proxy application for the MILC QCD application which performs a large number of 3x3 matrix-matrix multiplies. On A100 GPUs, we obtain 97% of tuned CUDA performance by using the NVIDIA HPC SDK 21.1 compilers with OpenMP GPU-offload. Our implementation uses the collapse clause to create a large iteration space and the loop construct to workshare the collapsed iteration space across teams and threads.

// The OpenMP version of SU3
#pragma omp target teams loop collapse(4)
for (int i=0; i<total_sites; ++i)
  for (int j=0; j<4; ++j)
    for (int k=0; k<3; ++k)
      for (int l=0; l<3; ++l) {
        Complx cc = {0.0, 0.0};
        for (int m=0; m<3; ++m)
          cc += a[i].link[j].e[k][m] * b[j].e[m][l];
        c[i].link[j].e[k][l] = cc;

It is instructive to compare against an equivalent CUDA implementation. This highlights two CUDA pain points:

  1. Loop iterations must be manually workshared across thread blocks and threads based on block ID and thread ID. In OpenMP, you can workshare loop iterations using loop, distribute or for/do directives.

  2. Multiple loops cannot be collapsed into a larger iteration space automatically. The user must create a new loop that spans the iteration space of multiple loops and then use integer arithmetic to manually calculate the indices of the original loops. In OpenMP, you can collapse loops simply by using the collapse clause. This allows you to preserve your original loop structure and enables you to quickly experiment with different parallelization approaches by simply changing the collapse clause value.

// The CUDA version of SU3
for (int id = blockIdx.x * blockDim.x + threadIdx.x;
     id < total_sites * 36;
     id += blockDim.x * gridDim.x)
  int i = id/36;
  int j = (id%36)/9;
  int k = (id%9)/3;
  int l = id%3;
  Complx cc = {0.0, 0.0};
  for (int m=0;m<3;m++)
    cc += a[i].link[j].e[k][m] * b[j].e[m][l];
  c[i].link[j].e[k][l] = cc;
Rule 4: Try to avoid code between target and parallel constructs

The rule 2 text showed the OpenMP combined constructs that obtain highest performance on GPUs. Portability and performance can suffer when straying from these combined constructs. The examples below show a problematic code pattern for OpenMP compilers. The issue with both code fragments is that there is user code, in this case a for loop, in between target and parallel constructs. The first fragment often has performance issues and the second fragment often has both portability and performance issues.

// Problematic code #1. Avoid if you can!
// This often has performance issues
#pragma omp target teams distribute
for (int i=0; i<N; i++) {
#pragma omp parallel for
  for (int j=0; j<N; j++) {
    // ... code
// Problematic code #2. Avoid!!!
// This often has portability and performance issues
//   Portability issues likely if `my_function` defined
//   in a different compilation unit
#pragma omp target teams distribute
for (int i=0; i<N; i++) {
   y[i] = my_function(x[i]);

double my_function(double a) {
// This is termed an "orphaned" parallel construct
#pragma omp parallel for
  for (int j=0; j<N; j++) {
    // ... code
  return b;

It is OK to use the pattern shown in the first code fragment if this code region is not performance critical. It is portable across compilers and the performance may be acceptable if the parallel construct is infrequently encountered. However, we never recommend the pattern shown in the second code fragment. It introduces additional challenges for compilers and does not yet work with the NVIDIA compiler when my_function is defined in a different compilation unit.

The code pattern is difficult for the compiler to optimize because the parallel directive creates parallelism. Dynamic parallelism is inefficient on the GPU. As such, OpenMP compilers implement various optimizations to avoid dynamic parallelism and still meet OpenMP program requirements. The quality of the compiler optimizations varies across compilers. For highest performance and performance-portability we recommend one of the two approaches below. The solution we advocate at NERSC is to use the loop construct. This works well with the NVIDIA compiler. The downside is that the loop construct has not been implemented by all compilers yet. The fragments below show the inner loop in the same function and the inner loop in a different function.

// Avoid inner `parallel` by using the `loop` construct
// It may be desirable to add `bind(parallel)` on the
// inner `loop` to force worksharing across threads only
#pragma omp target teams loop
for (int i=0; i<N; i++) {
#pragma omp loop
  for (int j=0; j<N; j++) {
    // ... code
// Avoid inner `parallel` by using the `loop` construct
// It is necessary to add the `bind` clause on the inner `loop`
// because it is in a function encountered in the target region
#pragma omp target teams loop
for (int i=0; i<N; i++) {
   y[i] = my_function(x[i]);

double my_function(double a) {
// This currently only works with the NVIDIA compiler if
// `my_function` is in the same compilation unit.
#pragma omp loop bind(parallel)
  for (int j=0; j<N; j++) {
    // ... code
  return b;

The alternative is more portable, based on the compiler capabilities of today, but is less desirable. It is effectively CUDA written using OpenMP directives and API calls. It can perform well across all compilers. However, it may be necessary to add the num_teams and thread_limit clauses to get the best performance when using this pattern because a compiler cannot select good values based on loop bounds.

// Avoid inner `parallel` by strictly nesting the `parallel` construct
// inside the `teams` region. The downside of this approach is that it
// requires manual worksharing of data, similar to CUDA.
#pragma omp target teams num_teams(10000) thread_limit(128)
#pragma omp parallel
  const int threadIdx = omp_get_thread_num();
  const int blockDim = omp_get_num_threads();
  const int blockIdx = omp_get_team_num();
  const int gridDim = omp_get_num_teams();

  // Manually distribute the i loop over teams
  // We cannot use the `distribute` construct because of OpenMP restrictions
  for (int i=blockIdx; i<N; i+=gridDim) {

    // Manually distribute the j loop over threads
    //   Alternatively, we could automatically workshare using
    //   the `for` construct if we wished.
    for (int j=threadIdx; j<N; j+=blockDim) {
      // ... code
Rule 5: Minimize data movement using the family of target data directives

A major performance limiter is mapping unnecessary data between CPU and GPU. This can happen if you only ever map data at the same time as you launch an OpenMP target region. OpenMP provides a family of target data directives to map data separately from target region execution. This can allow GPU data to be used and updated in multiple target region GPU kernels. For example, the code below shows how a target data region can be used to move the x pointer variable data to/from the GPU once rather than twice.

// Data is moved to/from the GPU twice: once per target region
#pragma omp target teams distribute parallel for map(tofrom: x[:N])
for (int i=0; i<N; ++i) // ... do work on x[i]

#pragma omp target teams distribute parallel for map(tofrom: x[:N])
for (int i=0; i<N; ++i) // ... do work on x[i]
// Data is moved to/from the GPU once
#pragma omp target data map(tofrom: x[:N])
#pragma omp target teams distribute parallel for
for (int i=0; i<N; ++i) // ... do work on x[i]

#pragma omp target teams distribute parallel for
for (int i=0; i<N; ++i) // ... do work on x[i]
Rule 6: Pay attention to any implicit data movement

The OpenMP specification defines the implicit data sharing and data mapping that happens in an OpenMP target region. For example, arrays are implicitly mapped tofrom and scalars variables are implicitly firstprivate with the device value initialized from the host value. This can result in more data movement than is necessary for the correct execution of your code. You can use compiler diagnostics via the NVIDIA HPC SDK -Minfo=mp compiler flag to find out what data is implictly and explicitly mapped in each target region. You can get this information and more by using runtime tracing via the NVIDIA HPC SDK NVCOMPILER_ACC_NOTIFY=3 environment variable. This will show you exactly how much data is moving to/from each target region, including any data movement from initializing scalar variables.

You can avoid data movement by explicitly using a map clause with a more restrictive map-type value, e.g. to, from, or alloc rather than tofrom. You can avoid firstprivate initiated data movement by specifying that variables should be uninitialized private variables in the target region. If you need initialized scalar variables on the GPU you can allocate and initialize them once using a declare target directive.

Rule 7: Profile your code

You can profile your OpenMP GPU-offload application with the family of NVIDIA profiling tools as well as HPCToolkit. If you want to line-level profiling information we recommend you compile with -gopt rather than -g when using the NVIDIA compiler.

Commonly Used Compiler Flags

This section documents some commonly used compiler flags of the NVidia HPCSDK compilers.

  • -fast Common optimizations, which includes -O2 -Munroll=c:1 -Mnoframe -Mlre
  • -Minfo Generate informational messages about the compilation process, such as which loops are optimized using unrolling, SIMD vectorization, parallelization, GPU offloading, interprocedural optimization, inline functions, and various miscellaneous optimizations.
  • -mp
    • -mp=gpu GPU and multicore code generation
    • -mp=multicore Multicore code generation (default)
  • -gpu
    • -gpu=ccXY Use cc70 for Volta GPU, cc80 for Ampere GPU, cc70,cc80 for both Volta and Ampere GPUs
    • -gpu=managed Enable CUDA managed memory
  • -acc
    • -acc=gpu GPU code generation (default)
    • -acc=host Host (sequential) code generation (default)
    • -acc=multicore Multicore code generation
  • -stdpar Enable parallelization and offloading of C++17 and Fortran do-concurrent parallel algorithms
  • -cuda Interoperability with CUDA
  • -cudalib=[...]
    • cublas, cufft, cusolve, cusparse, etc. Using CUDA math libraries
    • -nvlamath Enable porting of LAPACK calls to the GPU using cuSolver. Need to compile and link along with -gpu=managed
Runtime Environment Variables
  • OMP_TARGET_OFFLOAD for all compilers
    • DEFAULT Offload to the GPU, fall back to the host if device is not available
    • MANDATORY Offload to the GPU, abort if device is not available
    • DISABLED Run on the host
  • OMP_STACKSIZE NVidia HPCSDK compilers extend this environment variable to set stack size on the GPU in addition to the CPU
    • [acc:size|sizeB|sizeK|sizeM|sizeG[,]]size|sizeB|sizeK|sizeM|sizeG Examples are:
      • acc:4K 4 KB on GPU, default on CPU
      • acc:1024,64M 1024 byte on GPU, 16 MB on CPU
      • 1G default on GPU, 1 GB on CPU
    • 1 Show kernel launches information
    • 2 Show data transfers between CPU and Devices
    • 3 Show both above
  • LIBOMPTARGET_INFO for LLVM/Clang compilers
    • -1 Show all information about data-mappings and kernel execution
  • CRAY_ACC_DEBUG for HPE/Cray CCE compiler
    • 1 Show high level overview of kernels executed and data transferred
    • 2 Break down data transfer by each variable
    • 3 Show both above

OpenMP offload code is generally interoperable with other GPU programming models, such as OpenACC and CUDA. The interoperability support depends on the compiler. The NVIDIA HPC SDK compilers provides interoperability between OpenMP, OpenACC and CUDA and enables you to mix all three in the same source file. The remainder of the section describes the interoperability support provided by NVIDIA HPC SDK. Below is a simple example code "test.F90" for illustration:

% cat test.F90
!$  print *, "Compiled for OpenMP"

!! Notice !@acc is an NVHPC extension, not part of OpenACC standard
!@acc  print *, "Compiled for OpenACC"

!@cuf  print *, "Compiled for CUDA Fortran"


# compile and run with various scenarios of interoperability
# first get onto a GPU node

% cat 
for op1 in "" -mp 
      for op2 in "" -acc
     for op3 in "" -cuda 
               echo "nvfortran $op1 $op2 $op3 test.F90" 
               nvfortran $op1 $op2 $op3 test.F90 
           echo " "

% chmod u+x
% ./ 
nvfortran    test.F90

nvfortran   -cuda test.F90
 Compiled for CUDA Fortran

nvfortran  -acc  test.F90
 Compiled for OpenACC

nvfortran  -acc -cuda test.F90
 Compiled for OpenACC
 Compiled for CUDA Fortran

nvfortran -mp   test.F90
 Compiled for OpenMP

nvfortran -mp  -cuda test.F90
 Compiled for OpenMP
 Compiled for CUDA Fortran

nvfortran -mp -acc  test.F90
 Compiled for OpenMP
 Compiled for OpenACC

nvfortran -mp -acc -cuda test.F90
 Compiled for OpenMP
 Compiled for OpenACC
 Compiled for CUDA Fortran

The equivalent of the above "test.F90" is "test2.F90" below using CPP directives:

#ifdef _OPENMP
!$omp  print *, "Compiled for OpenMP"
#ifdef _OPENACC
!@acc  print *, "Compiled for OpenACC"
#ifdef _CUDA
!@cuf  print *, "Compiled for CUDA Fortran"

If OpenMP, OpenACC, and CUDA code coexist in the same program, their runtimes use the same CUDA context on each GPU. CUDA-allocated data is available for use inside OpenMP target regions with is_device_ptr clause or inside OpenACC regions with deviceptr() clause. OpenMP-allocated data is available for use inside CUDA kernels directly if the data was allocated with the omp_target_alloc() API call. If the OpenMP data was created with a target data map clause, it is available for use inside CUDA kernels with the target data use_device_addr() clause. Present table is shared between OpenACC and OpenMP. For example, data can be managed by acc data and used by all acc kernel or omp target regions.

OpenMP CPU-parallel object files compiled by the HPCSDK compilers are compatible with the CPU-parallel objects by compilers also using the KMPC OpenMP runtime interface, including Intel and Clang. The NVidia HPC SDK compilers support a GNU OpenMP interface layer as well which provides OpenMP CPU-parallel interoperability with the GNU compilers. There are no interoperability for GPU device offload operations among different compilers.

Study Materials


OpenACC is a programming framework that allows users to expose parallelism in their code via compiler directives. Compilers that support the framework will interpret the directives and schedule the parallel code accordingly. On Perlmutter, NVHPC/HPC-SDK (formerly PGI), GCC and CCE compilers will support OpenACC directives where, the degree of the support will vary based on the compiler. Here is the OpenACC specification OpenACC spec.

The benefit of OpenACC is its ease of use. But the downside of the approach is the inability to perform low level optimizations. Following are few of the highlights and recommendations for OpenACC:

  • OpenACC provides kernels construct for users to identify code segments that may contain parallelism. It allows users to rely on compilers to identify and safely parallelize the marked code sections. This is a good starting place for OpenACC.
  • Take advantage of OpenACC clauses such as gang, vector and worker to map hierarchical parallelism onto the GPUs.
  • collapse clause can be used to expose more parallelism in nested loop structures.
  • Optimize the data movement in applications. Due to high cost of data transfer between CPU and GPU memories, it is beneficial to perform computation on the GPU even when the code lacks parallelism if the trade-off is a reduction in memory transfers.
  • Reorder loops and data structures to take advantage of memory coalescing.
  • OpenACC allows interoperability with other programming languages via the use of clauses such as host_data and deviceptr.
  • OpenACC implementation of NVHPC compiler allows interoperability with CUDA. This implies that kernels that need low level optimizations via architecture specific programming can be intermixed with parts of code that are implemented in OpenACC.
  • Some of the advanced OpenACC features include asynchronus operations and multi-device programming.

The following link provides a detailed description of the best practices to be followed in OpenACC OpenACC best practices.

C++ language based




Kokkos is a C++ based programming model that allows mapping abstract hierarchies of parallelism onto diverse architectures. It provides a unified front-end for users to write portable parallel code via its constructs while the framework provides architecture specific backends to map the constructs onto the intended hardware. A few of the framework's highlights are :

  • The View abstraction for multidimensional arrays. It allows users to chose the optimal order to traverse the data structures based on the loop ordering and target architecture via its semantics.
  • The framework allows users to expose additional parallelism by taking advantage of its MDRange constructs.
  • Kokkos also allows the users to exploit multiple levels of parallelism and take advantage of hierarchical parallelism available in most of the modern high performance computers.

Kokkos framework is availiable at Github. Here is a link to the tutorial lecture series on use of the Kokkos framework kokkos lecture series.


HPX is a C++ Standard Library for Concurrency and Parallelism. It implements all of the corresponding facilities as defined by the C++ Standard. Additionally, HPX implements functionalities proposed as part of the ongoing C++ standardization process. HPX also extends the C++ Standard APIs to the distributed case.

The API exposed by HPX is not only modeled after the interfaces defined by the C++11/14/17/20/23 ISO standard, it also adheres to the programming guidelines used by the Boost collection of C++ libraries. HPX aims to improve the scalability of today's applications and to expose new levels of parallelism which are necessary to take advantage of the exascale systems of the future.

HPX is availiable at Github.




HIP is AMD's GPU portability layer. HIP is compiled with either a CUDA or ROCm backend to run on NVIDIA or AMD GPUs, respectively. ROCm is AMD's native programming language, although its a lower-end language than HIP and it is expected that users write their codes in HIP and rarely use ROCm directly.

HIP is designed to directly mimic CUDA. HIP uses an extremely similar programming model and it designed so that most of its calls are direct copies of cuda, in most cases simply replacing "cuda" with "hip". HIP uses a subset of CUDA calls (roughly based on a target version of CUDA) and additional AMD specific features, so porting between HIP and CUDA is typically a very simple process.

HIP is a good option for those code teams with knowledge of the CUDA programming model and who also want to achieve maximum performance on AMD systems. For teams targeting both Perlmutter and Frontier (NVIDIA and AMD architechtures), HIP may be the ideal programming model.

However, HIP may not work well for the same codes that should avoid CUDA: codes that want full portability, minimial tuning, minimal editing or don't have the time to maintain the code base. OpenMP, Kokkos, or another abstraction layer is likely the better bet for such coding efforts.


The Fortran standard avoids language and features which target particular architectures. This generic approach provides Fortran compilers the freedom to interpret features in different ways, including how to offload kernels to GPUs. Consequently, different compilers may implement certain features of the Fortran language differently in terms of how they are accelerated on GPUs.

On Perlmutter, perhaps the most effective resource for accelerating Fortran code on GPUs is the compiler included in the NVIDIA HPC SDK. The HPC SDK provides three separate mechanisms by which Fortran programs may be accelerated on A100 GPUs. These are described below.

GPU acceleration with standard Fortran

The HPC SDK enables GPU acceleration of several different Fortran features. One is DO CONCURRENT loops which can be offloaded to GPUs by adding the -stdpar compiler flag. The HPC SDK can also offload several other Fortran features to GPUs by mapping them to cuTENSOR functions; these features include RESHAPE, TRANSPOSE, SPREAD, certain element-wise expressions, MATMUL, and several element-wise functions.

Please provide feedback to NERSC

GPU acceleration of standard Fortran intrinsics as described above is a new feature in the NVIDIA HPC SDK compiler. Users are encouraged to experiment with this feature in the compiler and provide feedback to NERSC by filing a ticket at the online help desk.


The HPC SDK also implements a subset of OpenMP offload capabilities which can be used in Fortran programs too. Features of OpenMP which are not implemented in the HPC SDK are parsed but ignored in order to maintain portability across OpenMP-enabled Fortran compilers.

A feature in OpenMP implemented in the HPC SDK which may be of special interest to Fortran programmers is the loop construct, which is similar to the kernels construct in OpenACC. The loop construct is a descriptive directive which allows the compiler the freedom to analyze the contained loop for dependencies and apply optimizations for GPU offload as it sees fit; see the NVIDIA HPC SDK OpenMP Documentation Section for details regarding its implementation in the HPC SDK.


The HPC SDK continues to support OpenACC for GPU acceleration; see the NVIDIA HPC SDK OpenACC Documentation Section for details.


Most Julia applications do not need any changes to run on Perlmutter. Users should be aware of the following: 1) We will only support version 1.6.0 or higher on Perlmutter; 2) Perlmutter has 4 A100 GPUs per node, your code needs to manage these.

We recommend developing your GPU-enabled Julia code using CUDA.jl. Also, if you are interested in reading up on the other Julia GPU resources, please take a look at the Julia GPU website.

Using CUDA.jl

We provide CUDA.jl in our central installation -- even though there are no known adverse affects of installing it locally to your home directory. However, if you do encounter issues, please first run:

using CUDA

to see if there are any issued with your local CUDA installation.

Do get started, please look at the CUDA.jl overview. This will introduce you to the CuArray data type, as as running kernel functions. Some more examples are available in the CUDA.jl GitHub.

Using Multiple GPUS

One GPU per Thread is Not Recommended

Multi-threading is a fairly recent addition to the language and many packages, including those for Julia GPU programming, have not been made thread-safe yet.

An overview of using multiple GPUs is available at the CUDA.jl multi-GPU page. At the moment, the recommended strategy for using multiple GPUs is to assign one GPU per process, e.g.:

# assign devices
asyncmap((zip(workers(), devices()))) do (p, d)
    remotecall_wait(p) do
        @info "Worker $p uses $d"

In this case the CuArray objects can manage all host-to-device memory transfers. The CUDA IPC APIs are available as CUDA.cuIpcOpenMemHandle and friends, but not available through high-level wrappers.

Alternatively, one can use MPI.jl together with an CUDA-aware MPI implementation. In that case, CuArray objects can be passed as send and receive buffers to point-to-point and collective operations to avoid going through the CPU. This feature will be provided on Perlmutter the NERSC install of MPI.jl

Multiple GPUs per process are possible, but the user will need to manage these manullay using the device!, 'copyto!', and synchronize functions.

Profiling Julia GPU Code

You can use nsys profile julia /path/to/your/program.jl to profile your entire application. To profile your code from the REPL, or Jupyter, please use the CUDA.@profile macro.


For more information about preparing Python code for Perlmutter GPUs, please see this page.


Some modern MPI implementations support a feature known as "CUDA-awareness." This term remains somewhat loosely defined, but is generally understood to mean that pointers to data residing in GPU device memory can be used as arguments in MPI function calls invoked on the CPU.

CUDA-awareness in MPI has several potential benefits but also some potential drawbacks. One significant benefit is that it can be a convenience to the programmer. For example, when executing an MPI function call, a CUDA-aware MPI implementation may inspect the pointer to determine whether the data it points to resides in CPU or GPU memory, and if the latter, the MPI implementation may execute a cudaMemcpy() to copy the data from the GPU to the CPU, and then execute the MPI function on the CPU memory. Such an implementation would save the programmer the work of writing the cudaMemcpy() in their own code, but would not offer any performance benefit. An alternative implementation of this behavior might be for CUDA to allocate the memory using cudaMallocManaged(), in which case the CUDA runtime itself may copy the data back to CPU memory before executing the MPI function.

There are also potential performance benefits to CUDA-aware implementations of MPI, namely that the library can take advantage of hardware features like GPUDirect, such that if the user executes an MPI function with a pointer to data in device memory, the MPI library may copy the data directly from GPU device memory to the network device on the node, without copying the data to CPU memory. This behavior would provide both a convenience and a performance benefit.

However, there are also potential drawbacks to writing programs which target CUDA-aware MPI implementations. Chiefly, CUDA-aware MPI is not standardized, and so its behavior may vary among different MPI implementations which each claim to be "CUDA-aware." For example, one MPI implementation may support CUDA-awareness only in point-to-point communication routines, whereas another may support it only in collective routines. This may result in programs executing correctly on one system, but crashing on another.

Details regarding CUDA-awareness support in HPE Cray MPI on Perlmutter can be found on the MPI programming models documentation page. Additionally, some open-source MPI implementations already support this behavior, e.g., Open MPI and MVAPICH2-GDR. Users are encouraged to review those pages to understand how CUDA-awareness is implemented in those libraries.

Machine Learning Applications

We support distributed PyTorch and TensorFlow on Perlmutter. Currently, we recommend to use the relevant Shifter images which are based on optimized NVIDIA NGC ones, details can be found in our Machine Learning documentation.

Performance and I/O optimization

Deep learning pipelines often need performance tuning to make best use of powerful accelerators. Special considerations should be given to I/O bottlenecks. Our NERSC-NVIDIA Deep Learning at Scale Tutorial (Supercomputing 2020) provides guidance on how to analyze and address some of these performance challenges.

Case studies

  1. Optimizing Data Processing Pipelines 3D CNN (Cosmo-3D)


Alternatives to Intel Math Kernel Library (MKL) for CPU math functions

Several generations of NERSC systems have provided the Intel Math Kernel Library (MKL), which provides a large, diverse collection of functions and routines for commonly used numerical calculations, which are summarized in the Intel Developer Guide for the oneAPI MKL.

MKL is optimized for Intel architectures. While it is possible that Intel MKL may work on the AMD CPUs on Perlmutter, it is not guaranteed, nor is it certain that if it does work that will achieve similar levels of performance that its users have enjoyed on Intel architectures. Consequently, NERSC users are encouraged to explore other software offerings which may fulfill many of the same roles as MKL.

Currently, the AMD AOCL library provides implementations of several math functions and routines provided in MKL, including BLAS, LAPACK, ScaLAPACK, sparse linear algebra, FFTW, optimized math functions and random number generators.

AOCL currently unavailable on Perlmutter

AOCL is not currently available on Perlmutter.

GPU implementations of math libraries are often much faster

While CPU-based implementations of math functions and routines included in MKL are available and provide below, users should first consider GPU implementations, which are often much faster.

Math libraries for Nvidia GPUs

One way to port existing applications to GPUs without rewriting the entire application in a GPU-aware programming model is to use GPU-accelerated libraries to execute performance-critical calculations. Below is an incomplete list of GPU-accelerated math libraries which are expected to run effectively on Perlmutter.

Linear algebra

Dense linear algebra
  • BLAS: Nvidia cuBLAS
  • LAPACK: Nvidia cuSOLVER (not all LAPACK subroutines have been implemented)
  • BLAS/LAPACK: MAGMA (not all LAPACK subroutines have been implemented)
  • ScaLAPACK replacement: SLATE
Sparse linear algebra
Matrix operations/solvers
Algebraic multigrid

Solver libraries


  • Single GPU:

    • Nvidia cuFFT/cuFFTW (cuFFTW library provides the FFTW3 API to facilitate porting of existing FFTW applications), supports multiple GPUs on a single compute node
  • Distributed FFT (multi-node):

Ray tracing

I/O Considerations

All file systems can be accessed from Perlmutter. Details about the Perlmutter Scratch File System can be found in the Perlmutter scratch file system page.

AMD CPU Considerations

Profiling tools

Perlmutter includes several tools available for profiling CPU and GPU applications.

NVIDIA Nsight tools for performance optimization

The NVIDIA Nsight profiling suite is made available to users as a part of the cudatoolkit module. Loading cudatoolkit will give you access to both CLI and GUI for Nsight Systems (nsys / nsys-ui) and Nsight Compute (ncu / ncu-ui) tools, and their versions will be consistent with the release of nvhpc sdk.

nvprof is disabled on A100 GPUs

nvprof was the original profiling tool provided in the CUDA Toolkit. Starting in CUDA 10, it has been replaced by Nsight Systems and Nsight Compute, and nvprof is disabled on new NVIDIA GPUs, starting with A100. NVIDIA provides a migration guide for nvprof users transitioning to Nsight Systems and Nsight Compute.

Nsight Systems

Nsight Systems is a low-overhead profiling tool which provides a broad characterization of an application's performance on GPUs. It is often the best tool to start with when profiling a new application, because it provides a time-series of data regarding activity that a process spends in various broadly-defined categories, like CPU activity, GPU activity, and data transfers between CPU and GPU memory.

One can profile a GPU-accelerated application using Nsight Systems from the Perlmutter Scratch File System as follows:

srun <Slurm flags> nsys profile <program> <program arguments>

Please note that Nsight System does not create a .nsys-rep output file when running the above command from $HOME or $CFS. You need to profile from the Perlmutter Scratch File System to generate a .nsys-rep file which you can load into the user interface.

A few commonly used optional flags to nsys include:

  • -t nvtx,cuda: instructs Nsight Systems to trace only NVTX and CUDA activity, and to ignore other activity, reducing the overhead of the profiling collection and reducing the size of the resulting profile database (if the application has no NVTX markers, one can use simply -t cuda);
  • --stats=true: instructs Nsight to print a summary of the application's performance to STDOUT, along with saving the result to a profiling database on disk;
  • -s none: instructs Nsight Systems to ignore all CPU activity, reducing the overhead of the profiling collection and reducing the size of the resulting profile database.
  • -b dwarf: instructs Nsight Systems to use DWARF's CFI (Call Frame Information) while sampling. The option 'lbr' instead of 'dwarf' is available only with Intel© CPUs codenamed Haswell and later. Note that setting the value to 'none' can reduce collection overhead.

NVIDIA documentation about Nsight Systems contains further details about these flags and others.

Nsight Compute

The Nsight Compute tool provides highly detailed information regarding GPU kernel performance. It is often useful after an initial analysis has been done with Nsight Systems, and key GPU kernels have been identified as being critical to achieving high overall application performance. NERSC recommends for users to start profiling their GPU applications with Nsight Systems before using Nsight Compute.

A typical Nsight Compute performance collection invocation has a similar form to Nsight Systems; one simply prefixes the application name with ncu, along with the desired Nsight Compute flags:

srun <Slurm flags> ncu -o <filename> <other Nsight Compute flags> <program> <program arguments>

Nsight Compute adds large overhead during application profiling

Unlike Nsight Systems, which generally adds relatively low overhead to an application's runtime, Nsight Compute can increase an application's runtime by orders of magnitude, due to the large amount of performance data it collects from GPU kernels. It is therefore strongly recommended to limit the scope of an Nsight Compute performance collection using the tips described in this document.

A few flags to Nsight Compute can reduce the overhead added to an application's runtime, which are summarized below. All of these flags are documented in more detail in the Nsight Compute documentation.

  • -k <expr>: this instructs Nsight Compute to profile kernels whose names are matched by the regular expression <expr>; all non-matching kernels are ignored;
  • -s <num1> -c <num2>: these flags instruct Nsight Compute to skip the first <num1> GPU kernel launches, and to only profile <num2> kernels after that. These flags are useful when an application launches the same GPU kernels many times, e.g., in an iterative solver or in a time-stepping routine.

Nsight Compute also provides an automated GPU roofline tool. Nsight Compute generates a GPU roofline plot in the GUI automatically, but only if all performance metrics are collected on the kernel; this can be accomplished by adding --set full to the list of Nsight Compute arguments during application profiling. Users should note that this flag increases profiling overheard significantly, so it is recommended to limit the number of kernels profiled by using the flags summarized above.

If your application has very long kernel names (i.e. if it uses something like Kokkos) and you want to profile an individual kernel you see in Nsight systems then right click on the kernel you want to profile and select “Analyse the Selected Kernel with Nvidia Nsight Compute”. In the pop up select “Display the command line to use Nvidia Nsight Compute CLI” and it will give you the appropriate command to use to target the selected Kernel using the --launch-skip parameter.


HPCToolkit is comprehensive profiling tool supporting CPU, GPU, and MPI activity.

Running Applications

Controlling task and GPU binding

By default, in a Slurm job allocation, all GPUs allocated to the job on a particular node are accessible by all tasks in the job which are also on that same node. In some cases, this is precisely the desired behavior for a job, while in others, it is preferable for each task on a node to be bound only to a subset of the GPUs on that node.

There are several ways to control which GPUs are visible to which tasks in a job. One way is via the CUDA Runtime itself; another is to use Slurm.

Using the CUDA Runtime

The CUDA Runtime provides different ways for the programmer to control which GPUs are accessible to which tasks. One way is to use the CUDA_VISIBLE_DEVICES environment variable, which is a variable assigned per-task. No source code changes are required in order to control task and GPU affinity using this approach. More information about using the CUDA_VISIBLE_DEVICES environment variable in a Slurm job is provided at the Slurm Generic Resource Scheduling page. Additional documentation regarding how to use CUDA_VISIBLE_DEVICES is provided in the intro_mpi man page.

A second way is to use the CUDA Runtime API itself, as described in the NVIDIA CUDA programming guied. This approach requires making changes to one's source code in order to implement.

Using Slurm

An alternative approach to controlling task and GPU affinity is to use Slurm itself. The sbatch, salloc, and srun commands each support several flags specific to GPU allocation, including --gpus-per-task, --gpus-per-node, and several others, which are described on the Slurm Generic Resource Scheduling page and also in the man pages for the various Slurm commands.

Additionally, Slurm provides a flag --gpu-bind which controls which GPUs are accessible by which tasks on a given node. Documentation regarding this flag is provided in the Slurm srun documentation or via the man srun command on Perlmutter. For example, if one wishes to assign to run 4 MPI tasks on a Perlmutter compute node, with each task assigned to a single, unique GPU on the node, this could be accomplished using the following #SBATCH header in a job script:

#SBATCH --gpu-bind=map_gpu:0,1,2,3

Like the environment variable approach described above, this requires no source code changes, but does rely on Slurm implementation details and so may not be portable to other systems with different workload managers.

General recommendations

Write a validation test

Performance doesn't matter if you get the wrong answer!

  • Define benchmarks for performance. These should represent the science cases you want to run on Perlmutter.
  • Use optimized libraries when possible (FFT, BLAS, etc).
  • Start with 1 MPI rank per GPU.
  • Start with UVM and add explicit data movement control as needed.
  • Minimize data movement (Host to device and device to host transfers).
  • Avoid device allocations (Use a pool allocator)


The ability for applications to achieve both portability and high performance across computer architectures remains an open challenge.

However there are some general trends in current and emerging HPC hardware: increased thread parallelism; wider vector units; and deep, complex, memory hierarchies.

In some cases a performance portable algorithm can realized by considering generic "wide vectors" which could map to either GPU SIMT threads or CPU SIMD lanes.

Case Studies and Examples

Many examples of kernels and applications being ported to NVIDIA GPUs are available in the literature, conference proceedings, and online at GitHub, GitLab, etc. Below is a small selection of examples and case studies which may be useful to NERSC users as they port their applications to GPUs.

  1. AMReX
  2. DESI
  3. Meta-HipMER
  4. Tomopy
  5. SNAP
  6. CPP to Python
  7. Lulesh with C++ Programming Models Case Study

NERSC proxy app suite

NERSC maintains a collection of proxy applications, which are typically small codes with only a handful of computationally intensive algorithms included. These proxy apps are often written in multiple languages and programming models, in order to evaluate performance of those languages and models, as well as to illustrate how one might port an existing code (e.g., one which does not offload anything to a GPU) to a language or model which enables efficient GPU acceleration.

Programming model examples running at NERSC

References and Events