Skip to content

Perlmutter Readiness

This page contains recommendations for application developers to "hit the ground running" with upcoming system architectures such as Perlmutter.

Testing of performance on relevant hardware and compatibility with the software environment are both important.

Perlmutter GPU information

Accelerator porting recommendations

Write a validation test

Performance doesn't matter if you get the wrong answer!

  • Define benchmarks for performance. These should represent the science cases you want to run on Perlmutter.
  • Use optimized libraries when possible (FFT, BLAS, etc).
  • Start with 1 MPI rank per GPU.
  • Start with UVM and add explicit data movement control as needed.
  • Minimize data movement (Host to device and device to host transfers).
  • Avoid device allocations (Use a pool allocator)


  • Ensure code compiles with PGI compilers
  • Port to OpenACC
  • (optional for portability) Port OpenACC to OpenMP offload as compiler support for OpenMP 5.0+ matures.


With use cutensorEx PGI compilers will map Fortran intrinsics such as MATMUL, RESHAPE, TRANSPOSE AND SPREAD to CuTENSOR library calls. (Details)



The C++ standard is continually evolving.


Several libraries, frameworks and language extensions provide a path towards fully standard based accelerated code:

Experimental (for CUDA devices)


While the standards based approach offers high portability maximum performance can be obtained by specializing for specific hardware.


With C++ it is possible to encapsulate such code through template specialization in order to preserve portability.


It is possible to use directives (OpenMP, OpenACC) with C++ code, but this is not recommended. Directives are more suited to "C with classes" style code than "modern C++".

As with Fortran it is recommended to start with OpenACC and PGI compilers and then translate to OpenMP if desired.

Proxy hardware platforms

Perlmutter will feature NVIDIA GPUs and AMD cpus.

Cloud providers

HPC systems

Summit and Sierra feature NVIDIA GPUs and IBM CPUs. Piz Daint features NVIDIA GPUs and Intel CPUs, but only has 1 GPU per node.


Current generation AMD cpus are a good place to start.

Software environment

Compilers and programming models play key roles in the software environment.

Programming Models

The choice of programming model depends on multiple factors including: number of performance critical kernels, source language, existing programming model, and portability of algorithm. A 20K line C++ code with 5 main kernels will have different priorities and choices vs a 10M line Fortran code.





There are many options for using Python on GPUs, each with their own set of pros/cons. We have tried to provide a brief overview of several frameworks here. The Python GPU landscape is changing quickly so please check back periodically for more information.


CuPy is a drop-in replacement for NumPy. Where the NumPy internals are often written in C and Fortran, the CuPy internals are in CUDA.

Pros: Very easy to use, minimal code changes required.

Cons: Will only work on NVIDIA GPUs, performance may not be fully optimized.

Numba CUDA

Numba is a JIT compiler for Python. It can translate Python into optimized CPU or GPU code. It is best suited for porting specific kernels to the GPU. Numba supports both NVIDIA and AMD GPUs, although we'll only discuss Numba CUDA here.

Pros: Write a mix of Python/CUDA that can easily run on a GPU.

Cons: Numba CUDA kernels look almost like CUDA. Limited Python functionality permitted.


PyOpenCL provides a framework for directly accessing OpenCL via Python. PyOpenCL is best suited for porting specific kernels to the GPU.

Pros: OpenCL can run on many architectures (CPU and GPU) and is extremely portable.

Cons: Harder to install, limited profiling tool support. You have to actually know and understand OpenCL or its close relatives, C or C++.


PyCUDA, like PyOpenCL, provides a framework for directly accessing CUDA via Python. PyCUDA is best suited for porting specific kernels to the GPU.

Pros: Gives you full access to CUDA. Much more powerful than Numba CUDA.

Cons: You have to actually know and understand CUDA or its close relatives, C or C++.


JAX, designed with machine learning in mind, is several things. It is a drop-in replacement for NumPy and also includes a Python JIT compiler. It uses the XLA compiler which can write portable code.

Pros: Portable replacement for NumPy. JIT compiler transforms Python code into portable (CPU/GPU) code that can run on many architectures.

Cons: JIT compiler is not as powerful as Numba. Your code might need major changes, including non-intuitive loop and indexing syntax. Interface is not as friendly as CuPy.


NVIDIA RAPIDS is a set of tools that allow familiar Python libraries like scikit-learn, pandas, and Dask to easily run on GPUs.

Pros: Very easy to use, minimal code changes required.

Cons: Will only work on NVIDIA GPUs.


The ability for applications to achieve both portability and high performance across computer architectures remains an open challenge.

However there are some general trends in current and emerging HPC hardware: increased thread parallelism; wider vector units; and deep, complex, memory hierarchies.

In some cases a performance portable algorithm can realized by considering generic "wide vectors" which could map to either GPU SIMT threads or CPU SIMD lanes.

References and Events