This page contains recommendations for application developers to "hit the ground running" with upcoming system architectures such as Perlmutter.
Testing of performance on relevant hardware and compatibility with the software environment are both important.
Perlmutter GPU information¶
Accelerator porting recommendations¶
Write a validation test
Performance doesn't matter if you get the wrong answer!
- Define benchmarks for performance. These should represent the science cases you want to run on Perlmutter.
- Use optimized libraries when possible (FFT, BLAS, etc).
- Start with 1 MPI rank per GPU.
- Start with UVM and add explicit data movement control as needed.
- Minimize data movement (Host to device and device to host transfers).
- Avoid device allocations (Use a pool allocator)
- Ensure code compiles with PGI compilers
- Port to OpenACC
- (optional for portability) Port OpenACC to OpenMP offload as compiler support for OpenMP 5.0+ matures.
The C++ standard is continually evolving.
Several libraries, frameworks and language extensions provide a path towards fully standard based accelerated code:
Experimental (for CUDA devices)
While the standards based approach offers high portability maximum performance can be obtained by specializing for specific hardware.
With C++ it is possible to encapsulate such code through template specialization in order to preserve portability.
It is possible to use directives (OpenMP, OpenACC) with C++ code, but this is not recommended. Directives are more suited to "C with classes" style code than "modern C++".
As with Fortran it is recommended to start with OpenACC and PGI compilers and then translate to OpenMP if desired.
Proxy hardware platforms¶
Perlmutter will feature NVIDIA GPUs and AMD cpus.
Summit and Sierra feature NVIDIA GPUs and IBM CPUs. Piz Daint features NVIDIA GPUs and Intel CPUs, but only has 1 GPU per node.
Current generation AMD cpus are a good place to start.
Compilers and programming models play key roles in the software environment.
module load pgion Cori.
- PGI Compilers on AWS
The choice of programming model depends on multiple factors including: number of performance critical kernels, source language, existing programming model, and portability of algorithm. A 20K line C++ code with 5 main kernels will have different priorities and choices vs a 10M line Fortran code.
For more information about preparing Python code for Perlmutter GPUs, please see this page.
The ability for applications to achieve both portability and high performance across computer architectures remains an open challenge.
However there are some general trends in current and emerging HPC hardware: increased thread parallelism; wider vector units; and deep, complex, memory hierarchies.
In some cases a performance portable algorithm can realized by considering generic "wide vectors" which could map to either GPU SIMT threads or CPU SIMD lanes.