Skip to content

Running Python on AMD Hardware

One important difference about Perlmutter is that its CPUs will be AMD rather than Intel. On this page we'll cover the relevant details for Python users.

Should you still use MKL?

Many computationally expensive functions (like those in numpy.linalg) are using optimized libraries like Intel's Math Kernel Library (MKL) or OpenBLAS under the hood. In the past, our advice to NERSC users was generally to use MKL as it was well-adapted for our Intel hardware. On Perlmutter however the CPUs are AMD, so does this recommendation still hold? The answer is yes, but with some important caveats. Spoiler: the best way to know for sure is to benchmark your code.

Intel MKL will check the CPU manufacturer and choose a code path accordingly. This is discussed in detail here and here. Without any intervention, MKL may use a less-optimized code path on AMD hardware. How much will this matter on Perlmutter?

Benchmarking study

We performed a small benchmarking study on AMD Rome hardware (AMD EPYC 7702 64-Core Processor) to try to estimate Python performance on Perlmutter. You will find information about our benchmarking study here, including all the materials you need should you wish to reproduce it or run it elsewhere.

We tested the following:

  1. 4 NumPy functions (numpy.linalg.eigh, numpy.linalg.cholesky, numpy.linalg.svd, at
  2. 3 square matrix sizes for
  3. 3 library configurations (mkl, mkl-workaround, OpenBLAS) for
  4. 2 different configurations (high arithmetic intensity, highAI, and high memory bandwidth, highMB)
  5. Using 1 AMD EPYC 7702 (Rome) with 64 physical cores and 2 hyperthreads per core

We determined the high arithmetic configuration using the standard DGEMM matrix multiplication benchmark; you can find more information about this benchmark here. We adjusted the thread and process binding on the AMD Rome to optimize DGEMM GFLOPS/s using the following configuration:

OMP_NUM_THREADS=128 OMP_PLACES=threads OMP_PROC_BIND=close srun -u -n 1 -c 128 --cpu_bind=sockets ./dgemm.exe

We determined the high memory bandwidth using the standard Stream-Triad benchmark; you can find more information about this benchmark here. We adjusted the thread and process binding on the AMD Rome to optimize Stream-Triad GB/s bandwidth using the following configuration:

OMP_NUM_THREADS=64 OMP_PLACES=cores OMP_PROC_BIND=spread srun -u -n 1 -c 128 --cpu_bind=sockets ./stream-triad.exe

This is not a perfect test, nor is it exhaustive. Our goal is to understand general NumPy performance on AMD Rome so we can provide this information to our users to make informed decisions about which libraries to use on Perlmutter.

Benchmarking study results

  1. For the functions we studied here, the MKL workaround can add some performance improvements, but it depends on the matrix size and function type. Notably also it added runtime to the function, which underlines the importance of doing some basic benchmarking. We do suggest that users who need performance test out the MKL workaround since it is straightforward to use.
  2. OpenBLAS beat MKL in np.linalg.cholesky in the highAI configuration and in in both configurations. MKL may not always be faster; again, it is important to benchmark your application. 1. np.linalg.eigh, np.linalg.cholesky, and np.linalg.svd performed better in the highMB configuration and generally performed better in the highAI configuration. Python users should be aware that their choice of process and thread binding can also have a large impact on performance.
  3. Other libraries like BLIS exist. BLIS in general was slow in our testing so we did not include it here. BLIS is under active development so we'll keep an eye on this effort and update our reccomendations as the landscape evolves.
  4. This benchmark was performed on our /global/common/software filesystem. Python users are advised to remember that other factors influence performance, especially choice of filesystems and containers. Please see this page for tips on improving general Python performance at NERSC.

We summarize the results of the benchmarking study grouped according to highMB and highAI configurations. For each configuration, we show the runtime normalized to mkl on top and the minimum runtime in seconds on the bottom.