# Running Python on AMD Hardware¶

One important difference about Perlmutter is that its CPUs are AMD rather than Intel. On this page we'll cover the relevant details for Python users.

tl;dr: Try our fask-mkl-amd module

module load fast-mkl-amd


## Should you still use MKL?¶

Many computationally expensive functions (like those in numpy.linalg) are using optimized libraries like Intel's Math Kernel Library (MKL) or OpenBLAS under the hood. In the past, our advice to NERSC users was generally to use MKL as it was well-adapted for our Intel hardware. On Perlmutter however the CPUs are AMD, so does this recommendation still hold? The answer is yes, but with some important caveats. Spoiler: the best way to know for sure is to benchmark your code.

Intel MKL will check the CPU manufacturer and choose a code path accordingly. This is discussed in detail in a blog post by Daniel de Kok and in another blog post by Donald Kinghorn. Without any intervention, MKL may use a less-optimized code path on AMD hardware.

To use Daniel de Kok's suggested workaround on Perlmutter, module load fast-mkl-amd. For more information about why this matters, please see the results of our benchmarking study below.

## Benchmarking study¶

We performed a small benchmarking study on AMD Rome hardware (AMD EPYC 7702 64-Core Processor) to try to estimate Python performance on Perlmutter. You will find information about our benchmarking study, including all of the materials you need should you wish to reproduce it or run it elsewhere, at the NERSC Python AMD Benchmarking Gitlab repository.

We tested the following:

1. 4 NumPy functions (numpy.linalg.eigh, numpy.linalg.cholesky, numpy.linalg.svd, numpy.dot) at
2. 3 square matrix sizes for
3. 3 library configurations (mkl, mkl-workaround, OpenBLAS) for
4. 2 different configurations (high arithmetic intensity, highAI, and high memory bandwidth, highMB)
5. Using 1 AMD EPYC 7702 (Rome) with 64 physical cores and 2 hyperthreads per core

We determined the high arithmetic configuration using the standard DGEMM matrix multiplication benchmark; you can find more information about this benchmark at the Intel GEMM benchmarking page. We adjusted the thread and process binding on the AMD Rome to optimize DGEMM GFLOPS/s using the following configuration:

OMP_NUM_THREADS=128 OMP_PLACES=threads OMP_PROC_BIND=close srun -u -n 1 -c 128 --cpu_bind=sockets ./dgemm.exe


OMP_NUM_THREADS=64 OMP_PLACES=cores OMP_PROC_BIND=spread srun -u -n 1 -c 128 --cpu_bind=sockets ./stream-triad.exe


This is not a perfect test, nor is it exhaustive. Our goal is to understand general NumPy performance on AMD Rome so we can provide this information to our users to make informed decisions about which libraries to use on Perlmutter.

## Benchmarking study results¶

1. For the functions we studied here, the MKL workaround can add some performance improvements, but it depends on the matrix size and function type. Notably also it added runtime to the np.dot function, which underlines the importance of doing some basic benchmarking. We do suggest that users who need performance test out the MKL workaround since it is straightforward to use.
2. OpenBLAS beat MKL in np.linalg.cholesky in the highAI configuration and in np.dot in both configurations. MKL may not always be faster; again, it is important to benchmark your application. 1. np.linalg.eigh, np.linalg.cholesky, and np.linalg.svd performed better in the highMB configuration and np.dot generally performed better in the highAI configuration. Python users should be aware that their choice of process and thread binding can also have a large impact on performance.
3. Other libraries like BLIS exist. BLIS in general was slow in our testing so we did not include it here. BLIS is under active development so we'll keep an eye on this effort and update our reccomendations as the landscape evolves.
4. This benchmark was performed on our /global/common/software filesystem. Python users are advised to remember that other factors influence performance, especially choice of filesystems and containers. Please see this page for tips on improving general Python performance at NERSC.

We summarize the results of the benchmarking study grouped according to highMB and highAI configurations. For each configuration, we show the runtime normalized to mkl on top and the minimum runtime in seconds on the bottom.