Skip to content

Preparing your Python code for Perlmutter's GPUs

NERSC's next system is Perlmutter. 1) The Perlmutter GPU partition will have approximately 1500 GPU nodes, each with 4 NVIDIA A100 GPUs and 2) the CPU partition have approximately 3000 CPU nodes, each with 2 AMD Milan CPUs. We offer general Perlmutter readiness advice in our documentation on software performance.

The majority of the Perlmutter FLOPS will come from the GPU partition. Python users who wish to take advantage of this will need to adjust their code to run on the GPUs.

For more information about running Python on Perlmutter's AMD CPUs, please see our Python on AMD page. This page is focused on considerations and GPU porting advice for potential Python GPU users.

Here are some questions to ask yourself to help determine if your code would benefit from running on a GPU and if so, which options may be a good fit for your application.

Question 0 -- Is my code a good fit for a GPU?

This is an important question to consider. Moving code from a CPU to a GPU can be a lot of work. Expending this developer time and energy only makes sense if you can be reasonably sure that a GPU will provide some speedup.

GPUs are best at the type large matrix operations, like those in graphics processing, that they were initially created to do. If a scientific Python application can leverage large matrix operations, GPUs are likely a good fit. Other applications that can't employ massively parallel operations, like Monte-Carlo calculations, will like struggle to use the GPU effectively.

If you aren't sure what kinds of operations you are doing or where your code spends most of its time, the best solution is to profile your code. Even basic profiling will give you some sense of whether moving to the GPU is worth your time. It may not be-- and that's ok!

Question 1 -- How much data do I need to move between the CPU and GPU?

Moving data between a CPU and GPU is expensive. In many cases it can be much more expensive than the GPU computation itself. A Python developer should ask themselves if the amount of data they have to move outweighs the number of calculations they expect to perform on the GPU. A general rule of thumb is that the GPU should perform 10 operations on a dataset before it makes the cost of data movement worthwhile.

If you only have one matrix to solve, a GPU may not be that helpful. If you want to operate on the same matrix 100 times, then it may be much more useful.

Basic profiling may also help shed some light on the data needs and usage patterns in your application.

Question 2 -- What if my code doesn't meet these requirements?

This is actually a common question. It is very unlikely your code is a great fit for GPUs without doing any work. Many of the codes that will use the Perlmutter GPU partition have been actively redesigned with GPUs in mind.

If you are willing to redesign your code, you'll want to focus on the areas we discussed in Question 0 and Question 1-- massive parallelism and minimizing eventual data movement between the CPU and GPU. You may find that these changes may improve the CPU version of your code. Once your code meets these requirements, you can move forward.

Question 3 -- Okay, my code is suitable for a GPU, now what?

If your code meets the first two criteria we listed-- contains massive parallelism and has a reasonably high GPU compute to data movement ratio-- the next question is how best to start porting your code to the GPU.

import gpu #this will not work :)

Python users have become somewhat spoiled due to well-established libraries like NumPy and SciPy that make performing sophisticated calculations easy on a wide variety of CPU hardware.

On GPUs however, Python users can not use familiar libraries like NumPy and instead will have to choose from newer GPU-enabled libraries and frameworks. Each of these options has its own pros and cons. You should select a library (or libraries) depending on the needs of your application. We will provide a brief overview of several current options below.

The only constant is change

It is important to note that the Python GPU landscape is changing quickly. Most likely by the time you are reading this page some information will be out of date. We encourage anyone interested in the current Python GPU landscape to continue checking project repositories and reading the documentation for the framework(s) in which they are interested.

Question 4 -- How should I replace NumPy/SciPy?


CuPy, developed by Preferred Networks, is a drop-in replacement for NumPy. Many NumPy and SciPy functions have been implemented in CuPy with largely the same interface as their original counterparts.

CuPy does not come with its own JIT compiler. It does however interface well with Numba which will be described below. Users who require a function which is not currently implemented in CuPy are advised to write a custom kernel in a framework like Numba.

At present CuPy is predominately supported on NVIDIA GPUs, although AMD ROCm support is also in development.


JAX, developed by Google, is also a NumPy/SciPy-like library, although it cannot be used as a drop-in replacement due to syntax differences. The JAX framework includes many implemented versions of NumPy and SciPy functions, a JIT compiler, and some other deep learning specific functionality.

JAX has its own JIT compiler in contrast to CuPy. While this native JIT functionality is nice, the syntax required for the JIT compiler can be somewhat alien to NumPy users. For example, traditional NumPy index slicing is not supported. These small changes can make it difficult to translate existing code into code suitable for JAX's JIT compiler.

JAX uses Google's XLA compiler. At the moment XLA supports many CPUs, Google TPUs, NVIDIA GPUs, and recently AMD ROCm. However JAX does not yet support AMD GPUs.

Deep Learning Libraries like PyTorch and TensorFlow

PyTorch and TensorFlow aren't just for deep learning! They both have implemented many of the most common NumPy functionality (and even some SciPy functionality) in their tensor arrays. PyTorch and TensorFlow have been supported by more development effort than smaller Python frameworks like JAX, and as a result, they are very well-optimized. They also have very large user communities, making it easier to find answers to problems.

Question 5 -- How should I replace pandas/scikit-learn?


NVIDIA RAPIDS provides libraries like pandas (RAPIDS CuDF) and scikit-learn (RAPIDS CuML) implemented with CUDA library backends.

Question 6 -- How can I write my own custom GPU kernels?

If you are using other domain specific libraries-- for example, AstroPy-- that do not yet provide any GPU support, you will likely need to write some GPU kernels yourself. You may also find that CuPy, JAX or RAPIDS have not implemented the function that you require. In both of these situations your only option is to write your own GPU kernel.

There are several options for this. Your choice should depend on your skill-level and your needs.


Numba, part of the Python ecosystem, is a JIT-compiler for Python. It can compile Python for CPUs and also generate GPU code for both CUDA and ROCm architectures.

Numba CUDA is more Pythonic, less powerful, and less complex than writing kernels in PyCUDA. For users who would prefer not to deal with raw CUDA, Numba offers a more friendly alternative. It does however still require users to be aware of the basics of CUDA-style GPU programming, including threads, blocks, and kernels.

We have found that CuPy and Numba compliment each other and have used both together-- CuPy for the drop-in NumPy replacements, and Numba in situations where the functions are not available in CuPy. Numba and CuPy objects are able to interface directly.


PyCUDA was written by Andreas Kloeckner. PyCUDA is much more powerful than Numba. It provides a wrapper for CUDA that is cleanly accessible within Python. It is straightforward to install and run. However for most Python users, PyCUDA and PyOpenCL are likely the most challenging frameworks to use. Where Numba abstracts away some of the more complex parts of writing CUDA code (using pointers, for example), Python users will need to write and understand enough of C/C++/CUDA in order to write the PyCUDA kernel they require. The same is true for PyOpenCL.


PyOpenCL was also written by Andreas Kloeckner. It is similar to PyCUDA in that it wraps OpenCL, another C/C++ like language that is most likely not familiar to most Python users. Installing and using PyOpenCL is more challenging than using PyCUDA (at least on our Corigpu NVIDIA testbed). It is supported by fewer profiling tools and in general is more difficult for which to find resources. However, the true strength of OpenCL is its portability. It should run on all CPUs and all current GPUs.

Question 7 -- How do I scale up?

We anticipate that many NERSC users would like to run on more than one GPU and maybe even more than one GPU node. We will briefly summarize several options for scaling up with Python GPU frameworks:

  1. Familiar options like mpi4py will still work on CPUs and can be used to coordinate work on multiple nodes and multiple GPUs.
  2. CuPy can run on more than one GPU via CuPy streams. It can also scale to multiple GPUs and many nodes via Dask.
  3. JAX can extend from a single GPU to multiple GPUs (on the same node) via the Parallelization/pmap subpackage. Based on the JAX docs it appears that these areas still under active development.
  4. Libraries in NVIDIA RAPIDS also use Dask to scale to multiple GPUs and multiple nodes.
  5. Other libraries like Legate that are currently in development may also provide a user-friendly way to scale NumPy operations to many nodes.

If you have questions about porting your Python code to Perlmutter, please open a ticket at We can provide some guidance to help you decide if GPU porting is a good fit for your application, give you some advice to get started, and help you choose a framework (or combination of frameworks) that will suit your needs.