Skip to content

Preparing your Python code for Perlmutter's GPUs

The Perlmutter GPU partition includes nearly 1800 GPU nodes, each with 4 NVIDIA A100 GPUs, and the CPU partition includes over 3000 CPU nodes, each with 2 AMD Milan CPUs. The majority of Perlmutter's computational capability is in the GPU partition. Python users who wish to take advantage of this will need to adjust their code to run on GPUs.

This page is a guide for Python users that may be new to using GPUs for computation and offers advice on how to get started. For more information about running Python on Perlmutter's AMD CPUs, please see our Python on AMD page. See the general Perlmutter readiness page for general advice for software performance on Perlmutter.

Here are some questions to ask yourself to help determine if your code would benefit from running on a GPU and if so, which options may be a good fit for your application.

Question 0 -- Is my code a good fit for a GPU?

This is an important question to consider. Moving code from a CPU to a GPU can be a lot of work. Expending this developer time and energy only makes sense if you can be reasonably sure that a GPU will provide some speedup.

GPUs are best at the type large matrix operations, like those in graphics processing, that they were initially created to do. If a scientific Python application can leverage large matrix operations, GPUs are likely a good fit. Other applications that can't employ massively parallel operations, like Monte-Carlo calculations, will like struggle to use the GPU effectively.

If you aren't sure what kinds of operations you are doing or where your code spends most of its time, the best solution is to profile your code. Even basic profiling will give you some sense of whether moving to the GPU is worth your time. It may not be-- and that's ok!

Question 1 -- How much data do I need to move between the CPU and GPU?

Moving data between a CPU and GPU is expensive. In many cases it can be much more expensive than the GPU computation itself. A Python developer should ask themselves if the amount of data they have to move outweighs the number of calculations they expect to perform on the GPU. A general rule of thumb is that the GPU should perform 10 operations on a dataset before it makes the cost of data movement worthwhile.

If you only have one matrix to solve, a GPU may not be that helpful. If you want to operate on the same matrix 100 times, then it may be much more useful.

Basic profiling may also help shed some light on the data needs and usage patterns in your application.

Question 2 -- What if my code doesn't meet these requirements?

This is actually a common question. It is very unlikely your code is a great fit for GPUs without doing any work. Many of the codes that will use the Perlmutter GPU partition have been actively redesigned with GPUs in mind.

If you are willing to redesign your code, you'll want to focus on the areas we discussed in Question 0 and Question 1-- massive parallelism and minimizing eventual data movement between the CPU and GPU. You may find that these changes may improve the CPU version of your code. Once your code meets these requirements, you can move forward.

Question 3 -- Okay, my code is suitable for a GPU, now what?

If your code meets the first two criteria we listed-- contains massive parallelism and has a reasonably high GPU compute to data movement ratio-- the next question is how best to start porting your code to the GPU.

import gpu #this will not work :)

Python users have become somewhat spoiled due to well-established libraries like NumPy and SciPy that make performing sophisticated calculations easy on a wide variety of CPU hardware.

On GPUs however, Python users can not use familiar libraries like NumPy and instead will have to choose from newer GPU-enabled libraries and frameworks. Each of these options has its own pros and cons. You should select a library (or libraries) depending on the needs of your application. We will provide a brief overview of several current options below.

The only constant is change

It is important to note that the Python GPU landscape is changing quickly. Most likely by the time you are reading this page some information will be out of date. We encourage anyone interested in the current Python GPU landscape to continue checking project repositories and reading the documentation for the framework(s) in which they are interested.

Question 4 -- How should I replace NumPy/SciPy?


CuPy, developed by Preferred Networks, is a drop-in replacement for NumPy. Many NumPy and SciPy functions have been implemented in CuPy with largely the same interface as their original counterparts.

CuPy interfaces well with Numba CUDA JIT compilation for custom kernel implementation and recently added an API for implementing custom kernels that is similar to Numba CUDA.

At present CuPy is predominately supported on NVIDIA GPUs, although AMD ROCm support is also in development.


JAX is also a library NumPy/SciPy-like library, but it cannot be used as a direct replacement due to differences in syntax. The JAX framework includes implementations of most NumPy and SciPy operations, a JIT compiler, and some deep learning-specific functionalities.

While the native JIT compiler is useful, the syntax required by the JIT compiler may feel somewhat alien to NumPy users. For instance, JAX prohibits in-place updates, a constraint that can complicate the process of translating existing code into code suitable for JAX's JIT compiler.

JAX utilizes Google's XLA compiler, which currently supports a wide range of platforms including many CPUs, Google TPUs, NVIDIA GPUs, AMD GPUs (experimental), as well as Apple GPUs (experimental).

Deep Learning Libraries like PyTorch and TensorFlow

PyTorch and TensorFlow aren't just for deep learning! They both have implemented many of the most common NumPy functionality (and even some SciPy functionality) in their tensor arrays. PyTorch and TensorFlow have been supported by more development effort than smaller Python frameworks like JAX, and as a result, they are very well-optimized. They also have very large user communities, making it easier to find answers to problems.

Question 5 -- How should I replace pandas/scikit-learn?


NVIDIA RAPIDS provides libraries like pandas (RAPIDS CuDF) and scikit-learn (RAPIDS CuML) implemented with CUDA library backends.

Question 6 -- How can I write my own custom GPU kernels?

If you are using other domain specific libraries-- for example, AstroPy-- that do not yet provide any GPU support, you will likely need to write some GPU kernels yourself. You may also find that CuPy, or RAPIDS have not implemented the function that you require. In both of these situations your only option is to write your own GPU kernel.

There are several options for this. Your choice should depend on your skill-level and your needs.


Numba, part of the Python ecosystem, is a JIT-compiler for Python. It can compile Python for CPUs and also generate GPU code for both CUDA and ROCm architectures.

Numba CUDA is more Pythonic, less powerful, and less complex than writing kernels in PyCUDA. For users who would prefer not to deal with raw CUDA, Numba offers a more friendly alternative. It does however still require users to be aware of the basics of CUDA-style GPU programming, including threads, blocks, and kernels.

We have found that CuPy and Numba compliment each other and have used both together-- CuPy for the drop-in NumPy replacements, and Numba in situations where the functions are not available in CuPy. Numba and CuPy objects are able to interface directly.


CuPy recently added an API for implementing custom kernels that is similar to Numba CUDA. As of this writing, this feature is still marked experimental but it is worth considering if you are already using CuPy for NumPy-style array programming.


While JAX offers ready-made operations similar to CuPy, it also enables the writing of arbitrary operations. With JAX, you can write these operations in Python, rather than dropping to a low-level language, and then utilize its just-in-time (JIT) compiler to optimize them for your GPU (optimization includes merging kernels, optimizing data movement, etc).


PyCUDA was written by Andreas Kloeckner. PyCUDA is much more powerful than Numba. It provides a wrapper for CUDA that is cleanly accessible within Python. It is straightforward to install and run. However for most Python users, PyCUDA and PyOpenCL are likely the most challenging frameworks to use. Where Numba abstracts away some of the more complex parts of writing CUDA code (using pointers, for example), Python users will need to write and understand enough of C/C++/CUDA in order to write the PyCUDA kernel they require. The same is true for PyOpenCL.


PyOpenCL was also written by Andreas Kloeckner. It is similar to PyCUDA in that it wraps OpenCL, another C/C++ like language that is most likely not familiar to most Python users. Installing and using PyOpenCL is more challenging than using PyCUDA. It is supported by fewer profiling tools and in general is more difficult for which to find resources. However, the true strength of OpenCL is its portability. It should run on all CPUs and all current GPUs.

CUDA Python

NVIDIA's CUDA Python provides Cython/Python wrappers for CUDA driver and runtime APIs. Many of the libraries on this page existed before this official Python interface to CUDA existed. This option may be a good fit if you are already familiar with CUDA programming in lower level languages. It currently lacks some of the higher level APIs in the other options on this page that may be more familiar to Python users.

Question 7 -- How do I scale up?

We anticipate that many NERSC users would like to run on more than one GPU and maybe even more than one GPU node. We will briefly summarize several options for scaling up with Python GPU frameworks:

  1. Familiar options like mpi4py will still work on CPUs and can be used to coordinate work on multiple nodes and multiple GPUs.
  2. CuPy can run on more than one GPU via CuPy streams. It can also scale to multiple GPUs and many nodes via Dask.
  3. JAX has native support for multi-GPU as well as multi-node computation. See our documentation for further details.
  4. Libraries in NVIDIA RAPIDS also use Dask to scale to multiple GPUs and multiple nodes.
  5. Other libraries like Legate that are currently in development may also provide a user-friendly way to scale NumPy operations to many nodes.

If you have questions about porting your Python code to Perlmutter, please open a ticket at We can provide some guidance to help you decide if GPU porting is a good fit for your application, give you some advice to get started, and help you choose a framework (or combination of frameworks) that will suit your needs.