Guide to Using Python on Perlmutter¶

We aim to provide important information and tips about using Python on Perlmutter. Please be aware that the programming environment on Perlmutter changes quickly and it may be difficult to keep this page fully up to date. We will do our best, but we welcome you to contact us if you find anything that appears incorrect or deprecated. Please review our Current Python on Perlmutter Known Issues

Python modules¶

NERSC provides customized Anaconda Python installations. You can use them via module load python.

The default programming environment provided by Cray/HPE includes a cray-python module with numpy and scipy configured with Cray LibSci and mpi4py configured with Cray MPICH.

Please note that Python 2.7 retired on Jan 1^st 2020. NERSC will not provide Python 2 on Perlmutter.

Customizing your Python environment¶

We strongly encourage the use of conda environments at NERSC for Python users to install and customize their own software environment. We also encourage Python users to customize their software stacks via Shifter. If you are interested in installing or using Python in other ways, please contact us so we can help you.

`cudatoolkit` dependency¶

Many python packages that use GPUs depend on cudatoolkit. NERSC provides a cudatoolkit module that can satisfy that dependency in many cases. The conda-forge channel also provides a cudatoolkit package which conda users can install into their environment.

There are similar modules on Perlmutter and packages on conda-forge for other common dependencies such as nccl, cutensor, and cudnn.

module load vs conda install¶

You can use either the cudatoolkit module or the cudatoolkit package installed from conda-forge. We suggest that you avoid using both. Do not module load cudatoolkit if you have cudatoolkit installed in your conda environment.

Some packages available on conda-forge do not assume you already have cudatoolkit installed and will install cudatoolkit into your conda environment.

missing nvcc

The cudatoolkit module provides the nvcc compiler which is not provided by the cudatoolkit package from conda-forge. If you are using cudatoolkit from conda-forge and your python application needs to JIT compile CUDA, you may need to install cuda-nvcc from the nvidia conda channel.

using cudatoolkit module in jupyter¶

To use the cudatoolkit module in a conda environment kernel from Jupyter, you will need to modify the kernel's kernel.json file to use a helper shell script to load the cudatoolkit module and activate the environment.

For example, if you've created a kernel for a conda environment named "mygpuenv", your kernel-helper.sh script might look like this:

#!/bin/bash
module load cudatoolkit
module load python
source activate mygpuenv
exec "$@"

mpi4py on Perlmutter¶

Using mpi4py on Perlmutter is similar to using mpi4py on previous NERSC systems such as Cori. For the most part, the same recommendations apply on Perlmutter, especially if you are only using CPU nodes.

This section provides some additional recommendations for using mpi4py on Perlmutter GPU nodes. Please see the mpi4py documentation for details about GPU-aware MPI support in mpi4py.

Installing mpi4py with GPU-aware Cray MPICH¶

We recommend installing mpi4py with GPU-aware Cray MPICH. The following examples demonstrate how to install mpi4py with GPU-aware Cray MPICH. The examples explicitly load required modules which may already be loaded by default.

Using the GNU programming environment:

module load PrgEnv-gnu cray-mpich cudatoolkit craype-accel-nvidia80 python
conda create -n gpu-aware-mpi python -y
conda activate gpu-aware-mpi
MPICC="cc -shared" pip install --force --no-cache-dir --no-binary=mpi4py mpi4py

Using the NVIDIA programming environment:

module load PrgEnv-nvidia cray-mpich cudatoolkit craype-accel-nvidia80 python
conda create -n gpu-aware-mpi python=3.9 -y
conda activate gpu-aware-mpi
MPICC="cc -shared" CC=nvc CFLAGS="-noswitcherror" pip install --force --no-cache-dir --no-binary=mpi4py mpi4py

Note

If mpi4py is installed with GPU-aware Cray MPICH, you must have the CUDA runtime in your environment at runtime, even for CPU-only programs.

Using mpi4py with GPU-aware Cray MPICH¶

The following example demonstrates how to use an mpi4py built with GPU-aware Cray MPICH using CuPy. See example instruction below for adding CuPy to your conda environment.

Here is a simple example using mpi4py with CuPy arrays:

from mpi4py import MPI
import cupy as cp
comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()
sendbuf = cp.arange(10, dtype='i')
recvbuf = cp.empty_like(sendbuf)
print(f"{rank=} before {sendbuf=} {recvbuf=}")
comm.Allreduce(sendbuf, recvbuf)
print(f"{rank=} after {sendbuf=} {recvbuf=}")
assert cp.allclose(recvbuf, sendbuf*size)

The following Slurm batch script can be used to run this program. MPICH_GPU_SUPPORT_ENABLED=1 should already be set by default (via the gpu module). It is included here explicitly to illustrate that it is required for GPU-aware Cray MPICH at runtime.

#!/bin/bash
#SBATCH --account=<account>
#SBATCH --constraint=gpu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4

module load PrgEnv-gnu cray-mpich cudatoolkit craype-accel-nvidia80 python
conda activate gpu-aware-mpi
export MPICH_GPU_SUPPORT_ENABLED=1 

srun ./select_gpu_device python test-gpu-aware-mpi.py

Note

select_gpu_device is a wrapper script that maps each MPI task to a single GPU device:

#!/bin/bash
export CUDA_VISIBLE_DEVICES=$SLURM_LOCALID
exec $*

This script may not always be necessary as some applications or libraries will handle mapping of MPI tasks or processes at runtime. See the Cray MPICH documentation by running man intro_mpi on Perlmutter for more information about using GPU-aware Cray MPICH.

CuPy¶

CuPy requires CUDA which is provided by the cudatoolkit module on Perlmutter.

The following instructions demonstrate how to setup a custom conda environment to use CuPy on Perlmutter. They are adapted from the CuPy installation instructions.

Installing with pip¶

You can install cupy in your conda environment with pip. Make sure to load the cudatoolkit module and specify a cupy wheel that corresponds the version of cudatoolkit from the module.

# Note the CUDA version from cudatoolkit (12.2)
module load cudatoolkit/12.2
module load conda
# Create a new conda environment
conda create -n cupy-demo python=3.12 pip numpy scipy
# Activate the environment before using pip to install CuPy
conda activate cupy-demo
# Install the wheel compatible with CUDA 12.2
pip install cupy-cuda12X

When you use cupy with this environment you should make sure to load the corresponding cudatoolkit module.

Installing from conda-forge¶

You can install both cudatoolkit and cupy from conda-forge.

module load python
conda create -c conda-forge -n cupy-demo python=3.9 pip numpy scipy cudatoolkit cupy

In this case, you should avoid loading the cudatoolkit module in your environment which could lead to conflicts with the cudatoolkit installed in your conda environment.

Building CuPy from Source using pip¶

You can also build CuPy from source on Perlmutter. The build instructions depends slightly on whether you're using PrgEnv-nvidia, or PrgEnv-gnu

Compiling with PrgEnv-gnu:

module load PrgEnv-gnu
module load cudatoolkit
LDFLAGS="$CRAY_CUDATOOLKIT_POST_LINK_OPTS" NVCC="nvcc $CRAY_CUDATOOLKIT_INCLUDE_OPTS" CC=cc CXX=CC pip install cupy

Compiling with PrgEnv-nvidia:

module load PrgEnv-nvidia
module load cudatoolkit
LDFLAGS="$CRAY_CUDATOOLKIT_POST_LINK_OPTS -L$CUDATOOLKIT_HOME/targets/x86_64-linux/lib" CFLAGS="-I$CUDATOOLKIT_HOME/targets/x86_64-linux/include" NVCC="nvcc $CRAY_CUDATOOLKIT_INCLUDE_OPTS" CC=gcc CXX=g++ pip install cupy

CuPy builds can be customized in many ways

We recommend that you check out the list of customizations. Eg. CUPY_NUM_BUILD_JOBS and CUPY_NUM_NVCC_THREADS can be used to increase the parallelism of your CuPy builds. And CUPY_CACHE_DIR can be used to relocate the location of CUDA code generated by CuPy.

JAX¶

Setting up JAX¶

Pip Installation¶

The following instructions, adapted from the JAX installation instructions, demonstrate how to set up a custom Conda environment and use pip to set up JAX on Perlmutter:

module load python
# Create a new conda environment
conda create -n jax-demo python=3.11 pip numpy scipy rich
# Activate the environment and install the latest JAX
conda activate jax-demo
pip install --upgrade "jax[cuda12]==0.4.37"

Do note that here JAX brings its own versions of CUDA and cuDNN, which may become incompatible with Perlmutter as the system environment or JAX dependencies evolve, requiring you to change the JAX version being used.

NVIDIA Containers¶

We recommend that you use NVIDIA's JAX-Toolbox containers to install JAX. The process is relatively straightforward and will reliably give you a working installation of JAX.

You can start a GPU job using the ghcr.io/nvidia/jax:jax image via Shifter with the following Slurm script:

#!/bin/bash
#SBATCH --image=ghcr.io/nvidia/jax:jax
#SBATCH --nodes=1
#SBATCH --qos=regular
#SBATCH --constraint=gpu
#SBATCH --gpus-per-node=4
#SBATCH --module=gpu,nccl-plugin

srun shifter python3 ./my_script.py

You could also run a script locally as follows:

shifter --module=gpu,nccl-plugin --image=ghcr.io/nvidia/jax:jax python3 ./my_script.py

See the Shifter documentation for further details on how to use Shifter containers.

Distributing JAX Computation¶

JAX offers robust support for multi-GPU computing within a single process through its Parallel Operators, particularly using pmap. For more information, refer to the Parallel Evaluation in JAX documentation.

To distribute computation across multiple nodes, you can utilize the jax.distribute module. Additionally, the multi-host and multi-process official documentation, which is very TPU-focused, provides valuable insights. For a detailed guide on distributing JAX computation over a SLURM cluster, refer to Wassim Kabalan's very detailed tutorial.

For an example of an idiomatic multi-node and multi-GPU JAX job running at NERSC with Slurm and containers, visit this GitHub repository.

When looking for efficient distributed operations, we recommend exploring existing libraries before implementing your own. For instance, jaxDecomp provides a multinode differentiable approach to Fast Fourier Transforms (FFT).

Additionally, you can use mpi4jax to incorporate MPI operations within JAX JIT-compiled sections.

Note

For JAX support (besides tickets and the NERSC Help Desk) and to connect with other JAX users at NERSC who might share similar problems and solutions, you can join the #jax-users channel on the NERSC Users Slack.

cuNumeric¶

cuNumeric is a Legate library that aims to provide a distributed and accelerated drop-in replacement for the NumPy API on top of the Legion runtime.

The following instructions demonstrate how to install cuNumeric using conda for use on a single Perlmutter GPU node. On a single Perlmutter GPU node:

# install cunumeric using conda
module load conda
conda create -n cunumeric -c nvidia -c conda-forge -c legate cunumeric 
conda activate cunumeric
# download cunumeric repo with examples
git clone https://github.com/nv-legate/cunumeric.git
cd cunumeric
# run an example program using 4 GPUS 
# note use of the legate driver to launch the program
legate --gpus 4 examples/gemm.py -n 8000

Multi-node installation currently require building legate-core and cuNumeric from source, see the nv-legate/quickstart repo for details.

cuNumeric is currently under active development and should be considered as experimental or "beta" software. If you have any issues, we recommend opening an issue on the cuNumeric github project.

Known issues¶

General issues¶

Our Known Issues page includes more general issues that may also impact Python users.

MPI issues¶

Issues with `fork()` in MPI processes¶

Several Python users have encountered errors that look like this:

mlx5: nid003244: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 20009232 00000000 00000300
00003c40 92083204 000180b8 0085a0e0
MPICH ERROR [Rank 256] [job id 126699.1] [Wed Oct 20 12:32:36 2021] [nid003244] - Abort(70891919) (rank 256 in comm 0): Fatal error in PMPI_Gatherv: Other MPI error, error stack:
PMPI_Gatherv(415)..............: MPI_Gatherv failed(sbuf=0x55ee4a4ebea0, scount=88, MPI_BYTE, rbuf=(nil), rcnts=(nil), displs=(nil), datatype=MPI_BYTE, root=0, comm=MPI_COMM_WORLD) failed
MPIR_CRAY_Gatherv(353).........: 
MPIC_Recv(197).................: 
MPIC_Wait(71)..................: 
MPIR_Wait_impl(41).............: 
MPID_Progress_wait(186)........: 
MPIDI_Progress_test(80)........: 
MPIDI_OFI_handle_cq_error(1059): OFI poll failed (ofi_events.c:1061:MPIDI_OFI_handle_cq_error:Input/output error - local protection error)

or this:

File "mpi4py/MPI/Comm.pyx", line 1595, in mpi4py.MPI.Comm.allgather
File "mpi4py/MPI/msgpickle.pxi", line 873, in mpi4py.MPI.PyMPI_allgather
File "mpi4py/MPI/msgpickle.pxi", line 177, in mpi4py.MPI.pickle_loadv
File "mpi4py/MPI/msgpickle.pxi", line 152, in mpi4py.MPI.pickle_load
File "mpi4py/MPI/msgpickle.pxi", line 141, in mpi4py.MPI.cloads
_pickle.UnpicklingError: invalid load key, '\x00'.

File "mpi4py/MPI/msgpickle.pxi", line 141, in mpi4py.MPI.cloads
_pickle.UnpicklingError: pickle data was truncated

File "mpi4py/MPI/msgpickle.pxi", line 141, in mpi4py.MPI.cloads
_pickle.UnpicklingError: unpickling stack underflow

These error messages seem to be related to the use of fork() within an MPI process. For example, using the subprocess module to spawn processes or calling a library function such as the os.uname function indirectly uses of the fork() system call in a Python application.

This is considered undefined behavior in the MPI standard and may change depending on MPI implementation/MPI middleware/network hardware.

On SS11, setting these environment variables may help circumvent the error:

export CXI_FORK_SAFE=1
export CXI_FORK_SAFE_HP=1

However, it's possible these variables will not help. The most robust mitigation is to avoid spawning forks/subprocesses in MPI applications.

If you see this error and/or have questions about this, please open a ticket.

Using AMD CPUs on Perlmutter¶

Python users should be aware that using the Intel MKL library may be slow on Perlmutter's AMD CPUs, although it is often still faster than OpenBLAS.

We advise users to try our MKL workaround via

module load fast-mkl-amd

Guide to Using Python on Perlmutter¶

Python modules¶

Customizing your Python environment¶

cudatoolkit dependency¶

module load vs conda install¶

using cudatoolkit module in jupyter¶

mpi4py on Perlmutter¶

Installing mpi4py with GPU-aware Cray MPICH¶

Using mpi4py with GPU-aware Cray MPICH¶

CuPy¶

Installing with pip¶

Installing from conda-forge¶

Building CuPy from Source using pip¶

JAX¶

Setting up JAX¶

Pip Installation¶

NVIDIA Containers¶

Distributing JAX Computation¶

cuNumeric¶

Known issues¶

General issues¶

MPI issues¶

Issues with fork() in MPI processes¶

Using AMD CPUs on Perlmutter¶

`cudatoolkit` dependency¶

Issues with `fork()` in MPI processes¶