Skip to content

Guide to Using Python on Perlmutter

We aim to provide important information and tips about using Python on Perlmutter. Please be aware that the programming environment on Perlmutter changes quickly and it may be difficult to keep this page fully up to date. We will do our best, but we welcome you to contact us if you find anything that appears incorrect or deprecated. Please review our Current Python on Perlmutter Known Issues

Python modules

NERSC provides customized Anaconda Python installations. You can use them via module load python.

The default programming environment provided by Cray/HPE includes a cray-python module with numpy and scipy configured with Cray LibSci and mpi4py configured with Cray MPICH.

Please note that Python 2.7 retired on Jan 1st 2020. NERSC will not provide Python 2 on Perlmutter.

Customizing your Python environment

We strongly encourage the use of conda environments at NERSC for Python users to install and customize their own software environment. We also encourage Python users to customize their software stacks via Shifter. If you are interested in installing or using Python in other ways, please contact us so we can help you.

cudatoolkit dependency

Many python packages that use GPUs depend on cudatoolkit. NERSC provides a cudatoolkit module that can satisfy that dependency in many cases. The conda-forge channel also provides a cudatoolkit package which conda users can install into their environment.

There are similar modules on Perlmutter and packages on conda-forge for other common dependencies such as nccl, cutensor, and cudnn.

module load vs conda install

You can use either the cudatoolkit module or the cudatoolkit package installed from conda-forge. We suggest that you avoid using both. Do not module load cudatoolkit if you have cudatoolkit installed in your conda environment.

Some packages available on conda-forge do not assume you already have cudatoolkit installed and will install cudatoolkit into your conda environment.

missing nvcc

The cudatoolkit module provides the nvcc compiler which is not provided by the cudatoolkit package from conda-forge. If you are using cudatoolkit from conda-forge and your python application needs to JIT compile CUDA, you may need to install cuda-nvcc from the nvidia conda channel.

using cudatoolkit module in jupyter

To use the cudatoolkit module in a conda environment kernel from Jupyter, you will need to modify the kernel's kernel.json file to use a helper shell script to load the cudatoolkit module and activate the environment.

For example, if you've created a kernel for a conda environment named "mygpuenv", your kernel-helper.sh script might look like this:

#!/bin/bash
module load cudatoolkit
module load python
source activate mygpuenv
exec "$@"

mpi4py on Perlmutter

Using mpi4py on Perlmutter is similar to using mpi4py on previous NERSC systems such as Cori. For the most part, the same recommendations apply on Perlmutter, especially if you are only using CPU nodes.

This section provides some additional recommendations for using mpi4py on Perlmutter GPU nodes. Please see the mpi4py documentation for details about GPU-aware MPI support in mpi4py.

Installing mpi4py with GPU-aware Cray MPICH

We recommend installing mpi4py with GPU-aware Cray MPICH. The following examples demonstrate how to install mpi4py with GPU-aware Cray MPICH. The examples explicitly load required modules which may already be loaded by default.

Using the GNU programming environment:

module load PrgEnv-gnu cray-mpich cudatoolkit craype-accel-nvidia80 python
conda create -n gpu-aware-mpi python -y
conda activate gpu-aware-mpi
MPICC="cc -shared" pip install --force --no-cache-dir --no-binary=mpi4py mpi4py

Using the NVIDIA programming environment:

module load PrgEnv-nvidia cray-mpich cudatoolkit craype-accel-nvidia80 python
conda create -n gpu-aware-mpi python=3.9 -y
conda activate gpu-aware-mpi
MPICC="cc -shared" CC=nvc CFLAGS="-noswitcherror" pip install --force --no-cache-dir --no-binary=mpi4py mpi4py

Note

If mpi4py is installed with GPU-aware Cray MPICH, you must have the CUDA runtime in your environment at runtime, even for CPU-only programs.

Using mpi4py with GPU-aware Cray MPICH

The following example demonstrates how to use an mpi4py built with GPU-aware Cray MPICH using CuPy. See example instruction below for adding CuPy to your conda environment.

Here is a simple example using mpi4py with CuPy arrays:

from mpi4py import MPI
import cupy as cp
comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()
sendbuf = cp.arange(10, dtype='i')
recvbuf = cp.empty_like(sendbuf)
print(f"{rank=} before {sendbuf=} {recvbuf=}")
comm.Allreduce(sendbuf, recvbuf)
print(f"{rank=} after {sendbuf=} {recvbuf=}")
assert cp.allclose(recvbuf, sendbuf*size)

The following Slurm batch script can be used to run this program. MPICH_GPU_SUPPORT_ENABLED=1 should already be set by default (via the gpu module). It is included here explicitly to illustrate that it is required for GPU-aware Cray MPICH at runtime.

#!/bin/bash
#SBATCH --account=<account>
#SBATCH --constraint=gpu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4

module load PrgEnv-gnu cray-mpich cudatoolkit craype-accel-nvidia80 python
conda activate gpu-aware-mpi
export MPICH_GPU_SUPPORT_ENABLED=1 

srun ./select_gpu_device python test-gpu-aware-mpi.py

Note

select_gpu_device is a wrapper script that maps each MPI task to a single GPU device:

#!/bin/bash
export CUDA_VISIBLE_DEVICES=$SLURM_LOCALID
exec $*
This script may not always be necessary as some applications or libraries will handle mapping of MPI tasks or processes at runtime. See the Cray MPICH documentation by running man intro_mpi on Perlmutter for more information about using GPU-aware Cray MPICH.

CuPy

CuPy requires CUDA which is provided by the cudatoolkit module on Perlmutter.

The following instructions demonstrate how to setup a custom conda environment to use CuPy on Perlmutter. They are adapted from the CuPy installation instructions.

Installing with pip

You can install cupy in your conda environment with pip. Make sure to load the cudatoolkit module and specify a cupy wheel that corresponds the version of cudatoolkit from the module.

# Note the CUDA version from cudatoolkit (11.7)
module load cudatoolkit/11.7
module load python
# Create a new conda environment
conda create -n cupy-demo python=3.9 pip numpy scipy
# Activate the environment before using pip to install CuPy
conda activate cupy-demo
# Install the wheel compatible with CUDA 11.7
pip install cupy-cuda11X

When you use cupy with this environment you should make sure to load the corresponding cudatoolkit module.

Installing from conda-forge

You can install both cudatoolkit and cupy from conda-forge.

module load python
conda create -c conda-forge -n cupy-demo python=3.9 pip numpy scipy cudatoolkit cupy

In this case, you should avoid loading the cudatoolkit module in your environment which could lead to conflicts with the cudatoolkit installed in your conda environment.

Building CuPy from Source using pip

You can also build CuPy from source on Perlmutter. The build instructions depends slightly on whether you're using PrgEnv-nvidia, or PrgEnv-gnu

  • Compiling with PrgEnv-gnu:
module load PrgEnv-gnu
module load cudatoolkit
LDFLAGS="$CRAY_CUDATOOLKIT_POST_LINK_OPTS" NVCC="nvcc $CRAY_CUDATOOLKIT_INCLUDE_OPTS" CC=cc CXX=CC pip install cupy
  • Compiling with PrgEnv-nvidia:
module load PrgEnv-nvidia
module load cudatoolkit
LDFLAGS="$CRAY_CUDATOOLKIT_POST_LINK_OPTS -L$CUDATOOLKIT_HOME/targets/x86_64-linux/lib" CFLAGS="-I$CUDATOOLKIT_HOME/targets/x86_64-linux/include" NVCC="nvcc $CRAY_CUDATOOLKIT_INCLUDE_OPTS" CC=gcc CXX=g++ pip install cupy 

CuPy builds can be customized in many ways

We recommend that you check out the list of customizations. Eg. CUPY_NUM_BUILD_JOBS and CUPY_NUM_NVCC_THREADS can be used to increase the parallelism of your CuPy builds. And CUPY_CACHE_DIR can be used to relocate the location of CUDA code generated by CuPy.

JAX

Setting up JAX

NVIDIA Containers

We recommend that you use NVIDIA's JAX-Toolbox containers to install JAX. The process is relatively straightforward and will reliably give you a working installation of JAX.

You can start a GPU job using the ghcr.io/nvidia/jax:jax image via Shifter with the following Slurm script:

#!/bin/bash
#SBATCH --image=ghcr.io/nvidia/jax:jax
#SBATCH --nodes=1
#SBATCH --qos=regular
#SBATCH --constraint=gpu

srun shifter --module=gpu python3 ./my_script.py

You could also run a script locally as follows:

shifter --module=gpu --image=ghcr.io/nvidia/jax:jax python3 ./my_script.py

See the Shifter documentation for further details on how to use Shifter containers.

Pip Installation

It is possible to install JAX using pip modules and Perlmutter modules. However, we recommend against it as using the proper version number to get your installation working, both on a single node and especially in a distributed fashion, is error-prone.

Using JAX on GPUs requires CUDA and cuDNN, which are provided by the cudatoolkit and cudnn modules on Perlmutter. The following instructions, adapted from the JAX installation instructions, demonstrate how to set up a custom Conda environment to use JAX on Perlmutter with the local CUDA 12 libraries:

module load cudatoolkit/12.2
module load cudnn/8.9.3_cuda12
module load python
# Create a new conda environment
conda create -n jax-demo python=3.9 pip numpy scipy
# Activate the environment before using pip to install JAX
conda activate jax-demo
# Install a compatible wheel
pip install --no-cache-dir "jax==0.4.23" "jaxlib[cuda12_cudnn89]==0.4.23" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

You need to ensure that the version of JAX you are installing, here 0.4.23, was built against the versions of Python, cudatoolkit, and cudnn that you are using (it might, for example, have been built against CUDA 12.3, which would trigger a runtime error when you try to run JIT-compiled code on a GPU with CUDA 12.2). Thus, we recommend that you lock down all of your version numbers to ensure a successful installation.

Finding an appropriate JAX version number might take some trial and error, going down the list of current JAX releases.

Distributing JAX Computation

JAX offers robust support for multi-GPU computing within a single process through its Parallel Operators, particularly using pmap. For more information, refer to the Parallel Evaluation in JAX documentation.

To distribute computation across multiple nodes, you can utilize the jax.distribute module. Additionally, the multi-host and multi-process official documentation, which is very TPU-focused, provides valuable insights. For a detailed guide on distributing JAX computation over a SLURM cluster, refer to Wassim Kabalan's very detailed tutorial.

For an example of an idiomatic multi-node and multi-GPU JAX job running at NERSC with Slurm and containers, visit this GitHub repository.

Additionally, you can use mpi4jax to incorporate MPI operations within JAX JIT-compiled sections.

cuNumeric

cuNumeric is a Legate library that aims to provide a distributed and accelerated drop-in replacement for the NumPy API on top of the Legion runtime.

The following instructions demonstrate how to install cuNumeric using conda for use on a single Perlmutter GPU node. On a single Perlmutter GPU node:

# install cunumeric using conda
module load conda
conda create -n cunumeric -c nvidia -c conda-forge -c legate cunumeric 
conda activate cunumeric
# download cunumeric repo with examples
git clone https://github.com/nv-legate/cunumeric.git
cd cunumeric
# run an example program using 4 GPUS 
# note use of the legate driver to launch the program
legate --gpus 4 examples/gemm.py -n 8000

Multi-node installation currently require building legate-core and cuNumeric from source, see the nv-legate/quickstart repo for details.

cuNumeric is currently under active development and should be considered as experimental or "beta" software. If you have any issues, we recommend opening an issue on the cuNumeric github project.

Known issues

General issues

Our Known Issues page includes more general issues that may also impact Python users.

MPI issues

conda / cray-mpich / libfabric dependencies missing evp symbols

Conda + MPI/mpi4py users may see the following error after the December Perlmutter maintenance when attempting to use cray-mpich or related software inside an active conda environment:

OSError: /usr/lib64/libssh.so.4: undefined symbol: EVP_KDF_CTX_new_id, version OPENSSL_1_1_1d

When mpi4py is built locally with cray-mpich, cray-mpich dependencies on system libraries are essentially pulled into the environment. The system libssh is not compatible with some versions of the openssl package that are automatically included in conda environments.

We've added an evp-patch module that provides a compatible set of libssh and openssl that can be loaded to work around the issue. This module is automatically loaded by the python, pytorch, and tensorflow modules.

If you are using your own conda installation, you can load this module to work around the issue.

elvis@perlmutter> module load evp-patch

Another potential workaround is to add a compatible libssh package directly to your conda environment by installing it from conda-forge:

elvis@perlmutter> conda activate myenv
elvis@perlmutter> conda install -c conda-forge libssh

Issues with fork() in MPI processes

Several Python users have encountered errors that look like this:

mlx5: nid003244: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 20009232 00000000 00000300
00003c40 92083204 000180b8 0085a0e0
MPICH ERROR [Rank 256] [job id 126699.1] [Wed Oct 20 12:32:36 2021] [nid003244] - Abort(70891919) (rank 256 in comm 0): Fatal error in PMPI_Gatherv: Other MPI error, error stack:
PMPI_Gatherv(415)..............: MPI_Gatherv failed(sbuf=0x55ee4a4ebea0, scount=88, MPI_BYTE, rbuf=(nil), rcnts=(nil), displs=(nil), datatype=MPI_BYTE, root=0, comm=MPI_COMM_WORLD) failed
MPIR_CRAY_Gatherv(353).........: 
MPIC_Recv(197).................: 
MPIC_Wait(71)..................: 
MPIR_Wait_impl(41).............: 
MPID_Progress_wait(186)........: 
MPIDI_Progress_test(80)........: 
MPIDI_OFI_handle_cq_error(1059): OFI poll failed (ofi_events.c:1061:MPIDI_OFI_handle_cq_error:Input/output error - local protection error)

or this:

File "mpi4py/MPI/Comm.pyx", line 1595, in mpi4py.MPI.Comm.allgather
File "mpi4py/MPI/msgpickle.pxi", line 873, in mpi4py.MPI.PyMPI_allgather
File "mpi4py/MPI/msgpickle.pxi", line 177, in mpi4py.MPI.pickle_loadv
File "mpi4py/MPI/msgpickle.pxi", line 152, in mpi4py.MPI.pickle_load
File "mpi4py/MPI/msgpickle.pxi", line 141, in mpi4py.MPI.cloads
_pickle.UnpicklingError: invalid load key, '\x00'.

File "mpi4py/MPI/msgpickle.pxi", line 141, in mpi4py.MPI.cloads
_pickle.UnpicklingError: pickle data was truncated

File "mpi4py/MPI/msgpickle.pxi", line 141, in mpi4py.MPI.cloads
_pickle.UnpicklingError: unpickling stack underflow

These error messages seem to be related to the use of fork() within an MPI process. For example, using the subprocess module to spawn processes or calling a library function such as the os.uname function indirectly uses of the fork() system call in a Python application.

This is considered undefined behavior in the MPI standard and may change depending on MPI implementation/MPI middleware/network hardware.

On SS11, setting these environment variables may help circumvent the error:

export CXI_FORK_SAFE=1
export CXI_FORK_SAFE_HP=1

On SS10 (Perlmutter Phase 1 GPU nodes only), setting these environment variables may help circumvent the error:

export IBV_FORK_SAFE=1
export RDMAV_HUGEPAGES_SAFE=1

However, it's possible these variables will not help. The most robust mitigation is to avoid spawning forks/subprocesses in MPI applications.

If you see this error and/or have questions about this, please open a ticket.

Using AMD CPUs on Perlmutter

Python users should be aware that using the Intel MKL library may be slow on Perlmutter's AMD CPUs, although it is often still faster than OpenBLAS.

We advise users to try our MKL workaround via

module load fast-mkl-amd