Skip to content

Guide to Using Python on Perlmutter

We aim to provide important information and tips about using Python on Perlmutter. Please be aware that the programming environment on Perlmutter changes quickly and it may be difficult to keep this page fully up to date. We will do our best, but we welcome you to contact us if you find anything that appears incorrect or deprecated.

Known issues

Python issues

Several Python teams have encountered errors that look like:

mlx5: nid003244: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 20009232 00000000 00000300
00003c40 92083204 000180b8 0085a0e0
MPICH ERROR [Rank 256] [job id 126699.1] [Wed Oct 20 12:32:36 2021] [nid003244] - Abort(70891919) (rank 256 in comm 0): Fatal error in PMPI_Gatherv: Other MPI error, error stack:
PMPI_Gatherv(415)..............: MPI_Gatherv failed(sbuf=0x55ee4a4ebea0, scount=88, MPI_BYTE, rbuf=(nil), rcnts=(nil), displs=(nil), datatype=MPI_BYTE, root=0, comm=MPI_COMM_WORLD) failed
MPIR_CRAY_Gatherv(353).........: 
MPIC_Recv(197).................: 
MPIC_Wait(71)..................: 
MPIR_Wait_impl(41).............: 
MPID_Progress_wait(186)........: 
MPIDI_Progress_test(80)........: 
MPIDI_OFI_handle_cq_error(1059): OFI poll failed (ofi_events.c:1061:MPIDI_OFI_handle_cq_error:Input/output error - local protection error)

The cause: forking a process within an MPI process. For example: using the Python subprocess module inside an mpi4py rank. This is considered undefined behavior in the MPI standard and may change depending on MPI implementation/MPI middleware/network hardware. We expect it may even change during Perlmutter Phase 2.

Setting these environment variables may help fix this:

export IBV_FORK_SAFE=1
export RDMAV_HUGEPAGES_SAFE=1

However it's also possible these variables will not fix the error. The most robust and future-proof strategy is to remove instances of spawning forks/subprocesses within mpi4py code.

If you see this error and/or have questions about this, please open a ticket.

General issues

Our Known Issues page includes more general issues that may also impact Python users.

mpi4py on Perlmutter

The most current release of mpi4py now includes CUDA-aware capabilities. If you intend to use mpi4py to transfer GPU objects, you will need CUDA-aware mpi4py

The mpi4py you obtain via module load python is CUDA-aware. The mpi4py in module load cray-python is not currently CUDA-aware.

If the mpi4py you are using is CUDA-aware, you must have cudatoolkit loaded when using it, even for CPU-only code. mpi4py will look for CUDA libraries at runtime.

Building CUDA-aware mpi4py

Here is an example that demonstrates building CUDA-aware mpi4py in a custom conda environment:

module load PrgEnv-gnu cudatoolkit python
conda create -n cudaaware python=3.9 -y
conda activate cudaaware
MPICC="cc -target-accel=nvidia80 -shared" pip install --force --no-cache-dir --no-binary=mpi4py mpi4py

Testing CUDA-aware mpi4py with CuPy

You can test that your CUDA-aware mpi4py installation is working with an example like test-cuda-aware-mpi4py.py:

from mpi4py import MPI
import cupy as cp
comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()
print("starting reduce")
sendbuf = cp.arange(10, dtype='i')
recvbuf = cp.empty_like(sendbuf)
print("rank:", rank, "sendbuff:", sendbuf)
print("rank:", rank, "recvbuff:", recvbuf)
assert hasattr(sendbuf, '__cuda_array_interface__')
assert hasattr(recvbuf, '__cuda_array_interface__')
comm.Allreduce(sendbuf, recvbuf)
print("finished reduce")
print("rank:", rank, "sendbuff:", sendbuf)
print("rank:", rank, "recvbuff:", recvbuf)
assert cp.allclose(recvbuf, sendbuf*size)

Test on one node:

MPICH_GPU_SUPPORT_ENABLED=1 srun -C gpu -n 1 --gpus-per-node=1 python test-cuda-aware-mpi4py.py

Test on two nodes:

MPICH_GPU_SUPPORT_ENABLED=1 srun -C gpu -N 2 --ntasks-per-node 2 --gpus-per-node 2 --gpu-bind=single:1 python test-cuda-aware-mpi4py.py

Python modules

NERSC provides semi-custom Anaconda Python installations. You can use them via module load python.

You will also find a Cray-provided Python module: cray-python/3.8.5.0, but this is not conda-based. Note that the mpi4py provided in the cray-python module is not CUDA-aware.

Please note that Python 2.7 retired in 2020, so NERSC will not be providing Python 2 on Perlmutter.

Customizing Python stacks

We strongly encourage the use of conda environments at NERSC for users to install and customize their own software stacks. We also encourage users to customize their Python software stacks via Shifter. If you are interested in installing or using Python in other ways, please contact us so we can help you.

CuPy

CuPy requires CUDA which is provided by the cudatoolkit module on Perlmutter.

The following instructions demonstrate how to setup a custom conda environment to use CuPy on Perlmutter. They are adapted from the CuPy installation instructions.

# Note the CUDA version from cudatoolkit (11.4)
module load cudatoolkit/21.9_11.4
module load python
# Create a new conda environment
conda create -n cupy-demo python=3.9 pip numpy scipy
# Activate the environment before using pip to install CuPy
conda activate cupy-demo
# Install the wheel compatible with CUDA 11.4
pip install cupy-cuda114

Use pip to install CuPy

We recommend installing CuPy using pip install ... rather than conda install -c conda-forge ... to avoid bringing in the cudatoolkit package from conda-forge which could lead to conflicts or confusion with the system CUDA libraries.

JAX

JAX on GPUs requires CUDA and cuDNN which is provided by the cudatoolkit and cudnn modules on Perlmutter.

The following instructions demonstrate how to setup a custom conda environment to use JAX on Perlmutter. They are adapted from the JAX installation instructions.

# Note the CUDA version from cudatoolkit (11.4)
module load cudatoolkit/21.9_11.4
# Also note the version of cuDNN (8.2.0)
module load cudnn/8.2.0
module load python
# Create a new conda environment 
conda create -n jax-demo python=3.9 pip numpy scipy
# Activate the environment before using pip to install JAX
conda activate jax-demo
# Install the wheel compatible with CUDA 11 and cuDNN 8.2
pip install jax[cuda11_cudnn82] -f https://storage.googleapis.com/jax-releases/jax_releases.html

Using AMD CPUs on Perlmutter

Python users should be aware that using the Intel MKL library may be slow on Perlmutter's AMD CPUs, although it is often still faster than OpenBLAS.

We advise users to try our MKL workaround via

module load fast-mkl-amd