Skip to content

Guide to Using Python on Perlmutter

We aim to provide important information and tips about using Python on Perlmutter. Please be aware that the programming environment on Perlmutter changes quickly and it may be difficult to keep this page fully up to date. We will do our best, but we welcome you to contact us if you find anything that appears incorrect or deprecated.

Python modules

NERSC provides semi-custom Anaconda Python installations. You can use them via module load python.

You will also find a Cray-provided Python module: cray-python, but this is not conda-based.

Please note that Python 2.7 retired in 2020, so NERSC will not be providing Python 2 on Perlmutter.

Customizing Python stacks

We strongly encourage the use of conda environments at NERSC for users to install and customize their own software stacks. We also encourage users to customize their Python software stacks via Shifter. If you are interested in installing or using Python in other ways, please contact us so we can help you.

cudatoolkit dependency

Many python packages that use GPUs depend on cudatoolkit. NERSC provides a cudatoolkit module that can satisfy that dependency in many cases. The conda-forge channel also provides a cudatoolkit package which conda users can install into their environment.

There are similar modules on Perlmutter and packages on conda-forge for other common dependencies such as nccl, cutensor, and cudnn.

module load vs conda install

You can use either the cudatoolkit module or the cudatoolkit package installed from conda-forge. We suggest that you avoid using both. Do not module load cudatoolkit if you have cudatoolkit installed in your conda environment.

Some packages available on conda-forge do not assume you already have cudatoolkit installed and will install cudatoolkit into your conda environment.

missing nvcc

The cudatoolkit module provides the nvcc compiler which is not provided by the cudatoolkit package from conda-forge. If you are using cudatoolkit from conda-forge and your python application needs to JIT compile CUDA, you may need to install cuda-nvcc from the nvidia conda channel.

using cudatoolkit module in jupyter

To use the cudatoolkit module in a conda environment kernel from Jupyter, you will need to modify the kernel's kernel.json file to use a helper shell script to load the cudatoolkit module and activate the environment.

For example, if you've created a kernel for a conda environment named "mygpuenv", your kernel-helper.sh script might look like this:

#!/bin/bash
module load cudatoolkit
module load python
source activate mygpuenv
exec "$@"

mpi4py on Perlmutter

The latest releases (since 3.1.0) of mpi4py include CUDA-aware capabilities. If you intend to use mpi4py to transfer GPU objects, you will need CUDA-aware mpi4py.

The mpi4py provided by the python or cray-python modules is not CUDA-aware. You will have to build CUDA-aware mpi4py in a custom environment using the instructions below.

If the mpi4py you are using is CUDA-aware, you must have cudatoolkit loaded when using it, even for CPU-only code. mpi4py will look for CUDA libraries at runtime.

Building CUDA-aware mpi4py

Here is an example that demonstrates building CUDA-aware mpi4py in a custom conda environment using the GCC compiler suite:

module load PrgEnv-gnu cudatoolkit python
conda create -n cudaaware python=3.9 -y
conda activate cudaaware
MPICC="cc -target-accel=nvidia80 -shared" pip install --force --no-cache-dir --no-binary=mpi4py mpi4py

Here is a similar example that uses the NVIDIA compiler suite instead of the GCC compiler suite:

module load PrgEnv-nvidia cudatoolkit python
conda create -n cudaaware python=3.9 -y
conda activate cudaaware
MPICC="cc -target-accel=nvidia80 -shared" CC=nvc CFLAGS="-noswitcherror" pip install --force --no-cache-dir --no-binary=mpi4py mpi4py

Testing CUDA-aware mpi4py with CuPy

You can test that your CUDA-aware mpi4py installation is working with an example like test-cuda-aware-mpi4py.py:

from mpi4py import MPI
import cupy as cp
comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()
print("starting reduce")
sendbuf = cp.arange(10, dtype='i')
recvbuf = cp.empty_like(sendbuf)
print("rank:", rank, "sendbuff:", sendbuf)
print("rank:", rank, "recvbuff:", recvbuf)
assert hasattr(sendbuf, '__cuda_array_interface__')
assert hasattr(recvbuf, '__cuda_array_interface__')
comm.Allreduce(sendbuf, recvbuf)
print("finished reduce")
print("rank:", rank, "sendbuff:", sendbuf)
print("rank:", rank, "recvbuff:", recvbuf)
assert cp.allclose(recvbuf, sendbuf*size)

You can follow the example instruction below for adding CuPy to your conda environment.

Test on one node:

MPICH_GPU_SUPPORT_ENABLED=1 srun -C gpu -n 1 --gpus-per-node=1 python test-cuda-aware-mpi4py.py

Test on two nodes:

MPICH_GPU_SUPPORT_ENABLED=1 srun -C gpu -N 2 --ntasks-per-node 2 --gpus-per-node 2 --gpu-bind=single:1 python test-cuda-aware-mpi4py.py

CuPy

CuPy requires CUDA which is provided by the cudatoolkit module on Perlmutter.

The following instructions demonstrate how to setup a custom conda environment to use CuPy on Perlmutter. They are adapted from the CuPy installation instructions.

Installing with pip

You can install cupy in your conda environment with pip. Make sure to load the cudatoolkit module and specify a cupy wheel that corresponds the version of cudatoolkit from the module.

# Note the CUDA version from cudatoolkit (11.7)
module load cudatoolkit/11.7
module load python
# Create a new conda environment
conda create -n cupy-demo python=3.9 pip numpy scipy
# Activate the environment before using pip to install CuPy
conda activate cupy-demo
# Install the wheel compatible with CUDA 11.7
pip install cupy-cuda11X

When you use cupy with this environment you should make sure to load the corresponding cudatoolkit module.

Installing from conda-forge

You can install both cudatoolkit and cupy from conda-forge.

module load python
conda create -c conda-forge -n cupy-demo python=3.9 pip numpy scipy cudatoolkit cupy

In this case, you should avoid loading the cudatoolkit module in your environment which could lead to conflicts with the cudatoolkit installed in your conda environment.

Building CuPy from Source using pip

You can also build CuPy from source on Perlmutter. The build instructions depends slightly on whether you're using PrgEnv-nvidia, or PrgEnv-gnu

  • Compiling with PrgEnv-gnu:
module load PrgEnv-gnu
module load cudatoolkit
LDFLAGS="$CRAY_CUDATOOLKIT_POST_LINK_OPTS" NVCC="nvcc $CRAY_CUDATOOLKIT_INCLUDE_OPTS" CC=cc CXX=CC pip install cupy
  • Compiling with PrgEnv-nvidia:
module load PrgEnv-nvidia
module load cudatoolkit
LDFLAGS="$CRAY_CUDATOOLKIT_POST_LINK_OPTS -L$CUDATOOLKIT_HOME/targets/x86_64-linux/lib" CFLAGS="-I$CUDATOOLKIT_HOME/targets/x86_64-linux/include" NVCC="nvcc $CRAY_CUDATOOLKIT_INCLUDE_OPTS" CC=gcc CXX=g++ pip install cupy 

CuPy builds can be customized in many ways

We recommend that you check out the list of customizations. Eg. CUPY_NUM_BUILD_JOBS and CUPY_NUM_NVCC_THREADS can be used to increase the parallelism of your CuPy builds. And CUPY_CACHE_DIR can be used to relocate the location of CUDA code generated by CuPy.

JAX

JAX on GPUs requires CUDA and cuDNN which is provided by the cudatoolkit and cudnn modules on Perlmutter.

The following instructions demonstrate how to setup a custom conda environment to use JAX on Perlmutter. They are adapted from the JAX installation instructions. After loading the cudatoolkit and cudnn modules, verify that they are compitable with the latest version of JAX.

module load cudatoolkit
module load cudnn
module load python
# Verify the versions of cudatoolkit and cudnn are compatible with JAX
module list
# Create a new conda environment 
conda create -n jax-demo python=3.9 pip numpy scipy
# Activate the environment before using pip to install JAX
conda activate jax-demo
# Install the latest wheel
pip install jax[cuda] -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

Known issues

General issues

Our Known Issues page includes more general issues that may also impact Python users.

MPI issues

Missing support for Matched probe/recv

Perlmutter's SS11 libfabric currently lacks support for MPI functions MPI_Mprobe()/MPI_Improbe() and MPI_Mrecv()/MPI_Imrecv(). This may impact mpi4py applications using Comm.send/Comm.recv to exchange pickled objects or use of mpi4py.futures. If these missing functions are impacting your workload, please open a ticket to let us know.

A potential mitigation while we await a fix is to disable matched recv probes in mpi4py using an environment variable (export MPI4PY_RC_RECV_MPROBE=0) or with a runtime configuration option (mpi4py.rc.recv_mprobe = False).

The following are example error messages related to this issue:

Traceback (most recent call last):
File ".../test.py", line 10, in <module>
data = comm.recv(source=0, tag=11)
File "mpi4py/MPI/Comm.pyx", line 1438, in mpi4py.MPI.Comm.recv
File "mpi4py/MPI/msgpickle.pxi", line 341, in mpi4py.MPI.PyMPI_recv
File "mpi4py/MPI/msgpickle.pxi", line 299, in mpi4py.MPI.PyMPI_recv_match
mpi4py.MPI.Exception: Other MPI error, error stack:
PMPI_Mprobe(118)........: MPI_Mprobe(source=0, tag=11, comm=MPI_COMM_WORLD, message=0x7ffeadbaac00, status=0x7ffeadbaac10)
PMPI_Mprobe(101)........:
MPID_Mprobe(199)........:
MPIDI_improbe_safe(146).:
MPIDI_improbe_unsafe(88):
(unknown)(): Other MPI error
Exception in thread Thread-1:
Traceback (most recent call last):
File ".../lib/python3.9/threading.py", line 973, in _bootstrap_inner
self.run()
File ".../lib/python3.9/threading.py", line 910, in run
self._target(*self._args, **self._kwargs)
File ".../lib/python3.9/site-packages/mpi4py/futures/_lib.py", line 357, in _manager_shared
client_exec(comm, options, tag, workers, queue)
File ".../lib/python3.9/site-packages/mpi4py/futures/_lib.py", line 565, in client_exec
recv()
File ".../lib/python3.9/site-packages/mpi4py/futures/_lib.py", line 525, in recv
future, request = pending.pop(pid)

Issues with fork() in MPI processes

Several Python users have encountered errors that look like this:

mlx5: nid003244: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 20009232 00000000 00000300
00003c40 92083204 000180b8 0085a0e0
MPICH ERROR [Rank 256] [job id 126699.1] [Wed Oct 20 12:32:36 2021] [nid003244] - Abort(70891919) (rank 256 in comm 0): Fatal error in PMPI_Gatherv: Other MPI error, error stack:
PMPI_Gatherv(415)..............: MPI_Gatherv failed(sbuf=0x55ee4a4ebea0, scount=88, MPI_BYTE, rbuf=(nil), rcnts=(nil), displs=(nil), datatype=MPI_BYTE, root=0, comm=MPI_COMM_WORLD) failed
MPIR_CRAY_Gatherv(353).........: 
MPIC_Recv(197).................: 
MPIC_Wait(71)..................: 
MPIR_Wait_impl(41).............: 
MPID_Progress_wait(186)........: 
MPIDI_Progress_test(80)........: 
MPIDI_OFI_handle_cq_error(1059): OFI poll failed (ofi_events.c:1061:MPIDI_OFI_handle_cq_error:Input/output error - local protection error)

or this:

File "mpi4py/MPI/Comm.pyx", line 1595, in mpi4py.MPI.Comm.allgather
File "mpi4py/MPI/msgpickle.pxi", line 873, in mpi4py.MPI.PyMPI_allgather
File "mpi4py/MPI/msgpickle.pxi", line 177, in mpi4py.MPI.pickle_loadv
File "mpi4py/MPI/msgpickle.pxi", line 152, in mpi4py.MPI.pickle_load
File "mpi4py/MPI/msgpickle.pxi", line 141, in mpi4py.MPI.cloads
_pickle.UnpicklingError: invalid load key, '\x00'.

File "mpi4py/MPI/msgpickle.pxi", line 141, in mpi4py.MPI.cloads
_pickle.UnpicklingError: pickle data was truncated

File "mpi4py/MPI/msgpickle.pxi", line 141, in mpi4py.MPI.cloads
_pickle.UnpicklingError: unpickling stack underflow

These error messages seem to be related to the use of fork() within an MPI process. For example, using the subprocess module to spawn processes or calling a library function such as the os.uname function indirectly uses of the fork() system call in a Python application.

This is considered undefined behavior in the MPI standard and may change depending on MPI implementation/MPI middleware/network hardware.

On SS11, setting these environment variables may help circumvent the error:

export CXI_FORK_SAFE=1
export CXI_FORK_SAFE_HP=1

On SS10 (Perlmutter Phase 1 GPU nodes only), setting these environment variables may help circumvent the error:

export IBV_FORK_SAFE=1
export RDMAV_HUGEPAGES_SAFE=1

However, it's possible these variables will not help. The most robust mitigation is to avoid spawning forks/subprocesses in MPI applications.

If you see this error and/or have questions about this, please open a ticket.

Using AMD CPUs on Perlmutter

Python users should be aware that using the Intel MKL library may be slow on Perlmutter's AMD CPUs, although it is often still faster than OpenBLAS.

We advise users to try our MKL workaround via

module load fast-mkl-amd