Guide to Using Python on Perlmutter¶
We aim to provide important information and tips about using Python on Perlmutter. Please be aware that the programming environment on Perlmutter changes quickly and it may be difficult to keep this page fully up to date. We will do our best, but we welcome you to contact us if you find anything that appears incorrect or deprecated. Please review our Current Python on Perlmutter Known Issues
Python modules¶
NERSC provides customized Anaconda Python installations. You can use them via module load python
.
The default programming environment provided by Cray/HPE includes a cray-python
module with numpy and scipy configured with Cray LibSci and mpi4py configured with Cray MPICH.
Please note that Python 2.7 retired on Jan 1st 2020. NERSC will not provide Python 2 on Perlmutter.
Customizing your Python environment¶
We strongly encourage the use of conda environments at NERSC for Python users to install and customize their own software environment. We also encourage Python users to customize their software stacks via Shifter. If you are interested in installing or using Python in other ways, please contact us so we can help you.
cudatoolkit
dependency¶
Many python packages that use GPUs depend on cudatoolkit
. NERSC provides a cudatoolkit
module that can satisfy that dependency in many cases. The conda-forge channel also provides a cudatoolkit
package which conda users can install into their environment.
There are similar modules on Perlmutter and packages on conda-forge for other common dependencies such as nccl
, cutensor
, and cudnn
.
module load vs conda install¶
You can use either the cudatoolkit
module or the cudatoolkit
package installed from conda-forge. We suggest that you avoid using both. Do not module load cudatoolkit
if you have cudatoolkit
installed in your conda environment.
Some packages available on conda-forge do not assume you already have cudatoolkit
installed and will install cudatoolkit
into your conda environment.
missing nvcc
The cudatoolkit
module provides the nvcc compiler which is not provided by the cudatoolkit
package from conda-forge. If you are using cudatoolkit
from conda-forge and your python application needs to JIT compile CUDA, you may need to install cuda-nvcc
from the nvidia conda channel.
using cudatoolkit module in jupyter¶
To use the cudatoolkit
module in a conda environment kernel from Jupyter, you will need to modify the kernel's kernel.json
file to use a helper shell script to load the cudatoolkit
module and activate the environment.
For example, if you've created a kernel for a conda environment named "mygpuenv", your kernel-helper.sh
script might look like this:
#!/bin/bash
module load cudatoolkit
module load python
source activate mygpuenv
exec "$@"
mpi4py on Perlmutter¶
Using mpi4py on Perlmutter is similar to using mpi4py on previous NERSC systems such as Cori. For the most part, the same recommendations apply on Perlmutter, especially if you are only using CPU nodes.
This section provides some additional recommendations for using mpi4py on Perlmutter GPU nodes. Please see the mpi4py documentation for details about GPU-aware MPI support in mpi4py.
Installing mpi4py with GPU-aware Cray MPICH¶
We recommend installing mpi4py with GPU-aware Cray MPICH. The following examples demonstrate how to install mpi4py with GPU-aware Cray MPICH. The examples explicitly load required modules which may already be loaded by default.
Using the GNU programming environment:
module load PrgEnv-gnu cray-mpich cudatoolkit craype-accel-nvidia80 python
conda create -n gpu-aware-mpi python -y
conda activate gpu-aware-mpi
MPICC="cc -shared" pip install --force --no-cache-dir --no-binary=mpi4py mpi4py
Using the NVIDIA programming environment:
module load PrgEnv-nvidia cray-mpich cudatoolkit craype-accel-nvidia80 python
conda create -n gpu-aware-mpi python=3.9 -y
conda activate gpu-aware-mpi
MPICC="cc -shared" CC=nvc CFLAGS="-noswitcherror" pip install --force --no-cache-dir --no-binary=mpi4py mpi4py
Note
If mpi4py is installed with GPU-aware Cray MPICH, you must have the CUDA runtime in your environment at runtime, even for CPU-only programs.
Using mpi4py with GPU-aware Cray MPICH¶
The following example demonstrates how to use an mpi4py built with GPU-aware Cray MPICH using CuPy. See example instruction below for adding CuPy to your conda environment.
Here is a simple example using mpi4py with CuPy arrays:
from mpi4py import MPI
import cupy as cp
comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()
sendbuf = cp.arange(10, dtype='i')
recvbuf = cp.empty_like(sendbuf)
print(f"{rank=} before {sendbuf=} {recvbuf=}")
comm.Allreduce(sendbuf, recvbuf)
print(f"{rank=} after {sendbuf=} {recvbuf=}")
assert cp.allclose(recvbuf, sendbuf*size)
The following Slurm batch script can be used to run this program. MPICH_GPU_SUPPORT_ENABLED=1
should already be set by default (via the gpu
module). It is included here explicitly to illustrate that it is required for GPU-aware Cray MPICH at runtime.
#!/bin/bash
#SBATCH --account=<account>
#SBATCH --constraint=gpu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
module load PrgEnv-gnu cray-mpich cudatoolkit craype-accel-nvidia80 python
conda activate gpu-aware-mpi
export MPICH_GPU_SUPPORT_ENABLED=1
srun ./select_gpu_device python test-gpu-aware-mpi.py
Note
select_gpu_device
is a wrapper script that maps each MPI task to a single GPU device:
#!/bin/bash
export CUDA_VISIBLE_DEVICES=$SLURM_LOCALID
exec $*
man intro_mpi
on Perlmutter for more information about using GPU-aware Cray MPICH. CuPy¶
CuPy requires CUDA which is provided by the cudatoolkit
module on Perlmutter.
The following instructions demonstrate how to setup a custom conda environment to use CuPy on Perlmutter. They are adapted from the CuPy installation instructions.
Installing with pip¶
You can install cupy in your conda environment with pip. Make sure to load the cudatoolkit module and specify a cupy wheel that corresponds the version of cudatoolkit from the module.
# Note the CUDA version from cudatoolkit (11.7)
module load cudatoolkit/11.7
module load python
# Create a new conda environment
conda create -n cupy-demo python=3.9 pip numpy scipy
# Activate the environment before using pip to install CuPy
conda activate cupy-demo
# Install the wheel compatible with CUDA 11.7
pip install cupy-cuda11X
When you use cupy with this environment you should make sure to load the corresponding cudatoolkit module.
Installing from conda-forge¶
You can install both cudatoolkit and cupy from conda-forge.
module load python
conda create -c conda-forge -n cupy-demo python=3.9 pip numpy scipy cudatoolkit cupy
In this case, you should avoid loading the cudatoolkit module in your environment which could lead to conflicts with the cudatoolkit installed in your conda environment.
Building CuPy from Source using pip¶
You can also build CuPy from source on Perlmutter. The build instructions depends slightly on whether you're using PrgEnv-nvidia
, or PrgEnv-gnu
- Compiling with
PrgEnv-gnu
:
module load PrgEnv-gnu
module load cudatoolkit
LDFLAGS="$CRAY_CUDATOOLKIT_POST_LINK_OPTS" NVCC="nvcc $CRAY_CUDATOOLKIT_INCLUDE_OPTS" CC=cc CXX=CC pip install cupy
- Compiling with
PrgEnv-nvidia
:
module load PrgEnv-nvidia
module load cudatoolkit
LDFLAGS="$CRAY_CUDATOOLKIT_POST_LINK_OPTS -L$CUDATOOLKIT_HOME/targets/x86_64-linux/lib" CFLAGS="-I$CUDATOOLKIT_HOME/targets/x86_64-linux/include" NVCC="nvcc $CRAY_CUDATOOLKIT_INCLUDE_OPTS" CC=gcc CXX=g++ pip install cupy
CuPy builds can be customized in many ways
We recommend that you check out the list of customizations. Eg. CUPY_NUM_BUILD_JOBS
and CUPY_NUM_NVCC_THREADS
can be used to increase the parallelism of your CuPy builds. And CUPY_CACHE_DIR
can be used to relocate the location of CUDA code generated by CuPy.
JAX¶
Setting up JAX¶
NVIDIA Containers¶
We recommend that you use NVIDIA's JAX-Toolbox containers to install JAX. The process is relatively straightforward and will reliably give you a working installation of JAX.
You can start a GPU job using the ghcr.io/nvidia/jax:jax
image via Shifter with the following Slurm script:
#!/bin/bash
#SBATCH --image=ghcr.io/nvidia/jax:jax
#SBATCH --nodes=1
#SBATCH --qos=regular
#SBATCH --constraint=gpu
srun shifter --module=gpu python3 ./my_script.py
You could also run a script locally as follows:
shifter --module=gpu --image=ghcr.io/nvidia/jax:jax python3 ./my_script.py
See the Shifter documentation for further details on how to use Shifter containers.
Pip Installation¶
It is possible to install JAX using pip
modules and Perlmutter modules. However, we recommend against it as using the proper version number to get your installation working, both on a single node and especially in a distributed fashion, is error-prone.
Using JAX on GPUs requires CUDA and cuDNN, which are provided by the cudatoolkit
and cudnn
modules on Perlmutter. The following instructions, adapted from the JAX installation instructions, demonstrate how to set up a custom Conda environment to use JAX on Perlmutter with the local CUDA 12 libraries:
module load cudatoolkit/12.2
module load cudnn/8.9.3_cuda12
module load python
# Create a new conda environment
conda create -n jax-demo python=3.9 pip numpy scipy
# Activate the environment before using pip to install JAX
conda activate jax-demo
# Install a compatible wheel
pip install --no-cache-dir "jax==0.4.23" "jaxlib[cuda12_cudnn89]==0.4.23" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
You need to ensure that the version of JAX you are installing, here 0.4.23
, was built against the versions of Python, cudatoolkit
, and cudnn
that you are using (it might, for example, have been built against CUDA 12.3, which would trigger a runtime error when you try to run JIT-compiled code on a GPU with CUDA 12.2). Thus, we recommend that you lock down all of your version numbers to ensure a successful installation.
Finding an appropriate JAX version number might take some trial and error, going down the list of current JAX releases.
Distributing JAX Computation¶
JAX offers robust support for multi-GPU computing within a single process through its Parallel Operators, particularly using pmap. For more information, refer to the Parallel Evaluation in JAX documentation.
To distribute computation across multiple nodes, you can utilize the jax.distribute module. Additionally, the multi-host and multi-process official documentation, which is very TPU-focused, provides valuable insights. For a detailed guide on distributing JAX computation over a SLURM cluster, refer to Wassim Kabalan's very detailed tutorial.
For an example of an idiomatic multi-node and multi-GPU JAX job running at NERSC with Slurm and containers, visit this GitHub repository.
Additionally, you can use mpi4jax to incorporate MPI operations within JAX JIT-compiled sections.
Support¶
For JAX support (besides tickets and the NERSC Help Desk) and to connect with other JAX users at NERSC who might share similar problems and solutions, you can join the #jax-users
channel on the NERSC Users Slack.
cuNumeric¶
cuNumeric is a Legate library that aims to provide a distributed and accelerated drop-in replacement for the NumPy API on top of the Legion runtime.
The following instructions demonstrate how to install cuNumeric using conda for use on a single Perlmutter GPU node. On a single Perlmutter GPU node:
# install cunumeric using conda
module load conda
conda create -n cunumeric -c nvidia -c conda-forge -c legate cunumeric
conda activate cunumeric
# download cunumeric repo with examples
git clone https://github.com/nv-legate/cunumeric.git
cd cunumeric
# run an example program using 4 GPUS
# note use of the legate driver to launch the program
legate --gpus 4 examples/gemm.py -n 8000
Multi-node installation currently require building legate-core and cuNumeric from source, see the nv-legate/quickstart repo for details.
cuNumeric is currently under active development and should be considered as experimental or "beta" software. If you have any issues, we recommend opening an issue on the cuNumeric github project.
Known issues¶
General issues¶
Our Known Issues page includes more general issues that may also impact Python users.
MPI issues¶
conda / cray-mpich / libfabric dependencies missing evp symbols¶
Conda + MPI/mpi4py users may see the following error after the December Perlmutter maintenance when attempting to use cray-mpich or related software inside an active conda environment:
OSError: /usr/lib64/libssh.so.4: undefined symbol: EVP_KDF_CTX_new_id, version OPENSSL_1_1_1d
When mpi4py is built locally with cray-mpich, cray-mpich dependencies on system libraries are essentially pulled into the environment. The system libssh is not compatible with some versions of the openssl package that are automatically included in conda environments.
We've added an evp-patch
module that provides a compatible set of libssh and openssl that can be loaded to work around the issue. This module is automatically loaded by the python
, pytorch
, and tensorflow
modules.
If you are using your own conda installation, you can load this module to work around the issue.
elvis@perlmutter> module load evp-patch
Another potential workaround is to add a compatible libssh package directly to your conda environment by installing it from conda-forge:
elvis@perlmutter> conda activate myenv
elvis@perlmutter> conda install -c conda-forge libssh
Issues with fork()
in MPI processes¶
Several Python users have encountered errors that look like this:
mlx5: nid003244: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 20009232 00000000 00000300
00003c40 92083204 000180b8 0085a0e0
MPICH ERROR [Rank 256] [job id 126699.1] [Wed Oct 20 12:32:36 2021] [nid003244] - Abort(70891919) (rank 256 in comm 0): Fatal error in PMPI_Gatherv: Other MPI error, error stack:
PMPI_Gatherv(415)..............: MPI_Gatherv failed(sbuf=0x55ee4a4ebea0, scount=88, MPI_BYTE, rbuf=(nil), rcnts=(nil), displs=(nil), datatype=MPI_BYTE, root=0, comm=MPI_COMM_WORLD) failed
MPIR_CRAY_Gatherv(353).........:
MPIC_Recv(197).................:
MPIC_Wait(71)..................:
MPIR_Wait_impl(41).............:
MPID_Progress_wait(186)........:
MPIDI_Progress_test(80)........:
MPIDI_OFI_handle_cq_error(1059): OFI poll failed (ofi_events.c:1061:MPIDI_OFI_handle_cq_error:Input/output error - local protection error)
or this:
File "mpi4py/MPI/Comm.pyx", line 1595, in mpi4py.MPI.Comm.allgather
File "mpi4py/MPI/msgpickle.pxi", line 873, in mpi4py.MPI.PyMPI_allgather
File "mpi4py/MPI/msgpickle.pxi", line 177, in mpi4py.MPI.pickle_loadv
File "mpi4py/MPI/msgpickle.pxi", line 152, in mpi4py.MPI.pickle_load
File "mpi4py/MPI/msgpickle.pxi", line 141, in mpi4py.MPI.cloads
_pickle.UnpicklingError: invalid load key, '\x00'.
File "mpi4py/MPI/msgpickle.pxi", line 141, in mpi4py.MPI.cloads
_pickle.UnpicklingError: pickle data was truncated
File "mpi4py/MPI/msgpickle.pxi", line 141, in mpi4py.MPI.cloads
_pickle.UnpicklingError: unpickling stack underflow
These error messages seem to be related to the use of fork()
within an MPI process. For example, using the subprocess module to spawn processes or calling a library function such as the os.uname function indirectly uses of the fork()
system call in a Python application.
This is considered undefined behavior in the MPI standard and may change depending on MPI implementation/MPI middleware/network hardware.
On SS11, setting these environment variables may help circumvent the error:
export CXI_FORK_SAFE=1
export CXI_FORK_SAFE_HP=1
However, it's possible these variables will not help. The most robust mitigation is to avoid spawning forks/subprocesses in MPI applications.
If you see this error and/or have questions about this, please open a ticket.
Using AMD CPUs on Perlmutter¶
Python users should be aware that using the Intel MKL library may be slow on Perlmutter's AMD CPUs, although it is often still faster than OpenBLAS.
We advise users to try our MKL workaround via
module load fast-mkl-amd