Skip to content


PyTorch is a high-productivity Deep Learning framework based on dynamic computation graphs and automatic differentiation. It is designed to be as close to native Python as possible for maximum flexibility and expressivity.

Using PyTorch at NERSC

There are multiple ways to use and run PyTorch on NERSC systems like Perlmutter.

Using NERSC PyTorch modules

The PyTorch modules are the easiest and fastest way to get started with a complete Python + PyTorch environment including all the features supported by the system.

You can load the PyTorch module with

module load pytorch/<version>

where <version> should be replaced with the version string you are trying to load. You can see which PyTorch versions are available with module avail pytorch. We generally recommend to use the latest version to have all the latest PyTorch features.

Customizing the module environments: If you want to integrate your own packages into the NERSC PyTorch module environment, you can simply install packages on top with pip, e.g.:

module load pytorch
pip install --user netCDF

This leverages the $PYTHONUSERBASE variable which is set by the modulefiles to specify a location for your additional packages specific to that module version. These packages will then be available every time you load the module.

Building your own environments

Alternatively, you can install PyTorch into your own software environments. This allows you to have full control over the included packages and versions. It is recommended to use conda as described in our Python documentation. Follow the appropriate installation instructions at:

Note that if you install PyTorch via conda, it will not have MPI support. However, you can install PyTorch with GPU and NCCL support via conda.

If you need to build PyTorch from source, you can refer to our build scripts for PyTorch in the nersc-pytorch-build repository. If you need assistance, please open a support ticket at


It is also possible to use containers to run PyTorch code on Perlmutter with shifter. Refer to the NERSC shifter documentation for help deploying your own containers.

On Perlmutter, we provide prebuilt PyTorch images based on NVIDIA GPU Cloud (NGC) containers. They are named like nersc/pytorch:24.08.01 where the 24.08 tag refers to the NGC image tag. Note that the best performance for multi-node distributed training using containers is achieved via usage of the NCCL shifter modules, along with the default gpu shifter module. Refer to the NCCL shifter modules page to identify the appropriate argument for your container.

To run a container in batch jobs we strongly recommend using Slurm image shifter options for best performance:

#SBATCH --image=nersc/pytorch:24.08.01
#SBATCH --module=gpu,nccl-plugin

srun shifter python

Customizing your containers in shifter: Shifter containers are read-only, which means you cannot modify the image contents at runtime. However, you can specify a path on the host system for additional packages by setting $PYTHONUSERBASE. You can use the Shifter --env option to set this variable, e.g.:

shifter --image=nersc/pytorch:24.08.01 --module gpu,nccl-plugin --env PYTHONUSERBASE=$HOME/.local/perlmutter/nersc_pt_24.06.01
pip install --user netCDF

You also need to set the $PYTHONUSERBASE in your Slurm batch scripts to use your custom libraries at runtime:

#SBATCH --image=nersc/pytorch:24.08.01
#SBATCH --module=gpu,nccl-plugin

srun shifter --env PYTHONUSERBASE=$HOME/.local/perlmutter/nersc_pt_24.06.01 python

Distributed training

PyTorch makes it fairly easy to get up and running with multi-GPU and multi-node training via its distributed package. For an overview, refer to the PyTorch distributed documentation.

On Perlmutter, best performance for multi-node distributed training using containers is achieved via usage of the nccl-plugin shifter module, along with the default gpu shifter module.

Requires unset NCCL_CROSS_NIC for multi-node training for container version 24.11 and higher

There is an issue with the PyTorch containers 24.11+ (which have NCCL versions 2.23 - 2.25) that will hurt performance for multi-node training. To circumvent this, include NCCL_CROSS_NIC when launching your jobs with these containers.

#SBATCH --image=nersc/pytorch:25.02.01
#SBATCH --module=gpu,nccl-plugin

srun <srun args> shifter <shifter args> bash -c " unset NCCL_CROSS_NIC <anything else you have> python <train script, etc.>

See below for some complete examples for PyTorch distributed training at NERSC.

Performance optimization

To optimize performance of pytorch model training workloads on NVIDIA GPUs, we refer you to our Deep Learning at Scale Tutorial material from SC23, which includes guidelines for optimizing performance on a single NVIDIA GPU as well as best practices for scaling up model training across many GPUs and nodes.

Examples and tutorials

There is a set of example problems, datasets, models, and training code in this repository:

This repository can serve as a template for your research projects with a flexibly-organized design for layout and code structure. It also demonstrates how you can launch data-parallel distributed training jobs on our systems. The examples include MNIST image classification with a simple CNN and CIFAR10 image classification with a ResNet50 model.

We also provide a more lightweight template PyTorch code for data parallel distributed training with the option of integrating with Weights & Biases for experiment tracking and hyperparameter optimization at:

For a general introduction to coding in PyTorch, you can check out this great tutorial from the DL4Sci school at Berkeley Lab in 2020 by Evann Courdier.

Additionally, for an example focused on performance and scaling, we have the material and code example from our Deep Learning at Scale tutorial at SC23.

Finally, PyTorch has a nice set of official tutorials you can learn from as well.