Skip to content

Current known ML software issues

This page lists current known issues and workarounds, if any, for machine learning software at NERSC

Issues on Perlmutter

  • The default GPU queues send jobs to compute nodes which are configured to use the Slingshot11 interconnect. Due to a known software issue, ML codes using NCCL for communication between GPUs may initially be slower when run at scale on these nodes; we are waiting for a libfabric-optimized fix from the vendors to address this. If you wish to continue using the Slingshot10 GPU nodes, you will need to explicitly request them by adding "_ss10" to the queue name, e.g., "-C gpu -q regular_ss10". Jobs cannot use a mixture of Slingshot10 and Slingshot11 nodes. Please report any issue you encounter via a ticket.

  • Some Nvidia ngc containers might not properly enter compatibility mode when running with shifter. To ensure correct behavior in ngc deep learning containers, you might need to wrap your commands inside the container with bash to ensure the compatibility check is set up properly. For example, the line

    srun shifter python

    would change to

    srun shifter bash -c 'python'

    Alternately, you can put your code inside a bash script and just run the bash script with shifter.

  • Conda-installed pytorch comes with an older version of NCCL (\<2.8) that is incompatible with an InfiniBand setting on Perlmutter NICs, so multi-node distributed trainings with a nccl backend will hang. There are a number of possible workarounds:

    • Use one of our pytorch modules (version 1.9.0 or later), which are built from source with NCCL 2.9.8
    • Use a container with pytorch and a version of NCCL>=2.8. The Nvidia ngc deep learning containers have many versions available, and are optimized for Nvidia GPUs.
    • Set the environment variable export NCCL_IB_DISABLE=1 before running your training. This will disable collective communications over InfiniBand, so will incur a performance hit.