Skip to content

Current known ML software issues

This page lists current known issues and workarounds, if any, for machine learning software at NERSC

Issues on Perlmutter

  • Some Nvidia ngc containers might not properly enter compatibility mode when running with shifter. To ensure correct behavior in ngc deep learning containers, you might need to wrap your commands inside the container with bash to ensure the compatibility check is set up properly. For example, the line

    srun shifter python

    would change to

    srun shifter bash -c 'python'

    Alternately, you can put your code inside a bash script and just run the bash script with shifter.

  • Conda-installed pytorch comes with an older version of NCCL (\<2.8) that is incompatible with an InfiniBand setting on Perlmutter NICs, so multi-node distributed trainings with a nccl backend will hang. There are a number of possible workarounds:

    • Use one of our pytorch modules (version 1.9.0 or later), which are built from source with NCCL 2.9.8
    • Use a container with pytorch and a version of NCCL>=2.8. The Nvidia ngc deep learning containers have many versions available, and are optimized for Nvidia GPUs.
    • Set the environment variable export NCCL_IB_DISABLE=1 before running your training. This will disable collective communications over InfiniBand, so will incur a performance hit.