Skip to content

Current known ML software issues

This page lists current known issues and workarounds, if any, for machine learning software at NERSC

Issues on Perlmutter

  • Older NGC tensorflow containers (e.g. 21.08) may experience performance variability or degradation when running distributed training. The issue can be fixed by upgrading to a more recent container image.

  • Conda-installed pytorch comes with an older version of NCCL (\<2.8) that is incompatible with an InfiniBand setting on Perlmutter NICs, so multi-node distributed trainings with a nccl backend will hang. There are a number of possible workarounds:

    • Use one of our pytorch modules (version 1.9.0 or later), which are built from source with newer NCCL.
    • Use a container with pytorch and a version of NCCL>=2.8. The Nvidia ngc deep learning containers have many versions available, and are optimized for Nvidia GPUs.
    • Set the environment variable export NCCL_IB_DISABLE=1 before running your training. This will disable collective communications over InfiniBand, so will incur a performance hit.