Current known ML software issues¶

This page lists current known issues and workarounds, if any, for machine learning software at NERSC

Issues on Perlmutter¶

Older NGC tensorflow containers (e.g. 21.08) may experience performance variability or degradation when running distributed training. The issue can be fixed by upgrading to a more recent container image.
Conda-installed pytorch comes with an older version of NCCL (\<2.8) that is incompatible with an InfiniBand setting on Perlmutter NICs, so multi-node distributed trainings with a nccl backend will hang. There are a number of possible workarounds:
- Use one of our pytorch modules (version 1.9.0 or later), which are built from source with newer NCCL.
- Use a container with pytorch and a version of NCCL>=2.8. The Nvidia ngc deep learning containers have many versions available, and are optimized for Nvidia GPUs.
- Set the environment variable export NCCL_IB_DISABLE=1 before running your training. This will disable collective communications over InfiniBand, so will incur a performance hit.