Current known ML software issues¶
This page lists current known issues and workarounds, if any, for machine learning software at NERSC
Issues on Perlmutter¶
-
Older NGC tensorflow containers (e.g. 21.08) may experience performance variability or degradation when running distributed training. The issue can be fixed by upgrading to a more recent container image.
-
Conda-installed pytorch comes with an older version of NCCL (\<2.8) that is incompatible with an InfiniBand setting on Perlmutter NICs, so multi-node distributed trainings with a
nccl
backend will hang. There are a number of possible workarounds:- Use one of our
pytorch
modules (version 1.9.0 or later), which are built from source with newer NCCL. - Use a container with pytorch and a version of NCCL>=2.8. The Nvidia ngc deep learning containers have many versions available, and are optimized for Nvidia GPUs.
- Set the environment variable
export NCCL_IB_DISABLE=1
before running your training. This will disable collective communications over InfiniBand, so will incur a performance hit.
- Use one of our