Skip to content

Current known ML software issues

This page lists current known issues and workarounds, if any, for machine learning software at NERSC

Issues on Perlmutter

  • Users sometimes encounter a CUDA Unknown Error during initialization. Nvidia is still investigating the issue, but provided a workaround in the meantime: run a simple executable that creates a GPU context, then run your normal job steps. One can create the executable with the following line:

    srun -C gpu -N1 -n1 bash -c 'echo "int main() {cudaFree(0);}" > dummy.cu && nvcc -o dummy dummy.cu'
    

    Then, the dummy executable can be saved somewhere (e.g. in your $HOME directory) and reused for your jobs. To prevent the CUDA Unknown Error, run the dummy executable once on each GPU of your job before running your actual code. Note the dummy executable does not need to be run from inside a shifter container.

  • Some Nvidia ngc containers don't properly enter compatibility mode when running with shifter. To ensure correct behavior in ngc deep learning containers, you must wrap your commands inside the container with bash to ensure the compatibility check is set up properly. For example, the line

    srun shifter --image=nvcr.io/nvidia/pytorch:21.05-py3 python train.py
    

    would change to

    srun shifter --image=nvcr.io/nvidia/pytorch:21.05-py3 bash -c 'python train.py'
    

    Alternately, you can put your code inside a bash script and just run the bash script with shifter. In contrast to the deep learning ngc containers, the base Nvidia CUDA containers will not work with the above bash trick. In those, you will manually need to set $LD_LIBRARY_PATH to expose the proper compatibility libraries. To do so, in the container, set the environment variable

     export LD_LIBRARY_PATH=/usr/local/cuda/compat/lib.real:$LD_LIBRARY_PATH
    
  • Conda-installed pytorch comes with an older version of NCCL (\<2.8) that is incompatible with an InfiniBand setting on Perlmutter NICs, so multi-node distributed trainings with a nccl backend will hang. There are a number of possible workarounds:

    • Use our pytorch/1.9.0 module, which is built from source with NCCL 2.9.8
    • Use a container with pytorch and a version of NCCL>=2.8. The Nvidia ngc deep learning containers have many versions available, and are optimized for Nvidia GPUs.
    • Set the environment variable export NCCL_IB_DISABLE=1 before running your training. This will disable collective communications over InfiniBand, so will incur a slight performance hit.
  • Tensorflow + horovod built with Cray MPICH hangs when running multi-node training. Currently, we recommend using shifter to run multi-node tensorflow + horovod workflows. Note that since the Nvidia ngc containers use an installation of OpenMPI, the --mpi=pmi2 option for srun mentioned above is needed to get this working.