Skip to content

Current known issues

Perlmutter

Perlmutter is not a production resource

Perlmutter is not a production resource and usage is not charged against your allocation of time. While we will attempt to make the system available to users as much as possible, it is subject to unannounced and unexpected outages, reconfigurations, and periods of restricted access. Please visit the timeline page for more information about changes we've made in our recent upgrades.

NERSC has automated monitoring that tracks failed nodes, so please only open tickets for node failures if the node consistently has poor performance relative to other nodes in the job or if the node repeatedly causes your jobs to fail.

New issues

  • Users will encounter problems linking CUDA math libraries (cufft.h, cusolver.h, etc.) with any CUDAToolkit. A temporary workaround is to prepend the $CPATH or the $CMAKE_PREFIX_PATH (if using CMake) to point to the math_libs folder (note: change the cuda compiler version to match your CUDAToolkit, below is what you need for cudatoolkit/21.9_11.4:

    • to prepend your CPATH:

      export CPATH=/opt/nvidia/hpc_sdk/Linux_x86_64/21.9/math_libs/include:$CPATH
      
    • to prepend your CMake path:

       export CMAKE_PREFIX_PATH=/opt/nvidia/hpc_sdk/Linux_x86_64/21.9/math_libs/11.4:$CMAKE_PREFIX_PATH
      
  • Spack on Perlmutter has gotten out-of-date with reference to the installed compilers and other PE components, and consequently mostly does not work at the moment. We're working to update the configuration.

Ongoing issues

  • MPI users may hit segmentation fault errors when trying to launch an MPI job with many ranks due to incorrect allocation of GPU memory. We provide more information and a suggested workaround.
  • Some users may see messages like -bash: /usr/common/usg/bin/nersc_host: No such file or directory when you login. This means you have outdated dotfiles that need to be updated. To stop this message, you can either delete this line from your dot files or check if NERSC_HOST is set before overwriting it. Please see our environment page for more details.
  • Known issues for Machine Learning applications
  • Nodes on Perlmutter currently do not get a constant hostid (IP address) response.
  • collabsu is not available. Please create a direct login with sshproxy to login into Perlmutter or switch to a collaboration account on Cori and then login to Perlmutter.

Be careful with NVIDIA Unified Memory to avoid crashing nodes

In your code, NVIDIA Unified Memory might look something like cudaMallocManaged. At the moment, we do not have the ability to control this kind of memory and keep it under a safe limit. Users who allocate a large pool of this kind of memory may end up crashing nodes if the UVM memory does not leave enough room for necessary system tools like our filesystem client. We expect a fix in early 2022. In the meantime, please keep the size of memory pools allocated via UVM relatively small. If you have questions about this, please contact us.

Cori

The Burst Buffer on Cori has a number of known issues, documented at Cori Burst Buffer.