Skip to content

Current known issues

Perlmutter

Perlmutter is not a production resource

Perlmutter is not a production resource and usage is not charged against your allocation of time. While we will attempt to make the system available to users as much as possible, it is subject to unannounced and unexpected outages, reconfigurations, and periods of restricted access. Please visit the timeline page for more information about changes we've made in our recent upgrades.

NERSC has automated monitoring that tracks failed nodes, so please only open tickets for node failures if the node consistently has poor performance relative to other nodes in the job or if the node repeatedly causes your jobs to fail.

New issues

  • MPICH returns null bytes when collectively reading with MPI-IO from a PFL-striped file in /pscratch and exceeding a boundary between the different striping factors (e.g. when the file exceeds 1 GiB).
    All PFL-striped directories on /pscratch have been restriped with a factor of 1, so new files will be created with that striping; we're in the process of restriping all existing files but it may take some time. In the meantime users can either export MPICH_MPIIO_HINTS="*:romio_cb_read=disable" to disable collective reading or manually migrate their existing files.
  • Building software fails with message like cannot find -lcublas in PrgEnv-gnu. This is because the cublas (and other CUDA math) libraries are in a different location to the main cudatoolkit libraries, and the modulefiles are missing the environment variables that would enable the compiler and linker to find them. You can work around this by setting some environment variables after loading cudatoolkit:
perlmutter$ export LIBRARY_PATH="${LIBRARY_PATH}:${CUDATOOLKIT_HOME}/../../math_libs/lib64"
perlmutter$ export CPATH="${CPATH}:${CUDATOOLKIT_HOME}/../../math_libs/include"

Ongoing issues

  • Our slingshot 11 libfabric (Perlmutter CPU nodes) is currently missing the functions MPI_Mprobe()/MPI_Improbe() and MPI_Mrecv()/MPI_Imrecv(). This may especially impact mpi4py users attempting to send/receive pickled objects. One workaround may be to set export MPI4PY_RC_RECV_MPROBE='False'. The current estimated timeframe for a fix is January 2023. If these missing functions are impacting your workload, please open a ticket to let us know.
  • MPI users may hit segmentation fault errors when trying to launch an MPI job with many ranks due to incorrect allocation of GPU memory. We provide more information and a suggested workaround.
  • You may see messages like -bash: /usr/common/usg/bin/nersc_host: No such file or directory when you login. This means you have outdated dotfiles that need to be updated. To stop this message, you can either delete this line from your dot files or check if NERSC_HOST is set before overwriting it. Please see our environment page for more details.
  • Shifter user may see errors about BIND MOUNT FAILED if they attempt to volume mount directories that are not world executable. We have some workarounds for this issue.
  • Known issues for Machine Learning applications

Be careful with NVIDIA Unified Memory to avoid crashing nodes

In your code, NVIDIA Unified Memory might look something like cudaMallocManaged. At the moment, we do not have the ability to control this kind of memory and keep it under a safe limit. Users who allocate a large pool of this kind of memory may end up crashing nodes if the UVM memory does not leave enough room for necessary system tools like our filesystem client. We expect a fix from the vendor soon. In the meantime, please keep the size of memory pools allocated via UVM relatively small. If you have questions about this, please contact us.

Past Issues

For updates on past issues affecting Perlmutter, see the timeline page.

Cori

Ongoing issues

  • The Burst Buffer on Cori has a number of known issues, documented at Cori Burst Buffer.