Skip to content

Current known issues

Perlmutter

Perlmutter is not a production resource

Perlmutter is not a production resource. While we will attempt to make the system available to users as much as possible, it is subject to performance variability, unannounced and unexpected outages, reconfigurations, and periods of restricted access. Please visit the timeline page for more information about changes we've made in our recent upgrades.

NERSC has automated monitoring that tracks failed nodes, so please only open tickets for node failures if the node consistently has poor performance relative to other nodes in the job or if the node repeatedly causes your jobs to fail.

New issues

  • Large jobs may fail with an error like MPIDI_OFI_handle_cq_error(1062): OFI poll failed (ofi_events.c:1064:MPIDI_OFI_handle_cq_error:Input/output error - UNDELIVERABLE). At this point this appears to be an intermittent error that's not associated with any particular type of job. If you see this error, please open a ticket with the job ID.

  • The current Arm Forge (Linaro Forge) version doesn't support CUDA 12 and it generates the error, 'Could not extract driver version from /proc/driver/nvidia/version'. A workaround is to set the environment variable ALLINEA_FORCE_CUDA_VERSION to 11.0. For details, see the DDT page.

  • If you're in csh running a bash script and want to use modules, you will need to add source /etc/profile.d/zz-cray-pe.sh to your script to get the module command to work. Conversely if you're in bash and want to run a csh script, you'll need to add source /etc/profile.d/zz-cray-pe.csh.

  • Due to changes in the SLES SP4 system libraries, changes may be required for conda environments built or invoked without using the NERSC provided python module. Users may see errors like ImportError: /usr/lib64/libssh.so.4: undefined symbol: EVP_KDF_CTX_new_id, version OPENSSL_1_1_1d. Please see our Perlmutter python documentation for more information.

  • Some machine learning workloads running in older NGC containers (versions from before 2022) may encounter performance variability. These issues can be fixed by upgrading the container to a more recent version. Please see our machine learning known issues page for details.

  • If an instance of your scrontab job is cancelled (for instance because of a maintenance or the node it's running on needed to be rebooted), the entire scrontab entry will go into a DISABLED state. We've opened a request with SchedMD to get this behavior amended to just block a single instance of the scrontab instead of the whole thing. For now, you will have to manually unblock your scrontabs (i.e. delete the DISABLED text in front of your scrontab entry).

  • MPICH returns null bytes when collectively reading with MPI-IO from a file on /pscratch using progressive file layout (PFL) and exceeding a boundary between the different striping factors. If you want to use PFL, you can use export MPICH_MPIIO_HINTS="*:romio_cb_read=disable" to disable collective reading.

  • Host-based authentication is currently not working on Perlmutter. You can still ssh between nodes, but will need your password and OTP

Ongoing issues

  • Our slingshot 11 libfabric (Perlmutter CPU nodes) is currently missing the functions MPI_Mprobe()/MPI_Improbe() and MPI_Mrecv()/MPI_Imrecv(). This may especially impact mpi4py users attempting to send/receive pickled objects. One workaround may be to set export MPI4PY_RC_RECV_MPROBE='False'. The current estimated timeframe for a fix is January 2023. If these missing functions are impacting your workload, please open a ticket to let us know.
  • MPI users may hit segmentation fault errors when trying to launch an MPI job with many ranks due to incorrect allocation of GPU memory. We provide more information and a suggested workaround.
  • You may see messages like -bash: /usr/common/usg/bin/nersc_host: No such file or directory when you login. This means you have outdated dotfiles that need to be updated. To stop this message, you can either delete this line from your dot files or check if NERSC_HOST is set before overwriting it. Please see our environment page for more details.
  • Shifter user may see errors about BIND MOUNT FAILED if they attempt to volume mount directories that are not world executable. We have some workarounds for this issue.
  • Known issues for Machine Learning applications
  • Users may notice MKL-based CPU code runs more slowly. Please try module load fast-mkl-amd.

Past Issues

For updates on past issues affecting Perlmutter, see the timeline page.