Skip to content

Current known issues

Perlmutter

Please visit the timeline page for more information about changes we've made in our recent upgrades.

NERSC has automated monitoring that tracks failed nodes, so please only open tickets for node failures if the node consistently has poor performance relative to other nodes in the job or if the node repeatedly causes your jobs to fail.

New issues

  • Users may see ECC errors during their jobs that look like Message from syslogd@nid00XXXX at Jul 27 02:33:15 ... kernel:[Hardware Error]: IPID: 0x0000009600450f00, Syndrome: 0x9f8320000a800a01 or Message from syslogd@nid00XXXX at Jul 27 02:33:15 ... kernel:[Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. These errors can be ignored. The node will self-correct these errors and we are working to silence these messages.

Ongoing issues

  • Cray MPICH's MPI_Allreduce function can generate an incorrect result when the data size is over 2GB. A workaround is not to use the SMP-aware algorithm with export MPICH_ALLREDUCE_NO_SMP=1.
  • If you're in csh running a bash script and want to use modules, you will need to add source /etc/profile.d/zz-cray-pe.sh to your script to get the module command to work. Conversely if you're in bash and want to run a csh script, you'll need to add source /etc/profile.d/zz-cray-pe.csh.
  • Due to changes in the SLES SP4 system libraries, changes may be required for conda environments built or invoked without using the NERSC provided python module. Users may see errors like ImportError: /usr/lib64/libssh.so.4: undefined symbol: EVP_KDF_CTX_new_id, version OPENSSL_1_1_1d. Please see our Perlmutter python documentation for more information.
  • MPI users may hit segmentation fault errors when trying to launch an MPI job with many ranks due to incorrect allocation of GPU memory.
  • You may see messages like -bash: /usr/common/usg/bin/nersc_host: No such file or directory when you login. This means you have outdated dotfiles that need to be updated. To stop this message, you can either delete this line from your dot files or check if NERSC_HOST is set before overwriting it. Please see our environment page for more details.
  • Shifter user may see errors about BIND MOUNT FAILED if they attempt to volume mount directories that are not world executable. We have some workarounds for this issue.
  • Known issues for Machine Learning applications
  • Users may notice MKL-based CPU code runs more slowly. Please try module load fast-mkl-amd.

Past Issues

For updates on past issues affecting Perlmutter, see the timeline page.