Skip to content

Current known issues

Perlmutter

Please visit the timeline page for more information about changes we've made in our recent upgrades.

NERSC has automated monitoring that tracks failed nodes, so please only open tickets for node failures if the node consistently has poor performance relative to other nodes in the job or if the node repeatedly causes your jobs to fail.

New issues

  • Users may see ECC errors during their jobs that look like Message from syslogd@nid00XXXX at Jul 27 02:33:15 ... kernel:[Hardware Error]: IPID: 0x0000009600450f00, Syndrome: 0x9f8320000a800a01 or Message from syslogd@nid00XXXX at Jul 27 02:33:15 ... kernel:[Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. These errors can be ignored. The node will self-correct these errors and we are working to silence these messages.

Ongoing issues

  • If you're in csh running a bash script and want to use modules, you will need to add source /etc/profile.d/zz-cray-pe.sh to your script to get the module command to work. Conversely if you're in bash and want to run a csh script, you'll need to add source /etc/profile.d/zz-cray-pe.csh.
  • Due to changes in the SLES SP4 system libraries, changes may be required for conda environments built or invoked without using the NERSC provided python module. Users may see errors like ImportError: /usr/lib64/libssh.so.4: undefined symbol: EVP_KDF_CTX_new_id, version OPENSSL_1_1_1d. Please see our Perlmutter python documentation for more information.
  • MPI users may hit segmentation fault errors when trying to launch an MPI job with many ranks due to incorrect allocation of GPU memory.
  • You may see messages like -bash: /usr/common/usg/bin/nersc_host: No such file or directory when you login. This means you have outdated dotfiles that need to be updated. To stop this message, you can either delete this line from your dot files or check if NERSC_HOST is set before overwriting it. Please see our environment page for more details.
  • Shifter users may see errors about BIND MOUNT FAILED if they attempt to volume mount directories that are not world executable. We have some workarounds for this issue.
  • Known issues for Machine Learning applications
  • Users may notice MKL-based CPU code runs more slowly. Please try module load fast-mkl-amd.

Selected Bug Reports Filed with Vendors

Updated on January 22, 2024.

Vendor Title Description Status
Open MPI TCP BTL fails to collect all interface addresses (when interfaces are on different subnets) Multi-node Open MPI point-to-point communications using the tcp BTL component fail because, although one NIC on a Perlmutter node has two IP interfaces for different subnets (one private and one public), only one IP is used per peer kernel interface. In progress
Nvidia CP2K container builds with Open MPI with networking bug CP2K container images on NGC that NERSC suggests to use were built with old Open MPI versions (4.x), and bugs there contribute to multi-node job failures. Requesting new images with Open MPI 5.x built against libfabric. In progress
HPE No scope analysis window pops up with Perftools/Reveal The 'Scope Loop' button on the Reveal tool doesn't open a window that would normally show the scoping result for a selected loop. In progress
Nvidia OpenACC reduction with worker gives wrong answers A procedure declared with an OpenACC routine worker directive returns wrong reduction values in the PrgEnv-nvidia environment when called from within a loop where num_workers and vector_length are set to 32. In progress
HPE RMA performance problems on Perlmutter with GASNet Codes With the GASNet-EX networking library implementing RMA (Remote Memory Access) APIs with fi_read() and fi_write() functions of the vendor-provided libfabric and its cxi provider, it is observed that RMA operations perform very well under ideal conditions. When conditions are not ideal, the performance decreases significantly for both host and GPU memory. In progress
HPE crayftn bug in assignment to unlimited polymorphic variable The Cray Fortran compiler doesn't allow assignment of an expression to a unlimited polymorphic allocatable variable. In progress
HPE Performance issue with fl_write() to GPU memory on Perlmutter The GASNet-EX networking library implements RMA APIs with the vendor-provided libfabric and its cxi provider. RMA Put operations between two GPU nodes when the destination address is in remote GPU memory show unexpectedly much lower performance than MPI. For other source/destination memory and Put/Get mode combinations, the GASNet-EX and MPI benchmarks show similar performance or GASNet-EX performs better. In progress
HPE OFI segfault, and intermittent loss of messages with GASNet The segfault problem has since been fixed. Applications occasionally hang, possibly due to loss of messages sent with fi_send() to be received in buffers posted using fi_recvmsg(). This has been observed with the cxi-provider libfabric. A suggested workaround of setting certain environment variables doesn't appear to be fully effective, and yet induces a waste of large memory. In progress
HPE Internal Compiler Error An internal compiler error occurs when compiling the E3SM code with the AMD compilers. In progress
HPE Issues when linking with PrgEnv-nvidia and cuFFTMp Undefined reference to MPI_Comm_f2c reported at link time. In progress
HPE cray-mpich with GTL not recognising pointer to device memory, that was returned by OpenCL clSVMAlloc When a pointer returned by the OpenCL clSVMAlloc function is used in one-sided MPI communication, it is not getting the correct data. A workaround of wrapping MPI RMA exposure epoch in clEnqueueSVMMap/clEnqueueSVMUnmap causes a large amount of data to be unnecessarily moved between the host and device memory. Asking for advice for using OpenCL with MPICH_GPU_SUPPORT_ENABLED. In progress
HPE Missing mpi_f08 module A Fortran code that uses the module fails to compile with PrgEnv-nvidia. In progress
HPE User-defined reduction code segfaults with Intel compiler A Fortran code using the mpi_f08 interface of the MPI_User_function fails to compile due to a problem in the mpi_f08_callbacks module. (The title refers to runtime behavior of an example code in the MPI standard manual when the bug was initially reported. A corrected version shown in later releases of the manual fails to compile.) In progress

Past Issues

For updates on past issues affecting Perlmutter, see the timeline page.

Selected Resolved Vendor Bug Reports

Vendor Title Description Status
Linaro Problem with srun -c flag over multiple nodes Forge fails to connect to a large per-node tasks when using multiple nodes. A provided workaround implemented; fixed in a future release
HPE MPI_Allreduce gives incorrect result for data size over 2GB The MPI_Allreduce function generates incorrect numerical results when the data size is over 2GB. A workaround provided (export MPICH_ALLREDUCE_NO_SMP=1); fixed in cray-mpich/8.1.27