Current known issues


Please visit the timeline page for more information about changes we've made in our recent upgrades.

NERSC has automated monitoring that tracks failed nodes, so please only open tickets for node failures if the node consistently has poor performance relative to other nodes in the job or if the node repeatedly causes your jobs to fail.

New issues

Ongoing issues

  • Due to changes in the SLES SP4 system libraries, changes may be required for conda environments built or invoked without using the NERSC provided python module. Users may see errors like ImportError: /usr/lib64/ undefined symbol: EVP_KDF_CTX_new_id, version OPENSSL_1_1_1d. Please see our Perlmutter python documentation for more information.
  • Shifter users may see errors about BIND MOUNT FAILED if they attempt to volume mount directories that are not world executable. We have some workarounds for this issue.
  • Users may notice MKL-based CPU code runs more slowly. Please try module load fast-mkl-amd.

Selected Bug Reports Filed with Vendors

Updated on April 24, 2024.

Vendor Title Description Status
HPE crayftn ICE when trying to build dftd4 An internal compiler error (ICE) occurs when building dftd4. In progress
HPE crayftn ICE on WHERE statement with defined assignment An ICE occurs when it encounters a WHERE statement that uses defined assignment. In progress
HPE crayftn MERGE Function call ICE when the MASK argument is a derived type component An ICE occurs when there is a function call to the intrinsic MERGE and when the MASK argument is a component of the derived type argument to a type bound procedure and when the result of the MERGE call is then passed to the intrinsic TRIM. In progress
HPE sanitizers4hpc's output aggregation with ThreadSanitizer ThreadSanitizer output aggregation needs improvement. In progress
HPE sanitizers4hpc with Compute Sanitizer's memcheck produces output that is not aggregated Compute Sanitizer output aggregation needs improvement. In progress
HPE sanitizers4hpc produces stack traces for 'Program hit CUDA_ERROR_INVALID_VALUE error' The error message appears when using with Compute Sanitizer although the desired output is produced. In progress
HPE No source line number displayed when run with MemorySanitizer in PrgEnv-cray No info is provided where in source code an error occurs or where the memory was allocated. The same problem is seen with PrgEnv-intel, as it turned out. In progress
HPE disable_sanitizer_instrumentation attribute doesn't work with PrgEnv-aocc The attribute doesn't work although the compilers are Clang-based. In progress
HPE Error occurs when MPI window object is not freed Error messages about a fatal MPI finalize error when a MPI window object is not freed before MPI_Finalize. A fix in cray-mpich/
HPE CCE 17.0.0 Fortran compiler fails four smart-pointers tests The Cray Fortran compiler fails four tests in the Smart-Pointers test suite. In progress
HPE crayftn runtime error with user defined operator on associate name A segmentation fault occurs in a code built with the Cray Fortran compiler when calling a user defined operator on a name associated with a function/expression result. In progress
HPE Valid coarray code rejected by crayftn A coarray code is incorrectly rejected with errors by the Cray Fortran compiler. In progress
HPE Incorrect results and poor performance with do concurrent reduction A code that does a do concurrent reduce operation gives incorrect results when built with the Cray Fortran compiler. When compiled with the -h thread_do_concurrent flag, the code shows poor performance. In progress
Open MPI TCP BTL fails to collect all interface addresses (when interfaces are on different subnets) Multi-node Open MPI point-to-point communications using the tcp BTL component fail because, although one NIC on a Perlmutter node has two IP interfaces for different subnets (one private and one public), only one IP is used per peer kernel interface. In progress
Nvidia CP2K container builds with Open MPI with networking bug CP2K container images on NGC that NERSC suggests to use were built with old Open MPI versions (4.x), and bugs there contribute to multi-node job failures. Requesting new images with Open MPI 5.x built against libfabric. In progress
HPE cray-mpich module does not set LD_LIBRARY_PATH Loading the module doesn't update the environment variable and this has to be done manually. In progress
Nvidia OpenACC reduction with worker gives wrong answers A procedure declared with an OpenACC routine worker directive returns wrong reduction values in the PrgEnv-nvidia environment when called from within a loop where num_workers and vector_length are set to 32. In progress
HPE RMA performance problems on Perlmutter with GASNet Codes With the GASNet-EX networking library implementing RMA (Remote Memory Access) APIs with fi_read() and fi_write() functions of the vendor-provided libfabric and its cxi provider, it is observed that RMA operations perform very well under ideal conditions. When conditions are not ideal, the performance decreases significantly for both host and GPU memory. In progress
HPE crayftn overloaded constructor with polymorphic argument in array constructor The Cray Fortran compiler generates an internal compiler error for a code that passes a child type to an overloaded structure constructor within an array constructor, where the parent type has a deferred procedure. In progress
HPE crayftn bug in assignment to unlimited polymorphic variable The Cray Fortran compiler doesn't allow assignment of an expression to a unlimited polymorphic allocatable variable. In progress
HPE Performance issue with fl_write() to GPU memory on Perlmutter The GASNet-EX networking library implements RMA APIs with the vendor-provided libfabric and its cxi provider. RMA Put operations between two GPU nodes when the destination address is in remote GPU memory show unexpectedly much lower performance than MPI. For other source/destination memory and Put/Get mode combinations, the GASNet-EX and MPI benchmarks show similar performance or GASNet-EX performs better. In progress
HPE Internal Compiler Error An internal compiler error occurs when compiling the E3SM code with the AMD compilers. In progress
HPE Issues when linking with PrgEnv-nvidia and cuFFTMp Undefined reference to MPI_Comm_f2c reported at link time. In progress
HPE cray-mpich with GTL not recognising pointer to device memory, that was returned by OpenCL clSVMAlloc When a pointer returned by the OpenCL clSVMAlloc function is used in one-sided MPI communication, it is not getting the correct data. A workaround of wrapping MPI RMA exposure epoch in clEnqueueSVMMap/clEnqueueSVMUnmap causes a large amount of data to be unnecessarily moved between the host and device memory. Asking for advice for using OpenCL with MPICH_GPU_SUPPORT_ENABLED. In progress
HPE Missing mpi_f08 module A Fortran code that uses the module fails to compile with PrgEnv-nvidia. In progress
HPE nvfortran does not support the VALUE attribute for arrays which are not assumed-size In progress
HPE nvfortran does not support intrinsic elemental functions BGE, BGT, BLE, BLT In progress
HPE nvfortran does not support intrinsic elemental functions DSHIFTL, DSHIFTR In progress
HPE nvfortran does not support function references in a variable definition context In progress
HPE nvfortran does not support intrinsic assignment to allocatable polymorphic variables In progress
HPE nvfortran does not support %RE and %IM complex-part-designators in variables of COMPLEX type In progress
HPE User-defined reduction code segfaults with Intel compiler A Fortran code using the mpi_f08 interface of the MPI_User_function fails to compile due to a problem in the mpi_f08_callbacks module. (The title refers to runtime behavior of an example code in the MPI standard manual when the bug was initially reported. A corrected version shown in later releases of the manual fails to compile.) In progress

Past Issues

For updates on past issues affecting Perlmutter, see the timeline page.

Selected Resolved Vendor Bug Reports

Vendor Title Description Status
Nvidia Sudo permission issue for cuquantum-appliance:23.10 container One doesn't have permission to access the /home/cuquantum directory in a container. Fixed in neilmehta87/cuquantum-appliance:23.10
HPE NCCL workload hitting node failures from network link flaps Certain apps (not limited to those using the NCCL library, as it turned out) may expose a system condition as a link flap, making them susceptible to a job failure. Resolved
HPE No scope analysis window pops up with Perftools/Reveal The 'Scope Loop' button on the Reveal tool doesn't open a window that would normally show the scoping result for a selected loop. Fixed; version 23.12 works
HPE OFI segfault, and intermittent loss of messages with GASNet The segfault problem has since been fixed. Applications occasionally hang, possibly due to loss of messages sent with fi_send() to be received in buffers posted using fi_recvmsg(). This has been observed with the cxi-provider libfabric. A suggested workaround of setting certain environment variables doesn't appear to be fully effective, and yet induces a waste of large memory. Resolved