Skip to content

Current known issues

Perlmutter

Please visit the timeline page for more information about changes we've made in our recent upgrades.

NERSC has automated monitoring that tracks failed nodes, so please only open tickets for node failures if the node consistently has poor performance relative to other nodes in the job or if the node repeatedly causes your jobs to fail.

New issues

Ongoing issues

  • Due to changes in the SLES SP4 system libraries, changes may be required for conda environments built or invoked without using the NERSC provided python module. Users may see errors like ImportError: /usr/lib64/libssh.so.4: undefined symbol: EVP_KDF_CTX_new_id, version OPENSSL_1_1_1d. Please see our Perlmutter python documentation for more information.
  • Shifter users may see errors about BIND MOUNT FAILED if they attempt to volume mount directories that are not world executable. We have some workarounds for this issue.
  • Users may notice MKL-based CPU code runs more slowly. Please try module load fast-mkl-amd.

Profiling with hardware counters

NVIDIA Data Center GPU Manager (dcgm) is a light weight tool to measure and monitor GPU utilization and comprehensive diagnostics of GPU nodes on a cluster. NERSC will be using this tool to measure application utilization and monitor the status of the machine. Due to current hardware limitations, collecting profiling metrics using performance tools such as Nsight-Compute, TAU, HPCToolkit applications that require acess to hardware counters will conflict with the DCGM instance running on the system.

To invoke performance collection with ncu one must add dcgmi profile --pause / --resume to your scripts (this script will work for single node or multiple node runs):

srun --ntasks-per-node 1 dcgmi profile --pause
srun <Slurm flags> ncu -o <filename> <other Nsight Compute flags> <program> <program arguments>
srun --ntasks-per-node 1 dcgmi profile --resume

Running profiler on multiple nodes

The DCGM instance on each node must be paused before running the profiler. Please note that you should only use 1 task to pause the dcgm instance as shown above.

Past Issues

For updates on past issues affecting Perlmutter, see the timeline page.