Current known issues¶

Perlmutter¶

Please visit the timeline page for more information about changes we've made in our recent upgrades.

NERSC has automated monitoring that tracks failed nodes, so please only open tickets for node failures if the node consistently has poor performance relative to other nodes in the job or if the node repeatedly causes your jobs to fail.

New issues¶

Use of the --signal= option in slurm can result in unexpected behavior.

We have a pending fix for this that we hope will address the issue.

We suggest not using this option until the problem can be addressed.

Ongoing issues¶

Shifter users may see errors about BIND MOUNT FAILED if they attempt to volume mount directories that are not world executable. We have some workarounds for this issue.

Profiling with hardware counters¶

NVIDIA Data Center GPU Manager (dcgm) is a light weight tool to measure and monitor GPU utilization and comprehensive diagnostics of GPU nodes on a cluster. NERSC will be using this tool to measure application utilization and monitor the status of the machine. Due to current hardware limitations, collecting profiling metrics using performance tools such as Nsight-Compute, TAU, HPCToolkit applications that require acess to hardware counters will conflict with the DCGM instance running on the system.

To invoke performance collection with ncu one must add dcgmi profile --pause / --resume to your scripts (this script will work for single node or multiple node runs):

srun --ntasks-per-node 1 dcgmi profile --pause
srun <Slurm flags> ncu -o <filename> <other Nsight Compute flags> <program> <program arguments>
srun --ntasks-per-node 1 dcgmi profile --resume

Running profiler on multiple nodes

The DCGM instance on each node must be paused before running the profiler. Please note that you should only use 1 task to pause the dcgm instance as shown above.

Past Issues¶

For updates on past issues affecting Perlmutter, see the timeline page.