Current known issues¶
Perlmutter¶
Please visit the timeline page for more information about changes we've made in our recent upgrades.
NERSC has automated monitoring that tracks failed nodes, so please only open tickets for node failures if the node consistently has poor performance relative to other nodes in the job or if the node repeatedly causes your jobs to fail.
New issues¶
Ongoing issues¶
- Due to changes in the SLES SP4 system libraries, changes may be required for conda environments built or invoked without using the NERSC provided
python
module. Users may see errors likeImportError: /usr/lib64/libssh.so.4: undefined symbol: EVP_KDF_CTX_new_id, version OPENSSL_1_1_1d
. Please see our Perlmutter python documentation for more information. - Shifter users may see errors about
BIND MOUNT FAILED
if they attempt to volume mount directories that are not world executable. We have some workarounds for this issue. - Users may notice MKL-based CPU code runs more slowly. Please try
module load fast-mkl-amd
.
Profiling with hardware counters¶
NVIDIA Data Center GPU Manager (dcgm) is a light weight tool to measure and monitor GPU utilization and comprehensive diagnostics of GPU nodes on a cluster. NERSC will be using this tool to measure application utilization and monitor the status of the machine. Due to current hardware limitations, collecting profiling metrics using performance tools such as Nsight-Compute, TAU, HPCToolkit applications that require acess to hardware counters will conflict with the DCGM instance running on the system.
To invoke performance collection with ncu
one must add dcgmi profile --pause / --resume
to your scripts (this script will work for single node or multiple node runs):
srun --ntasks-per-node 1 dcgmi profile --pause
srun <Slurm flags> ncu -o <filename> <other Nsight Compute flags> <program> <program arguments>
srun --ntasks-per-node 1 dcgmi profile --resume
Running profiler on multiple nodes
The DCGM instance on each node must be paused before running the profiler. Please note that you should only use 1 task to pause the dcgm instance as shown above.
Past Issues¶
For updates on past issues affecting Perlmutter, see the timeline page.