Using Perlmutter¶
Current Known Issues¶
For updates on past issues, see the timeline page.
Access¶
Perlmutter is now available to general users. All users with an active NERSC account have been added to the Perlmutter server login. Please follow the steps below to log into the system. If you wish to obtain a NERSC account, please visit our accounts page to get an overview of what kind of allocation or user account you need.
Connecting to Perlmutter¶
You can connect directly to Perlmutter with
ssh perlmutter.nersc.gov
or
ssh saul.nersc.gov
You can also connect to Perlmutter from DTN and then connect to Perlmutter with ssh perlmutter
.
Connecting to Perlmutter with sshproxy¶
If you have an ssh key generated by sshproxy, you can configure your local computer's ~/.ssh/config
file as suggested in the webpage section SSH Configuration File Options.
Connecting to Perlmutter with a Collaboration Account¶
Collabsu
is not available on Perlmutter. Please use the direct login functionality of sshproxy to log into Perlmutter.
Using the Community File System, Global Homes, and Global Common¶
Perlmutter mounts the Community File System directly at /global/cfs/cdirs. Files on CFS are available from the system as they have been on previous systems.
User home directories on NERSC systems are available globally. Similar to CFS, files in the global homes file system are readily available on Perlmutter. There is no need to transfer data in home directories to Perlmutter.
Perlmutter also mounts the global common file system for user-installed software. Files sitting in that file system are available on Perlmutter as they are elsewhere.
Transferring Data to / from Perlmutter Scratch¶
Perlmutter scratch is only accessible from Perlmutter login or compute nodes.
NERSC has set up a dedicated Globus Endpoint on Perlmutter that has access to Perlmutter Scratch as well as the Community and Homes File Systems at NERSC. This is the recommended way to transfer large volumes of data to/from Perlmutter scratch.
Alternatively, for small transfers you can use scp
on a Perlmutter login node.
Larger datasets could also be staged on the Community File System (which is available on Perlmutter) either with Globus, or a cp
, or rsync
on a Data Transfer Node. Once the data is on the Community File System, you can use cp
, or rsync
from a Perlmutter login node to copy the data to Perlmutter scratch.
Preparing for Perlmutter¶
Please see Migrating to Perlmutter for information and tips about transitioning your workflow from Cori to Perlmutter.
Please check the Transitioning Applications to Perlmutter webpage for a wealth of useful information on how to transition your applications for Perlmutter.
Compiling/Building Software¶
Running Jobs¶
Perlmutter uses Slurm for batch job scheduling.
- Lists of available queues as well as their time and node limits can be found on our queue policies on Perlmutter page
- Find example job scripts on our running jobs on Perlmutter's GPU nodes page
Below you can find general information on how to submit jobs using Slurm and monitor jobs, etc.:
Known issues¶
Profiling with hardware counters¶
NVIDIA Data Center GPU Manager (dcgm) is a light weight tool to measure and monitor GPU utilization and comprehensive diagnostics of GPU nodes on a cluster. NERSC will be using this tool to measure application utilization and monitor the status of the machine. Due to current hardware limitations, collecting profiling metrics using performance tools such as Nsight-Compute, TAU, HPCToolkit applications that require acess to hardware counters will conflict with the DCGM instance running on the system.
To invoke performance collection with ncu
one must add dcgmi profile --pause / --resume
to your scripts (this script will work for single node or multiple node runs):
srun --ntasks-per-node 1 dcgmi profile --pause
srun <Slurm flags> ncu -o <filename> <other Nsight Compute flags> <program> <program arguments>
srun --ntasks-per-node 1 dcgmi profile --resume
Running profiler on multiple nodes
The DCGM instance on each node must be paused before running the profiler. Please note that you should only use 1 task to pause the dcgm instance as shown above.