Skip to content

Monitoring Jobs

Continuously running squeue/sqs with e.g. watch and especially multiple instances of "watch squeue/sqs" is not allowed. When many users are doing this at once it impacts the performance of the job scheduler which is a shared resource.

If you must monitor your workload run only single instances of squeue or sqs If watch is essential to your workflow then limit the refresh interval to 1 min (watch -n 60) and be sure to terminate the process when you are not actively using it.

Additionally the sacct command (sacct -X -s pd,r) uses less expensive queries for much of the same information, but the same advice about watch applies.

For users who are interested in monitoring their job's resource usage while the job is running, NERSC provides the ssh_job command.

Monitoring Tools

sacct

sacct is used to report job or job step accounting information about active or completed jobs.

For a complete list of sacct options please refer to the sacct manual or run man sacct.

You can directly invoke sacct without any argument and it will show jobs for the current user.

$ sacct      
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
31170188             sh interacti+     nstaff        272     FAILED      1:0 
31170188.ex+     extern                nstaff        272  COMPLETED      0:0 
31170188.0         bash                nstaff          1     FAILED      1:0 
31170188.1        a.out                nstaff        272  COMPLETED      0:0 
31171781             sh       resv     nstaff        272  COMPLETED      0:0 
31171781.ex+     extern                nstaff        272  COMPLETED      0:0 
31171781.0         bash                nstaff          1  COMPLETED      0:0 
31172253             sh       resv     nstaff        272    TIMEOUT      0:0 
31172253.ex+     extern                nstaff        272  COMPLETED      0:0 
31172253.0         bash                nstaff          1  COMPLETED      0:0 
31172376             sh       resv     nstaff        272  COMPLETED      0:0 
31172376.ex+     extern                nstaff        272  COMPLETED      0:0 
31172376.0         bash                nstaff          1  COMPLETED      0:0 
31172552             sh       resv     nstaff        272    TIMEOUT      0:0 
31172552.ex+     extern                nstaff        272  COMPLETED      0:0 
31172552.0         bash                nstaff          1  COMPLETED      0:0 
31173331             sh       resv     nstaff        272    TIMEOUT      0:0 
31173331.ex+     extern                nstaff        272  COMPLETED      0:0 
31173331.0         bash                nstaff          1  COMPLETED      0:0 

You can format columns as you wish using the --format option. For example, we can format columns based on User JobName State and Submit as follows

$ sacct --format=User,JobName,State,Submit
     User    JobName      State              Submit 
--------- ---------- ---------- ------------------- 
 siddiq90         sh     FAILED 2020-05-27T07:49:18 
              extern  COMPLETED 2020-05-27T07:49:18 
                bash     FAILED 2020-05-27T07:49:41 
               a.out  COMPLETED 2020-05-27T07:52:31 
 siddiq90         sh  COMPLETED 2020-05-27T08:28:34 
              extern  COMPLETED 2020-05-27T08:28:34 
                bash  COMPLETED 2020-05-27T08:28:42 
 siddiq90         sh    TIMEOUT 2020-05-27T08:51:43 
              extern  COMPLETED 2020-05-27T08:51:43 
                bash  COMPLETED 2020-05-27T08:51:52 
 siddiq90         sh  COMPLETED 2020-05-27T08:57:06 
              extern  COMPLETED 2020-05-27T08:57:07 
                bash  COMPLETED 2020-05-27T08:57:19 
 siddiq90         sh    TIMEOUT 2020-05-27T09:00:16 
              extern  COMPLETED 2020-05-27T09:00:31 
                bash  COMPLETED 2020-05-27T09:00:46 
 siddiq90         sh    TIMEOUT 2020-05-27T09:19:41 
              extern  COMPLETED 2020-05-27T09:19:41 
                bash  COMPLETED 2020-05-27T09:19:54 

We can retrieve historical data for any given user. For example if you want to filter jobs by Start Time 2020-05-20 and End Time 2020-05-27 for user tbrewer you can do the following

$ sacct -u tbrewer -S 2020-05-20 -E 2020-05-27
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
30922369     test_node+     system     mpcray       8256    TIMEOUT      0:0 
30922369.ba+      batch                mpcray         64  CANCELLED     0:15 
30922369.ex+     extern                mpcray       8256  COMPLETED      0:0 
30922369.0   test_node+                mpcray        320     FAILED      1:0 
30922369.1   test_node+                mpcray       1904     FAILED      1:0 
30922378     test_node+     system     mpcray         38    PENDING      0:0 

If you want to filter output by JobID you can specify the -j option with a list of comma separated job ids.

$ sacct -j 31170188,31171781
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
31170188             sh interacti+     nstaff        272     FAILED      1:0 
31170188.ex+     extern                nstaff        272  COMPLETED      0:0 
31170188.0         bash                nstaff          1     FAILED      1:0 
31170188.1        a.out                nstaff        272  COMPLETED      0:0 
31171781             sh       resv     nstaff        272  COMPLETED      0:0 
31171781.ex+     extern                nstaff        272  COMPLETED      0:0 
31171781.0         bash                nstaff          1  COMPLETED      0:0 

To view pending jobs for a particular user you can use the -s or long option --state with state name. For complete list of job states see JOB STATE CODES in sacct manual

$ sacct -u tbrewer --format=User,JobName,State -s PENDING
     User    JobName      State 
--------- ---------- ---------- 
  tbrewer test_node+    PENDING 

sqs

sqs is used to view job information for jobs managed by Slurm. This is a custom script provided by NERSC which incorportates information from several sources.

Note

See sqs --help for details about all available options, and man squeue for information about job state and reason codes.

$ sqs
JOBID   ST  USER   NAME         NODES REQUESTED USED  SUBMIT               PARTITION SCHEDULED_START      REASON
864933  PD  elvis  first-job.*  2     10:00     0:00  2018-01-06T14:14:23  regular   avail_in_~48.0_days  None

sstat

sstat is used to display various status information of a running job or job step. For example, one may wish to see the maximum memory usage (resident set size) of all tasks in a running job.

$ sstat -j 864934 -o JobID,MaxRSS
       JobID     MaxRSS 
------------ ---------- 
864934.0          4333K 

For a complete list of sstat options and examples please see sstat manual.

Email notification

#SBATCH --mail-type=begin,end,fail
#SBATCH --mail-user=user@domain.com

ssh_job

To allow users to check resource usage of a job while it is running (e.g., by running top), NERSC provides the ssh_job command. This command must be run from a login node, and the syntax is

$ ssh_job <job ID>

where the argument to the command is the Slurm job ID of interest. For multi-node jobs, this will SSH into the first node in the allocation. From this node, you can then SSH to other nodes allocated to the job.