Skip to content

Monitoring Jobs

Note

Continuously running squeue/sqs with e.g. watch and especially multiple instances of "watch squeue/sqs" is not allowed. When many users are doing this at once it impacts the performance of the job scheduler which is a shared resource.

If you must monitor your workload run only single instances of squeue or sqs If watch is essential to your workflow then limit the refresh interval to 1 min (watch -n 60) and be sure to terminate the process when you are not actively using it.

Additionally the sacct command (sacct -X -s pd,r) uses less expensive queries for much of the same information, but the same advice about watch applies.

For users who are interested in monitoring their job's resource usage while the job is running, NERSC provides the ssh_job command.

Note

The ssh_job command is temporarily unavailable.

sqs

sqs is a NERSC custom wrapper for the Slurm native squeue script with a chosen default format to view job information in the batch queue managed by Slurm. The sqs command without any flag displays queued jobs for the logged-in user. Invoking sqs -a displays the jobs of all users.

sqs is fully compatible with squeue that it takes any flag that is accepted by squeue, thus enabling more flexibility in customizing the output. For example, you could choose to only see running jobs with -t R, or you could overwrite the default format of sqs with the -o flag to provide the list and format for fields of your own interest.

Note

Please refer to sqs --help and the squeue man page for the available flags and more information.

$sqs
JOBID            ST USER      NAME          NODES TIME_LIMIT       TIME  SUBMIT_TIME          QOS             START_TIME           FEATURES       NODELIST(REASON
32308177         R  elvis     myjob1        1024  1-00:00:00       0:00  2020-06-08T05:05:12  regular_0       2020-06-10T06:00:00  knl&quad&cache nid0[2435-2443,
32308268         PD elvis     myjob2        1024  1-00:00:00       0:00  2020-06-08T05:19:59  regular_0       2020-06-10T06:00:00  knl&quad&cache (ReqNodeNotAvai         knl&quad&cache (Priority)     
32305323         PD elvis     myjob3          48     6:00:00       0:00  2020-06-07T05:08:36  regular_1       N/A                  knl&quad&cache (Nodes required
32305332         PD elvis     myjob4          48     6:00:00       0:00  2020-06-07T05:09:06  regular_1       N/A                  knl&quad&cache (Nodes required

squeue

squeue provides information about jobs in Slurm scheduling queue, this should you be used for viewing jobs and job step information for active jobs (PENDING, RUNNING, SUSPENDED). For more details on squeue refer to squeue manual or run squeue --help, man squeue.

View current user jobs:

squeue -u $USER

The same output can be retrieved via --me option which is equivalent to --user=<$USER>

squeue --me

To view all running jobs for current user:

squeue --me -t RUNNING

To view all pending jobs for current user:

squeue --me -t PENDING

To view all pending jobs in QOS shared:

squeue -q shared -t PENDING

To view all running jobs for current user on shared qos:

$ squeue --me -q shared -t RUNNING  
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             1000    shared netcdf_r   user1  R    1:16:47      1 nid00527

To view all jobs for account nstaff use -A <nersc_project>:

$ squeue -A <nersc_project>
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          2000 debug_knl tokio-ab    admin1 PD       0:00    256 (Burst buffer pre_run error)
          2001 regular_h mpi4py-i   admin2 PD       0:00    150 (Priority)
          2002 regular_h mpi4py-i   admin3 PD       0:00    150 (Priority)
          2003 regular_h preproce    admin4 PD       0:00      1 (Priority)          

To view filter jobs use the -j option and pass job ID. You can specify multiple job IDs separated by comma.

$ squeue -j 2542,2560
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          2542    shared wrfpostp user1 PD       0:00      1 (Dependency)
          2560    shared netcdf_r  user2 PD       0:00      1 (Resources)

To view a job step use the --steps option and pass the job step ID.

$ squeue --steps 1001.0            
         STEPID     NAME PARTITION     USER      TIME NODELIST
     1001.0 vasp_std regular_k elvis   5:19:26 nid0[2520-2527]

sacct

sacct is used to report job or job step accounting information about active or completed jobs. You can directly invoke sacct without any argument and it will show jobs for the current user. sacct can be used for monitoring but it is primarily used for Job Accounting.

For a complete list of sacct options please refer to the sacct manual or run man sacct.

jobstats

jobstats provides slurm accounting and job details from sacct, sreport and squeue. You can run jobstats without any argument and it will show report for current user report from sreport for today. If you have any pending or running jobs it will show that as well.

For more information see jobstats documentation.

$ jobstats
User: XXXXXX 
Default Account: YYYYY
User is part of the following slurm accounts ['YYYYY']
User Raw Share: 1
User Raw Usage: 0
Number of Pending Jobs: 0
Number of Running Jobs: 0
Total Jobs Completed: 0
Total Jobs Completed Successfully: 0
Total Jobs Failed: 0
Total Jobs Cancelled: 0
Total Jobs Timeout: 0

Today: 06/15/2020 12:13:37 sreport
--------------------------------------------------------------------------------
Top 10 Users 2020-06-14T00:00:00 - 2020-06-14T23:59:59 (86400 secs)
Usage reported in CPU Hours
--------------------------------------------------------------------------------
  Cluster     Login     Proper Name         Account         Used        Energy 
--------- --------- --------------- --------------- ------------ ------------- 

sstat

sstat is used to display various status information of a running job or job step. For example, one may wish to see the maximum memory usage (resident set size) of all tasks in a running job.

$ sstat -j 864934 -o JobID,MaxRSS
       JobID     MaxRSS 
------------ ---------- 
864934.0          4333K 

For a complete list of sstat options and examples please see sstat manual.

Email notification

#SBATCH --mail-type=begin,end,fail
#SBATCH --mail-user=user@domain.com

ssh_job

To allow users to check resource usage of a job while it is running (e.g., by running top), NERSC provides the ssh_job command. This command must be run from a login node, and the syntax is

ssh_job <JobID>

where the argument to the command is the Slurm job ID of interest. For multi-node jobs, this will SSH into the first node in the allocation. From this node, you can then SSH to other nodes allocated to the job.

Updating Jobs

Cancel jobs

Cancel a specific job:

scancel $JobID

Cancel all jobs owned by a user

Warning

If you try to cancel several hundred job do not perform this action in bulk change instead cancel jobs by subset.

scancel -u $USER

Change timelimit

scontrol update jobid=$JobID timelimit=$new_timelimit

Change QOS

scontrol update jobid=$JobID qos=$new_qos

Change account

scontrol update jobid=$JobID account=$new_project_to_charge

Note

The new repo must be eligible to run the job.

Controlling Jobs

Prevent a pending job from being started:

Note

A held job will lose its accumulated wait time in the queue. Later if this job is released, it will have the same priority as a newly submitted job.

Release a previously held job (``scontrol hold```):

scontrol release $JobID

To requeue (cancel and rerun) a particular job:

scontrol requeue $JobID

Job Accounting

sacct example
$ sacct      
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
31170188             sh interacti+     nstaff        272     FAILED      1:0 
31170188.ex+     extern                nstaff        272  COMPLETED      0:0 
31170188.0         bash                nstaff          1     FAILED      1:0 
31170188.1        a.out                nstaff        272  COMPLETED      0:0 
31171781             sh       resv     nstaff        272  COMPLETED      0:0 
31171781.ex+     extern                nstaff        272  COMPLETED      0:0 
31171781.0         bash                nstaff          1  COMPLETED      0:0 
31172253             sh       resv     nstaff        272    TIMEOUT      0:0 
31172253.ex+     extern                nstaff        272  COMPLETED      0:0 
31172253.0         bash                nstaff          1  COMPLETED      0:0 

You can format columns as you wish using the --format option. For example, we can format columns based on User JobName State and Submit as follows

sacct format example
$ sacct --format=User,JobName,State,Submit
     User    JobName      State              Submit 
--------- ---------- ---------- ------------------- 
 user1         sh     FAILED 2020-05-27T07:49:18 
              extern  COMPLETED 2020-05-27T07:49:18 
                bash     FAILED 2020-05-27T07:49:41 
               a.out  COMPLETED 2020-05-27T07:52:31 
 user1         sh  COMPLETED 2020-05-27T08:28:34 
              extern  COMPLETED 2020-05-27T08:28:34 
                bash  COMPLETED 2020-05-27T08:28:42 
 user1         sh    TIMEOUT 2020-05-27T08:51:43 
              extern  COMPLETED 2020-05-27T08:51:43 
                bash  COMPLETED 2020-05-27T08:51:52     

We can retrieve historical data for any given user. For example if you want to filter jobs by Start Time 2020-05-20 and End Time 2020-05-27 for user elvis you can do the following

sacct format example with Start and End Date
$ sacct -u elvis -S 2020-05-20 -E 2020-05-27
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
30922369     test_node+     system     physics       8256    TIMEOUT      0:0 
30922369.ba+      batch                physics         64  CANCELLED     0:15 
30922369.ex+     extern                physics       8256  COMPLETED      0:0 
30922369.0   test_node+                physics        320     FAILED      1:0 
30922369.1   test_node+                physics       1904     FAILED      1:0 
30922378     test_node+     system     physics         38    PENDING      0:0 

If you want to filter output by JobID you can specify the -j option with a list of comma separated job ids.

sacct filter by jobs
$ sacct -j 31170188,31171781
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
31170188             sh interacti+     nstaff        272     FAILED      1:0 
31170188.ex+     extern                nstaff        272  COMPLETED      0:0 
31170188.0         bash                nstaff          1     FAILED      1:0 
31170188.1        a.out                nstaff        272  COMPLETED      0:0 
31171781             sh       resv     nstaff        272  COMPLETED      0:0 
31171781.ex+     extern                nstaff        272  COMPLETED      0:0 
31171781.0         bash                nstaff          1  COMPLETED      0:0 

To view pending jobs for a particular user you can use the -s or long option --state with state name. For complete list of job states see JOB STATE CODES in sacct manual

sacct example with user, format fields and job stats
$ sacct -u elvis --format=User,JobName,State -s PENDING
     User    JobName      State 
--------- ---------- ---------- 
  bob test_node+    PENDING