Monitoring Jobs¶

Note

Continuously running squeue/sqs using watch, and especially multiple instances of "watch squeue/sqs" is not allowed. When many users are doing this at once it adversely impacts the performance of the job scheduler, which is a shared resource.

If you must monitor your workload, run only single instances of squeue or sqs or use sacct. If watch is essential to your workflow then limit the refresh interval to 1 min (watch -n 60) and be sure to terminate the process when you are not actively using it.

For users who are interested in monitoring their job's resource usage while the job is running, the section on how to log in to compute nodes while jobs are running below.

sqs¶

sqs is a NERSC custom wrapper for the Slurm native squeue script with a chosen default format to view job information in the batch queue managed by Slurm. The sqs command without any flag displays queued jobs for the logged-in user. Invoking sqs -a displays the jobs of all users.

sqs is fully compatible with squeue in that it takes any flag that is accepted by squeue, thus enabling more flexibility in customizing the output. For example, you could choose to only see running jobs with -t R, or you could overwrite the default format of sqs with the -o flag to provide the list and format for fields of your own interest.

Note

Please refer to sqs --help and the squeue man page for the available flags and more information.

$ sqs
JOBID            ST USER      NAME          NODES TIME_LIMIT       TIME  SUBMIT_TIME          QOS             START_TIME           FEATURES       NODELIST(REASON
9992934          R  elvis     myjob1        1024    12:00:00       0:00  2023-06-05T05:05:12  regular_0       2023-06-05T06:00:00  cpu            nid[004196-0041
9992980          PD elvis     myjob2        1024    12:00:00       0:00  2023-06-05T05:19:59  regular_0       2023-06-05T06:00:00  cpu            (ReqNodeNotAvai   
9995272          PD elvis     myjob3          48     6:00:00       0:00  2023-06-05T05:38:36  regular_1       N/A                  cpu            (Dependency)
9992985          PD elvis     myjob4          48     6:00:00       0:00  2023-06-05T05:51:06  regular_1       N/A                  cpu            (Nodes required

squeue¶

squeue provides information about jobs in the Slurm scheduling queue and is best used for viewing jobs and job step information for active jobs (PENDING, RUNNING, SUSPENDED). For more details on squeue refer to the squeue manual or run squeue --help, man squeue.

To view current user jobs:

squeue -u $USER

The same output can be retrieved via --me option which is equivalent to --user=<$USER>

squeue --me

To view all running jobs for the current user:

squeue --me -t RUNNING

To view all pending jobs for current user:

squeue --me -t PENDING

To view all pending jobs in QOS shared:

squeue -q shared -t PENDING

To view all running jobs for current user in the shared QOS:

$ squeue --me -q shared -t RUNNING  
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              1000    shared netcdf_r    user1  R    1:16:47      1 nid006504

To view all jobs for a particular account (project), use -A <nersc_project>:

$ squeue -A <nersc_project>
         JOBID PARTITION     NAME      USER ST       TIME  NODES NODELIST(REASON)
          2000 regular_m tokio-ab    admin1 PD       0:00    256 (Priority)
          2001 regular_m mpi4py-i    admin2 PD       0:00    150 (Priority)
          2002 regular_m mpi4py-i    admin3 PD       0:00    150 (Priority)
          2003 regular_m preproce    admin4 PD       0:00      1 (Priority)

To view filter jobs, use the -j option followed by the job ID. You can specify multiple job IDs separated by commas.

$ squeue -j 2542,2560
         JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          2542 shared_mi wrfpostp    user1 PD       0:00      1 (Dependency)
          2560 shared_mi netcdf_r    user2 PD       0:00      1 (Resources)

To view a job step use the --steps option with the job step ID.

$ squeue --steps 1001.0            
         STEPID     NAME PARTITION     USER      TIME NODELIST
         1001.0 vasp_std regular_m    elvis   5:19:26 nid004113

sacct¶

sacct is used to report job or job step accounting information about active or completed jobs. You can directly invoke sacct without any arguments and it will show jobs for the current user. sacct can be used for monitoring but it is primarily used for Job Accounting.

For a complete list of sacct options please refer to the sacct manual or run man sacct.

jobstats¶

Note

You must use Python 3.x in order to use jobstats; this can be done with module load python.

jobstats provides Slurm accounting and job details from sacct, sreport and squeue. You can run jobstats without any arguments and it will show a report for the current user from sreport for today. If you have any pending or running jobs it will show that as well.

$ jobstats
User: XXXXXX 
Default Account: YYYYY
User is part of the following slurm accounts ['YYYYY']
User Raw Share: 1
User Raw Usage: 0
Number of Pending Jobs: 0
Number of Running Jobs: 0
Total Jobs Completed: 0
Total Jobs Completed Successfully: 0
Total Jobs Failed: 0
Total Jobs Cancelled: 0
Total Jobs Timeout: 0

Today: 06/05/2023 12:13:37 sreport
--------------------------------------------------------------------------------
Top 10 Users 2020-06-14T00:00:00 - 2020-06-14T23:59:59 (86400 secs)
Usage reported in CPU Hours
--------------------------------------------------------------------------------
  Cluster     Login     Proper Name         Account         Used        Energy 
--------- --------- --------------- --------------- ------------ -------------

Shown below is a list of options for the jobstats command.

$ jobstats --help
usage: jobstats [-h] [-u USER] [-S START] [-E END] [-j]
                [--state {COMPLETED,FAILED,TIMEOUT,CANCELLED}] [-a]

slurm utility for display user job statistics, reporting, and account detail.

optional arguments:
  -h, --help            show this help message and exit
  -u USER, --user USER  Select a user
  -S START, --start START
                        Start Date Format: YYYY-MM-DD
  -E END, --end END     End Date Format: YYYY-MM-DD
  -j, --jobsummary      Display job summary for user
  --state {COMPLETED,FAILED,TIMEOUT,CANCELLED}
                        Filter by Job State
  -a, --account         Display information on account shares that user
                        belongs to

Developed by Shahzeb Siddiqui <shahzebmsiddiqui@gmail.com>

For more information see the jobstats documentation.

sstat¶

sstat is used to display various status information of a running job or job step. For example, one may wish to see the maximum memory usage (resident set size) of all tasks in a running job.

$ sstat -j 9992980 -o JobID,MaxRSS
       JobID     MaxRSS 
------------ ---------- 
9992980.0         4333K

For a complete list of sstat options and examples please see sstat manual.

Email notification¶

You can add directives within your job script to notify you when your job starts, finishes, or fails. Using the --mail-type option, you can select one of begin, end, or fail (respectively), or two or more in a comma-separated list (as below). You should specify the email address to which the notifications should go with the --mail-user option.

#SBATCH --mail-type=begin,end,fail
#SBATCH --mail-user=user@domain.com

How to log in to compute nodes running your jobs¶

It can be useful for troubleshooting or diagnostics to log in to compute nodes running one's job in order to observe activity on those nodes. Below is the series of steps required to log in to a compute node while one's job is running.

Access to compute nodes is enabled only while the job is running

A user's SSH access to compute nodes is enabled only during the lifetime of the job. When the job ends, the user's SSH connections to all compute nodes in the job will be disconnected.

Retrieve the list of nodes that your job is running on. This will either print the host name nid***** or a range of host names -- if the job has more than one node -- in square brackets.
```
scontrol show job <jobid> | grep -oP  'NodeList=nid(\[.+\]|.+)'
```
SSH into any nid***** node in the scontrol list generated in step 1.

Requesting the head-node ID

If you need the head-node only (eg. for DMTCP applications) use BatchHost instead of NodeList:

scontrol show job <jobid>|grep -oP 'BatchHost=\K\w+'

Updating Jobs¶

Cancel jobs¶

To cancel a specific job:

scancel $JobID

You can also cancel more than one job in a single call to scancel:

scancel $JobID1 $JobID2

To cancel all jobs owned by a user

Warning

If you want to cancel several hundred jobs, do not perform this action as one bulk change; cancel jobs by subset instead.

scancel -u $USER

Because scancel sends a remote procedure call to the Slurm daemon, a degradation of service can result from many scancel calls happening all at once. Therefore we recommend using as few individual calls to this function as possible. In particular, do not wrap scancel in a loop in a script or other function.

Change timelimit¶

scontrol update jobid=$JobID timelimit=$new_timelimit

Change QOS¶

scontrol update jobid=$JobID qos=$new_qos

Change account¶

scontrol update jobid=$JobID account=$new_project_to_charge

Note

The new project must be eligible to run the job.

Controlling Jobs¶

Prevent a pending job from being started:

scontrol hold $JobID

Note

A held job will lose its accumulated wait time in the queue. Later, if this job is released, it will have the same priority as a newly submitted job.

Release a previously held job (``scontrol hold```):

scontrol release $JobID

To requeue (cancel and rerun) a particular job:

scontrol requeue $JobID

Job Accounting¶

sacct example

$ sacct      
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
10009775             sh  regular_m      proj1        256     FAILED      1:0 
10009775.ex+     extern                 proj1        256  COMPLETED      0:0 
10009775.0         bash                 proj1          1     FAILED      1:0 
10009775.1        a.out                 proj1        256  COMPLETED      0:0 
31171781             sh       resv      proj1        256  COMPLETED      0:0 
31171781.ex+     extern                 proj1        256  COMPLETED      0:0 
31171781.0         bash                 proj1          1  COMPLETED      0:0 
31172253             sh       resv      proj1        256    TIMEOUT      0:0 
31172253.ex+     extern                 proj1        256  COMPLETED      0:0 
31172253.0         bash                 proj1          1  COMPLETED      0:0

You can format columns as you wish using the --format option. For example, we can format columns based on User JobName State and Submit as follows

sacct format example

$ sacct --format=User,JobName,State,Submit
     User    JobName      State              Submit 
--------- ---------- ---------- ------------------- 
    user1         sh     FAILED 2023-05-27T07:49:18 
              extern  COMPLETED 2023-05-27T07:49:18 
                bash     FAILED 2023-05-27T07:49:41 
               a.out  COMPLETED 2023-05-27T07:52:31 
    user1         sh  COMPLETED 2023-05-27T08:28:34 
              extern  COMPLETED 2023-05-27T08:28:34 
                bash  COMPLETED 2023-05-27T08:28:42 
    user1         sh    TIMEOUT 2023-05-27T08:51:43 
              extern  COMPLETED 2023-05-27T08:51:43 
                bash  COMPLETED 2023-05-27T08:51:52

We can retrieve historical data for any given user. For example if you want to filter jobs by Start Time 2023-05-20 and End Time 2023-05-27 for user elvis you can do the following

sacct format example with Start and End Date

$ sacct -u elvis -S 2023-05-20 -E 2023-05-27
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
10009730     test_node+     system    physics       4096    TIMEOUT      0:0 
10009730.ba+      batch               physics         64  CANCELLED     0:15 
10009730.ex+     extern               physics       4096  COMPLETED      0:0 
10009730.0   test_node+               physics        128     FAILED      1:0 
10009730.1   test_node+               physics       2048     FAILED      1:0 
10009732     test_node+     system    physics        512    PENDING      0:0

You can retrieve up to 31 days of job records within given time window; this limit was implemented as safety measure to prevent bringing down the Slurm database. You will see the following error if you exceed the 31 day count:

$ sacct --start 2023-01-04 
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
sacct: error: slurmdbd: Too wide of a date range in query
$ date
Wed 07 Jun 2023 09:35:56 AM PDT

To query by job states, use the option -s (or long option --state) plus the abbreviated state name code. For complete list of job states and their codes, see the JOB STATE CODES section in the sacct manual. In the example below we query for all failed jobs. The start and end window to your query, indicated by the --start and --end options, are required arguments.

sacct example with user, format fields and job states

$ sacct -X --format=User,JobName,State -s f --start=2023-06-01 --end=now 
     User    JobName      State 
--------- ---------- ---------- 
  elvis   81932_161+     FAILED 
  elvis   82105_161+     FAILED

To filter output by JobID, you can specify the -j option with a list of comma-separated job IDs.

sacct filter by jobs

$ sacct -j 9994271,9992980
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
 9994271             sh regular_m+      proj2        256     FAILED      1:0 
 9994271.ex+     extern                 proj2        256  COMPLETED      0:0 
 9994271.0         bash                 proj2          1     FAILED      1:0 
 9994271.1        a.out                 proj2        256  COMPLETED      0:0 
 9992980             sh       resv      proj2        256  COMPLETED      0:0 
 9992980.ex+     extern                 proj2        256  COMPLETED      0:0 
 9992980.0         bash                 proj2          1  COMPLETED      0:0