Monitoring Jobs¶
Note
Continuously running squeue/sqs using watch, and especially multiple instances of "watch squeue/sqs" is not allowed. When many users are doing this at once it adversely impacts the performance of the job scheduler, which is a shared resource.
If you must monitor your workload, run only single instances of squeue
or sqs
. If watch
is essential to your workflow then limit the refresh interval to 1 min (watch -n 60
) and be sure to terminate the process when you are not actively using it.
For users who are interested in monitoring their job's resource usage while the job is running, the section on how to log in to compute nodes while jobs are running below.
sqs¶
sqs
is a NERSC custom wrapper for the Slurm native squeue
script with a chosen default format to view job information in the batch queue managed by Slurm. The sqs
command without any flag displays queued jobs for the logged-in user. Invoking sqs -a
displays the jobs of all users.
sqs
is fully compatible with squeue
in that it takes any flag that is accepted by squeue
, thus enabling more flexibility in customizing the output. For example, you could choose to only see running jobs with -t R
, or you could overwrite the default format of sqs
with the -o
flag to provide the list and format for fields of your own interest.
Note
Please refer to sqs --help
and the squeue
man page for the available flags and more information.
$ sqs
JOBID ST USER NAME NODES TIME_LIMIT TIME SUBMIT_TIME QOS START_TIME FEATURES NODELIST(REASON
9992934 R elvis myjob1 1024 12:00:00 0:00 2023-06-05T05:05:12 regular_0 2023-06-05T06:00:00 cpu nid[004196-0041
9992980 PD elvis myjob2 1024 12:00:00 0:00 2023-06-05T05:19:59 regular_0 2023-06-05T06:00:00 cpu (ReqNodeNotAvai
9995272 PD elvis myjob3 48 6:00:00 0:00 2023-06-05T05:38:36 regular_1 N/A cpu (Dependency)
9992985 PD elvis myjob4 48 6:00:00 0:00 2023-06-05T05:51:06 regular_1 N/A cpu (Nodes required
squeue¶
squeue
provides information about jobs in the Slurm scheduling queue and is best used for viewing jobs and job step information for active jobs (PENDING, RUNNING, SUSPENDED). For more details on squeue refer to the squeue manual or run squeue --help
, man squeue
.
To view current user jobs:
squeue -u $USER
The same output can be retrieved via --me
option which is equivalent to --user=<$USER>
squeue --me
To view all running jobs for the current user:
squeue --me -t RUNNING
To view all pending jobs for current user:
squeue --me -t PENDING
To view all pending jobs in QOS shared
:
squeue -q shared -t PENDING
To view all running jobs for current user in the shared
QOS:
$ squeue --me -q shared -t RUNNING
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1000 shared netcdf_r user1 R 1:16:47 1 nid006504
To view all jobs for a particular account (project), use -A <nersc_project>
:
$ squeue -A <nersc_project>
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2000 regular_m tokio-ab admin1 PD 0:00 256 (Priority)
2001 regular_m mpi4py-i admin2 PD 0:00 150 (Priority)
2002 regular_m mpi4py-i admin3 PD 0:00 150 (Priority)
2003 regular_m preproce admin4 PD 0:00 1 (Priority)
To view filter jobs, use the -j
option followed by the job ID. You can specify multiple job IDs separated by commas.
$ squeue -j 2542,2560
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2542 shared_mi wrfpostp user1 PD 0:00 1 (Dependency)
2560 shared_mi netcdf_r user2 PD 0:00 1 (Resources)
To view a job step use the --steps
option with the job step ID.
$ squeue --steps 1001.0
STEPID NAME PARTITION USER TIME NODELIST
1001.0 vasp_std regular_m elvis 5:19:26 nid004113
sacct¶
sacct
is used to report job or job step accounting information about active or completed jobs. You can directly invoke sacct
without any arguments and it will show jobs for the current user. sacct can be used for monitoring but it is primarily used for Job Accounting.
For a complete list of sacct
options please refer to the sacct manual or run man sacct
.
jobstats¶
Note
You must use Python 3.x in order to use jobstats
; this can be done with module load python
.
jobstats provides Slurm accounting and job details from sacct
, sreport
and squeue
. You can run jobstats
without any arguments and it will show a report for the current user from sreport
for today. If you have any pending or running jobs it will show that as well.
$ jobstats
User: XXXXXX
Default Account: YYYYY
User is part of the following slurm accounts ['YYYYY']
User Raw Share: 1
User Raw Usage: 0
Number of Pending Jobs: 0
Number of Running Jobs: 0
Total Jobs Completed: 0
Total Jobs Completed Successfully: 0
Total Jobs Failed: 0
Total Jobs Cancelled: 0
Total Jobs Timeout: 0
Today: 06/05/2023 12:13:37 sreport
--------------------------------------------------------------------------------
Top 10 Users 2020-06-14T00:00:00 - 2020-06-14T23:59:59 (86400 secs)
Usage reported in CPU Hours
--------------------------------------------------------------------------------
Cluster Login Proper Name Account Used Energy
--------- --------- --------------- --------------- ------------ -------------
Shown below is a list of options for the jobstats
command.
$ jobstats --help
usage: jobstats [-h] [-u USER] [-S START] [-E END] [-j]
[--state {COMPLETED,FAILED,TIMEOUT,CANCELLED}] [-a]
slurm utility for display user job statistics, reporting, and account detail.
optional arguments:
-h, --help show this help message and exit
-u USER, --user USER Select a user
-S START, --start START
Start Date Format: YYYY-MM-DD
-E END, --end END End Date Format: YYYY-MM-DD
-j, --jobsummary Display job summary for user
--state {COMPLETED,FAILED,TIMEOUT,CANCELLED}
Filter by Job State
-a, --account Display information on account shares that user
belongs to
Developed by Shahzeb Siddiqui <shahzebmsiddiqui@gmail.com>
For more information see the jobstats documentation.
sstat¶
sstat
is used to display various status information of a running job or job step. For example, one may wish to see the maximum memory usage (resident set size) of all tasks in a running job.
$ sstat -j 9992980 -o JobID,MaxRSS
JobID MaxRSS
------------ ----------
9992980.0 4333K
For a complete list of sstat
options and examples please see sstat manual.
Email notification¶
You can add directives within your job script to notify you when your job starts, finishes, or fails. Using the --mail-type
option, you can select one of begin
, end
, or fail
(respectively), or two or more in a comma-separated list (as below). You should specify the email address to which the notifications should go with the --mail-user
option.
#SBATCH --mail-type=begin,end,fail
#SBATCH --mail-user=user@domain.com
How to log in to compute nodes running your jobs¶
It can be useful for troubleshooting or diagnostics to log in to compute nodes running one's job in order to observe activity on those nodes. Below is the series of steps required to log in to a compute node while one's job is running.
Access to compute nodes is enabled only while the job is running
A user's SSH access to compute nodes is enabled only during the lifetime of the job. When the job ends, the user's SSH connections to all compute nodes in the job will be disconnected.
-
Retrieve the list of nodes that your job is running on. This will either print the host name
nid*****
or a range of host names -- if the job has more than one node -- in square brackets.scontrol show job <jobid> | grep -oP 'NodeList=nid(\[.+\]|.+)'
-
SSH into any
nid*****
node in thescontrol
list generated in step 1.
Requesting the head-node ID
If you need the head-node only (eg. for DMTCP applications) use BatchHost
instead of NodeList
:
scontrol show job <jobid>|grep -oP 'BatchHost=\K\w+'
Updating Jobs¶
Cancel jobs¶
To cancel a specific job:
scancel $JobID
You can also cancel more than one job in a single call to scancel
:
scancel $JobID1 $JobID2
To cancel all jobs owned by a user
Warning
If you want to cancel several hundred jobs, do not perform this action as one bulk change; cancel jobs by subset instead.
scancel -u $USER
Because scancel
sends a remote procedure call to the Slurm daemon, a degradation of service can result from many scancel
calls happening all at once. Therefore we recommend using as few individual calls to this function as possible. In particular, do not wrap scancel
in a loop in a script or other function.
Change timelimit¶
scontrol update jobid=$JobID timelimit=$new_timelimit
Change QOS¶
scontrol update jobid=$JobID qos=$new_qos
Change account¶
scontrol update jobid=$JobID account=$new_project_to_charge
Note
The new project must be eligible to run the job.
Controlling Jobs¶
Prevent a pending job from being started:
scontrol hold $JobID
Note
A held job will lose its accumulated wait time in the queue. Later, if this job is released, it will have the same priority as a newly submitted job.
Release a previously held job (``scontrol hold```):
scontrol release $JobID
To requeue (cancel and rerun) a particular job:
scontrol requeue $JobID
Job Accounting¶
sacct example
$ sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
10009775 sh regular_m proj1 256 FAILED 1:0
10009775.ex+ extern proj1 256 COMPLETED 0:0
10009775.0 bash proj1 1 FAILED 1:0
10009775.1 a.out proj1 256 COMPLETED 0:0
31171781 sh resv proj1 256 COMPLETED 0:0
31171781.ex+ extern proj1 256 COMPLETED 0:0
31171781.0 bash proj1 1 COMPLETED 0:0
31172253 sh resv proj1 256 TIMEOUT 0:0
31172253.ex+ extern proj1 256 COMPLETED 0:0
31172253.0 bash proj1 1 COMPLETED 0:0
You can format columns as you wish using the --format
option. For example, we can format columns based on User JobName State and Submit as follows
sacct format example
$ sacct --format=User,JobName,State,Submit
User JobName State Submit
--------- ---------- ---------- -------------------
user1 sh FAILED 2023-05-27T07:49:18
extern COMPLETED 2023-05-27T07:49:18
bash FAILED 2023-05-27T07:49:41
a.out COMPLETED 2023-05-27T07:52:31
user1 sh COMPLETED 2023-05-27T08:28:34
extern COMPLETED 2023-05-27T08:28:34
bash COMPLETED 2023-05-27T08:28:42
user1 sh TIMEOUT 2023-05-27T08:51:43
extern COMPLETED 2023-05-27T08:51:43
bash COMPLETED 2023-05-27T08:51:52
We can retrieve historical data for any given user. For example if you want to filter jobs by Start Time 2023-05-20
and End Time 2023-05-27
for user elvis
you can do the following
sacct format example with Start and End Date
$ sacct -u elvis -S 2023-05-20 -E 2023-05-27
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
10009730 test_node+ system physics 4096 TIMEOUT 0:0
10009730.ba+ batch physics 64 CANCELLED 0:15
10009730.ex+ extern physics 4096 COMPLETED 0:0
10009730.0 test_node+ physics 128 FAILED 1:0
10009730.1 test_node+ physics 2048 FAILED 1:0
10009732 test_node+ system physics 512 PENDING 0:0
You can retrieve up to 31 days of job records within given time window; this limit was implemented as safety measure to prevent bringing down the Slurm database. You will see the following error if you exceed the 31 day count:
$ sacct --start 2023-01-04
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
sacct: error: slurmdbd: Too wide of a date range in query
$ date
Wed 07 Jun 2023 09:35:56 AM PDT
To query by job states, use the option -s
(or long option --state
) plus the abbreviated state name code. For complete list of job states and their codes, see the JOB STATE CODES section in the sacct manual. In the example below we query for all failed jobs. The start and end window to your query, indicated by the --start
and --end
options, are required arguments.
sacct example with user, format fields and job states
$ sacct -X --format=User,JobName,State -s f --start=2023-06-01 --end=now
User JobName State
--------- ---------- ----------
elvis 81932_161+ FAILED
elvis 82105_161+ FAILED
To filter output by JobID, you can specify the -j
option with a list of comma-separated job IDs.
sacct filter by jobs
$ sacct -j 9994271,9992980
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
9994271 sh regular_m+ proj2 256 FAILED 1:0
9994271.ex+ extern proj2 256 COMPLETED 0:0
9994271.0 bash proj2 1 FAILED 1:0
9994271.1 a.out proj2 256 COMPLETED 0:0
9992980 sh resv proj2 256 COMPLETED 0:0
9992980.ex+ extern proj2 256 COMPLETED 0:0
9992980.0 bash proj2 1 COMPLETED 0:0