Skip to content

Slurm

NERSC uses Slurm for cluster/resource management and job scheduling. Slurm is responsible for allocating resources to users, providing a framework for starting, executing and monitoring work on allocated resources and scheduling work for future execution.

Additional Resources

Jobs

A job is an allocation of resources such as compute nodes assigned to a user for an amount of time. Jobs can be interactive or batch (e.g., a script) scheduled for later execution.

Tip

NERSC provides an extensive set of example job scripts

Once a job is assigned a set of nodes, the user is able to initiate parallel work in the form of job steps (sets of tasks) in any configuration within the allocation.

When you login to a NERSC system you land on a login node. Login nodes are for editing, compiling, preparing jobs. They are not for running jobs. From the login node you can interact with Slurm to submit job scripts or start interactive jobs.

NERSC supports a diverse workload including high-throughput serial tasks, full system capability simulations and complex workflows.

Submitting jobs

sbatch

sbatch is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.

When you submit the job, Slurm responds with the job's ID, which will be used to identify this job in reports from Slurm.

nersc$ sbatch first-job.sh
Submitted batch job 864933

Slurm will also check your file system usage and reject the job if you are over your quota in your scratch or home file system. See here for more details.

salloc

salloc is used to allocate resources for a job in real time as an interactive batch job. Typically this is used to allocate resources and spawn a shell. The shell is then used to execute srun commands to launch parallel tasks.

srun

srun is used to submit a job for execution or initiate job steps in real time. A job can contain multiple job steps executing sequentially or in parallel on independent or shared resources within the job's node allocation. This command is typically executed within a script which is submitted with sbatch or from an interactive prompt on a compute node obtained via salloc.

Options

At a minimum a job script must include number of nodes, time, type of nodes (constraint), and quality of service (QOS). If a script does not specify any of these options then a default may be applied.

Tip

It is good practice to always set the account option (--account=<NERSC Project>).

The full list of directives is documented in the man pages for the sbatch command (see. man sbatch). Each option can be specified either as a directive in the job script:

#!/bin/bash
#SBATCH -N 2

Or as a command line option when submitting the script:

nersc$ sbatch -N 2 ./first-job.sh

The command line and directive versions of an option are equivalent and interchangeable. If the same option is present both on the command line and as a directive, the command line will be honored. If the same option or directive is specified twice, the last value supplied will be used.

Also, many options have both a long form, e.g., --nodes=2 and a short form, e.g., -N 2. These are equivalent and interchangable.

Many options are common to both sbatch and srun, for example sbatch -N 4 ./first-job.sh allocates 4 nodes to first-job.sh, and srun -N 4 uname -n inside the job runs a copy of uname -n on each of 4 nodes. If you don't specify an option in the srun command line, srun will inherit the value of that option from sbatch.

In these cases the default behavior of srun is to assume the same options as were passed to sbatch. This is achieved via environment variables: sbatch sets a number of environment variables with names like SLURM_NNODES and srun checks the values of those variables. This has two important consequences:

  1. Your job script can see the settings it was submitted with by checking these environment variables

  2. You should not override these environment variables. Also be aware that if your job script does certain tricky things, such as using ssh to launch a command on another node, the environment might not be propagated and your job may not behave correctly

Commonly Used Options

The below table lists some commonly used sbatch/salloc/srun options as well as their meaning. All the listed options can be used with the sbatch or salloc commands (either on the command line or as directives within a script). Many are also commonly used with srun within a script or interactive job.

The long and short forms of each option are interchangeable, but their formats differ. The long form begins with a double hyphen and includes a word, acronym, or phrase (with words separated by single hyphens) followed by an equals sign and any argument to the option (e.g., --time=10:00:00) while the short form consists a single hyphen and a single letter, followed by a space and any argument to the option (e.g., -t 10:00:00). For clarity, we recommend using the long form for Slurm directives in a script -- this makes it easier to understand what options are being set on each line.

Option (long form) Option (short form) Meaning Use with sbatch/salloc? Use with srun?
--time -t maximum walltime Y N
--time-min (none) minimum walltime Y N
--nodes -N number of nodes Y Y
--ntasks -n number of MPI tasks Y Y
--cpus-per-task -c number of processors per MPI task Y Y
--constraint -C constraint (e.g., type of resource) Y N
--qos -q quality of service (QOS) Y N
--account -A project to charge for this job Y N
--licenses -L licenses (filesystem required for job) Y N
--job-name -J name of job Y N

Writing a Job Script

A clear job script will include at least the number of nodes, walltime, type of nodes (constraint), and quality of service (QOS). These options could be specified on the command line, but for clarity and to establish a record of the job submission we recommend including all these options (and more) in your job script.

A Slurm job script begins with a shell invocation (e.g., #!/bin/bash) followed by lines of directives, each of which begins with #SBATCH. After these directives, users then include the commands to be run in the script, including the setting of environment variables and the setup of the job. Usually (but not always) the script includes at least one srun command, launching a parallel job onto one or more nodes allocated to the job.

#!/bin/bash
#SBATCH --nodes=<nnodes>
#SBATCH --time=hh:mm:ss
#SBATCH --constraint=<architecture>
#SBATCH --qos=<QOS>
#SBATCH --account=<project_name>

# set up for problem & define any environment variables here

srun -n <num_mpi_processes> -c <cpus_per_task> a.out

# perform any cleanup or short post-processing here

The above script is easily applied only to the simplest of cases and is not widely generalizable. In this simple case, a user would replace the items between < > with specific arguments, e.g., --nodes=2 or --qos=debug. The format for the maximum walltime request is number of hours, number of minutes, and number of seconds, separated by colons (e.g., --time=12:34:56 for 12 hours, 34 minutes, and 56 seconds).

There are many factors to consider when creating a script for your particular job. In our experience, we find that determining the correct settings for number of CPUs per task, process affinity, etc. can be tricky. Consequently, we recommend using the Job Script Generator to generate the correct #SBATCH directives, srun arguments, and process affinity settings for you.

The job script generator will provide the correct runtime arguments for your job, but may not adequately demonstrate a way to run jobs that fits your particular workflow or application. To help with this, we have developed a curated collection of example job scripts for users to peruse for inspiration.

Defaults

If you do not specify the following options in your script, defaults will be assigned. Note that jobs not specifying the constraint will be rejected.

Option Cori
nodes 1
time 10 minutes
qos debug
account set in Iris
constraint (reject job)

Show Job Details

To view a slurm job

scontrol show job <JobID>

Available memory for applications on compute nodes

Since the OS uses some memory from the total memory of 128 GB on a Haswell compute node and 96 GB on a KNL compute node, the available memory we set in Slurm for applications to use is 118 GB on a Haswell node, and 87 GB on a KNL node.

Quota Enforcement

Users will not be allowed to submit jobs if they are over quota in their scratch or home directories. This quota check is performed twice, first when the job is submitted and again when the running job invokes srun. This could mean that if you went over quota after submitting the job, the job could fail when it runs. Please check your quota regularly and delete or archive data as needed.

FAQs About Jobs

Q: How long will I wait for my jobs to run?

A: Queue wait times for past jobs can be a useful guide in estimating wait times of current jobs. The wait time depends on the quality of service (QOS), requested resources (nodes, time, filesystems, etc), jobs in the queue, your other jobs and other jobs from the same NERSC project.

Q: How do I check for how many free nodes are available in each partition?

A: Below is a sample Slurm command sinfo with the selected output fields. Column 1 shows the partition name, Column 2 shows the status of this partition, Column 3 shows the max wall time limit for this partition, and Column 4 shows the number of nodes Allocated/Idle/Other/Total in this partition.

cori$ sinfo -o "%.10P %.10a %.15l %.20F"
 PARTITION      AVAIL       TIMELIMIT       NODES(A/I/O/T)
    system         up      1-12:00:00   11611/412/53/12076
    debug*         up           30:00    11299/77/42/11418
   jupyter         up      4-00:00:00            0/10/0/10
   regular         up      4-00:00:00    11199/15/40/11254
  regularx         up      2-00:00:00    11295/77/42/11414
      resv         up     14-00:00:00   11611/412/53/12076
resv_share         up     14-00:00:00     2159/210/19/2388
 benchmark         up      1-12:00:00    11299/77/42/11418
realtime_s         up        12:00:00      1847/75/12/1934
  realtime         up        12:00:00    11299/89/42/11430
    shared         up      2-00:00:00            59/0/1/60
interactiv         up         4:00:00         93/282/9/384
  genepool         up      3-00:00:00         160/31/1/192
genepool_s         up      3-00:00:00         160/31/1/192

Q: How do I find which slurm accounts I am part of?

A: You can view your account membership by running iris which will show your user details, the first column Project is all the slurm accounts a user is associated with.

$ iris
Project      Used(user)    Allocated(user)        Used    Allocated
---------  ------------  -----------------  ----------  -----------
m3503               0.0          1000000.0     12726.1    1000000.0
nstaff          21690.5          4000000.0  26397725.1   80000000.0

You can view this information on https://iris.nersc.gov/

Q: How do I check for how many Haswell and KNL nodes are idle now?

A: Below is a sample Slurm command sinfo with the selected output fields. Column 1 shows the available computer node features (such as Haswell or KNL), and Column 2 shows the number of nodes Allocated/Idle/Other/Total in this partition. Both knl and knl,cache,quad are KNL quad cache nodes.

cori$ sinfo -o "%.20b %.20F"
     ACTIVE_FEATURES       NODES(A/I/O/T)
                 knl              0/0/6/6
             haswell     2138/231/19/2388
      knl,cache,quad     9412/242/28/9682

Q: How many interactive QOS nodes are available that I can use?

A: Each repo can use up to total 64 nodes (combining Haswell and KNL nodes). Run the following command to see how many interactive nodes are being used by the members of your repo:

cori$ squeue --qos=interactive --account=<reponame> -O jobid,username,starttime,timelimit,maxnodes,account

If the number sums up to 64 nodes, please contact the other group members if you feel they need to release interactive resources.

Q: Could I run a single job across both Haswell and KNL compute nodes?

A: This is currently available only for certain NERSC and Cray staff for benchmarking. We will evaluate whether (and if yes, how) to make it available for general NERSC users.

Q: How do I improve my I/O performance?

A: Consider using the Burst Buffer - this is a tier of SSDs that sits inside the Cori HSN, giving high-performance I/O. See this page for more details. If you are using the $SCRATCH file system, take a look at this page for I/O optimization tips.

Q: What is a slurm cluster?

A: A slurm cluster is comprised of all nodes managed by a single slurmctld daemon. Each slurm cluster is independent, with its own slurm environment (partitions/qos) and job accounting. At Cori, we are operating in Multi-Cluster mode with two clusters cori and escori. One can submit jobs to local or remote cluster using sbatch --clusters=<CLUSTERNAME>.

Q: What is the default slurm cluster for Cori?

A: cori

Q: What are the available slurm clusters for Cori?

A: The available slurm clusters are cori and escori

Q: How do I submit jobs to the escori slurm cluster?

In order to submit jobs to escori slurm cluster you need to load the following module

module load esslurm

Note

module load esslurm will make escori your default slurm cluster. If you want to revert back to cori unload the module (module unload esslurm).

The default slurm binaries are in /usr/bin but we place slurm binaries for esslurm (i.e., sbatch, squeue, sacct, srun) in /opt/esslurm/bin. Once you load the module your sbatch should be the following:

cori$ which sbatch
/opt/esslurm/bin/sbatch

To submit jobs to escori use the option sbatch --clusters=escori or job directive #SBATCH --clusters=escori within your script.