# Example job scripts¶

For details of terminology used on this page please see our jobs overview. Correct affinity settings are essential for good performance.

For Perlmutter, please see the running jobs on Perlmutter GPU nodes section. For Cori GPU see examples on docs-dev.nersc.gov

## Basic MPI batch script¶

One MPI process per physical core.

Cori Haswell
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=2
#SBATCH --constraint=haswell

srun check-mpi.intel.cori

Cori KNL
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=2
#SBATCH --constraint=knl

srun check-mpi.intel.cori


## Hybrid MPI+OpenMP jobs¶

Warning

In Slurm each hyper thread is considered a "cpu" so the --cpus-per-task option must be adjusted accordingly. Generally best performance is obtained with 1 OpenMP thread per physical core. Additional details about affinity settings.

### Example 1¶

One MPI process per socket and 1 OpenMP thread per physical core

Cori Haswell
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=2
#SBATCH --constraint=haswell

export OMP_PROC_BIND=true

srun check-hybrid.intel.cori

Cori KNL
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=2
#SBATCH --constraint=knl

export OMP_PROC_BIND=true

srun check-hybrid.intel.cori


### Example 2¶

28 MPI processes with 8 OpenMP threads per process, each OpenMP thread has 1 physical core

Note

The addition of --cpu-bind=cores is useful for getting correct affinity settings.

Cori Haswell
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=7
#SBATCH --constraint=haswell

export OMP_PROC_BIND=true

srun --cpu-bind=cores check-hybrid.intel.cori

Cori KNL
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=4
#SBATCH --constraint=haswell

export OMP_PROC_BIND=true

srun --cpu-bind=cores check-hybrid.intel.cori


## Interactive¶

Interactive jobs are launched with the salloc command.

Tip

Cori has dedicated nodes for interactive work.

Cori Haswell
cori$salloc --qos=interactive -C haswell --time=60 --nodes=2  Cori KNL cori$ salloc --qos=interactive -C knl --time=60 --nodes=2


Note

Additional details on Cori's interactive QOS

## Multiple Parallel Jobs Sequentially¶

Multiple sruns can be executed one after another in a single batch script. Be sure to specify the total walltime needed to run all jobs.

Cori Haswell
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --nodes=4
#SBATCH --time=10:00
#SBATCH --constraint=haswell

srun -n 128 -c 2 --cpu_bind=cores ./a.out
srun -n 64 -c 4 --cpu_bind=cores ./b.out
srun -n 32 -c 8 --cpu_bind=cores ./c.out


Tip

Workflow tools are another option to help you run multiple parallel sequential jobs.

## Multiple Parallel Jobs Simultaneously¶

Multiple sruns can be executed simultaneously in a single batch script.

Tip

Be sure to specify the total number of nodes needed to run all jobs at the same time.

Note

By default, multiple concurrent srun executions cannot share compute nodes under Slurm in the non-shared QOSs.

In the following example, a total of 192 cores are required, which would hypothetically fit on 192 / 32 = 6 Haswell nodes. However, because sruns cannot share nodes by default, we instead have to dedicate:

• 2 nodes to the first execution (44 cores)
• 4 to the second (108 cores)
• 2 to the third (40 cores)

For all three executables the node is not fully packed and number of MPI tasks per node is not a divisor of 64, so both -c and --cpu-bind flags are used in srun commands.

Note

The "&" at the end of each srun command and the wait command at the end of the script are very important to ensure the jobs are run in parallel and the batch job will not exit before all the simultaneous sruns are completed.

Cori Haswell
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --nodes=8
#SBATCH --time=30:00
#SBATCH --constraint=haswell

srun -N 2 -n 44 -c 2 --cpu_bind=cores ./a.out &
srun -N 4 -n 108 -c 2 --cpu_bind=cores ./b.out &
srun -N 2 -n 40 -c 2 --cpu_bind=cores ./c.out &
wait


Tip

Workflow tools are another option to help you run multiple parallel simultaneous jobs.

## Job Arrays¶

Job arrays offer a mechanism for submitting and managing collections of similar jobs quickly and easily.

This example submits 3 jobs. Each job uses 1 node and has the same time limit and QOS. The SLURM_ARRAY_TASK_ID environment variable is set to the array index value.

Cori KNL

#!/bin/bash
#SBATCH --qos=debug
#SBATCH --nodes=1
#SBATCH --constraint=knl
#SBATCH --time=2
#SBATCH --array=0-2

echo $SLURM_ARRAY_TASK_ID  Additional examples and details Tip In many use cases, GNU Parallel is a superior solution to task arrays. This is because the Slurm scheduler prioritizes fewer jobs requesting many nodes ahead of many jobs requesting fewer nodes (array tasks are considered individual jobs). Other workflow tools are available as well. ## Dependencies¶ Job dependencies can be used to construct complex pipelines or chain together long simulations requiring multiple steps. Note The --parsable option to sbatch can simplify working with job dependencies. Example jobid=$(sbatch --parsable first_job.sh)
sbatch --dependency=afterok:$jobid second_job.sh  Example jobid1=$(sbatch --parsable first_job.sh)
jobid2=$(sbatch --parsable --dependency=afterok:$jobid1 second_job.sh)
jobid3=$(sbatch --parsable --dependency=afterok:$jobid1 third_job.sh)
sbatch --dependency=afterok:$jobid2,afterok:$jobid3 last_job.sh


Note

A job that is dependent on another job does not accumulate eligible queue wait time before the dependency is satisfied.

Tip

## Shared¶

Unlike other QOS's in the shared QOS a single node can be shared by multiple users or jobs. Jobs in the shared QOS are charged for each physical core in allocated to the job.

Tip

In many use cases, GNU Parallel is a superior solution to using a shared QOS. This is because the Slurm scheduler prioritizes fewer jobs requesting many nodes ahead of many jobs requesting fewer nodes.

The number of physical cores allocated to a job by Slurm is controlled by three parameters:

• -n (--ntasks)
• -c (--cpus-per-task)
• --mem - Total memory available to the job (MemoryRequested)

Note

In Slurm a "cpu" corresponds to a hyperthread. So there are 2 cpus per physical core.

The memory on a node is divided evenly among the "cpus" (or hyperthreads):

System MemoryPerCpu (megabytes)
Cori 1952

The number of physical cores used by a job is computed by

\text{physical cores} = \Bigl\lceil \frac{1}{2} \text{max} \left( \Bigl\lceil \frac{\mathrm{MemoryRequested}}{\mathrm{MemoryPerCpu}} \Bigr\rceil, \mathrm{ntasks} * \mathrm{CpusPerTask} \right) \Bigr\rceil

Cori-Haswell MPI

A two rank MPI job which utilizes 2 physical cores (and 4 hyperthreads) of a Haswell node.

#!/bin/bash
#SBATCH --qos=shared
#SBATCH --constraint=haswell
#SBATCH --time=5
#SBATCH --nodes=1

srun --cpu-bind=cores ./a.out

Cori-Haswell MPI/OpenMP

A two rank MPI job which utilizes 4 physical cores (and 8 hyperthreads) of a Haswell node.

#!/bin/bash
#SBATCH --qos=shared
#SBATCH --constraint=haswell
#SBATCH --time=5
#SBATCH --nodes=1
srun --cpu-bind=cores ./a.out

Cori-Haswell OpenMP

An OpenMP only code which utilizes 6 physical cores.

#!/bin/bash
#SBATCH --qos=shared
#SBATCH --constraint=haswell
#SBATCH --time=5
#SBATCH --nodes=1
./my_openmp_code.exe

Cori-Haswell serial

A serial job should start by requesting a single slot and increase the amount of memory required only as needed to maximize throughput and minimize charge and wait time.

#!/bin/bash
#SBATCH --qos=shared
#SBATCH --constraint=haswell
#SBATCH --time=5
#SBATCH --nodes=1
#SBATCH --mem=1GB

./serial.exe


## Intel MPI¶

Applications built with Intel MPI can be launched via srun in the Slurm batch script on Cori compute nodes. The module impi must be loaded, and the application should be built using the mpiicc (for C Codes) or mpiifort (for Fortran codes) or mpiicpc (for C++ codes) commands.

Cori Haswell
#!/bin/bash
#SBATCH --qos=regular
#SBATCH --time=03:00:00
#SBATCH --nodes=8
#SBATCH --constraint=haswell

mpiicc -qopenmp -o mycode.exe mycode.c

srun -n 32 -c 16 --cpu-bind=cores ./mycode.exe


## Open MPI¶

Applications built with Open MPI can be launched via srun or Open MPI's mpirun command. The module openmpi needs to be loaded to build an application against Open MPI. Typically one builds the application using the mpicc (for C Codes), mpifort (for Fortran codes), or mpiCC (for C++ codes) commands. Alternatively, Open MPI supports use of pkg-config to obtain the include and library paths. For example, pkg-config --cflags --libs ompi-c returns the flags that must be passed to the backend c compiler (e.g. gcc, gfortran, icc, ifort) to build against Open MPI. Open MPI also supports Java MPI bindings. Use mpijavac to compile Java codes that use the Java MPI bindings. For Java MPI, it is highly recommended to launch jobs using Open MPI's mpirun command. Note the Open MPI packages at NERSC do not support static linking.

Cori Haswell Open MPI
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=2
#SBATCH --constraint=haswell

/bin/cat <<EOM > ring_c.c
/*
* Copyright (c) 2004-2006 The Trustees of Indiana University and Indiana
*                         University Research and Technology
*
* Simple ring test program in C.
*/

#include <stdio.h>
#include "mpi.h"

int main(int argc, char *argv[])
{
int rank, size, next, prev, message, tag = 201;

/* Start up MPI */

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);

/* Calculate the rank of the next process in the ring.  Use the
modulus operator so that the last process "wraps around" to
rank zero. */

next = (rank + 1) % size;
prev = (rank + size - 1) % size;

/* If we are the "master" process (i.e., MPI_COMM_WORLD rank 0),
put the number of times to go around the ring in the
message. */

if (0 == rank) {
message = 10;

printf("Process 0 sending %d to %d, tag %d (%d processes in ring)\n",
message, next, tag, size);
MPI_Send(&message, 1, MPI_INT, next, tag, MPI_COMM_WORLD);
printf("Process 0 sent to %d\n", next);
}

/* Pass the message around the ring.  The exit mechanism works as
follows: the message (a positive integer) is passed around the
ring.  Each time it passes rank 0, it is decremented.  When
each processes receives a message containing a 0 value, it
passes the message on to the next process and then quits.  By
passing the 0 message first, every process gets the 0 message
and can quit normally. */

while (1) {
MPI_Recv(&message, 1, MPI_INT, prev, tag, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);

if (0 == rank) {
--message;
printf("Process 0 decremented value: %d\n", message);
}

MPI_Send(&message, 1, MPI_INT, next, tag, MPI_COMM_WORLD);
if (0 == message) {
printf("Process %d exiting\n", rank);
break;
}
}

/* The last process does one extra send to process 0, which needs
to be received before the program can exit */

if (0 == rank) {
MPI_Recv(&message, 1, MPI_INT, prev, tag, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
}

/* All done */

MPI_Finalize();
return 0;
}
EOM

mpicc -o ring_c ring_c.c
mpirun ring_c
#
# run again with srun
#
srun ring_c

Cori KNL Open MPI
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=2
#SBATCH --constraint=knl

/bin/cat <<EOM > ring_c.c
/*
* Copyright (c) 2004-2006 The Trustees of Indiana University and Indiana
*                         University Research and Technology
*
* Simple ring test program in C.
*/

#include <stdio.h>
#include "mpi.h"

int main(int argc, char *argv[])
{
int rank, size, next, prev, message, tag = 201;

/* Start up MPI */

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);

/* Calculate the rank of the next process in the ring.  Use the
modulus operator so that the last process "wraps around" to
rank zero. */

next = (rank + 1) % size;
prev = (rank + size - 1) % size;

/* If we are the "master" process (i.e., MPI_COMM_WORLD rank 0),
put the number of times to go around the ring in the
message. */

if (0 == rank) {
message = 10;

printf("Process 0 sending %d to %d, tag %d (%d processes in ring)\n",
message, next, tag, size);
MPI_Send(&message, 1, MPI_INT, next, tag, MPI_COMM_WORLD);
printf("Process 0 sent to %d\n", next);
}

/* Pass the message around the ring.  The exit mechanism works as
follows: the message (a positive integer) is passed around the
ring.  Each time it passes rank 0, it is decremented.  When
each processes receives a message containing a 0 value, it
passes the message on to the next process and then quits.  By
passing the 0 message first, every process gets the 0 message
and can quit normally. */

while (1) {
MPI_Recv(&message, 1, MPI_INT, prev, tag, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);

if (0 == rank) {
--message;
printf("Process 0 decremented value: %d\n", message);
}

MPI_Send(&message, 1, MPI_INT, next, tag, MPI_COMM_WORLD);
if (0 == message) {
printf("Process %d exiting\n", rank);
break;
}
}

/* The last process does one extra send to process 0, which needs
to be received before the program can exit */

if (0 == rank) {
MPI_Recv(&message, 1, MPI_INT, prev, tag, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
}

/* All done */

MPI_Finalize();
return 0;
}
EOM

mpicc -o ring_c ring_c.c
mpirun ring_c
#
# run again with srun
#
srun ring_c


## Xfer queue¶

The intended use of the xfer queue is to transfer data between compute systems and HPSS. The xfer jobs run on one of the login nodes and are free of charge. If you want to transfer data to the HPSS archive system at the end of a regular job, you can submit an xfer job at the end of your batch job script via module load esslurm; sbatch hsi put <my_files> (be sure to load the esslurm module first, or you'll end up in the regular queue), so that you will not get charged for the duration of the data transfer. The xfer jobs can be monitored via module load esslurm; squeue. The number of running jobs for each user is limited to the number of concurrent HPSS sessions (15).

Tip

You must load the module esslurm to access the xfer queue

Warning

Do not run computational jobs in the xfer queue.

Xfer transfer job
#!/bin/bash
#SBATCH --qos=xfer
#SBATCH --time=12:00:00
#SBATCH --job-name=my_transfer

#Archive run01 to HPSS
htar -cvf run01.tar run01


Xfer jobs specifying -N nodes will be rejected at submission time. When submitting an Xfer job from Cori, the -C haswell is not needed since the job does not run on compute nodes. By default, xfer jobs get 2GB of memory allocated. The memory footprint scales somewhat with the size of the file, so if you're archiving larger files, you'll need to request more memory. You can do this by adding #SBATCH --mem=XGB to the above script (where X in the range of 5 - 10 GB is a good starting point for large files).

To monitor your xfer jobs please load the esslurm module, then you can use Slurm commands like squeue or scontrol to access the xfer queue on Cori.

## Variable-time jobs¶

After scheduling jobs earlier in the queue, Slurm attemps to fill gaps in the schedule by scanning the remainder of the queue for jobs that can fit in those gaps. If your job can be flexible about the required runtime you can add a --time-min flag and Slurm will start the job in the first gap larger than the specified --time-min, thus reducing queue wait time. Slurm will set the time limit for the job to the either the maximum requested time (--time) or the size of the gap, whichever is smaller.

Tip

Jobs that are capable of checkpoint/restart are ideal candidates for --time-min.

Pre-terminated jobs can be requeued (or resubmitted) by using the scontrol requeue command (or sbatch) to resume from where the previous executions left off, until the cumulative execution time reaches the desired time limit or the job completes.

When combined with checkpointing, this allows jobs to accumulate more than the usual 48-hour wallclock limit.

A job that requires 6 hours of wallclock time cannot be used to fill

Variable-time jobs are jobs submitted with a minimum time, #SBATCH --time-min, in addition to the maximum time (#SBATCH --time). Here is an example job script for variable-time jobs:

Sample job script with --time-min

#!/bin/bash
#SBATCH -J test
#SBATCH -q flex
#SBATCH -C knl
#SBATCH -N 1
#SBATCH --time=48:00:00      #the max walltime allowed for flex QOS jobs
#SBATCH --time-min=2:00:00   #the minimum amount of time the job should run

#this is an example to run an MPI+OpenMP job:
export OMP_PROC_BIND=true

srun -n8 -c32 --cpu_bind=cores ./a.out


### Using the flex QOS for charging discount for variable-time jobs¶

You can access the flex queue (and a substantial discount) by submitting with -q flex. You must specify a minimum running time for this job of 2 hours or less with the --time-min flag. Jobs submitted without the --time-min flag will be automatically rejected by the batch system. The maximum wall time request limit (requested via --time or -t flag) for flex jobs must be greater than 2 hours and cannot exceed 48 hours.

Example

A flex job requesting a minimum time of 1.5 hours, and max wall time of 10 hrs:

sbatch -q flex --time-min=01:30:00 --time=10:00:00 my_batch_script.sl


Tip

Variable-time jobs, specifying a shorter amount of time that a job should run, increase backfill opportunities, meaning users will see a better queue turnaround. In addition, the process of job resubmitting can be automated, so users can run a long job in multiple shorter chunks with a single job script (see the automated job script sample below). However, variable-time jobs incur checkpoint/restart overheads from splitting a longer job into multiple shorter ones. The flex QOS discount aims to compensate for checkpoint-restart overheads.

Note

• The flex QOS has a 75% charging discount on KNL and 50% discount on Haswell. The discount rate is subject to change.
• Variable-time jobs work with any QOS on Cori, but the charging discount is only available with the flex QOS.

### Annotated example - automated variable-time jobs¶

A sample job script for variable-time jobs, which automates the process of executing, pre-terminating, requeuing and restarting the job repeatedly until it runs for the desired amount of time or the job completes.

Cori Haswell
#!/bin/bash
#SBATCH -J vtj
#SBATCH -q regular
#SBATCH -C haswell
#SBATCH -N 2
#SBATCH --time=48:00:00
#SBATCH --time-min=2:00:00 #the minimum amount of time the job should run
#SBATCH --error=vtj-%j.err
#SBATCH --output=vtj-%j.out
#SBATCH --mail-user=elvis@nersc.gov
#
#SBATCH --comment=96:00:00  #desired timelimit
#SBATCH --signal=B:USR1@60
#SBATCH --requeue
#SBATCH --open-mode=append

# specify the command to run to checkpoint your job if any (leave blank if none)
ckpt_command=

# requeueing the job if reamining time >0 (do not change the following 3 lines )
. /usr/common/software/variable-time-job/setup.sh
requeue_job func_trap USR1
#

# user setting goes here

# srun must execute in the background and catch the signal USR1 on the wait command
srun -n64 -c2 --cpu_bind=cores ./a.out &

wait


Cori KNL

#!/bin/bash
#SBATCH -J vtj
#SBATCH -q flex
#SBATCH -C knl
#SBATCH -N 2
#SBATCH --time=48:00:00
#SBATCH --time-min=2:00:00 #the minimum amount of time the job should run
#SBATCH --error=%x-%j.err
#SBATCH --output=%x-%j.out
#SBATCH --mail-user=elvis@nersc.gov
#
#SBATCH --comment=96:00:00  #desired time limit
#SBATCH --requeue
#SBATCH --open-mode=append

# specify the command to use to checkpoint your job if any (leave blank if none)
ckpt_command=

# requeueing the job if reamining time >0 (do not change the following 3 lines )
. /usr/common/software/variable-time-job/setup.sh
requeue_job func_trap USR1
#

# user setting goes here
export OMP_PROC_BIND=true

#srun must execute in the background and catch the signal USR1 on the wait command
srun -n32 -c16 --cpu_bind=cores ./a.out &

wait


The --comment option is used to enter the user’s desired maximum wall-clock time, which could be longer than the maximum time limit allowed by the batch system (96 hours in this example). In addition to the time limit (--time), the --time-min option is used to specify the minimum amount of time the job should run (2 hours).

The script setup.sh defines a few bash functions (e.g., requeue_job, func_trap) that are used to automate the process. The requeue_job func_trap USR1 command executes the func_trap function, which contains a list of actions to checkpoint and requeue the job upon trapping the USR1 signal. Users may want to modify the scripts (get a copy) as needed, although they should work for most applications as they are now.

The job script works as follows:

1. User submits the above job script.
2. The batch system looks for a backfill opportunity for the job. If it can allocate the requested number of nodes for this job for any duration (e.g., 3 hours) between the specified minimum time (2 hours) and the time limit (48 hours) before those nodes are used for other higher priority jobs, the job starts execution.
3. The job runs until it receives a signal USR1 (--signal=B:USR1@<sig_time) 60 seconds (sig_time=60 in this example) before it hits the allocated time limit (3 hours). The sig_time should match the amount of time (in seconds) needed for checkpointing.
4. Upon receiving the signal, the job checkpoints and requeues itself with the remaining max time limit before it gets terminated.
5. Steps 2-4 repeat until the job runs for the desired amount of time (96 hours) or the job completes.

Note

• If your application requires external triggers or commands to do checkpointing, you need to provide the checkpoint commands using the variable, ckpt_command. It could be a script containing several commands to be executed within the specified checkpoint overhead time.
• Additionally, if you need to change the job input files to resume the job, you can do so within ckpt_command.
• If your application does checkpointing periodically, like most molecular dynamics codes do, you don’t need to specify ckpt_command (just leave it blank).
• You can send the USR1 signal outside the job script any time using the scancel -b -s USR1 <jobid> command to terminate the currently running job. The job still checkpoints and requeues itself before it gets terminated.
• The srun command must execute in the background (notice the & at the end of the srun command line and the wait command at the end of the job script), so to catch the signal (USR1) on the wait command instead of srun, allow srun to run for a bit longer (up to sig_time seconds) to complete the checkpointing.

### VASP example¶

VASP atomic relaxation jobs for Cori KNL

#!/bin/bash
#SBATCH -J vt_vasp
#SBATCH -q regular
#SBATCH -C knl
#SBATCH -N 2
#SBATCH --time=48:0:00
#SBATCH --error=%x%j.err
#SBATCH --output=%x%j.out
#SBATCH --mail-user=elvis@nersc.gov
#
#SBATCH --comment=96:00:00
#SBATCH --time-min=02:0:00
#SBATCH --signal=B:USR1@300
#SBATCH --requeue
#SBATCH --open-mode=append

# user setting

#srun must execute in background and catch signal on wait command
srun -n 32 -c16 --cpu_bind=cores vasp_std &

# put any commands that need to run to prepare for the next job here
ckpt_vasp() {
restarts=squeue -h -O restartcnt -j $SLURM_JOB_ID echo checkpointing the${restarts}-th job >&2

#to terminate VASP at the next ionic step
echo LSTOP = .TRUE. > STOPCAR

#wait until VASP to complete the current ionic step, write out WAVECAR file and quit
srun_pid=ps -fle|grep srun|head -1|awk '{print $4}' echo srun pid is$srun_pid  >&2
0-35 ./a.out
36-96 ./b.out

cori$cat batch_script.sh #!/bin/bash #SBATCH -q regular #SBATCH -N 5 #SBATCH -n 97 # total of 97 tasks #SBATCH -t 02:00:00 #SBATCH -C haswell srun --multi-prog ./mpmd.conf  ## Burst buffer¶ All examples for the burst buffer are shown with Cori Haswell nodes, but burst buffer can also be used with Haswell nodes. More details about Burst Buffer are available in the dedicated page. Check the DataWarp limitations Please note that support for DataWarp has been reduced. The Burst Buffer is also not a persistent storage and a reservation can become unavailable if hardware is unstable. A user reported a data corruption event, detailed in the known issues section of the Burst Buffer documentation page. We invite users to consider using the Cori SCRATCH file system whenever possible. DataWarp is still available for those who benefit from it and recognize the possible risks. Make sure to understand all limitations of Burst Buffer reported in the Burst Buffer doc page, to avoid losing data and wasting precious compute hours. ### Scratch¶ Use the burst buffer as a scratch space to store temporary data during the execution of I/O intensive codes. In this mode all data from the burst buffer allocation will be removed automatically at the end of the job. #!/bin/bash #SBATCH --qos=debug #SBATCH --time=5 #SBATCH --nodes=2 #SBATCH --tasks-per-node=32 #SBATCH --constraint=haswell #DW jobdw capacity=10GB access_mode=striped type=scratch srun check-mpi.intel.cori >${DW_JOB_STRIPED}/output.txt
ls ${DW_JOB_STRIPED} cat${DW_JOB_STRIPED}/output.txt


### Stage in/out¶

Copy the named file or directory into the Burst Buffer, which can then be accessed using $DW_JOB_STRIPED. Note • Only files on the Cori $SCRATCH file system can be staged in and stage out only works on Cori $SCRATCH; if a destination file not on $SCRATCH is used, the files will be lost
• A full path to the file must be used
• You must have permissions to access the file
• The job start may be delayed until the transfer is complete
• Stage out occurs after the job is completed, so there is no charge
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=2
#SBATCH --constraint=haswell
#DW jobdw capacity=10GB access_mode=striped type=scratch
#DW stage_in source=/global/cscratch1/sd/username/dwtest-file destination=$DW_JOB_STRIPED/dwtest-file type=file srun ls${DW_JOB_STRIPED}/dwtest-file

#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=1
#SBATCH --constraint=haswell
#DW jobdw capacity=10GB access_mode=striped type=scratch
#DW stage_out source=$DW_JOB_STRIPED/output destination=/global/cscratch1/sd/username/output type=directory mkdir$DW_JOB_STRIPED/output
srun check-mpi.intel.cori > ${DW_JOB_STRIPED}/output/output.txt  ### Persistent Reservations¶ Persistent reservations are useful when multiple jobs need access to the same files. Warning • Reservations must be deleted when no longer in use. • There are no guarantees of data integrity over long periods of time. • If you have multiple jobs writing to the same directory in a Persistent Reservation, you will run into race conditions due to the DataWarp caching. The second job will likely fail with Permission denied or No such file or directory messages. See theBurst Buffer dedicated page for more details. Note Each persistent reservation must have a unique name. Check the existing PRs with scontrol show burst. #### Create¶ #!/bin/bash #SBATCH --qos=debug #SBATCH --time=1 #SBATCH --nodes=1 #SBATCH --constraint=haswell #BB create_persistent name=PRname capacity=100GB access_mode=striped type=scratch  #### Use¶ Take care if multiple jobs will be using the reservation to not overwrite data. #!/bin/bash #SBATCH --qos=debug #SBATCH --time=1 #SBATCH --nodes=1 #SBATCH --constraint=haswell #DW persistentdw name=PRname ls$DW_PERSISTENT_STRIPED_PRname/


#### Destroy¶

Any data on the reservation at the time the script executes will be removed.

#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=1
#SBATCH --nodes=1
#SBATCH --constraint=haswell
#BB destroy_persistent name=PRname


### Interactive¶

The burst buffer is available also in interactive sessions. It is recommended to use a configuration file for the burst buffer directives:

cori$cat bbf.conf #DW jobdw capacity=10GB access_mode=striped type=scratch #DW stage_in source=/global/cscratch1/sd/username/path/to/filename destination=$DW_JOB_STRIPED/filename type=file

cori\$ salloc --qos=interactive -C haswell -t 00:30:00 --bbf=bbf.conf


## Large Memory¶

There are two nodes on Cori, cori22 and cori23, with 750 GB of memory that can be used for jobs that require very high memory per node. There are only two nodes, so this resource is limited and should only be used for jobs that require high memory.

Cori Example

A sample bigmem job which needs only one core.

#!/bin/bash
#SBATCH --clusters=escori
#SBATCH --qos=bigmem
#SBATCH --nodes=1
#SBATCH --time=01:00:00
#SBATCH --job-name=my_big_job

srun -n 1 ./my_big_executable


## Realtime¶

The "realtime" QOS is used for running jobs with the need of getting realtime turnaround time. This is only intended for jobs that are connected with an external realtime component (e.g. live beamline runs, telescope time, etc.).

Note

Use of this QOS requires special approval, and is only intended for use with a live, external realtime component that needs on-demand resources. There are limited resources available on this queue. It is not intended to provide faster batch turnaround for regular jobs.

"realtime" QOS Request Form

The realtime QOS is a user-selective shared QOS, meaning you can request either exclusive node access (with the #SBATCH --exclusive flag) or allow multiple applications to share a node (with the #SBATCH --share flag).

Tip

It is recommended to allow sharing the nodes so more jobs can be scheduled in the allocated nodes. Sharing a node is the default setting, and using #SBATCH --share is optional.

Example

Uses two full nodes

#!/bin/bash
#SBATCH --qos=realtime
#SBATCH --constraint=haswell
#SBATCH --nodes=2
#SBATCH --time=01:00:00
#SBATCH --job-name=my_job
#SBATCH --exclusive

srun --cpu-bind=cores ./mycode.exe   # pure MPI, 64 MPI tasks


If you are requesting only a portion of a single node, please add --gres=craynetwork:0 as follows to allow more jobs on the node. Similar to using the "shared" QOS, you can request number of slots on the node (total of 64 CPUs, or 64 slots) by specifying the -ntasks and/or --mem. The rules are the same as the shared QOS.

Example

Two MPI ranks running with 4 OpenMP threads each. The job is using in total 8 physical cores (8 "cpus" or hyperthreads per "task") and 10GB of memory.

#!/bin/bash
#SBATCH --qos=realtime
#SBATCH --constraint=haswell
#SBATCH --nodes=1
#SBATCH --gres=craynetwork:0
#SBATCH --mem=10GB
#SBATCH --time=01:00:00
#SBATCH --job-name=my_job2
#SBATCH --shared

srun --cpu-bind=cores ./mycode.exe


Example

OpenMP only code running with 6 threads. Note that srun is not required in this case.

#!/bin/bash
#SBATCH --qos=realtime
#SBATCH --constraint=haswell
#SBATCH --nodes=1
#SBATCH --gres=craynetwork:0
#SBATCH --mem=16GB
#SBATCH --time=01:00:00
#SBATCH --job-name=my_job3
#SBATCH --shared

./mycode.exe


## Multiple Parallel Jobs While Sharing Nodes¶

Under certain scenarios, you might want two or more independent applications running simultaneously on each compute node allocated to your job. For example, a pair of applications that interact in a client-server fashion via some IPC mechanism on-node (e.g. shared memory), but must be launched in distinct MPI communicators.

This latter constraint would mean that MPMD mode (see below) is not an appropriate solution, since although MPMD can allow multiple executables to share compute nodes, the executables will also share an MPI_COMM_WORLD at launch.

Slurm can allow multiple executables launched with concurrent srun calls to share compute nodes as long as the sum of the resources assigned to each application does not exceed the node resources requested for the job. Importantly, you cannot over-allocate the CPU, memory, or "craynetwork" resource. While the former two are self-explanatory, the latter refers to limitations imposed on the number of applications per node that can simultaneously use the Aries interconnect, which is currently limited to 4.

Here is an example of an sbatch script that uses two compute nodes and runs two applications concurrently. One application uses 8 cores on each node, while the other uses 24 on each node. The number of tasks per node is controlled with the -n and -N flags and the amount of memory per node with the --mem flag. To specify the "craynetwork" resource, we use the --gres flag available in both sbatch and srun. The --overlap flag is needed to allow overlap on the assigned resources with other job steps.

Cori Haswell

#!/bin/bash

#SBATCH -q regular
#SBATCH -N 2
#SBATCH -t 12:00:00
#SBATCH --gres=craynetwork:2
#SBATCH -L SCRATCH
#SBATCH -C haswell

srun -N 2 -n 16 -c 2 --mem=51200 --gres=craynetwork:1 --overlap ./exec_a &
srun -N 2 -n 48 -c 2 --mem=61440 --gres=craynetwork:1 --overlap ./exec_b &
wait


This example is quite similar to the multiple srun jobs shown for running simultaneous parallel jobs, with the following exceptions:

1. For our sbatch job, we have requested --gres=craynetwork:2 which will allow us to run up to two applications simultaneously per compute node.

2. In our srun calls, we have explicitly defined the maximum amount of memory available to each application per node with --mem (in this example 50 and 60 GB, respectively) such that the sum is less than the resource limit per node (roughly 122 GB).

3. In our srun calls, we have also explicitly used one of the two requested craynetwork resources per call.

4. In our srun calls, we need to use the --overlap flag to allow multiple sruns to share resources on the same nodes with other job steps.

Using this combination of resource requests, we are able to run multiple parallel applications per compute node.

Note

It is permitted to specify srun --gres=craynetwork:0 which will not count against the craynetwork resource. This is useful when, for example, launching a bash script or other application that does not use the interconnect. We don't currently anticipate this being a common use case, but if your application(s) do employ this mode of operation it would be appreciated if you let us know.

Tip

Workflow tools are another option to help you run multiple parallel jobs while sharing nodes.

## Compile¶

The compile QOS is intended for compiling codes and should be used in workflows that regularly build from source code. These jobs are submitted to a special queue that can be queried by loading the module esslurm or passing flags in the command line -M escori.

Note

All compile queue jobs run on a single Haswell node, please be mindful of the resources requested for a job.

Example

A sample compile job.

#!/bin/bash
#SBATCH --clusters=escori
#SBATCH --qos=compile
#SBATCH --job-name=my_compile_job
#SBATCH --time=00:05:00
#SBATCH --nodes=1
#SBATCH --mem=8GB

make -j 4


## Perlmutter GPUs¶

Do not include the account option (-A or --account) for job submission on Perlmutter

For now, do not include the -A (--account) option for job submission, which is to specify the project account to charge for the job. Otherwise, the job submission will fail.

--gpus-per-task does not enforce GPU affinity or binding

Despite what its name suggests, --gpus-per-task in the examples below only counts the number of GPUs to allocate to the job; it does not enforce any binding or affinity of GPUs to CPUs or tasks.

### 1 node, 1 task, 1 GPU¶

#!/bin/bash
#SBATCH -C gpu
#SBATCH -q regular
#SBATCH -t 1:00:00
#SBATCH -n 1
#SBATCH -c 128

export SLURM_CPU_BIND="cores"


Output:

Rank 0 out of 1 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 0: 0000:C1:00.0


### 1 node, 4 tasks, 4 GPUs, all GPUs visible to all tasks¶

#!/bin/bash
#SBATCH -C gpu
#SBATCH -q regular
#SBATCH -t 1:00:00
#SBATCH -n 4
#SBATCH -c 32

export SLURM_CPU_BIND="cores"


Output:

Rank 0 out of 4 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 0: 0000:02:00.0
1 for rank 0: 0000:41:00.0
2 for rank 0: 0000:81:00.0
3 for rank 0: 0000:C1:00.0
Rank 2 out of 4 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 2: 0000:02:00.0
1 for rank 2: 0000:41:00.0
2 for rank 2: 0000:81:00.0
3 for rank 2: 0000:C1:00.0
Rank 3 out of 4 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 3: 0000:02:00.0
1 for rank 3: 0000:41:00.0
2 for rank 3: 0000:81:00.0
3 for rank 3: 0000:C1:00.0
Rank 1 out of 4 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 1: 0000:02:00.0
1 for rank 1: 0000:41:00.0
2 for rank 1: 0000:81:00.0
3 for rank 1: 0000:C1:00.0


### 1 node, 4 tasks, 4 GPUs, 1 GPU visible to each task¶

#!/bin/bash
#SBATCH -C gpu
#SBATCH -q regular
#SBATCH -t 1:00:00
#SBATCH -n 4
#SBATCH -c 32
#SBATCH --gpu-bind=map_gpu:0,1,2,3

export SLURM_CPU_BIND="cores"


Output:

Rank 0 out of 4 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 0: 0000:02:00.0
Rank 3 out of 4 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 3: 0000:C1:00.0
Rank 2 out of 4 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 2: 0000:81:00.0
Rank 1 out of 4 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 1: 0000:41:00.0


### 4 nodes, 16 tasks, 16 GPUs, all GPUs visible to all tasks¶

#!/bin/bash
#SBATCH -C gpu
#SBATCH -q regular
#SBATCH -t 1:00:00
#SBATCH -n 16
#SBATCH -c 32

export SLURM_CPU_BIND="cores"


Output:

Rank 13 out of 16 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 13: 0000:02:00.0
1 for rank 13: 0000:41:00.0
2 for rank 13: 0000:81:00.0
3 for rank 13: 0000:C1:00.0
Rank 3 out of 16 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 3: 0000:02:00.0
1 for rank 3: 0000:41:00.0
2 for rank 3: 0000:81:00.0
3 for rank 3: 0000:C1:00.0
Rank 11 out of 16 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 11: 0000:02:00.0
1 for rank 11: 0000:41:00.0
2 for rank 11: 0000:81:00.0
3 for rank 11: 0000:C1:00.0
Rank 5 out of 16 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 5: 0000:02:00.0
1 for rank 5: 0000:41:00.0
2 for rank 5: 0000:81:00.0
3 for rank 5: 0000:C1:00.0
Rank 15 out of 16 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 15: 0000:02:00.0
1 for rank 15: 0000:41:00.0
2 for rank 15: 0000:81:00.0
3 for rank 15: 0000:C1:00.0
Rank 14 out of 16 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 14: 0000:02:00.0
1 for rank 14: 0000:41:00.0
2 for rank 14: 0000:81:00.0
3 for rank 14: 0000:C1:00.0
Rank 12 out of 16 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 12: 0000:02:00.0
1 for rank 12: 0000:41:00.0
2 for rank 12: 0000:81:00.0
3 for rank 12: 0000:C1:00.0
Rank 9 out of 16 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 9: 0000:02:00.0
1 for rank 9: 0000:41:00.0
2 for rank 9: 0000:81:00.0
3 for rank 9: 0000:C1:00.0
Rank 10 out of 16 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 10: 0000:02:00.0
1 for rank 10: 0000:41:00.0
2 for rank 10: 0000:81:00.0
3 for rank 10: 0000:C1:00.0
Rank 8 out of 16 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 8: 0000:02:00.0
1 for rank 8: 0000:41:00.0
2 for rank 8: 0000:81:00.0
3 for rank 8: 0000:C1:00.0
Rank 1 out of 16 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 1: 0000:02:00.0
1 for rank 1: 0000:41:00.0
2 for rank 1: 0000:81:00.0
3 for rank 1: 0000:C1:00.0
Rank 2 out of 16 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 2: 0000:02:00.0
1 for rank 2: 0000:41:00.0
2 for rank 2: 0000:81:00.0
3 for rank 2: 0000:C1:00.0
Rank 0 out of 16 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 0: 0000:02:00.0
1 for rank 0: 0000:41:00.0
2 for rank 0: 0000:81:00.0
3 for rank 0: 0000:C1:00.0
Rank 6 out of 16 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 6: 0000:02:00.0
1 for rank 6: 0000:41:00.0
2 for rank 6: 0000:81:00.0
3 for rank 6: 0000:C1:00.0
Rank 7 out of 16 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 7: 0000:02:00.0
1 for rank 7: 0000:41:00.0
2 for rank 7: 0000:81:00.0
3 for rank 7: 0000:C1:00.0
Rank 4 out of 16 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 4: 0000:02:00.0
1 for rank 4: 0000:41:00.0
2 for rank 4: 0000:81:00.0
3 for rank 4: 0000:C1:00.0


### 4 nodes, 16 tasks, 16 GPUs, 1 GPU visible to each task¶

#!/bin/bash
#SBATCH -C gpu
#SBATCH -q regular
#SBATCH -t 1:00:00
#SBATCH -n 16
#SBATCH -c 32
#SBATCH --gpu-bind=map_gpu:0,1,2,3

export SLURM_CPU_BIND="cores"


Output:

Rank 6 out of 16 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 6: 0000:81:00.0
Rank 13 out of 16 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 13: 0000:41:00.0
Rank 1 out of 16 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 1: 0000:41:00.0
Rank 10 out of 16 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 10: 0000:81:00.0
Rank 15 out of 16 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 15: 0000:C1:00.0
Rank 9 out of 16 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 9: 0000:41:00.0
Rank 7 out of 16 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 7: 0000:C1:00.0
Rank 14 out of 16 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 14: 0000:81:00.0
Rank 11 out of 16 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 11: 0000:C1:00.0
Rank 5 out of 16 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 5: 0000:41:00.0
Rank 12 out of 16 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 12: 0000:02:00.0
Rank 8 out of 16 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 8: 0000:02:00.0
Rank 4 out of 16 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 4: 0000:02:00.0
Rank 2 out of 16 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 2: 0000:81:00.0
Rank 3 out of 16 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 3: 0000:C1:00.0
Rank 0 out of 16 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 0: 0000:02:00.0


Users who have many independent single-GPU tasks may wish to pack these into one job which runs the tasks in parallel on different GPUs. There are multiple ways to accomplish this; here we present one example.

#### srun¶

The Slurm srun command can be used to launch individual tasks, each allocated some amount of resources requested by the job script. An example of this is:

#!/bin/bash
#SBATCH -C gpu
#SBATCH -G 2
#SBATCH -N 1
#SBATCH -t 5

srun --gres=craynetwork:0 -n 1 -G 1 ./a.out &
srun --gres=craynetwork:0 -n 1 -G 1 ./b.out &
wait


Each srun invocation requests one task and one GPU, and requesting zero craynetwork resources per task is required to allow the tasks to run in parallel. The & at the end of each line puts the tasks in the background, and the final wait command is needed to allow all of the tasks to run to completion.

Do not use srun for large numbers of tasks

This approach is feasible for relatively small numbers (i.e., tens) of tasks but should not be used for hundreds or thousands of tasks. To run larger numbers of tasks, GNU parallel is preferred, which will be provided on Perlmutter soon.

## Projects that have exhausted their allocation¶

A project with zero or negative NERSC-hours balance can submit to the the overrun queue.

If you meet the overrun criteria, you can access the overrun queue by submitting with -q overrun (-q shared_overrun for the shared queue). In addition, you must specify a minimum running time for this job of 4 hours or less with the --time-min flag. Jobs submitted without these flags will be automatically rejected by the batch system.

Tip

We recommend you implement checkpoint/restart your overrun jobs to save your progress.

Example

A job requesting a minimum time of 1.5 hours:

sbatch -q overrun --time-min=01:30:00 my_batch_script.sl