Skip to content

Example job scripts

Basic MPI batch script

One MPI process per physical core.

Edison
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=2
#SBATCH --tasks-per-node=24

srun check-mpi.intel.edison
Cori Haswell
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=2
#SBATCH --tasks-per-node=32
#SBATCH --constraint=haswell

srun check-mpi.intel.cori
Cori KNL
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=2
#SBATCH --tasks-per-node=68
#SBATCH --constraint=knl

srun check-mpi.intel.cori

Hybrid MPI+OpenMP jobs

One MPI process per socket and 1 OpenMP thread per physical core

Warning

In Slurm each hyper thread is considered a "cpu" so the --cpus-per-task option must be adjusted accordingly. Generally best performance is obtained with 1 OpenMP thread per physical core.

Edison
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=2
#SBATCH --tasks-per-node=2
#SBATCH --cpus-per-task=24

export OMP_PROC_BIND=true
export OMP_PLACES=threads
export OMP_NUM_THREADS=12

srun check-hybrid.intel.edison
Cori Haswell
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=2
#SBATCH --tasks-per-node=2
#SBATCH --cpus-per-task=32
#SBATCH --constraint=haswell

export OMP_PROC_BIND=true
export OMP_PLACES=threads
export OMP_NUM_THREADS=16

srun check-hybrid.intel.cori
Cori KNL
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=2
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=272
#SBATCH --constraint=knl

export OMP_PROC_BIND=true
export OMP_PLACES=threads
export OMP_NUM_THREADS=68

srun check-hybrid.intel.cori

Interactive

Interactive jobs are launched with the salloc command.

Tip

Cori has dedicated nodes for interactive work.

Edison
edison$ salloc --qos=debug --time=30 --nodes=2
Cori Haswell
cori$ salloc --qos=interactive -C haswell --time=60 --nodes=2
Cori KNL
cori$ salloc --qos=interactive -C knl --time=60 --nodes=2

Note

Additional details on Cori's interactive QOS

Multiple Parallel Jobs Sequentially

Multiple sruns can be executed one after another in a single batch script. Be sure to specify the total walltime needed to run all jobs.

Cori Haswell
#!/bin/bash -l

#SBATCH --qos=debug
#SBATCH --nodes=4
#SBATCH --time=10:00
#SBATCH --licenses=project,cscratch1
#SBATCH --constraint=haswell

srun -n 128 -c 2 --cpu_bind=cores ./a.out   
srun -n 64 -c 4 --cpu_bind=cores ./b.out 
srun -n 32 -c 8 --cpu_bind=cores ./c.out

Multiple Parallel Jobs Simultaneously

Multiple sruns can be executed simultaneously in a single batch script.

Tip

Be sure to specify the total number of nodes needed to run all jobs at the same time.

Note

By default, multiple concurrent srun executions cannot share compute nodes under Slurm in the non-shared QOSs.

In the following example, a total of 192 cores are required, which would hypothetically fit on 192 / 32 = 6 Haswell nodes. However, because sruns cannot share nodes by default, we instead have to dedicate:

  • 2 nodes to the first execution (44 cores)
  • 4 to the second (108 cores)
  • 2 to the third (40 cores)

For all three executables the node is not fully packed and number of MPI tasks per node is not a divisor of 64, so both -c and --cpu-bind flags are used in srun commands.

Note

The "&" at the end of each srun command and the wait command at the end of the script are very important to ensure the jobs are run in parallel and the batch job will not exit before all the simultaneous sruns are completed.

Cori Haswell
#!/bin/bash -l

#SBATCH --qos=debug
#SBATCH --nodes=8
#SBATCH --time=30:00
#SBATCH --licenses=cscratch1
#SBATCH --constraint=haswell

srun -N 2 -n 44 -c 2 --cpu_bind=cores ./a.out &
srun -N 4 -n 108 -c 2 --cpu_bind=cores ./b.out &
srun -N 2 -n 40 -c 2 --cpu_bind=cores ./c.out &
wait

Job Arrays

Job arrays offer a mechanism for submitting and managing collections of similar jobs quickly and easily.

This example submits 3 jobs. Each job uses 1 node and has the same time limit and QOS. The SLURM_ARRAY_TASK_ID environment variable is set to the array index value.

Cori KNL

#!/bin/bash
#SBATCH --qos=debug
#SBATCH --nodes=1
#SBATCH --constraint=knl
#SBATCH --time=2
#SBATCH --array=0-2

echo $SLURM_ARRAY_TASK_ID

Additional examples and details

Dependencies

Job dependencies can be used to construct complex pipelines or chain together long simulations requiring multiple steps.

Note

The --parsable option to sbatch can simplify working with job dependencies.

Example

jobid=$(sbatch --parsable first_job.sh)
sbatch --dependency=afterok:$jobid second_job.sh

Example

jobid1=$(sbatch --parsable first_job.sh)
jobid2=$(sbatch --parsable --dependency=afterok:$jobid1 second_job.sh)
jobid3=$(sbatch --parsable --dependency=afterok:$jobid1 third_job.sh)
sbatch --dependency=afterok:$jobid2,afterok:$jobid3 last_job.sh

Shared

Unlike other QOS's in the shared QOS a single node can be shared by multiple users or jobs. Jobs in the shared QOS are charged for each physical core in allocated to the job.

The number of physical cores allocated to a job by Slurm is controlled by three parameters:

  • -n (--ntasks)
  • -c (--cpus-per-task)
  • --mem - Total memory available to the job (MemoryRequested)

Note

In Slurm a "cpu" corresponds to a hyperthread. So there are 2 cpus per physical core.

The memory on a node is divided evenly among the "cpus" (or hyperthreads):

System MemoryPerCpu (megabytes)
Edison 1300
Cori 1952

The number of physical cores used by a job is computed by

\text{physical cores} = \Bigl\lceil \frac{1}{2} \text{max} \left( \Bigl\lceil \frac{\mathrm{MemoryRequested}}{\mathrm{MemoryPerCpu}} \Bigr\rceil, \mathrm{ntasks} * \mathrm{CpusPerTask} \right) \Bigr\rceil

Cori-Haswell MPI

A two rank MPI job which utilizes 2 physical cores (and 4 hyperthreads) of a Haswell node.

#!/bin/bash
#SBATCH --qos=shared
#SBATCH --constraint=haswell
#SBATCH --time=5
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=2

srun --cpu-bind=cores ./a.out
Cori-Haswell MPI/OpenMP

A two rank MPI job which utilizes 4 physical cores (and 8 hyperthreads) of a Haswell node.

#!/bin/bash
#SBATCH --qos=shared
#SBATCH --constraint=haswell
#SBATCH --time=5
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=4
export OMP_NUM_THREADS=2
srun --cpu-bind=cores ./a.out
Cori-Haswell OpenMP

An OpenMP only code which utilizes 6 physical cores.

#!/bin/bash
#SBATCH --qos=shared
#SBATCH --constraint=haswell
#SBATCH --time=5
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=12
export OMP_NUM_THREADS=6
./my_openmp_code.exe
Cori-Haswell serial

A serial job should start by requesting a single slot and increase the amount of memory required only as needed to maximize thoughput and minimize charge and wait time.

#!/bin/bash
#SBATCH --qos=shared
#SBATCH --constraint=haswell
#SBATCH --time=5
#SBATCH --ntasks=1
#SBATCH --mem=1GB

./serial.exe

Using Intel MPI

Applications built with Intel MPI can be launched via srun in the SLURM batch script on Cori compute nodes. The module impi needs to be loaded, and the application should be built using the mpiicc (for C Codes) or mpiifort (for Fortran codes) or mpiicpc (for C++ codes) commands. Below is a sample compile and run script.

Cori Haswell
#!/bin/bash
#SBATCH --qos=regular
#SBATCH --time=03:00:00
#SBATCH --nodes=8
#SBATCH --constraint=haswell

module load impi
mpiicc -qopenmp -o mycode.exe mycode.c

export OMP_NUM_THREADS=8
export OMP_PROC_BIND=spread
export OMP_PLACES=threads

srun -n 32 -c 16 --cpu-bind=cores ./mycode.exe

Using Open MPI

Applications built with Open MPI can be launched via srun or Open MPI's mpirun command. The module openmpi needs to be loaded to build an application against Open MPI. Typically one builds the application using the mpicc (for C Codes), mpifort (for Fortran codes), or mpiCC (for C++ codes) commands. Alternatively, Open MPI supports use of pkg-config to obtain the include and library paths. For example, pkg-config --cflags --libs ompi-c returns the flags that must be passed to the backend c compiler (e.g. gcc, gfortran, icc, ifort) to build against Open MPI. Open MPI also supports Java MPI bindings. Use mpijavac to compile Java codes that use the Java MPI bindings. For Java MPI, it is highly recommended to launch jobs using Open MPI's mpirun command. Note the Open MPI packages at NERSC do not support static linking.

See Open MPI for more information about using Open MPI on NERSC systems.

Cori Haswell Open MPI
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=2
#SBATCH --tasks-per-node=32
#SBATCH --constraint=haswell

module load openmpi

/bin/cat <<EOM > ring_c.c
/*
 * Copyright (c) 2004-2006 The Trustees of Indiana University and Indiana
 *                         University Research and Technology
 *                         Corporation.  All rights reserved.
 * Copyright (c) 2006      Cisco Systems, Inc.  All rights reserved.
 *
 * Simple ring test program in C.
 */

#include <stdio.h>
#include "mpi.h"

int main(int argc, char *argv[])
{
    int rank, size, next, prev, message, tag = 201;

    /* Start up MPI */

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    /* Calculate the rank of the next process in the ring.  Use the
       modulus operator so that the last process "wraps around" to
       rank zero. */

    next = (rank + 1) % size;
    prev = (rank + size - 1) % size;

    /* If we are the "master" process (i.e., MPI_COMM_WORLD rank 0),
       put the number of times to go around the ring in the
       message. */

    if (0 == rank) {
        message = 10;

        printf("Process 0 sending %d to %d, tag %d (%d processes in ring)\n",
               message, next, tag, size);
        MPI_Send(&message, 1, MPI_INT, next, tag, MPI_COMM_WORLD);
        printf("Process 0 sent to %d\n", next);
    }

    /* Pass the message around the ring.  The exit mechanism works as
       follows: the message (a positive integer) is passed around the
       ring.  Each time it passes rank 0, it is decremented.  When
       each processes receives a message containing a 0 value, it
       passes the message on to the next process and then quits.  By
       passing the 0 message first, every process gets the 0 message
       and can quit normally. */

    while (1) {
        MPI_Recv(&message, 1, MPI_INT, prev, tag, MPI_COMM_WORLD,
                 MPI_STATUS_IGNORE);

        if (0 == rank) {
            --message;
            printf("Process 0 decremented value: %d\n", message);
        }

        MPI_Send(&message, 1, MPI_INT, next, tag, MPI_COMM_WORLD);
        if (0 == message) {
            printf("Process %d exiting\n", rank);
            break;
        }
    }

    /* The last process does one extra send to process 0, which needs
       to be received before the program can exit */

    if (0 == rank) {
        MPI_Recv(&message, 1, MPI_INT, prev, tag, MPI_COMM_WORLD,
                 MPI_STATUS_IGNORE);
    }

    /* All done */

    MPI_Finalize();
    return 0;
}
EOM

mpicc -o ring ring_c.c
mpirun ring_c
#
# run again with srun
#
srun ring_c
Cori KNL Open MPI
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=2
#SBATCH --tasks-per-node=68
#SBATCH --constraint=knl

module load openmpi

/bin/cat <<EOM > ring_c.c
/*
 * Copyright (c) 2004-2006 The Trustees of Indiana University and Indiana
 *                         University Research and Technology
 *                         Corporation.  All rights reserved.
 * Copyright (c) 2006      Cisco Systems, Inc.  All rights reserved.
 *
 * Simple ring test program in C.
 */

#include <stdio.h>
#include "mpi.h"

int main(int argc, char *argv[])
{
    int rank, size, next, prev, message, tag = 201;

    /* Start up MPI */

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    /* Calculate the rank of the next process in the ring.  Use the
       modulus operator so that the last process "wraps around" to
       rank zero. */

    next = (rank + 1) % size;
    prev = (rank + size - 1) % size;

    /* If we are the "master" process (i.e., MPI_COMM_WORLD rank 0),
       put the number of times to go around the ring in the
       message. */

    if (0 == rank) {
        message = 10;

        printf("Process 0 sending %d to %d, tag %d (%d processes in ring)\n",
               message, next, tag, size);
        MPI_Send(&message, 1, MPI_INT, next, tag, MPI_COMM_WORLD);
        printf("Process 0 sent to %d\n", next);
    }

    /* Pass the message around the ring.  The exit mechanism works as
       follows: the message (a positive integer) is passed around the
       ring.  Each time it passes rank 0, it is decremented.  When
       each processes receives a message containing a 0 value, it
       passes the message on to the next process and then quits.  By
       passing the 0 message first, every process gets the 0 message
       and can quit normally. */

    while (1) {
        MPI_Recv(&message, 1, MPI_INT, prev, tag, MPI_COMM_WORLD,
                 MPI_STATUS_IGNORE);

        if (0 == rank) {
            --message;
            printf("Process 0 decremented value: %d\n", message);
        }

        MPI_Send(&message, 1, MPI_INT, next, tag, MPI_COMM_WORLD);
        if (0 == message) {
            printf("Process %d exiting\n", rank);
            break;
        }
    }

    /* The last process does one extra send to process 0, which needs
       to be received before the program can exit */

    if (0 == rank) {
        MPI_Recv(&message, 1, MPI_INT, prev, tag, MPI_COMM_WORLD,
                 MPI_STATUS_IGNORE);
    }

    /* All done */

    MPI_Finalize();
    return 0;
}
EOM

mpicc -o ring ring_c.c
mpirun ring_c
#
# run again with srun
#
srun ring_c

Xfer queue

The intended use of the xfer queue is to transfer data between Cori or Edison and HPSS. The xfer jobs run on one of the login nodes and are free of charge. If you want to transfer data to the HPSS archive system at the end of a regular job, you can submit an xfer job at the end of your batch job script via module load esslurm; sbatch hsi put <my_files> (be sure to load the esslurm module first, or you'll end up in the regular queue), so that you will not get charged for the duration of the data transfer. The xfer jobs can be monitored via module load esslurm; squeue. The number of running jobs for each user is limited to the number of concurrent HPSS sessions (15).

Warning

Do not run computational jobs in the xfer queue.

Xfer transfer job
#!/bin/bash
#SBATCH --qos=xfer
#SBATCH --time=12:00:00
#SBATCH --job-name=my_transfer
#SBATCH --licenses=SCRATCH

#Archive run01 to HPSS
htar -cvf run01.tar run01

#Submit job with
#module load esslurm
#sbatch <job_script>

Xfer jobs specifying -N nodes will be rejected at submission time. When submitting an Xfer job from Cori, the -C haswell is not needed since the job does not run on compute nodes. By default, xfer jobs get 2GB of memory allocated. The memory footprint scales somewhat with the size of the file, so if you're archiving larger files, you'll need to request more memory. You can do this by adding #SBATCH --mem=XGB to the above script (where X in the range of 5 - 10 GB is a good starting point for large files).

To monitor your xfer jobs please load the esslurm module, then you can use Slurm commands like squeue or scontrol to access the xfer queue on Cori or Edison.

Variable-time jobs

Variable-time jobs are for users who wish to get a better queue turnaround and/or need to run long running jobs, including jobs longer than 48 hours, the maximum wall-clock time allowed on Cori and Edison.

Variable-time jobs are jobs submitted with a minimum time, #SBATCH --time-min, in addition to the maximum time (#SBATCH –time). Jobs specifying a minimum time can start execution earlier than they would otherwise with a time limit anywhere between the minimum and maximum time requests. Pre-terminated jobs can be requeued (using the scontrol requeue command) to resume from where the previous executions left off, until the cumulative execution time reaches the desired time limit or the job completes.

Note

To use variable-time jobs, applications are required to be able to checkpoint and restart by themselves.

Annotated example

Here is a sample job script for variable-time jobs, which automates the process of executing, pre-terminating, requeuing and restarting the job repeatedly until it runs for the desired amount of time or the job completes.

Edison
#!/bin/bash
#SBATCH -J vtj
#SBATCH -q regular
#SBATCH -N 2
#SBATCH --comment=96:00:00
#SBATCH --time-min=2:00:00 #the minimum amount of time the job should run
#SBATCH --time=48:00:00
#SBATCH --error=vtj-%j.err
#SBATCH --output=vtj-%j.out
#SBATCH --mail-user=elvis@nersc.gov
#
#SBATCH --signal=B:USR1@60
#SBATCH --requeue
#SBATCH --open-mode=append

# use the following three variables to specify the time limit per job (max_timelimit), 
# the amount of time (in seconds) needed for checkpointing, 
# and the command to use to do the checkpointing if any (leave blank if none)
max_timelimit=48:00:00   # can match the #SBATCH --time option but don't have to
ckpt_overhead=60         # should match the time in the #SBATCH --signal option
ckpt_command=

# requeueing the job if reamining time >0 (do not change the following 3 lines )
. /usr/common/software/variable-time-job/setup.sh
requeue_job func_trap USR1
#

# user setting goes here

# srun must execute in the background and catch the signal USR1 on the wait command
srun -n48 -c2 --cpu_bind=cores ./a.out &

wait

Cori Haswell

#!/bin/bash
#SBATCH -J vtj 
#SBATCH -q regular
#SBATCH -C haswell
#SBATCH -N 2
#SBATCH --comment=96:00:00
#SBATCH --time-min=2:00:00 #the minimum amount of time the job should run
#SBATCH --time=48:00:00
#SBATCH --error=vtj-%j.err
#SBATCH --output=vtj-%j.out
#SBATCH --mail-user=elvis@nersc.gov
#
#SBATCH --signal=B:USR1@60
#SBATCH --requeue
#SBATCH --open-mode=append

# use the following three variables to specify the time limit per job (max_timelimit), 
# the amount of time (in seconds) needed for checkpointing, 
# and the command to use to do the checkpointing if any (leave blank if none)
max_timelimit=48:00:00   # can match the #SBATCH --time option but don't have to
ckpt_overhead=60         # should match the time in the #SBATCH --signal option 
ckpt_command=

# requeueing the job if reamining time >0 (do not change the following 3 lines )
. /usr/common/software/variable-time-job/setup.sh
requeue_job func_trap USR1
#

# user setting goes here

# srun must execute in the background and catch the signal USR1 on the wait command
srun -n64 -c2 --cpu_bind=cores ./a.out &

wait
Cori KNL
#!/bin/bash
#SBATCH -J vtj
#SBATCH -q regular
#SBATCH -C knl 
#SBATCH -N 2
#SBATCH --comment=96:00:00
#SBATCH --time-min=2:00:00 #the minimum amount of time the job should run
#SBATCH --time=48:00:00
#SBATCH --error=vtj-%j.err
#SBATCH --output=vtj-%j.out
#SBATCH --mail-user=elvis@nersc.gov
#
#SBATCH --signal=B:USR1@60
#SBATCH --requeue
#SBATCH --open-mode=append

# use the following three variables to specify the time limit per job (max_timelimit), 
# the amount of time (in seconds) needed for checkpointing, 
# and the command to use to do the checkpointing if any (leave blank if none)
max_timelimit=48:00:00   # can match the #SBATCH --time option but don't have to
ckpt_overhead=60         # should match the time in the #SBATCH --signal option
ckpt_command=

# requeueing the job if reamining time >0 (do not change the following 3 lines )
. /usr/common/software/variable-time-job/setup.sh
requeue_job func_trap USR1
#

# user setting goes here
export OMP_PROC_BIND=true
export OMP_PLACES=threads
export OMP_NUM_THREADS=8

#srun must execute in the background and catch the signal USR1 on the wait command
srun -n32 -c16 --cpu_bind=cores ./a.out &

wait

In the above example, the --comment option is used to enter the user’s desired maximum wall-clock time, which could be longer than the maximum time limit allowed by the batch system (96 hours in this example). In addition to the time limit (--time), the --time-min option is used to specify the minimum amount of time the job should run (2 hours).

The script setup.sh defines a few bash functions (e.g., requeue_job, func_trap) that are used to automate the process. The requeue_job func_trap USR1 command executes the func_trap function, which contains a list of actions to checkpoint and requeue the job, upon trapping the USR1 signal. Users may want to modify the scripts (get a copy) as needed, although they should work for most applications as they are now.

The job script works as follows:

  1. User submits the above job script.
  2. The batch system looks for a backfill opportunity for the job. If it can allocate the requested number of nodes for this job for any duration (e.g., 3 hours) between the specified minimum time (2 hours) and the time limit (48 hours) before those nodes are used for other higher priority jobs, the job starts execution.
  3. The job runs until it receives a signal USR1 (--signal=B:USR1@<sig_time) 60 seconds (sig_time=60 in this example) before it hits the allocated time limit (3 hours).
  4. Upon receiving the signal, the job checkpoints and requeues itself with the remaining max time limit before it gets terminated. The variable ckpt_overhead is used to specify the amount of time (in seconds) needed for checkpointing and requeuing the job. It should match the sig_time in the --signal option.
  5. The steps 2-4 repeat until the job runs for the desired amount of time (96 hours) or the job completes.

Note

  • If your application requires external triggers or commands to do checkpointing, you need to provide the checkpoint commands using the variable, ckpt_command. It could be a script containing several commands to be executed within the specified checkpoint overhead time (ckpt_overhead).
  • Additionally, if you need to change the job input files to resume the job, you can do so within the ckpt_command.
  • If your application does checkpointing periodically, like most of the molecular dynamics codes do, you don’t need the ckpt_command (just leave it blank).
  • You can send the USR1 signal outside the job script any time using the scancel -b -s USR1 <jobid> command to terminate the currently running job. The job still checkpoints and requeues itself before it gets terminated.
  • The srun command must execute in the background (notice the & at the end of the srun command line and the wait command at the end of the job script), so to catch the signal (USR1) on the wait command instead of srun, allow srun to run for a bit longer (up to sig_time seconds) to complete the checkpointing.

VASP example

VASP atomic relaxation jobs for Cori KNL

#!/bin/bash 
#SBATCH -J vt_vasp 
#SBATCH -q regular 
#SBATCH -C knl 
#SBATCH -N 2 
#SBATCH --time=48:0:00 
#SBATCH --error=vt_vasp%j.err 
#SBATCH --output=vt_vasp%j.out 
#SBATCH --mail-user=elvis@nersc.gov 
# 
#SBATCH --comment=96:00:00 
#SBATCH --time-min=02:0:00 
#SBATCH --signal=B:USR1@300 
#SBATCH --requeue 
#SBATCH --open-mode=append 
  
#user setting 
export OMP_PROC_BIND=true 
export OMP_PLACES=threads 
export OMP_NUM_THREADS=8 
  
#srun must execute in background and catch signal on wait command 
module load vasp/20171017-knl 
srun -n 8 -c32 --cpu_bind=cores vasp_std & 
  
# put any commands that need to run to continue the next job (fragment) here 
ckpt_vasp() { 
set -x 
restarts=`squeue -h -O restartcnt -j $SLURM_JOB_ID` 
echo checkpointing the ${restarts}-th job 
  
#to terminate VASP at the next ionic step 
echo LSTOP = .TRUE. > STOPCAR 

#wait until VASP to complete the current ionic step, write out WAVECAR file and quit 
srun_pid=`ps -fle|grep srun|head -1|awk '{print $4}'` 
echo srun pid is $srun_pid 
wait $srun_pid 
  
#copy CONTCAR to POSCAR 
cp -p CONTCAR POSCAR 
set +x 
} 
  
ckpt_command=ckpt_vasp 
max_timelimit=48:00:00 
ckpt_overhead=300 
  
# requeueing the job if remaining time >0 
. /usr/common/software/variable-time-job/setup.sh 
requeue_job func_trap USR1 
  
wait

MPMD (Multiple Program Multiple Data) jobs

Run a job with different programs and different arguments for each task. To run MPMD jobs under Slurm use --multi-prog <config_file_name>.

srun -n 8 --multi-prog myrun.conf

Configuration file format

  • Task rank

    One or more task ranks to use this configuration. Multiple values may be comma separated. Ranges may be indicated with two numbers separated with a '-' with the smaller number first (e.g. "0-4" and not "4-0"). To indicate all tasks not otherwise specified, specify a rank of '*' as the last line of the file. If an attempt is made to initiate a task for which no executable program is defined, the following error message will be produced "No executable program specified for this task".

  • Executable

    The name of the program to execute. May be fully qualified pathname if desired.

  • Arguments

    Program arguments. The expression "%t" will be replaced with the task's number. The expression "%o" will be replaced with the task's offset within this range (e.g. a configured task rank value of "1-5" would have offset values of "0-4"). Single quotes may be used to avoid having the enclosed values interpreted. This field is optional. Any arguments for the program entered on the command line will be added to the arguments specified in the configuration file.

Example

Sample job script for MPMD jobs. You need to create a configuration file with format described above, and a batch script which passes this configuration file via --multi-prog flag in the srun command.

Cori-Haswell

cori$ cat mpmd.conf
0-35 ./a.out
36-96 ./b.out

cori$ cat batch_script.sh
#!/bin/bash
#SBATCH -q regular
#SBATCH -N 5
#SBATCH -n 97  # total of 97 tasks
#SBATCH -t 02:00:00
#SBATCH -C haswell

srun --multi-prog ./mpmd.conf

Burst buffer

All examples for the burst buffer are shown with Cori Haswell nodes. Options related to the burst buffer do not depend on Haswell or KNL node choice.

Note

The burst buffer is only available on Cori.

Scratch

Use the burst buffer as a scratch space to store temporary data during the execution of I/O intensive codes. In this mode all data from the burst buffer allocation will be removed automatically at the end of the job.

#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=2
#SBATCH --tasks-per-node=32
#SBATCH --constraint=haswell
#DW jobdw capacity=10GB access_mode=striped type=scratch

srun check-mpi.intel.cori > ${DW_JOB_STRIPED}/output.txt
ls ${DW_JOB_STRIPED}
cat ${DW_JOB_STRIPED}/output.txt

Stage in/out

Copy the named file or directory into the Burst Buffer, which can then be accessed using $DW_JOB_STRIPED.

Note

  • Only files on the Cori $SCRATCH filesystem can be staged in
  • A full path to the file must be used
  • You must have permissions to access the file
  • The job start may be delayed until the transfer is complete
  • Stage out occurs after the job is completed so there is no charge
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=2
#SBATCH --tasks-per-node=1
#SBATCH --constraint=haswell
#DW jobdw capacity=10GB access_mode=striped type=scratch
#DW stage_in source=/global/cscratch1/sd/dwtest-file destination=$DW_JOB_STRIPED/dwtest-file type=file
srun ls ${DW_JOB_STRIPED}/dwtest-file
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=1
#SBATCH --constraint=haswell
#DW jobdw capacity=10GB access_mode=striped type=scratch
#DW stage_out source=$DW_JOB_STRIPED/output destination=/global/cscratch1/sd/username/output type=directory
mkdir $DW_JOB_STRIPED/output
srun check-mpi.intel.cori > ${DW_JOB_STRIPED}/output/output.txt

Persistent Reservations

Persistent reservations are useful when multiple jobs need access to the same files.

Warning

  • Reservations must be deleted when no longer in use.
  • There are no guarantees of data integrity over long periods of time.

Note

Each persistent reservation must have a unique name.

Create

#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=1
#SBATCH --nodes=1
#SBATCH --constraint=haswell
#BB create_persistent name=PRname capacity=100GB access_mode=striped type=scratch

Use

Take care if multiple jobs will be using the reservation to not overwrite data.

#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=1
#SBATCH --nodes=1
#SBATCH --constraint=haswell
#DW persistentdw name=PRname

ls $DW_PERSISTENT_STRIPED_PRname/

Destroy

Any data on the resevration at the time the script executes will be removed.

#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=1
#SBATCH --nodes=1
#SBATCH --constraint=haswell
#BB destroy_persistent name=PRname

Interactive

The burst buffer is available in interactive sessions. It is recommended to use a configuration file for the burst buffer directives:

cori$ cat bbf.conf
#DW jobdw capacity=10GB access_mode=striped type=scratch
#DW stage_in source=/global/cscratch1/sd/username/path/to/filename destination=$DW_JOB_STRIPED/filename type=file
cori$ salloc --qos=interactive -C haswell -t 00:30:00 --bbf=bbf.conf

Large Memory

There are two nodes on Cori with 750 GB of memory that can be used for jobs that require very high memory per node. There are only two nodes, so this resource is limited and should only be used for jobs that require high memory. In an effort to make these useful to more users at once, these nodes can be shared among users. If you need to run with multiple threads, you will need to request the whole node. To do this, add #SBATCH --exclusive and add the -c 32 flag to your srun call.

Cori Example

A sample bigmem job which needs only one core.

#!/bin/bash
#SBATCH --clusters=escori
#SBATCH --qos=bigmem
#SBATCH --nodes=1
#SBATCH --time=01:00:00
#SBATCH --job-name=my_big_job
#SBATCH --licenses=SCRATCH
#SBATCH --mem=250GB

srun -n 1 ./my_big_executable

Realtime

The "realtime" QOS is used for running jobs with the need of getting realtime turnaround time.

Note

Use of this QOS requires special approval.

"realtime" QOS Request Form

The realtime QOS is a user-selective shared QOS, meaning you can request either exclusive node access (with the #SBATCH --exclusive flag) or allow multiple applications to share a node (with the #SBATCH --share flag).

Tip

It is recommended to allow sharing the nodes so more jobs can be scheduled in the allocated nodes. Sharing a node is the default setting, and using #SBATCH --share is optional.

Example

Uses two full nodes

#!/bin/bash
#SBATCH --qos=realtime
#SBATCH --constraint=haswell
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --cpus-per-task=2
#SBATCH --time=01:00:00
#SBATCH --job-name=my_job
#SBATCH --licenses=project
#SBATCH --exclusive

srun --cpu-bind=cores ./mycode.exe   # pure MPI, 64 MPI tasks

If you are requesting only a portion of a single node, please add --gres=craynetwork:0 as follows to allow more jobs on the node. Similar to using the "shared" QOS, you can request number of slots on the node (total of 64 CPUs, or 64 slots) by specifying the -ntasks and/or --mem. The rules are the same as the shared QOS.

Example

Two MPI ranks running with 4 OpenMP threads each. The job is using in total 8 physical cores (8 "cpus" or hyperthreads per "task") and 10GB of memory.

#!/bin/bash
#SBATCH --qos=realtime
#SBATCH --constraint=haswell
#SBATCH --nodes=1
#SBATCH --ntasks=2
#SBATCH --gres=craynetwork:0
#SBATCH --cpus-per-task=8
#SBATCH --mem=10GB
#SBATCH --time=01:00:00
#SBATCH --job-name=my_job2
#SBATCH --licenses=project
#SBATCH --shared

export OMP_NUM_THREADS=4
srun --cpu-bind=cores ./mycode.exe

Example

OpenMP only code running with 6 threads. Note that srun is not required in this case.

#!/bin/bash
#SBATCH --qos=realtime
#SBATCH --constraint=haswell
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gres=craynetwork:0
#SBATCH --cpus-per-task=12
#SBATCH --mem=16GB
#SBATCH --time=01:00:00
#SBATCH --job-name=my_job3
#SBATCH --licenses=project,SCRATCH
#SBATCH --shared

export OMP_NUM_THREADS=6
./mycode.exe

Multiple Parallel Jobs While Sharing Nodes

Under certain scenarios, you might want two or more independent applications running simultaneously on each compute node allocated to your job. For example, a pair of applications that interact in a client-server fashion via some IPC mechanism on-node (e.g. shared memory), but must be launched in distinct MPI communicators.

This latter constraint would mean that MPMD mode (see below) is not an appropriate solution, since although MPMD can allow multiple executables to share compute nodes, the executables will also share an MPI_COMM_WORLD at launch.

Slurm can allow multiple executables launched with concurrent srun calls to share compute nodes as long as the sum of the resources assigned to each application does not exceed the node resources requested for the job. Importantly, you cannot over-allocate the CPU, memory, or "craynetwork" resource. While the former two are self-explanatory, the latter refers to limitations imposed on the number of applications per node that can simultaneously use the Aries interconnect, which is currently limited to 4.

Here is a quick example of an sbatch script that uses two compute nodes and runs two applications concurrently. One application uses 8 cores on each node, while the other uses 24 on each node. The number of CPUs per node is again controlled with the "-n" and "-N" flags, while the amount of memory per node with the "--mem" flag. To specify the "craynetwork" resource, we use the "--gres" flag available in both "sbatch" and "srun".

Cori Haswell

#!/bin/bash

#SBATCH -q regular
#SBATCH -N 2
#SBATCH -t 12:00:00
#SBATCH --gres=craynetwork:2
#SBATCH -L SCRATCH
#SBATCH -C haswell

srun -N 2 -n 16 -c 2 --mem=51200 --gres=craynetwork:1 ./exec_a &
srun -N 2 -n 48 -c 2 --mem=61440 --gres=craynetwork:1 ./exec_b &
wait 

This is example is quite similar to the mutliple srun jobs shown for running simultaneous parallel jobs, with the following exceptions:

  1. For our sbatch job, we have requested "--gres=craynetwork:2" which will allow us to run up to two applications simultaneously per compute node.

  2. In our srun calls, we have explicitly defined the maximum amount of memory available to each application per node with "--mem" (in this example 50 and 60 GB, respectively) such that the sum is less than the resource limit per node (roughly 122 GB).

  3. In our srun calls, we have also explicitly used one of the two requested craynetwork resources per call.

Using this combination of resource requests, we are able to run multiple parallel applications per compute node.

One additional observation: when calling srun, it is permitted to specify "--gres=craynetwork:0" which will not count against the craynetwork resource. This is useful when, for example, launching a bash script or other application that does not use the interconnect. We don't currently anticipate this being a common use case, but if your application(s) do employ this mode of operation it would be appreciated if you let us know.

Additional information