# Example job scripts¶

For details of terminology used on this page please see our jobs overview. Correct affinity settings are essential for good performance.

## Basic MPI batch script¶

One MPI process per physical core.

Edison
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=2

srun check-mpi.intel.edison

Cori Haswell
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=2
#SBATCH --constraint=haswell

srun check-mpi.intel.cori

Cori KNL
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=2
#SBATCH --constraint=knl

srun check-mpi.intel.cori


## Hybrid MPI+OpenMP jobs¶

Warning

In Slurm each hyper thread is considered a "cpu" so the --cpus-per-task option must be adjusted accordingly. Generally best performance is obtained with 1 OpenMP thread per physical core. Additional details about affinity settings.

### Example 1¶

One MPI process per socket and 1 OpenMP thread per physical core

Edison
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=2

export OMP_PROC_BIND=true

srun check-hybrid.intel.edison

Cori Haswell
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=2
#SBATCH --constraint=haswell

export OMP_PROC_BIND=true

srun check-hybrid.intel.cori

Cori KNL
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=2
#SBATCH --constraint=knl

export OMP_PROC_BIND=true

srun check-hybrid.intel.cori


### Example 2¶

28 MPI processes with 8 OpenMP threads per process, each OpenMP thread has 1 physical core

Note

The addition of --cpu-bind=cores is useful for getting correct affinity settings.

Cori Haswell
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=7
#SBATCH --constraint=haswell

export OMP_PROC_BIND=true

srun --cpu-bind=cores check-hybrid.intel.cori

Cori KNL
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=4
#SBATCH --constraint=haswell

export OMP_PROC_BIND=true

srun --cpu-bind=cores check-hybrid.intel.cori


## Interactive¶

Interactive jobs are launched with the salloc command.

Tip

Cori has dedicated nodes for interactive work.

Edison
edison$salloc --qos=debug --time=30 --nodes=2  Cori Haswell cori$ salloc --qos=interactive -C haswell --time=60 --nodes=2

Cori KNL


## Dependencies¶

Job dependencies can be used to construct complex pipelines or chain together long simulations requiring multiple steps.

Note

The --parsable option to sbatch can simplify working with job dependencies.

Example

jobid=$(sbatch --parsable first_job.sh) sbatch --dependency=afterok:$jobid second_job.sh


Example

jobid1=$(sbatch --parsable first_job.sh) jobid2=$(sbatch --parsable --dependency=afterok:$jobid1 second_job.sh) jobid3=$(sbatch --parsable --dependency=afterok:$jobid1 third_job.sh) sbatch --dependency=afterok:$jobid2,afterok:$jobid3 last_job.sh  ## Shared¶ Unlike other QOS's in the shared QOS a single node can be shared by multiple users or jobs. Jobs in the shared QOS are charged for each physical core in allocated to the job. The number of physical cores allocated to a job by Slurm is controlled by three parameters: • -n (--ntasks) • -c (--cpus-per-task) • --mem - Total memory available to the job (MemoryRequested) Note In Slurm a "cpu" corresponds to a hyperthread. So there are 2 cpus per physical core. The memory on a node is divided evenly among the "cpus" (or hyperthreads): System MemoryPerCpu (megabytes) Edison 1300 Cori 1952 The number of physical cores used by a job is computed by \text{physical cores} = \Bigl\lceil \frac{1}{2} \text{max} \left( \Bigl\lceil \frac{\mathrm{MemoryRequested}}{\mathrm{MemoryPerCpu}} \Bigr\rceil, \mathrm{ntasks} * \mathrm{CpusPerTask} \right) \Bigr\rceil Cori-Haswell MPI A two rank MPI job which utilizes 2 physical cores (and 4 hyperthreads) of a Haswell node. #!/bin/bash #SBATCH --qos=shared #SBATCH --constraint=haswell #SBATCH --time=5 #SBATCH --ntasks=2 #SBATCH --cpus-per-task=2 srun --cpu-bind=cores ./a.out  Cori-Haswell MPI/OpenMP A two rank MPI job which utilizes 4 physical cores (and 8 hyperthreads) of a Haswell node. #!/bin/bash #SBATCH --qos=shared #SBATCH --constraint=haswell #SBATCH --time=5 #SBATCH --ntasks=2 #SBATCH --cpus-per-task=4 export OMP_NUM_THREADS=2 srun --cpu-bind=cores ./a.out  Cori-Haswell OpenMP An OpenMP only code which utilizes 6 physical cores. #!/bin/bash #SBATCH --qos=shared #SBATCH --constraint=haswell #SBATCH --time=5 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=12 export OMP_NUM_THREADS=6 ./my_openmp_code.exe  Cori-Haswell serial A serial job should start by requesting a single slot and increase the amount of memory required only as needed to maximize thoughput and minimize charge and wait time. #!/bin/bash #SBATCH --qos=shared #SBATCH --constraint=haswell #SBATCH --time=5 #SBATCH --ntasks=1 #SBATCH --mem=1GB ./serial.exe  ## Using Intel MPI¶ Applications built with Intel MPI can be launched via srun in the SLURM batch script on Cori compute nodes. The module impi needs to be loaded, and the application should be built using the mpiicc (for C Codes) or mpiifort (for Fortran codes) or mpiicpc (for C++ codes) commands. Below is a sample compile and run script. Cori Haswell #!/bin/bash #SBATCH --qos=regular #SBATCH --time=03:00:00 #SBATCH --nodes=8 #SBATCH --constraint=haswell module load impi mpiicc -qopenmp -o mycode.exe mycode.c export OMP_NUM_THREADS=8 export OMP_PROC_BIND=spread export OMP_PLACES=threads srun -n 32 -c 16 --cpu-bind=cores ./mycode.exe  ## Using Open MPI¶ Applications built with Open MPI can be launched via srun or Open MPI's mpirun command. The module openmpi needs to be loaded to build an application against Open MPI. Typically one builds the application using the mpicc (for C Codes), mpifort (for Fortran codes), or mpiCC (for C++ codes) commands. Alternatively, Open MPI supports use of pkg-config to obtain the include and library paths. For example, pkg-config --cflags --libs ompi-c returns the flags that must be passed to the backend c compiler (e.g. gcc, gfortran, icc, ifort) to build against Open MPI. Open MPI also supports Java MPI bindings. Use mpijavac to compile Java codes that use the Java MPI bindings. For Java MPI, it is highly recommended to launch jobs using Open MPI's mpirun command. Note the Open MPI packages at NERSC do not support static linking. See Open MPI for more information about using Open MPI on NERSC systems. Cori Haswell Open MPI #!/bin/bash #SBATCH --qos=debug #SBATCH --time=5 #SBATCH --nodes=2 #SBATCH --tasks-per-node=32 #SBATCH --constraint=haswell module load openmpi /bin/cat <<EOM > ring_c.c /* * Copyright (c) 2004-2006 The Trustees of Indiana University and Indiana * University Research and Technology * Corporation. All rights reserved. * Copyright (c) 2006 Cisco Systems, Inc. All rights reserved. * * Simple ring test program in C. */ #include <stdio.h> #include "mpi.h" int main(int argc, char *argv[]) { int rank, size, next, prev, message, tag = 201; /* Start up MPI */ MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); /* Calculate the rank of the next process in the ring. Use the modulus operator so that the last process "wraps around" to rank zero. */ next = (rank + 1) % size; prev = (rank + size - 1) % size; /* If we are the "master" process (i.e., MPI_COMM_WORLD rank 0), put the number of times to go around the ring in the message. */ if (0 == rank) { message = 10; printf("Process 0 sending %d to %d, tag %d (%d processes in ring)\n", message, next, tag, size); MPI_Send(&message, 1, MPI_INT, next, tag, MPI_COMM_WORLD); printf("Process 0 sent to %d\n", next); } /* Pass the message around the ring. The exit mechanism works as follows: the message (a positive integer) is passed around the ring. Each time it passes rank 0, it is decremented. When each processes receives a message containing a 0 value, it passes the message on to the next process and then quits. By passing the 0 message first, every process gets the 0 message and can quit normally. */ while (1) { MPI_Recv(&message, 1, MPI_INT, prev, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE); if (0 == rank) { --message; printf("Process 0 decremented value: %d\n", message); } MPI_Send(&message, 1, MPI_INT, next, tag, MPI_COMM_WORLD); if (0 == message) { printf("Process %d exiting\n", rank); break; } } /* The last process does one extra send to process 0, which needs to be received before the program can exit */ if (0 == rank) { MPI_Recv(&message, 1, MPI_INT, prev, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE); } /* All done */ MPI_Finalize(); return 0; } EOM mpicc -o ring ring_c.c mpirun ring_c # # run again with srun # srun ring_c  Cori KNL Open MPI #!/bin/bash #SBATCH --qos=debug #SBATCH --time=5 #SBATCH --nodes=2 #SBATCH --tasks-per-node=68 #SBATCH --constraint=knl module load openmpi /bin/cat <<EOM > ring_c.c /* * Copyright (c) 2004-2006 The Trustees of Indiana University and Indiana * University Research and Technology * Corporation. All rights reserved. * Copyright (c) 2006 Cisco Systems, Inc. All rights reserved. * * Simple ring test program in C. */ #include <stdio.h> #include "mpi.h" int main(int argc, char *argv[]) { int rank, size, next, prev, message, tag = 201; /* Start up MPI */ MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); /* Calculate the rank of the next process in the ring. Use the modulus operator so that the last process "wraps around" to rank zero. */ next = (rank + 1) % size; prev = (rank + size - 1) % size; /* If we are the "master" process (i.e., MPI_COMM_WORLD rank 0), put the number of times to go around the ring in the message. */ if (0 == rank) { message = 10; printf("Process 0 sending %d to %d, tag %d (%d processes in ring)\n", message, next, tag, size); MPI_Send(&message, 1, MPI_INT, next, tag, MPI_COMM_WORLD); printf("Process 0 sent to %d\n", next); } /* Pass the message around the ring. The exit mechanism works as follows: the message (a positive integer) is passed around the ring. Each time it passes rank 0, it is decremented. When each processes receives a message containing a 0 value, it passes the message on to the next process and then quits. By passing the 0 message first, every process gets the 0 message and can quit normally. */ while (1) { MPI_Recv(&message, 1, MPI_INT, prev, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE); if (0 == rank) { --message; printf("Process 0 decremented value: %d\n", message); } MPI_Send(&message, 1, MPI_INT, next, tag, MPI_COMM_WORLD); if (0 == message) { printf("Process %d exiting\n", rank); break; } } /* The last process does one extra send to process 0, which needs to be received before the program can exit */ if (0 == rank) { MPI_Recv(&message, 1, MPI_INT, prev, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE); } /* All done */ MPI_Finalize(); return 0; } EOM mpicc -o ring ring_c.c mpirun ring_c # # run again with srun # srun ring_c  ## Xfer queue¶ The intended use of the xfer queue is to transfer data between Cori or Edison and HPSS. The xfer jobs run on one of the login nodes and are free of charge. If you want to transfer data to the HPSS archive system at the end of a regular job, you can submit an xfer job at the end of your batch job script via module load esslurm; sbatch hsi put <my_files> (be sure to load the esslurm module first, or you'll end up in the regular queue), so that you will not get charged for the duration of the data transfer. The xfer jobs can be monitored via module load esslurm; squeue. The number of running jobs for each user is limited to the number of concurrent HPSS sessions (15). Warning Do not run computational jobs in the xfer queue. Xfer transfer job #!/bin/bash #SBATCH --qos=xfer #SBATCH --time=12:00:00 #SBATCH --job-name=my_transfer #SBATCH --licenses=SCRATCH #Archive run01 to HPSS htar -cvf run01.tar run01 #Submit job with #module load esslurm #sbatch <job_script>  Xfer jobs specifying -N nodes will be rejected at submission time. When submitting an Xfer job from Cori, the -C haswell is not needed since the job does not run on compute nodes. By default, xfer jobs get 2GB of memory allocated. The memory footprint scales somewhat with the size of the file, so if you're archiving larger files, you'll need to request more memory. You can do this by adding #SBATCH --mem=XGB to the above script (where X in the range of 5 - 10 GB is a good starting point for large files). To monitor your xfer jobs please load the esslurm module, then you can use Slurm commands like squeue or scontrol to access the xfer queue on Cori or Edison. ## Variable-time jobs¶ Variable-time jobs are for users who wish to get a better queue turnaround and/or need to run long running jobs, including jobs longer than 48 hours, the maximum wall-clock time allowed on Cori and Edison. Variable-time jobs are jobs submitted with a minimum time, #SBATCH --time-min, in addition to the maximum time (#SBATCH –time). Here is an example job script for variable-time jobs: Sample job script with --time-min #!/bin/bash #SBATCH -J test #SBATCH -q flex #SBATCH -C knl #SBATCH -N 1 #SBATCH --time=12:00:00 #the max walltime allowed for flex QOS jobs #SBATCH --time-min=2:00:00 #the minimum amount of time the job should run #this is an example to run an MPI+OpenMP job: export OMP_PROC_BIND=true export OMP_PLACES=threads export OMP_NUM_THREADS=8 srun -n8 -c16 --cpu_bind=cores ./a.out  Jobs specifying a minimum time can start execution earlier than they would otherwise with a time limit anywhere between the minimum and maximum time requests. Pre-terminated jobs can be requeued (or resubmitted) by using the scontrol requeue command (or sbatch) to resume from where the previous executions left off, until the cumulative execution time reaches the desired time limit or the job completes. Note To use variable-time jobs, applications are required to be able to checkpoint and restart by themselves. ### Using the flex QOS for charging discount for variable-time jobs on KNL¶ Variable-time jobs, specifying a shorter amount of time that a job should run, increase backfill opportunities, meaning users will see a better queue turnaround. In addition, the process of job resubmitting can be automated, so users can run a long job in multiple shorter chunks with a single job script (see the automated job script sample below). However, variable-time jobs incur checkpoint/restart overheads from splitting a longer job into multiple shorter ones. To compensate for this overhead and to encourage users to use Cori KNL where more backfill opportunities are available, we have created a flex QOS on Cori KNL (#SBATCH -q flex) with a charging discount for variable-time jobs. See the Queues and Policy page for Cori KNL for more details on the flex QOS. Note • The flex QOS is free of charge currently. The discount rate is subject to change. • Variable-time jobs work with any QOS on Cori and Edison, but the charging discount is available only with the flex QOS on Cori KNL. ### Annotated example - automated variable-time jobs¶ Here is a sample job script for variable-time jobs, which automates the process of executing, pre-terminating, requeuing and restarting the job repeatedly until it runs for the desired amount of time or the job completes. Edison #!/bin/bash #SBATCH -J vtj #SBATCH -q regular #SBATCH -N 2 #SBATCH --comment=96:00:00 #SBATCH --time-min=2:00:00 #the minimum amount of time the job should run #SBATCH --time=12:00:00 #SBATCH --error=vtj-%j.err #SBATCH --output=vtj-%j.out #SBATCH --mail-user=elvis@nersc.gov # #SBATCH --signal=B:USR1@60 #SBATCH --requeue #SBATCH --open-mode=append # use the following three variables to specify the time limit per job (max_timelimit), # the amount of time (in seconds) needed for checkpointing, # and the command to use to do the checkpointing if any (leave blank if none) max_timelimit=12:00:00 # can match the #SBATCH --time option but don't have to ckpt_overhead=60 # should match the time in the #SBATCH --signal option ckpt_command= # requeueing the job if reamining time >0 (do not change the following 3 lines ) . /usr/common/software/variable-time-job/setup.sh requeue_job func_trap USR1 # # user setting goes here # srun must execute in the background and catch the signal USR1 on the wait command srun -n48 -c2 --cpu_bind=cores ./a.out & wait  Cori Haswell #!/bin/bash #SBATCH -J vtj #SBATCH -q regular #SBATCH -C haswell #SBATCH -N 2 #SBATCH --comment=96:00:00 #SBATCH --time-min=2:00:00 #the minimum amount of time the job should run #SBATCH --time=12:00:00 #SBATCH --error=vtj-%j.err #SBATCH --output=vtj-%j.out #SBATCH --mail-user=elvis@nersc.gov # #SBATCH --signal=B:USR1@60 #SBATCH --requeue #SBATCH --open-mode=append # use the following three variables to specify the time limit per job (max_timelimit), # the amount of time (in seconds) needed for checkpointing, # and the command to use to do the checkpointing if any (leave blank if none) max_timelimit=12:00:00 # can match the #SBATCH --time option but don't have to ckpt_overhead=60 # should match the time in the #SBATCH --signal option ckpt_command= # requeueing the job if reamining time >0 (do not change the following 3 lines ) . /usr/common/software/variable-time-job/setup.sh requeue_job func_trap USR1 # # user setting goes here # srun must execute in the background and catch the signal USR1 on the wait command srun -n64 -c2 --cpu_bind=cores ./a.out & wait  Cori KNL #!/bin/bash #SBATCH -J vtj #SBATCH -q flex #SBATCH -C knl #SBATCH -N 2 #SBATCH --comment=96:00:00 #SBATCH --time-min=2:00:00 #the minimum amount of time the job should run #SBATCH --time=12:00:00 #SBATCH --error=vtj-%j.err #SBATCH --output=vtj-%j.out #SBATCH --mail-user=elvis@nersc.gov # #SBATCH --signal=B:USR1@60 #SBATCH --requeue #SBATCH --open-mode=append # use the following three variables to specify the time limit per job (max_timelimit), # the amount of time (in seconds) needed for checkpointing, # and the command to use to do the checkpointing if any (leave blank if none) max_timelimit=12:00:00 # can match the #SBATCH --time option but don't have to ckpt_overhead=60 # should match the time in the #SBATCH --signal option ckpt_command= # requeueing the job if reamining time >0 (do not change the following 3 lines ) . /usr/common/software/variable-time-job/setup.sh requeue_job func_trap USR1 # # user setting goes here export OMP_PROC_BIND=true export OMP_PLACES=threads export OMP_NUM_THREADS=8 #srun must execute in the background and catch the signal USR1 on the wait command srun -n32 -c16 --cpu_bind=cores ./a.out & wait  In the above example, the --comment option is used to enter the user’s desired maximum wall-clock time, which could be longer than the maximum time limit allowed by the batch system (96 hours in this example). In addition to the time limit (--time), the --time-min option is used to specify the minimum amount of time the job should run (2 hours). The script setup.sh defines a few bash functions (e.g., requeue_job, func_trap) that are used to automate the process. The requeue_job func_trap USR1 command executes the func_trap function, which contains a list of actions to checkpoint and requeue the job upon trapping the USR1 signal. Users may want to modify the scripts (get a copy) as needed, although they should work for most applications as they are now. The job script works as follows: 1. User submits the above job script. 2. The batch system looks for a backfill opportunity for the job. If it can allocate the requested number of nodes for this job for any duration (e.g., 3 hours) between the specified minimum time (2 hours) and the time limit (12 hours) before those nodes are used for other higher priority jobs, the job starts execution. 3. The job runs until it receives a signal USR1 (--signal=B:USR1@<sig_time) 60 seconds (sig_time=60 in this example) before it hits the allocated time limit (3 hours). 4. Upon receiving the signal, the job checkpoints and requeues itself with the remaining max time limit before it gets terminated. The variable ckpt_overhead is used to specify the amount of time (in seconds) needed for checkpointing and requeuing the job. It should match the sig_time in the --signal option. 5. Steps 2-4 repeat until the job runs for the desired amount of time (96 hours) or the job completes. Note • If your application requires external triggers or commands to do checkpointing, you need to provide the checkpoint commands using the variable, ckpt_command. It could be a script containing several commands to be executed within the specified checkpoint overhead time (ckpt_overhead). • Additionally, if you need to change the job input files to resume the job, you can do so within ckpt_command. • If your application does checkpointing periodically, like most molecular dynamics codes do, you don’t need ckpt_command (just leave it blank). • You can send the USR1 signal outside the job script any time using the scancel -b -s USR1 <jobid> command to terminate the currently running job. The job still checkpoints and requeues itself before it gets terminated. • The srun command must execute in the background (notice the & at the end of the srun command line and the wait command at the end of the job script), so to catch the signal (USR1) on the wait command instead of srun, allow srun to run for a bit longer (up to sig_time seconds) to complete the checkpointing. ### VASP example¶ VASP atomic relaxation jobs for Cori KNL #!/bin/bash #SBATCH -J vt_vasp #SBATCH -q regular #SBATCH -C knl #SBATCH -N 2 #SBATCH --time=12:0:00 #SBATCH --error=vt_vasp%j.err #SBATCH --output=vt_vasp%j.out #SBATCH --mail-user=elvis@nersc.gov # #SBATCH --comment=96:00:00 #SBATCH --time-min=02:0:00 #SBATCH --signal=B:USR1@300 #SBATCH --requeue #SBATCH --open-mode=append #user setting export OMP_PROC_BIND=true export OMP_PLACES=threads export OMP_NUM_THREADS=8 #srun must execute in background and catch signal on wait command module load vasp/20171017-knl srun -n 8 -c32 --cpu_bind=cores vasp_std & # put any commands that need to run to continue the next job (fragment) here ckpt_vasp() { set -x restarts=squeue -h -O restartcnt -j$SLURM_JOB_ID
echo checkpointing the ${restarts}-th job #to terminate VASP at the next ionic step echo LSTOP = .TRUE. > STOPCAR #wait until VASP to complete the current ionic step, write out WAVECAR file and quit srun_pid=ps -fle|grep srun|head -1|awk '{print$4}'
echo srun pid is $srun_pid wait$srun_pid

#copy CONTCAR to POSCAR
cp -p CONTCAR POSCAR
set +x
}

ckpt_command=ckpt_vasp
max_timelimit=12:00:00

# requeueing the job if remaining time >0
. /usr/common/software/variable-time-job/setup.sh
requeue_job func_trap USR1

wait


## MPMD (Multiple Program Multiple Data) jobs¶

Run a job with different programs and different arguments for each task. To run MPMD jobs under Slurm use --multi-prog <config_file_name>.

srun -n 8 --multi-prog myrun.conf


### Configuration file format¶

One or more task ranks to use this configuration. Multiple values may be comma separated. Ranges may be indicated with two numbers separated with a '-' with the smaller number first (e.g. "0-4" and not "4-0"). To indicate all tasks not otherwise specified, specify a rank of '*' as the last line of the file. If an attempt is made to initiate a task for which no executable program is defined, the following error message will be produced "No executable program specified for this task".

• Executable

The name of the program to execute. May be fully qualified pathname if desired.

• Arguments

Program arguments. The expression "%t" will be replaced with the task's number. The expression "%o" will be replaced with the task's offset within this range (e.g. a configured task rank value of "1-5" would have offset values of "0-4"). Single quotes may be used to avoid having the enclosed values interpreted. This field is optional. Any arguments for the program entered on the command line will be added to the arguments specified in the configuration file.

### Example¶

Sample job script for MPMD jobs. You need to create a configuration file with format described above, and a batch script which passes this configuration file via --multi-prog flag in the srun command.

Cori-Haswell

cori$cat mpmd.conf 0-35 ./a.out 36-96 ./b.out cori$ cat batch_script.sh
#!/bin/bash
#SBATCH -q regular
#SBATCH -N 5
#SBATCH -n 97  # total of 97 tasks
#SBATCH -t 02:00:00
#SBATCH -C haswell

srun --multi-prog ./mpmd.conf


## Burst buffer¶

All examples for the burst buffer are shown with Cori Haswell nodes. Options related to the burst buffer do not depend on Haswell or KNL node choice.

Note

The burst buffer is only available on Cori.

### Scratch¶

Use the burst buffer as a scratch space to store temporary data during the execution of I/O intensive codes. In this mode all data from the burst buffer allocation will be removed automatically at the end of the job.

#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=2
#SBATCH --constraint=haswell
#DW jobdw capacity=10GB access_mode=striped type=scratch

srun check-mpi.intel.cori > ${DW_JOB_STRIPED}/output.txt ls${DW_JOB_STRIPED}
cat ${DW_JOB_STRIPED}/output.txt  ### Stage in/out¶ Copy the named file or directory into the Burst Buffer, which can then be accessed using $DW_JOB_STRIPED.

Note

• Only files on the Cori $SCRATCH filesystem can be staged in • A full path to the file must be used • You must have permissions to access the file • The job start may be delayed until the transfer is complete • Stage out occurs after the job is completed so there is no charge #!/bin/bash #SBATCH --qos=debug #SBATCH --time=5 #SBATCH --nodes=2 #SBATCH --tasks-per-node=1 #SBATCH --constraint=haswell #DW jobdw capacity=10GB access_mode=striped type=scratch #DW stage_in source=/global/cscratch1/sd/dwtest-file destination=$DW_JOB_STRIPED/dwtest-file type=file
srun ls ${DW_JOB_STRIPED}/dwtest-file  #!/bin/bash #SBATCH --qos=debug #SBATCH --time=5 #SBATCH --nodes=1 #SBATCH --constraint=haswell #DW jobdw capacity=10GB access_mode=striped type=scratch #DW stage_out source=$DW_JOB_STRIPED/output destination=/global/cscratch1/sd/username/output type=directory
mkdir $DW_JOB_STRIPED/output srun check-mpi.intel.cori >${DW_JOB_STRIPED}/output/output.txt


### Persistent Reservations¶

Persistent reservations are useful when multiple jobs need access to the same files.

Warning

• Reservations must be deleted when no longer in use.
• There are no guarantees of data integrity over long periods of time.

Note

Each persistent reservation must have a unique name.

#### Create¶

#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=1
#SBATCH --nodes=1
#SBATCH --constraint=haswell
#BB create_persistent name=PRname capacity=100GB access_mode=striped type=scratch


#### Use¶

Take care if multiple jobs will be using the reservation to not overwrite data.

#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=1
#SBATCH --nodes=1
#SBATCH --constraint=haswell
#DW persistentdw name=PRname

ls $DW_PERSISTENT_STRIPED_PRname/  #### Destroy¶ Any data on the resevration at the time the script executes will be removed. #!/bin/bash #SBATCH --qos=debug #SBATCH --time=1 #SBATCH --nodes=1 #SBATCH --constraint=haswell #BB destroy_persistent name=PRname  ### Interactive¶ The burst buffer is available in interactive sessions. It is recommended to use a configuration file for the burst buffer directives: cori$ cat bbf.conf
#DW jobdw capacity=10GB access_mode=striped type=scratch
#DW stage_in source=/global/cscratch1/sd/username/path/to/filename destination=$DW_JOB_STRIPED/filename type=file  cori$ salloc --qos=interactive -C haswell -t 00:30:00 --bbf=bbf.conf


## Large Memory¶

There are two nodes on Cori with 750 GB of memory that can be used for jobs that require very high memory per node. There are only two nodes, so this resource is limited and should only be used for jobs that require high memory. In an effort to make these useful to more users at once, these nodes can be shared among users. If you need to run with multiple threads, you will need to request the whole node. To do this, add #SBATCH --exclusive and add the -c 32 flag to your srun call.

Cori Example

A sample bigmem job which needs only one core.

#!/bin/bash
#SBATCH --clusters=escori
#SBATCH --qos=bigmem
#SBATCH --nodes=1
#SBATCH --time=01:00:00
#SBATCH --job-name=my_big_job
#SBATCH --mem=250GB

srun -n 1 ./my_big_executable


## Realtime¶

The "realtime" QOS is used for running jobs with the need of getting realtime turnaround time.

Note

Use of this QOS requires special approval.

"realtime" QOS Request Form

The realtime QOS is a user-selective shared QOS, meaning you can request either exclusive node access (with the #SBATCH --exclusive flag) or allow multiple applications to share a node (with the #SBATCH --share flag).

Tip

It is recommended to allow sharing the nodes so more jobs can be scheduled in the allocated nodes. Sharing a node is the default setting, and using #SBATCH --share is optional.

Example

Uses two full nodes

#!/bin/bash
#SBATCH --qos=realtime
#SBATCH --constraint=haswell
#SBATCH --nodes=2
#SBATCH --time=01:00:00
#SBATCH --job-name=my_job
#SBATCH --exclusive

srun --cpu-bind=cores ./mycode.exe   # pure MPI, 64 MPI tasks


If you are requesting only a portion of a single node, please add --gres=craynetwork:0 as follows to allow more jobs on the node. Similar to using the "shared" QOS, you can request number of slots on the node (total of 64 CPUs, or 64 slots) by specifying the -ntasks and/or --mem. The rules are the same as the shared QOS.

Example

Two MPI ranks running with 4 OpenMP threads each. The job is using in total 8 physical cores (8 "cpus" or hyperthreads per "task") and 10GB of memory.

#!/bin/bash
#SBATCH --qos=realtime
#SBATCH --constraint=haswell
#SBATCH --nodes=1
#SBATCH --gres=craynetwork:0
#SBATCH --mem=10GB
#SBATCH --time=01:00:00
#SBATCH --job-name=my_job2
#SBATCH --shared

srun --cpu-bind=cores ./mycode.exe


Example

OpenMP only code running with 6 threads. Note that srun is not required in this case.

#!/bin/bash
#SBATCH --qos=realtime
#SBATCH --constraint=haswell
#SBATCH --nodes=1
#SBATCH --gres=craynetwork:0
#SBATCH --mem=16GB
#SBATCH --time=01:00:00
#SBATCH --job-name=my_job3
#SBATCH --shared

./mycode.exe


## Multiple Parallel Jobs While Sharing Nodes¶

Under certain scenarios, you might want two or more independent applications running simultaneously on each compute node allocated to your job. For example, a pair of applications that interact in a client-server fashion via some IPC mechanism on-node (e.g. shared memory), but must be launched in distinct MPI communicators.

This latter constraint would mean that MPMD mode (see below) is not an appropriate solution, since although MPMD can allow multiple executables to share compute nodes, the executables will also share an MPI_COMM_WORLD at launch.

Slurm can allow multiple executables launched with concurrent srun calls to share compute nodes as long as the sum of the resources assigned to each application does not exceed the node resources requested for the job. Importantly, you cannot over-allocate the CPU, memory, or "craynetwork" resource. While the former two are self-explanatory, the latter refers to limitations imposed on the number of applications per node that can simultaneously use the Aries interconnect, which is currently limited to 4.

Here is a quick example of an sbatch script that uses two compute nodes and runs two applications concurrently. One application uses 8 cores on each node, while the other uses 24 on each node. The number of CPUs per node is again controlled with the "-n" and "-N" flags, while the amount of memory per node with the "--mem" flag. To specify the "craynetwork" resource, we use the "--gres" flag available in both "sbatch" and "srun".

Cori Haswell

#!/bin/bash

#SBATCH -q regular
#SBATCH -N 2
#SBATCH -t 12:00:00
#SBATCH --gres=craynetwork:2
#SBATCH -L SCRATCH
#SBATCH -C haswell

srun -N 2 -n 16 -c 2 --mem=51200 --gres=craynetwork:1 ./exec_a &
srun -N 2 -n 48 -c 2 --mem=61440 --gres=craynetwork:1 ./exec_b &
wait


This is example is quite similar to the mutliple srun jobs shown for running simultaneous parallel jobs, with the following exceptions:

1. For our sbatch job, we have requested "--gres=craynetwork:2" which will allow us to run up to two applications simultaneously per compute node.

2. In our srun calls, we have explicitly defined the maximum amount of memory available to each application per node with "--mem" (in this example 50 and 60 GB, respectively) such that the sum is less than the resource limit per node (roughly 122 GB).

3. In our srun calls, we have also explicitly used one of the two requested craynetwork resources per call.

Using this combination of resource requests, we are able to run multiple parallel applications per compute node.

One additional observation: when calling srun, it is permitted to specify "--gres=craynetwork:0" which will not count against the craynetwork resource. This is useful when, for example, launching a bash script or other application that does not use the interconnect. We don't currently anticipate this being a common use case, but if your application(s) do employ this mode of operation it would be appreciated if you let us know.