GNU Parallel¶
GNU Parallel is a free, open-source tool for running shell commands and scripts in parallel and sequence on a single node.
This tool is best suited for workflows that contain many similar tasks with no execution order requirements or data dependencies. Its second primary strength is a convenient syntax for specifying patterns of file path organization and command parameters.
It is workable, but does not excel, when tasks use multiple nodes or MPI applications. A logging system is present and enables incomplete workflows to be resumed but it is rudimentary; if a project requires consistent workflow state be maintained over the course of many batch job submissions then using GNU Parallel alone is not the best choice.
Strengths of GNU Parallel:¶
- No user installation or configuration
- No database or persistent server needed
- Powerful specification of task filenames, paths, and parameters
- Easily scales to a very large number of tasks
- Does not burden cluster scheduler
Disadvantages of GNU Parallel:¶
- Doesn't easily load balance work
- User is required to do careful organization of input and output files
- Scaling up requires awareness of system I/O performance
- Does not strongly preserve workflow state between invocations
- Modest familiarity with bash scripting recommended
Example Repository¶
Working examples of the following concepts are available on the DOE Cross-facility Workflows Training Repository. Obtain them by using git clone
with the repository url on your Perlmutter storage and path to /DOE-HPC-workflow-training/GNU-Parallel/NERSC
.
How to use GNU Parallel at NERSC¶
In this first example, seq
is used to generate four lines of input, which are piped into parallel
. Those four input lines cause parallel
to run four tasks in total. Then the command each task will run is passed to parallel
, in this case, echo
and its arguments. The {}
in the command sets a location where individual input line content will be substituted inside each task command.
Basic example
elvis@perlmutter:login13:~> module load parallel
elvis@perlmutter:login13:~> seq 1 4 | parallel echo "Hello world {}!"
Hello world 1!
Hello world 2!
Hello world 3!
Hello world 4!
elvis@perlmutter:login13:~>
The next necessary concept is how to submit substantial tasks and input files to parallel
. This example shows how input data created with sequential file names can be passed to parallel using bash commands and pipes:
Sequentially named input files for each task
elvis@nid004258:~/work> ls
input01.dat input02.dat input03.dat input04.dat input05.dat
input06.dat input07.dat input08.dat input09.dat input10.dat
elvis@nid004258:~/work> seq -w 1 10 | parallel task_command.sh input{}.dat
salloc
session. Though parallel
is great for automating mundane, lightweight, and repetitive tasks such as creating many directories or parsing lots of log files, tasks which use a substantial amount of compute resources still need to be run on Slurm allocated compute nodes and not on shared login nodes. A second approach places all task inputs into the same directory and uses the find
command to build the file listing all their paths:
Using find
to build an input file list
elvis@nid004258:~/work> find $PWD -type f | grep dat | sort > input.txt
elvis@nid004258:~/work> cat input.txt | parallel task_command.sh {}
I/O Performance Pitfalls at Large Scale
If work requires large numbers (more than 1000) of tasks and input files then additional consideration for I/O systems may be needed or beneficial.
Directories containing more than 1000 files are less performant; distribute files between subdirectories to avoid this.
When using a pipe to pass the task list to GNU parallel, multiple temporary files are needed per task, which can cause parallel to fail by exceeding the OS ulimit of open file handles. Avoid this by explicitly creating a file containing the task list and passing it as an argument to parallel.
At larger scale it is more important to use higher performance file systems such as the Lustre scratch file system to read and write data.
If all of your tasks are reading the same files then you can increase performance by making multiple copies of those files and assigning different tasks to read different copies.
Running Many Tasks Inside a Single Node Allocation¶
This Slurm batch script will request one CPU node in the regular QOS and then run parallel
on that node. The parallel
command runs up six tasks of payload.sh
at a time, one for each line in the file input.txt
. If the input file contains more than six lines then the additional tasks will wait until earlier tasks finish and space is available for them. Each input line string becomes an argument to its task script.
single_node_many_task_with_parallel.sh
#!/bin/bash
#SBATCH --qos=regular
#SBATCH --nodes=1
#SBATCH --constraint=cpu
module load parallel
srun parallel --jobs 6 ./payload.sh argument_{} :::: input.txt
This arrangement is a great alternative to submitting many individual jobs or a task array to the shared Slurm QOS. Current scheduling policy only allows two jobs per user to gain priority at a time; a single job running many tasks will spend less time waiting in the queue than many jobs each running a single task. Also, this work pattern requires much less interaction with the Slurm controller, which makes it less likely to cause or be impacted by the Slurm controller experiencing heavy load.
Many Tasks Inside a Multiple Node Allocation¶
Demonstrated using two scripts: a batch submission to Slurm and a driver containing the parallel and payload commands.
This batch submission will request two CPU nodes, then the srun
will run two instances of driver.sh
with the $1
argument containing the task input list.
multiple_nodes_many_tasks_parallel.sh
#!/bin/bash
#SBATCH --qos=regular
#SBATCH --nodes=2
#SBATCH --constraint=cpu
#SBATCH --ntasks-per-node 1
srun --no-kill --ntasks=2 --wait=0 driver.sh $1
The --no-kill
argument will keep the slurm allocation running if any of the allocated nodes fail during the job. The --wait=0
argument prevents the job from terminating the other driver instances when the first one finishes.
The driver script uses environment variables set by Slurm inside a job to distinguish each instance of parallel, and then round-robin distributes input tasks to them using awk
.
driver.sh
#!/bin/bash
module load parallel
if [[ -z "${SLURM_NODEID}" ]]; then
echo "need \$SLURM_NODEID set"
exit
fi
if [[ -z "${SLURM_NNODES}" ]]; then
echo "need \$SLURM_NNODES set"
exit
fi
cat $1 | \
awk -v NNODE="$SLURM_NNODES" -v NODEID="$SLURM_NODEID" \
'NR % NNODE == NODEID' | \
parallel payload.sh {}
The conditional statements make sure the needed Slurm environment variables are in place. $SLURM_NNODES
holds the total number of nodes in the job and $SLURM_NODEID
holds the unique ID number of this node. The awk
command uses the line number of each input and the two environment variables to implement round-robin assignments of tasks to nodes. An advantage of this method is the number of nodes requested by the job can be freely changed without needing to adjust the task-to-node assignment logic.
Grouping Many One-Node MPI Jobs Into a Larger Job¶
This Slurm batch script demonstrates how parallel
can be used to distribute multiple single-node MPI tasks within a multi-node job. The batch script starts a job with 4 CPU nodes, then tells parallel
to run 4 simultaneous jobs. Note here that the number of nodes requested and the number of processes to run in parallel are the same, in order to run one instance of the MPI task script on each node. This job script must request an appropriate number of tasks per node as well (128 on a Perlmutter CPU node) so that same value may be used in the MPI task script.
mpi-task-job.sub
#!/bin/bash
#SBATCH --qos=regular
#SBATCH --nodes=4
#SBATCH --constraint=cpu
#SBATCH --ntasks-per-node=128
module load parallel
parallel -j 4 mpi-task.sh {} < input.txt
Below is an example task script which would be used in combination with the above job script. The srun
command is used to launch the MPI executable, with its argument passed in via parallel
and the input.txt
input file. Note that the srun
call specifies one node and 128 tasks on that node.
mpi-task.sh
#!/bin/bash
srun -N 1 -n 128 mpi.exec $1
Scaling Parallel with --sshlogin
is Not Recommended¶
GNU parallel includes a feature to distribute tasks to multiple machines using ssh connections. Though this allows work to balance between multiple nodes, our testing suggests that scaling is much less effective and it would be better to use a different task manager. More detail about this finding is available upon request.
Resuming Unfinished or Retrying Failed Tasks¶
If any tasks in a GNU parallel instance return a non-zero exit code, the parallel command will also return non-zero. Parallel can be configured to use a job log file which tracks failed or incomplete tasks so that they can be resumed or retried.
Add --resume-failed --joblog logfile.txt
to the list of parallel arguments and the state of tasks will be recorded. When that parallel instance is rerun with the exact same command line, it will skip any tasks that are already complete and re-run any tasks which failed. When using joblog it is good practice to use the available Slurm environment variables to distinguish files for each instance of parallel
.
It is very important that the input file and command line arguments not be modified between runs and that only one instance of parallel per log file run at a time.
Note that the --retries n
parallel argument seems like it should allow an instance of parallel
to retry a failed task, but actually, this feature only works when using the --sshlogin
feature.