Skip to content

GNU Parallel

GNU Parallel is a free, open-source tool for running shell commands and scripts in parallel and sequence on a single node.

This tool is best suited for workflows that contain many similar tasks with no execution order requirements or data dependencies. Its second primary strength is a convenient syntax for specifying patterns of file path organization and command parameters.

It is workable, but does not excel, when tasks use multiple nodes or MPI applications. A logging system is present and enables incomplete workflows to be resumed but it is rudimentary; if a project requires consistent workflow state be maintained over the course of many batch job submissions then using GNU Parallel alone is not the best choice.

Strengths of GNU Parallel:

  • No user installation or configuration
  • No database or persistent server needed
  • Powerful specification of task filenames, paths, and parameters
  • Easily scales to a very large number of tasks
  • Does not burden cluster scheduler

Disadvantages of GNU Parallel:

  • Doesn't easily load balance work
  • User is required to do careful organization of input and output files
  • Scaling up requires awareness of system I/O performance
  • Does not strongly preserve workflow state between invocations
  • Modest familiarity with bash scripting recommended

Example Repository

Working examples of the following concepts are available on the DOE Cross-facility Workflows Training Repository. Obtain them by using git clone with the repository url on your Perlmutter storage and path to /DOE-HPC-workflow-training/GNU-Parallel/NERSC.

How to use GNU Parallel at NERSC

In this first example, seq is used to generate four lines of input, which are piped into parallel. Those four input lines cause parallel to run four tasks in total. Then the command each task will run is passed to parallel, in this case, echo and its arguments. The {} in the command sets a location where individual input line content will be substituted inside each task command.

Basic example
elvis@perlmutter:login13:~> module load parallel
elvis@perlmutter:login13:~> seq 1 4 | parallel echo "Hello world {}!"
Hello world 1!
Hello world 2!
Hello world 3!
Hello world 4!
elvis@perlmutter:login13:~>

The next necessary concept is how to submit substantial tasks and input files to parallel. This example shows how input data created with sequential file names can be passed to parallel using bash commands and pipes:

Sequentially named input files for each task

elvis@nid004258:~/work> ls
input01.dat  input02.dat  input03.dat  input04.dat  input05.dat
input06.dat  input07.dat  input08.dat  input09.dat  input10.dat
elvis@nid004258:~/work> seq -w 1 10 | parallel task_command.sh input{}.dat
This example occurs in an salloc session. Though parallel is great for automating mundane, lightweight, and repetitive tasks such as creating many directories or parsing lots of log files, tasks which use a substantial amount of compute resources still need to be run on Slurm allocated compute nodes and not on shared login nodes.

A second approach places all task inputs into the same directory and uses the find command to build the file listing all their paths:

Using find to build an input file list
elvis@nid004258:~/work> find $PWD -type f | grep dat | sort > input.txt
elvis@nid004258:~/work> cat input.txt | parallel task_command.sh {}
I/O Performance Pitfalls at Large Scale

If work requires large numbers (more than 1000) of tasks and input files then additional consideration for I/O systems may be needed or beneficial.

Directories containing more than 1000 files are less performant; distribute files between subdirectories to avoid this.

When using a pipe to pass the task list to GNU parallel, multiple temporary files are needed per task, which can cause parallel to fail by exceeding the OS ulimit of open file handles. Avoid this by explicitly creating a file containing the task list and passing it as an argument to parallel.

At larger scale it is more important to use higher performance file systems such as the Lustre scratch file system to read and write data.

If all of your tasks are reading the same files then you can increase performance by making multiple copies of those files and assigning different tasks to read different copies.

Running Many Tasks Inside a Single Node Allocation

This Slurm batch script will request one CPU node in the regular QOS and then run parallel on that node. The parallel command runs up six tasks of payload.sh at a time, one for each line in the file input.txt. If the input file contains more than six lines then the additional tasks will wait until earlier tasks finish and space is available for them. Each input line string becomes an argument to its task script.

single_node_many_task_with_parallel.sh
#!/bin/bash
#SBATCH --qos=regular
#SBATCH --nodes=1
#SBATCH --constraint=cpu

module load parallel

srun parallel --jobs 6 ./payload.sh argument_{} :::: input.txt 

This arrangement is a great alternative to submitting many individual jobs or a task array to the shared Slurm QOS. Current scheduling policy only allows two jobs per user to gain priority at a time; a single job running many tasks will spend less time waiting in the queue than many jobs each running a single task. Also, this work pattern requires much less interaction with the Slurm controller, which makes it less likely to cause or be impacted by the Slurm controller experiencing heavy load.

Many Tasks Inside a Multiple Node Allocation

Demonstrated using two scripts: a batch submission to Slurm and a driver containing the parallel and payload commands.

This batch submission will request two CPU nodes, then the srun will run two instances of driver.sh with the $1 argument containing the task input list.

multiple_nodes_many_tasks_parallel.sh
#!/bin/bash
#SBATCH --qos=regular
#SBATCH --nodes=2
#SBATCH --constraint=cpu
#SBATCH --ntasks-per-node 1

srun --no-kill --ntasks=2 --wait=0 driver.sh $1 

The --no-kill argument will keep the slurm allocation running if any of the allocated nodes fail during the job. The --wait=0 argument prevents the job from terminating the other driver instances when the first one finishes.

The driver script uses environment variables set by Slurm inside a job to distinguish each instance of parallel, and then round-robin distributes input tasks to them using awk.

driver.sh
#!/bin/bash
module load parallel
if [[ -z "${SLURM_NODEID}" ]]; then
    echo "need \$SLURM_NODEID set"
    exit
fi
if [[ -z "${SLURM_NNODES}" ]]; then
    echo "need \$SLURM_NNODES set"
    exit
fi
cat $1 |                                               \
awk -v NNODE="$SLURM_NNODES" -v NODEID="$SLURM_NODEID" \
'NR % NNODE == NODEID' |                               \
parallel payload.sh {}

The conditional statements make sure the needed Slurm environment variables are in place. $SLURM_NNODES holds the total number of nodes in the job and $SLURM_NODEID holds the unique ID number of this node. The awk command uses the line number of each input and the two environment variables to implement round-robin assignments of tasks to nodes. An advantage of this method is the number of nodes requested by the job can be freely changed without needing to adjust the task-to-node assignment logic.

Grouping Many One-Node MPI Jobs Into a Larger Job

This Slurm batch script demonstrates how parallel can be used to distribute multiple single-node MPI tasks within a multi-node job. The batch script starts a job with 4 CPU nodes, then tells parallel to run 4 simultaneous jobs. Note here that the number of nodes requested and the number of processes to run in parallel are the same, in order to run one instance of the MPI task script on each node. This job script must request an appropriate number of tasks per node as well (128 on a Perlmutter CPU node) so that same value may be used in the MPI task script.

mpi-task-job.sub
#!/bin/bash
#SBATCH --qos=regular
#SBATCH --nodes=4
#SBATCH --constraint=cpu
#SBATCH --ntasks-per-node=128

module load parallel
parallel -j 4 mpi-task.sh {} < input.txt

Below is an example task script which would be used in combination with the above job script. The srun command is used to launch the MPI executable, with its argument passed in via parallel and the input.txt input file. Note that the srun call specifies one node and 128 tasks on that node.

mpi-task.sh
#!/bin/bash
srun -N 1 -n 128 mpi.exec $1

GNU parallel includes a feature to distribute tasks to multiple machines using ssh connections. Though this allows work to balance between multiple nodes, our testing suggests that scaling is much less effective and it would be better to use a different task manager. More detail about this finding is available upon request.

Resuming Unfinished or Retrying Failed Tasks

If any tasks in a GNU parallel instance return a non-zero exit code, the parallel command will also return non-zero. Parallel can be configured to use a job log file which tracks failed or incomplete tasks so that they can be resumed or retried.

Add --resume-failed --joblog logfile.txt to the list of parallel arguments and the state of tasks will be recorded. When that parallel instance is rerun with the exact same command line, it will skip any tasks that are already complete and re-run any tasks which failed. When using joblog it is good practice to use the available Slurm environment variables to distinguish files for each instance of parallel.

It is very important that the input file and command line arguments not be modified between runs and that only one instance of parallel per log file run at a time.

Note that the --retries n parallel argument seems like it should allow an instance of parallel to retry a failed task, but actually, this feature only works when using the --sshlogin feature.