TaskFarmer is a workflow manager developed in-house at NERSC to coordinate single or multicore tasks. It tracks which tasks have completed successfully, and allows straightforward re-submission of failed or un-run jobs from a task list.

• Good for long jobs (minutes to hours)
• Minimal setup required
• TaskFarmer can be restarted and begin where it left off
• Server/worker structure is inherently load-balancing
• Excellent support (author of TaskFarmer is NERSC staff)

• Requires an extra node devoted to running the manager process
• Not well-suited for complex job structure

This script launches a server on the head node of your compute allocation. This server will keep track of the tasks on your list and the workers running on cores of the other compute nodes in your batch job. This is why you need at least two nodes for your TaskFarmer job- one node will be reserved for the TaskFarmer server.

The workers check the $THREADS environment variable (this is a TaskFarmer variable, not a Slurm variable) and spins up this many threads per node to run tasks. By default, this is set to the number of available cores on a compute node. Each thread 1) requests a task from the server, 2) is assigned the next task in the task list, 3) then forks off to run the task. Once the task is complete it communicates with the server and requests the next task. ## Important TaskFarmer Information¶ • The TaskFarmer server requires a full compute node, so you will need to request minimum 2 nodes in your batch script. • The TaskFarmer server can communicate with about 5-10 tasks per second. If more than 10 tasks are simultaneously contacting the server, communication will become a bottleneck. You will need to consider this limit when you are constructing your task list. How many tasks will you have, how long do each of them take, and how many do you expect to be simultaneously contacting the server? • Example 1: you have many identical tasks that all take exactly 100 seconds to run. All tasks will finish at the same time and contact the server simultaneously. This means that you should not have more than 1000 tasks or you will overwhelm the server. • Example 2: you have many different tasks which take between 30-60 minutes to run. You can have many more of these tasks (thousands or more) because they will not all contact the server at the same time. • TaskFarmer performs best with long running jobs (minutes to hours). • The total walltime you request should be equal to (number of tasks*task time)/$THREADS, not simply the time required to run one task.

In this example, we will use TaskFarmer to orchestrate this Python script (calc_sum.py):

#!/usr/bin/env python

import argparse

parser = argparse.ArgumentParser()

args = parser.parse_args()

sum = args.avalue + args.bvalue + args.cvalue
print("sum is", sum)

You must make all your scripts executable

To enable TaskFarmer to run your scripts, you must chmod +x calc_sum.py and wrapper.sh.

### Step 1: Write a wrapper (wrapper.sh)¶

Next write a wrapper that defines one task that you're going to run called wrapper.sh. It should contain the executable and any options required - these will be defined in the next step.

#!/usr/bin/env bash

cd $SCRATCH/taskfarmer module load python python calc_sum.py -a$1 -b $2 -c$3

This is where you can list all the tasks you need, including all job options.

#!/usr/bin/env bash

$SCRATCH/taskfarmer/wrapper.sh 0 0 1$SCRATCH/taskfarmer/wrapper.sh 0 1 0
$SCRATCH/taskfarmer/wrapper.sh 0 1 1$SCRATCH/taskfarmer/wrapper.sh 1 0 0
$SCRATCH/taskfarmer/wrapper.sh 1 0 1$SCRATCH/taskfarmer/wrapper.sh 1 1 1

### Step 3: Create a batch script (submit_taskfarmer.sl)¶

#!/bin/sh
#SBATCH -N 2 -c 64
#SBATCH -q debug
#SBATCH -t 00:05:00
#SBATCH -C haswell

## Output¶

You will find several files appear in your job submission directory. Their names will depend on the name of the tasklist you created in step 2.