TaskFarmer is a workflow manager developed in-house at NERSC to coordinate single or multicore tasks. It tracks which tasks have completed successfully, and allows straightforward re-submission of failed or un-run jobs from a task list.

## Strengths of TaskFarmer¶

• Good for long jobs (minutes to hours)
• Minimal setup required
• TaskFarmer can be restarted and begin where it left off
• Server/worker structure is inherently load-balancing
• Excellent support (author of TaskFarmer is NERSC staff)

• Requires an extra node devoted to running the manager process
• Bad for short tasks (communication will be a bottleneck)
• Not well-suited for complex job structure

## How does TaskFarmer Work?¶

The base functionality is contained within the runcommands.sh script which is provided by Taskfarmer. When you module load taskfarmer this script will be automatically added to your path. You can type module show taskfarmer for more information.

This script launches a server on the head node of your compute allocation. This server will keep track of the tasks on your list and the workers running on cores of the other compute nodes in your batch job. This is why you need at least two nodes for your TaskFarmer job- one node will be reserved for the TaskFarmer server.

## How to run TaskFarmer¶

### Step 0: Define your task (calc_sum.py)¶

In this example, we will use TaskFarmer to orchestrate this Python script (calc_sum.py):

#!/usr/bin/env python

import argparse

parser = argparse.ArgumentParser()

args = parser.parse_args()

sum = args.avalue + args.bvalue + args.cvalue
print("sum is", sum)


You must make all your scripts executable

To enable TaskFarmer to run your scripts, you must chmod +x calc_sum.py and wrapper.sh.

### Step 1: Write a wrapper (wrapper.sh)¶

Next write a wrapper that defines one task that you're going to run called wrapper.sh. It should contain the executable and any options required - these will be defined in the next step.

#!/usr/bin/env bash

cd $SCRATCH/taskfarmer module load python python calc_sum.py -a$1 -b $2 -c$3


### Step 2: Create a task list (tasks.txt)¶

This is where you can list all the tasks you need, including all job options.

#!/usr/bin/env bash

$SCRATCH/taskfarmer/wrapper.sh 0 0 1$SCRATCH/taskfarmer/wrapper.sh 0 1 0
$SCRATCH/taskfarmer/wrapper.sh 0 1 1$SCRATCH/taskfarmer/wrapper.sh 1 0 0
$SCRATCH/taskfarmer/wrapper.sh 1 0 1$SCRATCH/taskfarmer/wrapper.sh 1 1 1


### Step 3: Create a batch script (submit_taskfarmer.sl)¶

Your batch script will specify the total time required to run all your tasks and how many nodes you want to use. Assuming you are running on the Cori Haswell partition, note that you need to specify the -c 64 option to ensure that all 64 hyperthreads are available to the workers, and export THREADS=32 to ensure that each task runs on one thread. The -N 2 requests two compute nodes - one will run the TaskFarmer server, and the other will run the tasks. tasks.txt is the tasklist you created in the previous step. The THREADS variable controls the number of tasks per node and defaults to 32 for Cori Haswell. You may need to adjust this number if, for example, your application requires additional memory or uses internal threading (e.g. OpenMP or POSIX threads).

module load taskfarmer before you submit your job!

#!/bin/sh
#SBATCH -N 2 -c 64
#SBATCH -q debug
#SBATCH -t 00:05:00
#SBATCH -C haswell



### Step 4: Run TaskFarmer¶

module load taskfarmer


## Output¶

You will find several files appear in your job submission directory. Their names will depend on the name of the tasklist you created in step 2.

• tasks.txt.tfin: Repetition of tasklist, formatted for use by TaskFarmer.
• progress.tasks.txt.tfin: A space-delimited file that captures tracking information for each successful task. A line count of this file will show you how many tasks have completed. The first field is the byte offset into the formatted task list file. The second field is the node/thread that ran the task. The third field is the runtime in seconds for the task. The fourth field can be ignored. The fifth field is the completion time in UNIX time format. The sixth field can be ignored.
• log.tasks.txt.tfin: Any error messages from the tasks. This is useful for identifying errors for particular tasks.
• fast recovery.tasks.txt.tfin: This is a checkpoint file that tracks which tasks have not yet completed. If your job fails or hits wall time, you can re-run TaskFarmer with the same options and it will re-run any in-flight tasks and resume progress.
• done.tasks.txt.tfin: Produced once all tasks are completed.

## Resuming and Rerunning a Taskfarmer Job¶

TaskFarmer is designed to recovery easily from failures or wall-time limits. For example, if a job hits a wall-time limit, you can resubmit the same batch script and it will resume operations. A "done" file will be created once the job has finished (all tasks have completed successfully). If you wish to re-run the same job, delete the progress and done files (e.g. done.tasks.txt.tfin and progresss.tasks.txt.tfin).