Launchers¶

In distributed machine learning, a launcher is the tool or command that starts your training processes across one or more compute nodes. A launcher will take care of the following:

Start the correct number of processes across nodes and GPUs
Set environment variables needed for distributed communication (e.g., RANK, WORLD_SIZE, MASTER_ADDR)
Coordinate process startup and synchronization

Your training code then uses these environment variables to initialize the communication backend (e.g., NCCL for GPU training, Gloo for CPU).

At NERSC, we generally recommend launching distributed jobs using srun or torchrun. Using srun requires the user to set their own environment variables (offering more control), while torchrun automatically handles this setup. For Hugging Face's ecosystem, accelerate launch can also be used.

srun¶

When to use: Any distributed training where you want direct control over process launching and environment setup. This is the most flexible approach but requires manual setup of variables for distributed communication.

How it works: srun directly spawns processes across nodes based on Slurm resource allocation. You need to manually assign the environment variables such as RANK required for distributed training.

Example: This is an example of submitting a distributed training job train.py with srun, using a Shifter container (with the nccl plugin for better communication performance).

#!/bin/bash
#SBATCH -C gpu
#SBATCH --nodes= <num_nodes>
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --cpus-per-task=32
#SBATCH --image=nersc/pytorch:25.02.01
#SBATCH --module=gpu,nccl-plugin

export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=29500

srun shifter \
  bash -c "
    export RANK=\$SLURM_PROCID
    export LOCAL_RANK=\$SLURM_LOCALID
    export WORLD_SIZE=\$SLURM_NTASKS
    python train.py
  "

These environment variables can then be used, e.g. when setting up PyTorch's distributed process group:

torch.distributed.init_process_group(backend="nccl", init_method="env://")

torchrun¶

When to use: Any PyTorch distributed training including data parallelism (DDP), fully sharded data parallelism (FSDP), tensor parallelism, or hybrid approaches. This is PyTorch's native distributed launcher and works with all distributed strategies.

How it works: torchrun handles process spawning and sets up the necessary environment variables for distributed communication, such as RANK, WORLD_SIZE, and LOCAL_RANK. For a complete list of available environment variables, see the PyTorch documentation. The tool uses rendezvous to coordinate process initialization across nodes.

If you're writing a distributed training script from scratch, e.g. using DDP or FSDP, you'll need to initialize the process group at the beginning of your script:

import torch.distributed as dist
dist.init_process_group(backend='nccl', init_method="env://")  # Use 'gloo' instead of NCCL for CPU training

Example: This is an example of submitting a distributed training job train.py with torchrun, using a Shifter container (with the nccl plugin for better communication performance). Note that we only use one task per node.

#!/bin/bash
#SBATCH -C gpu
#SBATCH --nodes= <num_nodes>
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=4
#SBATCH --cpus-per-task=128
#SBATCH --image=nersc/pytorch:25.02.01
#SBATCH --module=gpu,nccl-plugin

export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=29500
export OMP_NUM_THREADS=8

srun shifter \
  torchrun \
  --nnodes=$SLURM_JOB_NUM_NODES \
  --nproc-per-node=$SLURM_GPUS_PER_NODE \
  --rdzv-backend=c10d \
  --rdzv-endpoint=$MASTER_ADDR:$MASTER_PORT \
  train.py

accelerate launch¶

When to use: When running training scripts that use the Hugging Face Accelerate library for distributed training. This launcher is designed to work with Accelerate's API and automatically configures the distribute environment for scripts using Accelerator().

How it works: accelerate launch can be used to pass options to the Hugging Face Accelerator (such as whether to use mixed precision during optimization). Note that torchrun can also be used instead of accelerate launch.

For detailed configuration options, see the Accelerate documentation.

Note

When using Hugging Face, set the HF cache directory to scratch to avoid filesystem issues:

export HF_HOME=$SCRATCH/cache/huggingface

Example: This is an example of submitting a distributed training job train.py with accelerate launch, using a Shifter container (with the nccl plugin for better communication performance). Note that we only use one task per node.

#!/bin/bash
#SBATCH -C gpu
#SBATCH --nodes= <num_nodes>
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=4
#SBATCH --cpus-per-task=128
#SBATCH --image=nersc/pytorch:25.02.01
#SBATCH --module=gpu,nccl-plugin

export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=29500
export OMP_NUM_THREADS=8

# Set the HF cache directory
export HF_HOME=$SCRATCH/cache/huggingface

srun shifter bash -c "
  accelerate launch \
  --num_processes $((SLURM_NNODES * SLURM_GPUS_PER_NODE)) \
  --num_machines $SLURM_NNODES \
  --rdzv_backend c10d \
  --main_process_ip $MASTER_ADDR \
  --main_process_port $MASTER_PORT \
  --machine_rank \$SLURM_PROCID \
  --multi_gpu \
  train.py
"