Launchers¶
In distributed machine learning, a launcher is the tool or command that starts your training processes across one or more compute nodes. A launcher will take care of the following:
- Start the correct number of processes across nodes and GPUs
- Set environment variables needed for distributed communication (e.g.,
RANK
,WORLD_SIZE
,MASTER_ADDR
) - Coordinate process startup and synchronization
Your training code then uses these environment variables to initialize the communication backend (e.g., NCCL for GPU training, Gloo for CPU).
At NERSC, we generally recommend launching distributed jobs using srun
or torchrun
. Using srun
requires the user to set their own environment variables (offering more control), while torchrun
automatically handles this setup. For Hugging Face's ecosystem, accelerate launch
can also be used.
srun¶
When to use: Any distributed training where you want direct control over process launching and environment setup. This is the most flexible approach but requires manual setup of variables for distributed communication.
How it works: srun
directly spawns processes across nodes based on Slurm resource allocation. You need to manually assign the environment variables such as RANK
required for distributed training.
Example: This is an example of submitting a distributed training job train.py
with srun
, using a Shifter container (with the nccl plugin for better communication performance).
#!/bin/bash
#SBATCH -C gpu
#SBATCH --nodes= <num_nodes>
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --cpus-per-task=32
#SBATCH --image=nersc/pytorch:25.02.01
#SBATCH --module=gpu,nccl-plugin
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=29500
srun shifter \
bash -c "
export RANK=\$SLURM_PROCID
export LOCAL_RANK=\$SLURM_LOCALID
export WORLD_SIZE=\$SLURM_NTASKS
python train.py
"
These environment variables can then be used, e.g. when setting up PyTorch
's distributed process group:
torch.distributed.init_process_group(backend="nccl", init_method="env://")
torchrun¶
When to use: Any PyTorch distributed training including data parallelism (DDP), fully sharded data parallelism (FSDP), tensor parallelism, or hybrid approaches. This is PyTorch's native distributed launcher and works with all distributed strategies.
How it works: torchrun
handles process spawning and sets up the necessary environment variables for distributed communication, such as RANK
, WORLD_SIZE
, and LOCAL_RANK
. For a complete list of available environment variables, see the PyTorch documentation. The tool uses rendezvous
to coordinate process initialization across nodes.
If you're writing a distributed training script from scratch, e.g. using DDP or FSDP, you'll need to initialize the process group at the beginning of your script:
import torch.distributed as dist
dist.init_process_group(backend='nccl', init_method="env://") # Use 'gloo' instead of NCCL for CPU training
Example: This is an example of submitting a distributed training job train.py
with torchrun
, using a Shifter container (with the nccl plugin for better communication performance). Note that we only use one task per node.
#!/bin/bash
#SBATCH -C gpu
#SBATCH --nodes= <num_nodes>
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=4
#SBATCH --cpus-per-task=128
#SBATCH --image=nersc/pytorch:25.02.01
#SBATCH --module=gpu,nccl-plugin
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=29500
export OMP_NUM_THREADS=8
srun shifter \
torchrun \
--nnodes=$SLURM_JOB_NUM_NODES \
--nproc-per-node=$SLURM_GPUS_PER_NODE \
--rdzv-backend=c10d \
--rdzv-endpoint=$MASTER_ADDR:$MASTER_PORT \
train.py
accelerate launch¶
When to use: When running training scripts that use the Hugging Face Accelerate library for distributed training. This launcher is designed to work with Accelerate's API and automatically configures the distribute environment for scripts using Accelerator().
How it works: accelerate launch
can be used to pass options to the Hugging Face Accelerator (such as whether to use mixed precision during optimization). Note that torchrun
can also be used instead of accelerate launch
.
For detailed configuration options, see the Accelerate documentation.
Note
When using Hugging Face, set the HF cache directory to scratch to avoid filesystem issues:
export HF_HOME=$SCRATCH/cache/huggingface
Example: This is an example of submitting a distributed training job train.py
with accelerate launch
, using a Shifter container (with the nccl plugin for better communication performance). Note that we only use one task per node.
#!/bin/bash
#SBATCH -C gpu
#SBATCH --nodes= <num_nodes>
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=4
#SBATCH --cpus-per-task=128
#SBATCH --image=nersc/pytorch:25.02.01
#SBATCH --module=gpu,nccl-plugin
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=29500
export OMP_NUM_THREADS=8
# Set the HF cache directory
export HF_HOME=$SCRATCH/cache/huggingface
srun shifter bash -c "
accelerate launch \
--num_processes $((SLURM_NNODES * SLURM_GPUS_PER_NODE)) \
--num_machines $SLURM_NNODES \
--rdzv_backend c10d \
--main_process_ip $MASTER_ADDR \
--main_process_port $MASTER_PORT \
--machine_rank \$SLURM_PROCID \
--multi_gpu \
train.py
"