Training Libraries¶
Training libraries provide high-level abstractions and optimizations for distributed machine learning. While you can implement data parallelism using PyTorch's native features (DDP, FSDP), the libraries discussed here also offer:
- Simplified APIs that reduce boilerplate code for distributed training
- Advanced optimization techniques for memory efficiency and large model training
- Automatic handling of mixed precision, checkpointing, and communication
- Flexibility to switch between different parallelism strategies with minimal code changes
Each library targets different use cases. Choose based on your model size, complexity needs, and desired level of control.
PyTorch Native (DDP/FSDP)¶
PyTorch provides built-in distributed training capabilities through DistributedDataParallel (DDP) and FullyShardedDataParallel (FSDP).
When to use: DDP is ideal for models that fit in GPU memory, while FSDP enables training larger models by sharding parameters across GPUs.
How it works: DDP replicates the model on each GPU and synchronizes gradients. FSDP shards model parameters, gradients, and optimizer states across GPUs to reduce memory usage. Both require manual setup of process groups and device management.
Running at NERSC: Use torchrun
or srun
to launch. See PyTorch's distributed documentation on DDP and FSDP for examples.
Hugging Face Accelerate¶
Accelerate
is Hugging Face's library that simplifies distributed training and mixed precision by providing a unified API that works across different hardware setups and training configurations.
When to use: When you want minimal code changes to enable distributed training, especially for Hugging Face models. Accelerate
is great for researchers who want their training scripts to work seamlessly across single GPU, multi-GPU, and multi-node setups without rewriting code for each configuration.
How it works: Accelerate
wraps your PyTorch training loop with the Accelerator
class, which automatically handles device placement, distributed synchronization, and mixed precision. The framework handles the underlying distributed strategy (DDP, FSDP, DeepSpeed) based on your configuration. See the Accelerate
docs for examples on how to wrap your standard PyTorch
code to use Accelerate
.
Running at NERSC: Accelerate
can be found in our pytorch
module and containers. Accelerate
can be launched using either accelerate launch
or torchrun
.
Note
When using Hugging Face, set the HF cache directory to scratch to avoid filesystem issues:
export HF_HOME=$SCRATCH/cache/huggingface
DeepSpeed¶
DeepSpeed
is Microsoft's deep learning optimization library that enables training of extremely large models through memory optimization and parallelism techniques.
When to use: Very large models where you need maximum memory efficiency, advanced features like CPU offloading, or sophisticated pipeline parallelism. While PyTorch FSDP is simpler to set up and sufficient for most models, DeepSpeed remains valuable for large model training and deployments requiring advanced optimizations.
How it works: DeepSpeed implements ZeRO (Zero Redundancy Optimizer) which partitions optimizer states, gradients, and model parameters across multiple GPUs to reduce memory usage. It can offload computations and memory to CPU when GPU memory is insufficient. The framework supports hybrid parallelism combining data and pipeline parallelism. It requires configuration files that specify the optimization strategy and parallelism setup.
Running at NERSC: DeepSpeed
can be found in our pytorch
module and containers. Although DeepSpeed
provides its own launcher, we recommend using torchrun
at NERSC. When using torchrun
instead of the DeepSpeed
launcher, no hostfile has to be created (since SLURM will take care of resource assignment). Rather than supplying the DeepSpeed
configuration file to the launcher, provide it during initialization:
model_engine, optimizer, trainloader, _ = deepspeed.initialize(
model=model,
model_parameters=model_parameters,
training_data=dataset,
config="path/to/config.json"
)
Lightning¶
Lightning
is a high-level framework built on top of PyTorch
that organizes your code into reusable components and automates many aspects of the training process, including distributed training, checkpointing, and logging.
When to use: When you want to focus on model architecture and training logic without managing training loops, device placement, or distributed training setup. Lightning
handles training orchestration automatically and supports many distributed strategies including DDP, FSDP, and DeepSpeed through simple configuration changes; see the implemented distribution strategies. If you want more control over your training pipeline, standard PyTorch
is still recommended.
How it works: Lightning abstracts the training loop into a standardized interface through the [LightningModule
] (https://lightning.ai/docs/pytorch/stable/common/lightning_module.html) and [Trainer
] (https://lightning.ai/docs/pytorch/stable/common/trainer.html) classes. Your model inherits from LightningModule
where you define the forward pass, training/validation steps, and optimizers. The Trainer
manages the actual training execution, automatically handling mixed precision, checkpointing, logging, and distributed training coordination.
Running at NERSC: Lightning
can be found in our pytorch
module and containers. We recommend running Lightning
training with torchrun
. Alternatively, Lightning
has [SLURM integration] (https://lightning.ai/docs/pytorch/stable/clouds/cluster_advanced.html#run-on-a-slurm-managed-cluster), but requires special care when running in interactive sessions due to automatic environment detection.
Megatron¶
Megatron
is NVIDIA's framework for efficiently training large language models using advanced model parallelism techniques including tensor, pipeline, and sequence parallelism.
When to use: When training models with billions of parameters that exceed single-GPU memory capacity, particularly transformer-based language models. Megatron
is excellent for models that require model parallelism beyond data parallelism, enabling training of huge models.
How it works: Megatron
implements efficient parallelism strategies by splitting model layers across GPUs. See our page on distributed training for different types of model parallelism and when to use them. Models must be implemented using Megatron's parallel layers.
Running at NERSC: Use the torchrun
launcher for running Megatron
training. See [this example] (https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start) to get started.
Horovod¶
Horovod
is Uber's distributed training framework that provides a simple interface for data parallel training across multiple GPUs and nodes.
When to use: When you need compatibility with TensorFlow. While newer frameworks like DDP offer better performance for PyTorch, Horovod
remains useful for multi-framework environments.
How it works: Horovod
uses NCCL for efficient communication between processes. It requires minimal code changes - mainly adding hvd.init()
, wrapping the optimizer with hvd.DistributedOptimizer
, and broadcasting the initial model state.
Running at NERSC: Launch with srun
directly. See the Horovod documentation for examples.