Skip to content

Training Libraries

Training libraries provide high-level abstractions and optimizations for distributed machine learning. While you can implement data parallelism using PyTorch's native features (DDP, FSDP), the libraries discussed here also offer:

  • Simplified APIs that reduce boilerplate code for distributed training
  • Advanced optimization techniques for memory efficiency and large model training
  • Automatic handling of mixed precision, checkpointing, and communication
  • Flexibility to switch between different parallelism strategies with minimal code changes

Each library targets different use cases. Choose based on your model size, complexity needs, and desired level of control.

PyTorch Native (DDP/FSDP)

PyTorch provides built-in distributed training capabilities through DistributedDataParallel (DDP) and FullyShardedDataParallel (FSDP).

When to use: DDP is ideal for models that fit in GPU memory, while FSDP enables training larger models by sharding parameters across GPUs.

How it works: DDP replicates the model on each GPU and synchronizes gradients. FSDP shards model parameters, gradients, and optimizer states across GPUs to reduce memory usage. Both require manual setup of process groups and device management.

Running at NERSC: Use torchrun or srun to launch. See PyTorch's distributed documentation on DDP and FSDP for examples.

Hugging Face Accelerate

Accelerate is Hugging Face's library that simplifies distributed training and mixed precision by providing a unified API that works across different hardware setups and training configurations.

When to use: When you want minimal code changes to enable distributed training, especially for Hugging Face models. Accelerate is great for researchers who want their training scripts to work seamlessly across single GPU, multi-GPU, and multi-node setups without rewriting code for each configuration.

How it works: Accelerate wraps your PyTorch training loop with the Accelerator class, which automatically handles device placement, distributed synchronization, and mixed precision. The framework handles the underlying distributed strategy (DDP, FSDP, DeepSpeed) based on your configuration. See the Accelerate docs for examples on how to wrap your standard PyTorch code to use Accelerate.

Running at NERSC: Accelerate can be found in our pytorch module and containers. Accelerate can be launched using either accelerate launch or torchrun.

Note

When using Hugging Face, set the HF cache directory to scratch to avoid filesystem issues:

export HF_HOME=$SCRATCH/cache/huggingface

DeepSpeed

DeepSpeed is Microsoft's deep learning optimization library that enables training of extremely large models through memory optimization and parallelism techniques.

When to use: Very large models where you need maximum memory efficiency, advanced features like CPU offloading, or sophisticated pipeline parallelism. While PyTorch FSDP is simpler to set up and sufficient for most models, DeepSpeed remains valuable for large model training and deployments requiring advanced optimizations.

How it works: DeepSpeed implements ZeRO (Zero Redundancy Optimizer) which partitions optimizer states, gradients, and model parameters across multiple GPUs to reduce memory usage. It can offload computations and memory to CPU when GPU memory is insufficient. The framework supports hybrid parallelism combining data and pipeline parallelism. It requires configuration files that specify the optimization strategy and parallelism setup.

Running at NERSC: DeepSpeed can be found in our pytorch module and containers. Although DeepSpeed provides its own launcher, we recommend using torchrun at NERSC. When using torchrun instead of the DeepSpeed launcher, no hostfile has to be created (since SLURM will take care of resource assignment). Rather than supplying the DeepSpeed configuration file to the launcher, provide it during initialization:

model_engine, optimizer, trainloader, _ = deepspeed.initialize(
  model=model,
  model_parameters=model_parameters,
  training_data=dataset,
  config="path/to/config.json"
)

Lightning

Lightning is a high-level framework built on top of PyTorch that organizes your code into reusable components and automates many aspects of the training process, including distributed training, checkpointing, and logging.

When to use: When you want to focus on model architecture and training logic without managing training loops, device placement, or distributed training setup. Lightning handles training orchestration automatically and supports many distributed strategies including DDP, FSDP, and DeepSpeed through simple configuration changes; see the implemented distribution strategies. If you want more control over your training pipeline, standard PyTorch is still recommended.

How it works: Lightning abstracts the training loop into a standardized interface through the [LightningModule] (https://lightning.ai/docs/pytorch/stable/common/lightning_module.html) and [Trainer] (https://lightning.ai/docs/pytorch/stable/common/trainer.html) classes. Your model inherits from LightningModule where you define the forward pass, training/validation steps, and optimizers. The Trainer manages the actual training execution, automatically handling mixed precision, checkpointing, logging, and distributed training coordination.

Running at NERSC: Lightning can be found in our pytorch module and containers. We recommend running Lightning training with torchrun. Alternatively, Lightning has [SLURM integration] (https://lightning.ai/docs/pytorch/stable/clouds/cluster_advanced.html#run-on-a-slurm-managed-cluster), but requires special care when running in interactive sessions due to automatic environment detection.

Megatron

Megatron is NVIDIA's framework for efficiently training large language models using advanced model parallelism techniques including tensor, pipeline, and sequence parallelism.

When to use: When training models with billions of parameters that exceed single-GPU memory capacity, particularly transformer-based language models. Megatron is excellent for models that require model parallelism beyond data parallelism, enabling training of huge models.

How it works: Megatron implements efficient parallelism strategies by splitting model layers across GPUs. See our page on distributed training for different types of model parallelism and when to use them. Models must be implemented using Megatron's parallel layers.

Running at NERSC: Use the torchrun launcher for running Megatron training. See [this example] (https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start) to get started.

Horovod

Horovod is Uber's distributed training framework that provides a simple interface for data parallel training across multiple GPUs and nodes.

When to use: When you need compatibility with TensorFlow. While newer frameworks like DDP offer better performance for PyTorch, Horovod remains useful for multi-framework environments.

How it works: Horovod uses NCCL for efficient communication between processes. It requires minimal code changes - mainly adding hvd.init(), wrapping the optimizer with hvd.DistributedOptimizer, and broadcasting the initial model state.

Running at NERSC: Launch with srun directly. See the Horovod documentation for examples.