Distributed Training¶
Distributed training (or fine-tuning) is often used if you have large datasets and/or large deep learning models. This page outlines guidelines (example: what parallelization to use?) , tools (example: launchers for multi-node training), and workflows relating to distibuted training (fine-tuning) of deep learning applications on Perlmutter.
Distributed training can have different flavors of parallelization depending on your use-case with the most common being data parallelism followed by different flavors of model parallelism (for advanced users). We outline these and provide steps to help you get started on Perlmutter for each.
Data parallelism¶
When to use: If your model can fit onto a single device (GPU), and you want faster time-to-solutions or experiment with larger batch sizes. This is the most common use-case for distributed training.
How it works: Each device (GPU) holds a copy of the model (learnable parameters: weights and biases), but each device (GPU) processes a subset of the data batch size simultaneously. During the optimizer step update, the gradients are synchronized using an AllReduce collective operation before being used for the parameter updates.
Tools to get started:
- We recommend PyTorch DistributedDataParallel (DDP) for easy implementation of data parallelism in PyTorch. Refer to this quick DDP example on moving your workflow from single GPU training to multi-GPU (including multi-node) using DDP on Perlmutter.
- You may also use any one of several launchers for multi-GPU (and multi-node) training. In this example, we show you how to do this on Perlmutter with either Huggingface accelerate or torchrun.
- For TensorFlow, see this example that trains a multi-GPU workflow using MirroredStrategy. For multi-node training, we recommend using Horovod (see this example for Perlmutter)
- For scaling and performance considerations, refer to our SC23 Deep Learning at Scale tutorial: Data parallelism.
Tensor parallelism:¶
When to use: If your model does not fit onto a single device (GPU) because the parameters or the intermediate activation tensors are too large. This is an advanced use-case.
How it works: The model parameter tensors are sharded (partitioned) across multiple devices (GPUs) and each device (GPU) works on its local shard (partition). The results are synchronized after every local computation through different communication collectives.
Tools to get started:
- Refer to our SC23 Deep Learning at Scale tutorial: Model parallelism on setting up tensor parallelism in your PyTorch workflow on Perlmutter.
- We also recommend Megatron-LM for parallelization of large transformers (especially Large Language Models) and DeepSpeed.
- You can also use HuggingFace Accelerate.
Pipeline parallelism¶
When to use: If your model does not fit onto a single device (GPU) because the parameters are very large and your neural network has many layers (preferably identical layers such as transformers). This is an advanced use-case.
How it works: Different layers of the neural network are placed on each device (GPU). A batch of data is split further into microbatches and they pass sequentially through the different layers. Sophisticated pipeline schedules allow for overlapping computation of different microbatches and communication of intermediate results to other devices (GPUs).
Tools to get started:
- We recommend Megatron-LM and DeepSpeed.
- You can also use HuggingFace Accelerate.
For advanced parallelism (tensor and pipeline), typically you will require a hybrid of data and model parallelism and most frameworks support this.