Skip to content

Distributed Training

Distributed training (or fine-tuning) is often used if you have large datasets and/or large deep learning models. This page outlines guidelines (example: what parallelization to use?) , tools (example: launchers for multi-node training), and workflows relating to distibuted training (fine-tuning) of deep learning applications on Perlmutter.

Distributed training can have different flavors of parallelization depending on your use-case with the most common being data parallelism followed by different flavors of model parallelism (for advanced users). We outline these and provide steps to help you get started on Perlmutter for each.

Data parallelism

When to use: If your model can fit onto a single device (GPU), and you want faster time-to-solutions or experiment with larger batch sizes. This is the most common use-case for distributed training.

How it works: Each device (GPU) holds a copy of the model (learnable parameters: weights and biases), but each device (GPU) processes a subset of the data batch size simultaneously. During the optimizer step update, the gradients are synchronized using an AllReduce collective operation before being used for the parameter updates.

Tools to get started:

Tensor parallelism:

When to use: If your model does not fit onto a single device (GPU) because the parameters or the intermediate activation tensors are too large. This is an advanced use-case.

How it works: The model parameter tensors are sharded (partitioned) across multiple devices (GPUs) and each device (GPU) works on its local shard (partition). The results are synchronized after every local computation through different communication collectives.

Tools to get started:

Pipeline parallelism

When to use: If your model does not fit onto a single device (GPU) because the parameters are very large and your neural network has many layers (preferably identical layers such as transformers). This is an advanced use-case.

How it works: Different layers of the neural network are placed on each device (GPU). A batch of data is split further into microbatches and they pass sequentially through the different layers. Sophisticated pipeline schedules allow for overlapping computation of different microbatches and communication of intermediate results to other devices (GPUs).

Tools to get started:

For advanced parallelism (tensor and pipeline), typically you will require a hybrid of data and model parallelism and most frameworks support this.