Skip to content

Overview

NERSC provides a variety of tools and frameworks to help users run machine learning workloads efficiently on Perlmutter.

This section summarizes training launchers, training libraries, and hyperparameter optimization tools that we recommend for distributed training. For each tool, we describe:

  • When to use it — the scenarios where this tool is most effective
  • How it works — the key concepts and approach
  • Running at NERSC — specific examples, modules, and tips for Perlmutter

These tools can help you:

  • Launch distributed training across multiple nodes and GPUs
  • Scale training efficiently with minimal code changes
  • Optimize hyperparameters and manage experiment sweeps

Categories covered:

  1. Training Launchers — Tools for starting distributed training jobs

  2. Training Libraries — Frameworks for scaling and optimizing training

  3. Hyperparameter Optimization — Tools for automated tuning and experiment management