Skip to content

Overview¶

NERSC provides a variety of tools and frameworks to help users run machine learning workloads efficiently on Perlmutter.

This section summarizes training launchers, training libraries, and hyperparameter optimization tools that we recommend for distributed training. For each tool, we describe:

When to use it — the scenarios where this tool is most effective
How it works — the key concepts and approach
Running at NERSC — specific examples, modules, and tips for Perlmutter

These tools can help you:

Launch distributed training across multiple nodes and GPUs
Scale training efficiently with minimal code changes
Optimize hyperparameters and manage experiment sweeps

Categories covered:

Training Launchers — Tools for starting distributed training jobs
Training Libraries — Frameworks for scaling and optimizing training
Hyperparameter Optimization — Tools for automated tuning and experiment management