Checkpoint/Restart Overview¶
Checkpointing is the action of saving the state of a running process to a checkpoint image file. The process can later be restarted from the checkpoint file, continuing from where it left off from any computer.
Checkpoint/Restart (C/R) is critical to fault-tolerant computing, and is especially desirable for HPC computing centers like NERSC. From the user perspective, C/R enables jobs to run longer than the walltime limit, and improves job throughput by splitting a long running job into multiple shorter ones to better exploit holes in the job schedule created by Slurm. From NERSC's perspective, it offers flexibility when scheduling jobs and system maintenances, enables preempting for time-sensitive jobs (e.g., real time data processing for experimental facilities), and better backfill when draining the system for large jobs which increases system utilization.
Creating a transparent-to-users C/R tool for HPC applications, however, is challenging, requiring extensive development and maintenance effort due to ever-changing HPC systems and diverse production workloads at all scales. MPI support is especially challenging: the combination of MPI implementations (e.g., MPICH, Open MPI, Cray MPICH) and networks (e.g., TCP/IP, InfiniBand, Slingshot) could require maintaining multiple versions of the C/R code. In addition, to enable transparent checkpointing/restarting for users, C/R tools often require cooperation between MPI, OS kernels, and batch system developers, which has proven to be hard to sustain over time. As a result, there are no ready-to-use C/R tools for users who work with cutting-edge HPC computers that often deploy new networks and hardware.
Distributed MultiThreaded Checkpointing (DMTCP) takes a different approach by living completely in user space. No OS kernel modifications or hooks into MPI libraries are required. A new implementation of DMTCP, MANA for MPI: MPI-Agnostic Network-Agnostic Transparent Checkpointing, has addressed the MPI's MxN maintenance issue, and has been proven to be scalable to a large number of processes. Despite the fact that MANA may need to develop and maintain separate code bases for emerging new hardware, it is a huge step forward toward ready-to-use C/R tools on future HPC platforms!
Both DMTCP and MANA are available on Perlmutter. You are encouraged to checkpoint/restart your MPI jobs with MANA. If you run serial or threaded applications, we recommend that you use DMTCP (the traditional implementation, which does not support MPI) to checkpoint your jobs. The MANA and DMTCP pages have more information about using MANA and DMTCP at NERSC.
Warning
Checkpoint/Restart is not available for GPU applications.
NERSC has been in a close collaboration with the DMTCP/MANA team to get DMTCP/MANA reliably working with production workloads at NERSC. Please report any issues you encounter with DMTCP/MANA at NERSC's Help Desk.