Containerized Checkpoint-Restart (C/R) Mechanisms for High-Performance Computing (HPC)¶
High-Performance Computing (HPC) systems are crucial for solving complex scientific problems but challenges like resource management, fault tolerance, and maintaining consistent performance across diverse environments can be difficult. Container technologies like NERSC's Shifter and Podman-HPC offer some solutions to these problems. We used Distributed MultiThreaded CheckPointing (DMTCP) technologies to implement robust Checkpoint-Restart (C/R) mechanisms to handle challenges with fault tolerance and resource management within containerized environments.
This section highlights successful C/R implementations on Perlmutter at NERSC using Shifter, Podman-HPC, and Apptainer, which has broader adoption in the HPC container space to show where C/R within containers could be used at more HPC centers.
Containers play a critical role in optimizing high-performance computing (HPC) workflows. In the context of checkpoint-restart (C/R) mechanisms, containers significantly enhance efficiency by integrating DMTCP. This integration allows for the seamless pausing, resuming, and migration of jobs without restarting computations from scratch. By enabling the smooth resumption of long-running tasks after interruptions, containers help reduce computational overhead, improve resource utilization, and optimize job scheduling. As a result, containers have become essential for ensuring operational resilience, consistent performance, and cost-effectiveness in HPC environments.
The figure illustrates the automated job management process within NERSC's containerized HPC environment. The workflow covers the entire lifecycle of a computational job, from submission to execution, checkpointing, and signal trapping for job resubmission. When a job reaches its time limit or encounters a termination signal, the checkpoint-restart (C/R) mechanism activates, capturing the job's state and requeuing it to resume later. The diagram highlights the decision-making flow following a termination signal, showing how the system either completes the job or restarts it based on the checkpoint data. This automated C/R strategy ensures efficient use of resources by enabling jobs to continue from their last saved state, minimizing downtime and maximizing computational efficiency. This is a visual representation of how DMTCP-enabled C/R is integrated into the containerized environment, emphasizing the seamless interaction between job scheduling, signal handling, and resource allocation within NERSC’s Shifter and Podman-HPC containers.
Preparing a Dockerfile for C/R¶
DMTCP cannot be checkpointed from outside the containers. It must be included within the container when it is build.
The simulation package can be built using several methods:
- During the container’s build process: The package is compiled and installed when the container is initially built.
- After the container has been built: The source code can be linked from an external location, allowing for flexibility and updating code without rebuilding the entire container.
- Extending an existing container: You can build on top of an already existing container image, which is efficient for quick experimentation and requires minimal modifications.
All of these methods have been thoroughly tested and validated, ensuring compatibility with DMTCP for C/R within containers. It provides flexibility for various use cases, ensuring efficient setup for simulations in a containerized HPC environment like NERSC's Perlmutter system.
Below is an example script to integrate DMTCP into an existing container. This script pulls the latest DMTCP source, configures, compiles, and installs it as part of the container. It demonstrates how DMTCP can be embedded within a container by extending an existing container.
Dockerfile
: Integrate DMTCP into an existing container
FROM my_application_container:latest
RUN git clone https://github.com/dmtcp/dmtcp.git \
&& cd dmtcp \
&& ./configure && make \
&& make install
C/R Batch Jobs within a Container¶
A custom batch script is designed for batch job management in an HPC environment and is critical in handling DMTCP-based checkpointing within Slurm-managed jobs. It converts execution time into a human-readable format, calculates the remaining time for job scheduling, and updates job comments to reflect the current state. It also manages job requeuing based on the remaining time, ensuring seamless job continuation without user intervention. The script automatically traps termination signals, performs checkpointing, and requeues jobs with updated time limits. This design integrates DMTCP’s checkpointing capabilities into the fabric of job management workflows, ensuring reliable and automated resumption of interrupted tasks.
cr_env.sh
: Script to integrate DMTCP and manage C/R within the container for seamless job execution and requeuing
#!/bin/bash
# -------------------- Time tracking, signal trapping, and requeue functions ------------------------
# Converts seconds to a human-readable format (days-hours:minutes:seconds)
secs2timestr() {
((d=${1}/86400))
((h=(${1}%86400)/3600))
((m=(${1}%3600)/60))
((s=${1}%60))
printf "%d-%02d:%02d:%02d\n" $d $h $m $s
}
# Converts human-readable time to seconds
timestr2secs() {
if [[ $1 == *-* ]]; then
echo $1 | sed 's/-/:/' | awk -F: '{print $1*86400+$2*3600+$3*60+$4}'
else
if [[ $1 == *:*:* ]]; then
echo $1 | awk -F: '{print $3, $2, $1}' | awk '{print $1+60*$2+3600*$3}'
else
echo $1 | awk -F: '{print $2+60*$1}'
fi
fi
}
# Parses job information, calculates remaining time and requests time for next requeue
parse_job() {
# Set defaults if not provided
if [[ -z $ckpt_overhead ]]; then let ckpt_overhead=60; fi
if [[ -z $max_timelimit ]]; then let max_timelimit=172800; fi
TOTAL_TIME=$(squeue -h -j $SLURM_JOB_ID -o %k)
timeAlloc=$(squeue -h -j $SLURM_JOB_ID -o %l)
# Ensure timeAlloc has at least two fields (hours:minutes)
fields=$(echo $timeAlloc | awk -F ':' '{print NF}')
if [ $fields -le 2 ]; then
timeAlloc="0:$timeAlloc"
fi
timeAllocSec=$(timestr2secs $timeAlloc)
TOTAL_TIME=$(timestr2secs $TOTAL_TIME)
let remainingTimeSec=TOTAL_TIME-timeAllocSec+ckpt_overhead
if [ $remainingTimeSec -gt 0 ]; then
remainingTime=$(secs2timestr $remainingTimeSec)
scontrol update JobId=$SLURM_JOB_ID Comment=$remainingTime
jobtime=$(secs2timestr $timeAllocSec)
if [ $remainingTimeSec -gt $timeAllocSec ]; then
requestTime=$jobtime
else
requestTime=$remainingTime
fi
echo "Remaining time: $remainingTime"
echo "Next timelimit: $requestTime"
fi
}
# Requeue the job if remaining time is available
requeue_job() {
parse_job
if [ -n "$remainingTimeSec" ] && [ $remainingTimeSec -gt 0 ]; then
func="$1"; shift
for sig; do
trap "$func $sig" "$sig"
done
else
echo "No more job requeues, job done!"
fi
}
# Handles signal and performs checkpointing
func_trap() {
# Perform checkpoint before requeue
$ckpt_command
echo "Signal received. Requeuing job."
scontrol requeue $SLURM_JOB_ID
scontrol update JobId=$SLURM_JOB_ID TimeLimit=$requestTime
echo "Exit status: $?"
}
# Starts the DMTCP coordinator and creates a command wrapper for job communication
start_coordinator() {
fname=dmtcp_command.$SLURM_JOBID
h=$(hostname)
# Check if dmtcp_coordinator is installed
if ! which dmtcp_coordinator > /dev/null; then
echo "No dmtcp_coordinator found. Check your DMTCP installation."
exit 0
fi
# Start the coordinator in daemon mode
dmtcp_coordinator --daemon --exit-on-last -p 0 --port-file $fname $@ 1>/dev/null 2>&1
# Wait until the coordinator has started and port is available
while true; do
if [ -f "$fname" ]; then
p=$(cat $fname)
if [ -n "$p" ]; then
break
fi
fi
done
export DMTCP_COORD_HOST=$h
export DMTCP_COORD_PORT=$p
# Create a DMTCP command wrapper for easy access
echo "#!/bin/bash
export PATH=\$PATH
export DMTCP_COORD_HOST=$h
export DMTCP_COORD_PORT=$p
dmtcp_command \$@" > $fname
chmod a+rx $fname
}
# Waits for the coordinator to complete checkpointing
wait_coord() {
let sum=0
ckpt_done=0
while true; do
x=($(./dmtcp_command.$SLURM_JOB_ID -s))
npeers=${x[6]#*=}
running=${x[7]#*=}
if [[ $npeers > 0 && $running == no ]]; then
let sum=sum+1
sleep 1
elif [[ $npeers > 0 && $running == yes ]]; then
ckpt_done=1
break
else
break
fi
done
if [[ $ckpt_done == 1 ]]; then
echo "Checkpointing completed, overhead = $sum seconds"
else
echo "No running job to checkpoint"
fi
}
# Executes DMTCP checkpointing before requeuing the job
ckpt_dmtcp() {
./dmtcp_command.$SLURM_JOB_ID -c
wait_coord
}
Key Points:¶
- Time Tracking: The functions
secs2timestr
andtimestr2secs
are used to convert time formats between human-readable form and seconds, facilitating easier tracking and job scheduling. - Job Requeuing: The
parse_job
function calculates the remaining job time and updates SLURM job comments with the calculated next requeue time, ensuring efficient job requeuing management. - Signal Trapping: The
func_trap
function handles signals (such as termination signals), performing checkpointing when a signal is received and ensuring the job is properly requeued for later execution. - DMTCP Coordination: The
start_coordinator
function initializes the DMTCP coordinator, enabling the checkpoint-restart functionality within the job, which allows seamless pausing and resumption of tasks. - Checkpointing: The
wait_coord
andckpt_dmtcp
functions manage the checkpointing process, waiting for the coordinator to complete the checkpoint and ensuring that the job's state is saved correctly.
Automated C/R Strategies (Jobs)¶
The automated /R strategy ensures seamless job execution and resubmission by integrating DMTCP with Slurm job scheduling. It automates the process of pausing, checkpointing, and resuming jobs, ensuring that jobs can continue from their last checkpoint after a termination signal (e.g., SIGTERM
) or time limit is reached. This is particularly useful in HPC environments, where long-running jobs must be periodically interrupted for resource allocation.
Below, we break down the main components involved in the automation process using the main.sh, wrapper.sh, and example_g4.sh scripts. These scripts work together to manage the lifecycle of an HPC job inside a container, ensuring that the job can checkpoint, terminate, and restart seamlessly.
main.sh
¶
The main.sh
script defines the Slurm job properties and manages the initial job setup. It handles setting the container environment, trapping termination signals, and ensuring that the job is automatically requeued and restarted when needed.
main.sh
: main Slurm script to submit the job using Shifter container
#!/bin/bash
# Slurm directives for job properties
#SBATCH -J test # Job name
#SBATCH -q debug # Queue
#SBATCH -N 1 # Number of nodes
#SBATCH -C cpu # CPU architecture
#SBATCH -t 00:07:00 # Wall clock time
#SBATCH -e %x-%j.err # Error file
#SBATCH -o %x-%j.out # Output file
#SBATCH --time-min=00:06:00 # Minimum time allocation
#SBATCH --comment=00:15:00 # Job comment with expected runtime
#SBATCH --signal=SIGTERM@60 # Signal for checkpointing 60 seconds before job ends
#SBATCH --requeue # Automatically requeue the job if it terminates
#SBATCH --open-mode=append # Append output to log files
# Load required module and container image
#SBATCH --module=cvmfs
#SBATCH --image=mtimalsina/geant4_dmtcp:Dec2023 # Container image with Geant4 and DMTCP setup
# Set up environment and start DMTCP coordinator
export DMTCP_COORD_HOST=$(hostname)
source my_env_setup.sh
# Trap SIGTERM signal to trigger requeue and checkpoint
requeue_job func_trap SIGTERM
# Launch the job within the Shifter container
shifter --module=cvmfs --image=mtimalsina/geant4_dmtcp:Dec2023 /bin/bash ./wrapper.sh &
# Wait for the job to finish or get requeued
wait
Key Points:¶
- Slurm Job Properties: The script uses Slurm directives to define job parameters such as job name, QOS, node architecture, and wall clock time.
- Signal Handling: The job is configured to catch a
SIGTERM
signal 60 seconds before termination, triggering the checkpoint process and requeue. - Containerized Execution: The job runs within a containerized environment using Shifter (for e.g.), ensuring portability and reproducibility.
wrapper.sh
¶
The wrapper.sh
script is responsible for managing the checkpoint-restart logic in the job. It starts the DMTCP coordinator, handles job restarts, and traps termination signals for checkpointing. This ensures that the job can either start fresh or resume from a checkpoint.
wrapper.sh
: script to manage the checkpoint-restart logic in the job
#!/bin/bash
# Set DMTCP coordinator host and source environment setup
export DMTCP_COORD_HOST=$(hostname)
source my_env_setup.sh
# Function to restart or initiate the job
function restart_job() {
# Start DMTCP coordinator with a checkpoint interval
start_coordinator -i 300
if [[ $(restart_count) == 0 ]]; then
# Initial job launch
dmtcp_launch --join-coordinator --interval 300 ./example_g4.sh
echo "Initial launch successful."
elif [[ $(restart_count) > 0 ]] && [[ -e dmtcp_restart_script.sh ]]; then
# Restart the job
echo "Restarting the job..."
./dmtcp_restart_script.sh &
echo "Restart initiated."
else
echo "Failed to restart the job, exiting."; exit
fi
# Trap SIGTERM signal to trigger checkpointing
trap ckpt_dmtcp SIGTERM
}
# Execute the function to restart or start the job
restart_job
# Wait for the job to complete or terminate
wait
Key Points:¶
- Job Initialization: If the job is being launched for the first time, the
dmtcp_launch
command is used to start the simple application (payload.sh
) or high energy physics application; Geant4 (example_g4.sh
), with checkpointing enabled at 300-second intervals. - Job Restart: If the job has been checkpointed previously, it restarts from the last saved checkpoint using the
dmtcp_restart_script.sh
file, ensuring the job can continue from where it left off. - Signal Trapping: The script traps the
SIGTERM
signal, which triggers theckpt_dmtcp
function to checkpoint the job's state, ensuring it is saved before requeuing or terminating.
Similar to the section DMTCP, the following example components demonstrate the basic use of Slurm scripts to checkpoint and restart an application contained in payload.sh
.
payload.sh
: script contains the application you wish to checkpoint
#!/bin/bash
for i in {1..45}
do
echo "step $i"
date
sleep 60
done
Additionally, here is an example of a real scientific application in high-energy physics, Geant4, which we have successfully tested with the checkpoint-restart mechanism. An important note is that we need to use the :
command at the end to ensure DMTCP recognizes the completion of the Geant4 simulation, as it doesn’t natively detect the end of the application.
example_g4.sh
: actual Geant4 simulation code to run inside the container
#!/bin/bash
# Source Geant4 data from the container environment
export G4ABLADATA=/cvmfs/geant4.cern.ch/share/data/G4ABLA3.3
export G4LEDATA=/cvmfs/geant4.cern.ch/share/data/G4EMLOW8.5
export G4ENSDFSTATEDATA=/cvmfs/geant4.cern.ch/share/data/G4ENSDFSTATE2.3
export G4INCLDATA=/cvmfs/geant4.cern.ch/share/data/G4INCL1.2
export G4NEUTRONHPDATA=/cvmfs/geant4.cern.ch/share/data/G4NDL4.7
export G4PARTICLEXSDATA=/cvmfs/geant4.cern.ch/share/data/G4PARTICLEXS4.0
export G4PIIDATA=/cvmfs/geant4.cern.ch/share/data/G4PII1.3
export G4SAIDXSDATA=/cvmfs/geant4.cern.ch/share/data/G4SAIDDATA2.0
export G4LEVELGAMMADATA=/cvmfs/geant4.cern.ch/share/data/PhotonEvaporation5.7
export G4RADIOACTIVEDATA=/cvmfs/geant4.cern.ch/share/data/RadioactiveDecay5.6
export G4REALSURFACEDATA=/cvmfs/geant4.cern.ch/share/data/RealSurface2.2
# Set environment variables for the Geant4 benchmark
export G4BENCH_INSTALL=/usr/local
export app=ecal
export NEVENTS=10000000
export log=checkpoint
# Run the Geant4 application with the specified parameters
"$G4BENCH_INSTALL/$app/$app-mt" -n 256 -j "$NEVENTS" -p "PERLMUTTER" -b "$log" >>"$log-n256.log"
:
g4bench.conf
: file to configure the Geant4 benchmark settings
{
"Run": {
"Seed": 123456789,
"G4DATA": "/cvmfs/geant4.cern.ch/share/data"
},
"Primary": {
"particle": "e-",
"energy": 1000.0, // MeV
"position": [ 0., 0., -45. ], // cm
"direction": [ 0., 0., 1.]
}
}
C/R with podman-hpc and Apptainer¶
The example above is using Shifter for containerized execution. However, if you want to use podman-hpc or Apptainer, you will need to modify the main.sh
file accordingly. Below are the modified versions of the main.sh
file for each case.
Using podman-hpc¶
If you want to use podman-hpc, the main.sh
file will look like this:
main_podmanhpc.sh
: main Slurm script to submit the job using podman-hpc container
#!/bin/bash
# Slurm directives for job properties
#SBATCH -J test # Job name
#SBATCH -q debug # Queue
#SBATCH -N 1 # Number of nodes
#SBATCH -C cpu # CPU architecture
#SBATCH -t 00:07:00 # Wall clock time
#SBATCH -e %x-%j.err # Error file
#SBATCH -o %x-%j.out # Output file
#SBATCH --time-min=00:06:00 # Minimum time allocation
#SBATCH --comment=00:17:00 # Job comment with expected runtime
#SBATCH --signal=SIGTERM@60 # Signal for checkpointing 60 seconds before job ends
#SBATCH --requeue # Automatically requeue the job if it terminates
#SBATCH --open-mode=append # Append output to log files
# Load the container image
#SBATCH --image=mtimalsina/geant4_dmtcp:Dec2023 # Container image
# Set up environment and DMTCP coordinator
export DMTCP_COORD_HOST=$(hostname)
# Requeue function to resubmit the job on SIGTERM
function requeue () {
echo "Got Signal. Going to requeue"
scontrol requeue ${SLURM_JOB_ID}
}
# Trap SIGTERM signal to trigger requeue function
trap requeue SIGTERM
# Launch the job within the podman-hpc container
podman-hpc run --userns keep-id --rm -it --mpi \
-e SLURM_JOBID=${SLURM_JOB_ID} \
-v /cvmfs:/cvmfs \
-v $(pwd):/podman-hpc \
-w /podman-hpc \
mtimalsina/geant4_dmtcp:Dec2023 \
/bin/bash ./wrapper.sh &
wait
The rest of the files remain the same as the Shifter example.
Using Apptainer¶
If you want to use Apptainer, the main.sh
file will look like this:
main_apptainer.sh
: main Slurm script to submit the job using apptainer container
#!/bin/bash
# Slurm directives for job properties
#SBATCH -J test-gent4-apptainer # Job name
#SBATCH -q debug # Queue (regular, debug, etc.)
#SBATCH -N 1 # Number of nodes
#SBATCH -C cpu # CPU architecture
#SBATCH -t 00:29:00 # Wall clock time
#SBATCH -e %x-%j.err # Error file
#SBATCH -o %x-%j.out # Output file
#SBATCH --time-min=00:29:00 # Minimum time allocation
#SBATCH --comment=00:50:00 # Job comment with expected runtime
#SBATCH --signal=SIGTERM@60 # Signal for checkpointing 60 seconds before job ends
#SBATCH --requeue # Automatically requeue the job if it terminates
#SBATCH --open-mode=append # Append output to log files
# Load required module
#SBATCH --module=cvmfs # Load CVMFS module for Apptainer
# Set up environment and DMTCP coordinator
export DMTCP_COORD_HOST=$(hostname)
export PATH=${PATH}:/cvmfs/oasis.opensciencegrid.org/mis/apptainer/1.3.3/x86_64/bin
# Requeue function to resubmit the job on SIGTERM
function requeue () {
echo "Got Signal. Going to requeue"
scontrol requeue ${SLURM_JOB_ID}
}
# Trap SIGTERM signal to trigger requeue function
trap requeue SIGTERM
# Define the path to the Apptainer image
# download image with
# apptainer pull docker://mtimalsina/geant4_dmtcp:Dec2023
apptainer_image_path="/global/cfs/cdirs/m0000/elvis/geant4_dmtcp_Dec2023.sif"
# Check if the Apptainer image exists
if [ ! -f "$apptainer_image_path" ]; then
echo "Cannot find Apptainer image at $apptainer_image_path"
exit 1
fi
# Launch the job within the Apptainer container
apptainer exec -B /cvmfs:/cvmfs \
$apptainer_image_path /bin/bash ./wrapper.sh &
wait
The rest of the files remain the same as the Shifter example.
The checkpoint-restart (C/R) mechanism can also be executed directly on Perlmutter without the need for containers. For more details, refer to the documentation at NERSC Checkpoint-Restart.
References¶
For more details on the checkpoint-restart mechanisms and containerized HPC environments, you can refer to the following resources:
-
Optimizing Checkpoint-Restart Mechanisms for HPC with DMTCP in Containers at NERSC
This paper provides a comprehensive overview of the techniques used in this work, including the implementation of DMTCP in various container platforms like Shifter and podman-hpc. -
Checkpoint-Restart in Containerized HPC Environments - NERSC Data Day 2024 Presentation
This presentation from NERSC Data Day 2024 highlights the use of checkpoint-restart mechanisms and their integration into containerized HPC environments for enhancing job reliability and efficiency.