Skip to content

Containerized Checkpoint-Restart (C/R) Mechanisms for High-Performance Computing (HPC)

High-Performance Computing (HPC) systems are crucial for solving complex scientific problems but challenges like resource management, fault tolerance, and maintaining consistent performance across diverse environments can be difficult. Container technologies like NERSC's Shifter and Podman-HPC offer some solutions to these problems. We used Distributed MultiThreaded CheckPointing (DMTCP) technologies to implement robust Checkpoint-Restart (C/R) mechanisms to handle challenges with fault tolerance and resource management within containerized environments.

This section highlights successful C/R implementations on Perlmutter at NERSC using Shifter, Podman-HPC, and Apptainer, which has broader adoption in the HPC container space to show where C/R within containers could be used at more HPC centers.

Containers play a critical role in optimizing high-performance computing (HPC) workflows. In the context of checkpoint-restart (C/R) mechanisms, containers significantly enhance efficiency by integrating DMTCP. This integration allows for the seamless pausing, resuming, and migration of jobs without restarting computations from scratch. By enabling the smooth resumption of long-running tasks after interruptions, containers help reduce computational overhead, improve resource utilization, and optimize job scheduling. As a result, containers have become essential for ensuring operational resilience, consistent performance, and cost-effectiveness in HPC environments.

C/R Workflow

The figure illustrates the automated job management process within NERSC's containerized HPC environment. The workflow covers the entire lifecycle of a computational job, from submission to execution, checkpointing, and signal trapping for job resubmission. When a job reaches its time limit or encounters a termination signal, the checkpoint-restart (C/R) mechanism activates, capturing the job's state and requeuing it to resume later. The diagram highlights the decision-making flow following a termination signal, showing how the system either completes the job or restarts it based on the checkpoint data. This automated C/R strategy ensures efficient use of resources by enabling jobs to continue from their last saved state, minimizing downtime and maximizing computational efficiency. This is a visual representation of how DMTCP-enabled C/R is integrated into the containerized environment, emphasizing the seamless interaction between job scheduling, signal handling, and resource allocation within NERSC’s Shifter and Podman-HPC containers.

Preparing a Dockerfile for C/R

DMTCP cannot be checkpointed from outside the containers. It must be included within the container when it is build.

The simulation package can be built using several methods:

  • During the container’s build process: The package is compiled and installed when the container is initially built.
  • After the container has been built: The source code can be linked from an external location, allowing for flexibility and updating code without rebuilding the entire container.
  • Extending an existing container: You can build on top of an already existing container image, which is efficient for quick experimentation and requires minimal modifications.

All of these methods have been thoroughly tested and validated, ensuring compatibility with DMTCP for C/R within containers. It provides flexibility for various use cases, ensuring efficient setup for simulations in a containerized HPC environment like NERSC's Perlmutter system.

Below is an example script to integrate DMTCP into an existing container. This script pulls the latest DMTCP source, configures, compiles, and installs it as part of the container. It demonstrates how DMTCP can be embedded within a container by extending an existing container.

Dockerfile: Integrate DMTCP into an existing container
FROM my_application_container:latest
RUN git clone https://github.com/dmtcp/dmtcp.git \
    && cd dmtcp \
    && ./configure && make \
    && make install

C/R Batch Jobs within a Container

A custom batch script is designed for batch job management in an HPC environment and is critical in handling DMTCP-based checkpointing within Slurm-managed jobs. It converts execution time into a human-readable format, calculates the remaining time for job scheduling, and updates job comments to reflect the current state. It also manages job requeuing based on the remaining time, ensuring seamless job continuation without user intervention. The script automatically traps termination signals, performs checkpointing, and requeues jobs with updated time limits. This design integrates DMTCP’s checkpointing capabilities into the fabric of job management workflows, ensuring reliable and automated resumption of interrupted tasks.

cr_env.sh: Script to integrate DMTCP and manage C/R within the container for seamless job execution and requeuing
#!/bin/bash
# -------------------- Time tracking, signal trapping, and requeue functions ------------------------

# Converts seconds to a human-readable format (days-hours:minutes:seconds)
secs2timestr() {
  ((d=${1}/86400))
  ((h=(${1}%86400)/3600))
  ((m=(${1}%3600)/60))
  ((s=${1}%60))
  printf "%d-%02d:%02d:%02d\n" $d $h $m $s
}

# Converts human-readable time to seconds
timestr2secs() {
  if [[ $1 == *-* ]]; then
    echo $1 | sed 's/-/:/' | awk -F: '{print $1*86400+$2*3600+$3*60+$4}'
  else
    if [[ $1 == *:*:* ]]; then
      echo $1 | awk -F: '{print $3, $2, $1}' | awk '{print $1+60*$2+3600*$3}'
    else
      echo $1 | awk -F: '{print $2+60*$1}'
    fi
  fi
}

# Parses job information, calculates remaining time and requests time for next requeue
parse_job() {
  # Set defaults if not provided
  if [[ -z $ckpt_overhead ]]; then let ckpt_overhead=60; fi
  if [[ -z $max_timelimit ]]; then let max_timelimit=172800; fi

  TOTAL_TIME=$(squeue -h -j $SLURM_JOB_ID -o %k)
  timeAlloc=$(squeue -h -j $SLURM_JOB_ID -o %l)

  # Ensure timeAlloc has at least two fields (hours:minutes)
  fields=$(echo $timeAlloc | awk -F ':' '{print NF}')
  if [ $fields -le 2 ]; then
    timeAlloc="0:$timeAlloc"
  fi

  timeAllocSec=$(timestr2secs $timeAlloc)
  TOTAL_TIME=$(timestr2secs $TOTAL_TIME)

  let remainingTimeSec=TOTAL_TIME-timeAllocSec+ckpt_overhead
  if [ $remainingTimeSec -gt 0 ]; then
    remainingTime=$(secs2timestr $remainingTimeSec)
    scontrol update JobId=$SLURM_JOB_ID Comment=$remainingTime

    jobtime=$(secs2timestr $timeAllocSec)
    if [ $remainingTimeSec -gt $timeAllocSec ]; then
      requestTime=$jobtime
    else
      requestTime=$remainingTime
    fi
    echo "Remaining time: $remainingTime"
    echo "Next timelimit: $requestTime"
  fi
}

# Requeue the job if remaining time is available
requeue_job() {
  parse_job
  if [ -n "$remainingTimeSec" ] && [ $remainingTimeSec -gt 0 ]; then
    func="$1"; shift
    for sig; do
      trap "$func $sig" "$sig"
    done
  else
    echo "No more job requeues, job done!"
  fi
}

# Handles signal and performs checkpointing
func_trap() {
  # Perform checkpoint before requeue
  $ckpt_command
  echo "Signal received. Requeuing job."
  scontrol requeue $SLURM_JOB_ID
  scontrol update JobId=$SLURM_JOB_ID TimeLimit=$requestTime
  echo "Exit status: $?"
}

# Starts the DMTCP coordinator and creates a command wrapper for job communication
start_coordinator() {
  fname=dmtcp_command.$SLURM_JOBID
  h=$(hostname)

  # Check if dmtcp_coordinator is installed
  if ! which dmtcp_coordinator > /dev/null; then
    echo "No dmtcp_coordinator found. Check your DMTCP installation."
    exit 0
  fi

  # Start the coordinator in daemon mode
  dmtcp_coordinator --daemon --exit-on-last -p 0 --port-file $fname $@ 1>/dev/null 2>&1

  # Wait until the coordinator has started and port is available
  while true; do
    if [ -f "$fname" ]; then
      p=$(cat $fname)
      if [ -n "$p" ]; then
        break
      fi
    fi
  done

  export DMTCP_COORD_HOST=$h
  export DMTCP_COORD_PORT=$p

  # Create a DMTCP command wrapper for easy access
  echo "#!/bin/bash
  export PATH=\$PATH
  export DMTCP_COORD_HOST=$h
  export DMTCP_COORD_PORT=$p
  dmtcp_command \$@" > $fname
  chmod a+rx $fname
}

# Waits for the coordinator to complete checkpointing
wait_coord() {
  let sum=0
  ckpt_done=0
  while true; do
    x=($(./dmtcp_command.$SLURM_JOB_ID -s))
    npeers=${x[6]#*=}
    running=${x[7]#*=}
    if [[ $npeers > 0 && $running == no ]]; then
      let sum=sum+1
      sleep 1
    elif [[ $npeers > 0 && $running == yes ]]; then
      ckpt_done=1
      break
    else
      break
    fi
  done

  if [[ $ckpt_done == 1 ]]; then
    echo "Checkpointing completed, overhead = $sum seconds"
  else
    echo "No running job to checkpoint"
  fi
}

# Executes DMTCP checkpointing before requeuing the job
ckpt_dmtcp() {
  ./dmtcp_command.$SLURM_JOB_ID -c
  wait_coord
}

Key Points:

  • Time Tracking: The functions secs2timestr and timestr2secs are used to convert time formats between human-readable form and seconds, facilitating easier tracking and job scheduling.
  • Job Requeuing: The parse_job function calculates the remaining job time and updates SLURM job comments with the calculated next requeue time, ensuring efficient job requeuing management.
  • Signal Trapping: The func_trap function handles signals (such as termination signals), performing checkpointing when a signal is received and ensuring the job is properly requeued for later execution.
  • DMTCP Coordination: The start_coordinator function initializes the DMTCP coordinator, enabling the checkpoint-restart functionality within the job, which allows seamless pausing and resumption of tasks.
  • Checkpointing: The wait_coord and ckpt_dmtcp functions manage the checkpointing process, waiting for the coordinator to complete the checkpoint and ensuring that the job's state is saved correctly.

Automated C/R Strategies (Jobs)

The automated /R strategy ensures seamless job execution and resubmission by integrating DMTCP with Slurm job scheduling. It automates the process of pausing, checkpointing, and resuming jobs, ensuring that jobs can continue from their last checkpoint after a termination signal (e.g., SIGTERM) or time limit is reached. This is particularly useful in HPC environments, where long-running jobs must be periodically interrupted for resource allocation.

Below, we break down the main components involved in the automation process using the main.sh, wrapper.sh, and example_g4.sh scripts. These scripts work together to manage the lifecycle of an HPC job inside a container, ensuring that the job can checkpoint, terminate, and restart seamlessly.

main.sh

The main.sh script defines the Slurm job properties and manages the initial job setup. It handles setting the container environment, trapping termination signals, and ensuring that the job is automatically requeued and restarted when needed.

main.sh: main Slurm script to submit the job using Shifter container
#!/bin/bash

# Slurm directives for job properties
#SBATCH -J test            # Job name
#SBATCH -q debug           # Queue
#SBATCH -N 1               # Number of nodes
#SBATCH -C cpu             # CPU architecture
#SBATCH -t 00:07:00        # Wall clock time
#SBATCH -e %x-%j.err       # Error file
#SBATCH -o %x-%j.out       # Output file
#SBATCH --time-min=00:06:00   # Minimum time allocation
#SBATCH --comment=00:15:00    # Job comment with expected runtime
#SBATCH --signal=SIGTERM@60   # Signal for checkpointing 60 seconds before job ends
#SBATCH --requeue              # Automatically requeue the job if it terminates
#SBATCH --open-mode=append     # Append output to log files

# Load required module and container image
#SBATCH --module=cvmfs
#SBATCH --image=mtimalsina/geant4_dmtcp:Dec2023   # Container image with Geant4 and DMTCP setup

# Set up environment and start DMTCP coordinator
export DMTCP_COORD_HOST=$(hostname)
source my_env_setup.sh

# Trap SIGTERM signal to trigger requeue and checkpoint
requeue_job func_trap SIGTERM

# Launch the job within the Shifter container
shifter --module=cvmfs --image=mtimalsina/geant4_dmtcp:Dec2023 /bin/bash ./wrapper.sh &

# Wait for the job to finish or get requeued
wait

Key Points:

  • Slurm Job Properties: The script uses Slurm directives to define job parameters such as job name, QOS, node architecture, and wall clock time.
  • Signal Handling: The job is configured to catch a SIGTERM signal 60 seconds before termination, triggering the checkpoint process and requeue.
  • Containerized Execution: The job runs within a containerized environment using Shifter (for e.g.), ensuring portability and reproducibility.

wrapper.sh

The wrapper.sh script is responsible for managing the checkpoint-restart logic in the job. It starts the DMTCP coordinator, handles job restarts, and traps termination signals for checkpointing. This ensures that the job can either start fresh or resume from a checkpoint.

wrapper.sh: script to manage the checkpoint-restart logic in the job
#!/bin/bash

# Set DMTCP coordinator host and source environment setup
export DMTCP_COORD_HOST=$(hostname)
source my_env_setup.sh

# Function to restart or initiate the job
function restart_job() {
    # Start DMTCP coordinator with a checkpoint interval
    start_coordinator -i 300

    if [[ $(restart_count) == 0 ]]; then
        # Initial job launch
        dmtcp_launch --join-coordinator --interval 300 ./example_g4.sh
        echo "Initial launch successful."
    elif [[ $(restart_count) > 0 ]] && [[ -e dmtcp_restart_script.sh ]]; then
        # Restart the job
        echo "Restarting the job..."
        ./dmtcp_restart_script.sh &
        echo "Restart initiated."
    else
        echo "Failed to restart the job, exiting."; exit
    fi

    # Trap SIGTERM signal to trigger checkpointing
    trap ckpt_dmtcp SIGTERM
}

# Execute the function to restart or start the job
restart_job

# Wait for the job to complete or terminate
wait

Key Points:

  • Job Initialization: If the job is being launched for the first time, the dmtcp_launch command is used to start the simple application (payload.sh) or high energy physics application; Geant4 (example_g4.sh), with checkpointing enabled at 300-second intervals.
  • Job Restart: If the job has been checkpointed previously, it restarts from the last saved checkpoint using the dmtcp_restart_script.sh file, ensuring the job can continue from where it left off.
  • Signal Trapping: The script traps the SIGTERM signal, which triggers the ckpt_dmtcp function to checkpoint the job's state, ensuring it is saved before requeuing or terminating.

Similar to the section DMTCP, the following example components demonstrate the basic use of Slurm scripts to checkpoint and restart an application contained in payload.sh.

payload.sh: script contains the application you wish to checkpoint
#!/bin/bash
for i in {1..45}
do
    echo "step $i"
    date
    sleep 60
done

Additionally, here is an example of a real scientific application in high-energy physics, Geant4, which we have successfully tested with the checkpoint-restart mechanism. An important note is that we need to use the : command at the end to ensure DMTCP recognizes the completion of the Geant4 simulation, as it doesn’t natively detect the end of the application.

example_g4.sh: actual Geant4 simulation code to run inside the container
#!/bin/bash

# Source Geant4 data from the container environment
export G4ABLADATA=/cvmfs/geant4.cern.ch/share/data/G4ABLA3.3
export G4LEDATA=/cvmfs/geant4.cern.ch/share/data/G4EMLOW8.5
export G4ENSDFSTATEDATA=/cvmfs/geant4.cern.ch/share/data/G4ENSDFSTATE2.3
export G4INCLDATA=/cvmfs/geant4.cern.ch/share/data/G4INCL1.2
export G4NEUTRONHPDATA=/cvmfs/geant4.cern.ch/share/data/G4NDL4.7
export G4PARTICLEXSDATA=/cvmfs/geant4.cern.ch/share/data/G4PARTICLEXS4.0
export G4PIIDATA=/cvmfs/geant4.cern.ch/share/data/G4PII1.3
export G4SAIDXSDATA=/cvmfs/geant4.cern.ch/share/data/G4SAIDDATA2.0
export G4LEVELGAMMADATA=/cvmfs/geant4.cern.ch/share/data/PhotonEvaporation5.7
export G4RADIOACTIVEDATA=/cvmfs/geant4.cern.ch/share/data/RadioactiveDecay5.6
export G4REALSURFACEDATA=/cvmfs/geant4.cern.ch/share/data/RealSurface2.2

# Set environment variables for the Geant4 benchmark
export G4BENCH_INSTALL=/usr/local
export app=ecal
export NEVENTS=10000000
export log=checkpoint

# Run the Geant4 application with the specified parameters
"$G4BENCH_INSTALL/$app/$app-mt" -n 256 -j "$NEVENTS" -p "PERLMUTTER" -b "$log" >>"$log-n256.log"
:
g4bench.conf: file to configure the Geant4 benchmark settings
{
  "Run": {
    "Seed": 123456789,
    "G4DATA": "/cvmfs/geant4.cern.ch/share/data"
  },
  "Primary": {
    "particle": "e-",
    "energy": 1000.0,   // MeV
    "position": [ 0., 0., -45. ],  // cm
    "direction": [ 0., 0., 1.]
  }
}

C/R with podman-hpc and Apptainer

The example above is using Shifter for containerized execution. However, if you want to use podman-hpc or Apptainer, you will need to modify the main.sh file accordingly. Below are the modified versions of the main.sh file for each case.


Using podman-hpc

If you want to use podman-hpc, the main.sh file will look like this:

main_podmanhpc.sh: main Slurm script to submit the job using podman-hpc container
#!/bin/bash

# Slurm directives for job properties
#SBATCH -J test            # Job name
#SBATCH -q debug           # Queue
#SBATCH -N 1               # Number of nodes
#SBATCH -C cpu             # CPU architecture
#SBATCH -t 00:07:00        # Wall clock time
#SBATCH -e %x-%j.err       # Error file
#SBATCH -o %x-%j.out       # Output file
#SBATCH --time-min=00:06:00   # Minimum time allocation
#SBATCH --comment=00:17:00    # Job comment with expected runtime
#SBATCH --signal=SIGTERM@60   # Signal for checkpointing 60 seconds before job ends
#SBATCH --requeue              # Automatically requeue the job if it terminates
#SBATCH --open-mode=append     # Append output to log files

# Load the container image
#SBATCH --image=mtimalsina/geant4_dmtcp:Dec2023   # Container image

# Set up environment and DMTCP coordinator
export DMTCP_COORD_HOST=$(hostname)

# Requeue function to resubmit the job on SIGTERM
function requeue () {
    echo "Got Signal. Going to requeue"
    scontrol requeue ${SLURM_JOB_ID}
}

# Trap SIGTERM signal to trigger requeue function
trap requeue SIGTERM

# Launch the job within the podman-hpc container
podman-hpc run --userns keep-id --rm -it --mpi \
    -e SLURM_JOBID=${SLURM_JOB_ID} \
    -v /cvmfs:/cvmfs \
    -v $(pwd):/podman-hpc \
    -w /podman-hpc \
    mtimalsina/geant4_dmtcp:Dec2023 \
    /bin/bash ./wrapper.sh &

wait

The rest of the files remain the same as the Shifter example.

Using Apptainer

If you want to use Apptainer, the main.sh file will look like this:

main_apptainer.sh: main Slurm script to submit the job using apptainer container
#!/bin/bash

# Slurm directives for job properties
#SBATCH -J test-gent4-apptainer         # Job name
#SBATCH -q debug                        # Queue (regular, debug, etc.)
#SBATCH -N 1                            # Number of nodes
#SBATCH -C cpu                          # CPU architecture
#SBATCH -t 00:29:00                     # Wall clock time
#SBATCH -e %x-%j.err                    # Error file
#SBATCH -o %x-%j.out                    # Output file
#SBATCH --time-min=00:29:00             # Minimum time allocation
#SBATCH --comment=00:50:00              # Job comment with expected runtime
#SBATCH --signal=SIGTERM@60             # Signal for checkpointing 60 seconds before job ends
#SBATCH --requeue                       # Automatically requeue the job if it terminates
#SBATCH --open-mode=append              # Append output to log files

# Load required module
#SBATCH --module=cvmfs                  # Load CVMFS module for Apptainer

# Set up environment and DMTCP coordinator
export DMTCP_COORD_HOST=$(hostname)
export PATH=${PATH}:/cvmfs/oasis.opensciencegrid.org/mis/apptainer/1.3.3/x86_64/bin

# Requeue function to resubmit the job on SIGTERM
function requeue () {
    echo "Got Signal. Going to requeue"
    scontrol requeue ${SLURM_JOB_ID}
}

# Trap SIGTERM signal to trigger requeue function
trap requeue SIGTERM

# Define the path to the Apptainer image
# download image with 
# apptainer pull docker://mtimalsina/geant4_dmtcp:Dec2023 
apptainer_image_path="/global/cfs/cdirs/m0000/elvis/geant4_dmtcp_Dec2023.sif"

# Check if the Apptainer image exists
if [ ! -f "$apptainer_image_path" ]; then
    echo "Cannot find Apptainer image at $apptainer_image_path"
    exit 1
fi

# Launch the job within the Apptainer container
apptainer exec -B /cvmfs:/cvmfs \
    $apptainer_image_path /bin/bash ./wrapper.sh &

wait

The rest of the files remain the same as the Shifter example.

The checkpoint-restart (C/R) mechanism can also be executed directly on Perlmutter without the need for containers. For more details, refer to the documentation at NERSC Checkpoint-Restart.

References

For more details on the checkpoint-restart mechanisms and containerized HPC environments, you can refer to the following resources: