Skip to content

gdb4hpc and CCDB

Parallel Debugging with gdb4hpc

gdb4hpc is a GDB-based parallel debugger, developed by HPE (formely, Cray). It allows programmers to either launch an application or attach to an already-running application that was launched with srun, to debug the parallel code in command-line mode.

Below is an example of running gdb4hpc for a parallel application:

nersc$ salloc -N 1 -C knl -t 30:00 -q debug
...
nersc$ module load gdb4hpc
nersc$ gdb4hpc
...
dbg all> launch $pset{8} ./hello_mpi   # Launch 'hello_mpi' using 8 tasks which I name '$pset'

dbg all> viewset $pset                 # Display the PE set thus defined
Name       Procs
pset       pset{0..7}

dbg all> bt                            # Show where I am - the backtrace
pset{0..7}: #0  0x00000000200009c5 in main at /global/cscratch/sd/elvis/hello_mpi.c:8

dbg all> list                          # List the code
pset{0..7}: 8     MPI_Init(&argc,&argv);
pset{0..7}: 9     MPI_Comm_size(MPI_COMM_WORLD, &nproc);
pset{0..7}: 10    MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
pset{0..7}: 11    printf("Hello world from %d\n", myRank);
pset{0..7}: 12    MPI_Finalize();
pset{0..7}: 13    return 0;
pset{0..7}: 14  }

dbg all> break hello_mpi.c:11          # Set a breakpoint at line 11 of hello_mpi.c
dbg all> continue                      # Run

dbg all> print myRank                  # Print the value of 'myRank' for all processes
pset[0]: 0
...
pset[7]: 7
dbg all> print $pset{3}::myRank        # Print the value of 'myRank' for rank 3 only
pset[3]: 3

Note that .., as in pset{0..7}, is to denote a range of numbers.

Comparative Debugging

What makes gdb4hpc (and CCDB) unique is the comparative debugger technology, which enables programmers to run two executing applications side by side and compare data structures between them. This allows users to run two versions of the same application simultaneously, one that you know generates the correct results and another that gives incorrect results, to identify the location where the two codes start to deviate from each other.

CCDB is a GUI tool for comparative debugging. It runs gdb4hpc underneath. Its interface makes it easy for users to interact with gdb4hpc for debugging. Users are advised to use CCDB over gdb4hpc.

To compare something between two applicaions, you need to let gdb4hpc and CCDB know the name of the variable, and the location where a comparison is to be made, and how the data is distributed over MPI processes. For these, gdb4hpc and CCDB use 3 entities:

  • PE set: A set of MPI processes
  • Decomposition: How a variable is distributed over the MPI processes in a PE set
  • Assertion script: A collection of mathematical relationships (e.g., equality of the value of a variable in two codes) to be tested

Please see the man page man gdb4hpc for usage information about gdb4hpc's comparative debugging feature. Cray's 'XC Series Programming Environment User Guide' provides info on how to use the tool. The tutorial manual uses example codes that are provided in the gdb4hpc distribution package. You can build executables using the provided script as follows:

nersc$ module load gdb4hpc
nersc$ cp -R $GDB4HPC_DIR/demos/hpcc_demo .    # copy the entire directory to the current directory
nersc$ cd hpcc_demo
nersc$ module swap PrgEnv-intel PrgEnv-cray    # its Makefile uses the Cray compiler
nersc$ ./build_demo.sh

This will build two binaries, hpcc_working and hpcc_broken.

CCDB Example

To use:

nersc$ salloc -N 2 -C knl -t 30:00 -q debug    # request enough nodes for launching two applications
...
nersc$ module load gdb4hpc
nersc$ module load cray-ccdb
nersc$ ccdb

Then, launch two applications from the CCDB window.

Below is an assertion script which tests whether the 6 variables have the same values between the applications, at line 418 of HPL_pdtest.c. It shows that resid0 and XmormI have different values between the applications and therefore both applications have stopped at line 418.

ccdbpass1assertresid12