Debugging Tools¶

Introduction¶

When an app crashes or produces incorrect results, it may be due to a programming error in the code. You will need to identify the source of the problem and fix it.

In an attempt to identify and correct the cause of the problem, people often add print statements to make the code output something about the run status, hoping that it would reveal clues for programming errors. For parallel and complex apps running on a HPC system, this approach can become time-consuming and tedious and make people easily exhausted. For one, it's not easy to predict where in the code a print statement should be placed and what needs to be printed. If the volume of printed data is large, it may not be easy to process it to extract useful information.

Debuggers, software tools that can assist in finding programming errors, can be valuable in this situation. These tools allow you to do various things while an app is running:

run or stop the app whenever it is necessary
check the program status or variables quickly
display data quickly using a feature often provided by such tools (e.g., visualize data to get a quick visual hint)

Debugging work is still challenging because a lot of careful work is required but the tools make the work more manageable and less strenuous.

Available tools¶

NERSC provides many popular debugging tools. Some of them are general-purpose tools and others are geared toward more specific tasks.

A quick guideline on when to use which debugging tool is as follows:

DDT: DDT is a GUI parallel debugger. It has features similar to TotalView and a similarly intuitive user interface. It is primarily used for debugging parallel MPI or OpenMP applications.
GDB: GDB can be used to quickly and easily examine a core file that was produced when an execution crashed to give an approximate traceback.
gdb4hpc and CCDB: gdb4hpc is a GDB-based parallel debugger, developed by Cray. It allows programmers to either launch an application or attach to an already-running application that was launched with srun in order to debug the parallel code in command-line mode.
Sanitizers and sanitizers4hpc: LLVM Sanitizers are a group of tools for detecting a variety of of problems in C and C++ codes such as memory errors or race conditions among threads. With a specially instrumented executable, tools can detect bugs that are often hard to identify.
STAT and ATP: STAT (the Stack Trace Analysis Tool) is a highly scalable, lightweight tool that gathers and merges stack traces from all of the processes of a parallel application. ATP (Abnormal Termination Processing) automatically runs STAT when the code crashes.
TotalView: TotalView, from Perforce Software, is a parallel debugging tool. It provides an X Windows-based Graphical User Interface and a command line interface.
Valgrind: The Valgrind tool suite provides several debugging and profiling tools that can help make your programs faster and more correct. The most popular tool is Memcheck, which can detect many memory-related errors that are common in C and C++ programs.

Debugging Tips¶

Here are some general tips on what to do in some common situations.

Use a simpler run configuration for debugging¶

In many cases, the same code failure can also happen with a smaller number of MPI tasks or a smaller problem size. If this is the case for your program, you can try debugging with a simpler run configuration as it would be easier to debug the parallel code. Some apps use a different algorithm for a different number of MPI tasks, and using a smaller run configuration could not reproduce the error, however. Note that a big run configuration may not be allowed to run in the debug or interactive QOS because of the queue policy. If a problem happens with a big run configuration only, you can make a node reservation for debugging.

Try different debuggers¶

If you cannot find enough info about a code failure with a debugger, you can try another as different debuggers have different capabilities or functionalities. For example, if debugging doesn't yield useful info with DDT for a code failure, you can try TotalView (and vice versa) as these tools complement each other. For a subtle error, trying different debuggers can help.

How to debug code hangs¶

When a code hangs, it is important first to know where the code hangs. One way to find that out is to run the code with a full-fledged parallel debugger, DDT or TotalView. When you suspect the code is hanging, you can pause the execution and check the call stacks to see where all MPI processes are. You can make the program continue and halt again (and repeat these, if necessary) to see if the same call stacks are shown, which means that that the code is hanging.

A simpler way is to use the STAT tool if an interactive batch session can be used. The tool samples stack traces over time and presents graphically the results aggregated over processes, yielding important info on what each process is doing when the problem happens. You can run the STAT tool again to confirm if similar stack traces are seen. With this tool, you can get an overall picture of a hang problem.

If an application cannot be run in the debug or interactive QOS, you can use ATP instead. When you suspect that your app in a non-interactive batch job is hanging, you cancel the srun job step which triggers generation of STAT results before the app is terminated. Then, you will be able to view the stack traces for all the processes graphically.

Once it is known where the code hangs, you may want to drill down the problem area further with DDT or TotalView to see why the hang happens.

How to debug a segmentation fault¶

A segmentation fault happens when the program attempts to access a memory address that is not allowed to access (for example, attempting to access an array that is not yet allocated). See the Wikipedia page.

When a code segfaults, rebuild the code for debugging and run it with DDT or TotalView. The code will stop when the error occurs and, therefore, you will know where it happens. You can create a breakpoint before the place and rerun with the debugger. When the program stops, examine variables and see where invalid memory access is made.

Debugging memory bugs¶

If memory usage of a code keeps growing as the code runs, it is possibly due to memory leaks in the code, that is, the code allocates a memory block but does not free it even when its usage is over.

To check memory leaks or other memory-related errors, use Valgrind's Memcheck tool, LLVM's LeakSanitizer or MemorySanitizer tool.

HPE's aggregation tools, valgrind4hpc and santizers4hpc, present the output from these tools easier to understand by aggregating the results over MPI processes. sanitizers4hpc doesn't support LLVM's MemorySanitizer.

DDT and TotalView have memory debugging features such as memory usage reports and detection of out-of-bound array references. TotalView can detect use of an uninitialized memory block, too.

Visualize arrays to get a quick visual hint¶

If a code has an error with MPI communication, array values often display irregular patterns (e.g., a jagged topography) near the parallel domain boundaries. DDT and TotalView have a nice sub-tool that visualizes arrays, which can provide a quick visual hint about a problem the program is in, without you doing an analysis of the values. If such irregular patterns are displayed at domain boundaries, it often indicates incorrect halo exchanges with neighboring processes.

Arrays without proper initialization can sometimes display such irregular patterns, too.

TotalView's old ("classic") UI visualizes arrays using a wireframe plot, making it hard to know the true shape. Please use the modern UI instead that uses surface rendering.

Debugging a code that fails only after running for hours¶

In this case, first try to create a restart file close to the failure. The restart file should be for a correct run state. Then, debug the app starting with the restart file. This will shorten the time to the failure.

Another option for debugging a long-running app is to do offline debugging in a non-interactive batch job.

Sometimes an app that restarted with a restart file runs into the same error only after spending a similar amount of run time. This may indicate that the error could be related to some kind of computational resource issue or others. If this happens, please open a ticket at https://help.nersc.gov and let us know so that we can report to a proper vendor.

Other helpful resources¶

Here are some resources that you can get help from:

NERSC Users Slack: If your debugging work is stuck or if you have a question, knowledgeable fellow users in NERSC Users Slack's #debugging and #totalview channels may be able to provide suggestions. NERSC staff are there to help you, too.
Debugging training: Attend debugging training events or learn from previous ones. These events can be found at our NERSC training page.
TotalView Office Hours: We regularly have Office Hours Zoom sessions where you can ask questions about the TotalView tool or a debugging problem. TotalView engineers and NERSC staff will be there to help. Upcoming events are announced at the NERSC Training page.
GitHub repo for a collection of debugging exercises and previous training hands-on materials

When you still have a problem or question, open a trouble ticket at help desk.