CUDA-GDB¶

Introduction¶

CUDA-GDB, an extension to GDB, is the NVIDIA tool for debugging CUDA applications. It allows simultaneous debugging of both GPU and CPU code within the same application. It support debugging C/C++ and Fortran CUDA applications.

Compiling Applications¶

NVCC, the NVIDIA CUDA compiler driver, provides a mechanism for generating the debugging information necessary for CUDA-GDB to work properly. The -g -G option pair must be passed to NVCC when an application is compiled for debugging:

nvcc -g -G foo.cu -o foo

This forces -O0 compilation, with the exception of very limited dead-code eliminations and register-spilling optimizations.

To compile a CUDA Fortran code with debugging information, the NVIDIA Fortran compiler, nvfortran, must be invoked with -g option.

nvfortran -g foo.cuf -o foo

Kernel Focus¶

The existing GDB commands are unchanged. Every CUDA command or option is prefixed with the cuda keyword. For instance, the GDB command to display the host threads and switch to host thread 1 are, respectively:

(cuda-gdb) info threads
(cuda-gdb) thread 1

To display the CUDA threads and switch to cuda thread 1, the user only has to type:

(cuda-gdb) info cuda threads
(cuda-gdb) cuda thread 1

A CUDA application may be running several host threads and many device threads. To simplify the visualization of information about the state of application, commands are applied to the entity in focus.

On the device side, the focus is set to the device thread. A device thread belongs to a block, which in turn belongs to a kernel. Thread, block, and kernel are the software coordinates of the focus. A device thread runs on a lane. A lane belongs to a warp, which belongs to an SM (Streaming MultiProcessors), which in turn belongs to a device. Lane, warp, SM, and device are the hardware coordinates of the focus. Software and hardware coordinates can be used interchangeably and simultaneously as long as they remain coherent.

Debugging Example¶

Below is a CUDA-GDB debugging session using an example code provided in the manual, bitreverse.cu.

1  #include <stdio.h>
2  #include <stdlib.h>
3
4  // Simple 8-bit bit reversal Compute test
5
6  #define N 256
7
8  __global__ void bitreverse(void *data) {
9     unsigned int *idata = (unsigned int*)data;
10    extern __shared__ int array[];
11
12    array[threadIdx.x] = idata[threadIdx.x];
13
14    array[threadIdx.x] = ((0xf0f0f0f0 & array[threadIdx.x]) >> 4) |
15                        ((0x0f0f0f0f & array[threadIdx.x]) << 4);
16    array[threadIdx.x] = ((0xcccccccc & array[threadIdx.x]) >> 2) |
17                        ((0x33333333 & array[threadIdx.x]) << 2);
18    array[threadIdx.x] = ((0xaaaaaaaa & array[threadIdx.x]) >> 1) |
19                         ((0x55555555 & array[threadIdx.x]) << 1);
20
21    idata[threadIdx.x] = array[threadIdx.x];
22 }
23
24 int main(void) {
25     void *d = NULL; int i;
26     unsigned int idata[N], odata[N];
27
28     for (i = 0; i < N; i++)
29         idata[i] = (unsigned int)i;
30
31     cudaMalloc((void**)&d, sizeof(int)*N);
32     cudaMemcpy(d, idata, sizeof(int)*N,
33                cudaMemcpyHostToDevice);
34
35     bitreverse<<<1, N, N*sizeof(int)>>>(d);
36
37     cudaMemcpy(odata, d, sizeof(int)*N,
38                cudaMemcpyDeviceToHost);
39
40     for (i = 0; i < N; i++)
41        printf("%u -> %u\n", idata[i], odata[i]);
42
43     cudaFree((void*)d);
44     return 0;
45 }

$ salloc -A <project> -C gpu -N 1 -G 1 -q debug -t 30
...

$ module load PrgEnv-nvidia
$ nvcc -g -G -o bitreverse bitreverse.cu       # Compile

$ cuda-gdb ./bitreverse                        # Debug with CUDA-GDB
...
(cuda-gdb) break main                          # Create a breakpoint at main
Breakpoint 1 at 0x4039ed: file bitreverse.cu, line 25.

(cuda-gdb) break bitreverse                    # Create a breakpoint at bitreverse (kernel)
Breakpoint 2 at 0x403c73: file bitreverse.cu, line 8.

(cuda-gdb) break 21                            # Create a breakpoint at line 21
Breakpoint 3 at 0x403c7f: file bitreverse.cu, line 22.

(cuda-gdb) info breakpoints
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x00000000004039ed in main at bitreverse.cu:25
2       breakpoint     keep y   <MULTIPLE>
2.1                         n   0x0000000000403c73 in bitreverse(void*) at bitreverse.cu:8
3       breakpoint     keep y   <MULTIPLE>
3.1                         n   0x0000000000403c7f in bitreverse(void*) at bitreverse.cu:22

(cuda-gdb) run                                 # Will hit the breakpoint encountered first
Starting program: /pscratch/sd/e/elvis/CUDA-GDB/bitreverse
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Breakpoint 1, main () at bitreverse.cu:25
25     void *d = NULL; int i;

(cuda-gdb) continue                            # Until the device kernel is launched
Continuing.
[New Thread 0x7ffff4d13000 (LWP 1531499)]
[Detaching after fork from child process 1531501]
[New Thread 0x7fffe981c000 (LWP 1531942)]
[New Thread 0x7fffe901b000 (LWP 1532048)]
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]

Thread 1 "bitreverse" hit Breakpoint 2, bitreverse<<<(1,1,1),(256,1,1)>>> (data=0x7fffcf600000) at bitreverse.cu:12
12     array[threadIdx.x] = idata[threadIdx.x];

(cuda-gdb) info cuda threads                   # Verify the CUDA thread of focus
  BlockIdx ThreadIdx To BlockIdx To ThreadIdx Count         Virtual PC      Filename  Line
Kernel 0
*  (0,0,0)   (0,0,0)     (0,0,0)    (255,0,0)   256 0x00007fffc74528b0 bitreverse.cu    12

(cuda-gdb) info cuda kernels
  Kernel Parent Dev Grid Status                       SMs Mask GridDim  BlockDim Invocation
*      0      -   0    1 Active 0x0000000000000000000000000001 (1,1,1) (256,1,1) bitreverse()

(cuda-gdb) print blockIdx
$1 = {x = 0, y = 0, z = 0}

(cuda-gdb) print threadIdx
$2 = {x = 0, y = 0, z = 0}

(cuda-gdb) print gridDim
$3 = {x = 1, y = 1, z = 1}

(cuda-gdb) print blockDim
$4 = {x = 256, y = 1, z = 1}

(cuda-gdb) next                                # Move to next lin
14     array[threadIdx.x] = ((0xf0f0f0f0 & array[threadIdx.x]) >> 4) |

(cuda-gdb) continue                            # Until the next breakpoint
Continuing.

Thread 1 "bitreverse" hit Breakpoint 3, bitreverse<<<(1,1,1),(256,1,1)>>> (data=0x7fffcf600000) at bitreverse.cu:21
21     idata[threadIdx.x] = array[threadIdx.x];

(cuda-gdb) print array[0]@12                   # Check 12 elements
$5 = {0, 128, 64, 192, 32, 160, 96, 224, 16, 144, 80, 208}

(cuda-gdb) print &data
$7 = (@generic void * @parameter *) 0x160


(cuda-gdb) cuda thread 170                     # Switch the focus
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (170,0,0), device 0, sm 0, warp 5, lane 10]
0x00007fffc74531c0  16     array[threadIdx.x] = ((0xcccccccc & array[threadIdx.x]) >> 2) |

(cuda-gdb) cuda kernel block thread            # Display software coords
kernel 0, block (0,0,0), thread (170,0,0)

(cuda-gdb) cuda device sm warp lane            # Display hardware coords
device 0, sm 0, warp 5, lane 10

(cuda-gdb) ...DO SOMETHING...

(cuda-gdb) delete breakpoints
Delete all breakpoints? (y or n) y

(cuda-gdb) continue
Continuing.
0 -> 0
1 -> 128
2 -> 64
...
253 -> 191
254 -> 127
255 -> 255
[Thread 0x7ffff4d13000 (LWP 1531499) exited]
[Thread 0x7fffe901b000 (LWP 1532048) exited]
[Thread 0x7fffe981c000 (LWP 1531942) exited]
[Inferior 1 (process 1531326) exited normally]

(cuda-gdb) quit

Note that watchpoints on CUDA code are not supported by CUDA-GDB. Watchpoints on host code are supported.

Autostep¶

Autostep is a command to increase the precision of CUDA exceptions to the exact lane and instruction, when they would not have been otherwise.

Under normal execution, an exception may not be reported until several instructions after the exception occurred, or the exact thread where an exception occurred may not be known unless the exception is a lane error. However, the precise origin of the exception can be determined if the program is being single-stepped when the exception occurs. Single-stepping manually is a slow and tedious process; stepping takes much longer than normal execution and the user has to single-step each warp individually.

Autostep aids the user by allowing them to specify sections of code where they suspect an exception could occur, and these sections are automatically and transparently single-stepped the program is running. The rest of the program is executed normally to minimize the slow-down caused by single-stepping. The precise origin of an exception will be reported if the exception occurs within these sections.

To set a region for autostepping, run the autostep command as follows:

(cuda-bdb) autostep [LOCATION]
(cuda-bdb) autostep [LOCATION] for LENGTH [lines|instructions]

where

LOCATION may be anything that you use to specify the location of a breakpoint, such as a line number, function name, or an instruction address preceded by an asterisk.
LENGTH specifies the size of the autostep window in number of lines or instructions (lines and instructions can be shortened, e.g., l or i). If the length type is not specified, then lines is the default. If the for clause is omitted, then the default is 1 line.

Autosteps and breakpoints share the same numbering so most commands that work with breakpoints will also work with autosteps. info autosteps shows all breakpoints and autosteps.

$ nvcc -g -G autostep.cu -o autostep

$ cuda-gdb ./autostep

(cuda-gdb) autostep 11 for 2 lines
Breakpoint 1 at 0x7fffc7452960: file autostep.cu, line 11.
Created autostep of length 2 lines

(cuda-gdb) autostep 16 for 3 lines
Breakpoint 2 at 0x7fffc7452d00: file autostep.cu, line 16.
Created autostep of length 3 lines

(cuda-gdb) info breakpoints
Num     Type           Disp Enb Address            What
1       autostep       keep y   0x00007fffc7452960 in example(int**) at autostep.cu:11 for 2 lines
2       autostep       keep y   0x00007fffc7452d00 in example(int**) at autostep.cu:16 for 3 lines

(cuda-gdb) run
Starting program: /pscratch/sd/e/elvis/autostep
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff4d13000 (LWP 1750738)]
[Detaching after fork from child process 1750739]
[New Thread 0x7fffe981c000 (LWP 1750756)]
[New Thread 0x7fffe901b000 (LWP 1750757)]
warning: Overlapping autostep 2 ignored

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x7fffc7452e70 (autostep.cu:16)

Thread 1 "autostep" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Current focus set to CUDA kernel 1, grid 1, block (3,0,0), thread (32,0,0), device 0, sm 6, warp 1, lane 0]
Autostep precisely caught exception at autostep.cu:16 (0x7fffc7452e70)
[Switching focus to CUDA kernel 1, grid 1, block (3,0,0), thread (32,0,0), device 0, sm 6, warp 1, lane 0]
example<<<(8,1,1),(64,1,1)>>> (data=0x7fffcf640000) at autostep.cu:17
17    *(data[idx1]) = value3;

In the example, we have we correctly caught the exception at line 16. We see that the thread that threw the exception must have been in the same warp as block 3, thread 32. We have narrowed down the scope of the error from 512 threads down to 32 threads.

MPI+CUDA Application¶

For MPI+CUDA code debugging, we strongly recommend using DDT or TotalView, which provide more convenient debugging environment. If you still want to debug using CUDA-GDB for a small number of processes, you can launch xterm with srun running CUDA-GDB in the xterm window:

srun -n 4 --gpus-per-task=1 xterm -e cuda-gdb ./my_app

Then several xterm windows will appear. You should do this within the ThinLinc window.

CUDA-GDB Documentation¶

This page is largely based on Nvidia's CUDA-GDB manual. For more detailed info, please check the manual.