Compute Sanitizer¶

Introduction¶

Compute Sanitizer, previously known as CUDA-MEMCHECK, is a functional correctness checking suite included in the CUDA toolkit. This suite contains multiple tools that can perform different type of checks.

Memcheck: The memory access error and leak detection tool
Racecheck: The shared memory data access hazard detection tool
Initcheck: The uninitialized device global memory access detection tool
Synccheck: The thread synchronization hazard detection tool

Compiling Applications¶

The -G option to nvcc forces the compiler to generate debug information for the CUDA application. Alternatively, the -lineinfo nvcc option can be used to generate line number information for applications without affecting the optimization level.

The stack backtrace feature of the Compute Sanitizer tools is more useful when the application contains function symbol names. For the host backtrace, the host compiler must be given the -rdynamic option to retain function symbols. When using nvcc, flags to the host compiler can be specified using the -Xcompiler option. For the device backtrace, the full frame information is only available when the application is compiled with device debug information.

An example compile command to build with function symbols is:

nvcc -G -Xcompiler -rdynamic -o myapp myapp.cu

Run App with Compute Sanitizer¶

Compute Sanitizer tools can be invoked by running the compute-sanitizer executable as follows:

compute-sanitizer [options] app_name [app_options]

For an MPI code, you can prepend srun to the command:

srun [srun_options] compute-sanitizer [sanitizer_options] app_name [app_options]

Example codes are provided on the compute-sanitizer GitHub repository.

Memcheck Tool¶

This tool detects the following errors:

Memory access error: Errors due to out of bounds or misaligned accesses to memory by a global, local, shared or global atomic access
Hardware exception: Errors reported by the hardware error reporting mechanism
Malloc/Free errors: Errors that occur due to incorrect use of malloc() / free() in CUDA kernels
CUDA API errors: Reported when a CUDA API call in the application returns a failure
cudaMalloc memory leaks: Allocations of device memory using cudaMalloc() that have not been freed by the application
Device Heap Memory Leaks: Allocations of device memory using malloc() in device code that have not been freed by the application

You can run the tool as follows:

compute-sanitizer [sanitizer_options] ./app_name

The tool name is not specified above as memcheck is the default tool. But it can be specified explicitly:

compute-sanitizer --tool memcheck [sanitizer_options] ./app_name

A run example is shown below:

$ cat memcheck_demo.cu
/* Copyright (c) 2021-2023, NVIDIA CORPORATION. All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *  * Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 *  * Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 *  * Neither the name of NVIDIA CORPORATION nor the names of its
 *    contributors may be used to endorse or promote products derived
 *    from this software without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
 * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
 * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
 * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
 * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
 * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
 * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
 * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 */

#include <iostream>

__device__ int x;

__global__ void unaligned_kernel(void)
{
    *(int*) ((char*)&x + 1) = 42;
}

__device__ void out_of_bounds_function(void)
{
    *(int*) 0x87654320 = 42;
}

__global__ void out_of_bounds_kernel(void)
{
    out_of_bounds_function();
}
tatic void run_unaligned(void)
{
    std::cout << "Running unaligned_kernel: ";
    unaligned_kernel<<<1,1>>>();
    std::cout << cudaGetErrorString(cudaDeviceSynchronize()) << std::endl;
}

static void run_out_of_bounds(void)
{
    std::cout << "Running out_of_bounds_kernel: ";
    out_of_bounds_kernel<<<1,1>>>();
    std::cout << cudaGetErrorString(cudaDeviceSynchronize()) << std::endl;
}

int main() {
    int *devMem = nullptr;

    std::cout << "Mallocing memory" << std::endl;
    cudaMalloc((void**)&devMem, 1024);

    run_unaligned();
    run_out_of_bounds();

    // Omitted to demo leakcheck
    // cudaFree(devMem);

    return 0;
}

$ nvcc -G -Xcompiler -rdynamic -o memcheck_demo memcheck_demo.cu

$ compute-sanitizer --destroy-on-device-error kernel ./memcheck_demo
========= COMPUTE-SANITIZER
Mallocing memory
========= Invalid __global__ write of size 4 bytes
=========     at 0x160 in /pscratch/sd/e/elvis/Memcheck/memcheck_demo.cu:34:unaligned_kernel()
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x7f607b400001 is misaligned
=========     and is inside the nearest allocation at 0x7f607b400000 of size 4 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x331380]
=========                in /usr/local/cuda-12.2/compat/libcuda.so.1
...
=========
Running unaligned_kernel: no error
========= Invalid __global__ write of size 4 bytes
=========     at 0xb0 in /pscratch/sd/e/elvis/Memcheck/memcheck_demo.cu:39:out_of_bounds_function()
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x87654320 is out of bounds
=========     and is 140,050,087,722,208 bytes before the nearest allocation at 0x7f607b200000 of size 1,024 bytes
=========     Device Frame:/pscratch/sd/e/elvis/Memcheck/memcheck_demo.cu:44:out_of_bounds_kernel() [0x30]
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x331380]
=========                in /usr/local/cuda-12.2/compat/libcuda.so.1
...
=========
Running out_of_bounds_kernel: no error
========= ERROR SUMMARY: 2 errors

The --destroy-on-device-error kernel option above is to terminate kernel with a memory access error.

For detecting device side memory leaks, add the --leak-check full option:

$ compute-sanitizer --destroy-on-device-error kernel --leak-check=full ./memcheck_demo
========= COMPUTE-SANITIZER
...
=========
Running out_of_bounds_kernel: no error
========= Leaked 1,024 bytes at 0x7fa4cb200000
=========     Saved host backtrace up to driver entry point at allocation time
=========     Host Frame: [0x2d7b6f]
=========                in /usr/local/cuda-12.2/compat/libcuda.so.1
=========     Host Frame:__cudart607 [0x3d48e]
=========                in /pscratch/sd/e/elvis/Memcheck/memcheck_demo
=========     Host Frame:__cudart616 [0xa26b]
=========                in /pscratch/sd/e/elvis/Memcheck/memcheck_demo
=========     Host Frame:cudaMalloc [0x4b182]
=========                in /pscratch/sd/e/elvis/Memcheck/memcheck_demo
=========     Host Frame:main [0x3dd6]
=========                in /pscratch/sd/e/elvis/Memcheck/memcheck_demo
=========     Host Frame:__libc_start_main [0x3524d]
=========                in /lib64/libc.so.6
=========     Host Frame:../sysdeps/x86_64/start.S:122:_start [0x3b5a]
=========                in /pscratch/sd/e/elvis/Memcheck/memcheck_demo
=========
========= LEAK SUMMARY: 1024 bytes leaked in 1 allocations
========= ERROR SUMMARY: 3 errors

Using the --padding option will automatically extend the allocation size, effectively creating a padding buffer after each allocation. For example, with --padding 32, every allocation is followed by a 32-bytes padding buffer. This improves the out of bounds error detection as accesses to the padding area will always be considered invalid.

Racecheck Tool¶

The racecheck tool detects memory access race conditions in shared memory.

A data access hazard is a case where two threads attempt to access the same location in memory resulting in non-deterministic behavior. The behavior or the output of the application depends on the order in which all parallel threads are executed by the hardware, which can create intermittent application failures.

The racecheck tool identifies three types of canonical hazards in a program:

Write-after-Write (WAW) hazards: two threads attempt to write data to the same memory
Write-After-Read (WAR) hazards: two threads access the same memory location, with one thread performing a read and another a write; in this case, the writing thread is ordered before the reading thread
Read-After-Write (RAW) hazards: two threads access the same memory location, with one thread performing a read and another a write; in this case, the reading thread reads the value before the writing thread commits it

You can run as follows:

compute-sanitizer --tool racecheck [sanitizer_options] ./app_name

The tool can produce two types of output:

Hazard reports: detailed information about one particular hazard; the tool is byte accurate and, therefore, can produce huge output containing messages for each byte on which a hazard was detected
Analysis reports: high-level reports produced by analyzing multiple hazard reports

The --racecheck-report option controls which type will be used. Available values are hazard, analysis (default) and all.

A hazard report example is shown below:

$ compute-sanitizer --tool racecheck --racecheck-report hazard ./block_error
========= COMPUTE-SANITIZER
========= Warning: (Warp Level Programming) Potential RAW hazard detected at __shared__ 0x4 in block (0,0,0) :
=========     Write Thread (1,0,0) at 0x460 in /pscratch/sd/e/elvis/Racecheck/block_error.cu:41:sumKernel(int *, int *)
=========     Read Thread (0,0,0) at 0x7a0 in /pscratch/sd/e/elvis/Racecheck/block_error.cu:51:sumKernel(int *, int *)
=========     Current Value : 1
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x331380]
=========                in /usr/local/cuda-12.2/compat/libcuda.so.1
...
=========     Host Frame:../sysdeps/x86_64/start.S:122:_start [0x38da]
=========                in /pscratch/sd/e/elvis/Racecheck/./block_error
=========
========= Warning: (Warp Level Programming) Potential RAW hazard detected at __shared__ 0x5 in block (0,0,0) :
=========     Write Thread (1,0,0) at 0x460 in /pscratch/sd/e/elvis/Racecheck/block_error.cu:41:sumKernel(int *, int *)
=========     Read Thread (0,0,0) at 0x7a0 in /pscratch/sd/e/elvis/Racecheck/block_error.cu:51:sumKernel(int *, int *)
=========     Current Value : 0
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x331380]
=========                in /usr/local/cuda-12.2/compat/libcuda.so.1
...
=========
========= RACECHECK SUMMARY: 100 hazards displayed (384 errors, 124 warnings)

An analysis report example is shown below:

$ compute-sanitizer --tool racecheck --racecheck-report analysis ./block_error
========= COMPUTE-SANITIZER
========= Error: Race reported between Write access at 0x460 in /pscratch/sd/e/elvis/Racecheck/block_error.cu:41:sumKernel(int *, int *)
=========     and Read access at 0x7a0 in /pscratch/sd/e/elvis/Racecheck/block_error.cu:51:sumKernel(int *, int *) [508 hazards]
=========
========= RACECHECK SUMMARY: 1 hazard displayed (1 error, 0 warnings)

Once racecheck has identified a hazard, the user can modify their code to run the operations in the deterministically order, by inserting a __syncthreads() call. To avoid races between threads within a single warp, __syncwarp() can be used.

Racecheck supports synchronization through cuda::barrier on Ampere GPUs (Perlmutter GPUs) and newer. The number of barriers tracked by the tool is based on the number of barriers present in the source code as reported by compiler information, which can sometimes undercount. For that the --num-cuda-barriers option can be used to indicate the number of expected barriers in the source code.

Racecheck supports race detection on shared memory for asynchronous memory copy operations from global to shared memory. These can take the form of CUDA C++ cuda::memcpy_async or the PTX cp.async. In these cases, individual hazards when using --racecheck-report hazard will show invalid memcpy_async synchronization. These checks can be disabled by using --racecheck-memcpy-async no.

Initcheck Tool¶

The initcheck tool can identify when device global memory is accessed without it being initialized via device side writes, or via CUDA memcpy and memset API calls.

You can run the tool as follows:

compute-sanitizer --tool initcheck [sanitizer_options] ./app_name

An example is shown below:

$ compute-sanitizer --tool initcheck ./memset_error
========= COMPUTE-SANITIZER
========= Uninitialized __global__ memory read of size 4 bytes
=========     at 0x250 in /pscratch/sd/e/elvis/Initcheck/memset_error.cu:41:vectorAdd(int *)
=========     by thread (16,0,0) in block (0,0,0)
=========     Address 0x7f88f3200040
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x331380]
=========                in /usr/local/cuda-12.2/compat/libcuda.so.1
...
========= Uninitialized __global__ memory read of size 4 bytes
=========     at 0x250 in /pscratch/sd/e/elvis/Initcheck/memset_error.cu:41:vectorAdd(int *)
=========     by thread (31,0,0) in block (1,0,0)
=========     Address 0x7f88f32000fc
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x331380]
=========                in /usr/local/cuda-12.2/compat/libcuda.so.1
...
=========
========= ERROR SUMMARY: 48 errors

The initcheck tool can also be used to detect unused memory by using the --track-unused-memory on option. The behavior for this feature can be adjusted with the --unused-memory-threshold option which takes the minimum percentage at which reports should be printed.

Synccheck Tool¶

The synccheck tool is a runtime tool that can identify whether a CUDA application is correctly using synchronization primitives, specifically __syncthreads() and __syncwarp() intrinsics and their Cooperative Groups API counterparts.

You can run the tool as follows:

compute-sanitizer --tool synccheck [sanitizer_options] ./app_name

The following error classes can be reported

Divergent thread(s) in block: Divergence between threads within a block was detected for a barrier that does not support this on the current architecture. For example, this occurs when __syncthreads() is used within conditional code but the conditional does not evaluate equally across all threads in the block.
Divergent thread(s) in wrap: Divergence between threads within a single warp was detected for a barrier that does not support this on the current architecture
Invalid arguments: A barrier instruction or primitive was used with invalid arguments. This can occur for example if not all threads reaching a __syncwarp() declare themselves in the mask parameter.

A sample synccheck report is below:

$ compute-sanitizer --tool synccheck ./illegal_syncwarp
========= COMPUTE-SANITIZER
========= Barrier error detected. Invalid arguments.
=========     at 0x110 in /opt/nvidia/hpc_sdk/Linux_x86_64/23.9/cuda/12.2/include/sm_30_intrinsics.hpp:110:__syncwarp(unsigned int)
=========     by thread (0,0,0) in block (0,0,0)
=========     Device Frame:/pscratch/sd/e/elvis/Synccheck/illegal_syncwarp.cu:48:myKernel(int *) [0x3c0]
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x331380]
=========                in /usr/local/cuda-12.2/compat/libcuda.so.1
...
========= Barrier error detected. Invalid arguments.
=========     at 0x110 in /opt/nvidia/hpc_sdk/Linux_x86_64/23.9/cuda/12.2/include/sm_30_intrinsics.hpp:110:__syncwarp(unsigned int)
=========     by thread (16,0,0) in block (0,0,0)
=========     Device Frame:/pscratch/sd/e/elvis/Synccheck/illegal_syncwarp.cu:48:myKernel(int *) [0x3c0]
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x331380]
=========                in /usr/local/cuda-12.2/compat/libcuda.so.1
...
=========
========= ERROR SUMMARY: 17 errors

Synccheck supports synchronization through cuda::barrier. Again the number of barriers can be undercounted, and the --num-cuda-barriers option can be used to indicate the number of expected barriers in the source code.

More Info on Compute Sanitizer¶

More info on Compute Sanitizer can be found in the user manual.