Compiler Diagnostic Reports and Annotated Assembly¶

Many compilers targeting both CPUs and GPUs have advanced capabilities for emitting human-readable diagnostics, in order to provide the programmer with insight into what optimizations the compiler was able to make while compiler the programmer's code. Additionally, some can also provide annotated assembly, combining human-readable diagnostics with assembly code. Here we provide a few examples from different compilers.

Intel¶

The flag -qopt-report instructs the Intel compilers to emit optimization diagnostic information:

LOOP BEGIN at /root/hpgmg/finite-volume/source/operators/restriction.c(50,10) inlined into /root/hpgmg/finite-volume/source/
   remark #15542: loop was not vectorized: inner loop was already vectorized

   LOOP BEGIN at /root/hpgmg/finite-volume/source/operators/restriction.c(51,10) inlined into /root/hpgmg/finite-volume/sour
   <Peeled loop for vectorization>
      remark #15301: PEEL LOOP WAS VECTORIZED
   LOOP END

   LOOP BEGIN at /root/hpgmg/finite-volume/source/operators/restriction.c(51,10) inlined into /root/hpgmg/finite-volume/sour
      remark #15300: LOOP WAS VECTORIZED
   LOOP END

   LOOP BEGIN at /root/hpgmg/finite-volume/source/operators/restriction.c(51,10) inlined into /root/hpgmg/finite-volume/sour
   <Remainder loop for vectorization>
      remark #15301: REMAINDER LOOP WAS VECTORIZED
   LOOP END
LOOP END

The Intel compilers can also emit annotated assembly. In the "classic" compilers, one can add the flags -S fsource-asm. The annotations provide source code line and column numbers, as well as CPU cycle and stall information, and register spillage:

vmovsd    .L_2il0floatpacket.115(%rip), %xmm1           #176.70 c1
vmovsd    .L_2il0floatpacket.124(%rip), %xmm7           #176.39 c1
vgetmantsd $0, %xmm0, %xmm0, %xmm19                     #179.22 c1
vfmadd213sd %xmm0, %xmm3, %xmm1                         #176.39 c7 stall 2
vmovapd   %xmm7, %xmm3                                  #176.39 c7
vmovsd    200(%rsp), %xmm8                              #176.39[spill] c7
vmovsd    168(%rsp), %xmm6                              #176.39[spill] c7
vrcp28sd  %xmm19, %xmm19, %xmm20                        #179.22 c9
vgetmantsd $0, %xmm1, %xmm1, %xmm16                     #176.39 c13 stall 1
vgetexpsd %xmm1, %xmm1, %xmm2                           #176.39 c17 stall 1
vrcp28sd  %xmm16, %xmm16, %xmm17                        #176.39 c19
vfnmadd231sd {rn-sae}, %xmm19, %xmm20, %xmm7            #179.22 c23 stall 1
vfnmadd231sd {rn-sae}, %xmm16, %xmm17, %xmm3            #176.39 c27 stall 1

The new LLVM-based Intel compilers, introduced in oneAPI and beginning in version 2021.1, can also emit annotated assembly, by using the flags (inherited from Clang) -S -fverbose-asm.

Documentation regarding the precise behavior and latency of assembly instructions targeting Intel CPU architectures is provided in the Intel Software Development Manuals. Intel also provides an interactive Intrinsics Guide which provides detailed information about each intrinsic.

NVIDIA HPC SDK¶

The NVIDIA HPC SDK compilers can emit diagnostic reports regarding several different kinds of optimization, including the generation of GPU-accelerated code with OpenMP or OpenACC directives. This information is emitted by adding the -Minfo=all flag during compilation, resulting in output like the following:

main:
      8, Memory set idiom, loop replaced by call to __c_mset4
      9, Memory set idiom, loop replaced by call to __c_mset4
     11, !$omp target loop
         11, Generating "nvkernel_MAIN__F1L11_1" GPU kernel
             Generating Tesla code
           12, Loop parallelized across teams, threads(128) ! blockidx%x threadidx%x
         11, Generating Multicore code
           12, Loop parallelized across threads
     11, Generating implicit map(tofrom:a(:),c(:),b(:)) 
     12, Generated vector simd code for the loop

Assembly code can be produced by adding the flag -Mkeepasm.

CCE¶

The C and C++ compilers in CCE use the flag -fsave-loopmark to generate a new file with the .lst extension which includes the compiler's optimization report embedded inline with the original source code. The CCE Fortran compiler uses the flag -h list=a to achieve the same result:

    1.              program main
    2.                implicit none
    3.              
    4.                integer, parameter :: sz = 2**26
    5.                real :: s
    6.                real, dimension(sz) :: a
    7.                integer :: i
    8.              
    9.    A------<>   a = 1.0
   10.              
   11.    M-------<   !$omp parallel do reduction(+:s)
   12.    M mVr4--<   do i = 1, sz
   13.    M mVr4        s = s + a(i)
   14.    M mVr4-->   end do
   15.    M------->   !$omp end parallel do
   16.              
   17.                print *, s
   18.              
   19.              end program main

ftn-6202 ftn: VECTOR MAIN, File = main.f90, Line = 9 
  A loop starting at line 9 was replaced by a library call.

ftn-6823 ftn: THREAD MAIN, File = main.f90, Line = 11 
  A region starting at line 11 and ending at line 15 was multi-threaded.

ftn-6005 ftn: SCALAR MAIN, File = main.f90, Line = 12 
  A loop starting at line 12 was unrolled 4 times.

ftn-6204 ftn: VECTOR MAIN, File = main.f90, Line = 12 
  A loop starting at line 12 was vectorized.

ftn-6817 ftn: THREAD MAIN, File = main.f90, Line = 12 
  A loop starting at line 12 was partitioned.

CCE can also emit annotated assembly by providing the flag -S. Additionally, the CCE C and C++ compilers support the flag -fenhanced-asm=<N> where <N> is an integer (high = more verbose):

vpermd  %ymm6, %ymm7, %ymm6     #  Depth 2 finite-volume/source/operators/boundary_fv.c:475:31[ finite-volume/source/operators/boundary_fv.c:277:3 ]
vpaddd  %ymm4, %ymm6, %ymm8     #  Depth 2 finite-volume/source/operators/boundary_fv.c:475:31[ finite-volume/source/operators/boundary_fv.c:277:3 ]
leal    (%r8,%r8), %ecx         #  Depth 2 finite-volume/source/operators/boundary_fv.c:483:38[ finite-volume/source/operators/boundary_fv.c:277:3 ]
vpbroadcastd    %ecx, %ymm4     #  Depth 2 finite-volume/source/operators/boundary_fv.c:483:36[ finite-volume/source/operators/boundary_fv.c:277:3 ]
vpaddd  %ymm4, %ymm8, %ymm14    #  Depth 2 finite-volume/source/operators/boundary_fv.c:485:36[ finite-volume/source/operators/boundary_fv.c:277:3 ]
vextracti128    $1, %ymm14, %xmm6 #  Depth 2 finite-volume/source/operators/boundary_fv.c:486:21[ finite-volume/source/operators/boundary_fv.c:277:3 ]
vmovd   %xmm6, %ecx             #  Depth 2 finite-volume/source/operators/boundary_fv.c:486:21[ finite-volume/source/operators/boundary_fv.c:277:3 ]
movslq  %ecx, %rcx              #  Depth 2 finite-volume/source/operators/boundary_fv.c:486:21[ finite-volume/source/operators/boundary_fv.c:277:3 ]
vmovsd  (%rax,%rcx,8), %xmm10   #  Depth 2 finite-volume/source/operators/boundary_fv.c:486:21[ finite-volume/source/operators/boundary_fv.c:277:3 ]

LLVM/clang¶

The clang C/C++ compilers allow the programmer to select which compiler reports to emit via the -Rpass flag, which accepts regular expressions. For example, the user can request the vectorization and function inlining reports:

mpicc -c   -DUSE_MPI=1 -DUSE_BICGSTAB=1 -DUSE_SUBCOMM=1 -DUSE_FCYCLES=1 -DUSE_GSRB=1 -O3 -fopenmp -march=skylake -Rpass='(loop-vectorize|inline)'   /global/u2/u/user/benchmarks/hpgmg/finite-volume/source/level.c -o obj/finite-volume/source/level.o
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/level.c:622:45: remark: MALLOC inlined into build_exchange_ghosts with (cost=30, threshold=375) at callsite build_exchange_ghosts:124 [-Rpass=inline]
    if(stage==1)all_send_buffers = (double*)MALLOC(TotalBufferSize*sizeof(double)); // allocate in bulk
                                            ^
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/level.c:812:45: remark: MALLOC inlined into build_exchange_ghosts with (cost=30, threshold=375) at callsite build_exchange_ghosts:314 [-Rpass=inline]
    if(stage==1)all_recv_buffers = (double*)MALLOC(TotalBufferSize*sizeof(double)); // allocate in bulk
                                            ^
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/level.c:963:45: remark: MALLOC inlined into create_vectors with (cost=30, threshold=375) at callsite create_vectors:34 [-Rpass=inline]
    level->my_boxes[box].fp_base = (double*)MALLOC(malloc_size);
                                            ^
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/level.c:992:22: remark: FREE inlined into create_vectors with (cost=0, threshold=375) at callsite create_vectors:63 [-Rpass=inline]
      if(old_fp_base)FREE(old_fp_base); // free old FP data
                     ^
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/level.c:1312:67: remark: FREE inlined into destroy_level with (cost=0, threshold=375) at callsite destroy_level:7 [-Rpass=inline]
  for(i=0;i<level->num_my_boxes;i++)if(level->my_boxes[i].fp_base)FREE(level->my_boxes[i].fp_base);
                                                                  ^
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/level.c:1320:27: remark: FREE inlined into destroy_level with (cost=0, threshold=375) at callsite destroy_level:15 [-Rpass=inline]
  if(level->fluxes       )FREE(level->fluxes       );
                          ^
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/level.c:1337:50: remark: FREE inlined into destroy_level with (cost=0, threshold=375) at callsite destroy_level:32 [-Rpass=inline]
    if(level->exchange_ghosts[i].recv_buffers[0])FREE(level->exchange_ghosts[i].recv_buffers[0]); // allocated in bulk
                                                 ^
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/level.c:1343:50: remark: FREE inlined into destroy_level with (cost=0, threshold=375) at callsite destroy_level:38 [-Rpass=inline]
    if(level->exchange_ghosts[i].send_buffers[0])FREE(level->exchange_ghosts[i].send_buffers[0]); // allocated in bulk
                                                 ^
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/level.c:139:5: remark: vectorized loop (vectorization width: 8, interleaved count: 4) [-Rpass=loop-vectorize]
    for(k=klo;k<klo+kdim;k++){
    ^
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/level.c:624:5: remark: vectorized loop (vectorization width: 8, interleaved count: 4) [-Rpass=loop-vectorize]
    for(neighbor=0;neighbor<numSendRanks;neighbor++){
    ^
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/level.c:814:5: remark: vectorized loop (vectorization width: 8, interleaved count: 4) [-Rpass=loop-vectorize]
    for(neighbor=0;neighbor<numRecvRanks;neighbor++){
    ^
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/level.c:989:7: remark: vectorized loop (vectorization width: 4, interleaved count: 4) [-Rpass=loop-vectorize]
      #pragma omp parallel for
      ^
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/level.c:1137:3: remark: vectorized loop (vectorization width: 8, interleaved count: 4) [-Rpass=loop-vectorize]
  for(box=0;box<level->boxes_in.i*level->boxes_in.j*level->boxes_in.k;box++){level->rank_of_box[box]=-1;}  // -1 denotes that there is no actual box assigned to this region
  ^
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/level.c:1226:5: remark: vectorized loop (vectorization width: 4, interleaved count: 4) [-Rpass=loop-vectorize]
    for(i=0-level->box_ghosts;i<level->box_dim+level->box_ghosts;i++){

Additional info about the -Rpass flag is provided in the Clang Compiler User's Manual. One can supply the additional flag -fsave-optimization-record which will save the optimization report to a text file.

clang can emit assembly code by providing the -S flag during compilation. Additionally, clang can also emit LLVM's Intermediate Representation (IR) by supplying the flags -S -emit-llvm, which looks like the following:

88:                                               ; preds = %84, %88
  %89 = phi i64 [ 0, %84 ], [ %102, %88 ]
  %90 = load double**, double*** %85, align 8, !tbaa !324
  %91 = getelementptr inbounds double*, double** %90, i64 %89
  %92 = bitcast double** %91 to i8**
  %93 = load i8*, i8** %92, align 8, !tbaa !15
  %94 = load i32*, i32** %86, align 8, !tbaa !325
  %95 = getelementptr inbounds i32, i32* %94, i64 %89
  %96 = load i32, i32* %95, align 4, !tbaa !16
  %97 = load i32*, i32** %87, align 8, !tbaa !326
  %98 = getelementptr inbounds i32, i32* %97, i64 %89
  %99 = load i32, i32* %98, align 4, !tbaa !16
  %100 = getelementptr inbounds %struct.ompi_request_t*, %struct.ompi_request_t** %19, i64 %89
  %101 = tail call i32 @MPI_Isend(i8* %93, i32 %96, %struct.ompi_datatype_t* bitcast (%struct.ompi_predefined_datatype_t* @o      mpi_mpi_double to %struct.ompi_datatype_t*), i32 %99, i32 %10, %struct.ompi_communicator_t* bitcast (%struct.ompi_predefined      _communicator_t* @ompi_mpi_comm_world to %struct.ompi_communicator_t*), %struct.ompi_request_t** %100) #15
  %102 = add nuw nsw i64 %89, 1
  %103 = load i32, i32* %11, align 4, !tbaa !316
  %104 = sext i32 %103 to i64
  %105 = icmp slt i64 %102, %104
  br i1 %105, label %88, label %106

GCC¶

The GCC compilers provide a variety of flags to control which compiler diagnostics are emitted, such as -fopt-info; these flags are documented at the GCC Developer Options Page. An example is provided below:

mpicc -c   -DUSE_MPI=1 -DUSE_BICGSTAB=1 -DUSE_SUBCOMM=1 -DUSE_FCYCLES=1 -DUSE_GSRB=1 -O3 -march=skylake -fopt-info   /global/u2/u/user/benchmarks/hpgmg/finite-volume/source/mg.c -o obj/finite-volume/source/mg.o
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/mg.c:93:3: note: Loop 1 distributed: split to 0 loops and 1 library calls.
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/mg.c:95:3: note: Loop 3 distributed: split to 0 loops and 1 library calls.
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/mg.c:96:3: note: Loop 4 distributed: split to 0 loops and 1 library calls.
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/mg.c:94:3: note: loop vectorized
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/mg.c:94:3: note: loop with 2 iterations completely unrolled (header execution count 64530389)
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/mg.c:54:6: note: loop with 4 iterations completely unrolled (header execution count 38751186)