Compiler Diagnostic Reports and Annotated Assembly¶
Many compilers targeting both CPUs and GPUs have advanced capabilities for emitting human-readable diagnostics, in order to provide the programmer with insight into what optimizations the compiler was able to make while compiler the programmer's code. Additionally, some can also provide annotated assembly, combining human-readable diagnostics with assembly code. Here we provide a few examples from different compilers.
Intel¶
The flag -qopt-report
instructs the Intel compilers to emit optimization diagnostic information:
LOOP BEGIN at /root/hpgmg/finite-volume/source/operators/restriction.c(50,10) inlined into /root/hpgmg/finite-volume/source/
remark #15542: loop was not vectorized: inner loop was already vectorized
LOOP BEGIN at /root/hpgmg/finite-volume/source/operators/restriction.c(51,10) inlined into /root/hpgmg/finite-volume/sour
<Peeled loop for vectorization>
remark #15301: PEEL LOOP WAS VECTORIZED
LOOP END
LOOP BEGIN at /root/hpgmg/finite-volume/source/operators/restriction.c(51,10) inlined into /root/hpgmg/finite-volume/sour
remark #15300: LOOP WAS VECTORIZED
LOOP END
LOOP BEGIN at /root/hpgmg/finite-volume/source/operators/restriction.c(51,10) inlined into /root/hpgmg/finite-volume/sour
<Remainder loop for vectorization>
remark #15301: REMAINDER LOOP WAS VECTORIZED
LOOP END
LOOP END
The Intel compilers can also emit annotated assembly. In the "classic" compilers, one can add the flags -S fsource-asm
. The annotations provide source code line and column numbers, as well as CPU cycle and stall information, and register spillage:
vmovsd .L_2il0floatpacket.115(%rip), %xmm1 #176.70 c1
vmovsd .L_2il0floatpacket.124(%rip), %xmm7 #176.39 c1
vgetmantsd $0, %xmm0, %xmm0, %xmm19 #179.22 c1
vfmadd213sd %xmm0, %xmm3, %xmm1 #176.39 c7 stall 2
vmovapd %xmm7, %xmm3 #176.39 c7
vmovsd 200(%rsp), %xmm8 #176.39[spill] c7
vmovsd 168(%rsp), %xmm6 #176.39[spill] c7
vrcp28sd %xmm19, %xmm19, %xmm20 #179.22 c9
vgetmantsd $0, %xmm1, %xmm1, %xmm16 #176.39 c13 stall 1
vgetexpsd %xmm1, %xmm1, %xmm2 #176.39 c17 stall 1
vrcp28sd %xmm16, %xmm16, %xmm17 #176.39 c19
vfnmadd231sd {rn-sae}, %xmm19, %xmm20, %xmm7 #179.22 c23 stall 1
vfnmadd231sd {rn-sae}, %xmm16, %xmm17, %xmm3 #176.39 c27 stall 1
The new LLVM-based Intel compilers, introduced in oneAPI and beginning in version 2021.1, can also emit annotated assembly, by using the flags (inherited from Clang) -S -fverbose-asm
.
Documentation regarding the precise behavior and latency of assembly instructions targeting Intel CPU architectures is provided in the Intel Software Development Manuals. Intel also provides an interactive Intrinsics Guide which provides detailed information about each intrinsic.
NVIDIA HPC SDK¶
The NVIDIA HPC SDK compilers can emit diagnostic reports regarding several different kinds of optimization, including the generation of GPU-accelerated code with OpenMP or OpenACC directives. This information is emitted by adding the -Minfo=all
flag during compilation, resulting in output like the following:
main:
8, Memory set idiom, loop replaced by call to __c_mset4
9, Memory set idiom, loop replaced by call to __c_mset4
11, !$omp target loop
11, Generating "nvkernel_MAIN__F1L11_1" GPU kernel
Generating Tesla code
12, Loop parallelized across teams, threads(128) ! blockidx%x threadidx%x
11, Generating Multicore code
12, Loop parallelized across threads
11, Generating implicit map(tofrom:a(:),c(:),b(:))
12, Generated vector simd code for the loop
Assembly code can be produced by adding the flag -Mkeepasm
.
CCE¶
The C and C++ compilers in CCE use the flag -fsave-loopmark
to generate a new file with the .lst
extension which includes the compiler's optimization report embedded inline with the original source code. The CCE Fortran compiler uses the flag -h list=a
to achieve the same result:
1. program main
2. implicit none
3.
4. integer, parameter :: sz = 2**26
5. real :: s
6. real, dimension(sz) :: a
7. integer :: i
8.
9. A------<> a = 1.0
10.
11. M-------< !$omp parallel do reduction(+:s)
12. M mVr4--< do i = 1, sz
13. M mVr4 s = s + a(i)
14. M mVr4--> end do
15. M-------> !$omp end parallel do
16.
17. print *, s
18.
19. end program main
ftn-6202 ftn: VECTOR MAIN, File = main.f90, Line = 9
A loop starting at line 9 was replaced by a library call.
ftn-6823 ftn: THREAD MAIN, File = main.f90, Line = 11
A region starting at line 11 and ending at line 15 was multi-threaded.
ftn-6005 ftn: SCALAR MAIN, File = main.f90, Line = 12
A loop starting at line 12 was unrolled 4 times.
ftn-6204 ftn: VECTOR MAIN, File = main.f90, Line = 12
A loop starting at line 12 was vectorized.
ftn-6817 ftn: THREAD MAIN, File = main.f90, Line = 12
A loop starting at line 12 was partitioned.
CCE can also emit annotated assembly by providing the flag -S
. Additionally, the CCE C and C++ compilers support the flag -fenhanced-asm=<N>
where <N>
is an integer (high = more verbose):
vpermd %ymm6, %ymm7, %ymm6 # Depth 2 finite-volume/source/operators/boundary_fv.c:475:31[ finite-volume/source/operators/boundary_fv.c:277:3 ]
vpaddd %ymm4, %ymm6, %ymm8 # Depth 2 finite-volume/source/operators/boundary_fv.c:475:31[ finite-volume/source/operators/boundary_fv.c:277:3 ]
leal (%r8,%r8), %ecx # Depth 2 finite-volume/source/operators/boundary_fv.c:483:38[ finite-volume/source/operators/boundary_fv.c:277:3 ]
vpbroadcastd %ecx, %ymm4 # Depth 2 finite-volume/source/operators/boundary_fv.c:483:36[ finite-volume/source/operators/boundary_fv.c:277:3 ]
vpaddd %ymm4, %ymm8, %ymm14 # Depth 2 finite-volume/source/operators/boundary_fv.c:485:36[ finite-volume/source/operators/boundary_fv.c:277:3 ]
vextracti128 $1, %ymm14, %xmm6 # Depth 2 finite-volume/source/operators/boundary_fv.c:486:21[ finite-volume/source/operators/boundary_fv.c:277:3 ]
vmovd %xmm6, %ecx # Depth 2 finite-volume/source/operators/boundary_fv.c:486:21[ finite-volume/source/operators/boundary_fv.c:277:3 ]
movslq %ecx, %rcx # Depth 2 finite-volume/source/operators/boundary_fv.c:486:21[ finite-volume/source/operators/boundary_fv.c:277:3 ]
vmovsd (%rax,%rcx,8), %xmm10 # Depth 2 finite-volume/source/operators/boundary_fv.c:486:21[ finite-volume/source/operators/boundary_fv.c:277:3 ]
LLVM/clang¶
The clang C/C++ compilers allow the programmer to select which compiler reports to emit via the -Rpass
flag, which accepts regular expressions. For example, the user can request the vectorization and function inlining reports:
mpicc -c -DUSE_MPI=1 -DUSE_BICGSTAB=1 -DUSE_SUBCOMM=1 -DUSE_FCYCLES=1 -DUSE_GSRB=1 -O3 -fopenmp -march=skylake -Rpass='(loop-vectorize|inline)' /global/u2/u/user/benchmarks/hpgmg/finite-volume/source/level.c -o obj/finite-volume/source/level.o
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/level.c:622:45: remark: MALLOC inlined into build_exchange_ghosts with (cost=30, threshold=375) at callsite build_exchange_ghosts:124 [-Rpass=inline]
if(stage==1)all_send_buffers = (double*)MALLOC(TotalBufferSize*sizeof(double)); // allocate in bulk
^
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/level.c:812:45: remark: MALLOC inlined into build_exchange_ghosts with (cost=30, threshold=375) at callsite build_exchange_ghosts:314 [-Rpass=inline]
if(stage==1)all_recv_buffers = (double*)MALLOC(TotalBufferSize*sizeof(double)); // allocate in bulk
^
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/level.c:963:45: remark: MALLOC inlined into create_vectors with (cost=30, threshold=375) at callsite create_vectors:34 [-Rpass=inline]
level->my_boxes[box].fp_base = (double*)MALLOC(malloc_size);
^
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/level.c:992:22: remark: FREE inlined into create_vectors with (cost=0, threshold=375) at callsite create_vectors:63 [-Rpass=inline]
if(old_fp_base)FREE(old_fp_base); // free old FP data
^
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/level.c:1312:67: remark: FREE inlined into destroy_level with (cost=0, threshold=375) at callsite destroy_level:7 [-Rpass=inline]
for(i=0;i<level->num_my_boxes;i++)if(level->my_boxes[i].fp_base)FREE(level->my_boxes[i].fp_base);
^
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/level.c:1320:27: remark: FREE inlined into destroy_level with (cost=0, threshold=375) at callsite destroy_level:15 [-Rpass=inline]
if(level->fluxes )FREE(level->fluxes );
^
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/level.c:1337:50: remark: FREE inlined into destroy_level with (cost=0, threshold=375) at callsite destroy_level:32 [-Rpass=inline]
if(level->exchange_ghosts[i].recv_buffers[0])FREE(level->exchange_ghosts[i].recv_buffers[0]); // allocated in bulk
^
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/level.c:1343:50: remark: FREE inlined into destroy_level with (cost=0, threshold=375) at callsite destroy_level:38 [-Rpass=inline]
if(level->exchange_ghosts[i].send_buffers[0])FREE(level->exchange_ghosts[i].send_buffers[0]); // allocated in bulk
^
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/level.c:139:5: remark: vectorized loop (vectorization width: 8, interleaved count: 4) [-Rpass=loop-vectorize]
for(k=klo;k<klo+kdim;k++){
^
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/level.c:624:5: remark: vectorized loop (vectorization width: 8, interleaved count: 4) [-Rpass=loop-vectorize]
for(neighbor=0;neighbor<numSendRanks;neighbor++){
^
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/level.c:814:5: remark: vectorized loop (vectorization width: 8, interleaved count: 4) [-Rpass=loop-vectorize]
for(neighbor=0;neighbor<numRecvRanks;neighbor++){
^
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/level.c:989:7: remark: vectorized loop (vectorization width: 4, interleaved count: 4) [-Rpass=loop-vectorize]
#pragma omp parallel for
^
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/level.c:1137:3: remark: vectorized loop (vectorization width: 8, interleaved count: 4) [-Rpass=loop-vectorize]
for(box=0;box<level->boxes_in.i*level->boxes_in.j*level->boxes_in.k;box++){level->rank_of_box[box]=-1;} // -1 denotes that there is no actual box assigned to this region
^
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/level.c:1226:5: remark: vectorized loop (vectorization width: 4, interleaved count: 4) [-Rpass=loop-vectorize]
for(i=0-level->box_ghosts;i<level->box_dim+level->box_ghosts;i++){
Additional info about the -Rpass
flag is provided in the Clang Compiler User's Manual. One can supply the additional flag -fsave-optimization-record
which will save the optimization report to a text file.
clang can emit assembly code by providing the -S
flag during compilation. Additionally, clang can also emit LLVM's Intermediate Representation (IR) by supplying the flags -S -emit-llvm
, which looks like the following:
88: ; preds = %84, %88
%89 = phi i64 [ 0, %84 ], [ %102, %88 ]
%90 = load double**, double*** %85, align 8, !tbaa !324
%91 = getelementptr inbounds double*, double** %90, i64 %89
%92 = bitcast double** %91 to i8**
%93 = load i8*, i8** %92, align 8, !tbaa !15
%94 = load i32*, i32** %86, align 8, !tbaa !325
%95 = getelementptr inbounds i32, i32* %94, i64 %89
%96 = load i32, i32* %95, align 4, !tbaa !16
%97 = load i32*, i32** %87, align 8, !tbaa !326
%98 = getelementptr inbounds i32, i32* %97, i64 %89
%99 = load i32, i32* %98, align 4, !tbaa !16
%100 = getelementptr inbounds %struct.ompi_request_t*, %struct.ompi_request_t** %19, i64 %89
%101 = tail call i32 @MPI_Isend(i8* %93, i32 %96, %struct.ompi_datatype_t* bitcast (%struct.ompi_predefined_datatype_t* @o mpi_mpi_double to %struct.ompi_datatype_t*), i32 %99, i32 %10, %struct.ompi_communicator_t* bitcast (%struct.ompi_predefined _communicator_t* @ompi_mpi_comm_world to %struct.ompi_communicator_t*), %struct.ompi_request_t** %100) #15
%102 = add nuw nsw i64 %89, 1
%103 = load i32, i32* %11, align 4, !tbaa !316
%104 = sext i32 %103 to i64
%105 = icmp slt i64 %102, %104
br i1 %105, label %88, label %106
GCC¶
The GCC compilers provide a variety of flags to control which compiler diagnostics are emitted, such as -fopt-info
; these flags are documented at the GCC Developer Options Page. An example is provided below:
mpicc -c -DUSE_MPI=1 -DUSE_BICGSTAB=1 -DUSE_SUBCOMM=1 -DUSE_FCYCLES=1 -DUSE_GSRB=1 -O3 -march=skylake -fopt-info /global/u2/u/user/benchmarks/hpgmg/finite-volume/source/mg.c -o obj/finite-volume/source/mg.o
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/mg.c:93:3: note: Loop 1 distributed: split to 0 loops and 1 library calls.
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/mg.c:95:3: note: Loop 3 distributed: split to 0 loops and 1 library calls.
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/mg.c:96:3: note: Loop 4 distributed: split to 0 loops and 1 library calls.
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/mg.c:94:3: note: loop vectorized
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/mg.c:94:3: note: loop with 2 iterations completely unrolled (header execution count 64530389)
/global/u2/u/user/benchmarks/hpgmg/finite-volume/source/mg.c:54:6: note: loop with 4 iterations completely unrolled (header execution count 38751186)