Roofline Performance Model¶
Performance models and tools are an integral part of the performance analysis and performance optimization process for users who seek higher performance and better utilization of the hardware. The Roofline performance model offers an intuitive and insightful way to compare application performance against machine capabilities, track progress towards optimality, and identify bottlenecks, inefficiencies, and limitations in software implementations and architecture designs. Its ability to extract key computational characteristics and abstract away the complexity of modern memory hierarchies has made Roofline-based analysis an increasingly popular tool in the HPC community.
Roofline Performance Model¶
The most standard Roofline model is as follows. It can be used to bound floating-point performance (GFLOP/s) as a function of machine peak performance, machine peak bandwidth, and arithmetic intensity of the application. The resultant curve (hollow purple) can be viewed as a performance envelope under which kernel or application performance exists.
The ridge point on the Roofline is called the 'machine balance' point. Usually, if an application's arithmetic intensity is lower than this point, it is considered to be bandwidth bound, i.e., bound by how fast the data can be moved through the memory system instead of how fast the calculations can be done on the CPU core or the GPU SMs. To optimize in this case, memory inefficiencies are usually good places to examine, such as the memory access pattern, data locality and cache reuse. On the other hand, if the application's arithmetic intensity is higher than machine balance, then the application is more likely to be limited by how fast the computation can be done. In this case, improving vectorization (to more efficiently utilize the vector units on each CPU core), or multi-threading (to utilize the multi or many cores better), can usually help.
To help look into the complex memory system on modern architectures, multiple Rooflines can be superimposed upon each other to represent different cache levels in the memory hierarchy, and this is called the hierarchical Roofline model. It helps analyze the application's data locality and cache reuse pattern, and understand how efficiently data is flowing through the memory system. A demonstration of a hierarchical Roofline chart is as follows.
To construct an accurate and meaningful Roofline, we will need to collect performance information such as the peak compute performance and peak bandwidth for the architecture, and arithmetic intensity and achieved throughput (FLOP/s) for the application. In the following, we will detail how to collect such information using various performance tools.
Empirical Roofline Toolkit (ERT) for machine characterization¶
To estimate the peak compute performance (FLOP/s) and peak bandwidth, vendor specifications can be a good starting point. They give insight into the scale of the machine's capabilities, however they may not capture the realistic execution environment that actual applications run in, such as the power/energy constraints and programming models used. To get a more accurate understanding of the machine's attainable peak, the Empirical Roofline Toolkit (ERT) is recommended. ERT runs a variety of micro-kernels and sweeps through a range of runtime configurations. These micro-kernels may be small and designed to just test a particular aspect of the system, but together they provide a more realistic set of estimations for the machine capability such as peak bandwidth on various cache levels, and peak GFLOP/s.
Arithmetic Intensity (AI) and achieved performance (FLOP/s)¶
To characterize an application on a Roofline, three pieces of information need to be collected about the application: run time, total number of FLOPs performed, and the total number of bytes moved (both read and written). This can be for the entire application or for only a code region that is of interest. For hierarchical Roofline, multiple bytes
values will need to be collected for different memory/cache levels, as can be seen in the hierarchical Roofline above (same kernel has different bytes
values and different arithmetic intensities on different levels of cache).
For large-scale applications, it is infeasible to estimate the FLOPs
or bytes
by hand and performance tools are recommended. With recent years' collaboration with Intel and NVIDIA, automated Roofline data collection has been implemented in Nsight Compute. These tools provide fully integrated, production-quality Roofline analysis features and should be the go-to tools; however for completeness, we document a few alternatives, for users who seek lighter-weight tools or more customized data collection workflows. We will focus on two architectures, Intel KNL CPU and NVIDIA V100 GPU, and the tools discussed will be, LIKWID, SDE, and nvprof and Nsight Compute for NVIDIA GPU architectures.
As of mid-2020, the Roofline analysis feature shipped in Nsight Compute by default is only for the device memory (or HBM) level Roofline analysis. However, it can be extended to a hierarchical Roofline using the customized Nsight Compute section files, or Nsight Compute metrics-based data collection methodologies, documented here: the Roofline on NVIDIA GPUs repository.
Arithmetic Intensity¶
Arithmetic Intensity (AI) is the ratio of total floating-point operations (FLOPs) performed by a given code or code section, to the total data movement (Bytes) required to support those FLOPs. Here please note the difference between FLOPs and FLOP/s, where FLOPs is the count and FLOP/s is the rate or throughput. As mentioned above, for hierarchical Roofline, different 'bytes' will be collected for different levels of cache. For example, the L2 level Roofline will use the 'bytes' for between L2 and L1, and this 'bytes' will be used as the denominator for the L2 arithmetic intensity as well.
Intel SDE on KNL¶
The Intel SDE tool offers the dynamic instruction tracing capability, and it can capture information such as the instructions executed, instruction length, instruction category and ISA extension grouping, enabling accurate FLOPs estimation for a full application or a code region. Due to its instruction level analysis, SDE usually incurs a high level of runtime overhead, so caution should be exercised. On Intel CPUs, SDE supports counting both masked FLOPs and unmasked FLOPs (in terms of vectorization), however, here we will only focus on the unmasked FLOPs estimation due to overhead concerns.
At NERSC, an example command line for SDE is:
srun -n 4 -c 6 sde -knl -d -iform 1 -omix my_mix.out -i -global_region -start_ssc_mark 111:repeat -stop_ssc_mark 222:repeat -- foo.exe
Where:
-knl
is used to target Cori's KNL ISA and-hsw
for Cori Haswell-d
specifies to only collect dynamic profile information-iform 1
turns on compute ISA iform mix-omix
specifies the output file (and turns on-mix
)-i
specifies that each process will have a unique file name based on process ID (needed for MPI)-global_region
will include any threads spawned by a process (needed for OpenMP)-start_ssc_mark
and-stop_ssc_mark
limit the FLOPs counting to a code region within a large application. Note, code must include markers such as0x111
and0x222
.
__SSC_MARK(0x111); // start SDE tracing, note it uses 2 underscores
for (k=0; k<NTIMES; k++) {
#pragma omp parallel for
for (j=0; j<STREAM_ARRAY_SIZE; j++)
a[j] = b[j]+scalar*c[j];
}
__SSC_MARK(0x222); // stop SDE tracing
SDE will create a file for every process that is created by the application. For example, in an MPI code SDE will create a file for each MPI process (one per rank) and if the application contains threads (e.g. OpenMP) they will be encapsulated in the same file (-global_region
enables this).
To parse the results produced by SDE, the script parse-sde.sh. For example, the sample code Stream
in the same repository produces the following floating-point operation count and total L1 Byte count.
$ ./parse-sde.sh sde_2p16t*
Search stanza is "EMIT_GLOBAL_DYNAMIC_STATS"
elements_fp_single_1 = 0
elements_fp_single_2 = 0
elements_fp_single_4 = 0
elements_fp_single_8 = 0
elements_fp_single_16 = 0
elements_fp_double_1 = 2960
elements_fp_double_2 = 0
elements_fp_double_4 = 999999360
elements_fp_double_8 = 0
--->Total single-precision FLOPs = 0
--->Total double-precision FLOPs = 4000000400
--->Total FLOPs = 4000000400
mem-read-1 = 8618384
mem-read-2 = 1232
mem-read-4 = 137276433
mem-read-8 = 149329207
mem-read-16 = 1999998720
mem-read-32 = 0
mem-read-64 = 0
mem-write-1 = 264992
mem-write-2 = 560
mem-write-4 = 285974
mem-write-8 = 14508338
mem-write-16 = 0
mem-write-32 = 499999680
mem-write-64 = 0
--->Total Bytes read = 33752339756
--->Total Bytes written = 16117466472
--->Total Bytes = 49869806228
One can also extract the FLOPs number by simply bash
ing:
flops=`grep 'Total FLOPs = 'result.sde.after.parse | cut -d '=' -f 2`
gflops=`python -c "print('{0:.3f}'.format($flops/1024.0/1024/1024))"`
echo SDE FLOPS is $gflops GFLOPS
RRZE LIKWID on KNL¶
LIKWID is a very low overhead profiling tool that can be used to collect both FLOPs
and bytes
information for hierarchical Roofline analysis. It offers several predefined performance groups, and on Intel KNL for example, the following groups can be used to collect hierarchical Roofline information:
`FLOPS_DP' and `FLOPS_SP' for FLOPs
`HBM_CACHE' for MCDRAM data movement on KNL
`L2' for L2 cache, and
`DATA' for L1 cache
To run LIKWID, users need to compile the code with -I$LIKWID_INCLUDE -DLIKWID_PERFMON
, request nodes with '--perf=likwid' flag on Cori, and run the code as
module load likwid
srun -n 1 -c 272 --cpu-bind=cores --cpu-freq=1401000 likwid-perfctr -c 0-271 -g HBM_CACHE foo.exe >> result.likwid
This runs on a single node with all 272 hyper-threads being traced.
To collect information for a specific code region, please see LIKWID page for details about the Marker API in LIKWID.
To parse results LIKWID produced, you can run the following commands, for example for the HBM data movement:
hbm_mbytess=`grep "MCDRAM Memory bandwidth" result.likwid | tail -n 1 | cut -d '|' -f 3`
hbm_gbytes=`grep "MCDRAM Memory data volume" result.likwid | tail -n 1 | cut -d '|' -f 3`
hbm_gbytess=`python -c "print('{0:.3f}'.format($hbm_mbytess/1024))"`
echo "MCDRAM Bytes: $hbm_gbytes GB"
Arithmetic Intensity (AI) on V100¶
NVIDIA's profiling tool nvprof and Nsight Compute can be used to measure FLOPs
and Bytes
on an NVIDIA GPU. For example, the following command line can be used to collect such information with nvprof, for a particular invocation of a particular kernel in a GPU code.
nvprof --kernels "{kernel name}, {[context id/name]:[stream id/name]:[kernel name]:[invocation]}" --metrics flop_count_dp --metrics dram_read_transactions --metrics dram_write_transactions foo.exe
where flop_count_dp
is the total FLOP count for FP64 operations, and dram_read_transactions
and dram_write_transactions
are the read and write transactions from and to HBM. For FP32 or FP16 operations, flop_count_sp
and flop_count_hp
can be used. The size of each memory transaction is 32 bytes, so the total HBM data movement can be calculated as (dram_read_transactions + dram_write_transactions) x 32B
.
The arithmetic intensity of a kernel on an NVIDIA V100 can then be calculated as,
AI (HBM) = flop_count_dp / ((dram_read_transactions + dram_write_transactions)*32)
For more details on the nvprof or Nsight Compute metrics for hierarchical Roofline data collection, please see the Roofline on NVIDIA GPUs repository and the Hierarchical Roofline Analysis: How to Collect Data using Performance Tools on Intel CPUs and NVIDIA GPUs paper.
Application Performance¶
The y-coordinate of a kernel on the Roofline chart is its sustained computational throughput (GFLOP/s), and this can be calculated as FLOPs / Runtime
. The Runtime
can be obtained by timers in the code and the FLOPs
from the nvprof or Nsight Compute tool.
Together with the arithmetic intensity (obtained from the previous section) and Roofline ceilings (obtained from ERT), we can then construct a Roofline chart.
Some example scripts are available at https://github.com/cyanguwa/nersc-roofline, demonstrating how an example code from BerkeleyGW, called General Plasmon Pole (GPP), can be modeled by Roofline on both Intel KNL and NVIDIA V100. The paper Hierarchical Roofline Analysis: How to Collect Data using Performance Tools on Intel CPUs and NVIDIA GPUs accompanies these scripts.
With GPP, as we artificially increase the number of iterations for the innermost loop nw
from 1 to 6, we get the following Roofline charts, on Intel KNL and NVIDIA V100 (this is only for HBM level Roofline).
As you can see, as the parameter nw
increases from 1 to 6, so does the arithmetic intensity of the application, because the total amount of data moved isn't changed but the total amount of FLOPs executed has been proportionally growing to the value of nw
. This increase in arithmetic intensity takes GPP from a bandwidth bound regime to a compute bound regime, and the observed GFLOP/s also increases on both KNL and V100 Rooflines. The subtlety here is that the bottleneck may be different even for the same nw
. For example, at nw=2
, the kernel is more bandwidth bound on KNL whereas on V100, it is more compute bound.
Roofline is able to capture these subtle differences and is very helpful in understanding an application's performance across multiple architectures.