CrayPat¶
Codes can hang or fail when instrumented with perftools
Some apps can hang, segfault or fail for a unknown reason when instrumented with perftools. HPE suggests to set export PAT_RT_CALLSTACK_MODE=frames
until a fix is released.
Description¶
CrayPat is a performance analysis tool offered by Cray. CrayPat has a large feature set.
perftools-lite
is a simplified and easy-to-use version of the CrayPat tool. It provides basic performance analysis information automatically with simple steps. Users can decide whether to use the full CrayPat tools after trying perftools-lite
.
Here we will highlight basic usage and point to other relevant documentation.
How to use perftools-lite
¶
The general workflow for getting performance data using perftools-lite
is as follows:
- Load the
perftools-base
andperftools-lite
modules. - Build your application as normal.
- Run as normal. Performance data is summarized at the end of the job STDOUT.
- More detailed information can also be gathered with
pat_report
or Cray Apprentice2 after the run.
Outputs from perftools-lite
¶
- In the job's stdout file, basic information from the default
sample_profile
option: execution time, memory high-water mark, aggregate FLOPS rate, top time-consuming user functions, MPI information, etc. - A
.rpt
text file with the same info as above. - A
.ap2
file that can be used withpat_report
for more detailed information, and withapp2
for graphic visualization. - Possibly one or more suggested
MPICH_RANK_ORDER_FILE
files.
How to Use CrayPat¶
The general workflow for getting performance data using CrayPat is as follows:
- Load the
perftools-base
andperftools
modules. - Build your application; keep
.o
files. - Instrument the application using
pat_build
. - Run the instrumented executable to get a performance data (
.xf
) file. - Run
pat_report
on the generated data directory to view the results.
Darshan is an I/O profiling tool that can interfere with CrayPat. If the darshan
module is loaded, it should be unloaded first.
The perftools
module needs to be loaded before you start building your application. Otherwise, the following error message appears when you build a binary instrumented with perftools.
ERROR: Missing required ELF section '.note.link' from the program '<full path for an executable>'. Load the correct 'perftools' module and rebuild the program.
Object files (.o
files) need to be made available to CrayPat to correctly build an instrumented executable for profiling or tracing. In other words, compile and link stage should be separated by using the -c
compile flag. Otherwise, one will see the warning message:
$ module load perftools-base perftools
$ ftn mytest.f90
WARNING: PerfTools is saving object files from temporary locations into directory '/global/homes/...'
Please note that the Cray compiler wrappers (ftn
, cc
and CC
) must be used for building an executable, instead of native compiler commands (gfortran
, gcc
, g++
, etc.) since pat_build
cannot build an instrumented executable from an executable built with a native compiler.
Try to run a CrayPat-instrumented executable in the $SCRATCH
space or make sure that CrayPat writes its performance data to those spaces via the PAT_RT_EXPDIR_NAME
environment variable (see the intro_craypat
man page):
#!/bin/bash
#SBATCH -N 2
...
export PAT_RT_EXPDIR_NAME=$SCRATCH/data_dir # the directory must exist
srun -n 48 ./myprogram+pat
Sampling Experiments¶
Sampling (sometimes called "asynchronous") is to sample the program counter (PC) or the call stack at given time intervals or when specified counter overflows. The default experiment type is to sample the PC at a time interval (i.e., samp_pc_time
). There are other sampling experiment types available, and the type can be set by the environment variable PAT_RT_EXPERIMENT
(see the intro_craypat
man page).
module load perftools
ftn -c myprogram.f90
ftn -o myprogram myprogram.o
pat_build -S myprogram (or simply pat_build myprogram)
This generates a new executable, myprogram+pat
. Run this executable on compute nodes, as you would with the regular executable. This will generate a directory containing performance data files with the.xf
suffix (e.g., myprogram+pat+245879-9843s
; the xf
files are in its xf-files
subdirectory). To generate human-readable content, run pat_report
:
pat_report myprogram+pat+245879-9843s
This commands prints ASCII text report to your terminal and creates files with different suffices, .ap2
and .apa
in the same directory. The first file is used to view performance data graphically with the Cray Apprentice2 tool, and the latter is for suggested pat_build
options for more detailed tracing experiments. To see source line information use the -O ca+src
option to pat_report
.
A more detailed source line-by-line profile can be obtained by the following:
pat_build a.out # don't use any pat_build options
srun -n ... a.out
pat_report -O samp_profile+src ....
This will produce an output similar to the following:
Table 1: Profile by Group, Function, and Line
Samp% | Samp | Imb. | Imb. |Group
| | Samp | Samp% | Function
| | | | Source
| | | | Line
100.0% | 3654.0 | -- | -- |Total
|---------------------------------------------------------------
| 99.6% | 3640.0 | -- | -- |USER
||--------------------------------------------------------------
|| 82.5% | 3015.0 | -- | -- |dim3_sweep_module_dim3_sweep_
3| | | | | HOPPER/src/./dim3_sweep.f90
||||------------------------------------------------------------
4||| 1.3% | 47.0 | -- | -- |line.122
4||| 6.2% | 226.0 | -- | -- |line.217
4||| 11.0% | 402.0 | -- | -- |line.218
4||| 11.1% | 407.0 | -- | -- |line.228
4||| 7.5% | 274.0 | -- | -- |line.229
4||| 7.3% | 266.0 | -- | -- |line.238
4||| 4.3% | 158.0 | -- | -- |line.240
4||| 5.8% | 212.0 | -- | -- |line.241
4||| 5.4% | 198.0 | -- | -- |line.242
4||| 6.7% | 245.0 | -- | -- |line.243
4||| 3.3% | 120.0 | -- | -- |line.322
4||| 4.5% | 164.0 | -- | -- |line.371
4||| 3.9% | 142.0 | -- | -- |line.377
which shows that the function dim3_sweep
takes up nearly all the time (~82%) and source lines 218 and 228 comprise the bulk of that. Note that the individual source lines shown do not add to 82%, probably because some source lines have fallen below the pat_report
printing threshold.
Tracing Experiments¶
pat_build
also can instrument an executable to trace calls to user-defined functions and Cray-provided library functions (e.g., MPI functions). Again, to generate an instrumented executable, one needs to load the perftools module first, and, then, compile and link in separate steps.
module load perftools
ftn -c myprogram.f90
ftn -o myprogram myprogram.o
The -w
flag enables tracing. If only this flag is used, the entire program is traced as a whole (as main
), with no individual function being traced.
pat_build -w myprogram
To instrument user-defined functions func1
and func2
, use the -T
option, together with the -w
option:
pat_build -w -T func1,func2 myprogram
Be careful to choose the func1
and func2
names properly; the compiler may have appended underscore characters to the Fortran routine name.
To trace a group of functions you list in a text file, tracefile
, use the -t
option:
pat_build -w -t tracefile myprogram
where the file, tracefile
, contains the function names to be traced.
To trace all the user-defined function, use the -u
option.
pat_build -u myprogram
Tracing the entire user functions can slow down the code significantly if it contains many small and frequently called functions. To avoid such excessive overhead, one can restrict only to the functions with a certain text size or larger, by using the directive, trace-text-size (see the pat_build
man page):
pat_build -u -Dtrace-text-size=800 myprogram # to trace those with text size >= 800 bytes
To trace a Cray-provided library function group (e.g., MPI, OpenMP, ...), specify the function group name after the -g
flag. The supported function groups are listed in the pat_build
man page. For example, to trace the MPI, OpenMP and heap memory related functions as well as all the user functions, one can do:
pat_build -g mpi,omp,heap -u myprogram
After running the instrumented executable on compute nodes via srun
, run pat_report
on the generated data file (.xf
file). This prints ASCII text output to terminal, and creates a file with the same basename but with the .ap2
suffix, which is to be viewed with the Cray Apprentice2 tool.
Automatic Program Analysis (APA)¶
Since a sampling experiment runs with little overhead and a detailed tracing experiment in general comes with large overhead, a good strategy to get performance analysis for a code that a user doesn't know about its performance characteristics would to run a sampling experiment first to identify routines that need to be instrumented for a more detailed tracing experiment later.
CrayPat's Automatic Program Analysis (APA) feature provides an easy way for such a purpose. Using this feature, one can generate an instrumented executable for a sampling experiment. When the binary is executed, it generates an ASCII text file that contains CrayPat's suggestion for pat_build
tracing options, which can be used to re-instrument the executable for detailed tracing experiments.
The general workflow for using APA is as follows.
-
Generate the executable for sampling, using the special
-O apa
flag. It will generate an instrumented executable,myprogram+pat
.pat_build -O apa myprogram # generates myprogram+pat
-
Running the executable on compute nodes via
srun
generates a directory containing performance data files. Let's call itmyprogram+pat+4571-19s
in this example. -
Run
pat_report
on the data directory:pat_report myprogram+pat+4571-19s
It will generate the
myprogram+pat+4571-19sdot.ap2
andmyprogram+pat+4571-19sdot.apa1
. The latter contains suggestedpat_build
options for building an executable for tracing experiments. 1. Examine themyprogram+pat+4571-19sdot.apa
file and, if necessary, customize it for your need using your favorite text editor. 1. Rebuild an executable usingpat_build -O
option with the.apa
file name as the argument. It generates a new instrumented executable,myprogram+apa
.pat_build -O myprogram+pat+4571-19sdot.apa # generates myprogram+apa
-
Run the new executable,
myprogram+apa
, for a tracing experiment. -
Run
pat_report
on the newly created performance data directory. Itsxf-files
subdirectory contains actual performance data files (ending in.xf
). They are the tracing result.bash pat_report myprogram+apa+4590-19t
Monitoring Hardware Performance Counters¶
One can monitor hardware performance counter (HWPC) events while running sampling or tracing experiments (however, doing this with sampling experiment is discouraged). Supported PAPI standard and Intel native event names that can be monitored can be found by running the papi_avail
and papi_native_avail
commands on a compute node using a batch script:
#!/bin/bash
#SBATCH -N 1
...
module load perftools
papi_avail
papi_native_avail
By default, hardware performance counters are not monitored during sampling or tracing experiments. To enable monitoring, one has to explicitly specify up to four event names using the PAT_RT_PERFCTR
environment variable before srun
command is executed. For example, one can monitor floating point operations and L1 data cache misses with the following:
export PAT_RT_PERFCTR="PAPI_FP_OPS,PAPI_L1_DCM"
Or one can set the environment variable to a predefined hardware counter group number:
export PAT_RT_PERFCTR=1
The meaning of each counter group (1, 2, 3, ...) depends on the Cray system your application is running. E.g., on some systems, group 1 collects floating-point and cache metrics. These groups and the meanings are explained in the hwpc
man page on each machine (accessible when the perftools
module is loaded).
Some Advanced CrayPat Topics¶
By default, CrayPat will aggregate values from multiple processing elements during a run (and there are options to control the aggregation). If you want to look at HWPC values on a per-processing element (PE) basis, you can just do the following:
pat_report -s pe=ALL file.ap2 ...
The following should work as well:
pat_report -d counters -b pe -s aggr_pe_counters=select0 ...
If you want to sort the report by PEs, you can add -s sort_by_pe=yes
If you'd like HPWC values for just the whole program (not per function, etc), you can do
pat_report -O hwpc -s pe=ALL ...
Measuring Load Imbalance¶
You can use CrayPat to measure load imbalance in programs instrumented to trace MPI functions. By default CrayPat causes the trace wrapper for each MPI collective subroutine to measure the time for a barrier call prior to entering the collective. This time is reported by pat_report
in as MPI_SYNC
, which is separate from the MPI function group itself. The MPI_SYNC
time essentially represents the time spent waiting for the MPI call to synchronize; it determines if the MPI ranks arrive at the collectives together or not. If the environment variable PAT_RT_MPI_SYNC
is set to 1
(which is the default), the time spent waiting at a barrier and synchronizing processes is reported under MPI_SYNC
, while the time spent executing after the barrier is reported under MPI
. Default values are 1 for tracing experiments and 0 for sampling experiments.
Cray Apprentice2¶
Cray Apprentice2 is a tool used to visualize performance data instrumented with the CrayPat tool. There are many options for viewing results. Please refer to the app2
man page or Cray's documentation for more details.
module load perftools
app2 myprogram+pat+####-####tdot.ap2
To enable the "Mosaic" and "Traffic Report" views you need to set an environment variable as follows:
export PAT_RT_SUMMARY=0
Doing this may create enormous CrayPat data files and may take a long time.
Cray Reveal¶
Cray Reveal is a tool developed by Cray to help developing the hybrid MPI/OpenMP programming model. It is part of the Cray Perftools software package. It utilizes the Cray CCE program library (hence it only works under PrgEnv-cray
) for loopmark and source code analysis, combined with performance data collected from CrayPat. Reveal helps to identify top consuming loops, with compiler feedback on dependency and vectorization. Its loop scope analysis provides variable scope and compiler directive suggestions for inserting OpenMP parallelism to a serial or pure MPI code. Please see this page with detailed steps for using Reveal.
Further Information¶
NERSC has prepared a detailed tutorial on Cray's perftools. You can view the presentation material.
Please refer to the HPE user guide for more details. Man pages are available but only after you load the perftools-base
modulefile. Try man pat_help
for a tutorial. Other man pages include:
craypat
pat_build
pat_report
app2
hwpc
intro_perftools
papi
papi_counters
For questions on using CrayPat at NERSC, please contact the NERSC help desk.