CrayPat¶

Codes can hang or fail when instrumented with perftools

Some apps can hang, segfault or fail for a unknown reason when instrumented with perftools. HPE suggests to set export PAT_RT_CALLSTACK_MODE=frames until a fix is released.

Description¶

CrayPat is a performance analysis tool offered by Cray. CrayPat has a large feature set.

perftools-lite is a simplified and easy-to-use version of the CrayPat tool. It provides basic performance analysis information automatically with simple steps. Users can decide whether to use the full CrayPat tools after trying perftools-lite.

Here we will highlight basic usage and point to other relevant documentation.

How to use `perftools-lite`¶

The general workflow for getting performance data using perftools-lite is as follows:

Load the perftools-base and perftools-lite modules.
Build your application as normal.
Run as normal. Performance data is summarized at the end of the job STDOUT.
More detailed information can also be gathered with pat_report or Cray Apprentice2 after the run.

Outputs from `perftools-lite`¶

In the job's stdout file, basic information from the default sample_profile option: execution time, memory high-water mark, aggregate FLOPS rate, top time-consuming user functions, MPI information, etc.
A .rpt text file with the same info as above.
A .ap2 file that can be used with pat_report for more detailed information, and with app2 for graphic visualization.
Possibly one or more suggested MPICH_RANK_ORDER_FILE files.

How to Use CrayPat¶

The general workflow for getting performance data using CrayPat is as follows:

Load the perftools-base and perftools modules.
Build your application; keep .o files.
Instrument the application using pat_build.
Run the instrumented executable to get a performance data (.xf) file.
Run pat_report on the generated data directory to view the results.

Darshan is an I/O profiling tool that can interfere with CrayPat. If the darshan module is loaded, it should be unloaded first.

The perftools module needs to be loaded before you start building your application. Otherwise, the following error message appears when you build a binary instrumented with perftools.

ERROR: Missing required ELF section '.note.link' from the program '<full path for an executable>'. Load the correct 'perftools' module and rebuild the program.

Object files (.o files) need to be made available to CrayPat to correctly build an instrumented executable for profiling or tracing. In other words, compile and link stage should be separated by using the -c compile flag. Otherwise, one will see the warning message:

$ module load perftools-base perftools
$ ftn mytest.f90
WARNING: PerfTools is saving object files from temporary locations into directory '/global/homes/...'

Please note that the Cray compiler wrappers (ftn, cc and CC) must be used for building an executable, instead of native compiler commands (gfortran, gcc, g++, etc.) since pat_build cannot build an instrumented executable from an executable built with a native compiler.

Try to run a CrayPat-instrumented executable in the $SCRATCH space or make sure that CrayPat writes its performance data to those spaces via the PAT_RT_EXPDIR_NAME environment variable (see the intro_craypat man page):

#!/bin/bash
#SBATCH -N 2
...
export PAT_RT_EXPDIR_NAME=$SCRATCH/data_dir  # the directory must exist
srun -n 48 ./myprogram+pat

Sampling Experiments¶

Sampling (sometimes called "asynchronous") is to sample the program counter (PC) or the call stack at given time intervals or when specified counter overflows. The default experiment type is to sample the PC at a time interval (i.e., samp_pc_time). There are other sampling experiment types available, and the type can be set by the environment variable PAT_RT_EXPERIMENT (see the intro_craypat man page).

module load perftools
ftn -c myprogram.f90
ftn -o myprogram myprogram.o
pat_build -S myprogram  (or simply pat_build myprogram)

This generates a new executable, myprogram+pat. Run this executable on compute nodes, as you would with the regular executable. This will generate a directory containing performance data files with the.xf suffix (e.g., myprogram+pat+245879-9843s; the xf files are in its xf-files subdirectory). To generate human-readable content, run pat_report:

pat_report myprogram+pat+245879-9843s

This commands prints ASCII text report to your terminal and creates files with different suffices, .ap2 and .apa in the same directory. The first file is used to view performance data graphically with the Cray Apprentice2 tool, and the latter is for suggested pat_build options for more detailed tracing experiments. To see source line information use the -O ca+src option to pat_report.

A more detailed source line-by-line profile can be obtained by the following:

pat_build a.out   # don't use any pat_build options
srun -n ... a.out
pat_report -O samp_profile+src ....

This will produce an output similar to the following:

Table 1: Profile by Group, Function, and Line
Samp% | Samp | Imb. | Imb. |Group
      |      | Samp | Samp% | Function
 |    |      |              | Source
 |    |      |              | Line
100.0% | 3654.0 | -- | -- |Total
|---------------------------------------------------------------
| 99.6% | 3640.0 | -- | -- |USER
||--------------------------------------------------------------
|| 82.5% | 3015.0 | -- | -- |dim3_sweep_module_dim3_sweep_
3|       |        |    |    | HOPPER/src/./dim3_sweep.f90
||||------------------------------------------------------------
4|||  1.3% |  47.0 | -- | -- |line.122
4|||  6.2% | 226.0 | -- | -- |line.217
4||| 11.0% | 402.0 | -- | -- |line.218
4||| 11.1% | 407.0 | -- | -- |line.228
4|||  7.5% | 274.0 | -- | -- |line.229
4|||  7.3% | 266.0 | -- | -- |line.238
4|||  4.3% | 158.0 | -- | -- |line.240
4|||  5.8% | 212.0 | -- | -- |line.241
4|||  5.4% | 198.0 | -- | -- |line.242
4|||  6.7% | 245.0 | -- | -- |line.243
4|||  3.3% | 120.0 | -- | -- |line.322
4|||  4.5% | 164.0 | -- | -- |line.371
4|||  3.9% | 142.0 | -- | -- |line.377

which shows that the function dim3_sweep takes up nearly all the time (~82%) and source lines 218 and 228 comprise the bulk of that. Note that the individual source lines shown do not add to 82%, probably because some source lines have fallen below the pat_report printing threshold.

Tracing Experiments¶

pat_build also can instrument an executable to trace calls to user-defined functions and Cray-provided library functions (e.g., MPI functions). Again, to generate an instrumented executable, one needs to load the perftools module first, and, then, compile and link in separate steps.

module load perftools
ftn -c myprogram.f90
ftn -o myprogram myprogram.o

The -w flag enables tracing. If only this flag is used, the entire program is traced as a whole (as main), with no individual function being traced.

pat_build -w myprogram

To instrument user-defined functions func1 and func2, use the -T option, together with the -w option:

pat_build -w -T func1,func2 myprogram

Be careful to choose the func1 and func2 names properly; the compiler may have appended underscore characters to the Fortran routine name.

To trace a group of functions you list in a text file, tracefile, use the -t option:

pat_build -w -t tracefile myprogram

where the file, tracefile, contains the function names to be traced.

To trace all the user-defined function, use the -u option.

pat_build -u myprogram

Tracing the entire user functions can slow down the code significantly if it contains many small and frequently called functions. To avoid such excessive overhead, one can restrict only to the functions with a certain text size or larger, by using the directive, trace-text-size (see the pat_build man page):

pat_build -u -Dtrace-text-size=800 myprogram    # to trace those with text size >= 800 bytes

To trace a Cray-provided library function group (e.g., MPI, OpenMP, ...), specify the function group name after the -g flag. The supported function groups are listed in the pat_build man page. For example, to trace the MPI, OpenMP and heap memory related functions as well as all the user functions, one can do:

pat_build -g mpi,omp,heap -u myprogram

After running the instrumented executable on compute nodes via srun, run pat_report on the generated data file (.xf file). This prints ASCII text output to terminal, and creates a file with the same basename but with the .ap2 suffix, which is to be viewed with the Cray Apprentice2 tool.

Automatic Program Analysis (APA)¶

Since a sampling experiment runs with little overhead and a detailed tracing experiment in general comes with large overhead, a good strategy to get performance analysis for a code that a user doesn't know about its performance characteristics would to run a sampling experiment first to identify routines that need to be instrumented for a more detailed tracing experiment later.

CrayPat's Automatic Program Analysis (APA) feature provides an easy way for such a purpose. Using this feature, one can generate an instrumented executable for a sampling experiment. When the binary is executed, it generates an ASCII text file that contains CrayPat's suggestion for pat_build tracing options, which can be used to re-instrument the executable for detailed tracing experiments.

The general workflow for using APA is as follows.

Generate the executable for sampling, using the special -O apa flag. It will generate an instrumented executable, myprogram+pat.
```
pat_build -O apa myprogram    # generates myprogram+pat
```
Running the executable on compute nodes via srun generates a directory containing performance data files. Let's call it myprogram+pat+4571-19s in this example.
Run pat_report on the data directory:
```
pat_report myprogram+pat+4571-19s
```
It will generate the myprogram+pat+4571-19sdot.ap2 and myprogram+pat+4571-19sdot.apa1. The latter contains suggested pat_build options for building an executable for tracing experiments. 1. Examine the myprogram+pat+4571-19sdot.apa file and, if necessary, customize it for your need using your favorite text editor. 1. Rebuild an executable using pat_build -O option with the .apa file name as the argument. It generates a new instrumented executable, myprogram+apa.
```
pat_build -O myprogram+pat+4571-19sdot.apa    # generates myprogram+apa
```
Run the new executable, myprogram+apa, for a tracing experiment.
Run pat_report on the newly created performance data directory. Its xf-files subdirectory contains actual performance data files (ending in .xf). They are the tracing result.

bash pat_report myprogram+apa+4590-19t

Monitoring Hardware Performance Counters¶

One can monitor hardware performance counter (HWPC) events while running sampling or tracing experiments (however, doing this with sampling experiment is discouraged). Supported PAPI standard and Intel native event names that can be monitored can be found by running the papi_avail and papi_native_avail commands on a compute node using a batch script:

#!/bin/bash
#SBATCH -N 1
...
module load perftools
papi_avail
papi_native_avail

By default, hardware performance counters are not monitored during sampling or tracing experiments. To enable monitoring, one has to explicitly specify up to four event names using the PAT_RT_PERFCTR environment variable before srun command is executed. For example, one can monitor floating point operations and L1 data cache misses with the following:

export PAT_RT_PERFCTR="PAPI_FP_OPS,PAPI_L1_DCM"

Or one can set the environment variable to a predefined hardware counter group number:

export PAT_RT_PERFCTR=1

The meaning of each counter group (1, 2, 3, ...) depends on the Cray system your application is running. E.g., on some systems, group 1 collects floating-point and cache metrics. These groups and the meanings are explained in the hwpc man page on each machine (accessible when the perftools module is loaded).

Some Advanced CrayPat Topics¶

By default, CrayPat will aggregate values from multiple processing elements during a run (and there are options to control the aggregation). If you want to look at HWPC values on a per-processing element (PE) basis, you can just do the following:

pat_report -s pe=ALL file.ap2 ...

The following should work as well:

pat_report -d counters -b pe -s aggr_pe_counters=select0 ...

If you want to sort the report by PEs, you can add -s sort_by_pe=yes

If you'd like HPWC values for just the whole program (not per function, etc), you can do

pat_report -O hwpc -s pe=ALL ...

Measuring Load Imbalance¶

You can use CrayPat to measure load imbalance in programs instrumented to trace MPI functions. By default CrayPat causes the trace wrapper for each MPI collective subroutine to measure the time for a barrier call prior to entering the collective. This time is reported by pat_report in as MPI_SYNC, which is separate from the MPI function group itself. The MPI_SYNC time essentially represents the time spent waiting for the MPI call to synchronize; it determines if the MPI ranks arrive at the collectives together or not. If the environment variable PAT_RT_MPI_SYNC is set to 1 (which is the default), the time spent waiting at a barrier and synchronizing processes is reported under MPI_SYNC, while the time spent executing after the barrier is reported under MPI. Default values are 1 for tracing experiments and 0 for sampling experiments.

Cray Apprentice2¶

Cray Apprentice2 is a tool used to visualize performance data instrumented with the CrayPat tool. There are many options for viewing results. Please refer to the app2 man page or Cray's documentation for more details.

module load perftools
app2 myprogram+pat+####-####tdot.ap2

To enable the "Mosaic" and "Traffic Report" views you need to set an environment variable as follows:

export PAT_RT_SUMMARY=0

Doing this may create enormous CrayPat data files and may take a long time.

Cray Reveal¶

Cray Reveal is a tool developed by Cray to help developing the hybrid MPI/OpenMP programming model. It is part of the Cray Perftools software package. It utilizes the Cray CCE program library (hence it only works under PrgEnv-cray) for loopmark and source code analysis, combined with performance data collected from CrayPat. Reveal helps to identify top consuming loops, with compiler feedback on dependency and vectorization. Its loop scope analysis provides variable scope and compiler directive suggestions for inserting OpenMP parallelism to a serial or pure MPI code. Please see this page with detailed steps for using Reveal.

Further Information¶

NERSC has prepared a detailed tutorial on Cray's perftools. You can view the presentation material.

Please refer to the HPE user guide for more details. Man pages are available but only after you load the perftools-base modulefile. Try man pat_help for a tutorial. Other man pages include:

craypat
pat_build
pat_report
app2
hwpc
intro_perftools
papi
papi_counters

For questions on using CrayPat at NERSC, please contact the NERSC help desk.