Cray Reveal¶

Description¶

Cray Reveal is part of the Cray Perftools software package. It utilizes the Cray CCE program library (hence it only works under PrgEnv-cray) for source code analysis, combined with performance data collected from CrayPat. Reveal helps to identify top time-consuming loops, with compiler feedback on dependency and vectorization.

Attempting to achieve best performance on today's supercomputers demands that besides using MPI between nodes or sockets, the developer must also use a shared-memory programming paradigm on the node and vectorize low-level looping structures.

The process of adding parallelism involves finding top serial work-intensive loops, performing parallel analysis, scoping and vectorization, adding OpenMP layers of parallelism, and analyzing performance for further optimizations, specifically vectorization of innermost loops. Cray Reveal can be used to simplify these tasks. Its loop scope analysis provides variable scope and compiler directive suggestions for inserting OpenMP parallelism to serial or pure MPI code.

Steps to Use Cray Reveal¶

Reveal is available on Perlmutter under PrgEnv-cray by loading the Cray perftools module. Reveal has a GUI interface, so it needs X11 access (ssh -XY or ThinLinc).

1. Basic Steps to Setup User Environment¶

nersc$ module swap PrgEnv-intel PrgEnv-cray        
nersc$ module unload darshan
nersc$ module load perftools-base
nersc$ module load perftools

Note

Once the perftools-base module is loaded, the perftools version will be matched to the corresponding perftools-base module.

2. Generate Loop Work Estimates¶

a. Build with `-h profile_generate`¶

Fortran code example:

nersc$ ftn -c -h profile_generate myprogram.f90
nersc$ ftn -o myprogram -h profile_generate myprogram.o

C code example (this code will be used in following steps):

nersc$ cc -c -h profile_generate myprogram.c
nersc$ cc -o myprogram -h profile_generate myprogram.o

NERSC provides the C code example.

Note

It is a good idea to separate compilation and linking to preserve object files. It is also suggested to separate this step from generating the program library (with -hpl) since -h profile_generate disables all optimizations.

b. Build CrayPat Executable¶

nersc$ pat_build -w myprogram

The executable myprogram+pat will be generated. Here, the -w flag is used to enable tracing.

c. Run the Program to Generate Raw Performance Data in *.xf Format¶

Below is a simple batch interactive session example. A regular batch script can also be used to launch the myprogram+pat program. It is recommended that you execute the code from a Lustre file system.

nersc$ salloc -q debug -t 30:00 -C knl

Then, use srun to run the code. In this case, 4 MPI tasks are used.

nersc$ srun -n 4 ./myprogram+pat

This generates one or more raw data files in *.xf format.

Before proceeding, relinquish the job allocation:

nersc$ exit

d. Generate .ap2 and .rpt Files via `pat_report`¶

nersc$ pat_report myprogram+pat+......xf > myprogram+pat.rpt

3. Generate a Program Library¶

nersc$ cc -O3 -hpl=myprogram.pl -c myprogram.c

Warning

If there are multiple source code directories, this program library directory needs to be an absolute path.

Note

myprograml.pl is a directory; users need to clean it from time to time.

4. Save an Original Copy of Your Source Code¶

nersc$ cp myprogram.c myprogram_orig.c

The Reveal suggested code may overwrite your original version.

5. Launch Reveal¶

nersc$ reveal myprogram.pl myprogram+pat+....ap2

(Use the exact *.ap2 file name in the above command.)

6. Perform Loop Scoping¶

Choose the "Loop Performance" view from the "Navigation" drop-down list, pick some of the high time-consuming loops, start scoping, and insert directives.

The left-side panel lists the top time consuming loops. The top right panel displays the source code. The right bottom panel displays the compiler information about a loop.

reveal openmp loop

Double click a line in the "Info" section to display more explanations of a compiler decision about each loop, such as whether it is vectorized or unrolled.

reveal openmp info

Double click a line corresponding to a loop from the section that displays the code and a new "Reveal OpenMP Scoping" window will pop up:

reveal openmp scope

Ensure that the only the required loop is checked. Click the "Start Scoping" button on the bottom left. The scoping results for each variable will be provided in the "Scoping Results" tab. Some of the variables are marked red as "Unresolved". The reason why it fails in scoping is also specified.

reveal openmp results

Click the "Show Directive" button and the Reveal-suggested OpenMP Directive will be displayed:

reveal openmp directive

Click the "Insert Directive" button to insert the suggested directive in the code:

reveal openmp insert directive

Click the "Save" button on the top right hand corner of the main window. A "Save Source" window will pop up:

reveal save

Choose "Yes", and a file having the same name as the original file and with the OpenMP directives inserted will be created. The original file will be overwritten.

The above steps can be repeated for one loop at a time. Note that the newly saved file will have the same file name as your original code.

nersc$ cp myprogram.c myprogram.c.reveal # (myprogram.c.reveal is the code with OpenMP directives generated by Reveal)
nersc$ cp myprogram.c.orig myprogram.c # (these are copies of your original code)
nersc$ cp myprogram.c.reveal myprogram_omp.c # (myprogram_omp.c is the copy where all the variables will be resolved)

7. Work with `myprogram_omp.c`¶

a. Start to resolve all unresolved variables by changing them to private, shared or reduction¶

For example, Reveal provides the following directive for the selected loop:

// Directive inserted by Cray Reveal.  May be incomplete.
#pragma omp parallel for default(none)                      \
   unresolved (i,my_change,my_n,i_max,u_new,u)              \
      private (j)                                           \
      shared  (my_rank,N,i_min)
for ( i = i_min[my_rank]; i <= i_max[my_rank]; i++ )
{
    for ( j = 1; j <= N; j++ )
    {
        if ( u_new[INDEX(i,j)] != 0.0 )
         {
            my_change = my_change
            + fabs ( 1.0 - u[INDEX(i,j)] / u_new[INDEX(i,j)] );
            my_n = my_n + 1;
         }
    }
}

Note that the keyword unresolved is used above since Reveal could not resolve the data scope. We need to change these to reduction(+:my_change) and shared(u_new,u) and save a new copy of the code as "myprogram_omp.c".

// Directive inserted by Cray Reveal.  May be incomplete.
#pragma omp parallel for default(none)                \
   reduction (+:my_change,my_n)                      \
        private (i,j)                                \
        shared  (my_rank,N,i_min,i_max,u,u_new)
   for ( i = i_min[my_rank]; i <= i_max[my_rank]; i++ )
    {
      for ( j = 1; j <= N; j++ )
      {
        if( u_new[INDEX(i,j)] != 0.0 )
        {
           my_change = my_change
          + fabs ( 1.0 - u[INDEX(i,j)] / u_new[INDEX(i,j)] );
          my_n = my_n + 1;
        }
      }
    }

b. Compile with OpenMP enabled.¶

This can be done under any PrgEnv. Make sure to resolve compilation warnings and errors.

Note

Use --cpus-per-task=num_threads if you are using srun to execute your code.

c. Compare the performance between `myprogram` and `myprogram_omp`¶

Output for MPI code only (myprogram.c) using 4 MPI Processes¶

POISSON_MPI!!
  C version
  2-D Poisson equation using Jacobi algorithm
  ===========================================
  MPI version: 1-D domains, non-blocking send/receive
  Number of processes         = 4
  Number of interior vertices = 1200
  Desired fractional accuracy = 0.001000

  N = 1200, n = 1373424, my_n = 326781, Step 1000  Error = 0.143433
  N = 1200, n = 1439848, my_n = 359992, Step 2000  Error = 0.0442104
  N = 1200, n = 1439907, my_n = 359994, Step 3000  Error = 0.0200615
  N = 1200, n = 1439928, my_n = 360000, Step 4000  Error = 0.0114007
  N = 1200, n = 1439936, my_n = 359983, Step 5000  Error = 0.0073485
  N = 1200, n = 1439916, my_n = 359983, Step 6000  Error = 0.00513294
  N = 1200, n = 1439935, my_n = 359996, Step 7000  Error = 0.00379038
  N = 1200, n = 1439915, my_n = 359997, Step 8000  Error = 0.00291566
  N = 1200, n = 1439950, my_n = 359997, Step 9000  Error = 0.00231378
  N = 1200, n = 1439982, my_n = 360000, Step 10000  Error = 0.00188195
  N = 1200, n = 1439988, my_n = 360000, Step 11000  Error = 0.00156145
  N = 1200, n = 1439983, my_n = 360000, Step 12000  Error = 0.00131704
  N = 1200, n = 1439983, my_n = 360000, Step 13000  Error = 0.00112634 


Wall clock time =29.937182 secs 

POISSON_MPI:

    Normal end of execution.

Output for MPI+OpenMP code (myprogram_omp.c) using 4 MPI Processes with 4 OpenMP Threads per MPI Process¶

POISSON_MPI!!
  C version
  2-D Poisson equation using Jacobi algorithm
  ===========================================
  MPI version: 1-D domains, non-blocking send/receive
  Number of processes         = 4
  Number of interior vertices = 1200
  Desired fractional accuracy = 0.001000 

  N = 1200, n = 1373424, my_n = 326781, Step 1000  Error = 0.143433
  N = 1200, n = 1439848, my_n = 359992, Step 2000  Error = 0.0442104
  N = 1200, n = 1439907, my_n = 359994, Step 3000  Error = 0.0200615
  N = 1200, n = 1439928, my_n = 360000, Step 4000  Error = 0.0114007
  N = 1200, n = 1439936, my_n = 359983, Step 5000  Error = 0.00734855
  N = 1200, n = 1439916, my_n = 359983, Step 6000  Error = 0.00513294
  N = 1200, n = 1439935, my_n = 359996, Step 7000  Error = 0.00379038
  N = 1200, n = 1439915, my_n = 359997, Step 8000  Error = 0.00291566
  N = 1200, n = 1439950, my_n = 359997, Step 9000  Error = 0.00231378
  N = 1200, n = 1439982, my_n = 360000, Step 10000  Error = 0.00188195
  N = 1200, n = 1439988, my_n = 360000, Step 11000  Error = 0.00156145
  N = 1200, n = 1439983, my_n = 360000, Step 12000  Error = 0.00131704
  N = 1200, n = 1439983, my_n = 360000, Step 13000  Error = 0.00112634


  Wall clock time = 12.807604 secs
  POISSON_MPI:

   Normal end of execution.

Results¶

The following results were obtained by executing on Edison for 4 MPI tasks.

reveal hybrid results

Issues and Limitations¶

Cray Reveal works only under PrgEnv-cray with the CCE compiler
There will be unresolved and incomplete variable scopes
There may be more incomplete and incorrect variables identified when compiling OpenMP code
The user still needs to understand OpenMP and resolve the issues
Reveal does not insert OpenMP tasks, barrier, critical, atomic, etc. regions

Cray Reveal¶

Description¶

Steps to Use Cray Reveal¶

1. Basic Steps to Setup User Environment¶

2. Generate Loop Work Estimates¶

a. Build with -h profile_generate¶

b. Build CrayPat Executable¶

c. Run the Program to Generate Raw Performance Data in *.xf Format¶

d. Generate *.ap2 and *.rpt Files via pat_report¶

3. Generate a Program Library¶

4. Save an Original Copy of Your Source Code¶

5. Launch Reveal¶

6. Perform Loop Scoping¶

7. Work with myprogram_omp.c¶

a. Start to resolve all unresolved variables by changing them to private, shared or reduction¶

b. Compile with OpenMP enabled.¶

c. Compare the performance between myprogram and myprogram_omp¶

Output for MPI code only (myprogram.c) using 4 MPI Processes¶

Output for MPI+OpenMP code (myprogram_omp.c) using 4 MPI Processes with 4 OpenMP Threads per MPI Process¶

Results¶

Issues and Limitations¶

Training & Tutorials¶

a. Build with `-h profile_generate`¶

d. Generate .ap2 and .rpt Files via `pat_report`¶

7. Work with `myprogram_omp.c`¶

c. Compare the performance between `myprogram` and `myprogram_omp`¶