MAP¶

Linaro Forge sampler's CUPTI error message on CPU nodes

When MAP or Performance Reports is used on CPU nodes, multiple lines of the following message can appear:

Linaro Forge sampler: CUPTI failed to enable kernel activity monitoring - error code 15

This message on CPU nodes is benign and can be ignored. It can be suppressed with the command:

export FORGE_SAMPLER_DISABLE_GPU_PROFILING=1

Don't set this on GPU nodes.

MAP, part of the Linaro Forge (previously known as Arm Forge or Allinea Forge) tool suite, is a source-level parallel profiler with a simple graphical user interface.

Note that the performance of the X Windows-based MAP Graphical User Interface can be greatly improved if used in conjunction with the free ThinLinc software.

Introduction¶

MAP is a parallel profiler with simple Graphical User Interface. MAP can be run to profile serial, OpenMP, CUDA and MPI codes (up to 2048 tasks).

The Forge User Guide available from the official web page or $ALLINEA_TOOLS_DOCDIR/userguide-forge.pdf is a good resource for learning more about some of the advanced MAP features. The variable ALLINEA_TOOLS_DOCDIR is defined by the forge module.

Loading the Forge Module¶

To use MAP, first load the forge module to set the correct environment settings:

module load forge

Compiling Code to Run with MAP¶

Dynamic linking is the default mode of linking on Perlmutter. To build a dynamically-linked executable, you don't have to explicitly build MAP libraries. Generally speaking, build your executable as you would normally do, but with the -g compile flag to keep debugging symbols, together with optimization flags that you would normally use:

ftn -c -g -O3 ... testMAP.f
ftn -o testMAP_ex testMAP.o

The recommended set of compilation flags are:

CPU code (or host code)
- PrgEnv-gnu: -g1 -O3 -fno-inline -fno-optimize-sibling-calls
- PrgEnv-nvidia: -g -O3 -Meh_frame -Mnoautoinline
- PrgEnv-cray
  - C/C++: -g1 -O3 -fno-inline -fno-optimize-sibling-calls
  - Fortran: -G2 -O3 -h ipa0
nvcc for CUDA kernels: -g -lineinfo -O3

Do not generate debug information for device code using the -G or -device-debug flag as it can significantly slow down the code. Use -lineinfo instead.

For more info, please check the user guide.

Static linking is not supported on Perlmutter.

Starting a Job with MAP¶

Running an X window GUI application can be painfully slow when it is launched from a remote system over internet. NERSC recommends to use the free ThinLinc software because the performance of the X Window-based DDT GUI can be greatly improved. Another way to cope with the problem is to use Forge remote client, which will be discussed in the next section.

You can also start Be sure to log in with an X window forwarding enabled. This could mean using the -X or -Y option to ssh. The -Y option often works better for macOS.

ssh -Y username@perlmutter.nersc.gov

After loading the forge module and compiling with the -g option, request an interactive session:

salloc -A <project> -C cpu -N <numNodes> -q interactive -t 30:00     # Perlmutter CPU

Load the forge module if you haven't loaded it yet:

module load forge

Then launch the profiler with either

map ./testMAP_ex

where ./testMAP_ex is the name of your program to profile.

The Forge GUI will pop up, showing a start up menu for you to select what to do. For profiling choose the option 'PROFILE' with the MAP tool. You can also choose to 'LOAD PROFILE DATA FILE' to view profiling results saved in a file created in a previous MAP run.

MAP start window

Then a submission window will appear with a prefilled path to the executable to debug. Select the number of processors on which to run and press run. To pass command line arguments to a program enter them in the 'srun arguments' box.

MAP Run window

MAP will start your program and collect performance data from all processes.

MAP window when running_2

By default, MAP lets your program run to completion and will display data for the entire run. You can also use the 'Stop and Analyze' button and the menu beneath it to control how long to profile your program.

Reverse Connect Using Remote Client¶

If you want to use the ThinLinc tool instead of the remote client, you can skip this section.

Forge remote clients are provided for Windows, macOS and Linux that can run on your local desktop to connect via SSH to NERSC systems to debug, profile, edit and compile files directly on the remote NERSC machine. You can download the clients from Forge download page and install on your laptop/desktop.

Please note that the client version must be the same as the Forge version that you're going to use on the NERSC machines.

Instructions for configuring the client are provided in the DDT web page. If you have done configuration for using DDT on a NERSC machine, the same configuration will be used for running MAP.

To start a MAP session after the configuration step, select the configuration for the machine that you want to use from the 'Remote Launch' menu.

MAP Reverse Connect window

You'll be prompted to authenticate with password plus MFA (Multi-Factor Authentication) OTP (One-time password):

allinea-remoteclient4

If you have set up ssh to use the ssh keys generated by sshproxy as shown in MFA page's 'Ssh Configuration File Options' section and the keys have not expired, the remote client will connect to the desired machine without you entering password and OTP.

You can use the Reverse Connection method with the remote client. To do this, put aside the remote client window that you have been working with, and log in to the corresponding machine from a window on your local machine, as you would normally do.

ssh perlmutter.nersc.gov         # Perlmutter

Then, start an interactive batch session there. For example,

salloc -N 2 -G 8 -t 30:00 -q debug -C gpu -A ...  # Perlmutter GPU

and run MAP with with the option --connect as follows:

module load forge
map --connect srun -n 32 -c 8 --cpu-bind=cores ./jacobi_mpi

The remote client will ask you whether to accept a Reverse Connect request. Click 'Accept'.

Accept Connection

The usual Run window, as shown near the top of this webpage, will appear where you can change or set run configurations and debugging options. Click 'Run'.

Now, your program will start under MAP and profiling results are displayed in the remote client.

Profiling Results¶

After completing the run, MAP displays the collected performance data using GUI.

MAP results

For info on how to interpret the results, please see the Forge User Guide.

MAP saves profiling results in a file, executablename_#p_yyyy-mm-dd_HH-MM.map where # is for the process count and yyyy-mm-dd_HH-MM is the time stamp.

$ ls -l
-rw-------  1 elvis elvis 621583 Mar 16 21:31 jacobi_mpi_32p_1n_2023-03-16_21-30.map

CUDA Code Profiling¶

To enable CUDA analysis mode, click the checkboxes for 'Kernel analysis (CUDA only)' and 'Memory transfers (CUDA only)' under the 'GPU' menu of the Run window.

MAP enable GPU analyses

MAP will display data for lines inside CUDA kernels and memory transfers. CPU time spent waiting for CUDA kernels to complete is shown in purple. For the performance metrics, you can select 'Preset: Nvidia' which will show the 'GPU utilization' and 'GPU memory usage' time-series data.

MAP CUDA profiling

Note that MAP uses the timings from the perspective of the host. So the time spent in a non-blocking kernel is attributed to the next synchronous API call (e.g., cudaMemcpy), not to the kernel itself. This is also seen when the 'Functions' tab is clicked:

MAP CUDA profiling, Functions

To see the actual kernel runtime, click the 'GPU Kernels' tab:

MAP CUDA profiling, Kernels

Running in Command Line Mode¶

MAP can be run from the command line without GUI, by using the -profile option. You can submit a batch job as follows:

$ cat runit
#!/bin/bash
#SBATCH -A <project>
#SBATCH -C cpu
#SBATCH -N 1
#SBATCH -q debug
#SBATCH -t 10:00

module load forge
map --profile --np=32 ./jacobi_mpi

$ sbatch runit
Submitted batch job 6130079

$ cat slurm-6130079.out
Linaro Forge 23.0 - Linaro MAP

Profiling             : /pscratch/sd/e/elvis/jacobi_mpi
Allinea sampler       : preload
MPI implementation    : Auto-Detect (SLURM (MPMD))
* number of processes : 32
* number of nodes     : 1
* Allinea MPI wrapper : preload (JIT compiled)


MAP analysing program...
MAP gathering samples...
MAP generated /pscratch/sd/e/elvis/jacobi_mpi_32p_1n_2023-03-16_21-36.map
           1   85.3816681
           ...
          10   16.8724918
...

$ ls -l
...
-rw-------  1 elvis elvis 654668 Mar 16 21:37 jacobi_mpi_32p_1n_2023-03-16_21-36.map

Troubleshooting¶

If you are having trouble launching MAP, try these steps.

Make sure you have the most recent version of the system.config configuration file. The first time you run DDT, you pick up a master template which then gets stored locally in your home directory in ~/.allinea/${NERSC_HOST}/system.config where ${NERSC_HOST} is the machine name. If you are having problems launching DDT you could be using an older verion of the system.config file and you may want to remove the entire directory:

rm -rf ~/.allinea/${NERSC_HOST}

Remove any stale processes that may have been left by DDT.

rm -rf $TMPDIR/allinea-$USER

In case of a font problem where every character is displayed as a square, please delete the .fontconfig directory in your home directory and restart ddt.

rm -rf ~/.fontconfig

Make sure you are requesting an interactive batch session. NERSC has configured Forge to run from the interactive batch jobs.

salloc -q interactive -N <numNodes> -A <project> ...

Finally make sure you have compiled your code with -g. If none of these tips help, please contact the consultants via https://help.nersc.gov.