VTune¶
VTune is a performance analysis tool targeting Intel architectures. Users are encouraged to read the Intel VTune Amplifier documentation for general usage.
Using VTune¶
VTune is available on Cori by loading the vtune
module.
Recommended compiler flags for VTune performance collection
Intel provides their recommended compiler flags for compiling applications when collecting performance data with VTune. Users will generally have the best results when compiling codes using the Intel compilers, although the CCE and GCC compilers can also produce application suitable for analysis with VTune.
When collecting performance data with VTune, users should add the Slurm flag --perf=vtune[/<version>]
where <version>
is an explicit version of the vtune
module, e.g., --perf=vtune/2020.up3
.
For some versions of VTune, users must use --perf=likwid
instead
The --perf
flag is designed to enable users to request a specific version of the vtune
modulefile, e.g., --perf=vtune/2020.up3
. However, this currently fails with some versions of VTune, with an error like the following:
srun: error: nersc_perf: invalid perf module version specififed "vtune/2020.up3"
srun: error: Invalid --perf argument: vtune/2020.up3
If a user encounters this error, they are encouraged to replace --perf=vtune
with --perf=likwid
, which has the same effect, and enables VTune to perform all of its expected functions.
Defer finalization on KNL
It is generally recommended to defer finalization when running on KNL. Finalization is an inherently serial process and the individual core performance on KNL is very poor. Thus, when running VTune on KNL, add the parameter -finalization-mode=deferred
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --nodes=1
#SBATCH --time=00:30:00
#SBATCH --perf=vtune
# ... additional sbatch parameters ...
module load vtune
vtune -finalization-mode=deferred -collect ... -r <result-dir> -- <command-to-profile>
# in some cases, it one might want to copy over the libraries need to finalize
vtune -archive -r <result-dir>
and then finalize on a login node:
vtune -finalize -result-dir <PATH>
Using VTune with Shifter¶
VTune can be attached to a Shifter container by executing the process in the background and then attaching VTune to the process via the PID (process identifier).
Cannot directly run collection on containers
The following will not work:
vtune -collect ... -- shifter <command-to-execute-in-container>
The recommended method is as follows:
#!/bin/bash
#SBATCH --qos=debug
#SBATCH --nodes=1
#SBATCH --time=00:30:00
#SBATCH --perf=vtune
#SBATCH --image=<username/some-image>
# ... additional sbatch parameters ...
module load vtune
PID_FILE=$(mktemp pid.XXXXXXX)
# the first "&" causes the command to execute in the background
# "echo $!" prints the PID
# "&> ${PID_FILE}" writes the PID to the temporary file
shifter <command-to-execute-in-container> & echo $! &> ${PID_FILE}
# read the PID from the file
TARGET_PID=$(cat ${PID_FILE})
# attach VTune to the process
vtune -collect <collection-mode> --target-pid=${TARGET_PID} ...
VTune finalization with Shifter
In the Using VTune section, it was recommended to not finalize on KNL. However, when using containers, deferring finalization creates a problem because the binaries needed for finalization exist only within the container. Due to this fact, it is recommended to not defer finalization when using containers.
VTune + Shifter Example¶
#!/bin/bash
#SBATCH --qos=regular
#SBATCH --constraint=knl
#SBATCH --nodes=1
#SBATCH --time=03:00:00
#SBATCH --job-name=tomopy_gridrec
#SBATCH --output=out_tomopy_%j.log
#SBATCH --image=jrmadsen/tomopy-reference:gcc
#SBATCH --perf=vtune
set -o errexit
# ensure VTune module is loaded
module load vtune
# this format of assignment only sets the variable to the specified value
# if not already set in the environment
: ${OMP_NUM_THREADS:=1}
: ${NUMEXPR_MAX_THREADS:=$(nproc)}
: ${VTUNE_COLLECTION_MODE:="advanced-hotspots"}
: ${VTUNE_SAMPLING_INTERVAL:=25}
: ${VTUNE_RESULTS_DIR:=$(mktemp -d ${PWD}/run-${VTUNE_COLLECTION_MODE}-XXXXX)}
export OMP_NUM_THREADS
export NUMEXPR_MAX_THREADS
export VTUNE_COLLECTION_MODE
export VTUNE_SAMPLING_INTERVAL
export VTUNE_RESULTS_DIR
# make sure empty, let vtune create directory
rm -rf ${VTUNE_RESULTS_DIR}
# use mktemp to ensure guard against multiple jobs in same dir
PID_FILE=$(mktemp pid.XXXXXX)
echo -e "\n### Submitting shifter job into background and storing PID in file: ${PID_FILE} ###\n"
shifter /opt/conda/bin/python ./run_tomopy.py -a gridrec -n 256 -s 512 -f jpeg -S 1 -c 8 -p shepp3d -i 5 & echo $! &> ${PID_FILE}
echo -e "\n### Reading PID file: ${PID_FILE} ###\n"
TARGET_PID=$(cat ${PID_FILE})
# echo the ps for debugging
echo -e "\n### Target PID: ${TARGET_PID} ###\n"
ps
# echo the environment for reference
echo -e "\n### Environment ###\n"
env
echo -e "\n### Attaching VTune process to PID ${TARGET_PID} ###\n"
vtune \
-collect ${VTUNE_COLLECTION_MODE} \
-knob collection-detail=hotspots-sampling \
-knob event-mode=all \
-knob analyze-openmp=true \
-knob sampling-interval=${VTUNE_SAMPLING_INTERVAL} \
-data-limit=0 \
--target-pid=${TARGET_PID} \
-r ${VTUNE_RESULTS_DIR}
echo -e "\nCompleted\n"