Performance of C++ Parallel Programming Models on Perlmutter using Lulesh¶
In this study, we evaluate Lulesh performance with different C++ parallel programming models on Perlmutter, including OpenMP, HPX, Kokkos, and NVC++ stdpar. We also use different compilers, such as gcc@11.2.0, clang@16.0.0, and nvhpc@22.9, to compile the applications.
Lulesh is a widely used benchmark application that assesses the efficiency of parallel computing architectures in solving partial differential equations related to solid mechanics. For further details about Lulesh, please refer to https://asc.llnl.gov/codes/proxy-apps/lulesh.
If you are interested in any C++ parallel algorithm or require a performance report, please feel free to contact us via help.nersc.gov.
Performance results¶
CPU-based Performance¶
Lulesh benchmark with problem Size 30
Lulesh benchmark with problem Size 60
Lulesh benchmark with problem Size 90
GPU-based Performance¶
Lulesh benchmark with nvhpc gpu (There is no control over the number of threads for NVC++ -stdpar=gpu version.)
Source code used in this study¶
This study utilizes the following open-source repositories, each of which is accompanied by build instructions provided within their repo.
Lulesh OpenMP version¶
Lulesh HPX version¶
Lulesh Kokkos version¶
Lulesh NVC++ version¶
Notes:
-
- To obtain correct computation results for NVC++ version, the following changes are needed to the original source code: https://github.com/LLNL/LULESH/pull/24
-
- To enable multi-threaded execution for NVC++ version, the extra C++ flag
--gcc-toolchain
is needed, for example:--gcc-toolchain=/opt/cray/pe/gcc/11.2.0/bin/gcc
. The NVC++ -stdpar=gpu version does not provide control over the number of threads.
- To enable multi-threaded execution for NVC++ version, the extra C++ flag
Example Run Scripts¶
#!/bin/bash
#SBATCH -A $PROJECT_ID
#SBATCH -C gpu
#SBATCH -t 10:00:00
#SBATCH -q regular
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH -o lulesh.out
#SBATCH -e lulesh.err
for SIZE in 30 60 90
do
for NUM_THREADS in 1 2 4 8 16 32 64 128
do
echo "running ref_gcc_openmp with $SIZE workload and $NUM_THREADS" threads
OMP_NUM_THREADS=$NUM_THREADS OMP_PROC_BIND=spread OMP_PLACES=threads ./ref_gcc_openmp -s $SIZE
echo "running ref_clang_openmp with $SIZE workload and $NUM_THREADS" threads
OMP_NUM_THREADS=$NUM_THREADS OMP_PROC_BIND=spread OMP_PLACES=threads ./ref_clang_openmp -s $SIZE
echo "running hpx_gcc with $SIZE workload and $NUM_THREADS" threads
./hpx_gcc -s $SIZE --hpx:threads=$NUM_THREADS
echo "running hpx_clang with $SIZE workload and $NUM_THREADS" threads
./hpx_clang -s $SIZE --hpx:threads=$NUM_THREADS
echo "running kokkos_gcc_openmp with $SIZE workload and $NUM_THREADS" threads
OMP_NUM_THREADS=$NUM_THREADS OMP_PROC_BIND=spread OMP_PLACES=threads ./kokkos_gcc_openmp -s $SIZE
echo "running kokkos_clang_openmp with $SIZE workload and $NUM_THREADS" threads
OMP_NUM_THREADS=$NUM_THREADS OMP_PROC_BIND=spread OMP_PLACES=threads ./kokkos_clang_openmp -s $SIZE
echo "running lulesh nvc++ multicore with $NUM_THREADS threads and workload $SIZE"
OMP_NUM_THREADS=$NUM_THREADS OMP_PROC_BIND=spread OMP_PLACES=threads ./multicoreLulesh2.0 -s $SIZE
echo ""
done
echo "running lulesh nvc++ gpu with workload $SIZE"
OMP_NUM_THREADS=$NUM_THREADS OMP_PROC_BIND=spread OMP_PLACES=threads ./gpuLulesh2.0 -s $SIZE
echo ""
echo "finished running $SIZE workload size"
done