Performance variability¶
There are many potential sources of variability on an HPC system and NERSC has identified the following best practices to mitigate variability and improve application performance.
hugepages¶
Use of hugepages can reduce the cost of accessing memory, especially in the case of many MPI_Alltoall
operations.
- Load the hugepages module (
module load craype-hugepages2M
). - Recompile your code.
- Add
module load craype-hugepages2M
to batch scripts.
Note
Consider adding module load craype-hugepages2M
to ~/.bashrc
.
For more details see the manual pages (man intro_hugepages
).
Location of executables¶
Compilation of executables should be done in $HOME
or /tmp
. Executables can be copied into the compute node memory at the start of a job with sbcast
to greatly improve job startup times and reduce run-time variability in some cases.
For applications with dynamic executables and many libraries (especially python based applications) use Shifter.
Network Congestion¶
Sometimes, due to other communication-intensive workloads running at the same time as your workload, there may be variation in the amount of time spent on communication. There are Cray MPI environment variables that can be set to change the strategy used by the system to route messages in your job. The Network page provides more details on these environment variables.
Affinity¶
Running with correct affinity and binding options can greatly affect variability.
- use at least 8 ranks per node (1 rank per node cannot utilize the full network bandwidth)
- read
man intro_mpi
for additional options - check job script generator to get correct binding
- use check-mpi.
.pm and check-hybrid. .pm, where can be gnu, nvidia, or cce to check affinity settings
elvis@perlmutter$ salloc -N 2 -C cpu -q interactive -t 10:00
salloc: Granted job allocation 9887582
salloc: Waiting for resource configuration
salloc: Nodes nid[004434,005440] are ready for job
elvis@nid004434$ srun -n 8 -c 64 --cpu-bind=cores check-mpi.gnu.pm|sort -nk 4
Hello from rank 0, on nid004434. (core affinity = 0-31,128-159)
Hello from rank 1, on nid004434. (core affinity = 64-95,192-223)
Hello from rank 2, on nid004434. (core affinity = 32-63,160-191)
Hello from rank 3, on nid004434. (core affinity = 96-127,224-255)
Hello from rank 4, on nid005440. (core affinity = 0-31,128-159)
Hello from rank 5, on nid005440. (core affinity = 64-95,192-223)
Hello from rank 6, on nid005440. (core affinity = 32-63,160-191)
Hello from rank 7, on nid005440. (core affinity = 96-127,224-255)
Core specialization¶
Using core-specialization (#SBATCH -S n
or #SBATCH --core-spec=n
) moves OS functions to cores not in use by user applications, where n is the number of cores to dedicate to the OS. The flag only works in a batch script with sbatch
. It can not be requested as a flag with salloc
for interactive jobs, since salloc
is already a wrapper script for srun
.
The example shows 1 core per node on Perlmutter CPU for the OS and the other 127 for the application. Note that, when computing the -c
(or --cpus-per-task
) value using a formula provided in the affinity page, cores for the OS should be excluded from the numerator. So the -c
value is 2*\left \lfloor{(128-1)/(32/2)}\right \rfloor = 14.
#SBATCH --nodes=2
#SBATCH --constraint=cpu
#SBATCH -S 1
srun -n 32 -c 14 --cpu-bind=cores /tmp/my_program.x
Combined example¶
This example is for Perlmutter CPU.
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --constraint=cpu
#SBATCH --qos=regular
#SBATCH --time=60
#SBATCH --core-spec=1
#SBATCH --ntasks-per-node=32
#SBATCH --cpus-per-task=14
module load craype-hugepages2M
sbcast -f --compress ./my_program.x /tmp/my_program.x
srun --cpu-bind=cores /tmp/my_program.x