Process and Thread Affinity¶

Process affinity (or CPU pinning) means to bind each MPI process to a CPU or a range of CPUs on the node. It is important to spread MPI processes evenly onto different NUMA nodes.

Thread affinity means to map threads onto a particular subset of CPUs (called "places") that belong to the parent process (such as an MPI process) and to bind them to these places so the OS cannot migrate them to different places. It helps to take advantage of the local process state and to achieve better memory locality.

Memory locality is the degree to which data resides in memory that is close to the processors/threads working with the data.

Modern processors have multiple sockets and NUMA (Non-Uniform Memory Access) domains. Threads accessing memory in a remote NUMA domain is slower than accessing memory in a local NUMA domain.

Improper process and thread affinity could slow down code performance significantly. A combination of OpenMP environment variables and runtime flags are needed for different compilers and for the batch scheduler used on the system.

Node architecture¶

Perlmutter CPU-only nodes¶

Each node contains 2 sockets, each containing an AMD EPYC 7763 processor. Each processor has 64 physical cores, and each core has 2 hardware threads (logical CPUs). Each processor has 4 NUMA domains. For details, see Perlmutter architecture.

Socket 1 has physical cores 0 to 63, and socket 2 has physical cores 64 to 127. Core 0 has 2 hardware threads, with the logical CPUs numbered as 0 and 128; Core 1 has logical CPUs of 1 and 129, and so on. When OMP_PLACES is set to "cores", each OpenMP thread binds to one core; And when OMP_PLACES is set to "threads", each OpenMP thread binds to one logical CPU.

Below is the numactl -H result on a CPU compute node.

$ numactl -H 
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
node 0 size: 63805 MB
node 0 free: 61260 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159
node 1 size: 64503 MB
node 1 free: 61712 MB
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175
node 2 size: 64503 MB
node 2 free: 62726 MB
node 3 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
node 3 size: 64491 MB
node 3 free: 63018 MB
node 4 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207
node 4 size: 64503 MB
node 4 free: 60782 MB
node 5 cpus: 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223
node 5 size: 64503 MB
node 5 free: 62521 MB
node 6 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239
node 6 size: 64503 MB
node 6 free: 62484 MB
node 7 cpus: 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
node 7 size: 64501 MB
node 7 free: 61340 MB
node distances:
node   0   1   2   3   4   5   6   7
  0:  10  12  12  12  32  32  32  32
  1:  12  10  12  12  32  32  32  32
  2:  12  12  10  12  32  32  32  32
  3:  12  12  12  10  32  32  32  32
  4:  32  32  32  32  10  12  12  12
  5:  32  32  32  32  12  10  12  12
  6:  32  32  32  32  12  12  10  12
  7:  32  32  32  32  12  12  12  10

Perlmutter GPU nodes¶

Each node contains a single socket with an AMD EPYC 7763 processor and 4 NVIDIA A100 GPUs. An EPYC 7763 processor has 64 physical cores with 2 hardware threads (logical CPUs) per core. A processor has 4 NUMA domains. For details, see Perlmutter architecture.

The CPU processor has physical cores 0 to 63. Core 0 has 2 hardware threads, with the logical CPUs numbered as 0 and 64; Core 1 has logical CPUs of 1 and 65, and so on. When OMP_PLACES is set to "cores", each OpenMP thread binds to one core; And when OMP_PLACES is set to "threads", each OpenMP thread binds to one logical CPU.

Below is the numactl -H result on a GPU compute node.

$ numactl -H 
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
node 0 size: 63809 MB
node 0 free: 56407 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 1 size: 64503 MB
node 1 free: 58615 MB
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
node 2 size: 64503 MB
node 2 free: 60857 MB
node 3 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 3 size: 64490 MB
node 3 free: 57008 MB
node distances:
node   0   1   2   3
  0:  10  12  12  12
  1:  12  10  12  12
  2:  12  12  10  12
  3:  12  12  12  10

Slurm Options¶

srun -n <total tasks> -c <logical CPUs per task> --cpu-bind <binding option> [-G <GPUs> --gpu-bind <binding option>] <executable>

The srun option -c sets the "number of logical cores (CPUs) per task" for the executable. For MPI and hybrid MPI/OpenMP applications one task corresponds to one MPI process.

Tip

The -c flag is optional for jobs that use one task per physical core.

The --cpu-bind option binds processes to specific cores or core sets. The available options are none (default), cores, threads, ldoms, sockets,map_cpu,map_ldom, and more. See man srun for the complete list.

On Perlmutter, we recommend --cpu-bind=cores (which is the most common use case) when number of processes per node is smaller than or equal to number of physical cores (128 on CPU nodes, and 64 on GPU nodes). We recommend --cpu-bind=threads when number of processes on each node is bigger than 128 on CPU nodes or 64 on GPU nodes. These settings can help prevent the operating system (OS) migrating processes to different cores in a remote NUMA domain, thus minimize performance loss incurred by the NUMA penalty (memory access from a remote NUMA domain is slower than from a local NUMA domain).

Perlmutter¶

CPU-only nodes¶

A CPU-only compute node has a total of 128 physical cores, (each with 2 hardware threads, so 256 logical CPUs total). The value of -c should be set to 256 / tasks_per_node. In the general case when tasks per node is not a divisor of 256:

$2*\left \lfloor{\frac{128}{\mbox{tasks per node}}}\right \rfloor$

GPU nodes¶

CPUs¶

The AMD processor on a GPU compute node has a total of 64 physical cores, or 128 logical CPUs total. The value of -c should be set to 128 / tasks_per_node. In the general case when tasks per node is not a divisor of 128:

$2*\left \lfloor{\frac{64}{\mbox{tasks per node}}}\right \rfloor$

GPUs¶

The examples in this section use to the following program to demonstrate GPU affinity configurations.

// gpus_for_tasks.cpp
#include <iostream>
#include <string>
#include <cuda_runtime.h>
#include <mpi.h>

int main(int argc, char **argv) {
  int deviceCount = 0;
  int rank, nprocs;

  MPI_Init (&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &nprocs);

  cudaGetDeviceCount(&deviceCount);

  printf("Rank %d out of %d processes: I see %d GPU(s).\n", rank, nprocs, deviceCount);

  int dev, len = 15;
  char gpu_id[15];
  cudaDeviceProp deviceProp;

  for (dev = 0; dev < deviceCount; ++dev) {
    cudaSetDevice(dev);
    cudaGetDeviceProperties(&deviceProp, dev);
    cudaDeviceGetPCIBusId(gpu_id, len, dev);
    printf("%d for rank %d: %s\n", dev, rank, gpu_id);
  }

  MPI_Finalize ();

  return 0;
}

To compile in PrgEnv-gnu, run:

module load cudatoolkit
CC -o gpus_for_tasks gpus_for_tasks.cpp

and for PrgEnv-nvidia, run:

CC -cuda -o gpus_for_tasks gpus_for_tasks.cpp

When allocating CPUs and GPUs to a job in Slurm, the default behavior is that all GPUs on a particular node allocated to the job can be accessed by all tasks on that same node.

$ srun -C gpu -N 1 -n 2 -c 64 --cpu-bind=cores -G 4 ./gpus_for_tasks
Rank 0 out of 2 processes: I see 4 GPU(s).
0 for rank 0: 0000:03:00.0
1 for rank 0: 0000:41:00.0
2 for rank 0: 0000:81:00.0
3 for rank 0: 0000:C1:00.0
Rank 1 out of 2 processes: I see 4 GPU(s).
0 for rank 1: 0000:03:00.0
1 for rank 1: 0000:41:00.0
2 for rank 1: 0000:81:00.0
3 for rank 1: 0000:C1:00.0

However, when the --gpus-per-task option is used to allocate GPU resources, GPU access is restricted per task:

$ srun -C gpu -N 1 -n 2 -c 64 --cpu-bind=cores --gpus-per-task=1 ./gpus_for_tasks
Rank 1 out of 2 processes: I see 1 GPU(s).
0 for rank 1: 0000:41:00.0
Rank 0 out of 2 processes: I see 1 GPU(s).
0 for rank 0: 0000:03:00.0

For some applications, it is desirable that only certain GPUs can be accessed by certain tasks. For example, a common programming model for MPI + GPU applications is such that each GPU on a node is accessed by only a single task on that node.

Such behavior can be controlled in different ways. As we have seen above, one way is to use --gpus-per-task. Another way is to manipulate the environment variable CUDA_VISIBLE_DEVICES, as documented in the Environment variables section of the CUDA C programming guide. This approach works on any system with NVIDIA GPUs. The variable must be configured per process, and may have different values on different processes, depending on the user's desired GPU affinity settings.

If the number of tasks is the same as the number of GPUs (4 on the GPU nodes) on each node and each task is to use one GPU only, a simple way of initializing CUDA_VISIBLE_DEVICES is to set it to $SLURM_LOCALID. The SLURM_LOCALID variable is the local ID for the task within a node. Since the local ID is defined after launching an srun command, you will need to wrap the environment variable setting as well as the executable run in one shell as in the following example:

#!/bin/bash
# select_gpu_device wrapper script
export CUDA_VISIBLE_DEVICES=$SLURM_LOCALID
exec $*

Using -G 4 to allocate GPU resources and the select_gpu_device defined above, we can reproduce the behavior of --gpus-per-task:

$ srun -C gpu -N 1 -n 2 -c 64 --cpu-bind=cores -G 4 ./select_gpu_device ./gpus_for_tasks
Rank 1 out of 2 processes: I see 1 GPU(s).
0 for rank 1: 0000:41:00.0
Rank 0 out of 2 processes: I see 1 GPU(s).
0 for rank 0: 0000:03:00.0

If the number of tasks is greater than the number of GPUs on each node, often in conjunction with the use of CUDA Multi-Process Service (MPS), you can evenly distribute GPUs in a round-robin fashion among MPI tasks by making the following changes to select_gpu_device:

#!/bin/bash
# select_gpu_device wrapper script
export CUDA_VISIBLE_DEVICES=$(( SLURM_LOCALID % 4 )) 
exec $*

Another way to achieve a similar result is to use Slurm's GPU affinity flags. In particular, the --gpu-bind flag may be supplied to either salloc, sbatch, or srun in order to control which tasks can access which GPUs. A description of the --gpu-bind flag is documented on the Slurm srun page and via man srun. The --gpu-bind=map_gpu:<gpu_id_for_task_0>,<gpu_id_for_task_1>,... option sets the order of GPUs to be assigned to MPI tasks. For example, adding --gpu-bind=map_gpu:0,1 results in assigning GPUs 0 and 1 to MPI tasks in a round-robin fashion such that each task on the node may access a single, unique GPU:

$ srun -A mXYZ -C gpu -N 1 -n 2 -c 64 --cpu-bind=cores -G 4 --gpu-bind=map_gpu:0,1 ./gpus_for_tasks
Rank 0 out of 2 processes: I see 1 GPU(s).
0 for rank 0: 0000:03:00.0
Rank 1 out of 2 processes: I see 1 GPU(s).
0 for rank 1: 0000:41:00.0

To run a job across 8 GPU nodes using 32 tasks total, with each task bound to one GPU, one could run the following:

$ srun -C gpu -N 8 -n 32 -c 32 --cpu-bind=cores --gpus-per-node=4 --gpu-bind=map_gpu:0,1,2,3 ./gpus_for_tasks
Rank 31 out of 32 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 31: 0000:C1:00.0
Rank 7 out of 32 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 7: 0000:C1:00.0
Rank 3 out of 32 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 3: 0000:C1:00.0
...
Rank 20 out of 32 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 20: 0000:02:00.0
Rank 24 out of 32 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 24: 0000:02:00.0
Rank 8 out of 32 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 8: 0000:02:00.0

OpenMP Environment Variables¶

`OMP_PLACES`¶

OMP_PLACES defines a list of places that threads can be pinned on. It is used for complex layouts of threads. The possible values are:

threads: Each place corresponds to a single hardware thread on the target machine.
cores: Each place corresponds to a single core (having one or more hardware threads) on the target machine.
sockets: Each place corresponds to a single socket (consisting of one or more cores) on the target machine.
A list with explicit place values: such as: "{0,1,2,3},{4,5,6,7},{8,9,10,11},{12,13,14,15}” or “{0:4},{4:4},{8:4},{12:4}” can also be used. It has the following form "{lower-bound:length:stride}" Thus, specifying "{0:3:2}" is the same as specifying "{0,2,4}". Multiple locations can be included in a place.

`OMP_PROC_BIND`¶

OMP_PROC_BIND sets the binding of threads to processors. The options are:

true: Thread affinity is enabled with an implementation-defined default place list.
false: Thread affinity is disabled.
spread: Bind threads as evenly distributed (spread) as possible.
close: Bind threads close to the master thread while still distributing threads for load balancing.
master: Bind threads to the same place as the master thread.

It is recommended to set OMP_PROC_BIND=spread in general. For nested OpenMP regions, it is recommended to set OMP_PROC_BIND=spread for the outer loop and OMP_PROC_BIND=close for the inner loop.

`OMP_NUM_THREADS`¶

OMP_NUM_THREADS sets the number of threads to be used for the OpenMP parallel regions.

Job Script Generator¶

An interactive Job Script Generator is available at MyNERSC to provide some guidance on getting optimal process and thread binding. It is a GUI tool that allows you to input the Slurm parameters of your intended application run, and get a job script template.

Methods to Check Process and Thread Affinity¶

Use NERSC Prebuilt Binaries¶

Pre-built binaries from a small test code xthi.c with pure MPI or hybrid MPI/OpenMP can be used to check affinity. Binaries are in users default path, and named as such: check-mpi.<compiler>.<machine> (MPI), or check-hybrid.<compiler>.<machine> (MPI/OpenMP), for example: check-mpi.gnu.pm, check-hybrid.gnu.pm, check-mpi.nvidia.pm, check-hybrid.nvidia.pm, etc.

It is recommended that you replace your application executable with one of the small test binaries, and run with the exact same number of nodes, MPI tasks, and OpenMP threads as those your application will use and check if the desired binding is obtained.

Below is sample output (with interleaved notes) on Perlmutter CPU with an interactive batch session with various tests from different settings. The test results report that for each MPI rank, which node it runs on; and for each OpenMP thread that belongs to each MPI rank, which logical CPUs it binds to.

MPI example¶

8 MPI ranks on one node. Using the correct -n, -c, and --cpu-bind=cores options, the MPI tasks are spread out, and bind to both sockets on the Perlmutter CPU node. MPI ranks 0,2,4,6 are on the first socket, and MPI ranks 1,3,5,7 are on the second socket. Each MPI rank binds to 16 physical CPUs (which has 32 logical CPUs total). For example, MPI rank 0 binds to physical cores 0-3, which includes logical CPUs 0-3,32-25.

perlmutter$ salloc -N 1 -C cpu --qos=interactive -t 20:00
salloc: Granted job allocation 9884088
salloc: Waiting for resource configuration
salloc: Nodes nid006369 are ready for job
elvis@nid006369:~> srun -n 8 -c 32 --cpu-bind=cores check-mpi.gnu.pm |sort -k4
srun: Job 9884088 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Step created for StepId=9884088.1
Hello from rank 0, on nid006369. (core affinity = 0-15,128-143)
Hello from rank 1, on nid006369. (core affinity = 64-79,192-207)
Hello from rank 2, on nid006369. (core affinity = 16-31,144-159)
Hello from rank 3, on nid006369. (core affinity = 80-95,208-223)
Hello from rank 4, on nid006369. (core affinity = 32-47,160-175)
Hello from rank 5, on nid006369. (core affinity = 96-111,224-239)
Hello from rank 6, on nid006369. (core affinity = 48-63,176-191)
Hello from rank 7, on nid006369. (core affinity = 112-127,240-255)

MPI/OpenMP example¶

Below is an example with 8 MPI tasks, 4 OpenMP threads per MPI task on Perlmutter CPU. You could see the MPI tasks are again spread out on both sockets. Setting OMP_PLACES=threads will result in each thread bind to a specific hyperthread (a unique logical CPU as reported by Slurm). We set OMP_PROC_BIND=spread, it spreads out the threads onto CPUs.

elvis@nid006369:~> export OMP_NUM_THREADS=4
elvis@nid006369:~> export OMP_PLACES=threads
elvis@nid006369:~> export OMP_PROC_BIND=spread
elvis@nid006369:~> srun -n 8 -c 32 --cpu-bind=cores check-hybrid.gnu.pm |sort -k4,6
Hello from rank 0, thread 0, on nid006369. (core affinity = 0)
Hello from rank 0, thread 1, on nid006369. (core affinity = 4)
Hello from rank 0, thread 2, on nid006369. (core affinity = 8)
Hello from rank 0, thread 3, on nid006369. (core affinity = 12)
Hello from rank 1, thread 0, on nid006369. (core affinity = 64)
Hello from rank 1, thread 1, on nid006369. (core affinity = 68)
Hello from rank 1, thread 2, on nid006369. (core affinity = 72)
Hello from rank 1, thread 3, on nid006369. (core affinity = 76)
Hello from rank 2, thread 0, on nid006369. (core affinity = 16)
Hello from rank 2, thread 1, on nid006369. (core affinity = 20)
Hello from rank 2, thread 2, on nid006369. (core affinity = 24)
Hello from rank 2, thread 3, on nid006369. (core affinity = 28)
Hello from rank 3, thread 0, on nid006369. (core affinity = 80)
Hello from rank 3, thread 1, on nid006369. (core affinity = 84)
Hello from rank 3, thread 2, on nid006369. (core affinity = 88)
Hello from rank 3, thread 3, on nid006369. (core affinity = 92)
Hello from rank 4, thread 0, on nid006369. (core affinity = 32)
Hello from rank 4, thread 1, on nid006369. (core affinity = 36)
Hello from rank 4, thread 2, on nid006369. (core affinity = 40)
Hello from rank 4, thread 3, on nid006369. (core affinity = 44)
Hello from rank 5, thread 0, on nid006369. (core affinity = 96)
Hello from rank 5, thread 1, on nid006369. (core affinity = 100)
Hello from rank 5, thread 2, on nid006369. (core affinity = 104)
Hello from rank 5, thread 3, on nid006369. (core affinity = 108)
Hello from rank 6, thread 0, on nid006369. (core affinity = 48)
Hello from rank 6, thread 1, on nid006369. (core affinity = 52)
Hello from rank 6, thread 2, on nid006369. (core affinity = 56)
Hello from rank 6, thread 3, on nid006369. (core affinity = 60)
Hello from rank 7, thread 0, on nid006369. (core affinity = 112)
Hello from rank 7, thread 1, on nid006369. (core affinity = 116)
Hello from rank 7, thread 2, on nid006369. (core affinity = 120)
Hello from rank 7, thread 3, on nid006369. (core affinity = 124)

MPI/OpenMP example 2¶

Below is another hybrid MPI/OpenMP example on Perlmutter CPU. 4 MPI tasks, 4 OpenMP threads per MPI task. You could see the MPI tasks are again spread out on both sockets. Setting OMP_PLACES=cores will result in each thread bind to a specific core, which has 2 logical CPUs on a physical core as reported by Slurm. Again, we set OMP_PROC_BIND=spread.

elvis@nid006369:~> export OMP_NUM_THREADS=4
elvis@nid006369:~> export OMP_PLACES=cores
elvis@nid006369:~> export OMP_PROC_BIND=spread
elvis@nid006369:~> srun -n 4 -c 64 --cpu-bind=cores check-hybrid.gnu.pm |sort -k4,6
Hello from rank 0, thread 0, on nid006369. (core affinity = 0,128)
Hello from rank 0, thread 1, on nid006369. (core affinity = 8,136)
Hello from rank 0, thread 2, on nid006369. (core affinity = 16,144)
Hello from rank 0, thread 3, on nid006369. (core affinity = 24,152)
Hello from rank 1, thread 0, on nid006369. (core affinity = 64,192)
Hello from rank 1, thread 1, on nid006369. (core affinity = 72,200)
Hello from rank 1, thread 2, on nid006369. (core affinity = 80,208)
Hello from rank 1, thread 3, on nid006369. (core affinity = 88,216)
Hello from rank 2, thread 0, on nid006369. (core affinity = 32,160)
Hello from rank 2, thread 1, on nid006369. (core affinity = 40,168)
Hello from rank 2, thread 2, on nid006369. (core affinity = 48,176)
Hello from rank 2, thread 3, on nid006369. (core affinity = 56,184)
Hello from rank 3, thread 0, on nid006369. (core affinity = 96,224)
Hello from rank 3, thread 1, on nid006369. (core affinity = 104,232)
Hello from rank 3, thread 2, on nid006369. (core affinity = 112,240)
Hello from rank 3, thread 3, on nid006369. (core affinity = 120,248)

Use OpenMP 5.0 Environment Variables¶

A runtime thread affinity display feature is now included in the OpenMP 5.0 standard. This feature allows more portable and standard thread affinity display info across all compilers. It is available now for gcc (9.0 and newer) and CCE (9.0 and newer)

The runtime display OpenMP thread affinity feature is enabled with two environment variables:

OMP_DISPLAY_AFFINITY: set to TRUE or FALSE. Setting this to true will cause the system to display affinity information for all OpenMP threads when entering the first parallel region and when any thread affinity information changes in subsequent parallel regions.
OMP_AFFINITY_FORMAT: set to a string that defines the output affinity values that will be output and the format used when displaying them.

The format for each field is: %[[[0].]size]type, where size defines the number of characters used for an output field and type indicates the information to output. The period indicates that the values are to be right-justified (the default is left-justified) and the 0 indicates that you want leading zeros to be included.

Some sample OMP_AFFINITY_FORMAT strings are:

OMP_AFFINITY_FORMAT="host=%H, pid=%P, thread_num=%n, thread affinity=%A"

OMP_AFFINITY_FORMAT="Thread Level=%0.3L, Parent TLevel=%5a, thread_num=%5n, thread_affinity=%15A, host=%10H”

Not specifying a size for each field allows it to expand the display result as needed. Notice this is an OpenMP only feature, so it does not have specific info on MPI ranks, etc. The host and pid fields are useful in identifying all the OpenMP threads belong to the same MPI task.

There are also omp_display_affinity() and omp_capture_affinity() APIs available that you could call from specific threads. Details for the available display affinity fields (with short and long names) and runtime APIs are in OpenMP 5.0 Specification and in the OpenMP 5.0 Examples document.

Below is sample output on Perlmutter CPU (using gcc/11.2.0, with a sample code hybrid-hello.f90 with the code output commented out). You could see that with OMP_PROC_BIND=spread, the OpenMP threads are evenly spread out, and with OMP_PLACES=threads, each thread binds to a specific hyperthread on a core; and with OMP_PLACES=cores, each thread binds to a specific core, and is allowed to freely migrate within the core on any of the hyperthreads (as reported as logical CPUs by Slurm. For example, logical CPUs 0,128 are on physical core 0).

perlmutter$ module load PrgEnv-gnu
perlmutter$ module swap gcc gcc/11.2.0
perlmutter$ ftn -fopenmp -o hybrid-hello hybrid-hello.f90

Then either in the batch script of in an interactive session set the appropriate variables before the srun command.

export OMP_NUM_THREADS=2
export OMP_PROC_BIND=spread
export OMP_PLACES=threads
export OMP_DISPLAY_AFFINITY=true
export OMP_AFFINITY_FORMAT="host=%H, pid=%P, thread_num=%n, thread affinity=%A"

elvis@nid005385$ srun -n 8 -c 32 --cpu-bind=cores ./hybrid-hello |sort -k2,3
host=nid005385, pid=249438, thread_num=0, thread affinity=0
host=nid005385, pid=249438, thread_num=1, thread affinity=8
host=nid005385, pid=249439, thread_num=0, thread affinity=64
host=nid005385, pid=249439, thread_num=1, thread affinity=72
host=nid005385, pid=249440, thread_num=0, thread affinity=16
host=nid005385, pid=249440, thread_num=1, thread affinity=24
host=nid005385, pid=249442, thread_num=0, thread affinity=80
host=nid005385, pid=249442, thread_num=1, thread affinity=88
host=nid005385, pid=249470, thread_num=0, thread affinity=32
host=nid005385, pid=249470, thread_num=1, thread affinity=40
host=nid005385, pid=249472, thread_num=0, thread affinity=48
host=nid005385, pid=249472, thread_num=1, thread affinity=56
host=nid005385, pid=249471, thread_num=0, thread affinity=96
host=nid005385, pid=249471, thread_num=1, thread affinity=104
host=nid005385, pid=249473, thread_num=0, thread affinity=112
host=nid005385, pid=249473, thread_num=1, thread affinity=120

elvis@nid005385$ export OMP_PLACES=cores
elvis@nid005385$ srun -n 8 -c 32 --cpu-bind=cores ./hybrid-hello |sort -k2,3
host=nid005385, pid=9059, thread_num=0, thread affinity=0,128
host=nid005385, pid=9059, thread_num=1, thread affinity=8,136
host=nid005385, pid=9060, thread_num=0, thread affinity=64,192
host=nid005385, pid=9060, thread_num=1, thread affinity=72,200
host=nid005385, pid=9061, thread_num=0, thread affinity=16,144
host=nid005385, pid=9061, thread_num=1, thread affinity=24,152
host=nid005385, pid=9062, thread_num=0, thread affinity=80,208
host=nid005385, pid=9062, thread_num=1, thread affinity=88,216
host=nid005385, pid=9084, thread_num=0, thread affinity=32,160
host=nid005385, pid=9084, thread_num=1, thread affinity=40,168
host=nid005385, pid=9085, thread_num=0, thread affinity=96,224
host=nid005385, pid=9085, thread_num=1, thread affinity=104,232
host=nid005385, pid=9087, thread_num=0, thread affinity=112,240
host=nid005385, pid=9087, thread_num=1, thread affinity=120,248
host=nid005385, pid=9086, thread_num=0, thread affinity=48,176
host=nid005385, pid=9086, thread_num=1, thread affinity=56,184

Slurm cpu-bind flag¶

The srun flag --cpu-bind=verbose can be used to report process and thread binding. This option is recommended for advanced users only to interpret the "mask" output for CPUs on the node, such as:

cpu-bind=MASK - nid005385, task  0  0 [51353]: mask 0xffff0000000000000000000000000000ffff set
cpu-bind=MASK - nid005385, task  1  1 [51354]: mask 0xffff0000000000000000000000000000ffff0000000000000000 set

Compiler Specific Environment Variables¶

Or you can set the following runtime environment to obtain affinity information as part of the job stdout:

Intel¶

export KMP_AFFINITY=verbose

Cray CCE¶

export CRAY_OMP_CHECK_AFFINITY=TRUE

The results from each compiler have different formats and are not portable.