Parallelism¶
On-node parallelism¶
On Perlmutter, it will be necessary to specifically think both about inter-node parallelism as well as on-node parallelism. The baseline programming model for Perlmutter is MPI+X where X represents some conscious level of on-node parallelism, which could also be expressed as MPI or a shared memory programming model like OpenMP, pthreads etc. For a lot of codes, running without changes on 128 MPI tasks per Perlmutter node could be troublesome. Examples are codes that are MPI latency sensitive like 3D FFTs and codes that duplicate data structures on MPI ranks (often MPI codes don't perfectly distribute every data structure across ranks) which could more quickly exhaust the HBM if running in pure MPI mode.
Threading¶
-
Thread libraries per language
- C: pthreads, OpenMP
- C++: pthreads, OpenMP, STL threads, TBB
- Fortran: OpenMP
-
Library Overview
pthreads
- Usage style: functional C
- Ease-of-use: low
- Portability: moderate
- UNIX-only
- Features
- Generally provide best performance due to low-level nature
- Relevant C/C++ compiler flags
-pthread
(GCC, LLVM, Intel)
C++11 threads
- Object-oriented C++ wrappers around OS-defined threading library
- Ease-of-use: medium
- Portability: high
- Any compiler conforming to C++11 standard
- Features
- Easiest to use the
pthreads
, more difficult to use theOpenMP
- Lightweight wrapper around
pthreads
on UNIX with similar performance
- Easiest to use the
OpenMP
- Usage style: directives (
#pragma omp
) - Ease-of-use: high
- Portability: moderate
- Compiler-dependent but supported by most major compilers
- Performance can vary between compiler versions
- Features
- Built-in thread-pool
- Serial (compiled without OpenMP) and MT (compiled with OpenMP) provided by same block(s) of code
- SIMD, reductions, GPU offloading (currently limited), tasking
- Relevant C/C++ compiler flags
-fopenmp
(GCC/Intel)-fopenmp=libomp
(LLVM)
- Usage style: directives (
TBB
(Intel Threading Building Blocks)- Task-based parallelism library
- Ease-of-use
- C++98: medium
- C++11: high
- Portability: highest
- Any C++ compiler
- Features
- Built-in thread-pool
- Highly efficient nested and recursive parallelism
- Constructs for parallel: loops, reductions, function calls, etc.
- Flow-graphs and pipelines
Performance¶
The best performance occurs when threads are persistent (i.e. not continuously created and deleted, may require Thread Pool), at the top of the call-stack, and with minimal synchronization.
Thread Pool¶
Both OpenMP and TBB provide thread-pools by default.
Call-Stack¶
Take the following example for the case of nested loops:
// loop "A"
for(unsigned i = 0; i < N; ++i)
{
// loop "B"
for(unsigned j = 0; j < M; ++j)
{
// some work function
work(i, j);
}
}
Given the choice of using threads to operate on either loop "A" or loop "B", the most speed-up is provided by having the threads operate on a block/iteration from loop "A".
When using the OpenMP library, unless otherwise specified to enable nesting, inserting a #pragma omp parallel for
before both loops:
// loop "A"
#pragma omp parallel for
for(unsigned i = 0; i < N; ++i)
{
// loop "B"
#pragma omp parallel for
for(unsigned j = 0; j < M; ++j)
{
// some work function
work(i, j);
}
}
will result in OpenMP ignoring the directive on loop "B". Enabling nesting with OpenMP is not desired because it can quickly cause an excessive number of threads to be created. In the above example, if OMP_NUM_THREADS=16
and nesting enabled, 16 threads are created to parallelize loop "A" and then each of those 16 threads will create and additional 16 threads to parallelize loop "B" -- resulting in 256 total threads. Application failures will arise (due to exceeding the max threads for OS) when the work inside loop "B" makes calls to external libraries that also use OpenMP.
When using the tasking model such as when using TBB or OpenMP tasks, threads are abstracted and the work is bundled into "tasks" -- which can be thought as function calls. Threads do not operate directly within the loop but instead sit idle in a thread-pool until it has been given work to do in the form of a "task". The main thread divides the loop iterations into tasks and adds them to the queue. When this occurs, any idle threads are activated to begin popping tasks out of the queue and executing them. When the thread is finished executing one task, it either proceeds to the next task or, if the queue is empty, goes back to "sleep". The end result is that the tasking model provides better fine-grained parallelism and work-load balancing between the threads.
// lambda function for executing an iteration of loop "A"
auto do_loop_b = [] (int i)
{
// lambda function wrapping "work(i, j)" call
// [&] captures "i"
auto do_work = [&] (int j) { work(i, j); };
// loop "B"
tbb::parallel_for(0, M, do_work);
};
// loop "A"
tbb::parallel_for(0, N, do_loop_b);
Synchronization¶
Synchronizing the access of shared memory in multithreading creates a serial bottleneck and poor synchronization implementations can all but eliminate any potential speed-up. One of the most common synchronization issues involves updating shared containers:
// a shared container
static std::vector<double> my_container = std::vector<double>(100, 0.0);
// a mutex for synchronization
static std::mutex mtx;
void update_container(int index, double value)
{
// threads will idle here to acquire the lock
// this creates a serial bottleneck because only
// one thread can hold lock at a time
mtx.lock()
// update value
my_container[index] += value;
// release lock so other threads can acquire lock
mtx.unlock();
}
The optimal solution is to have the thread operate on its own container and at the end, have each thread merge the contents of its thread-local container into the shared container. Having each thread have its own container is accomplished with thread-local memory.
NOTE: thread-local memory in C++ requires trivial constructors so unless one is using plain-old-data (int, double, etc.) use a pointer and the pointer will be deleted when the thread exits.
// a shared container
static std::vector<double> my_container = std::vector<double>(100, 0.0);
// a container local to the thread
static thread_local std::vector<double>* tl_container = new std::vector<double>(100, 0.0);
// a mutex for synchronization
static std::mutex mtx;
void update_container(int index, double value)
{
// container is local to thread so no need for lock
(*tl_container)[index] += value;
}
// have each thread call this function when it is done
void merge_container()
{
// block access to shared resource
mtx.lock()
// copy the thread-local memory to shared memory
for(unsigned i = 0; i < tl_container->size(); ++i)
my_container[i] += (*tl_container)[i];
// release lock
mtx.unlock();
}
Off-node parallelism¶
Strong scaling¶
How the time to solution varies with number of processing elements for a fixed problem size.
Weak scaling¶
How the time to solution varies with number of processing elements for a fixed problem size per processor.