Cray MPICH¶
The default and preferred MPI implementation on Cray systems is Cray MPICH, and this is provided via the Cray compiler wrappers cc
, CC
and ftn
which are used instead of the common mpicc
, mpicxx
and mpif90
wrappers.
Read the manual
Please read the intro_mpi
man pages, they go well beyond "intro" level content!
CUDA-Aware MPI¶
HPE Cray MPI is a CUDA-aware MPI implementation. This means that the programmer can use pointers to GPU device memory in MPI buffers, and the MPI implementation will correctly copy the data in GPU device memory to/from the network interface card's (NIC's) memory, either by implicitly copying the data first to host memory and then copying the data from host memory to the NIC; or, in hardware which supports GPUDirect RDMA, the data will be copied directly from the GPU to the NIC, bypassing host memory altogether.
At compile time, the HPE GPU Transport Layer (GTL) libraries must be linked. In order to enable this, take at least one of the following actions:
- Keep the default
gpu
module loaded - Load the
cudatoolkit
andcraype-accel-nvidia80
modules - Set the environment variable
CRAY_ACCEL_TARGET=nvidia80
- Pass the compiler flag
-target-accel=nvidia80
At run time MPICH_GPU_SUPPORT_ENABLED=1
must be set. If it is not set there will be Errors similar to
MPIDI_CRAY_init: GPU_SUPPORT_ENABLED is requested, but GTL library is not linked
An example program is shown below:
#include <stdlib.h>
#include <stdio.h>
#include <mpi.h>
#include <cuda_runtime.h>
int main(int argc, char *argv[]) {
int myrank;
float *val_device, *val_host;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
val_host = (float*)malloc(sizeof(float));
cudaMalloc((void **)&val_device, sizeof(float));
*val_host = -1.0;
if (myrank != 0) {
printf("%s %d %s %f\n", "I am rank", myrank, "and my initial value is:", *val_host);
}
if (myrank == 0) {
*val_host = 42.0;
cudaMemcpy(val_device, val_host, sizeof(float), cudaMemcpyHostToDevice);
printf("%s %d %s %f\n", "I am rank", myrank, "and will broadcast value:", *val_host);
}
MPI_Bcast(val_device, 1, MPI_FLOAT, 0, MPI_COMM_WORLD);
if (myrank != 0) {
cudaMemcpy(val_host, val_device, sizeof(float), cudaMemcpyDeviceToHost);
printf("%s %d %s %f\n", "I am rank", myrank, "and received broadcasted value:", *val_host);
}
cudaFree(val_device);
free(val_host);
MPI_Finalize();
return 0;
}
The above C code can be compiled with HPE Cray MPI using:
module load cudatoolkit
export CRAY_ACCEL_TARGET=nvidia80
cc -o cuda-aware-bcast.ex cuda-aware-bcast.c
Load the cudatoolkit
module
The cudatoolkit
module needs to be loaded. Otherwise, the executable may hang.
Set the accelerator target to GPUs for CUDA-aware MPI
The GTL (GPU Transport Layer) library needs to be linked for MPI communication involving GPUs. To make the library detectable and linked, you need to set the accelerator target environment variable CRAY_ACCEL_TARGET
to nvidia80
or use the compile flag -target-accel=nvidia80
. Otherwise, you may get the following runtime error:
MPIDI_CRAY_init: GPU_SUPPORT_ENABLED is requested, but GTL library is not linked
$ module load cudatoolkit
$ export MPICH_GPU_SUPPORT_ENABLED=1
$ srun -C gpu -n 4 -G 4 --exclusive ./cuda-aware-bcast.ex
I am rank 1 and my initial value is: -1.000000
I am rank 1 and received broadcasted value: 42.000000
I am rank 0 and will broadcast value: 42.000000
I am rank 2 and my initial value is: -1.000000
I am rank 2 and received broadcasted value: 42.000000
I am rank 3 and my initial value is: -1.000000
I am rank 3 and received broadcasted value: 42.000000
The MPICH_GPU_SUPPORT_ENABLED
environment variable must be enabled if the application is CUDA-aware.
Thread Safety MPI_THREAD_MULTIPLE
¶
Cray MPICH supports up to MPI_THREAD_MULTIPLE
thread safety level.
#include "mpi.h"
#include <stdio.h>
int main( int argc, char *argv[] )
{
int provided;
MPI_Init_thread(&argc,&argv, MPI_THREAD_MULTIPLE,&provided);
printf("Supports level %d of %d %d %d %d\n", provided,
MPI_THREAD_SINGLE,
MPI_THREAD_FUNNELED,
MPI_THREAD_SERIALIZED,
MPI_THREAD_MULTIPLE);
MPI_Finalize();
return 0;
}
$ cc thread_levels.c
$ srun -n 1 ./a.out
Supports level 3 of 0 1 2 3
Note
Unlike in some previous releases, no additional flags or environment variables are required to access an optimized version the highest level (MPI_THREAD_MULTIPLE
)
Resources¶
man intro_mpi
- LLNL Tutorials
- MPI Forum (standards body)