Skip to content

Cray MPICH

The default and preferred MPI implementation on Cray systems is Cray MPICH, and this is provided via the Cray compiler wrappers cc, CC and ftn which are used instead of the common mpicc, mpicxx and mpif90 wrappers.

Read the manual

Please read the intro_mpi man pages, they go well beyond "intro" level content!

CUDA-Aware MPI

HPE Cray MPI is a CUDA-aware MPI implementation. This means that the programmer can use pointers to GPU device memory in MPI buffers, and the MPI implementation will correctly copy the data in GPU device memory to/from the network interface card's (NIC's) memory, either by implicitly copying the data first to host memory and then copying the data from host memory to the NIC; or, in hardware which supports GPUDirect RDMA, the data will be copied directly from the GPU to the NIC, bypassing host memory altogether.

At compile time, the HPE GPU Transport Layer (GTL) libraries must be linked. In order to enable this, take at least one of the following actions:

  1. Keep the default gpu module loaded
  2. Load the cudatoolkit and craype-accel-nvidia80 modules
  3. Set the environment variable CRAY_ACCEL_TARGET=nvidia80
  4. Pass the compiler flag -target-accel=nvidia80

At run time MPICH_GPU_SUPPORT_ENABLED=1 must be set. If it is not set there will be Errors similar to

MPIDI_CRAY_init: GPU_SUPPORT_ENABLED is requested, but GTL library is not linked

An example program is shown below:

#include <stdlib.h>
#include <stdio.h>
#include <mpi.h>
#include <cuda_runtime.h>

int main(int argc, char *argv[]) {
    int myrank;
    float *val_device, *val_host;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

    val_host = (float*)malloc(sizeof(float));
    cudaMalloc((void **)&val_device, sizeof(float));

    *val_host = -1.0;
    if (myrank != 0) {
      printf("%s %d %s %f\n", "I am rank", myrank, "and my initial value is:", *val_host);
    }

    if (myrank == 0) {
        *val_host = 42.0;
        cudaMemcpy(val_device, val_host, sizeof(float), cudaMemcpyHostToDevice);
        printf("%s %d %s %f\n", "I am rank", myrank, "and will broadcast value:", *val_host);
    }

    MPI_Bcast(val_device, 1, MPI_FLOAT, 0, MPI_COMM_WORLD);

    if (myrank != 0) {
      cudaMemcpy(val_host, val_device, sizeof(float), cudaMemcpyDeviceToHost);
      printf("%s %d %s %f\n", "I am rank", myrank, "and received broadcasted value:", *val_host);
    }

    cudaFree(val_device);
    free(val_host);

    MPI_Finalize();
    return 0;
}

The above C code can be compiled with HPE Cray MPI using:

module load cudatoolkit
export CRAY_ACCEL_TARGET=nvidia80
cc -o cuda-aware-bcast.ex cuda-aware-bcast.c

Load the cudatoolkit module

The cudatoolkit module needs to be loaded. Otherwise, the executable may hang.

Set the accelerator target to GPUs for CUDA-aware MPI

The GTL (GPU Transport Layer) library needs to be linked for MPI communication involving GPUs. To make the library detectable and linked, you need to set the accelerator target environment variable CRAY_ACCEL_TARGET to nvidia80 or use the compile flag -target-accel=nvidia80. Otherwise, you may get the following runtime error:

MPIDI_CRAY_init: GPU_SUPPORT_ENABLED is requested, but GTL library is not linked
For more info, see the section on setting the accelerator target.

$ module load cudatoolkit
$ export MPICH_GPU_SUPPORT_ENABLED=1
$ srun -C gpu -n 4 -G 4 --exclusive ./cuda-aware-bcast.ex
I am rank 1 and my initial value is: -1.000000
I am rank 1 and received broadcasted value: 42.000000
I am rank 0 and will broadcast value: 42.000000
I am rank 2 and my initial value is: -1.000000
I am rank 2 and received broadcasted value: 42.000000
I am rank 3 and my initial value is: -1.000000
I am rank 3 and received broadcasted value: 42.000000

The MPICH_GPU_SUPPORT_ENABLED environment variable must be enabled if the application is CUDA-aware.

Thread Safety MPI_THREAD_MULTIPLE

Cray MPICH supports up to MPI_THREAD_MULTIPLE thread safety level.

#include "mpi.h"
#include <stdio.h>

int main( int argc, char *argv[] )
{
   int provided;

   MPI_Init_thread(&argc,&argv, MPI_THREAD_MULTIPLE,&provided);

   printf("Supports level %d of %d %d %d %d\n", provided,
          MPI_THREAD_SINGLE,
          MPI_THREAD_FUNNELED,
          MPI_THREAD_SERIALIZED,
          MPI_THREAD_MULTIPLE);

   MPI_Finalize();
   return 0;
}
$ cc thread_levels.c
$ srun -n 1 ./a.out
Supports level 3 of 0 1 2 3

Note

Unlike in some previous releases, no additional flags or environment variables are required to access an optimized version the highest level (MPI_THREAD_MULTIPLE)

Resources