Skip to content


The default and preferred MPI implementation on Cray systems is Cray MPICH, and this is provided via the Cray compiler wrappers cc, CC and ftn which are used instead of the common mpicc, mpicxx and mpif90 wrappers.

Read the manual

Please read the intro_mpi man pages, they go well beyond "intro" level content!

!!! note Cray MPICH currently works with CUDA up to version 11.x. Cray MPICH does not support CUDA 12 or above at this time. This support will be added in a future release of Cray MPI.


HPE Cray MPI is a CUDA-aware MPI implementation. This means that the programmer can use pointers to GPU device memory in MPI buffers, and the MPI implementation will correctly copy the data in GPU device memory to/from the network interface card's (NIC's) memory, either by implicitly copying the data first to host memory and then copying the data from host memory to the NIC; or, in hardware which supports GPUDirect RDMA, the data will be copied directly from the GPU to the NIC, bypassing host memory altogether.

An example of this is shown below:

#include <stdlib.h>
#include <stdio.h>
#include <mpi.h>
#include <cuda_runtime.h>

int main(int argc, char *argv[]) {
    int myrank;
    float *val_device, *val_host;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

    val_host = (float*)malloc(sizeof(float));
    cudaMalloc((void **)&val_device, sizeof(float));

    *val_host = -1.0;
    if (myrank != 0) {
      printf("%s %d %s %f\n", "I am rank", myrank, "and my initial value is:", *val_host);

    if (myrank == 0) {
        *val_host = 42.0;
        cudaMemcpy(val_device, val_host, sizeof(float), cudaMemcpyHostToDevice);
        printf("%s %d %s %f\n", "I am rank", myrank, "and will broadcast value:", *val_host);

    MPI_Bcast(val_device, 1, MPI_FLOAT, 0, MPI_COMM_WORLD);

    if (myrank != 0) {
      cudaMemcpy(val_host, val_device, sizeof(float), cudaMemcpyDeviceToHost);
      printf("%s %d %s %f\n", "I am rank", myrank, "and received broadcasted value:", *val_host);


    return 0;

The above C code can be compiled with HPE Cray MPI using:

module load cudatoolkit
export CRAY_ACCEL_TARGET=nvidia80
cc -o cuda-aware-bcast.ex cuda-aware-bcast.c

Load the cudatoolkit module

The cudatoolkit module needs to be loaded. Otherwise, the executable may hang.

Set the accelerator target to GPUs for CUDA-aware MPI

The GTL (GPU Transport Layer) library needs to be linked for MPI communication involving GPUs. To make the library detectable and linked, you need to set the accelerator target environment variable CRAY_ACCEL_TARGET to nvidia80 or use the compile flag -target-accel=nvidia80. Otherwise, you may get the following runtime error:

MPIDI_CRAY_init: GPU_SUPPORT_ENABLED is requested, but GTL library is not linked
For more info, see the section on setting the accelerator target.

$ module load cudatoolkit
$ srun -C gpu -n 4 -G 4 --exclusive ./cuda-aware-bcast.ex
I am rank 1 and my initial value is: -1.000000
I am rank 1 and received broadcasted value: 42.000000
I am rank 0 and will broadcast value: 42.000000
I am rank 2 and my initial value is: -1.000000
I am rank 2 and received broadcasted value: 42.000000
I am rank 3 and my initial value is: -1.000000
I am rank 3 and received broadcasted value: 42.000000

The MPICH_GPU_SUPPORT_ENABLED environment variable must be enabled if the application is CUDA-aware.


Cray MPICH supports up to MPI_THREAD_MULTIPLE thread safety level.

#include "mpi.h"
#include <stdio.h>

int main( int argc, char *argv[] )
   int provided;

   MPI_Init_thread(&argc,&argv, MPI_THREAD_MULTIPLE,&provided);

   printf("Supports level %d of %d %d %d %d\n", provided,

   return 0;
$ cc thread_levels.c
$ srun -n 1 ./a.out
Supports level 3 of 0 1 2 3


Unlike in some previous releases, no additional flags or environment variables are required to access an optimized version the highest level (MPI_THREAD_MULTIPLE)