Interfacing C++ with Python/ML for HPC applications¶

Scientific computing has seen a tremendous growth in tandem with the increase in the computing power available. As is the nature of cutting edge research, many of the scientific computing codes are either developed or modified taking into account the best available computing architecture at a time. This is feasible for smaller projects but it quickly becomes intractable when we take into consideration code bases, such as LAMMPS, which is a molecular dynamics code developed by a team of multi-disciplinary scientists from multiple national labs. With the advent of GPU based computing, many codes have already undergone refactoring work to make them suitable for parallel computing. However, newer computational paradigms such as, machine learning (ML), requires a lot more effort at integrating with the pre-existing source codes, which are primarily written in C/C++.

Developing ML models and distributed data parallel training is easier with Python than C++. Developing ML workflows in Python reduces programming turn around time. However, the high-performance domain specific codes, often written in C++ have decades of research and development behind them. Our goal is to facilitate integration of ML packages such as, Pytorch and Tensorflow, with the legacy codes written in C++ enabling them to run hybrid C++-Python applications on GPUs with least amount of code modifications.

Methodology¶

We present two potential approaches that can be used to interface and run hybrid C++-python/ML simulations. In our first approach, we demonstrate the conversion of a PyTorch model from Python to C++ using Torch Script. Depending on the application, it may be necessary to call the ML model from the complex C++ application. For such a case, we propose that users can avail the use of Torch Script based conversion of python ML to a C++ readable form. For our second approach, we present pybind11 based interfacing. Pybind11 is a set of header-only library files which enable calling python functions from C++ and vice versa. As an added advantage of this functionality, it also enables transfer of data between C++ and python. Pybind11 is included with libtorch, which is a C++ Pytorch source code, making it easy to implement Pytorch with legacy C++ codes.

Torch Script: Tracing Pytorch model to C++¶

Torch Script creates a representation of a PyTorch model that can be understood, compiled and serialized by the Torch Script compiler. We use a technique known as tracing implemented within PyTorch and LibTorch. Tracing captures the structure of the model by evaluating it once using example inputs, and recording the flow of those inputs through the model. The following code snipped illustrates how we trace a trained PyTorch checkpoint.

import torch
import torchvision
from networks import UNet
import argparse
from utils.YParams import YParams
import os
from collections import OrderedDict
import torch

config_name = 'default_sr_large'

checkpoint_number = '000'

params = YParams('./config/UNet.yaml', config_name)

checkpoint_dir = './expts'
checkpoint_file = os.path.join(checkpoint_dir,config_name,checkpoint_number,'training_checkpoints','best_ckpt.tar')
model = UNet.UNet(params).cpu()
checkpoint = torch.load(checkpoint_file, map_location=torch.device('cpu'))
model.load_state_dict(checkpoint)

example = torch.rand(1, 2, 256, 256)

traced_script_module = torch.jit.trace(model, example)
traced_pt_name = config_name + checkpoint_number + ".pt"
traced_script_module.save(traced_pt_name)

Pybind11: Transferring data from C++ to Python¶

While it is indeed possible to use hybrid C++ and python codes, the critical bottleneck in such an approach is the transfer of data between either of the two programming approaches. When running a hybrid application on GPUs with such data transfer from C++ to python, the array data, which is residing on the device memory, would first have to be copied to the host memory, then transferred to python and then sent from host to GPU memory and only then can it be used by GPU enabled python code. This transfer of array data increases the time and memory footprint of the code. We have used pybind11 to enable transfer of arrays without the copy operations, directly handling the array ownership from C++ to python and vice versa.

Additional advantage of this method is the ability to use it on GPUs belonging to multiple vendors. Performance portability has been an increasingly necessary pre-requisite to the codes using modern HPC. The new class of near-exa and exascale machines are outfitted with host and device combinations from multiple vendors. Code users and developers are therefore encouraged to use vendor neutral programming models such as, openmp target and Kokkos to enable their codes to use any host-device vendor combinations. We have demonstrated that our approach can be used with Kokkos, which is a C++ performance portability programming model.

To better understand our approach, we use a simple example to demonstrate data transfer between C++ and python. In this example, we define a Kokkos view, which is analogous to arrays. The Kokkos views are a C++ class templated on the data type, which is used for storing multi-dimensional data as a one dimensional array of pointers. In our simple example, we send a Kokkos view from C++ to python. We need following modules loaded in order to run the example code.

PrgEnv-llvm/12.0.0-git-20210117
pytorch/1.7.1-gpu
cuda/11.0.3
cudnn/8.0.5
cmake/3.14.4

Our code is as follows:

#include<torch/torch.h>
#include<pybind11/pybind11.h>
#include<pybind11/numpy.h>
#include<pybind11/embed.h>
#include<Kokkos_Core.hpp>
#include<Kokkos_Random.hpp>
#include<iostream>
#include<random>
#include<cmath>
#include<cstdio>

//Inputs for view sizes
const int64_t N = 64; 
const int64_t D_in = 1000;

using namespace torch;
using namespace std;
namespace py = pybind11;

//Define the Kokkos view
typedef Kokkos::View<double **, Kokkos::LayoutRight> View2D;

int main(int argc, char* argv[])
{
  double* x_1 = new double [N*D_in];
  Kokkos::initialize(argc, argv);
  {
    View2D foo1("Foo1", N,D_in);

    Kokkos::parallel_for(N*D_in, KOKKOS_LAMBDA(const int iter)
    {
      int i = iter / D_in;
      int j = iter % D_in;
      foo1(i,j) = i*j;
    });

// Calling simple NN implemented in Python
    pybind11::scoped_interpreter guard{};
    pybind11::module sys = pybind11::module::import("sys");
    sys.attr("path").attr("insert")(1, CUSTOM_SYS_PATH);
    pybind11::module py_simplenn = pybind11::module::import("py_simple");
    py::object ob1 = py_simplenn.attr("add_NN")\
                    (py::array_t<double, py::array::c_style |\
                     py::array::forcecast>(N*D_in,foo1.data(),\
                     py::str{}),N,D_in);
    py::gil_scoped_release no_gil;
  }
  Kokkos::finalize();
  return 0;
}

In the above listing, lines 21 and 28 through 35 relate to defining, declaring and allocating the Kokkos view 'foo1' which we will send from C++ to python. The main part of the transfer between C++ and python is handled between lines 38 to 43. Lines 38 to 40 are default and have to be used for creating the scope of the pybind11 linked python function. On line 41, we provide the name of the python file which contains the python function we intend to call from C++, which in this case is "py_simple\". On line 42, using the pybind11 forcecasting sytax, we transfer kokkos view 'foo1' to function "add_NN\", which is set as an attribute of the pybind11 module "py_simple\". The syntax dictating the transfer of the Kokkos view is as following:

    py::array_t<double, py::array::c_style |\
    py::array::forcecast>(N*D_in,foo1.data(),py::str{},...)

Before transferring the view data over to python, we specify that the data will be received by the python function in the numpy array format by declaring py::array_t< "array type & layout" | py::array::forcecast> ("array name & size"). The data type of this array and its layout in the memory is also declared as (double, py::array::c_style). In this case, we wish to transfer data from Kokkos view 'foo1' of size N*D_in. This is done by using N*D_in,foo1.data(),py::str. Note that without py::str, the above operation copies the data being transferred. We use a placeholder string definition to make this a non-copy operation which only transfers data ownership from C++ to python, therefore, making this operation cheaper compared to other legacy methods. The python function "add_NN\" defined in "py_simple.py\" is as follows:

import numpy as np
import torch
import sys

def add_NN(foo1, N, D_in):
    print("*********************************************")
    print("Start python")
    dtype = torch.float;
    if (torch.cuda.is_available() == 1):
       device = torch.device("cuda:0")
          print("CUDA is available! Training on GPU")
    else:
       device = torch.device("cpu")
          print("Training on CPU")

    foo1 = np.reshape(foo1, (-1,D_in))

    x = torch.from_numpy(foo1).to(device).float()

    return foo1

print("End python")
print("*********************************************")

In the code above, once we receive the data, we recast it into a pytorch tensor. Lines 9 through 14 are used to detect whether the python is running on host or device execution space. On line 16, the data that is received from C++ as a one dimensional array is reshaped into a multidimensional numpy array with a form similar to that of the Kokkos view. On line 18, this numpy array is converted to a torch tensor, which resides in the device memory space.

Please be aware that integrating Kokkos and pybind11 library files require the use of cmake to build the application. The location of both, Kokkos and libtorch (which contains pybon11) directories have to be included in the cmake. Prior to building this application, it is necessay to build Kokkos nvcc wrapper. Due to version conflicts, the Kokkos CUDA nvcc wrapper has to be built with cuda 11.0.3, c++ standard 14, and gcc version 7.3. The CmakeLists.txt used for the above example is as follows:

    cmake_minimum_required(VERSION 3.12 FATAL_ERROR)
    project(KokkosToPytorch CXX)

    # find dependencies
    find_package(Kokkos REQUIRED)
    find_package(Torch REQUIRED)
    find_package(Python REQUIRED COMPONENTS Interpreter Development)

    # targets
    file(GLOB SOURCES "src/*.cpp")
    add_executable(FirstNN ${SOURCES})

    # properties
    # C++ properties: at least a C++14 capable compiler is needed
    target_compile_features(FirstNN PUBLIC cxx_std_14)
    set_target_properties(FirstNN PROPERTIES
        CXX_EXTENSIONS OFF 
        CXX_STANDARD_REQUIRED ON
    )

    # link/include
    target_compile_definitions(FirstNN PRIVATE -DCUSTOM_SYS_PATH="${KokkosToPytorch_SOURCE_DIR}/include")
    target_include_directories(FirstNN SYSTEM PRIVATE ${PYTHON_INCLUDE_DIRS})
    target_include_directories(FirstNN PRIVATE $<BUILD_INTERFACE:${KokkosToPytorch_SOURCE_DIR}/include>)
    target_link_libraries(FirstNN PRIVATE Kokkos::kokkos)
    target_link_libraries(FirstNN PRIVATE "${TORCH_LIBRARIES}")
    target_link_libraries(FirstNN PRIVATE Python::Python)

The command to build the application is as follows:

    CXX=/global/homes/n/namehta4/kokkos/install_cuda/bin/nvcc_wrapper cmake 
    -DCMAKE_PREFIX_PATH="/global/homes/n/namehta4/kokkos/install_cuda;/global/homes/n/namehta4/libtorch" ../

Real-world applications¶

Deploying the traced model within the AMReX code using Torch Script¶

Our goal is to use an ML model to augment fluid simulations in AMReX (https://amrex-codes.github.io), a high-performance C++ code. AMReX is a software framework for massively parallel, block-structured adaptive mesh refinement (AMR) applications developed at LBNL, NREL, and ANL as part of the Block-Structured AMR Co-Design Center in DOE's Exascale Computing Project. We focus on IAMR (https://amrex-fluids.github.io/IAMR/), which is an adaptive mesh, variable-density incompressible Navier Stokes solver dependant on AMReX. Our workflow requires on-the-fly interaction between an IAMR simulation and and ML model written in PyTorch.

The ML model workflow is written in PyTorch. Due to the necessity of rapid development of state-of-the-art ML algorithms and the requirement of data and model parallelism for training the ML models, it is infeasible to have the ML model training workflow in C++. A Python-centric workflow for training the ML model ensures faster turnaround for research and development. However, the trained ML model then needs to be deployed in C++.

AMReX and LibTorch can be compiled in the same environment using a GCC compiler. Data is represented in AMReX using a structure called a 'multifab'. In order to evaluate an ML model on AMReX data, the data needs to be converted from a multifab to a Torch tensor. The following snippet illustrates the conversion of a single level multifab to a tensor.

    #include <torch/torch.h>
    #include <torch/script.h>

    torch::Tensor t1 = torch::zeros({1,ncomp,bx_onegrid_bounds.x+1,bx_onegrid_bounds.y+1});

    #ifdef _OPENMP
    #pragma omp parallel if (Gpu::notInLaunchRegion())
    #endif
    for (MFIter mfi(SnewRefined,TilingIfNotGPU()); mfi.isValid(); ++mfi)
    {
      const Box& bx = mfi.tilebox();
      auto const& SnewRefined_fab = SnewRefined.array(mfi);
      FArrayBox fab_tmp(bx,ncomp);
      auto fab = fab_tmp.array(0);

      // Here we retrieve data from device to a fab array on host    
      amrex::ParallelFor(bx, ncomp, [fab,SnewRefined_fab]
      AMREX_GPU_DEVICE (int i, int j, int k, int n) noexcept
      {
       fab(i,j,k,n) = SnewRefined_fab(i,j,k,n) ;
      });

      // Now we copy the fab array into the Torch Tensor
      {
      for (int n = 0; n < ncomp; n++ )
        for (int jj = 0; jj < bx_onegrid_bounds.y+1; jj++ )
          for (int ii = 0; ii < bx_onegrid_bounds.x+1; ii++ )
          {
            t1[0][n][ii][jj] = fab(ii,jj,0,n);
          }
      }
    }

Now the traced model, similar to one shown previously is loaded into the C++ code as follows:

    torch::jit::script::Module module;
    try {
    // Deserialize the ScriptModule from a file using torch::jit::load().
     module = torch::jit::load("kolmogorov256default4.pt");
    }
    catch (const c10::Error& e) {
      amrex::Abort( "error loading the model\n");
    }

The loaded model is then evaluated on the tensor t1 as follows:

    std::vector<torch::jit::IValue> inputs{t1};
    at::Tensor output = module.forward(inputs).toTensor();

Finally the model output needs to be written back to a data structure that can be used by AMReX, namely, a multifab. The process of copying a tensor to a multifab is similar to the process of copying a multifab to a tensor described above. The following code snippet shows how to copy a tensor named 'output' (which was obtained by evaluating the ML model as described above) back to a multifab.

    #ifdef _OPENMP
    #pragma omp parallel if (Gpu::notInLaunchRegion())
    #endif
    for (MFIter mfi(CorrectedState,TilingIfNotGPU()); mfi.isValid(); ++mfi)
    {
      const Box& bx = mfi.tilebox();
      auto const& CorrectedState_fab = CorrectedState.array(mfi);
      FArrayBox fab_tmp(bx,ncomp);
      auto fab = fab_tmp.array(0);

      // Now we copy the Torch Tensor into the temporary fab array
      {
        for (int n = 0; n < ncomp; n++ )
          for (int jj = 0; jj < bx_onegrid_bounds.y+1; jj++ )
            for (int ii = 0; ii < bx_onegrid_bounds.x+1; ii++ )
            {
             fab(ii,jj,0,n) =  output[0][n][ii][jj].item<double>();
            }
      }
    }

Neural Net assisted TestSNAP using Pybind11¶

The SNAP potential is a part of the LAMMPS molecular dynamics (MD) toolkit. SNAP method specifies the potential energy of each atom as a curve fit of weighted bispectrum components. The bispectrum, also known as the descriptor, describes the positions of neighboring atoms and the local energy of each atom based on its location, for a given structural configuration. This bispectrum is represented by its components, which are used to reproduce the local energy. To aid in optimizing the SNAP algorithm, Dr. Aidan Thomspon has written a proxy application called TestSNAP, which has similar computational load as the SNAP algorithm but reduces the complexity by removing the conventional MD functionality of atom movements. TestSNAP provides a good testbed to implement parallelization schemes and algorithm improvements. Optimizing TestSNAP has been a part of the larger Exa-scale Computing Project (ECP).

As evident from the description, the accuracy of a TestSNAP simulation is highly dependent on the number of bispectrum coefficients because the energy of the system is directly computed from the curve fit of the bispectrum coefficient as mentioned previously. However, the computational cost of this method also becomes intractable as more number of bispectrum components are used for the simulation. Instead, we can use machine learning based Neural Net (NN) method to generate more accurate curve fit from lesser number of bispectrum components. In order to use such an approach, we need to generate the bispectrum from atom configurations in C++ and transfer that information to python, where we can implement a ML model. A brief snippet of this code is shared next.

pybind11::module sys = pybind11::module::import("sys");
    sys.attr("path").attr("insert")(1, CUSTOM_SYS_PATH);
    pybind11::module py_simplenn = pybind11::module::import("MLIAP");
    py::object ob1 = py_simplenn.attr("run_NN")(py::array_t<double,\
                     py::array::c_style | py::array::forcecast>\
                     (num_atoms*idxb_max,blist.data(),py::str{}),\
                     py::array_t<double,py::array::c_style | py::array::forcecast>\
                     (ncoeff,beta.data(),py::str{}),\
                     py::array_t<double,py::array::c_style | py::array::forcecast>\
                     (num_atoms,ene_nn.data(),py::str{}),\
                     num_atoms,idxb_max,ncoeff);

In this code snippet, we are transferring Kokkos views blist, beta, and ene_nn, which represent the bispectrum coefficients, bispectrum, and energy data. In our approach, the beta and ene_nn are empty Kokkos views and we send it to python, where it is populated from the results of ML model run by function "run_nn\" in python file "MLIAP.py\". Note that currently, we are able to send the Kokkos views but are testing the transfer of data back from python to C++.

The corresponding python part of the code is as follows:

import numpy as np
import torch

class Torchwrapper(torch.nn.Module):
    def __init__(self, model):
        super().__init__()
        self.model = model
        self.model.to(self.dtype)

    def __call__(self, bispectrum, beta, energy):
        bispectrum = torch.from_numpy(bispectrum).to(self.dtype).requires_grad_(True)
        with torch.autograd.enable_grad():
            energy_nn = self.model(bispectrum)
            if energy_nn.ndim > 1:
                energy_nn = energy_nn.flatten()
            beta_nn = torch.autograd.grad(energy_nn.sum(), bispectrum)[0]

def run_NN(blist,beta,ene_nn,num_atoms,idxb_max,ncoeff):
    print("*********************************************")
    print("Start python")
    dtype = torch.float;
    if (torch.cuda.is_available() == 1): 
        device = torch.device("cuda:0")
        print("CUDA is available! Training on GPU")
    else:
        device = torch.device("cpu")
        print("Training on CPU")

    blist = np.reshape(blist, (-1,idxb_max))
    x = torch.from_numpy(blist).to(device)
    x = x.float()
    #Code is not complete yet
    #Torchwrapper.__call__(...)

    print("End python")
    print("*********************************************")

Please note that this code is a work in progress and not a working example.

Acknowledgement¶

Neil Mehta would like to thank Dr. Axel Huebl (ATAP, LBL) and Dr. Jonathan Madsen (NERSC, LBL) for their immense help with this project. Jaideep Pathak would like to acknowledge Dr. Emmanuel Motheau (CRD, LBL), Dr. Mustafa Mustafa (NERSC, LBL) and Dr. Marcus Day (NREL) for their immeasurable help with this project.