Skip to content

Kokkos Fortran Interoperability

Galen Shipman edited this page Sep 26, 2019 · 7 revisions

Fortran Interop Use Case

Operations on multi-dimensional fortran allocated arrays using Kokkos

This example demonstrates usage of the teh Fortran Language Compatibility Layer (FLCL) in the context of performing a simple DAXPY (double precision A * X + Y) using Kokkos from a simple fortran program. Such a use case occurs when using Kokkos for for performance portability within a Fortran application.

Program structure

This example uses the Kokkos fortran interop utilities https://github.com/kokkos/kokkos-fortran-interop. This includes a set of fortran routines for converting fortran allocated arrays into an ndarray and a set of C++ functions for converting an ndarray into a Kokkos unmanaged view.

The ndarray type (flcl_ndarray_t) is a simple struct that captures the rank, dimensions, strides (equivalent to a dope vector) along with the flattened data.

typedef struct _flcl_nd_array_t {
    size_t rank;
    size_t dims[FLCL_NDARRAY_MAX_RANK];
    size_t strides[FLCL_NDARRAY_MAX_RANK];
    void *data;
} flcl_ndarray_t;

The

Input: inputData(C,P,D,D) - a rank 4 View inputField(C,F,P,D) - a rank 4 View

Return: outputField(C,F,P,D) - a rank 4 View

Computation: For each triple in C,F,P compute an output field from the two input views:

  for each (c,f,p) in (C,F,P)
    compute the product inputData(c,p,:,:) * inputField(c,f,p,:)
    store result in outputField(c,f,p,:)

Serial implementation

for (int c = 0; c < C; ++c)
for (int f = 0; f < F; ++f)
for (int p = 0; p < P; ++p)
{

  auto result = Kokkos::subview(outputField, c, f, p, Kokkos::ALL);
  auto left   = Kokkos::subview(inputData, c, p, Kokkos::ALL, Kokkos::ALL);
  auto right  = Kokkos::subview(inputField, c, f, p, Kokkos::ALL);
  
  for (int i=0;i<D;++i) {
  
    double tmp(0);
    
    for (int j=0;j<D;++j)
      tmp += left(i, j)*right(j);
    
    result(i) = tmp;
  }
}

Parallelization with Kokkos

Initial implementation - RangePolicy

The most straightforward way to parallize the serial code above is to convert the outer for loop over cells with the sequential iteration pattern into a parallel for loop using a RangePolicy

Kokkos::parallel_for("for_all_cells", 
  Kokkos::RangePolicy<>(0,C),
   KOKKOS_LAMBDA (const int c) {
     for (int f = 0; f < F; ++f)
     for (int p = 0; p < P; ++p)
     {

      auto result = Kokkos::subview(outputField, c, f, p, Kokkos::ALL);
      auto left   = Kokkos::subview(inputData, c, p, Kokkos::ALL, Kokkos::ALL);
      auto right  = Kokkos::subview(inputField, c, f, p, Kokkos::ALL);
  
      for (int i=0;i<D;++i) {
  
        double tmp(0);
    
        for (int j=0;j<D;++j)
          tmp += left(i, j)*right(j);
    
        result(i) = tmp;
      }
     }
  });

If the number of cells is large enough to merit parallelization, that is the overhead for parallel dispatch plus computation time is less than total serial execution time, then the simple implementation above will result in improved performance.

There is more parallelism to exploit, particularly within the for loops over fields F and points P. One way to accomplish this would involve taking the product of the three iteration ranges, C*F*P, and performing a parallel_for over that product. However, this would require extraction routines to map between indices from the flattened iteration range, C*F*P, and the multidimensional indices required by data structures in this example. In addition, to achieve performance portability the mapping between the 1-D product iteration range and multi-dimensional 3-D indices would require architecture-awareness, akin to the notion of LayoutLeft and LayoutRight used in Kokkos to establish data access patterns.

The MDRangePolicy provides a natural way to accomplish the goal of parallelizing over all three iteration ranges without requiring manually computing the product of the iteration ranges and mapping between 1-D and 3-D multidimensional indices. The MDRangePolicy is suitable for use with tightly-nested for loops and provides a method to expose additional parallelism in computations beyond simply parallelizing in a single dimension, as was shown in the first implementation using the RangePolicy.

Implementation - MDRangePolicy

Kokkos::parallel_for("mdr_for_all_cells", 
  Kokkos::MDRangePolicy< Kokkos::Rank<3> > ({0,0,0}, {C,F,P}),
   KOKKOS_LAMBDA (const int c, const int f, const int p) {
    auto result = Kokkos::subview(outputField, c, f, p, Kokkos::ALL);
    auto left   = Kokkos::subview(inputData, c, p, Kokkos::ALL, Kokkos::ALL);
    auto right  = Kokkos::subview(inputField, c, f, p, Kokkos::ALL);
  
    for (int i=0;i<D;++i) {
  
      double tmp(0);
    
      for (int j=0;j<D;++j)
        tmp += left(i, j)*right(j);
    
      result(i) = tmp;
    }
  });

MDRangePolicy usage

The MDRangePolicy accepts the same template parameters as the RangePolicy, but also requires an additional type - the Kokkos::Rank<R> parameter, where R is the rank, that is the number of nested for-loops, and must be provided at compile-time.

The policy requires two arguments:

  1. An initializer list, or Kokkos::Array, of "begin" indices
  2. An initializer list, or Kokkos::Array, of "end" indices

Internally the MDRangePolicy uses tiling over the multi-dimensional iteration space. For customization an optional third argument may be passed to the policy - an initializer list of tile dimension sizes. This argument might become important when performance tuning, as simple default sizes can be problem-dependent and difficult to determine automatically.

The signature of the lambda (or access operator of the functor) requires an argument for each rank.

The MDRangePolicy can be used with both the parallel_for and parallel_reduce patterns in Kokkos.

References

The API reference for the MDRangePolicy is available on the Kokkos wiki: wiki link

The use case that this example is based on comes from the Intrepid2 package of Trilinos. For more examples, check out code in Trilinos in files at: Trilinos/packages/intrepid2/src/Shared/Intrepid2_ArrayToolsDef*.hpp.

This link provides some overview of the Intrepid package: documentation link

Clone this wiki locally