-
Notifications
You must be signed in to change notification settings - Fork 0
Kokkos Fortran Interoperability
This example demonstrates usage of the teh Fortran Language Compatibility Layer (FLCL)
in the context of performing a simple DAXPY (double precision A * X + Y) using Kokkos from a simple fortran program. Such a use case occurs when using Kokkos for for performance portability within a Fortran application.
This example uses the Kokkos fortran interop utilities https://github.com/kokkos/kokkos-fortran-interop. This includes a set of fortran routines for converting fortran allocated arrays into an ndarray and a set of C++ functions for converting an ndarray into a Kokkos unmanaged view.
The ndarray type (flcl_ndarray_t) is a simple struct that captures the rank, dimensions, strides (equivalent to a dope vector) along with the flattened data.
typedef struct _flcl_nd_array_t {
size_t rank;
size_t dims[FLCL_NDARRAY_MAX_RANK];
size_t strides[FLCL_NDARRAY_MAX_RANK];
void *data;
} flcl_ndarray_t;
The
Input:
inputData(C,P,D,D)
- a rank 4 View
inputField(C,F,P,D)
- a rank 4 View
Return:
outputField(C,F,P,D)
- a rank 4 View
Computation:
For each triple in C,F,P
compute an output field from the two input views:
for each (c,f,p) in (C,F,P)
compute the product inputData(c,p,:,:) * inputField(c,f,p,:)
store result in outputField(c,f,p,:)
for (int c = 0; c < C; ++c)
for (int f = 0; f < F; ++f)
for (int p = 0; p < P; ++p)
{
auto result = Kokkos::subview(outputField, c, f, p, Kokkos::ALL);
auto left = Kokkos::subview(inputData, c, p, Kokkos::ALL, Kokkos::ALL);
auto right = Kokkos::subview(inputField, c, f, p, Kokkos::ALL);
for (int i=0;i<D;++i) {
double tmp(0);
for (int j=0;j<D;++j)
tmp += left(i, j)*right(j);
result(i) = tmp;
}
}
The most straightforward way to parallize the serial code above is to convert the outer for
loop over cells with the sequential iteration pattern into a parallel for loop using a RangePolicy
Kokkos::parallel_for("for_all_cells",
Kokkos::RangePolicy<>(0,C),
KOKKOS_LAMBDA (const int c) {
for (int f = 0; f < F; ++f)
for (int p = 0; p < P; ++p)
{
auto result = Kokkos::subview(outputField, c, f, p, Kokkos::ALL);
auto left = Kokkos::subview(inputData, c, p, Kokkos::ALL, Kokkos::ALL);
auto right = Kokkos::subview(inputField, c, f, p, Kokkos::ALL);
for (int i=0;i<D;++i) {
double tmp(0);
for (int j=0;j<D;++j)
tmp += left(i, j)*right(j);
result(i) = tmp;
}
}
});
If the number of cells is large enough to merit parallelization, that is the overhead for parallel dispatch plus computation time is less than total serial execution time, then the simple implementation above will result in improved performance.
There is more parallelism to exploit, particularly within the for loops over fields F
and points P
. One way to accomplish this would involve taking the product of the three iteration ranges, C*F*P
, and performing a parallel_for
over that product. However, this would require extraction routines to map between indices from the flattened iteration range, C*F*P
, and the multidimensional indices required by data structures in this example. In addition, to achieve performance portability the mapping between the 1-D product iteration range and multi-dimensional 3-D indices would require architecture-awareness, akin to the notion of LayoutLeft
and LayoutRight
used in Kokkos to establish data access patterns.
The MDRangePolicy
provides a natural way to accomplish the goal of parallelizing over all three iteration ranges without requiring manually computing the product of the iteration ranges and mapping between 1-D and 3-D multidimensional indices. The MDRangePolicy
is suitable for use with tightly-nested for loops and provides a method to expose additional parallelism in computations beyond simply parallelizing in a single dimension, as was shown in the first implementation using the RangePolicy
.
Kokkos::parallel_for("mdr_for_all_cells",
Kokkos::MDRangePolicy< Kokkos::Rank<3> > ({0,0,0}, {C,F,P}),
KOKKOS_LAMBDA (const int c, const int f, const int p) {
auto result = Kokkos::subview(outputField, c, f, p, Kokkos::ALL);
auto left = Kokkos::subview(inputData, c, p, Kokkos::ALL, Kokkos::ALL);
auto right = Kokkos::subview(inputField, c, f, p, Kokkos::ALL);
for (int i=0;i<D;++i) {
double tmp(0);
for (int j=0;j<D;++j)
tmp += left(i, j)*right(j);
result(i) = tmp;
}
});
The MDRangePolicy
accepts the same template parameters as the RangePolicy
, but also requires an additional type - the Kokkos::Rank<R>
parameter, where R
is the rank, that is the number of nested for-loops, and must be provided at compile-time.
The policy requires two arguments:
- An initializer list, or
Kokkos::Array
, of "begin" indices - An initializer list, or
Kokkos::Array
, of "end" indices
Internally the MDRangePolicy
uses tiling over the multi-dimensional iteration space. For customization an optional third argument may be passed to the policy - an initializer list of tile dimension sizes. This argument might become important when performance tuning, as simple default sizes can be problem-dependent and difficult to determine automatically.
The signature of the lambda (or access operator of the functor) requires an argument for each rank.
The MDRangePolicy
can be used with both the parallel_for
and parallel_reduce
patterns in Kokkos.
The API reference for the MDRangePolicy
is available on the Kokkos wiki:
wiki link
The use case that this example is based on comes from the Intrepid2 package of Trilinos. For more examples, check out code in Trilinos in files at: Trilinos/packages/intrepid2/src/Shared/Intrepid2_ArrayToolsDef*.hpp
.
This link provides some overview of the Intrepid package: documentation link
Home:
- Introduction
- Machine Model
- Programming Model
- Compiling
- Initialization
- View
- Parallel Dispatch
- Hierarchical Parallelism
- Custom Reductions
- Atomic Operations
- Subviews
- Interoperability
- Kokkos and Virtual Functions
- Initialization and Finalization
- View
- Data Parallelism
- Execution Policies
- Spaces
- Task Parallelism
- Utilities
- STL Compatibility
- Numerics
- Detection Idiom