diff --git a/.gitignore b/.gitignore index e7013a99f..593401bcb 100644 --- a/.gitignore +++ b/.gitignore @@ -1,5 +1,6 @@ # Python Virtual Environment /venv +.venv/ # PyCharm files /.idea # Rendered html files diff --git a/docs/Makefile b/docs/Makefile index b5218c28e..004bdd74a 100644 --- a/docs/Makefile +++ b/docs/Makefile @@ -1,5 +1,5 @@ html: - sphinx-build -b html ./source/ ./generated_docs/ + sphinx-build -b html ./source/ ./generated_docs/ -W --keep-going python3 ./source/edit_button_handler.py clean: diff --git a/docs/source/API/alphabetical.md b/docs/source/API/alphabetical.md deleted file mode 100644 index 71e66664e..000000000 --- a/docs/source/API/alphabetical.md +++ /dev/null @@ -1,121 +0,0 @@ -# API in Alphabetical Order - -All functions and classes listed here are part of the `Kokkos::` namespace. - -## Algorithms -|Name |Library | Category | Description | -|:---------|:--------|:-----------|:----------------------------| -|Rand| Algorithm | Random Number | Generator Type (12), draw options (3) | -|Rand| Algorithm | Random Number | Generator Type (12), draw options (3) | -|Random_XorShift64_Pool| Algorithm | Random Number | Random Number Generator, pool for threads | -|Random_XorShift64| Algorithm | Random Number | Random Number Generator for 12 types, plus normal distribution| -|init| Algorithm | Random Number | initialize state using seed for Random_XorShift64_Pool | -|Random_XorShift1024_Pool| Algorithm | Random Number | Random Number Generator, 1024 bit, pool for threads | -|Random_XorShift1024| Algorithm | Random Number | Random Number Generator for 12 types, plus normal distribution)| -|fill_random| Algorithm | Random Number | create sample space to fit a (0 to) range or begin-end space | - - -## Containers -| Name |Library | Category | Description | -|:--------------------------------------------|:--------|:-----------|:----------------------------| -| [Bitset](containers/Bitset) | [Containers](containers-index) | View | A concurrent Bitset class. | -| [DualView](containers/DualView) | [Containers](containers-index) | View | Host-Device Mirror of View with Host-Device Memory | -| [DynamicView](containers/DynamicView) | [Containers](containers-index) | View | A view which can change its size dynamically. | -| [DynRankView](containers/DynRankView) | [Containers](containers-index) | View | A view which can determine its rank at runtime. | -| ErrorReporter | [Containers](containers-index) | View | A class supporting error recording in parallel code. | -| [OffsetView](containers/Offset-View) | [Containers](containers-index) | View | View structure supporting non-zero start indicies. | -| [ScatterView](containers/ScatterView) | [Containers](containers-index) | View | View structure to transpartently support atomic and data replication strategies for scatter-reduce algorithms. | -| [StaticCrsGraph](containers/StaticCrsGraph) | [Containers](containers-index) | View | A non-resizable CRS graph structure with view semantics. | -| [UnorderedMap](containers/Unordered-Map) | [Containers](containers-index) | View | A map data structure optimized for concurrent inserts. | -| [vector](containers/vector) | [Containers](containers-index) | View | A class providing similar interfaces to `std::vector`. | - - -## Core -| Name |Library | Category | Description | -|:------------------------------------------------------------------------------|:--------|:----------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------| -| [abort](core/utilities/abort) | [Core](core-index) | [Utilities](core/Utilities) | Causes abnormal program termination. | -| [ALL](core/utilities/all) | [Core](core-index) | [Utilities](core/Utilities) | Selects all elements in a dimension. | -| [atomic_exchange](core/atomics/atomic_exchange) | [Core](core-index) | [Atomic-Operations](core/atomics) | Atomic operation which exchanges a value and returns the old. | -| [atomic_compare_exchange](core/atomics/atomic_compare_exchange) | [Core](core-index) | [Atomic-Operations](core/atomics) | Atomic operation which exchanges a value only if the old value matches a comparison value and returns the old value. | -| [atomic_compare_exchange_strong](core/atomics/atomic_compare_exchange_strong) | [Core](core-index) | [Atomic-Operations](core/atomics) | Atomic operation which exchanges a value only if the old value matches a comparison value and returns true if the exchange is executed. | -| [atomic_load](core/atomics/atomic_load) | [Core](core-index) | [Atomic-Operations](core/atomics) | Atomic operation which loads a value. | -| [atomic_\[op\]](core/atomics/atomic_op) | [Core](core-index) | [Atomic-Operations](core/atomics) | Atomic operation which don't return anything. | -| [atomic_fetch_\[op\]](core/atomics/atomic_fetch_op) | [Core](core-index) | [Atomic-Operations](core/atomics) | Various atomic operations which return the old value. | -| [atomic_\[op\]_fetch](core/atomics/atomic_op_fetch) | [Core](core-index) | [Atomic-Operations](core/atomics) | Various atomic operations which return the updated value. | -| [atomic_store](core/atomics/atomic_store) | [Core](core-index) | [Atomic-Operations](core/atomics) | Atomic operation which stores a value. | -| [BAnd](core/builtinreducers/BAnd) | [Core](core-index) | [Atomic-Operations](core/atomics) | Reducer for Binary 'And' reduction | -| [BOr](core/builtinreducers/BOr) | [Core](core-index) | [Atomic-Operations](core/atomics) | Reducer for Binary 'Or' reduction | -| [complex](core/utilities/complex) | [Core](core-index) | [STL Compatibility](core/STL-Compatibility) | Complex numbers which work on host and device | -| [(X)create_mirror](core/view/create_mirror) | [Core](core-index) | [View](core/View) | Mirror Host data to Device data | -| [(X)create_mirror_view](core/view/create_mirror) | [Core](core-index) | [View](core/View) | Mirror Host data to Device data | -| [Cuda](Cuda) | [Core](core-index) | [Spaces](core/Spaces) | The CUDA Execution Space. | -| [CudaSpace](CudaSpace) | [Core](core-index) | [Spaces](core/Spaces) | The primary CUDA Memory Space. | -| [CudaUVMSpace](CudaUVMSpace) | [Core](core-index) | [Spaces](core/Spaces) | The CUDA Memory Space providing access to unified memory page migratable allocations. | -| [CudaHostPinnedSpace](CudaHostPinnedSpace) | [Core](core-index) | [Spaces](core/Spaces) | The CUDA Memrory Space providing access to host pinned GPU-accessible host memory. | -| [deep_copy](core/view/deep_copy) | [Core](core-index) | [View](core/View) | Copy Views | -| [ExecutionPolicy Concept](core/policies/ExecutionPolicyConcept) | [Core](core-index) | [Execution Policies](core/Execution-Policies) | Concept for execution policies. | -| [ExecutionSpace concept](ExecutionSpaceConcept) | [Core](core-index) | [Spaces](core/Spaces) | Concept for execution spaces. | -| [fence](core/parallel-dispatch/fence) | [Core](core-index) | | Fences execution spaces. | -| [finalize](core/initialize_finalize/finalize) | [Core](core-index) | [Initialization and Finalization](core/Initialize-and-Finalize) | function to finalize Kokkos | -| [HostSpace](HostSpace) | [Core](core-index) | [Spaces](core/Spaces) | The primary Host Memory Space. | -| [HPX](HPX) | [Core](core-index) | [Spaces](core/Spaces) | Execution space using the HPX runtime system execution mechanisms. | -| [InitArguments](core/initialize_finalize/InitArguments) | [Core](core-index) | [Initialization and Finalization](core/Initialize-and-Finalize) | struct to programmatically define how to initialize Kokkos (deprecated in version 3.7) | -| [InitializationSettings](core/initialize_finalize/InitializationSettings) | [Core](core-index) | [Initialization and Finalization](core/Initialize-and-Finalize) | class to programmatically define how to initialize Kokkos | -| [initialize](core/initialize_finalize/initialize) | [Core](core-index) | [Initialization and Finalization](core/Initialize-and-Finalize) | function to initialize Kokkos | -| [is_array_layout](is_array_layout) | [Core](core-index) | [Traits](core/Traits) | Trait to detect types that model the Layout concept | -| [is_execution_policy](is_execution_policy) | [Core](core-index) | [Traits](core/Traits) | Trait to detect types that model ExecutionPolicy concept | -| is_execution_space | [Core](core-index) | [Traits](core/Traits) | Trait to detect types that model [ExecutionSpace concept](ExecutionSpaceConcept) | -| [is_memory_space](is_memory_space) | [Core](core-index) | [Traits](core/Traits) | Trait to detect types that model [MemorySpace concept](MemorySpaceConcept) | -| [is_memory_traits](is_memory_traits) | [Core](core-index) | [Traits](core/Traits) | Trait to detect specializations of `Kokkos::MemoryTraits` | -| [is_reducer](is_reducer) | [Core](core-index) | [Traits](core/Traits) | Trait to detect types that model the [Reducer concept](core/builtinreducers/ReducerConcept) | -| [is_space](is_space) | [Core](core-index) | [Traits](core/Traits) | Trait to detect types that model the Space concept | -| [LayoutLeft](core/view/layoutLeft) | [Core](core-index) | [Views](core/View) | Memory Layout matching Fortran | -| [LayoutRight](core/view/layoutRight) | [Core](core-index) | [Views](core/View) | Memory Layout matching C | -| [LayoutStride](core/view/layoutStride) | [Core](core-index) | [Views](core/View) | Memory Layout for arbitrary strides | -| [kokkos_free](core/c_style_memory_management/free) | [Core](core-index) | [Memory Management](core/c_style_memory_management) | Dellocates previously allocated memory | -| [kokkos_malloc](core/c_style_memory_management/malloc) | [Core](core-index) | [Memory Management](core/c_style_memory_management) | Allocates memory | -| [kokkos_realloc](core/c_style_memory_management/realloc) | [Core](core-index) | [Memory Management](core/c_style_memory_management) | Expands previously allocated memory block | -| [LAnd](core/builtinreducers/LAnd) | [Core](core-index) | [Built-in Reducers](core/builtin_reducers) | Reducer for Logical 'And' reduction | -| [LOr](core/builtinreducers/LOr) | [Core](core-index) | [Built-in Reducers](core/builtin_reducers) | Reducer for Logical 'Or' reduction | -| [Max](core/builtinreducers/Max) | [Core](core-index) | [Built-in Reducers](core/builtin_reducers) | Reducer for Maximum reduction | -| [MaxLoc](core/builtinreducers/MaxLoc) | [Core](core-index) | [Built-in Reducers](core/builtin_reducers) | Reducer for Reduction providing maximum and an associated index | -| [(U)MDRangePolicy](core/policies/MDRangePolicy) | [Core](core-index) | [Execution Policies](core/Execution-Policies) | Policy to iterate over a multidimensional index range. | -| [MemorySpace concept](MemorySpaceConcept) | [Core](core-index) | [Spaces](core/Spaces) | Concept for execution spaces. | -| [Min](core/builtinreducers/Min) | [Core](core-index) | [Built-in Reducers](core/builtin_reducers) | Reducer for Minimum reduction | -| [MinLoc](core/builtinreducers/MinLoc) | [Core](core-index) | [Built-in Reducers](core/builtin_reducers) | Reducer for Reduction providing minimum and an associated index | -| [MinMax](core/builtinreducers/MinMax) | [Core](core-index) | [Built-in Reducers](core/builtin_reducers) | Reducer for Reduction providing both minimum and maximum | -| [MinMaxLoc](core/builtinreducers/MinMaxLoc) | [Core](core-index) | [Built-in Reducers](core/builtin_reducers) | Reducer for Reduction providing both minimum and maximum and associated indicies | -| [OpenMP](OpenMP) | [Core](core-index) | [Spaces](core/Spaces) | Execution space using non-target OpenMP parallel execution mechanisms. | -| [OpenMPTarget](OpenMPTarget) | [Core](core-index) | [Spaces](core/Spaces) | Execution space using targetoffload OpenMP parallel execution mechanisms. | -| [pair](core/stl-compat/pair) | [Core](core-index) | [STL Compatibility](core/STL-Compatibility) | Device compatible std::pair analogue | -| [parallel_for](core/parallel-dispatch/parallel_for) | [Core](core-index) | | Bulk execute of independent work items. | -| [ParallelForTag](core/parallel-dispatch//ParallelForTag) | [Core](core-index) | | Tag passed to team\_size functions | -| [parallel_reduce](core/parallel-dispatch/parallel_reduce) | [Core](core-index) | | Bulk execute of independent work items, which contribute to a reduction. | -| [ParallelReduceTag](core/parallel-dispatch//ParallelReduceTag) | [Core](core-index) | | Tag passed to team\_size functions | -| [parallel_scan](core/parallel-dispatch/parallel_scan) | [Core](core-index) | | Bulk execute of work items, which a simple pre- or postfix scan dependency. | -| [ParallelScanTag](core/parallel-dispatch//ParallelScanTag) | [Core](core-index) | | Tag passed to team\_size functions | -| [partition_space](core/spaces/partition_space) | [Core](core-index) | [Spaces](core/Spaces) | Split an existing execution space instance into multiple | -| [PerTeam](PerTeam) | [Core](core-index) | [Execution Policies](core/Execution-Policies) | Policy used in single construct to indicate once per team execution. | -| [PerThread](PerThread) | [Core](core-index) | [Execution Policies](core/Execution-Policies) | Policy used in single construct to indicate once per thread execution. | -| [Prod](core/builtinreducers/Prod) | [Core](core-index) | [Built-in Reducers](core/builtin_reducers) | Reducer for Multiplicative reduction | -| [RangePolicy](core/policies/RangePolicy) | [Core](core-index) | [Execution Policies](core/Execution-Policies) | Policy to iterate over a 1D index range. | -| [realloc](core/view/realloc) | [Core](core-index) | [View](core/View) | Resize an existing view without maintaining the content | -| [ReducerConcept](core/builtinreducers/ReducerConcept) | [Core](core-index) | [Built-in Reducers](core/builtin_reducers) | Provides the concept for Reducers. | -| [resize](core/view/resize) | [Core](core-index) | [View](core/View) | Resize an existing view while maintaining the content | -| [Serial](Serial) | [Core](core-index) | [Spaces](core/Spaces) | Execution space using serial execution the CPU. | -| [ScopeGuard](core/initialize_finalize/ScopeGuard) | [Core](core-index) | [Initialization and Finalization](core/Initialize-and-Finalize) | class to aggregate initializing and finalizing Kokkos | -| [SpaceAccessibility](core/SpaceAccessibility) | [Core](core-index) | [Spaces](core/Spaces) | Facility to query accessibility rules between execution and memory spaces. | -| [Subview](core/view/Subview_type) | [Core](core-index) | [View](core/View) | Type of multi-dimensional array which is returned by the subview function | -| [subview](core/view/subview) | [Core](core-index) | [View](core/View) | Crating multi-dimensional array which is a slice of a view | -| [Sum](core/builtinreducers/Sum) | [Core](core-index) | [Built-in Reducers](core/builtin_reducers) | Reducer for Sum reduction | -| [TeamHandle concept](core/policies/TeamHandleConcept) | [Core](core-index) | [Execution Policies](core/Execution-Policies) | Provides the concept for the `member_type` of a [TeamPolicy](core/policies/TeamPolicy). | -| [(U)TeamPolicy](core/policies/TeamPolicy) | [Core](core-index) | [Execution Policies](core/Execution-Policies) | Policy to iterate over a 1D index range, assigning to each iteration a team of threads. | -| [TeamThreadMDRange](core/policies/TeamThreadMDRange) | [Core](core-index) | [Execution Policies](core/Execution-Policies) | Policy to iterate over a multidimensional index range with the threads of a team. | -| [TeamThreadRange](core/policies/TeamThreadRange) | [Core](core-index) | [Execution Policies](core/Execution-Policies) | Policy to iterate over a 1D index range with the threads of a team. | -| [TeamVectorMDRange](core/policies/TeamVectorMDRange) | [Core](core-index) | [Execution Policies](core/Execution-Policies) | Policy to iterate over a multidimensional index range with the threads and vector lanes of a team. | -| [TeamVectorRange](core/policies/TeamVectorRange) | [Core](core-index) | [Execution Policies](core/Execution-Policies) | Policy to iterate over a 1D index range with the threads and vector lanes of a team. | -| [ThreadVectorMDRange](core/policies/ThreadVectorMDRange) | [Core](core-index) | [Execution Policies](core/Execution-Policies) | Policy to iterate over a multidimensional index range with the vector lanes of a thread. | -| [ThreadVectorRange](core/policies/ThreadVectorRange) | [Core](core-index) | [Execution Policies](core/Execution-Policies) | Policy to iterate over a 1D index range with the vector lanes of a thread. | -| [Timer](core/utilities/timer) | [Core](core-index) | [Utilities](core/Utilities) | A basic timer returning seconds | -| [View](core/view/view) | [Core](core-index) | [View](core/View) | A multi-dimensional array | -| [View-like Type Concept](core/view/view_like) | [Core](core-index) | [View](core/View) | A set of class templates that act like a View | diff --git a/docs/source/API/alphabetical.rst b/docs/source/API/alphabetical.rst new file mode 100644 index 000000000..4116457cd --- /dev/null +++ b/docs/source/API/alphabetical.rst @@ -0,0 +1,234 @@ +API in Alphabetical Order +========================= + +All functions and classes listed here are part of the ``Kokkos::`` namespace. + +Algorithms +---------- + ++--------------------------+-----------+---------------+----------------------------------------------------------------+ +| Name | Library | Category | Description | ++==========================+===========+===============+================================================================+ +| Rand | Algorithm | Random Number | Generator Type (12), draw options (3) | ++--------------------------+-----------+---------------+----------------------------------------------------------------+ +| Random_XorShift64_Pool | Algorithm | Random Number | Random Number Generator, pool for threads | ++--------------------------+-----------+---------------+----------------------------------------------------------------+ +| Random_XorShift64 | Algorithm | Random Number | Random Number Generator for 12 types, plus normal distribution | ++--------------------------+-----------+---------------+----------------------------------------------------------------+ +| init | Algorithm | Random Number | Initialize state using seed for Random_XorShift64_Pool | ++--------------------------+-----------+---------------+----------------------------------------------------------------+ +| Random_XorShift1024_Pool | Algorithm | Random Number | Random Number Generator, 1024 bit, pool for threads | ++--------------------------+-----------+---------------+----------------------------------------------------------------+ +| Random_XorShift1024 | Algorithm | Random Number | Random Number Generator for 12 types, plus normal distribution | ++--------------------------+-----------+---------------+----------------------------------------------------------------+ +| fill_random | Algorithm | Random Number | Create sample space to fit a (0 to) range or begin-end space | ++--------------------------+-----------+---------------+----------------------------------------------------------------+ + +Containers +---------- + ++----------------------------------------------------+---------------------------------------+----------+----------------------------------------------------------------------------------------------------------------+ +| Name | Library | Category | Description | ++====================================================+=======================================+==========+================================================================================================================+ +| `Bitset `_ | `Containers `_ | View | A concurrent Bitset class. | ++----------------------------------------------------+---------------------------------------+----------+----------------------------------------------------------------------------------------------------------------+ +| `DualView `_ | `Containers `_ | View | Host-Device Mirror of View with Host-Device Memory | ++----------------------------------------------------+---------------------------------------+----------+----------------------------------------------------------------------------------------------------------------+ +| `DynamicView `_ | `Containers `_ | View | A view which can change its size dynamically. | ++----------------------------------------------------+---------------------------------------+----------+----------------------------------------------------------------------------------------------------------------+ +| `DynRankView `_ | `Containers `_ | View | A view which can determine its rank at runtime. | ++----------------------------------------------------+---------------------------------------+----------+----------------------------------------------------------------------------------------------------------------+ +| ErrorReporter | `Containers `_ | View | A class supporting error recording in parallel code. | ++----------------------------------------------------+---------------------------------------+----------+----------------------------------------------------------------------------------------------------------------+ +| `OffsetView `_ | `Containers `_ | View | View structure supporting non-zero start indicies. | ++----------------------------------------------------+---------------------------------------+----------+----------------------------------------------------------------------------------------------------------------+ +| `ScatterView `_ | `Containers `_ | View | View structure to transpartently support atomic and data replication strategies for scatter-reduce algorithms. | ++----------------------------------------------------+---------------------------------------+----------+----------------------------------------------------------------------------------------------------------------+ +| `StaticCrsGraph `_ | `Containers `_ | View | A non-resizable CRS graph structure with view semantics. | ++----------------------------------------------------+---------------------------------------+----------+----------------------------------------------------------------------------------------------------------------+ +| `UnorderedMap `_ | `Containers `_ | View | A map data structure optimized for concurrent inserts. | ++----------------------------------------------------+---------------------------------------+----------+----------------------------------------------------------------------------------------------------------------+ +| `vector `_ | `Containers `_ | View | A class providing similar interfaces to ``std::vector``. | ++----------------------------------------------------+---------------------------------------+----------+----------------------------------------------------------------------------------------------------------------+ + +Core +---- + +.. |SubviewType| replace:: Subview +.. _SubviewType: core/view/Subview_type.html + ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| Name | Library | Category | Description | ++======================================================================================+===========================+========================================================================+=========================================================================================================================================+ +| `abort `_ | `Core `_ | `Utilities `_ | Causes abnormal program termination. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `ALL `_ | `Core `_ | `Utilities `_ | Selects all elements in a dimension. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `atomic_exchange `_ | `Core `_ | `Atomic-Operations `_ | Atomic operation which exchanges a value and returns the old. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `atomic_compare_exchange `_ | `Core `_ | `Atomic-Operations `_ | Atomic operation which exchanges a value only if the old value matches a comparison value and returns the old value. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `atomic_compare_exchange_strong `_ | `Core `_ | `Atomic-Operations `_ | Atomic operation which exchanges a value only if the old value matches a comparison value and returns true if the exchange is executed. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `atomic_load `_ | `Core `_ | `Atomic-Operations `_ | Atomic operation which loads a value. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `atomic_\[op\] `_ | `Core `_ | `Atomic-Operations `_ | Atomic operation which don't return anything. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `atomic_fetch_\[op\] `_ | `Core `_ | `Atomic-Operations `_ | Various atomic operations which return the old value. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `atomic_\[op\]_fetch `_ | `Core `_ | `Atomic-Operations `_ | Various atomic operations which return the updated value. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `atomic_store `_ | `Core `_ | `Atomic-Operations `_ | Atomic operation which stores a value. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `BAnd `_ | `Core `_ | `Atomic-Operations `_ | Reducer for Binary 'And' reduction | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `BOr `_ | `Core `_ | `Atomic-Operations `_ | Reducer for Binary 'Or' reduction | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `complex `_ | `Core `_ | `STL Compatibility `_ | Complex numbers which work on host and device | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `(X)create_mirror `_ | `Core `_ | `View and related `_ | Mirror Host data to Device data | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `(X)create_mirror_view `_ | `Core `_ | `View and related `_ | Mirror Host data to Device data | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `Cuda `_ | `Core `_ | `Spaces `_ | The CUDA Execution Space. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `CudaSpace `_ | `Core `_ | `Spaces `_ | The primary CUDA Memory Space. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `CudaUVMSpace `_ | `Core `_ | `Spaces `_ | The CUDA Memory Space providing access to unified memory page migratable allocations. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `CudaHostPinnedSpace `_ | `Core `_ | `Spaces `_ | The CUDA Memrory Space providing access to host pinned GPU-accessible host memory. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `deep_copy `_ | `Core `_ | `View and related `_ | Copy Views | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `ExecutionPolicy Concept `_ | `Core `_ | `Execution Policies `_ | Concept for execution policies. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `ExecutionSpace concept `_ | `Core `_ | `Spaces `_ | Concept for execution spaces. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `fence `_ | `Core `_ | | Fences execution spaces. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `finalize `_ | `Core `_ | `Initialization and Finalization `_ | function to finalize Kokkos | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `HostSpace `_ | `Core `_ | `Spaces `_ | The primary Host Memory Space. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `HPX `_ | `Core `_ | `Spaces `_ | Execution space using the HPX runtime system execution mechanisms. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `InitArguments `_ | `Core `_ | `Initialization and Finalization `_ | struct to programmatically define how to initialize Kokkos (deprecated in version 3.7) | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `InitializationSettings `_ | `Core `_ | `Initialization and Finalization `_ | class to programmatically define how to initialize Kokkos | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `initialize `_ | `Core `_ | `Initialization and Finalization `_ | function to initialize Kokkos | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `is_array_layout `_ | `Core `_ | `Traits `_ | Trait to detect types that model the Layout concept | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `is_execution_policy `_ | `Core `_ | `Traits `_ | Trait to detect types that model ExecutionPolicy concept | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| is_execution_space | `Core `_ | `Traits `_ | Trait to detect types that model `ExecutionSpace concept `_ | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `is_memory_space `_ | `Core `_ | `Traits `_ | Trait to detect types that model `MemorySpace concept `_ | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `is_memory_traits `_ | `Core `_ | `Traits `_ | Trait to detect specializations of `Kokkos::MemoryTraits` | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `is_reducer `_ | `Core `_ | `Traits `_ | Trait to detect types that model the `Reducer concept `_ | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `is_space `_ | `Core `_ | `Traits `_ | Trait to detect types that model the Space concept | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `LayoutLeft `_ | `Core `_ | `View and related `_ | Memory Layout matching Fortran | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `LayoutRight `_ | `Core `_ | `View and related `_ | Memory Layout matching C | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `LayoutStride `_ | `Core `_ | `View and related `_ | Memory Layout for arbitrary strides | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `kokkos_free `_ | `Core `_ | `Memory Management `_ | Dellocates previously allocated memory | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `kokkos_malloc `_ | `Core `_ | `Memory Management `_ | Allocates memory | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `kokkos_realloc `_ | `Core `_ | `Memory Management `_ | Expands previously allocated memory block | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `LAnd `_ | `Core `_ | `Built-in Reducers `_ | Reducer for Logical 'And' reduction | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `LOr `_ | `Core `_ | `Built-in Reducers `_ | Reducer for Logical 'Or' reduction | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `Max `_ | `Core `_ | `Built-in Reducers `_ | Reducer for Maximum reduction | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `MaxLoc `_ | `Core `_ | `Built-in Reducers `_ | Reducer for Reduction providing maximum and an associated index | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `(U)MDRangePolicy `_ | `Core `_ | `Execution Policies `_ | Policy to iterate over a multidimensional index range. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `MemorySpace concept `_ | `Core `_ | `Spaces `_ | Concept for execution spaces. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `Min `_ | `Core `_ | `Built-in Reducers `_ | Reducer for Minimum reduction | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `MinLoc `_ | `Core `_ | `Built-in Reducers `_ | Reducer for Reduction providing minimum and an associated index | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `MinMax `_ | `Core `_ | `Built-in Reducers `_ | Reducer for Reduction providing both minimum and maximum | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `MinMaxLoc `_ | `Core `_ | `Built-in Reducers `_ | Reducer for Reduction providing both minimum and maximum and associated indicies | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `OpenMP `_ | `Core `_ | `Spaces `_ | Execution space using non-target OpenMP parallel execution mechanisms. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `OpenMPTarget `_ | `Core `_ | `Spaces `_ | Execution space using targetoffload OpenMP parallel execution mechanisms. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `pair `_ | `Core `_ | `STL Compatibility `_ | Device compatible std::pair analogue | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `parallel_for `_ | `Core `_ | | Bulk execute of independent work items. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `ParallelForTag `_ | `Core `_ | | Tag passed to team_size functions | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `parallel_reduce `_ | `Core `_ | | Bulk execute of independent work items, which contribute to a reduction. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `ParallelReduceTag `_ | `Core `_ | | Tag passed to team_size functions | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `parallel_scan `_ | `Core `_ | | Bulk execute of work items, which a simple pre- or postfix scan dependency. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `ParallelScanTag `_ | `Core `_ | | Tag passed to team_size functions | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `partition_space `_ | `Core `_ | `Spaces `_ | Split an existing execution space instance into multiple | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `PerTeam `_ | `Core `_ | `Execution Policies `_ | Policy used in single construct to indicate once per team execution. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `PerThread `_ | `Core `_ | `Execution Policies `_ | Policy used in single construct to indicate once per thread execution. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `Prod `_ | `Core `_ | `Built-in Reducers `_ | Reducer for Multiplicative reduction | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `RangePolicy `_ | `Core `_ | `Execution Policies `_ | Policy to iterate over a 1D index range. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `realloc `_ | `Core `_ | `View and related `_ | Resize an existing view without maintaining the content | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `ReducerConcept `_ | `Core `_ | `Built-in Reducers `_ | Provides the concept for Reducers. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `resize `_ | `Core `_ | `View and related `_ | Resize an existing view while maintaining the content | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `Serial `_ | `Core `_ | `Spaces `_ | Execution space using serial execution the CPU. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `ScopeGuard `_ | `Core `_ | `Initialization and Finalization `_ | class to aggregate initializing and finalizing Kokkos | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `SpaceAccessibility `_ | `Core `_ | `Spaces `_ | Facility to query accessibility rules between execution and memory spaces. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| |SubviewType|_ | `Core `_ | `View and related `_ | Type of multi-dimensional array which is returned by the subview function | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `subview `_ | `Core `_ | `View and related `_ | Crating multi-dimensional array which is a slice of a view | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `Sum `_ | `Core `_ | `Built-in Reducers `_ | Reducer for Sum reduction | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `TeamHandle concept `_ | `Core `_ | `Execution Policies `_ | Provides the concept for the `member_type` of a `TeamPolicy `_. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `(U)TeamPolicy `_ | `Core `_ | `Execution Policies `_ | Policy to iterate over a 1D index range, assigning to each iteration a team of threads. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `TeamThreadMDRange `_ | `Core `_ | `Execution Policies `_ | Policy to iterate over a multidimensional index range with the threads of a team. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `TeamThreadRange `_ | `Core `_ | `Execution Policies `_ | Policy to iterate over a 1D index range with the threads of a team. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `TeamVectorMDRange `_ | `Core `_ | `Execution Policies `_ | Policy to iterate over a multidimensional index range with the threads and vector lanes of a team. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `TeamVectorRange `_ | `Core `_ | `Execution Policies `_ | Policy to iterate over a 1D index range with the threads and vector lanes of a team. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `ThreadVectorMDRange `_ | `Core `_ | `Execution Policies `_ | Policy to iterate over a multidimensional index range with the vector lanes of a thread. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `ThreadVectorRange `_ | `Core `_ | `Execution Policies `_ | Policy to iterate over a 1D index range with the vector lanes of a thread. | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `Timer `_ | `Core `_ | `Utilities `_ | A basic timer returning seconds | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `View `_ | `Core `_ | `View and related `_ | A multi-dimensional array | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ +| `View-like Type Concept `_ | `Core `_ | `View and related `_ | A set of class templates that act like a View | ++--------------------------------------------------------------------------------------+---------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/docs/source/API/containers/DynRankView.rst b/docs/source/API/containers/DynRankView.rst index b3ef49de0..8a30fe3d3 100644 --- a/docs/source/API/containers/DynRankView.rst +++ b/docs/source/API/containers/DynRankView.rst @@ -70,19 +70,11 @@ Description are specified, ``MemorySpace`` must come before ``MemoryTraits``. - .. rubric:: Public Enums + .. rubric:: Public Static Variables - .. cppkokkos:type:: rank - - Rank of the view (i.e. the dimensionality). - - .. cppkokkos:type:: rank_dynamic - - Number of runtime determined dimensions. - - .. cppkokkos:type:: reference_type_is_lvalue_reference - - Whether the reference type is a C++ lvalue reference. + * ``rank``: Rank of the view (i.e. the dimensionality). + * ``rank_dynamic``: Number of runtime determined dimensions. + * ``reference_type_is_lvalue_reference``: Whether the reference type is a C++ lvalue reference. .. rubric:: Public Data Types Typedefs @@ -247,7 +239,7 @@ Description * ``ptr``: pointer to a user provided memory allocation. Must provide storage of size ``DynRankView::required_allocation_size(n0,...,nR)``. * ``indices``: runtime dimensions of the view. - .. cppkokkos:function:: DynRankView(const std::string& name, const array_layout& layout) + .. cppkokkos:function:: DynRankView(const pointer_type& ptr, const array_layout& layout) Unmanaged data wrapper constructor. diff --git a/docs/source/API/containers/Unordered-Map.rst b/docs/source/API/containers/Unordered-Map.rst index 192a5a900..0e05d883e 100644 --- a/docs/source/API/containers/Unordered-Map.rst +++ b/docs/source/API/containers/Unordered-Map.rst @@ -67,9 +67,10 @@ Description Insert the given key into the map with a default constructed value - .. cppkokkos:kokkosinlinefunction:: UnorderedMapInsertResult insert(Key key, Value value) const; + .. cppkokkos:kokkosinlinefunction:: UnorderedMapInsertResult insert(Key key, Value value, Insert op = NoOp) const; - Insert the given key/value pair into the map + Insert the given key/value pair into the map and optionally specify + the operator, op, used for combining values if key already exists .. cppkokkos:kokkosinlinefunction:: uint32_t find(Key key) const @@ -116,8 +117,27 @@ Description Index where the key exists in the map as long as failed() == false -Insertion ---------- +.. cppkokkos:struct:: template UnorderedMapInsertOpTypes + + :tparam ValueTypeView: The UnorderedMap value array type. + + :tparam ValuesIdxType: The index type for lookups in the value array. + + .. rubric:: *Public* Insertion Operator Types + + .. cppkokkos:struct:: NoOp + + Insert the given key/value pair into the map + + .. cppkokkos:struct:: AtomicAdd + + Duplicate key insertions sum values together. + + +.. _unordered_map_insert_op_types_noop: + +Insertion using default ``UnorderedMapInsertOpTypes::NoOp`` +----------------------------------------------------------- There are 3 potential states for every insertion which are reported by the ``UnorderedMapInsertResult``: @@ -129,6 +149,40 @@ There are 3 potential states for every insertion which are reported by the ``Uno with a bounded search of the internal atomic bitset. A ``failed`` insertion requires the user to increase the capacity (``rehash``) and restart the algoritm. +.. code-block:: cpp + + // use the default NoOp insert operation + using map_op_type = Kokkos::UnorderedMapInsertOpTypes; + using noop_type = typename map_op_type::NoOp; + noop_type noop; + parallel_for(N, KOKKOS_LAMBDA (uint32_t i) { + map.insert(i, values(i), noop); + }); + // OR; + parallel_for(N, KOKKOS_LAMBDA (uint32_t i) { + map.insert(i, values(i)); + }); + +Insertion using ``UnorderedMapInsertOpTypes::AtomicAdd`` +-------------------------------------------------------- + +The behavior from :ref:`unordered_map_insert_op_types_noop` holds true with the +exception that the ``UnorderedMapInsertResult``: + +- ``existing`` implies that the key is already in the map and the existing value at key was summed + with the new value being inserted. + +.. code-block:: cpp + + // use the AtomicAdd insert operation + using map_op_type = Kokkos::UnorderedMapInsertOpTypes; + using atomic_add_type = typename map_op_type::AtomicAdd; + atomic_add_type atomic_add; + parallel_for(N, KOKKOS_LAMBDA (uint32_t i) { + map.insert(i, values(i), atomic_add); + }); + + Iteration --------- diff --git a/docs/source/API/core/Utilities.rst b/docs/source/API/core/Utilities.rst index f26e4facd..9ca55cbd7 100644 --- a/docs/source/API/core/Utilities.rst +++ b/docs/source/API/core/Utilities.rst @@ -7,6 +7,7 @@ Utilities ./utilities/abort ./utilities/all ./utilities/complex + ./utilities/printf ./utilities/timer ./utilities/device_id ./utilities/num_threads diff --git a/docs/source/API/core/initialize_finalize/InitArguments.md b/docs/source/API/core/initialize_finalize/InitArguments.md deleted file mode 100644 index 6c7f4b928..000000000 --- a/docs/source/API/core/initialize_finalize/InitArguments.md +++ /dev/null @@ -1,44 +0,0 @@ -# InitArguments - -Defined in `` header. - -## Interface -```C++ -struct InitArguments { // (deprecated since 3.7) - int num_threads; - int num_numa; - int device_id; - int ndevices; - int skip_device; - bool disable_warnings; - InitArguments(); -}; -``` - -**DEPRECATED: use `Kokkos::InitializationSettings` instead** -`InitArguments` is a struct that can be used to programmatically define the -arguments passed to [`Kokkos::initialize`](initialize). It was deprecated in -version 3.7 in favor of -[`Kokkos::InitializationSettings`](InitializationSettings). -One of the main reasons for replacing it was that user-specified data members -cannot be distinguished from defaulted ones. - - -### Example -```C++ -#include - -int main() { - Kokkos::InitArguments arguments; - arguments.num_threads = 2; - arguments.device_id = 1; - arguments.disable_warnings = true; - Kokkos::initialize(arguments); - // ... - Kokkos::finalize(); -} -``` - -### See also -* [`Kokkos::InitializationSettings`](InitializationSettings) -* [`Kokkos::initialize`](initialize) diff --git a/docs/source/API/core/initialize_finalize/InitArguments.rst b/docs/source/API/core/initialize_finalize/InitArguments.rst new file mode 100644 index 000000000..b0562525e --- /dev/null +++ b/docs/source/API/core/initialize_finalize/InitArguments.rst @@ -0,0 +1,62 @@ +InitArguments +============= + +.. role:: cppkokkos(code) + :language: cppkokkos + +.. _KokkosInitialize: initialize.html +.. |KokkosInitialize| replace:: ``Kokkos::initialize`` + +.. _KokkosInitializationSetting: InitializationSettings.html +.. |KokkosInitializationSetting| replace:: ``Kokkos::InitializationSettings`` + +Defined in ```` header. + +.. warning:: Deprecated since 3.7, **use** ``Kokkos::InitializationSettings`` **instead** + +Interface +--------- + +.. cppkokkos:struct:: InitArguments + + .. cppkokkos:member:: int num_threads + + .. cppkokkos:member:: int num_numa + + .. cppkokkos:member:: int device_id + + .. cppkokkos:member:: int ndevices + + .. cppkokkos:member:: int skip_device + + .. cppkokkos:member:: bool disable_warnings + + .. cppkokkos:function:: InitArguments() + +``InitArguments`` is a struct that can be used to programmatically define the arguments passed to |KokkosInitialize|_. It was deprecated in version 3.7 in favor of |KokkosInitializationSetting|_. + +One of the main reasons for replacing it was that user-specified data members cannot be distinguished from defaulted ones. + +Example +~~~~~~~ + +.. code-block:: cpp + + #include + + int main() { + Kokkos::InitArguments arguments; + arguments.num_threads = 2; + arguments.device_id = 1; + arguments.disable_warnings = true; + Kokkos::initialize(arguments); + // ... + Kokkos::finalize(); + } + + +See also +~~~~~~~~ + +* |KokkosInitializationSetting|_ +* |KokkosInitialize|_ diff --git a/docs/source/API/core/parallel-dispatch/parallel_for.rst b/docs/source/API/core/parallel-dispatch/parallel_for.rst index 8648f2b5e..e369e4647 100644 --- a/docs/source/API/core/parallel-dispatch/parallel_for.rst +++ b/docs/source/API/core/parallel-dispatch/parallel_for.rst @@ -74,7 +74,7 @@ More Detailed Examples are provided in the ExecutionPolicy documentation. int N = atoi(argv[1]); - Kokkos::parallel_for("Loop1", N, KOKKOS_LAMBDA (const int& i) { + Kokkos::parallel_for("Loop1", N, KOKKOS_LAMBDA (const int i) { printf("Greeting from iteration %i\n",i); }); diff --git a/docs/source/API/core/parallel-dispatch/parallel_reduce.rst b/docs/source/API/core/parallel-dispatch/parallel_reduce.rst index 40a225037..3f2ebc99b 100644 --- a/docs/source/API/core/parallel-dispatch/parallel_reduce.rst +++ b/docs/source/API/core/parallel-dispatch/parallel_reduce.rst @@ -11,12 +11,12 @@ Usage .. code-block:: cpp - Kokkos::parallel_reduce( name, policy, functor, reducer... ); - Kokkos::parallel_reduce( name, policy, functor, result...); - Kokkos::parallel_reduce( name, policy, functor); - Kokkos::parallel_reduce( policy, functor, reducer...); - Kokkos::parallel_reduce( policy, functor, result...); - Kokkos::parallel_reduce( policy, functor); + Kokkos::parallel_reduce(name, policy, functor, reducer...); + Kokkos::parallel_reduce(name, policy, functor, result...); + Kokkos::parallel_reduce(name, policy, functor); + Kokkos::parallel_reduce(policy, functor, reducer...); + Kokkos::parallel_reduce(policy, functor, result...); + Kokkos::parallel_reduce(policy, functor); Dispatches parallel work defined by ``functor`` according to the *ExecutionPolicy* and performs a reduction of the contributions provided by workers as defined by the execution policy. The optional label name is used by profiling and debugging tools. The reduction type is either a ``sum``, is defined by the ``reducer`` or is deduced from an optional ``join`` operator on the functor. The reduction result is stored in ``result``, or through the ``reducer`` handle. It is also provided to the ``functor.final()`` function if such a function exists. Multiple ``reducers`` can be used in a single ``parallel_reduce`` and thus, it is possible to compute the ``min`` and the ``max`` values in a single ``parallel_reduce``. @@ -79,7 +79,7 @@ Parameters: - `TeamThreadRange <../policies/TeamThreadRange.html>`_: defines a 1D iteration range to be executed by a thread-team. Only valid inside a parallel region executed through a ``TeamPolicy`` or a ``TaskTeam``. - `ThreadVectorRange <../policies/ThreadVectorRange.html>`_: defines a 1D iteration range to be executed through vector parallelization dividing the threads within a team. Only valid inside a parallel region executed through a ``TeamPolicy`` or a ``TaskTeam``. * FunctorType: A valid functor with (at minimum) an ``operator()`` with a matching signature for the ``ExecPolicy`` combined with the reduced type. -* ReducerArgument: Either a class fullfilling the "Reducer" concept or a ``Kokkos::View`` +* ReducerArgument: Either a class fulfilling the "Reducer" concept or a ``Kokkos::View``. * ReducerArgumentNonConst: A scalar type or an array type; see below for functor requirements. Requirements: @@ -106,7 +106,7 @@ Requirements: + ReducerValueType must match the array signature. + the functor must define FunctorType::value_type the same as ReducerValueType. + the functor must declare a public member variable ``int value_count`` which is the length of the array. - + the functor must implement the function ``void init( ReducerValueType dst [] ) const``. + + the functor must implement the function ``void init( ReducerValueType dst[] ) const``. + the functor must implement the function ``void join( ReducerValueType dst[], ReducerValueType src[] ) const``. + If the functor implements the ``final`` function, the argument must also match those of init and join. @@ -125,45 +125,45 @@ Further examples are provided in the `Custom Reductions <../../../ProgrammingGui .. code-block:: cpp - #include - #include + #include + #include int main(int argc, char* argv[]) { - Kokkos::initialize(argc,argv); + Kokkos::initialize(argc, argv); int N = atoi(argv[1]); double result; - Kokkos::parallel_reduce("Loop1", N, KOKKOS_LAMBDA (const int& i, double& lsum ) { + Kokkos::parallel_reduce("Loop1", N, KOKKOS_LAMBDA (const int& i, double& lsum) { lsum += 1.0*i; - },result); + }, result); - printf("Result: %i %lf\n",N,result); + printf("Result: %i %lf\n", N, result); Kokkos::finalize(); } .. code-block:: cpp - #include - #include + #include + #include int main(int argc, char* argv[]) { - Kokkos::initialize(argc,argv); + Kokkos::initialize(argc, argv); int N = atoi(argv[1]); double sum, min; - Kokkos::parallel_reduce("Loop1", N, KOKKOS_LAMBDA (const int& i, double& lsum, double& lmin ) { + Kokkos::parallel_reduce("Loop1", N, KOKKOS_LAMBDA (const int& i, double& lsum, double& lmin) { lsum += 1.0*i; lmin = lmin < 1.0*i ? lmin : 1.0*i; - },sum,Min(min)); + }, sum, Kokkos::Min(min)); - printf("Result: %i %lf %lf\n",N,sum,min); + printf("Result: %i %lf %lf\n", N, sum, min); Kokkos::finalize(); } .. code-block:: cpp - #include - #include + #include + #include struct TagMax {}; struct TagMin {}; @@ -171,28 +171,28 @@ Further examples are provided in the `Custom Reductions <../../../ProgrammingGui struct Foo { KOKKOS_INLINE_FUNCTION void operator() (const TagMax, const Kokkos::TeamPolicy<>::member_type& team, double& lmax) const { - if( team.league_rank % 17 + team.team_rank % 13 > lmax ) - lmax = team.league_rank % 17 + team.team_rank % 13; + if (team.league_rank() % 17 + team.team_rank() % 13 > lmax) + lmax = team.league_rank() % 17 + team.team_rank() % 13; } KOKKOS_INLINE_FUNCTION - void operator() (const TagMin, const Kokkos::TeamPolicy<>::member_type& team, double& lmin ) const { - if( team.league_rank % 17 + team.team_rank % 13 < lmin ) - lmin = team.league_rank % 17 + team.team_rank % 13; + void operator() (const TagMin, const Kokkos::TeamPolicy<>::member_type& team, double& lmin) const { + if (team.league_rank() % 17 + team.team_rank() % 13 < lmin) + lmin = team.league_rank() % 17 + team.team_rank() % 13; } }; int main(int argc, char* argv[]) { - Kokkos::initialize(argc,argv); + Kokkos::initialize(argc, argv); int N = atoi(argv[1]); Foo foo; - double max,min; + double max, min; Kokkos::parallel_reduce(Kokkos::TeamPolicy(N,Kokkos::AUTO), foo, Kokkos::Max(max)); Kokkos::parallel_reduce("Loop2", Kokkos::TeamPolicy(N,Kokkos::AUTO), foo, Kokkos::Min(min)); Kokkos::fence(); - printf("Result: %lf %lf\n",min,max); + printf("Result: %lf %lf\n", min, max); Kokkos::finalize(); } diff --git a/docs/source/API/core/parallel-dispatch/parallel_scan.rst b/docs/source/API/core/parallel-dispatch/parallel_scan.rst index f6f68758b..6fc934fbb 100644 --- a/docs/source/API/core/parallel-dispatch/parallel_scan.rst +++ b/docs/source/API/core/parallel-dispatch/parallel_scan.rst @@ -16,8 +16,8 @@ Usage Kokkos::parallel_scan( policy, functor, result); Kokkos::parallel_scan( policy, functor ); -Dispatches parallel work defined by ``functor`` according to the *ExecutionPolicy* ``policy`` and perform a pre (inclusive) or post (exclusive) scan of the contributions -provided by the work items. The optional label ``name`` is used by profiling and debugging tools. If provided, the final result is placed in result. +Dispatches parallel work defined by ``functor`` according to the *ExecutionPolicy* ``policy`` and perform a pre (exclusive) or post (inclusive) scan of the contributions +provided by the work items. The optional label ``name`` is used by profiling and debugging tools. If provided, the final result is placed in result. Interface --------- @@ -33,14 +33,14 @@ Interface Parameters: ~~~~~~~~~~~ -* ``name``: A user provided string which is used in profiling and debugging tools via the Kokkos Profiling Hooks. +* ``name``: A user provided string which is used in profiling and debugging tools via the Kokkos Profiling Hooks. * ExecPolicy: An *ExecutionPolicy* which defines iteration space and other execution properties. Valid policies are: - ``IntegerType``: defines a 1D iteration range, starting from 0 and going to a count. - - `RangePolicy <../policies/RangePolicy.html>`_: defines a 1D iteration range. + - `RangePolicy <../policies/RangePolicy.html>`_: defines a 1D iteration range. - `ThreadVectorRange <../policies/ThreadVectorRange.html>`_: defines a 1D iteration range to be executed through vector parallelization dividing the threads within a team. Only valid inside a parallel region executed through a ``TeamPolicy`` or a ``TaskTeam``. * FunctorType: A valid functor with (at minimum) an ``operator()`` with a matching signature for the ``ExecPolicy`` combined with the reduced type. -* ReturnType: a POD type with ``operator +=`` and ``operator =``, or a ``Kokkos::View``. +* ReturnType: a POD type with ``operator +=`` and ``operator =``, or a ``Kokkos::View``. Requirements: ~~~~~~~~~~~~~ @@ -49,15 +49,15 @@ Requirements: - The ``WorkTag`` free form of the operator is used if ``ExecPolicy`` is an ``IntegerType`` or ``ExecPolicy::work_tag`` is ``void``. - ``HandleType`` is an ``IntegerType`` if ``ExecPolicy`` is an ``IntegerType`` else it is ``ExecPolicy::member_type``. -* The type ``ReturnType`` of the ``functor`` operator must be compatible with the ``ReturnType`` of the parallel_scan and must match the arguments of the ``init`` and ``join`` functions of the functor. +* The type ``ReturnType`` of the ``functor`` operator must be compatible with the ``ReturnType`` of the parallel_scan and must match the arguments of the ``init`` and ``join`` functions of the functor. * the functor must define FunctorType::value_type the same as ReturnType - + Semantics --------- -* Neither concurrency nor order of execution are guaranteed. -* The ``ReturnType`` content will be overwritten, i.e. the value does not need to be initialized to the reduction-neutral element. -* The input value to the operator may contain a partial result, Kokkos may only combine the thread local contributions in the end. The operator should modify the input value according to the desired scan operation. +* Neither concurrency nor order of execution are guaranteed. +* The ``ReturnType`` content will be overwritten, i.e. the value does not need to be initialized to the reduction-neutral element. +* The input value to the operator may contain a partial result, Kokkos may only combine the thread local contributions in the end. The operator should modify the input value according to the desired scan operation. Examples -------- @@ -82,8 +82,8 @@ Examples if(is_final) post(i) = partial_sum; }, result); - // pre: 0,0,1,3,6,10,... - // post: 0,1,3,6,10,... + // pre (exclusive): 0,0,1,3,6,10,... + // post (inclusive): 0,1,3,6,10,... // result: N*(N-1)/2 printf("Result: %i %li\n",N,result); } diff --git a/docs/source/API/core/policies/TeamThreadMDRange.md b/docs/source/API/core/policies/TeamThreadMDRange.md deleted file mode 100644 index 17c956b94..000000000 --- a/docs/source/API/core/policies/TeamThreadMDRange.md +++ /dev/null @@ -1,80 +0,0 @@ -# `TeamThreadMDRange` - -Header File: `Kokkos_Core.hpp` - -Usage: - ```c++ - parallel_for(TeamThreadMDRange, TeamHandle>(team, extent1, extent2, ...), - [=] (int i1, int i2, ...) {...}); - parallel_reduce(TeamThreadMDRange, TeamHandle>(team, extent1, extent2, ...), - [=] (int i1, int i2, ..., double& lsum) {...}, sum); - ``` - -TeamThreadMDRange is a [nested execution policy](https://kokkos.github.io/kokkos-core-wiki/ProgrammingGuide/HierarchicalParallelism.html?highlight=nested#nested-parallelism) -used inside of hierarchical parallelism. - -## Interface - ```c++ - template - struct TeamThreadMDRange, TeamHandle> { - TeamThreadMDRange(team, extent1, extent2, ..., extentN) { /* ... */ } - - /* ... */ - }; - ``` - -## Description - - * ```c++ - template - struct TeamThreadMDRange, TeamHandle>; - ``` - Splits the index range `0` to `extent` over the threads of the team, - where extent is the backend dependent rank that will be threaded - - * **Arguments** - * `team`: TeamHandle to the calling team execution context. - * `extent_i`: index range length of each rank. - - * **Requirements** - * `TeamHandle` is a type that models [TeamHandle](Kokkos%3A%3ATeamHandleConcept) - * extents are ints. - * Every member thread of `team` must call the operation in the same branch, i.e. it is not legal to have some - threads call this function in one branch, and the other threads of `team` call it in another branch. - * `N >= 2 && N <= 8` is true; - - -## Examples - - ```c++ - using TeamHandle = TeamPolicy<>::member_type; - - parallel_for(TeamPolicy<>(N,AUTO), - KOKKOS_LAMBDA (TeamHandle const& team) { - - int leagueRank = team.league_rank(); - - auto teamThreadMDRange = - TeamThreadMDRange, TeamHandle>( - team, n0, n1, n2, n3); - - parallel_for(teamThreadMDRange, [=](int i0, int i1, int i2, int i3) { - A(leagueRank, i0, i1, i2, i3) = B(leagueRank, i1) + C(i1, i2, i3); - }); - - team.team_barrier(); - - int teamSum = 0; - - parallel_reduce(teamThreadMDRange, - [=](int i0, int i1, int i2, int i3, int& threadSum) { - threadSum += D(leagueRank, i0, i1, i2, i3); - }, teamSum - ); - - single(PerTeam(team), [&leagueSum, teamSum]() { leagueSum += teamSum; }); - - A_rowSum[leagueRank] = leagueSum; - }); - ``` - diff --git a/docs/source/API/core/policies/TeamThreadMDRange.rst b/docs/source/API/core/policies/TeamThreadMDRange.rst new file mode 100644 index 000000000..daa7e00b5 --- /dev/null +++ b/docs/source/API/core/policies/TeamThreadMDRange.rst @@ -0,0 +1,80 @@ +``TeamThreadMDRange`` +===================== + +.. role::cpp(code) + :language: cpp + +Header File: ```` + +Description +----------- + +TeamThreadMDRange is a `nested execution policy <./NestedPolicies.html>`_ used inside of hierarchical parallelism. + + +Interface +--------- + +.. cppkokkos:class:: template TeamThreadMDRange + + .. rubric:: Constructor + + .. cppkokkos:function:: TeamThreadMDRange(team, extent_1, extent_2, ...); + + Splits the index range ``0`` to ``extent`` over the threads of the team, + where ``extent`` is the backend-dependent rank that will be threaded + + :param team: TeamHandle to the calling team execution context + + :param extent_1, extent_2, ...: index range lengths of each rank + + + * **Requirements** + + * ``TeamHandle`` is a type that models `TeamHandle <./TeamHandleConcept.html>`_ + + * ``extent_1, extent_2, ...`` are ints + + * Every member thread of ``team`` must call the operation in the same branch, + i.e. it is not legal to have some threads call this function in one branch, + and the other threads of ``team`` call it in another branch + + * ``extent_i`` is such that ``i >= 2 && i <= 8`` is true. + For example: + + .. code-block:: cpp + + TeamThreadMDRange(team, 4); // NOT OK, violates i>=2 + + TeamThreadMDRange(team, 4,5); // OK + TeamThreadMDRange(team, 4,5,6); // OK + TeamThreadMDRange(team, 4,5,6,2,3,4,5,6); // OK, max num of extents allowed + +Examples +-------- + +.. code-block:: cpp + + using TeamHandle = TeamPolicy<>::member_type; + + parallel_for(TeamPolicy<>(N,AUTO), + KOKKOS_LAMBDA (TeamHandle const& team) { + + int leagueRank = team.league_rank(); + + auto range = TeamThreadMDRange, TeamHandle>(team, n0, n1, n2, n3); + + parallel_for(range, [=](int i0, int i1, int i2, int i3) { + A(leagueRank, i0, i1, i2, i3) = B(leagueRank, i1) + C(i1, i2, i3); + }); + team.team_barrier(); + + int teamSum = 0; + parallel_reduce(range, + [=](int i0, int i1, int i2, int i3, int& threadSum) { + threadSum += D(leagueRank, i0, i1, i2, i3); + }, teamSum + ); + single(PerTeam(team), [&leagueSum, teamSum]() { leagueSum += teamSum; }); + A_rowSum[leagueRank] = leagueSum; + }); diff --git a/docs/source/API/core/policies/TeamVectorMDRange.md b/docs/source/API/core/policies/TeamVectorMDRange.md deleted file mode 100644 index d714301d5..000000000 --- a/docs/source/API/core/policies/TeamVectorMDRange.md +++ /dev/null @@ -1,82 +0,0 @@ -# `TeamVectorMDRange` - -Header File: `Kokkos_Core.hpp` - -Usage: - ```c++ - parallel_for(TeamVectorMDRange, TeamHandle>(team, extent1, extent2, ...), - [=] (int i1, int i2, ...) {...}); - parallel_reduce(TeamVectorMDRange, TeamHandle>(team, extent1, extent2, ...), - [=] (int i1, int i2, ..., double& lsum) {...}, sum); - ``` - -TeamVectorMDRange is a [nested execution policy](https://kokkos.github.io/kokkos-core-wiki/ProgrammingGuide/HierarchicalParallelism.html?highlight=nested#nested-parallelism) -used inside of hierarchical parallelism. - -## Interface - ```c++ - template - struct TeamVectorMDRange, TeamHandle> { - TeamVectorMDRange(team, extent1, extent2, ..., extentN) { /* ... */ } - - /* ... */ - }; - ``` - -## Description - - * ```c++ - template - struct TeamVectorMDRange, TeamHandle>; - ``` - Splits an index range `0` to `extent1` over the threads of the team and - another index range `0` to `extent2` over their vector lanes. - Ranks for threading and vectorization determined by the backend. - - * **Arguments** - * `team`: TeamHandle to the calling team execution context. - * `extent_i`: index range length of each rank. - - * **Requirements** - * `TeamHandle` is a type that models [TeamHandle](Kokkos%3A%3ATeamHandleConcept) - * extents are ints. - * Every member thread of `team` must call the operation in the same branch, i.e. it is not legal to have some - threads call this function in one branch, and the other threads of `team` call it in another branch. - * `N >= 2 && N <= 8` is true; - -## Examples - - ```c++ - using TeamHandle = TeamPolicy<>::member_type; - - parallel_for(TeamPolicy<>(N,AUTO), - KOKKOS_LAMBDA(TeamHandle const& team) { - - int leagueRank = team.league_rank(); - - auto teamVectorMDRange = - TeamVectorMDRange, TeamType>( - team, n0, n1, n2, n3); - - parallel_for(teamVectorMDRange, - [=](int i0, int i1, int i2, int i3) { - A(leagueRank, i0, i1, i2, i3) = B(leagueRank, i1) + C(i1, i2, i3); - }); - - team.team_barrier(); - - int teamSum = 0; - - parallel_reduce(teamVectorMDRange, - [=](int i0, int i1, int i2, int i3, int& vectorSum) { - vectorSum += v(leagueRank, i, j, k, l); - }, teamSum - ); - - single(PerTeam(team), [&leagueSum, teamSum]() { leagueSum += teamSum; }); - - A_rowSum[leagueRank] = leagueSum; - }); - - ``` - diff --git a/docs/source/API/core/policies/TeamVectorMDRange.rst b/docs/source/API/core/policies/TeamVectorMDRange.rst new file mode 100644 index 000000000..9eb1e7c86 --- /dev/null +++ b/docs/source/API/core/policies/TeamVectorMDRange.rst @@ -0,0 +1,76 @@ +``TeamVectorMDRange`` +===================== + +Header File: ```` + +Description +----------- + +TeamVectorMDRange is a `nested execution policy <./NestedPolicies.html>`_ used inside of hierarchical parallelism. + +Interface +--------- + +.. cppkokkos:class:: template TeamVectorMDRange + + .. rubric:: Constructor + + .. cppkokkos:function:: TeamVectorMDRange(team, extent_1, extent_2, ...); + + Splits an index range over the threads of the team and another index range over their vector lanes. + Ranks for threading and vectorization determined by the backend. + + :param team: TeamHandle to the calling team execution context + + :param extent_1, extent_2, ...: index range lengths of each rank + + * **Requirements** + + * ``TeamHandle`` is a type that models `TeamHandle <./TeamHandleConcept.html>`_ + + * ``extent_1, extent_2, ...`` are ints + + * Every member thread of ``team`` must call the operation in the same branch, + i.e. it is not legal to have some threads call this function in one branch, + and the other threads of ``team`` call it in another branch + + * ``extent_i`` is such that ``i >= 2 && i <= 8`` is true. + For example: + + .. code-block:: cpp + + TeamVectorMDRange(team, 4); // NOT OK, violates i>=2 + + TeamVectorMDRange(team, 4,5); // OK + TeamVectorMDRange(team, 4,5,6); // OK + TeamVectorMDRange(team, 4,5,6,2,3,4,5,6); // OK, max num of extents allowed + +Examples +-------- + +.. code-block:: cpp + + using TeamHandle = TeamPolicy<>::member_type; + + parallel_for(TeamPolicy<>(N,AUTO), + KOKKOS_LAMBDA(TeamHandle const& team) { + + int leagueRank = team.league_rank(); + + auto range = TeamVectorMDRange, TeamType>(team, n0, n1, n2, n3); + + parallel_for(range, + [=](int i0, int i1, int i2, int i3) { + A(leagueRank, i0, i1, i2, i3) = B(leagueRank, i1) + C(i1, i2, i3); + }); + team.team_barrier(); + + int teamSum = 0; + parallel_reduce(range, + [=](int i0, int i1, int i2, int i3, int& vectorSum) { + vectorSum += v(leagueRank, i, j, k, l); + }, teamSum + ); + single(PerTeam(team), [&leagueSum, teamSum]() { leagueSum += teamSum; }); + A_rowSum[leagueRank] = leagueSum; + }); diff --git a/docs/source/API/core/policies/ThreadVectorMDRange.md b/docs/source/API/core/policies/ThreadVectorMDRange.md deleted file mode 100644 index a0595c92d..000000000 --- a/docs/source/API/core/policies/ThreadVectorMDRange.md +++ /dev/null @@ -1,83 +0,0 @@ -# `ThreadVectorMDRange` - -Header File: `Kokkos_Core.hpp` - -Usage: - ```c++ - parallel_for(ThreadVectorMDRange, TeamHandle>(team, extent1, extent2, ...), - [=] (int i1, int i2, ...) {...}); - parallel_reduce(ThreadVectorMDRange, TeamHandle>(team, extent1, extent2, ...), - [=] (int i1, int i2, ..., double& lsum) {...}, sum); - ``` - -ThreadVectorMDRange is a [nested execution policy](https://kokkos.github.io/kokkos-core-wiki/ProgrammingGuide/HierarchicalParallelism.html?highlight=nested#nested-parallelism) -used inside of hierarchical parallelism. - -## Interface - ```c++ - template - struct ThreadVectorMDRange, TeamHandle> { - ThreadVectorMDRange(team, extent1, extent2, ..., extentN) { /* ... */ } - - /* ... */ - }; - ``` - -## Description - - * ```c++ - template - struct ThreadVectorMDRange, TeamHandle>; - ``` - Splits the index range `0` to `extent` over the vector lanes of the calling thread, - where extent is the backend dependent rank that will be vectorized - - * **Arguments** - * `team`: TeamHandle to the calling team execution context. - * `extent_i`: index range length of each rank. - - * **Requirements** - * `TeamHandle` is a type that models [TeamHandle](Kokkos%3A%3ATeamHandleConcept) - * extents are ints. - * This function can not be called inside a parallel operation dispatched using a - `TeamVectorRange` policy, `TeamVectorRange` policy, `TeamVectorMDRange` policy - or `ThreadVectorMDRange` policy. - * `N >= 2 && N <= 8` is true; - -## Examples - - ```c++ - using TeamHandle = TeamPolicy<>::member_type; - - parallel_for(TeamPolicy<>(N, Kokkos::AUTO), - KOKKOS_LAMBDA(TeamHandle const& team) { - int leagueRank = team.league_rank(); - - auto teamThreadRange = TeamThreadRange(team, n0); - auto threadVectorMDRange = - ThreadVectorMDRange, TeamHandle>( - team, n1, n2, n3); - - parallel_for(teamThreadRange, [=](int i0) { - parallel_for(threadVectorMDRange, [=](int i1, int i2, int i3) { - A(leagueRank, i0, i1, i2, i3) += B(leagueRank, i1) + C(i1, i2, i3); - }); - }); - - team.team_barrier(); - - int teamSum = 0; - - parallel_for(teamThreadRange, [=, &teamSum](int const& i0) { - int threadSum = 0; - parallel_reduce(threadVectorMDRange, - [=](int i1, int i2, int i3, int& vectorSum) { - vectorSum += D(leagueRank, i0, i1, i2, i3); - }, threadSum - ); - - teamSum += threadSum; - }); - }); - ``` - diff --git a/docs/source/API/core/policies/ThreadVectorMDRange.rst b/docs/source/API/core/policies/ThreadVectorMDRange.rst new file mode 100644 index 000000000..da95197d0 --- /dev/null +++ b/docs/source/API/core/policies/ThreadVectorMDRange.rst @@ -0,0 +1,85 @@ +``ThreadVectorMDRange`` +======================= + +.. role::cpp(code) + :language: cpp + +Header File: ```` + +Description +----------- + +ThreadVectorMDRange is a `nested execution policy <./NestedPolicies.html>`_ used inside of hierarchical parallelism. + +Interface +--------- + +.. cppkokkos:class:: template ThreadVectorMDRange + + .. rubric:: Constructor + + .. cppkokkos:function:: ThreadVectorMDRange(team, extent_1, extent_2, ...); + + Splits the index range ``0`` to ``extent`` over the vector lanes of the calling thread, + where ``extent`` is the backend-dependent rank that will be vectorized + + :param team: TeamHandle to the calling team execution context + + :param extent_1, extent_2, ...: index range lengths of each rank + + * **Requirements** + + * ``TeamHandle`` is a type that models `TeamHandle <./TeamHandleConcept.html>`_ + + * ``extent_1, extent_2, ...`` are ints + + * ``extent_i`` is such that ``i >= 2 && i <= 8`` is true. + For example: + + .. code-block:: cpp + + ThreadVectorMDRange(team, 4); // NOT OK, violates i>=2 + + ThreadVectorMDRange(team, 4,5); // OK + ThreadVectorMDRange(team, 4,5,6); // OK + ThreadVectorMDRange(team, 4,5,6,2,3,4,5,6); // OK, max num of extents allowed + + * The constructor can not be called inside a parallel operation dispatched using a + ``TeamVectorRange`` policy, ``TeamVectorRange`` policy, ``TeamVectorMDRange`` policy + or ``ThreadVectorMDRange`` policy. + +Examples +-------- + +.. code-block:: cpp + + using TeamHandle = TeamPolicy<>::member_type; + + parallel_for(TeamPolicy<>(N, Kokkos::AUTO), + KOKKOS_LAMBDA(TeamHandle const& team) { + int leagueRank = team.league_rank(); + + auto teamThreadRange = TeamThreadRange(team, n0); + auto threadVectorMDRange = + ThreadVectorMDRange, TeamHandle>( + team, n1, n2, n3); + + parallel_for(teamThreadRange, [=](int i0) { + parallel_for(threadVectorMDRange, [=](int i1, int i2, int i3) { + A(leagueRank, i0, i1, i2, i3) += B(leagueRank, i1) + C(i1, i2, i3); + }); + }); + team.team_barrier(); + + int teamSum = 0; + parallel_for(teamThreadRange, [=, &teamSum](int const& i0) { + int threadSum = 0; + parallel_reduce(threadVectorMDRange, + [=](int i1, int i2, int i3, int& vectorSum) { + vectorSum += D(leagueRank, i0, i1, i2, i3); + }, threadSum + ); + + teamSum += threadSum; + }); + }); diff --git a/docs/source/API/core/utilities/printf.rst b/docs/source/API/core/utilities/printf.rst new file mode 100644 index 000000000..0e60e5eec --- /dev/null +++ b/docs/source/API/core/utilities/printf.rst @@ -0,0 +1,16 @@ +``Kokkos::printf`` +================== + +.. role:: cppkokkos(code) + :language: cppkokkos + +Defined in header ```` + +.. code-block:: cpp + + template + KOKKOS_FUNCTION void printf(const char* format, Args... args); + +Prints the data specified in ``format`` and ``args...`` to ``stdout``. +The behavior is analogous to ``std::printf``, but the return type is ``void`` +to ensure a consistent behavior across backends. diff --git a/docs/source/API/core/view/Subview_type.md b/docs/source/API/core/view/Subview_type.md deleted file mode 100644 index 0f0dcf421..000000000 --- a/docs/source/API/core/view/Subview_type.md +++ /dev/null @@ -1,43 +0,0 @@ -# `Kokkos::Subview` - -Header File: `Kokkos_Core.hpp` - -Alias template to deduce the type that is returned by a call to the subview function with given arguments. - -Usage: -```c++ -Kokkos::Subview subView; -``` - -## Description - -```c++ -template -using Subview = IMPL_DETAIL; // deduce subview type from source view traits -``` -Type of a `Kokkos::View` viewing a subset of `ViewType` specified by `Args...`. -Same type as returned by a call to the subview function with corresponding arguments. -For restrictions on Args see [`Kokkos::subview`](Kokkos%3A%3Asubview) documentation. - -## Examples - -```c++ - -using view_type = Kokkos::View; -view_type a("A",N0,N1,N2); - -struct subViewHolder { -Kokkos::Subview, - int, - decltype(Kokkos::ALL), - int> s; -} subViewHolder; - -subViewHolder.s = Kokkos::subview(a, - std::pair(3,15), - 5, - Kokkos::ALL, - 3); - -``` diff --git a/docs/source/API/core/view/Subview_type.rst b/docs/source/API/core/view/Subview_type.rst new file mode 100644 index 000000000..d6f247229 --- /dev/null +++ b/docs/source/API/core/view/Subview_type.rst @@ -0,0 +1,60 @@ +``Kokkos::Subview`` +=================== + +.. role:: cppkokkos(code) + :language: cppkokkos + +.. _subviewfunc: subview.html + +.. |subviewfunc| replace:: ``Kokkos::subview()`` + +Header File: ``Kokkos_Core.hpp`` + +Description +----------- + +Alias template to deduce the type that is returned by a call to the |subviewfunc|_ function with given arguments. + +Interface +--------- + +.. code-block:: cpp + + template + using Subview = IMPL_DETAIL; // deduce subview type from source view traits + +Type of the result of ``Kokkos::subview(ViewType view_arg, Args .... args)`` + +Requirements +------------ + +Requires: + +- ``ViewType`` is a specialization of ``Kokkos::View`` + +- ``Args...`` are slice specifiers as defined in |subviewfunc|_. + +- ``sizeof... (Args) == ViewType::rank()``. + + +Examples +-------- + +.. code-block:: cpp + + using view_type = Kokkos::View; + view_type a("A",N0,N1,N2); + + struct subViewHolder { + Kokkos::Subview, + int, + decltype(Kokkos::ALL), + int> s; + } subViewHolder; + + subViewHolder.s = Kokkos::subview(a, + std::pair(3,15), + 5, + Kokkos::ALL, + 3); diff --git a/docs/source/API/core/view/view.rst b/docs/source/API/core/view/view.rst index 71eb1ba7f..380fb71ac 100644 --- a/docs/source/API/core/view/view.rst +++ b/docs/source/API/core/view/view.rst @@ -25,15 +25,15 @@ Parameters .. _LayoutRight: layoutRight.html -.. |LayoutRight| replace:: :cppkokkos:func:`LayoutRight` +.. |LayoutRight| replace:: ``LayoutRight()`` .. _LayoutLeft: layoutLeft.html -.. |LayoutLeft| replace:: :cppkokkos:func:`LayoutLeft` +.. |LayoutLeft| replace:: ``LayoutLeft()`` .. _LayoutStride: layoutStride.html -.. |LayoutStride| replace:: :cppkokkos:func:`LayoutStride` +.. |LayoutStride| replace:: ``LayoutStride()`` Template parameters other than ``DataType`` are optional, but ordering is enforced. That means for example that ``LayoutType`` can be omitted but if both ``MemorySpace`` @@ -106,7 +106,7 @@ member function callable from host and device side. Users are encouraged to use ``rank()`` and ``rank_dynamic()`` (akin to a static member function call) instead of relying on implicit conversion to an integral type. -The actual type of ``rank[_dymanic]`` as it was defined until Kokkos 4.1 was left up to the implementation +The actual type of ``rank[_dynamic]`` as it was defined until Kokkos 4.1 was left up to the implementation (that is, up to the compiler not to Kokkos) but in practice it was often ``int`` which means this change may yield warnings about comparing signed and unsigned integral types. It may also break code that was using the type of ``View::rank``. @@ -519,7 +519,7 @@ In the following we use ``DstType`` and ``SrcType`` as the type of the destinati .. code-block:: cpp - ScrType src_view(...); + SrcType src_view(...); DstType dst_view(src_view); dst_view = src_view; diff --git a/docs/source/ProgrammingGuide/Atomic-Operations.md b/docs/source/ProgrammingGuide/Atomic-Operations.md index 6929d4217..ebfa22a97 100644 --- a/docs/source/ProgrammingGuide/Atomic-Operations.md +++ b/docs/source/ProgrammingGuide/Atomic-Operations.md @@ -82,7 +82,7 @@ void compute_force(View neighbours, View values) { } ``` -There are also atomic operations which return the old or the new value. They follow the [`atomic_fetch_[op]`](../API/core/atomics/atomic_fetch_op) and [`atomic_[op]_fetch`](../API/core/atomics/atomic_op_fetch.md) naming scheme. For example if one would want to find all the indices of negative values in an array and store them in a list this would be the algorithm: +There are also atomic operations which return the old or the new value. They follow the [`atomic_fetch_[op]`](../API/core/atomics/atomic_fetch_op) and [`atomic_[op]_fetch`](../API/core/atomics/atomic_op_fetch) naming scheme. For example if one would want to find all the indices of negative values in an array and store them in a list this would be the algorithm: ```c++ void find_indicies(View indicies, View values) { View count("Count"); diff --git a/docs/source/ProgrammingGuide/HierarchicalParallelism.md b/docs/source/ProgrammingGuide/HierarchicalParallelism.md index 60ddc8bb1..8607ed35d 100644 --- a/docs/source/ProgrammingGuide/HierarchicalParallelism.md +++ b/docs/source/ProgrammingGuide/HierarchicalParallelism.md @@ -263,7 +263,7 @@ The third pattern is [`parallel_scan()`](../API/core/parallel-dispatch/parallel_ #### 8.4.1.1 Team Barriers -In instances where one loop operation might need to be sequenced with a different loop operation, such as filling of arrays as a preparation stage for following computations on that data, it is important to be able to control threads in time; this can be done through the use of barriers. In nested loops, the outside loop ( [`TeamPolicy<> ()`](../API/core/policies/TeamPolicy) ) has a built-in (implicit) team barrier; inner loops ( [`TeamThreadRange ()`](../API/core/policies/TeamThreadRange.md) ) do not. This latter condition is often referred to as a 'non-blocking' condition. When necessary, an explicit barrier can be introduced to synchronize team threads; an example is shown in the previous example. +In instances where one loop operation might need to be sequenced with a different loop operation, such as filling of arrays as a preparation stage for following computations on that data, it is important to be able to control threads in time; this can be done through the use of barriers. In nested loops, the outside loop ( [`TeamPolicy<> ()`](../API/core/policies/TeamPolicy) ) has a built-in (implicit) team barrier; inner loops ( [`TeamThreadRange ()`](../API/core/policies/TeamThreadRange) ) do not. This latter condition is often referred to as a 'non-blocking' condition. When necessary, an explicit barrier can be introduced to synchronize team threads; an example is shown in the previous example. ### 8.4.2 Vector loops diff --git a/docs/source/ProgrammingGuide/Machine-Model.md b/docs/source/ProgrammingGuide/Machine-Model.md deleted file mode 100644 index bd1ec71a7..000000000 --- a/docs/source/ProgrammingGuide/Machine-Model.md +++ /dev/null @@ -1,77 +0,0 @@ -# 2. Machine Model - -After reading this chapter you will understand the abstract model of a parallel computing node which underlies the design choices and structure of the Kokkos framework. The machine model ensures the applications written using Kokkos will have portability across architectures while being performant on a range of hardware. - -The machine model has two important components: -* _Memory spaces_, in which data structures can be allocated -* _Execution spaces_, which execute parallel operations using data from one or more _memory spaces_. - -## 2.1 Motivations - -Kokkos is comprised of two orthogonal aspects. The first of these is an underlying -_abstract machine model_ which describes fundamental concepts required for the development of future portable and performant high performance computing applications; the second is a _concrete instantiation of the programming model_ written in C++, which allows programmers to write to the concept machine model. It is important to treat these two aspects of Kokkos as distinct entities because the underlying model being used by Kokkos could, in the future, be instantiated in additional languages beyond C++ yet the algorithmic specification would remain valid. - -### 2.1.1 Kokkos Abstract Machine Model -Kokkos assumes an _abstract machine model_ for the design of future shared-memory computing architectures. The model (shown in Figure 2.1) assumes that there may be multiple execution units in a compute node. For a more general discussion of abstract machine models for Exascale computing the reader should consult reference Ang1. In the figure shown here, we have elected to show two different types of compute units - one which represents multiple latency-optimized cores, similar to contemporary processor cores, and a second source of compute in the form of an off die accelerator. Of note is that the processor and accelerator each have distinct memories, each with unique performance properties, that may or may not be accessible across the node (i.e. the memory may be reachable or _shared_ by all execution units, but specific memory spaces may also be only accessible by specific execution units). The specific layout shown in Figure 2.1 is an instantiation of the Kokkos abstract machine model used to describe the potential for multiple types of compute engines and memories within a single node. In future systems, there may be a range of execution engines which are used in the node ranging from a single type of core, as in many/multicore processors found today, through to a range of execution units where many-core processors may be joined to numerous types of accelerator cores. In order to ensure portability to the potential range of nodes, an abstraction of the compute engines and available memories are required. -*** -1 Ang, J.A., et. al., **Abstract Machine Models and Proxy Architectures for Exascale Computing**, -2014, Sandia National Laboratories and Lawrence Berkeley National Laboratory, DOE Computer Architecture Laboratories Project -*** - -![node](figures/kokkos-node-doc.png) - -

Figure 2.1 Conceptual Model of a Future High Performance Computing Node

- -## 2.2 Kokkos Spaces -Kokkos uses the term _execution spaces_ to describe a logical grouping of computation units which share an identical set of performance properties. An execution space provides a set of parallel execution resources which can be utilized by the programmer using several types of fundamental parallel operation. For a list of the operations available see [Chapter 7 - Parallel dispatch](ParallelDispatch). The term _memory spaces_ is used to describe a logical distinct memory resource, which is available to allocate data. - -### 2.2.1 Execution Space Instances -An _instance_ of an execution space is a specific instantiation of an execution space to which a programmer can target parallel work. By means of example, an execution space might be used to describe a multicore processor. In this example, the execution space contains several homogeneous cores which share some logical grouping. In a program written to the Kokkos model, an instance of this execution space would be made available on which parallel kernels could be executed. As a second example, if we were to add a GPU to the multicore processor so a second execution space type is available in the system, the application programmer would then have two execution space instances available to select from. The important consideration here is that the method of compiling code for different execution spaces and the dispatch of kernels to instances is abstracted by the Kokkos model. This allows application programmers to be free from writing algorithms in hardware specific languages. - -![execution-space](figures/kokkos-execution-space-doc.png) - -

Figure 2.2 Example Execution Spaces in a Future Computing Node

- -### 2.2.2 Kokkos Memory Spaces -The multiple types of memory which will become available in future computing nodes are abstracted by Kokkos through _memory spaces_. Each memory space provides a finite storage capacity at which data structures can be allocated and accessed. Different memory space types have different characteristics with respect to accessibility from execution spaces as well as their performance characteristics. - -### 2.2.3 Instances of Kokkos Memory Spaces -In much the same way execution spaces have specific instantiations through the availability of an _instance_ so do memory spaces. An instance of a memory space provides a concrete method for the application programmer to request data storage allocations. Returning to the examples provided for execution spaces, the multicore processor may have multiple memory spaces available including on-package memory, slower DRAM and additional sets of non-volatile memories. The GPU may also provide an additional memory space through its local on-package memory. The programmer is free to decide where each data structure may be allocated by requesting these from the specific instance associated with that memory space. Kokkos provides the appropriate abstraction of the allocation routines and any associated data management operations including releasing the memory, returning it for future use, as well as for copy operations. - -![memory-space](figures/kokkos-memory-space-doc.png) - -

Figure 2.3 Example Memory Spaces in a Future Computing Node

- -**Atomic accesses to Memory in Kokkos** In cases where multiple executing threads attempt to read a memory address, complete a computation on the item, and write it back to same address in memory, an ordering collision may occur. These situations, known as _race conditions_ (because the data value stored in memory as the threads complete is dependent on which thread completes its memory operation last), are often the cause of non-determinism in parallel programs. A number of methods can be employed to ensure that race conditions do not occur in parallel programs including the use of locks (which allow only a single thread to gain access to data structure at a time), critical regions (which allow only one thread to execute a code sequence at any point in time) and _atomic_ operations. Memory operations which are atomic guarantee that a read, simple computation, and write to memory are completed as a single unit. This might allow application programmers to safely increment a memory value for instance, or more commonly, to safely accumulate values from multiple threads into a single memory location. - -**Memory Consistency in Kokkos** Memory consistency models are a complex topic in and of themselves and usually rely on complex operations associated with hardware caches or memory access coherency (for more information see Hennessy and Paterson2). Kokkos does not _require_ caches to be present in hardware and so assumes an extremely weak memory consistency model. In the Kokkos model, the programmer should not assume any specific ordering of memory operations being issued by a kernel. This has the potential to create race conditions between memory operations if these are not appropriately protected. In order to provide a guarantee that memory operations are completed, Kokkos provides a _fence_ operation which forces the compute engine to complete all outstanding memory operations before any new ones can be issued. With appropriate use of fences, programmers are thereby able to ensure that guarantees can be made as to when data will _definitely_ have been written to memory. - -*** -2 Hennessy J.L. and Paterson D.A., **Computer Architecture, Fifth Edition: A Quantitative Approach**, Morgan Kaufmann, 2011. -*** - - -## 2.3 Program execution - -It is tempting to try to define formally what it means for a processor to execute code. None of us authors have a background in logic or what computer scientists call "formal methods," so our attempt might not go very far! We will stick with informal definitions and rely on Kokkos' C++ implementation as an existence proof that the definitions make sense. - -Kokkos lets users tell execution spaces to execute parallel operations. These include parallel for, reduce, and scan (see [Chapter 7 - Parallel dispatch](ParallelDispatch)) as well as [View allocation](View) and [Initialization](Initialization). We name the class of all such operations _parallel dispatch_. - -From our perspective, there are three kinds of code: - -1. Code executing inside of a Kokkos parallel operation -1. Code outside of a Kokkos parallel operation that asks Kokkos to do something (e.g., parallel dispatch itself) -1. Code that has nothing to do with Kokkos - -The first category is the most restrictive. [Section 8.2](HP_thread_teams) explains restrictions on inter-team synchronization. In general, we limit the ability of Kokkos-parallel code to invoke Kokkos operations (other than for nested parallelism; see [Chapter 8 - Hierarchical Parallelism](HierarchicalParallelism) and especially [Section 8.2](HP_thread_teams)). We also forbid dynamic memory allocation (other than from the team's scratch pad) in parallel operations. Whether Kokkos-parallel code may invoke operating system routines or third-party libraries depends on the execution and memory spaces being used. Regardless, restrictions on inter-team synchronization have implications for things like filesystem access. - -_Kokkos threads are for computing in parallel_, not for overlapping I/O and computation, and not for making graphical user interfaces responsive. Use other kinds of threads (e.g., operating system threads) for the latter two purposes. You may be able to mix Kokkos' parallelism with other kinds of threads; see [Section 2.3.1](MM_thread_safety). Kokkos' developers are also working on a task parallelism model that will work with Kokkos' existing data-parallel constructs. - -**Reproducible reductions and scans** Kokkos promises _nothing_ about the order in which the iterations of a parallel loop occur. However, it _does_ promise that if you execute the same parallel reduction or scan, using the same hardware resources and run-time settings, then you will get the same results each time you run the operation. "Same results" even means "with respect to floating-point rounding error." - -**Asynchronous parallel dispatch** This concerns the second category of code that calls Kokkos operations. In Kokkos, parallel dispatch executes _asynchronously_. This means that it may return "early," before it has actually completed. Nevertheless, it executes _in sequence_ with respect to other Kokkos operations on the same execution or memory space. This matters for things like timing. For example, a [`parallel_for()`](../API/core/parallel-dispatch/parallel_for) may return "right away," so if you want to measure how long it takes, you must first call [`fence()`](../API/core/parallel-dispatch/fence) on that execution space. This forces all functors to complete before [`fence()`](../API/core/parallel-dispatch/fence) returns. - -(MM_thread_safety)= -### 2.3.1 Thread safety? - -Users may wonder about "thread safety," that is, whether multiple operating system threads may safely call into Kokkos concurrently. Kokkos' thread safety depends on both its implementation and on the execution and memory spaces that the implementation uses. The C++ implementation has made great progress towards (non-Kokkos) thread safety of View memory management. For now, however, the most portable approach is for only one (non-Kokkos) thread of execution to control Kokkos. Also, be aware that operating system threads might interfere with Kokkos' performance depending on the execution space that you use. diff --git a/docs/source/ProgrammingGuide/Machine-Model.rst b/docs/source/ProgrammingGuide/Machine-Model.rst new file mode 100644 index 000000000..10389d958 --- /dev/null +++ b/docs/source/ProgrammingGuide/Machine-Model.rst @@ -0,0 +1,134 @@ +2. Machine Model +================ + +.. role:: cppkokkos(code) + :language: cppkokkos + +.. |node| image:: figures/kokkos-node-doc.png + :alt: Figure 2.1 Conceptual Model of a Future High Performance Computing Node + +.. _Chap7ParallelDispatch: ParallelDispatch.html +.. |Chap7ParallelDispatch| replace:: Chapter 7 - Parallel dispatch + +.. |execution-space| image:: figures/kokkos-execution-space-doc.png + :alt: Figure 2.2 Example Execution Spaces in a Future Computing Node + +.. |memory-space| image:: figures/kokkos-memory-space-doc.png + :alt: Figure 2.3 Example Memory Spaces in a Future Computing Node + +.. _ViewAllocation: View.html +.. |ViewAllocation| replace:: View allocation + +.. _Initialization: Initialization.html +.. |Initialization| replace:: Initialization + +.. _Section82: HierarchicalParallelism.html#hp-thread-teams +.. |Section82| replace:: Section 8.2 + +.. _Chap8HierarchicalParallelism: HierarchicalParallelism.html +.. |Chap8HierarchicalParallelism| replace:: Chapter 8 - Hierarchical Parallelism + +.. _Section231: Machine-Model.html#thread-safety +.. |Section231| replace:: Section 2.3.1 + +.. _ParallelFor: ../API/core/parallel-dispatch/parallel_for.html +.. |ParallelFor| replace:: ``parallel_for()`` + +.. _Fence: ../API/core/parallel-dispatch/fence.html +.. |Fence| replace:: ``fence()`` + +After reading this chapter you will understand the abstract model of a parallel computing node which underlies the design choices and structure of the Kokkos framework. The machine model ensures the applications written using Kokkos will have portability across architectures while being performant on a range of hardware. + +The machine model has two important components: + +* *Memory spaces*, in which data structures can be allocated +* *Execution spaces*, which execute parallel operations using data from one or more *memory spaces*. + +2.1 Motivations +--------------- + +Kokkos is comprised of two orthogonal aspects. The first of these is an underlying +*abstract machine model* which describes fundamental concepts required for the development of future portable and performant high performance computing applications; the second is a *concrete instantiation of the programming model* written in C++, which allows programmers to write to the concept machine model. It is important to treat these two aspects of Kokkos as distinct entities because the underlying model being used by Kokkos could, in the future, be instantiated in additional languages beyond C++ yet the algorithmic specification would remain valid. + +2.1.1 Kokkos Abstract Machine Model +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Kokkos assumes an *abstract machine model* for the design of future shared-memory computing architectures. The model (shown in Figure 2.1) assumes that there may be multiple execution units in a compute node. For a more general discussion of abstract machine models for Exascale computing the reader should consult reference Ang\ :sup:`1`. In the figure shown here, we have elected to show two different types of compute units - one which represents multiple latency-optimized cores, similar to contemporary processor cores, and a second source of compute in the form of an off die accelerator. Of note is that the processor and accelerator each have distinct memories, each with unique performance properties, that may or may not be accessible across the node (i.e. the memory may be reachable or *shared* by all execution units, but specific memory spaces may also be only accessible by specific execution units). The specific layout shown in Figure 2.1 is an instantiation of the Kokkos abstract machine model used to describe the potential for multiple types of compute engines and memories within a single node. In future systems, there may be a range of execution engines which are used in the node ranging from a single type of core, as in many/multicore processors found today, through to a range of execution units where many-core processors may be joined to numerous types of accelerator cores. In order to ensure portability to the potential range of nodes, an abstraction of the compute engines and available memories are required. + +----- + +:sup:`1` Ang, J.A., et. al., **Abstract Machine Models and Proxy Architectures for Exascale Computing**, +2014, Sandia National Laboratories and Lawrence Berkeley National Laboratory, DOE Computer Architecture Laboratories Project + +----- + +|node| + +Figure 2.1 Conceptual Model of a Future High Performance Computing Node +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +2.2 Kokkos Spaces +----------------- + +Kokkos uses the term *execution spaces* to describe a logical grouping of computation units which share an identical set of performance properties. An execution space provides a set of parallel execution resources which can be utilized by the programmer using several types of fundamental parallel operation. For a list of the operations available see |Chap7ParallelDispatch|_. The term *memory spaces* is used to describe a logical distinct memory resource, which is available to allocate data. + +2.2.1 Execution Space Instances +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +An *instance* of an execution space is a specific instantiation of an execution space to which a programmer can target parallel work. By means of example, an execution space might be used to describe a multicore processor. In this example, the execution space contains several homogeneous cores which share some logical grouping. In a program written to the Kokkos model, an instance of this execution space would be made available on which parallel kernels could be executed. As a second example, if we were to add a GPU to the multicore processor so a second execution space type is available in the system, the application programmer would then have two execution space instances available to select from. The important consideration here is that the method of compiling code for different execution spaces and the dispatch of kernels to instances is abstracted by the Kokkos model. This allows application programmers to be free from writing algorithms in hardware specific languages. + +|execution-space| + +Figure 2.2 Example Execution Spaces in a Future Computing Node +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +2.2.2 Kokkos Memory Spaces +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The multiple types of memory which will become available in future computing nodes are abstracted by Kokkos through *memory spaces*. Each memory space provides a finite storage capacity at which data structures can be allocated and accessed. Different memory space types have different characteristics with respect to accessibility from execution spaces as well as their performance characteristics. + +2.2.3 Instances of Kokkos Memory Spaces +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In much the same way execution spaces have specific instantiations through the availability of an *instance* so do memory spaces. An instance of a memory space provides a concrete method for the application programmer to request data storage allocations. Returning to the examples provided for execution spaces, the multicore processor may have multiple memory spaces available including on-package memory, slower DRAM and additional sets of non-volatile memories. The GPU may also provide an additional memory space through its local on-package memory. The programmer is free to decide where each data structure may be allocated by requesting these from the specific instance associated with that memory space. Kokkos provides the appropriate abstraction of the allocation routines and any associated data management operations including releasing the memory, returning it for future use, as well as for copy operations. + +|memory-space| + +Figure 2.3 Example Memory Spaces in a Future Computing Node +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +**Atomic accesses to Memory in Kokkos** In cases where multiple executing threads attempt to read a memory address, complete a computation on the item, and write it back to same address in memory, an ordering collision may occur. These situations, known as *race conditions* (because the data value stored in memory as the threads complete is dependent on which thread completes its memory operation last), are often the cause of non-determinism in parallel programs. A number of methods can be employed to ensure that race conditions do not occur in parallel programs including the use of locks (which allow only a single thread to gain access to data structure at a time), critical regions (which allow only one thread to execute a code sequence at any point in time) and *atomic* operations. Memory operations which are atomic guarantee that a read, simple computation, and write to memory are completed as a single unit. This might allow application programmers to safely increment a memory value for instance, or more commonly, to safely accumulate values from multiple threads into a single memory location. + +**Memory Consistency in Kokkos** Memory consistency models are a complex topic in and of themselves and usually rely on complex operations associated with hardware caches or memory access coherency (for more information see Hennessy and Paterson\ :sup:`2`). Kokkos does not *require* caches to be present in hardware and so assumes an extremely weak memory consistency model. In the Kokkos model, the programmer should not assume any specific ordering of memory operations being issued by a kernel. This has the potential to create race conditions between memory operations if these are not appropriately protected. In order to provide a guarantee that memory operations are completed, Kokkos provides a *fence* operation which forces the compute engine to complete all outstanding memory operations before any new ones can be issued. With appropriate use of fences, programmers are thereby able to ensure that guarantees can be made as to when data will *definitely* have been written to memory. + +----- + +:sup:`2` Hennessy J.L. and Paterson D.A., **Computer Architecture, Fifth Edition: A Quantitative Approach**, Morgan Kaufmann, 2011. + +----- + +2.3 Program execution +--------------------- + +It is tempting to try to define formally what it means for a processor to execute code. None of us authors have a background in logic or what computer scientists call "formal methods," so our attempt might not go very far! We will stick with informal definitions and rely on Kokkos' C++ implementation as an existence proof that the definitions make sense. + +Kokkos lets users tell execution spaces to execute parallel operations. These include parallel for, reduce, and scan (see |Chap7ParallelDispatch|_) as well as |ViewAllocation|_ and |Initialization|_. We name the class of all such operations *parallel dispatch*. + +From our perspective, there are three kinds of code: + +#. Code executing inside of a Kokkos parallel operation +#. Code outside of a Kokkos parallel operation that asks Kokkos to do something (e.g., parallel dispatch itself) +#. Code that has nothing to do with Kokkos + +The first category is the most restrictive. |Section82|_ explains restrictions on inter-team synchronization. In general, we limit the ability of Kokkos-parallel code to invoke Kokkos operations (other than for nested parallelism; see |Chap8HierarchicalParallelism|_ and especially |Section82|_). We also forbid dynamic memory allocation (other than from the team's scratch pad) in parallel operations. Whether Kokkos-parallel code may invoke operating system routines or third-party libraries depends on the execution and memory spaces being used. Regardless, restrictions on inter-team synchronization have implications for things like filesystem access. + +*Kokkos threads are for computing in parallel*, not for overlapping I/O and computation, and not for making graphical user interfaces responsive. Use other kinds of threads (e.g., operating system threads) for the latter two purposes. You may be able to mix Kokkos' parallelism with other kinds of threads; see |Section231|_. Kokkos' developers are also working on a task parallelism model that will work with Kokkos' existing data-parallel constructs. + +**Reproducible reductions and scans** Kokkos promises *nothing* about the order in which the iterations of a parallel loop occur. However, it *does* promise that if you execute the same parallel reduction or scan, using the same hardware resources and run-time settings, then you will get the same results each time you run the operation. "Same results" even means "with respect to floating-point rounding error." + +**Asynchronous parallel dispatch** This concerns the second category of code that calls Kokkos operations. In Kokkos, parallel dispatch executes *asynchronously*. This means that it may return "early," before it has actually completed. Nevertheless, it executes *in sequence* with respect to other Kokkos operations on the same execution or memory space. This matters for things like timing. For example, a |ParallelFor|_ may return "right away," so if you want to measure how long it takes, you must first call |Fence|_ on that execution space. This forces all functors to complete before |Fence|_ returns. + +2.3.1 Thread safety? +~~~~~~~~~~~~~~~~~~~~ + +Users may wonder about "thread safety," that is, whether multiple operating system threads may safely call into Kokkos concurrently. Kokkos' thread safety depends on both its implementation and on the execution and memory spaces that the implementation uses. The C++ implementation has made great progress towards (non-Kokkos) thread safety of View memory management. For now, however, the most portable approach is for only one (non-Kokkos) thread of execution to control Kokkos. Also, be aware that operating system threads might interfere with Kokkos' performance depending on the execution space that you use. diff --git a/docs/source/contributing.rst b/docs/source/contributing.rst index d3262880a..a58b7d00b 100644 --- a/docs/source/contributing.rst +++ b/docs/source/contributing.rst @@ -1,6 +1,12 @@ Contributing ============ +.. toctree:: + :maxdepth: 1 + :hidden: + + templates/index + We are open and try to encourage contributions from external developers. To do so please first open an issue describing the contribution and then issue a pull request against the develop branch. @@ -13,3 +19,10 @@ not just for public purposes but also for closed source commercial projects. For specifics see the `LICENSE `__. Open an issue/feature req. `ISSUES `_ + +Contributing Documentation +-------------------------- + +Please see the `README `_ for general instructions on building the documentation. + +To make it easier to contribute API documentation, we have a page of documentation templates :doc:`here ` diff --git a/docs/source/keywords.rst b/docs/source/keywords.rst index 7dd1ef5e8..e6635449b 100644 --- a/docs/source/keywords.rst +++ b/docs/source/keywords.rst @@ -253,6 +253,26 @@ Architecture Keywords * Optimize for the NVIDIA Ada generation CC 8.9 :sup:`since Kokkos 4.1` * ``OFF`` + * * ``Kokkos_ARCH_AMD_GFX906`` + * Optimize for AMD GPU MI50/MI60 GFX906 :sup:`since Kokkos 4.2` + * ``OFF`` + + * * ``Kokkos_ARCH_AMD_GFX908`` + * Optimize for AMD GPU MI100 GFX908 :sup:`since Kokkos 4.2` + * ``OFF`` + + * * ``Kokkos_ARCH_AMD_GFX90A`` + * Optimize for AMD GPU MI200 series GFX90A :sup:`since Kokkos 4.2` + * ``OFF`` + + * * ``Kokkos_ARCH_AMD_GFX1030`` + * Optimize for AMD GPU V620/W6800 GFX1030 :sup:`since Kokkos 4.2` + * ``OFF`` + + * * ``Kokkos_ARCH_AMD_GFX1100`` + * Optimize for AMD GPU 7900xt GFX1100 :sup:`since Kokkos 4.2` + * ``OFF`` + * * ``Kokkos_ARCH_AMPERE80`` * Optimize for the NVIDIA Ampere generation CC 8.0 * ``OFF`` @@ -278,11 +298,7 @@ Architecture Keywords * ``OFF`` * * ``Kokkos_ARCH_ARMV8_THUNDERX2`` - * Optimize for the ARMV8_TX2 architecture - * ``OFF`` - - * * ``Kokkos_ARCH_ARMV8_TX2`` - * Optimize for ARMV8_TX2 architecture + * Optimize for the ARMV8_THUNDERX2 architecture * ``OFF`` * * ``Kokkos_ARCH_BDW`` @@ -365,8 +381,8 @@ Architecture Keywords * Optimize for MAXWELL53 architecture * ``OFF`` - * * ``Kokkos_ARCH_NAVI1030`` :red:`[Since 4.0]` - * Optimize for AMD GPU V620/W6800 GFX1030 + * * ``Kokkos_ARCH_NAVI1030`` + * Optimize for AMD GPU V620/W6800 GFX1030 :sup:`since Kokkos 4.0` (Prefer ``Kokkos_ARCH_AMD_GFX1030``) * ``OFF`` * * ``Kokkos_ARCH_PASCAL60`` @@ -405,20 +421,20 @@ Architecture Keywords * Optimize for TURING75 architecture * ``OFF`` - * * ``Kokkos_ARCH_VEGA900`` :red:`[Removed in 4.0]` - * Optimize for AMD GPU MI25 GFX900 + * * ``Kokkos_ARCH_VEGA900`` + * Optimize for AMD GPU MI25 GFX900 :sup:`removed in 4.0` * ``OFF`` * * ``Kokkos_ARCH_VEGA906`` - * Optimize for AMD GPU MI50/MI60 GFX906 + * Optimize for AMD GPU MI50/MI60 GFX906 (Prefer ``Kokkos_ARCH_AMD_GFX906``) * ``OFF`` * * ``Kokkos_ARCH_VEGA908`` - * Optimize for AMD GPU MI100 GFX908 + * Optimize for AMD GPU MI100 GFX908 (Prefer ``Kokkos_ARCH_AMD_GFX908``) * ``OFF`` * * ``Kokkos_ARCH_VEGA90A`` - * Optimize for AMD GPU MI200 series GFX90A + * Optimize for AMD GPU MI200 series GFX90A (Prefer ``Kokkos_ARCH_AMD_GFX90A``) * ``OFF`` * * ``Kokkos_ARCH_VOLTA70`` @@ -450,3 +466,34 @@ request Ahead-Of-Time compilation. Just-In-Time compilation means that the compi are actually executed and only at that point the architecture to compile for is determined. On the other hand, Ahead-Of-Time compilation describes the standard model where the compiler is only invoked once to create the binary and the architecture to compile for is determined before the program is run. + +.. _kweyword_amd: + +AMD Architectures +================= + +.. list-table:: + :widths: 65 35 + :header-rows: 1 + :align: left + + * - AMD GPU + - Kokkos ARCH + + * * ``7900xt`` + * AMD_GFX1100 + + * * ``MI50/MI60`` + * AMD_GFX906 + + * * ``MI100`` + * AMD_GFX908 + + * * ``MI200`` series: ``MI210``, ``MI250``, ``MI250X`` + * AMD_GFX90A + + * * ``V620`` + * AMD_GFX1030 + + * * ``W6800`` + * AMD_GFX1030 diff --git a/docs/source/requirements.rst b/docs/source/requirements.rst index deaa7b402..308fcc39f 100644 --- a/docs/source/requirements.rst +++ b/docs/source/requirements.rst @@ -32,7 +32,7 @@ Kokkos 4.x * * Clang (CUDA) * 10.0.0 - * 10.0.0, 14.0.0 + * 12.0.0, 14.0.0 * * AppleClang * 8.0 diff --git a/docs/source/templates/class_api.rst b/docs/source/templates/class_api.rst new file mode 100644 index 000000000..399947d70 --- /dev/null +++ b/docs/source/templates/class_api.rst @@ -0,0 +1,197 @@ +.. + Use the following convention for headings: + + # with overline, for parts (collections of chapters) + + * with overline, for chapters + + = for sections + + - for subsections + + ^ for subsubsections + + " for paragraphs + +.. + Class / method / container name) + for free functions that are callable, preserve the naming convention, `view_alloc()` + +``CoolerView`` +============== + +.. role:: cppkokkos(code) + :language: cppkokkos + +.. + The (pulic header) file the user will include in their code + +Header File: ``Kokkos_Core.hpp`` + +.. + High-level, human-language summary of what the thing does, and if possible, brief statement about why it exists (2 - 3 sentences, max); + +Description +----------- + +.. + The API of the entity. + +Interface +--------- + +.. + The declaration or signature of the entity. + +.. cppkokkos:class:: template CoolerView + + .. + Template parameters (if applicable) + Omit template parameters that are just used for specialization/are deduced/ and/or should not be exposed to the user. + + .. rubric:: Template Parameters + + :tparam Foo: Description of the Foo template parameter + + .. + Parameters (if applicable) + + .. rubric:: Parameters + + :param bar: Description of the bar parameter + + .. rubric:: Public Types + + .. cppkokkos:type:: data_type + + Some interesting description of the type and how to use it. + + .. rubric:: Static Public Member Variables + + .. cppkokkos:member:: int some_var = 5; + + Description of some_var + + .. + If you have related info + + .. seealso:: + + .. + We can cross-reference entities + + The :cppkokkos:func:`frobrnicator` free function. + + .. rubric:: Constructors + + .. cppkokkos:function:: CoolerView(CoolerView&& rhs) + + Whether it's a move/copy/default constructor. Describe what it does. + + .. + Only include the destructor if it does something interesting as part of the API, such as RAII classes that release a resource on their destructor. Classes that merely + clean up or destroy their members don't need this member documented. + + .. rubric:: Destructor + + .. cppkokkos:function:: ~CoolerView() + + Performs some special operation when destroyed. + + .. rubric:: Public Member Functions + + .. cppkokkos:function:: template foo(U x) + + Brief description of the function. + + :tparam U: Description of U + + :param: description of x + + .. + Describe any API changes between versions. + + .. versionchanged:: 3.7.1 + + What changed between versions: e.g. Only takes one parameter for foo-style operations instead of two. + + .. + Use the C++ syntax for deprecation (don't use the Kokkos deprecated macro) as Sphinx will recognize it. We may in the future + add extra parsing after the html is generated to render this more nicely. + + .. cppkokkos:type:: [[deprecated("in version 4.0.1")]] foobar + + Represents the foobar capability. + + .. deprecated:: 4.0.1 + + Use :cppkokkos:type:`foobat` instead. + + .. cppkokkos:type:: foobat + + A better version of foobar. + + .. versionadded:: 4.0.1 + + +Non-Member Functions +-------------------- + +.. + These should only be listed here if they are closely related. E.g. friend operators. However, + something like view_alloc shouldn't be here for view + +.. cppkokkos:function:: template bool operator==(CoolerView, ViewSrc); + + :tparam ViewDst: the other + + :return: true if :cppkokkos:type:`View::value_type`, :cppkokkos:type:`View::array_layout`, :cppkokkos:type:`View::memory_space`, :cppkokkos:member:`View::rank`, :cppkokkos:func:`View::data()` and :cppkokkos:expr:`View::extent(r)`, for :cppkokkos:expr:`0<=r + #include + + int main(int argc, char* argv[]) { + Kokkos::initialize(argc,argv); + + int N0 = atoi(argv[1]); + int N1 = atoi(argv[2]); + + Kokkos::View a("A",N0); + Kokkos::View b("B",N1); + + Kokkos::parallel_for("InitA", N0, KOKKOS_LAMBDA (const int& i) { + a(i) = i; + }); + + Kokkos::parallel_for("InitB", N1, KOKKOS_LAMBDA (const int& i) { + b(i) = i; + }); + + Kokkos::View c("C",N0,N1); + { + Kokkos::View const_a(a); + Kokkos::View const_b(b); + Kokkos::parallel_for("SetC", Kokkos::MDRangePolicy>({0,0},{N0,N1}), + KOKKOS_LAMBDA (const int& i0, const int& i1) { + c(i0,i1) = a(i0) * b(i1); + }); + } + + Kokkos::finalize(); + } diff --git a/docs/source/templates/index.rst b/docs/source/templates/index.rst new file mode 100644 index 000000000..94f6213e9 --- /dev/null +++ b/docs/source/templates/index.rst @@ -0,0 +1,10 @@ +Documentation Templates +======================= + +The following documents are templates that may be useful for documenting new API members. For each template, you can see the source by clicking +on the "Edit this page" (pencil button) on the top right corner. + +.. toctree:: + :maxdepth: 1 + + Class API Template diff --git a/docs/source/testing-and-issue-tracking.rst b/docs/source/testing-and-issue-tracking.rst index bc684fe45..48491f573 100644 --- a/docs/source/testing-and-issue-tracking.rst +++ b/docs/source/testing-and-issue-tracking.rst @@ -1,9 +1,10 @@ -Testing and Issue Tracking -########################## +Kokkos Planning and Testing +########################### .. toctree:: :maxdepth: 1 + ./testing-and-issue-tracking/Kokkos-Project-Planning ./testing-and-issue-tracking/Requirements-Issues-and-Feedback ./testing-and-issue-tracking/Testing-Process-Details ./testing-and-issue-tracking/Testing-Processes diff --git a/docs/source/testing-and-issue-tracking/Kokkos-Project-Planning.md b/docs/source/testing-and-issue-tracking/Kokkos-Project-Planning.md new file mode 100644 index 000000000..d8da422d8 --- /dev/null +++ b/docs/source/testing-and-issue-tracking/Kokkos-Project-Planning.md @@ -0,0 +1,210 @@ +# Kokkos Project Planning + +## Requirements Gathering + +There are four requirement categories for the Kokkos Core project: + +- provide a stable, well-tested API avoiding breakage +- support all relevant compute platforms, at the time of their fielding +- provide programming model features enabling performance portability +- enable an on-ramp to future ISO C++ features + +A separate overarching requirement is the stability of the Kokkos API. + +All related specific actionable tasks are recorded and tracked in GitHub issues and pull requests. + +### Kokkos API Stability + +Robustness and API stability are ensured through test-driven development and an +explicit deprecation and removal process of existing features. + +If an existing capability is determined to be outdated, or not useful anymore the Kokkos +team will deprecate the feature and thus mark it for removal in the next major release (occurring every three years). +A Kokkos configure option furthermore allows the de-facto removal of deprecated features, enabling +customers to test whether they rely on them. + +The deprecation-removal cycle provides warnings for a minimum of 6 months to users. +During the deprecation phase, customer feedback allows for a revision of the deprecation decision. + +#### Activities to support this requirement: + +- provide complete testing for existing features, to ensure no accidental breakage +- evaluate features for continued usefulness and fundamental defects +- tag features as deprecated, when put on the deprecation/removal path +- remove deprecated features only at major release version change + +### Platform Support + +The primary requirement for the Kokkos project is to provide a robust performance-portability solution +for current and upcoming computing platforms. +The goal is to enable a seamless transition of codes between systems and avoid situations where existing +Kokkos-based codes cannot leverage a desired computing platform. + +In order to meet this requirement, the Kokkos team has to anticipate new hardware platforms, before these +are field-tested by customers. +The Kokkos project also needs to verify functionality with updated software stacks (compilers, runtime libraries) +on platforms as soon as they become available to the Kokkos team (ideally before deployment on customer platforms). + +Thus, the Kokkos team must engage with hardware vendors in co-design efforts both independently and in conjunction +with system procurement efforts of funding agencies. + +#### Activities to support this requirement: + +- participate in facility system procurement efforts +- monitor system software stack releases from vendors (AMD, Intel, NVIDIA, HPE) +- engage vendors to enable testing of Kokkos with pre-release software development kits +- procure new test systems where necessary +- update testing processes to account for new software stacks + +### Programming Model Capabilities + +Requirements for the Kokkos project are gathered from both customers, and research efforts conducted by Kokkos team members. + +Customer requirements are gathered via the Kokkos Slack channel, GitHub issues, Hackathons, and at user group meetings. +Kokkos team members assigned to a feature request will gather details of the use case and perform an initial evaluation +of the feature's general applicability. +The findings will be reported and discussed at the Kokkos developer meeting, enabling a decision on whether the feature +will be included in the roadmap. +Feature discussions will be recorded and tracked in public GitHub issues. + +New capability requirements by Kokkos team members are developed in separate research efforts, which explore functionality, +use cases, and general applicability. +They are then presented to the entire Kokkos team and discussed for inclusion in the main project. +These discussions lead to a decision on where a feature should go, whether it is important enough to be included in the primary core package, +or whether it should live as a separate library in its own repository under the Kokkos GitHub organization. + +#### Activities to support this requirement: + +- monitor Slack chanel and GitHub issues for new feature requests +- participate in Hackathons organized by the HPC community +- organize bi-annual Usergroup meeting +- discuss proposed features at developer meeting for inclusion into roadmap + +### ISO C++ Compatibility + +A third requirement for Kokkos is to provide an on-ramp for future ISO C++ standards, as well as influence where the standard goes. +This requirement serves the long-term sustainability goals of Kokkos by enabling the inclusion of Kokkos capabilities into ISO C++ and +thus share the maintenance burden with the entire C++ implementer community in the long run. + +To enable the on-ramp, Kokkos will provide backports of ISO C++ features to prior C++ standards, where appropriate and desired. +Kokkos will also provide extensions of ISO C++ features that work on GPUs, something which is not available by default. + +Kokkos features which have proven themselves, and are of interest to a wide audience are evaluated for possible inclusion in the ISO C++ standard. +The Kokkos team will write proposals for the ISO C++ committee when appropriate. + +If a feature is included in the ISO C++ standard, the Kokkos team will make the API variants provided in the future C++ standard +available on currently Kokkos-supported software stacks to the greatest extent possible. + +#### Activities to support this requirement: + +- participate in ISO C++ committee meetings +- monitor requests for ISO C++ features to be provided by Kokkos +- write proposals for ISO C++ for mature Kokkos features with wide applicability +- backport relevant future ISO C++ features to standards supported by Kokkos + +## Release Planning + +Kokkos releases are based on the "catch the train" model - i.e. the primary goal is to have regular releases, +not a specific feature list for each release. + +Major releases happen every three years, minor releases are aimed at every 3-4 months, with additional patch releases as necessary. + +The primary difference between a major and a minor release is that deprecated features are only removed at major releases, and +major releases come with a bump in minimal compiler version requirements and an updated minimum ISO C++ standard version. +Other than that, there is no difference in the planning and execution of major and minor releases. + +In contrast to major and minor releases, patch releases generally only contain bug fixes and no new capabilities. + +At the beginning of a release cycle, the Kokkos Core leadership will determine high-priority thrusts for the release cycle. +Furthermore, each team member will make a list of their personal priorities for the release cycle. +The priorities are discussed and refined at the Kokkos developer meeting and collected in internal documents. + +Issues for each item are assigned to the [Kokkos project plan](https://github.com/orgs/kokkos/projects/1) including team member assignments. + +The [Kokkos project plan](https://github.com/orgs/kokkos/projects/1) assigns issues to one of 7 categories: + +- *Unassigned:* issues that aren't assigned yet to team members. +- *Unassigned - Priority:* issues that aren't assigned yet to team members, but are high priority. These should be assigned at the next weekly developer meeting. +- *To Do:* Issue was assigned to a team member but is not yet actively worked on. +- *To Do - Priority:* Issue was assigned to a team member, but is not yet actively worked on. It is expected to be the next item in the queue of the assigned developer. If this item does not transition to *In Progress* by the next developer meeting, reassignment is considered. +- *In Progress:* Issue is getting worked on. +- *In Progress - Priority:* Issue is getting worked on. Code reviews for this issue are considered a priority, in order to get this resolved as soon as possible. +- *Done:* Issue is addressed via merged pull request, or was closed because of new information which made it obsolete. For merged pull requests it is ensured that a changelog entry was generated, if appropriate, before removing the item from the project plan. + + +## Issue Prioritization + +Issue prioritization is performed via two avenues: +- Kokkos Leadership meeting +- General Kokkos developer meeting. + +The Leadership meeting happens every week on Mondays. +It serves multiple purposes: +- determine urgent action items for the week +- go through new issue list, and triage criticality +- work through Kokkos planning items +- perform preliminary team assignments for new action items +- generate a draft for the developer meeting agenda + +Prioritization of items is recorded in the [Kokkos project plan](https://github.com/orgs/kokkos/projects/1) + +Meeting notes are kept in a private repository: [internal repository](https://github.com/kokkos/internal-documents) + +Further issue prioritization happens at the developer meeting discussed below. + +## Developer Coordination + +The team primarily use the #nucleus channel on Slack to communicate. +Members are added by Christian or Damien once they have joined [Slack](https://kokkosteam.slack.com). +Developers can have both public and private conversations with each other. +They can ask questions about parts of the code they are less familiar with or +ask for feedback on any ongoing issue. +Conversations on Slack are to be considered as ephemeral. Messages older than 90 days are deleted (unpaid plan). +If something needs to be referenceable longer term, then it needs to be discussed on GitHub wherever appropriate. +Private information may be hosted on the [internal repository](https://github.com/kokkos/internal-documents) but do not post NDA data on there. + +Kokkos developer meeting held once a week on Wednesdays 2pm ET / 12 pm MT / 18:00 UTC on Zoom. +The agenda is posted on the internal repository ahead of time (it can be found under the [`meeting-notes/`](https://github.com/kokkos/internal-documents/tree/master/meeting-notes/2023) directory). +Developers are allowed to edit the agenda and add topics or issues that they would like to be discussed at the meeting. + +## Release Process + +The release process has six steps: + +- create release candidate branch +- perform integration tests with release candidate +- resolve issues and cherry-pick fixes to release candidate +- check Changelog +- tag a release +- conduct release briefing for user community + +When nearing a desired release date, the release candidate branch will be created from the Kokkos develop branch. +Before creating the release candidate, possible delay reasons will be discussed at the developer meeting. +This could include important bug fixes, or an important feature being in the last phase of code review, +but is generally done under exceptional circumstances. +Furthermore, merging major new features into the development branch may be delayed until after the creation +of the release candidate. +This ensures that major new features have a period of testing in the develop branch before they are shipped. + +After creating the release candidate branch integration testing is started. +This includes internal testing by the Kokkos team with selected customer codes, as well as partnering +with some primary customers who will try the release candidate in their testing processes. + +The release candidate creation is also announced on the Slack channel, inviting the general Kokkos +user community to test it, and provide feedback. + +Defect reports (both functionality and performance) are collected as GitHub issues and marked with +"Blocks Promotion". +These items are then assigned to Kokkos team members at highest priority. + +Defect resolutions are merged into the develop branch first, and then cherry picked onto the +release candidate branch, ensuring that no regression remains unaddressed on the primary development +branch. + +Upon resolution of all defect reports the release candidate branch is used to create a GitHub release tag, +after checking and merging the Changelog. + +After the release is created a Release Briefing date is set approximately two to three weeks after the release, +providing an overview of new capabilities to users. +The release briefing also serves as an additional point for feedback collection. + diff --git a/docs/source/testing-and-issue-tracking/Testing-Processes.md b/docs/source/testing-and-issue-tracking/Testing-Processes.md index 2fa9438bd..05bda7613 100644 --- a/docs/source/testing-and-issue-tracking/Testing-Processes.md +++ b/docs/source/testing-and-issue-tracking/Testing-Processes.md @@ -9,29 +9,80 @@ Kokkos testing falls into three categories: ## Pull Request Testing All changes to Kokkos are introduced via pull requests against the github.com develop branch of Kokkos. +Pull requests are tested using GitHub actions workflows, as well as external testing servers. + In order to be merged two conditions must be met: 1) Automatic testing of the pull request must pass. -2) A Kokkos core developer must approve the pull request, after checking the changes for alignment with Kokkos developer standards. +2) Two Kokkos core developer must approve the pull request, after checking the changes for alignment with Kokkos developer standards. The tested configurations in Pull Request testing cover the major deployment systems -and are executed via jenkins and travis at various institutions. +and are executed via jenkins and travis at various institutions. + +New test configurations are proposed to the Kokkos team in its developer meeting. +Inclusion of new configurations is decided based on test resource availability, +duration of the entire testing pipeline, and primary computing facility software stacks. + Pull request testing also includes verification that the formatting meets the clang-format style specified in the repository. -Test configurations are defined in the `kokkos/.jenkins` and `kokkos/.travis.yml` files. + +Test configurations are defined in the `kokkos/.jenkins`, and `kokkos/.github/workflows/*` files and determine the official +primary software stack support. +The tested compiler versions are also listed [here](https://kokkos.github.io/kokkos-core-wiki/requirements.html). +These test configurations (sparsely) cover the cross product of hardware platforms (e.g. NVIDIA. Intel, and AMD), +compilers (e.g. GCC, Clang, NVC++), C++ standards (17-23), Kokkos backends (e.g. Cuda, OpenMP, and HIP) and Kokkos +configuration options (e.g. Debug, Relocatable Device Code). + The clang-format style file is `kokkos/.clang-format`. +Only the primary Kokkos maintainers can merge pull requests, they have the responsibility to judge whether conducted reviews meet the desired thoroughness. + ## Nightly Testing -Nightly testing covers a wider range of compilers and configuration of Kokkos -on an extensive list of platforms. -Test configurations are given in `kokkos/scripts/testings/test_all_sandia`. -Executing this script on the (in it) specified platforms will meet full testing requirements. -Nightly tests are set up via Jenkins and execute this script in stages. +Nightly testing covers a wider range of compilers and configuration of Kokkos +on an extensive list of platforms. + +All participating institutions are invited to perform nightly testing. +Test configurations are given in `kokkos/scripts/testings/` in institution specific test configuration files. + +Each institution designates a test POC, who will report failures to the entire Kokkos team, +and file github issues with reproduction steps. ## Integration Testing (Release Testing) -In order for a new Kokkos version to be released full integration testing is performed. -A release is then formed by merging the Kokkos develop branch into its master branch, -and creating a git tag with the version number. -Details of the process are described in Testing Process Details. +In order for a new Kokkos version to be released integration testing is performed. +Integration testing configurations are determined and maintained by the customer projects. + +This testing has three components: + +### Internal Integration Testing + +Kokkos team members will perform integration testing with a select number of customer codes, they are directly involved with. +Currently that includes two code bases: + +- Trilinos +- ArborX + +Trilinos in particular consists of several million lines of code over multiple packages. +Both codes are tested on the primary hardware platforms, and possibly multiple software stacks (compilers in particular). +They are also tested with a limited set of configurations during nightly testing, allowing the Kokkos team to catch issues early. + +### Preferred Customer Testing + +Customers funded by the same agencies as Kokkos are explicitly asked to test the release candidate before the actual release, and provide feedback. +This includes currently NNSA and Office of Science DOE users, specifically: + +- SNL Empire +- SNL LAMMPS +- SNL Sparta +- SNL Sierra - Aria +- ORNL Cabana +- ANL PETSc + +### General Community testing + +The release candidate is publicly available as a GitHub branch, and is advertised on the Kokkos Slack channel. +Any user of Kokkos is encouraged to test the release candidate and provide feedback. +The testing phase is at least two weeks. + + diff --git a/docs/source/usecases/MDRangePolicy.md b/docs/source/usecases/MDRangePolicy.md index 6293baa21..743ebe539 100644 --- a/docs/source/usecases/MDRangePolicy.md +++ b/docs/source/usecases/MDRangePolicy.md @@ -88,7 +88,7 @@ Kokkos::parallel_for("for_all_cells", If the number of cells is large enough to merit parallelization, that is the overhead for parallel dispatch plus computation time is less than total serial execution time, then the simple implementation above will result in improved performance. -There is more parallelism to exploit, particularly within the for loops over fields `F` and points `P`. One way to accomplish this would involve taking the product of the three iteration ranges, `C*F*P`, and performing a [`parallel_for`](../API/core/parallel-dispatch/parallel_for) over that product. However, this would require extraction routines to map between indices from the flattened iteration range, `C*F*P`, and the multidimensional indices required by data structures in this example. In addition, to achieve performance portability the mapping between the 1-D product iteration range and multidimensional 3-D indices would require architecture-awareness, akin to the notion of [`LayoutLeft`](../API/core/view/layoutLeft.md) and [`LayoutRight`](../API/core/view/layoutRight) used in Kokkos to establish data access patterns. +There is more parallelism to exploit, particularly within the for loops over fields `F` and points `P`. One way to accomplish this would involve taking the product of the three iteration ranges, `C*F*P`, and performing a [`parallel_for`](../API/core/parallel-dispatch/parallel_for) over that product. However, this would require extraction routines to map between indices from the flattened iteration range, `C*F*P`, and the multidimensional indices required by data structures in this example. In addition, to achieve performance portability the mapping between the 1-D product iteration range and multidimensional 3-D indices would require architecture-awareness, akin to the notion of [`LayoutLeft`](../API/core/view/layoutLeft) and [`LayoutRight`](../API/core/view/layoutRight) used in Kokkos to establish data access patterns. The [`MDRangePolicy`](../API/core/policies/MDRangePolicy) provides a natural way to accomplish the goal of parallelize over all three iteration ranges without requiring manually computing the product of the iteration ranges and mapping between 1-D and 3-D multidimensional indices. The [`MDRangePolicy`](../API/core/policies/MDRangePolicy) is suitable for use with tightly-nested for loops and provides a method to expose additional parallelism in computations beyond simply parallelize in a single dimension, as was shown in the first implementation using the [`RangePolicy`](../API/core/policies/RangePolicy). diff --git a/docs/source/usecases/Moving_from_EnableUVM_to_SharedSpace.md b/docs/source/usecases/Moving_from_EnableUVM_to_SharedSpace.md deleted file mode 100644 index 3c09c805d..000000000 --- a/docs/source/usecases/Moving_from_EnableUVM_to_SharedSpace.md +++ /dev/null @@ -1,56 +0,0 @@ -# Moving code from requiring `Kokkos_ENABLE_CUDA_UVM` to using `SharedSpace` - -With Kokkos 4.0 `Kokkos_ENABLE_CUDA_UVM` is deprecated and can only be used with `Kokkos_ENABLE_DEPRECATED_CODE_4`. The main reason for the deprecation was, that using the option changed the `memory_space` of the `Cuda` `ExecutionSpace`. This lead to several problems. For example: The driver is allowed to move chunks of this memory to the device or host depending on the access at any time without notice. -The accesses in `parallel_for`, `parallel_reduce`, or `parallel_scan` do not occur in any guaranteed order and furthermore depend on other kernels running on the same GPU. This makes debugging tedious. Especially, if the memory an allocation resides in is not apparent but dependent on the options when running `cmake`. - -## The alternative - -We introduced a new alias named [`SharedSpace`](SharedSpace) in Kokkos 4.0. This always points to memory that is accessible by every [`ExecutionSpace`](ExecutionSpaceConcept) and is migrated without user interaction to the acessing `ExecutionSpace` on demand. After migration the memory is accessed locally. -Using the alias e.g. in `Views` is expressive and thus easier to read. Furthermore, it is portable to every backend that can automatically migrate memory between `ExecutionSpaces`. -Furthermore, we introduced the alias [`SharedHostPinnedSpace`](SharedHostPinnedSpace) which points to memory that is accessible by all enabled `ExecutionSpaces` but always resides in the memory of the host. - -## The transition - -Basically it comes down to spelling [`Kokkos::SharedSpace`](SharedSpace) as a template argument in all allocations. -Below is an example of a transition: - - * Code requiring `Kokkos_ENABLE_CUDA_UVM` at configure time (until 4.0) -```c++ -#include - -int main (){ - Kokkos::initialize(); - { - unsigned int N = 100; - Kokkos::View myView("myView",N); - void* c_style_memory = Kokkos::kokkos_malloc("c_style_alloc",N*sizeof(double)); - - ... - - Kokkos::kokkos_free(c_style_memory); - } - Kokkos::finalize(); - return 0; -} -``` - - * Code using `SharedSpace` (since 4.0) -```c++ -#include - -int main (){ - Kokkos::initialize(); - { - static_assert(Kokkos::has_shared_space(),"code only works on backends with SharedSpace"); - - unsigned int N = 100; - Kokkos::View myView("myView",N); - void* c_style_memory = Kokkos::kokkos_malloc("c_style_alloc",N*sizeof(double)); - - ... - - Kokkos::kokkos_free(c_style_memory); - } - Kokkos::finalize(); - return 0; -} diff --git a/docs/source/usecases/Moving_from_EnableUVM_to_SharedSpace.rst b/docs/source/usecases/Moving_from_EnableUVM_to_SharedSpace.rst new file mode 100644 index 000000000..48811bf5b --- /dev/null +++ b/docs/source/usecases/Moving_from_EnableUVM_to_SharedSpace.rst @@ -0,0 +1,77 @@ +Moving code from requiring ``Kokkos_ENABLE_CUDA_UVM`` to using ``SharedSpace`` +============================================================================== + +.. role:: cppkokkos(code) + :language: cppkokkos + +.. _SharedSpace: ../API/core/memory_spaces.html#kokkos-sharedspace +.. |SharedSpace| replace:: ``SharedSpace`` + +.. _ExecutionSpace: ../API/core/execution_spaces.html#kokkos-executionspaceconcept +.. |ExecutionSpace| replace:: ``ExecutionSpace`` + +.. _SharedHostPinnedSpace: ../API/core/memory_spaces.html#kokkos-sharedhostpinnedspace +.. |SharedHostPinnedSpace| replace:: ``SharedHostPinnedSpace`` + +.. _KokkosSharedSpace: ../API/core/memory_spaces.html#kokkos-sharedspace +.. |KokkosSharedSpace| replace:: ``Kokkos::SharedSpace`` + +With Kokkos 4.0 ``Kokkos_ENABLE_CUDA_UVM`` is deprecated and can only be used with ``Kokkos_ENABLE_DEPRECATED_CODE_4``. The main reason for the deprecation was, that using the option changed the ``memory_space`` of the ``Cuda`` ``ExecutionSpace``. This lead to several problems. For example: The driver is allowed to move chunks of this memory to the device or host depending on the access at any time without notice. +The accesses in ``parallel_for``, ``parallel_reduce``, or ``parallel_scan`` do not occur in any guaranteed order and furthermore depend on other kernels running on the same GPU. This makes debugging tedious. Especially, if the memory an allocation resides in is not apparent but dependent on the options when running ``cmake``. + +The alternative +--------------- + +We introduced a new alias named |SharedSpace|_ in Kokkos 4.0. This always points to memory that is accessible by every |ExecutionSpace|_ and is migrated without user interaction to the acessing ``ExecutionSpace`` on demand. After migration the memory is accessed locally. +Using the alias e.g. in ``Views`` is expressive and thus easier to read. Furthermore, it is portable to every backend that can automatically migrate memory between ``ExecutionSpaces``. +Furthermore, we introduced the alias |SharedHostPinnedSpace|_ which points to memory that is accessible by all enabled ``ExecutionSpaces`` but always resides in the memory of the host. + +The transition +-------------- + +Basically it comes down to spelling |KokkosSharedSpace|_ as a template argument in all allocations. +Below is an example of a transition: + +* Code requiring ``Kokkos_ENABLE_CUDA_UVM`` at configure time (until 4.0) + +.. code-block:: cpp + + #include + + int main (){ + Kokkos::initialize(); + { + unsigned int N = 100; + Kokkos::View myView("myView",N); + void* c_style_memory = Kokkos::kokkos_malloc("c_style_alloc",N*sizeof(double)); + + ... + + Kokkos::kokkos_free(c_style_memory); + } + Kokkos::finalize(); + return 0; + } + +* Code using ``SharedSpace`` (since 4.0) + +.. code-block:: cpp + + #include + + int main (){ + Kokkos::initialize(); + { + static_assert(Kokkos::has_shared_space(),"code only works on backends with SharedSpace"); + + unsigned int N = 100; + Kokkos::View myView("myView",N); + void* c_style_memory = Kokkos::kokkos_malloc("c_style_alloc",N*sizeof(double)); + + ... + + Kokkos::kokkos_free(c_style_memory); + } + Kokkos::finalize(); + return 0; + }