WIP Reduction support. #252

jsjodin · 2023-12-28T17:10:22Z

This patch enables reduction support for the reduction tests in aomp/trudnk.
It is still a WIP with some debug printouts etc, but it should be able to run the tests successfully.

skatrak

I just have some small questions and suggestions, trying mostly to help somewhat. I haven't looked too closely into the whole approach, since I guess it's still bound to change a bit.

skatrak · 2024-01-02T13:22:28Z

flang/lib/Lower/OpenMP.cpp

@@ -2231,7 +2231,7 @@ calculateTripCount(Fortran::lower::AbstractConverter &converter,
  };

  // Start with signless i32 by default.
-  auto tripCount = b.createIntegerConstant(loc, b.getI32Type(), 1);


Just a small nit: Is this change still needed after PR #250 got merged? I think the MLIR trip count should accept any integer type, as long as when emitting the call to __tgt_target_kernel it gets casted to i64 to match the function signature. That was something that got introduced with that PR (in emitTargetCall() in line 5201 of OMPIRBuilder.cpp).

I was able to remove this change. Thanks for pointing it out.

skatrak · 2024-01-02T13:28:15Z

mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp

+    builder.SetInsertPoint(regionBlock->getTerminator());
+
+    // FIXME(JAN): We need to know if we are inside a distribute and
+    // if there is an inner wsloop reduction, in that case we need to


Not sure if this helps here, but a method called getInnermostCapturedOmpOp() was added to omp.target, which you could get a reference to via opInst parents. If that omp.target op contains a multi-level nest of OpenMP ops with the innermost region corresponding to an omp.wsloop, it will return that op. I guess you could read its reduction clause at that point, if needed. Something like:

if (auto targetOp = opInst->getParentOfType<omp::TargetOp>()) { if (auto wsLoopOp = dyn_cast_if_present<omp::WsLoopOp>(targetOp.getInnermostCapturedOmpOp())) { if (wsLoopOp.getReductions()...) { ... } } }

Also, the TargetOp::isTargetSPMDLoop() function will tell whether it is a "target teams distribute parallel do", if you need checking this here.

Perhaps, I will look into this later on.

skatrak · 2024-01-02T13:33:09Z

llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp

+    Type *ParallelTaskPtr, Value *TripCountOrig, Function &LoopBodyFn) {
+  // FIXME(JAN): The trip count is 1 larger than it should be, this may not be
+  // the right way to fix it, but it could be.
+  Value *TripCount = OMPBuilder->Builder.CreateSub(


Does this change break regular "target teams distribute parallel do" in exchange for making it work with reductions, or can the off-by-one problem be reproduced in both situations?

It doesn't seem to affect target teams distribute parallel to, but it is clear that the DeviceRTL functions for reductions want the reduced trip count. I would be surprised if the other DeviceRTL functions would assume a different trip count.

jsjodin · 2024-01-04T19:53:02Z

Rebased with latest ATD.

…e debug printouts.

…ffle function.

… can now run, but we get the wrong result.

we read outside the array when doing a reduction.

…ms portion of the reductions. Also the allocation of the private arrays must be done at the teams (distribute) level, not at the wsloop level.

…addition either for parlallel or teams, not both because the main thread will double its final result.

…riables.

…-bit assupmtions in helper functions.

generating code for GPU.

…o i64 by default.

…that multiple reduction regions in the same function will not inadvertantly share reduction info. This needs to be fixed for GPU if multiple loops with reductions are specified within a single target region.

skatrak reviewed Jan 2, 2024

View reviewed changes

jsjodin force-pushed the jleyon/red-rebase branch from 5adc610 to af6ed32 Compare January 4, 2024 19:52

jsjodin added 22 commits January 5, 2024 11:50

Refactor code, enable reductions to go through to codegen.

d4c2a32

Enable generating the elementwise reduction function for GPU. Add som…

ba636ad

…e debug printouts.

Initial prototype implementation of omp_reduction_inter_warp_copy_func

9efafb8

Initial prootype of _omp_reduction_shuffle_and_reduce_func.

d252973

Add names to values for easier debugging. Call getGpuWarpSize for shu…

04f93d3

…ffle function.

Use correct element type. Add more names.

ba3ece9

Add codegen for top level gpu reduction function.

9538d5c

Smaller bugfixes. Comment out debug printouts. All the generated code…

baa030d

… can now run, but we get the wrong result.

Reduce the trip count by 1, otherwise too many iterations are done and

f24f2ee

we read outside the array when doing a reduction.

Initial implementation for teams reductions.

be13cb5

Initial codegen done, wrong result

a027fb0

Fix issues where information is not available when generating the tea…

535d03f

…ms portion of the reductions. Also the allocation of the private arrays must be done at the teams (distribute) level, not at the wsloop level.

Ensure that the TripCount is a i64 value. Only emite final reduction …

0943304

…addition either for parlallel or teams, not both because the main thread will double its final result.

Fix merge issues.

e405a64

Add ReductionInfoManager to the OMPIRBuilder and remove the global va…

aec7c9e

…riables.

Remove debug printout.

d2b20f6

Expand emitInterWarpCopyFunction to handle larger types and remove 32…

8d76ff0

…-bit assupmtions in helper functions.

Adjust where code is inserted depending on loop type and only emit if

88c41bb

generating code for GPU.

Get correct integer type for constant.

2f95f36

Don't hard-code grid size value.

e20bd29

Pass in reduction size and reduction buffer size from MLIR.

671e462

Remove some debugging related code and unto setting trip count type t…

21dafbf

…o i64 by default.

jsjodin force-pushed the jleyon/red-rebase branch from 0aa533a to 21dafbf Compare January 5, 2024 18:16

jsjodin added 3 commits January 5, 2024 13:52

Use target features from OMPIRBuilder.

81a8d75

Use the correct data size from reduction variable.

76a37ba

Clear RIManger locally if compiling for non-GPU target. This ensures …

9c4aa51

…that multiple reduction regions in the same function will not inadvertantly share reduction info. This needs to be fixed for GPU if multiple loops with reductions are specified within a single target region.

jsjodin changed the title ~~WIP Reduction suppport.~~ WIP Reduction support. Jan 8, 2024

jsjodin merged commit d709125 into ROCm-Developer-Tools:amd-trunk-dev Jan 8, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP Reduction support. #252

WIP Reduction support. #252

jsjodin commented Dec 28, 2023 •

edited

Loading

skatrak left a comment

skatrak Jan 2, 2024

jsjodin Jan 5, 2024

skatrak Jan 2, 2024

jsjodin Jan 5, 2024

skatrak Jan 2, 2024

jsjodin Jan 5, 2024

jsjodin commented Jan 4, 2024

WIP Reduction support. #252

WIP Reduction support. #252

Conversation

jsjodin commented Dec 28, 2023 • edited Loading

skatrak left a comment

Choose a reason for hiding this comment

skatrak Jan 2, 2024

Choose a reason for hiding this comment

jsjodin Jan 5, 2024

Choose a reason for hiding this comment

skatrak Jan 2, 2024

Choose a reason for hiding this comment

jsjodin Jan 5, 2024

Choose a reason for hiding this comment

skatrak Jan 2, 2024

Choose a reason for hiding this comment

jsjodin Jan 5, 2024

Choose a reason for hiding this comment

jsjodin commented Jan 4, 2024

jsjodin commented Dec 28, 2023 •

edited

Loading