Skip to content
This repository has been archived by the owner on Jan 20, 2024. It is now read-only.

WIP Reduction support. #252

Merged
merged 25 commits into from
Jan 8, 2024

Conversation

jsjodin
Copy link
Contributor

@jsjodin jsjodin commented Dec 28, 2023

This patch enables reduction support for the reduction tests in aomp/trudnk.
It is still a WIP with some debug printouts etc, but it should be able to run the tests successfully.

Copy link
Contributor

@skatrak skatrak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just have some small questions and suggestions, trying mostly to help somewhat. I haven't looked too closely into the whole approach, since I guess it's still bound to change a bit.

@@ -2231,7 +2231,7 @@ calculateTripCount(Fortran::lower::AbstractConverter &converter,
};

// Start with signless i32 by default.
auto tripCount = b.createIntegerConstant(loc, b.getI32Type(), 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small nit: Is this change still needed after PR #250 got merged? I think the MLIR trip count should accept any integer type, as long as when emitting the call to __tgt_target_kernel it gets casted to i64 to match the function signature. That was something that got introduced with that PR (in emitTargetCall() in line 5201 of OMPIRBuilder.cpp).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to remove this change. Thanks for pointing it out.

builder.SetInsertPoint(regionBlock->getTerminator());

// FIXME(JAN): We need to know if we are inside a distribute and
// if there is an inner wsloop reduction, in that case we need to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this helps here, but a method called getInnermostCapturedOmpOp() was added to omp.target, which you could get a reference to via opInst parents. If that omp.target op contains a multi-level nest of OpenMP ops with the innermost region corresponding to an omp.wsloop, it will return that op. I guess you could read its reduction clause at that point, if needed. Something like:

if (auto targetOp = opInst->getParentOfType<omp::TargetOp>()) {
  if (auto wsLoopOp = dyn_cast_if_present<omp::WsLoopOp>(targetOp.getInnermostCapturedOmpOp())) {
    if (wsLoopOp.getReductions()...) { ... }
  }
}

Also, the TargetOp::isTargetSPMDLoop() function will tell whether it is a "target teams distribute parallel do", if you need checking this here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, I will look into this later on.

Type *ParallelTaskPtr, Value *TripCountOrig, Function &LoopBodyFn) {
// FIXME(JAN): The trip count is 1 larger than it should be, this may not be
// the right way to fix it, but it could be.
Value *TripCount = OMPBuilder->Builder.CreateSub(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change break regular "target teams distribute parallel do" in exchange for making it work with reductions, or can the off-by-one problem be reproduced in both situations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem to affect target teams distribute parallel to, but it is clear that the DeviceRTL functions for reductions want the reduced trip count. I would be surprised if the other DeviceRTL functions would assume a different trip count.

@jsjodin
Copy link
Contributor Author

jsjodin commented Jan 4, 2024

Rebased with latest ATD.

we read outside the array when doing a reduction.
…ms portion of the

reductions. Also the allocation of the private arrays must be done at the teams (distribute) level, not at the wsloop level.
…addition

either for parlallel or teams, not both because the main thread will double its
final result.
…that

multiple reduction regions in the same function will not inadvertantly share
reduction info. This needs to be fixed for GPU if multiple loops with
reductions are specified within a single target region.
@jsjodin jsjodin changed the title WIP Reduction suppport. WIP Reduction support. Jan 8, 2024
@jsjodin jsjodin merged commit d709125 into ROCm-Developer-Tools:amd-trunk-dev Jan 8, 2024
3 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants