-
Notifications
You must be signed in to change notification settings - Fork 17
WIP Reduction support. #252
WIP Reduction support. #252
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just have some small questions and suggestions, trying mostly to help somewhat. I haven't looked too closely into the whole approach, since I guess it's still bound to change a bit.
flang/lib/Lower/OpenMP.cpp
Outdated
@@ -2231,7 +2231,7 @@ calculateTripCount(Fortran::lower::AbstractConverter &converter, | |||
}; | |||
|
|||
// Start with signless i32 by default. | |||
auto tripCount = b.createIntegerConstant(loc, b.getI32Type(), 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a small nit: Is this change still needed after PR #250 got merged? I think the MLIR trip count should accept any integer type, as long as when emitting the call to __tgt_target_kernel
it gets casted to i64 to match the function signature. That was something that got introduced with that PR (in emitTargetCall()
in line 5201 of OMPIRBuilder.cpp).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to remove this change. Thanks for pointing it out.
builder.SetInsertPoint(regionBlock->getTerminator()); | ||
|
||
// FIXME(JAN): We need to know if we are inside a distribute and | ||
// if there is an inner wsloop reduction, in that case we need to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this helps here, but a method called getInnermostCapturedOmpOp()
was added to omp.target, which you could get a reference to via opInst
parents. If that omp.target op contains a multi-level nest of OpenMP ops with the innermost region corresponding to an omp.wsloop, it will return that op. I guess you could read its reduction clause at that point, if needed. Something like:
if (auto targetOp = opInst->getParentOfType<omp::TargetOp>()) {
if (auto wsLoopOp = dyn_cast_if_present<omp::WsLoopOp>(targetOp.getInnermostCapturedOmpOp())) {
if (wsLoopOp.getReductions()...) { ... }
}
}
Also, the TargetOp::isTargetSPMDLoop()
function will tell whether it is a "target teams distribute parallel do", if you need checking this here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps, I will look into this later on.
Type *ParallelTaskPtr, Value *TripCountOrig, Function &LoopBodyFn) { | ||
// FIXME(JAN): The trip count is 1 larger than it should be, this may not be | ||
// the right way to fix it, but it could be. | ||
Value *TripCount = OMPBuilder->Builder.CreateSub( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this change break regular "target teams distribute parallel do" in exchange for making it work with reductions, or can the off-by-one problem be reproduced in both situations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem to affect target teams distribute parallel to, but it is clear that the DeviceRTL functions for reductions want the reduced trip count. I would be surprised if the other DeviceRTL functions would assume a different trip count.
5adc610
to
af6ed32
Compare
Rebased with latest ATD. |
…e debug printouts.
… can now run, but we get the wrong result.
we read outside the array when doing a reduction.
…ms portion of the reductions. Also the allocation of the private arrays must be done at the teams (distribute) level, not at the wsloop level.
…addition either for parlallel or teams, not both because the main thread will double its final result.
…-bit assupmtions in helper functions.
generating code for GPU.
…o i64 by default.
0aa533a
to
21dafbf
Compare
…that multiple reduction regions in the same function will not inadvertantly share reduction info. This needs to be fixed for GPU if multiple loops with reductions are specified within a single target region.
This patch enables reduction support for the reduction tests in aomp/trudnk.
It is still a WIP with some debug printouts etc, but it should be able to run the tests successfully.