Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tensor.cat of dynamic length tensors always results in slow memcopys #19092

Open
rsuderman opened this issue Nov 9, 2024 · 9 comments · May be fixed by #19148
Open

tensor.cat of dynamic length tensors always results in slow memcopys #19092

rsuderman opened this issue Nov 9, 2024 · 9 comments · May be fixed by #19148
Assignees

Comments

@rsuderman
Copy link
Contributor

The following block always generates a slow memory copy for performing the concatenation

func.func @main(%arg0: tensor<4x?x4x64xf32>, %arg1: tensor<4x?x4x64xf32>) -> tensor<8x?x4x64xf32> {
  %1 = tensor.concat dim(0) %arg0, %arg1 : (tensor<4x?x4x64xf32>, tensor<4x?x4x64xf32>) -> tensor<8x?x4x64xf32>
  return %1 : tensor<8x?x4x64xf32>
}
@MaheshRavishankar
Copy link
Contributor

Well yes. The concats arent contiguous?

If the concat cannot be made contiguous on ingestion, you can try --iree-opt-outer-dim-concat=true.

@IanWood1
Copy link
Contributor

I thought they are right? The concat is happening on the outermost dim. I think the issue is that when this gets decomposed into a tensor.insert_slice we lose the knowledge that we are inserting into the full length of the dynamic dim.

@qedawkins
Copy link
Contributor

The choice to decompose tensor.concat was just because that's where I stopped plumbing through support for it. It should be easy to keep it around and convert it to flow.tensor.update ops if it's on the outer most dim.

@qedawkins
Copy link
Contributor

qedawkins commented Nov 11, 2024

func.func @main(%arg0: tensor<4x?x4x64xf32>, %arg1: tensor<4x?x4x64xf32>) -> tensor<8x?x4x64xf32> {
  %1 = tensor.concat dim(0) %arg0, %arg1 : (tensor<4x?x4x64xf32>, tensor<4x?x4x64xf32>) -> tensor<8x?x4x64xf32>
  return %1 : tensor<8x?x4x64xf32>
}

Wait isn't this sample degenerate? We always have to copy because the only thing here is a concat.

I think we need some surrounding dispatches to see if this actually always happens.

@IanWood1
Copy link
Contributor

IanWood1 commented Nov 11, 2024

@qedawkins I think the problem is that it gets lowered to a flow.dispatch + flow.tensor.load + flow.tensor.store instead of a flow.tensor.update because isOffsetSizeAndStrideMappableToFlow returns false. I tried to change

} else {
// TODO: Use ValueBoundsAnalysis to check whether two dynamic values
// are equal.
if (!(staticOffset == 0 && !ShapedType::isDynamic(staticSize) &&
!ShapedType::isDynamic(baseShape[dim - 1]) &&
staticSize == baseShape[dim - 1])) {
fullSlices = false;
}
}
}
but ValueBoundsOpInterface doesn't work with hal buffer dim ops

@MaheshRavishankar
Copy link
Contributor

Sorry, I missed that it is contiguous. @IanWood1 lets look at this deeper today in our 1:1

@qedawkins
Copy link
Contributor

Again this might be a degenerate case. isOffsetSizeAndStrideMappableToFlow tries to do some hacky analysis for whether an ssa value is based on a value computed on device. Because the inputs to the concat are function arguments the tensor.dim %arg0 doesn't get replaced with any real ssa values so that helper gives up. That might be the bug here, but this particular test case will always produce memcopies even if we convert to flow.tensor.update.

@MaheshRavishankar
Copy link
Contributor

I havent looked at this in a while. I'll need some context. Even for this degenerate case we should be able to not generate slow memcpys, even if it means we keep the concat around longer.

@MaheshRavishankar
Copy link
Contributor

Ok, for now I am taking this from Ian. Ill start with trying to push concat through further down the compilation flow cause at insert_slice you dont have information that this is contiguous

@MaheshRavishankar MaheshRavishankar self-assigned this Nov 12, 2024
MaheshRavishankar added a commit to MaheshRavishankar/iree that referenced this issue Nov 13, 2024
…pdate`.

These are in preparation to delay to decomposition of `tensor.concat`
into `tensor.insert_slice`s. This patch just adds the patterns to
lower a `tensor.concat` along the outer dimension to
`flow.tensor.update`. Future changes will delay the decomposition of
`tensor.concat` to allow for non-outer dimension concatenation to be
conveted into `tensor.insert_slice`s before dispatch formation with
the `tensor.insert_slice` fused into its producers.

Towards iree-org#19092

Signed-off-by: MaheshRavishankar <[email protected]>
MaheshRavishankar added a commit to MaheshRavishankar/iree that referenced this issue Nov 13, 2024
…pdate`.

These are in preparation to delay to decomposition of `tensor.concat`
into `tensor.insert_slice`s. This patch just adds the patterns to
lower a `tensor.concat` along the outer dimension to
`flow.tensor.update`. Future changes will delay the decomposition of
`tensor.concat` to allow for non-outer dimension concatenation to be
conveted into `tensor.insert_slice`s before dispatch formation with
the `tensor.insert_slice` fused into its producers.

Towards iree-org#19092

Signed-off-by: MaheshRavishankar <[email protected]>
MaheshRavishankar added a commit to MaheshRavishankar/iree that referenced this issue Nov 13, 2024
…pdate`.

These are in preparation to delay to decomposition of `tensor.concat`
into `tensor.insert_slice`s. This patch just adds the patterns to
lower a `tensor.concat` along the outer dimension to
`flow.tensor.update`. Future changes will delay the decomposition of
`tensor.concat` to allow for non-outer dimension concatenation to be
conveted into `tensor.insert_slice`s before dispatch formation with
the `tensor.insert_slice` fused into its producers.

Towards iree-org#19092

Signed-off-by: MaheshRavishankar <[email protected]>
MaheshRavishankar added a commit that referenced this issue Nov 14, 2024
…pdate`. (#19126)

These are in preparation to delay to decomposition of `tensor.concat`
into `tensor.insert_slice`s. This patch just adds the patterns to lower
a `tensor.concat` along the outer dimension to `flow.tensor.update`.
Future changes will delay the decomposition of `tensor.concat` to allow
for non-outer dimension concatenation to be conveted into
`tensor.insert_slice`s before dispatch formation with the
`tensor.insert_slice` fused into its producers.

Towards #19092

---------

Signed-off-by: MaheshRavishankar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants