`tensor.cat` of dynamic length tensors always results in slow memcopys #19092

rsuderman · 2024-11-09T02:29:33Z

The following block always generates a slow memory copy for performing the concatenation

func.func @main(%arg0: tensor<4x?x4x64xf32>, %arg1: tensor<4x?x4x64xf32>) -> tensor<8x?x4x64xf32> {
  %1 = tensor.concat dim(0) %arg0, %arg1 : (tensor<4x?x4x64xf32>, tensor<4x?x4x64xf32>) -> tensor<8x?x4x64xf32>
  return %1 : tensor<8x?x4x64xf32>
}

The text was updated successfully, but these errors were encountered:

MaheshRavishankar · 2024-11-09T04:46:51Z

Well yes. The concats arent contiguous?

If the concat cannot be made contiguous on ingestion, you can try --iree-opt-outer-dim-concat=true.

IanWood1 · 2024-11-11T17:30:01Z

I thought they are right? The concat is happening on the outermost dim. I think the issue is that when this gets decomposed into a tensor.insert_slice we lose the knowledge that we are inserting into the full length of the dynamic dim.

qedawkins · 2024-11-11T17:37:54Z

The choice to decompose tensor.concat was just because that's where I stopped plumbing through support for it. It should be easy to keep it around and convert it to flow.tensor.update ops if it's on the outer most dim.

qedawkins · 2024-11-11T17:38:52Z

func.func @main(%arg0: tensor<4x?x4x64xf32>, %arg1: tensor<4x?x4x64xf32>) -> tensor<8x?x4x64xf32> {
  %1 = tensor.concat dim(0) %arg0, %arg1 : (tensor<4x?x4x64xf32>, tensor<4x?x4x64xf32>) -> tensor<8x?x4x64xf32>
  return %1 : tensor<8x?x4x64xf32>
}

Wait isn't this sample degenerate? We always have to copy because the only thing here is a concat.

I think we need some surrounding dispatches to see if this actually always happens.

IanWood1 · 2024-11-11T17:52:50Z

@qedawkins I think the problem is that it gets lowered to a flow.dispatch + flow.tensor.load + flow.tensor.store instead of a flow.tensor.update because isOffsetSizeAndStrideMappableToFlow returns false. I tried to change

iree/compiler/src/iree/compiler/Dialect/Flow/Conversion/TensorToFlow/Utils.cpp

Lines 111 to 120 in 915b06b

    
             } else { 
        
               // TODO: Use ValueBoundsAnalysis to check whether two dynamic values 
        
               // are equal. 
        
               if (!(staticOffset == 0 && !ShapedType::isDynamic(staticSize) && 
        
                     !ShapedType::isDynamic(baseShape[dim - 1]) && 
        
                     staticSize == baseShape[dim - 1])) { 
        
                 fullSlices = false; 
        
               } 
        
             } 
        
           }

but ValueBoundsOpInterface doesn't work with hal buffer dim ops

MaheshRavishankar · 2024-11-11T17:57:38Z

Sorry, I missed that it is contiguous. @IanWood1 lets look at this deeper today in our 1:1

qedawkins · 2024-11-11T18:09:50Z

Again this might be a degenerate case. isOffsetSizeAndStrideMappableToFlow tries to do some hacky analysis for whether an ssa value is based on a value computed on device. Because the inputs to the concat are function arguments the tensor.dim %arg0 doesn't get replaced with any real ssa values so that helper gives up. That might be the bug here, but this particular test case will always produce memcopies even if we convert to flow.tensor.update.

MaheshRavishankar · 2024-11-11T18:31:06Z

I havent looked at this in a while. I'll need some context. Even for this degenerate case we should be able to not generate slow memcpys, even if it means we keep the concat around longer.

MaheshRavishankar · 2024-11-12T19:25:45Z

Ok, for now I am taking this from Ian. Ill start with trying to push concat through further down the compilation flow cause at insert_slice you dont have information that this is contiguous

…pdate`. These are in preparation to delay to decomposition of `tensor.concat` into `tensor.insert_slice`s. This patch just adds the patterns to lower a `tensor.concat` along the outer dimension to `flow.tensor.update`. Future changes will delay the decomposition of `tensor.concat` to allow for non-outer dimension concatenation to be conveted into `tensor.insert_slice`s before dispatch formation with the `tensor.insert_slice` fused into its producers. Towards iree-org#19092 Signed-off-by: MaheshRavishankar <[email protected]>

…pdate`. (#19126) These are in preparation to delay to decomposition of `tensor.concat` into `tensor.insert_slice`s. This patch just adds the patterns to lower a `tensor.concat` along the outer dimension to `flow.tensor.update`. Future changes will delay the decomposition of `tensor.concat` to allow for non-outer dimension concatenation to be conveted into `tensor.insert_slice`s before dispatch formation with the `tensor.insert_slice` fused into its producers. Towards #19092 --------- Signed-off-by: MaheshRavishankar <[email protected]>

rsuderman assigned IanWood1 Nov 9, 2024

MaheshRavishankar self-assigned this Nov 12, 2024

MaheshRavishankar mentioned this issue Nov 13, 2024

[Flow] Add patterns to convert from tensor.concat to flow.tensor.update. #19126

Merged

MaheshRavishankar linked a pull request Nov 14, 2024 that will close this issue

[DispatchCreation] Move concat decomposition to Dispatch creation and only decompose non-outer dim concats #19148

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`tensor.cat` of dynamic length tensors always results in slow memcopys #19092

`tensor.cat` of dynamic length tensors always results in slow memcopys #19092

rsuderman commented Nov 9, 2024

MaheshRavishankar commented Nov 9, 2024

IanWood1 commented Nov 11, 2024

qedawkins commented Nov 11, 2024

qedawkins commented Nov 11, 2024 •

edited

Loading

IanWood1 commented Nov 11, 2024 •

edited

Loading

MaheshRavishankar commented Nov 11, 2024

qedawkins commented Nov 11, 2024

MaheshRavishankar commented Nov 11, 2024

MaheshRavishankar commented Nov 12, 2024

tensor.cat of dynamic length tensors always results in slow memcopys #19092

tensor.cat of dynamic length tensors always results in slow memcopys #19092

Comments

rsuderman commented Nov 9, 2024

MaheshRavishankar commented Nov 9, 2024

IanWood1 commented Nov 11, 2024

qedawkins commented Nov 11, 2024

qedawkins commented Nov 11, 2024 • edited Loading

IanWood1 commented Nov 11, 2024 • edited Loading

MaheshRavishankar commented Nov 11, 2024

qedawkins commented Nov 11, 2024

MaheshRavishankar commented Nov 11, 2024

MaheshRavishankar commented Nov 12, 2024

`tensor.cat` of dynamic length tensors always results in slow memcopys #19092

`tensor.cat` of dynamic length tensors always results in slow memcopys #19092

qedawkins commented Nov 11, 2024 •

edited

Loading

IanWood1 commented Nov 11, 2024 •

edited

Loading