Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for being able to compile a efficient blockwise dequantization integer matmul #19133

Open
ziereis opened this issue Nov 13, 2024 · 0 comments
Labels
enhancement ➕ New feature or request

Comments

@ziereis
Copy link
Contributor

ziereis commented Nov 13, 2024

Request description

when a quantized model has both the weights and the activations quantized its possible to the inner "blockwise" dotproduct as integer. For this in llama.cpp there exist different optimized dotproduct functions like here for example:

https://github.com/ggerganov/llama.cpp/blob/80dd7ff22fd050fed58b552cc8001aaf968b7ebf/ggml/src/ggml-quants.c#L3921

A simplified version of the computation in pseudocode is here:

def mmt_absmax_quant_weight(activation: tensor<64x16x32xi8>, weight: tensor<128x16x32xi8>, w_scale: tensor<128x16xf32>, a_scale: tensor<64x16xf32) -> tensor<64x128xf32> {
    out = zeros<64x128xf32>;
    for (int a_row = 0; a_row < activation.shape[0]; a_row++) {
        for (int w_row = 0; w_row < weight.shape[0]; w_row++) {
            float fsum = 0;
            for (int block = 0; block < weight.shape[1]; block++) {
                int isum = 0;
                // integer dotproduct
                for(int elem = 0; elem < weights.shape[2]; elem++) {
                    isum += weight[w_row][block][elem] * activation[a_row][block][elem];
                }
                // convert to float and apply scale
                fsum += cast<float>(isum) * w_scale[w_row][block] * a_scale[a_row][block];
            }
            out[a_row][w_row] = fsum;
        }
    
    }
    return out;
}

Since this is not nerfectly nested it cant be expressed with a single linalg op so i came up with these two to do the same computation.

module {
  util.func public @int8_blocked_dequant_mmt(%arg0: tensor<128x256x256xi8>, %arg1: tensor<128x256x256xi8>, %arg2: tensor<128x256xf32>, %arg3: tensor<128x256xf32>) -> tensor<128x128xf32> {
    %cst = arith.constant 0.0 : f32
    %cst_0 = arith.constant 0 : i32
    %0 = tensor.empty() : tensor<128x128xf32>
    %1 = linalg.fill ins(%cst : f32) outs(%0 : tensor<128x128xf32>) -> tensor<128x128xf32>
    %2 = tensor.empty() : tensor<128x128x256xi32>
    %3 = linalg.fill ins(%cst_0 : i32) outs(%2 : tensor<128x128x256xi32>) -> tensor<128x128x256xi32>
    %4 = linalg.generic {
      indexing_maps = [
        affine_map<(d0, d1, d2, d3) -> (d0, d2, d3)>,
        affine_map<(d0, d1, d2, d3) -> (d1, d2, d3)>,
        affine_map<(d0, d1, d2, d3) -> (d0, d1, d2)>
      ],
      iterator_types = ["parallel", "parallel", "parallel", "reduction"]
    } ins(%arg1, %arg0 : tensor<128x256x256xi8>, tensor<128x256x256xi8>) outs(%3 : tensor<128x128x256xi32>) {
    ^bb0(%in: i8, %in_1: i8, %out: i32):
      %6 = arith.extsi %in : i8 to i32
      %7 = arith.extsi %in_1 : i8 to i32
      %8 = arith.muli %6, %7 : i32
      %9 = arith.addi %8, %out : i32
      linalg.yield %9 : i32
    } -> tensor<128x128x256xi32>
    %5 = linalg.generic {
      indexing_maps = [
        affine_map<(d0, d1, d2) -> (d0, d1, d2)>,
        affine_map<(d0, d1, d2) -> (d0, d2)>,
        affine_map<(d0, d1, d2) -> (d1, d2)>,
        affine_map<(d0, d1, d2) -> (d0, d1)>
      ],
      iterator_types = ["parallel", "parallel", "reduction"]
    } ins(%4, %arg3, %arg2 : tensor<128x128x256xi32>, tensor<128x256xf32>, tensor<128x256xf32>) outs(%1 : tensor<128x128xf32>) {
    ^bb0(%in: i32, %in_1: f32, %in_2: f32, %out: f32):
      %6 = arith.sitofp %in : i32 to f32
      %7 = arith.mulf %in_2, %in_1 : f32
      %8 = arith.mulf %7, %6 : f32
      %9 = arith.addf %8, %out : f32
      linalg.yield %9 : f32
    } -> tensor<128x128xf32>
    util.return %5 : tensor<128x128xf32>
  }
}

It is very important these two ops end up in the same dispatch and get fused in the end. otherwise a very large intermediate result will get materialized. However this is currently not the case.

to reproduce:

iree-compile --iree-dispatch-creation-enable-aggressive-fusion=true --compile-to=flow

What component(s) does this issue relate to?

No response

Additional context

No response

@ziereis ziereis added the enhancement ➕ New feature or request label Nov 13, 2024
@ziereis ziereis changed the title Request for beeing able to compile a efficient blockwise dequantization integer matmul Request for being able to compile a efficient blockwise dequantization integer matmul Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement ➕ New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant