Request for being able to compile a efficient blockwise dequantization integer matmul #19133

ziereis · 2024-11-13T08:08:53Z

Request description

when a quantized model has both the weights and the activations quantized its possible to the inner "blockwise" dotproduct as integer. For this in llama.cpp there exist different optimized dotproduct functions like here for example:

https://github.com/ggerganov/llama.cpp/blob/80dd7ff22fd050fed58b552cc8001aaf968b7ebf/ggml/src/ggml-quants.c#L3921

A simplified version of the computation in pseudocode is here:

def mmt_absmax_quant_weight(activation: tensor<64x16x32xi8>, weight: tensor<128x16x32xi8>, w_scale: tensor<128x16xf32>, a_scale: tensor<64x16xf32) -> tensor<64x128xf32> {
    out = zeros<64x128xf32>;
    for (int a_row = 0; a_row < activation.shape[0]; a_row++) {
        for (int w_row = 0; w_row < weight.shape[0]; w_row++) {
            float fsum = 0;
            for (int block = 0; block < weight.shape[1]; block++) {
                int isum = 0;
                // integer dotproduct
                for(int elem = 0; elem < weights.shape[2]; elem++) {
                    isum += weight[w_row][block][elem] * activation[a_row][block][elem];
                }
                // convert to float and apply scale
                fsum += cast<float>(isum) * w_scale[w_row][block] * a_scale[a_row][block];
            }
            out[a_row][w_row] = fsum;
        }
    
    }
    return out;
}

Since this is not nerfectly nested it cant be expressed with a single linalg op so i came up with these two to do the same computation.

module {
  util.func public @int8_blocked_dequant_mmt(%arg0: tensor<128x256x256xi8>, %arg1: tensor<128x256x256xi8>, %arg2: tensor<128x256xf32>, %arg3: tensor<128x256xf32>) -> tensor<128x128xf32> {
    %cst = arith.constant 0.0 : f32
    %cst_0 = arith.constant 0 : i32
    %0 = tensor.empty() : tensor<128x128xf32>
    %1 = linalg.fill ins(%cst : f32) outs(%0 : tensor<128x128xf32>) -> tensor<128x128xf32>
    %2 = tensor.empty() : tensor<128x128x256xi32>
    %3 = linalg.fill ins(%cst_0 : i32) outs(%2 : tensor<128x128x256xi32>) -> tensor<128x128x256xi32>
    %4 = linalg.generic {
      indexing_maps = [
        affine_map<(d0, d1, d2, d3) -> (d0, d2, d3)>,
        affine_map<(d0, d1, d2, d3) -> (d1, d2, d3)>,
        affine_map<(d0, d1, d2, d3) -> (d0, d1, d2)>
      ],
      iterator_types = ["parallel", "parallel", "parallel", "reduction"]
    } ins(%arg1, %arg0 : tensor<128x256x256xi8>, tensor<128x256x256xi8>) outs(%3 : tensor<128x128x256xi32>) {
    ^bb0(%in: i8, %in_1: i8, %out: i32):
      %6 = arith.extsi %in : i8 to i32
      %7 = arith.extsi %in_1 : i8 to i32
      %8 = arith.muli %6, %7 : i32
      %9 = arith.addi %8, %out : i32
      linalg.yield %9 : i32
    } -> tensor<128x128x256xi32>
    %5 = linalg.generic {
      indexing_maps = [
        affine_map<(d0, d1, d2) -> (d0, d1, d2)>,
        affine_map<(d0, d1, d2) -> (d0, d2)>,
        affine_map<(d0, d1, d2) -> (d1, d2)>,
        affine_map<(d0, d1, d2) -> (d0, d1)>
      ],
      iterator_types = ["parallel", "parallel", "reduction"]
    } ins(%4, %arg3, %arg2 : tensor<128x128x256xi32>, tensor<128x256xf32>, tensor<128x256xf32>) outs(%1 : tensor<128x128xf32>) {
    ^bb0(%in: i32, %in_1: f32, %in_2: f32, %out: f32):
      %6 = arith.sitofp %in : i32 to f32
      %7 = arith.mulf %in_2, %in_1 : f32
      %8 = arith.mulf %7, %6 : f32
      %9 = arith.addf %8, %out : f32
      linalg.yield %9 : f32
    } -> tensor<128x128xf32>
    util.return %5 : tensor<128x128xf32>
  }
}

It is very important these two ops end up in the same dispatch and get fused in the end. otherwise a very large intermediate result will get materialized. However this is currently not the case.

to reproduce:

iree-compile --iree-dispatch-creation-enable-aggressive-fusion=true --compile-to=flow

What component(s) does this issue relate to?

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

ziereis added the enhancement ➕ New feature or request label Nov 13, 2024

ziereis changed the title ~~Request for beeing able to compile a efficient blockwise dequantization integer matmul~~ Request for being able to compile a efficient blockwise dequantization integer matmul Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for being able to compile a efficient blockwise dequantization integer matmul #19133

Request for being able to compile a efficient blockwise dequantization integer matmul #19133

ziereis commented Nov 13, 2024 •

edited

Loading

Request for being able to compile a efficient blockwise dequantization integer matmul #19133

Request for being able to compile a efficient blockwise dequantization integer matmul #19133

Comments

ziereis commented Nov 13, 2024 • edited Loading

Request description

What component(s) does this issue relate to?

Additional context

ziereis commented Nov 13, 2024 •

edited

Loading