You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
when a quantized model has both the weights and the activations quantized its possible to the inner "blockwise" dotproduct as integer. For this in llama.cpp there exist different optimized dotproduct functions like here for example:
It is very important these two ops end up in the same dispatch and get fused in the end. otherwise a very large intermediate result will get materialized. However this is currently not the case.
ziereis
changed the title
Request for beeing able to compile a efficient blockwise dequantization integer matmul
Request for being able to compile a efficient blockwise dequantization integer matmul
Nov 13, 2024
Request description
when a quantized model has both the weights and the activations quantized its possible to the inner "blockwise" dotproduct as integer. For this in llama.cpp there exist different optimized dotproduct functions like here for example:
https://github.com/ggerganov/llama.cpp/blob/80dd7ff22fd050fed58b552cc8001aaf968b7ebf/ggml/src/ggml-quants.c#L3921
A simplified version of the computation in pseudocode is here:
Since this is not nerfectly nested it cant be expressed with a single linalg op so i came up with these two to do the same computation.
It is very important these two ops end up in the same dispatch and get fused in the end. otherwise a very large intermediate result will get materialized. However this is currently not the case.
to reproduce:
iree-compile --iree-dispatch-creation-enable-aggressive-fusion=true --compile-to=flow
What component(s) does this issue relate to?
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: