-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to libxsmm JIT backend #26
Comments
Don't think this would help for alpha:
Might be that the doc is out of date, though. |
Stupid question, but what is the performance penalty for the JIT backend? It's probably quite small,, as lookup works in O(1) with a hashmap, right? We also have to consider this for benchmarking purposes. We do not want to include the compile time in our benchmark (it's init after all), so we should do the JIT phase "Just before" ;) |
Don't think there will be a performance penalty. You can just directly store the function pointer. No one was motivated to code that so far. |
|
Not completly sure if "LIBXSMM is falling back to BLAS" is only meant for large problem sizes or also for alpha != 1. |
Likely calls MKL. |
Thank you for considering transition to LIBXSMM's JIT backend! Indeed, the stand-alone generator driver outputing inline-assembly C-functions is deprecated and already does not support latest micro architectural extensions (mostly relevant for low/mixed precision). The discussion here is correct about the choices like managing the function pointers inside of the application, or relying on libxsmm to manage the pointer. For matrix multiplication kernels, kernels are always managed but can be of course tabulated in addition inside of the application. If only a limited set of kernels can be determined upfront, it is best to keep a table of pointers available since any lookup cost can be avoided this way (no matter how small it is). If an application worked with a set of fixed kernels supplied by the deprecated stand-alone generator, one can consider this application to have upfront knowledge and to rely only on a limited set of kernels. The code registry not only delivers lookup service, but also manages to lifetime of the buffers storing the executable code, and offers kernel introspection, or advanced lookup for custom data (can be used to lookup multiple kernels at once or for totally unrelated data). One can read about code generation, lookup, and caching in this comment. |
Problem: The current approach using
libxsmm_gemm_generator
can only generate GEMMs with alpha = +/-1. If YATeTo encounters a GEMM with |alpha| != 1, it falls back to default code (nested for loops), which is not performant.Solution: Use the new libxsmm interface
libxsmm_?gemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc);
The text was updated successfully, but these errors were encountered: