Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to libxsmm JIT backend #26

Open
sebwolf-de opened this issue Mar 26, 2021 · 8 comments
Open

Switch to libxsmm JIT backend #26

sebwolf-de opened this issue Mar 26, 2021 · 8 comments
Assignees

Comments

@sebwolf-de
Copy link
Contributor

Problem: The current approach using libxsmm_gemm_generator can only generate GEMMs with alpha = +/-1. If YATeTo encounters a GEMM with |alpha| != 1, it falls back to default code (nested for loops), which is not performant.
Solution: Use the new libxsmm interface libxsmm_?gemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc);

@sebwolf-de sebwolf-de self-assigned this Mar 26, 2021
@uphoffc
Copy link
Contributor

uphoffc commented Mar 26, 2021

Don't think this would help for alpha:

What is a small matrix multiplication? When characterizing the problem-size by using the M, N, and K parameters, a problem-size suitable for LIBXSMM falls approximately within (M N K)1/3 <= 64 (which illustrates that non-square matrices or even "tall and skinny" shapes are covered as well). The library is typically used to generate code up to the specified threshold. Raising the threshold may not only generate excessive amounts of code (due to unrolling in M or K dimension), but also miss to implement a tiling scheme to effectively utilize the cache hierarchy. For auto-dispatched problem-sizes above the configurable threshold (explicitly JIT'ted code is not subject to the threshold), LIBXSMM is falling back to BLAS. In terms of GEMM, the supported kernels are limited to Alpha := 1, Beta := { 1, 0 }, and TransA := 'N'.

Might be that the doc is out of date, though.

@krenzland
Copy link
Contributor

krenzland commented Mar 26, 2021

Stupid question, but what is the performance penalty for the JIT backend? It's probably quite small,, as lookup works in O(1) with a hashmap, right?
Actually, you can generate a lookup table inside of yateto, so it's just a function pointer?

We also have to consider this for benchmarking purposes. We do not want to include the compile time in our benchmark (it's init after all), so we should do the JIT phase "Just before" ;)

@uphoffc
Copy link
Contributor

uphoffc commented Mar 26, 2021

Don't think there will be a performance penalty. You can just directly store the function pointer. No one was motivated to code that so far.

@sebwolf-de
Copy link
Contributor Author

libxsmm_dgemm(NULL, NULL, &m, &n, &k, &alpha, &a[0], NULL, &b[0], NULL, &beta, &c[0], NULL); works with alpha != 1

@sebwolf-de
Copy link
Contributor Author

Not completly sure if "LIBXSMM is falling back to BLAS" is only meant for large problem sizes or also for alpha != 1.

@uphoffc
Copy link
Contributor

uphoffc commented Mar 26, 2021

Likely calls MKL.

@uphoffc uphoffc closed this as completed Mar 26, 2021
@uphoffc uphoffc reopened this Mar 26, 2021
@hfp
Copy link

hfp commented Apr 27, 2021

Thank you for considering transition to LIBXSMM's JIT backend!

Indeed, the stand-alone generator driver outputing inline-assembly C-functions is deprecated and already does not support latest micro architectural extensions (mostly relevant for low/mixed precision).

The discussion here is correct about the choices like managing the function pointers inside of the application, or relying on libxsmm to manage the pointer. For matrix multiplication kernels, kernels are always managed but can be of course tabulated in addition inside of the application. If only a limited set of kernels can be determined upfront, it is best to keep a table of pointers available since any lookup cost can be avoided this way (no matter how small it is). If an application worked with a set of fixed kernels supplied by the deprecated stand-alone generator, one can consider this application to have upfront knowledge and to rely only on a limited set of kernels.

The code registry not only delivers lookup service, but also manages to lifetime of the buffers storing the executable code, and offers kernel introspection, or advanced lookup for custom data (can be used to lookup multiple kernels at once or for totally unrelated data). One can read about code generation, lookup, and caching in this comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants