-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
small prime FFT based on ulong #2107
base: main
Are you sure you want to change the base?
Conversation
if (depth == 1) | ||
{ | ||
ulong p_hi, p_lo, tmp; | ||
IDFT2_LAZY22(p[0], p[1], F->mod, F->mod2, F->tab_w[2*node], F->tab_w[2*node+1], p_hi, p_lo, tmp); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codecov complains about not reaching these lines. Is it possible to reach these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, these lines are not reached right now because only the node0
variants are tested currently (they do call the function containing these lines but never with depth == 1
). So these lines will be reached once the tests are more complete. I'm adding a todo in the code to make I do not forget about this. Thanks for the catch!
Looks great so far! Thoughts on support (smooth) non-power-of-two-sizes as an alternative or complement to truncation? Do you plan to add threading? One of the weaknesses of |
To be sure about "(smooth) non-power-of-two-sizes". Do you mean, e.g. on a very specific case: if the size is just below
Ok, thanks for the insight. I was not sure whether threading was an important goal for small-prime FFTs, that is good to know. This was not really in my plans for the near future, because I have other things I would like to make progress on (notably |
This PR aims to have an ulong-based version of small prime FFT. This is a draft, comments and suggestions highly welcome (on any aspect: for example I have no idea if
n_fft
is relevant naming).For the moment, the features implemented are:
Performance: observed on a few different machines, AMD zen 4 and various Intel. This slightly outperforms NTL's versions of the forward and inverse FFTs (acceleration of 0% to 30% depending on lengths). This is between 2 and 4 times slower, often around 3, than the vectorized floating point-based small-prime FFT in
fft_small
(or than the similar AVX-based version in NTL). This version uses no simd: enabling/disabling automatic vectorization does not change performance, and a straightforward "manual" vectorization should not bring much. The reason being that every few operations there is a full 64 bit multiplication (umul_ppmm
) happening. (Still, I made some experiments that suggest avx could help, maybe substantially on AMD processors which have a very fast vpmullq, but I leave this aside for later.)Planned:
Planned, but likely not within this PR: