Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

small prime FFT based on ulong #2107

Draft
wants to merge 74 commits into
base: main
Choose a base branch
from

Conversation

vneiger
Copy link
Collaborator

@vneiger vneiger commented Nov 9, 2024

This PR aims to have an ulong-based version of small prime FFT. This is a draft, comments and suggestions highly welcome (on any aspect: for example I have no idea if n_fft is relevant naming).

For the moment, the features implemented are:

  • forward FFT, inverse FFT, transposed forward FFT, transposed inverse FFT
  • restriction on the modulus: it must be 62 bits at most (for performance reasons)
  • length power of 2 (other lengths: zero padding means non-smooth timings between powers of 2)

Performance: observed on a few different machines, AMD zen 4 and various Intel. This slightly outperforms NTL's versions of the forward and inverse FFTs (acceleration of 0% to 30% depending on lengths). This is between 2 and 4 times slower, often around 3, than the vectorized floating point-based small-prime FFT in fft_small (or than the similar AVX-based version in NTL). This version uses no simd: enabling/disabling automatic vectorization does not change performance, and a straightforward "manual" vectorization should not bring much. The reason being that every few operations there is a full 64 bit multiplication (umul_ppmm) happening. (Still, I made some experiments that suggest avx could help, maybe substantially on AMD processors which have a very fast vpmullq, but I leave this aside for later.)

Planned:

  • more thorough testing files (for the transposed variants, which are only tested indirectly at the moment)
  • cleaning things here and there, add documentation
  • add mechanism to avoid too memory-consuming precomputation when a root of unity of very large order is available (maybe, in a first version, simply forbid transforms of length more than 2**25 or so?).

Planned, but likely not within this PR:

  • truncated FFT variants, for smooth performance when length varies from one power of 2 to the next
  • versions with strides, useful e.g. for polynomial matrices stored as a list of matrix coefficients (e.g., might help for the half-GCD algorithm)

src/n_fft.h Outdated Show resolved Hide resolved
Comment on lines +218 to +222
if (depth == 1)
{
ulong p_hi, p_lo, tmp;
IDFT2_LAZY22(p[0], p[1], F->mod, F->mod2, F->tab_w[2*node], F->tab_w[2*node+1], p_hi, p_lo, tmp);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codecov complains about not reaching these lines. Is it possible to reach these?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these lines are not reached right now because only the node0 variants are tested currently (they do call the function containing these lines but never with depth == 1). So these lines will be reached once the tests are more complete. I'm adding a todo in the code to make I do not forget about this. Thanks for the catch!

@fredrik-johansson
Copy link
Collaborator

fredrik-johansson commented Nov 12, 2024

Looks great so far!

Thoughts on support (smooth) non-power-of-two-sizes as an alternative or complement to truncation?

Do you plan to add threading? One of the weaknesses of fft_small for huge convolutions is that the FFTs are single-threaded. Also, huge FFTs should be bandwidth-constrained rather than arithmetic-constrained, especially if you have a lot of cores. Considering that you get nearly 25% more bits per coefficient with 62-bit modulus instead of 50-bit, n_fft should perform quite well asymptotically.

@vneiger
Copy link
Collaborator Author

vneiger commented Nov 13, 2024

Thoughts on support (smooth) non-power-of-two-sizes as an alternative or complement to truncation?

To be sure about "(smooth) non-power-of-two-sizes". Do you mean, e.g. on a very specific case: if the size is just below $3^k$, use radix-3 FFT, with a root of unity of order $3^k$? (and more generally, only slightly overshoot the actual size by some size that factors into small primes?)

Do you plan to add threading? One of the weaknesses of fft_small for huge convolutions is that the FFTs are single-threaded. Also, huge FFTs should be bandwidth-constrained rather than arithmetic-constrained, especially if you have a lot of cores. Considering that you get nearly 25% more bits per coefficient with 62-bit modulus instead of 50-bit, n_fft should perform quite well asymptotically.

Ok, thanks for the insight. I was not sure whether threading was an important goal for small-prime FFTs, that is good to know. This was not really in my plans for the near future, because I have other things I would like to make progress on (notably nmod_poly_mat), and because I do not have much hands-on experience with threading so it is not something I could do very quick (or this could lead to poor choices/design, poor usability, poor performance, etc.). But if someone with better threading skills wants to help, I will gladly collaborate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants