small prime FFT based on ulong #2107

vneiger · 2024-11-09T10:24:02Z

This PR aims to have an ulong-based version of small prime FFT. This is a draft, comments and suggestions highly welcome (on any aspect: for example I have no idea if n_fft is relevant naming).

For the moment, the features implemented are:

forward FFT, inverse FFT, transposed forward FFT, transposed inverse FFT
restriction on the modulus: it must be 62 bits at most (for performance reasons)
length power of 2 (other lengths: zero padding means non-smooth timings between powers of 2)

Performance: observed on a few different machines, AMD zen 4 and various Intel. This slightly outperforms NTL's versions of the forward and inverse FFTs (acceleration of 0% to 30% depending on lengths). This is between 2 and 4 times slower, often around 3, than the vectorized floating point-based small-prime FFT in fft_small (or than the similar AVX-based version in NTL). This version uses no simd: enabling/disabling automatic vectorization does not change performance, and a straightforward "manual" vectorization should not bring much. The reason being that every few operations there is a full 64 bit multiplication (umul_ppmm) happening. (Still, I made some experiments that suggest avx could help, maybe substantially on AMD processors which have a very fast vpmullq, but I leave this aside for later.)

Planned:

more thorough testing files (for the transposed variants, which are only tested indirectly at the moment)
cleaning things here and there, add documentation
add mechanism to avoid too memory-consuming precomputation when a root of unity of very large order is available (maybe, in a first version, simply forbid transforms of length more than 2**25 or so?).

Planned, but likely not within this PR:

truncated FFT variants, for smooth performance when length varies from one power of 2 to the next
versions with strides, useful e.g. for polynomial matrices stored as a list of matrix coefficients (e.g., might help for the half-GCD algorithm)

…troduce_nmod_fft

src/n_fft.h

albinahlback · 2024-11-09T12:49:25Z

src/n_fft/idft.c

+    if (depth == 1)
+    {
+        ulong p_hi, p_lo, tmp;
+        IDFT2_LAZY22(p[0], p[1], F->mod, F->mod2, F->tab_w[2*node], F->tab_w[2*node+1], p_hi, p_lo, tmp);
+    }


Codecov complains about not reaching these lines. Is it possible to reach these?

Yes, these lines are not reached right now because only the node0 variants are tested currently (they do call the function containing these lines but never with depth == 1). So these lines will be reached once the tests are more complete. I'm adding a todo in the code to make I do not forget about this. Thanks for the catch!

fredrik-johansson · 2024-11-12T10:46:13Z

Looks great so far!

Thoughts on support (smooth) non-power-of-two-sizes as an alternative or complement to truncation?

Do you plan to add threading? One of the weaknesses of fft_small for huge convolutions is that the FFTs are single-threaded. Also, huge FFTs should be bandwidth-constrained rather than arithmetic-constrained, especially if you have a lot of cores. Considering that you get nearly 25% more bits per coefficient with 62-bit modulus instead of 50-bit, n_fft should perform quite well asymptotically.

vneiger · 2024-11-13T08:57:24Z

Thoughts on support (smooth) non-power-of-two-sizes as an alternative or complement to truncation?

To be sure about "(smooth) non-power-of-two-sizes". Do you mean, e.g. on a very specific case: if the size is just below $3^k$, use radix-3 FFT, with a root of unity of order $3^k$? (and more generally, only slightly overshoot the actual size by some size that factors into small primes?)

Do you plan to add threading? One of the weaknesses of fft_small for huge convolutions is that the FFTs are single-threaded. Also, huge FFTs should be bandwidth-constrained rather than arithmetic-constrained, especially if you have a lot of cores. Considering that you get nearly 25% more bits per coefficient with 62-bit modulus instead of 50-bit, n_fft should perform quite well asymptotically.

Ok, thanks for the insight. I was not sure whether threading was an important goal for small-prime FFTs, that is good to know. This was not really in my plans for the near future, because I have other things I would like to make progress on (notably nmod_poly_mat), and because I do not have much hands-on experience with threading so it is not something I could do very quick (or this could lead to poor choices/design, poor usability, poor performance, etc.). But if someone with better threading skills wants to help, I will gladly collaborate.

vneiger added 30 commits September 16, 2024 11:24

add profile for powmod

0bf0127

Merge branch 'main' into introduce_nmod_fft

c363adb

add .h file

4a887b4

fix ifndef

aee38b3

context and init code

17faaea

add profile

e738256

fix include

dcaede7

improve profile init

cd50787

rename ctx init

afa5ddc

testing init

fd24de2

fix explanations and complete test for init

211ab75

remove printf

6368823

forgot to add main

9eeedd6

dft, test passes

3fa7944

add profile

ff33533

clean things a bit

f4520c9

introducing dft32 base case

e10c29c

dft32 base case

7b605a6

cleaning things

1f236d8

testing from length 1

9bf18c7

fix

fb88c54

remove useless function argument

f6cc96c

vaguely faster with added lazy14 layer

a675b68

clean explanations

28b3276

finalize lazy14 version

b71649d

small fixes

8cd392c

tentative fix for flint_bits == 32

9fa9020

dft8 is now a macro, code generation was too unpredictable

ccd3f71

putting more args slightly slows down for large lengths...

f0587e5

macro for dft16 helps, let's see for dft32

4cf7343

vneiger added 25 commits October 27, 2024 15:40

notes about init

0e0df84

wip: use multipoint eval in test

40539de

use multipoint eval in test

8c8b08b

idft_t

b1fb674

idft_t, not tested yet

dcf2ae7

minor changes

9d8845d

progress

a872720

add files

c29a5d1

idft test passes

66523a7

idft test passes

453c068

idft in progress

82d5cbf

fix name

07a7dbe

idft in progress

f0fed07

idft in progress

f79ba6f

idft in progress

c41bcd4

idft in progress

f09bca0

idft in progress

38a5aba

minor fix

ece93dc

a bit lazier

9ef50f1

idft becoming good, remains some fine tuning to do

3b5669d

a bit lazier

614dacb

fix name

9567b15

Merge branch 'flintlib:main' into introduce_nmod_fft

42f6fe6

minor comment

c438c43

Merge branch 'introduce_nmod_fft' of github.com:vneiger/flint into in…

eaf6d5c

…troduce_nmod_fft

albinahlback reviewed Nov 9, 2024

View reviewed changes

vneiger added 2 commits November 9, 2024 15:38

remove unnecessary includes

4bb083a

add todo about tests idft any node small depths

99284b5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

small prime FFT based on ulong #2107

small prime FFT based on ulong #2107

vneiger commented Nov 9, 2024

albinahlback Nov 9, 2024

vneiger Nov 9, 2024

fredrik-johansson commented Nov 12, 2024 •

edited

Loading

vneiger commented Nov 13, 2024

small prime FFT based on ulong #2107

Are you sure you want to change the base?

small prime FFT based on ulong #2107

Conversation

vneiger commented Nov 9, 2024

albinahlback Nov 9, 2024

Choose a reason for hiding this comment

vneiger Nov 9, 2024

Choose a reason for hiding this comment

fredrik-johansson commented Nov 12, 2024 • edited Loading

vneiger commented Nov 13, 2024

fredrik-johansson commented Nov 12, 2024 •

edited

Loading