substantially slower execution time on linux than windows? #235

bwheelz36 · 2022-08-17T01:36:31Z

Hi

Thanks for this code. We are using the python interface for distortion correction in MRI

I have noticed something rather curious. When i run my code on windows, it takes around 0.7 seconds per slice to correct. On linux, the time is ~5 seconds. since we have to correct several hundred slices, this makes a big difference to total execution time. I have tested this on two computers, one of which has the exact same CPU as the windoows computer. I tested using fresh environements with python 3.9 and 3.10 with the same results.

You can see the way we are calling the code around line 293 here; we are using the Plan interface.

I can just run this on windows so it's not urgent but I thought it was very strange. It might be the first time anything's ever worked better on windows 😝
I've also attached a cProfile file I ran which you can visualize with e.g. snakeviz, but this doesn't appear too informative to me - all I can tell is it's spending a lot of time in _interfaces.py so I guess the problem is at a lower level?

lu1and10 · 2022-08-17T09:51:36Z

Hi

I think it is because the linux python wheels uploaded to pypi is compiled with flag -march=x86-64 -mtune=generic -msse4, see https://github.com/flatironinstitute/finufft/runs/7796400053?check_suite_focus=true#step:6:18
The wheel is generated to work on a broad range of CPU architecture. Your machine probably supports avx2 or avx512, to make the full speed of your machine, you may need to checkout the source repository and compile/install from the source as in described https://finufft.readthedocs.io/en/latest/install.html#install-python instead of using pip install finufft.

In the future release, @wendazhou and the team will work on the CPU dispatching so that the python binding lower level C++ binary will utilize the fastest CPU instruction sets on the user's machine.

ahbarnett · 2022-08-17T15:50:16Z

Hi - compiling and installing locally into python isn't hard. As Libin said, see here: https://finufft.readthedocs.io/en/latest/install.html#install-python and let us know how it goes. Best, Alex

…

On Wed, Aug 17, 2022 at 5:51 AM Libin Lu ***@***.***> wrote: Hi I think it is because the linux python wheels uploaded to pypi is compiled with flag -march=x86-64 -mtune=generic -msse4, see [ https://github.com/flatironinstitute/finufft/runs/7796400053?check_suite_focus=true#step:6:18](https://github.com/flatironinstitute/finufft/runs/7796400053?check_suite_focus=true#step:6:18) The wheel is generated to work on a broad range of CPU architecture. Your machine probably supports avx2 or avx512, to make the full speed of your machine, you may need to checkout the source repository and compile/install from the source as in described [ https://finufft.readthedocs.io/en/latest/install.html#install-python](https://finufft.readthedocs.io/en/latest/install.html#install-python) instead of using pip install finufft`. In the future release, @wendazhou <https://github.com/wendazhou> and the team will work on the CPU dispatching so that the python binding lower level C++ binary will utilize the fastest CPU instruction sets on the user's machine. — Reply to this email directly, view it on GitHub <#235 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACNZRSWXMR72U7NIEGISUPLVZSY3JANCNFSM56XY66ZA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- *---------------------------------------------------------------------~^`^~._.~' |\ Alex H. Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

bwheelz36 · 2022-08-18T02:25:07Z

Hi,
I've done the following commands (all inside a python virtual environment):

git clone https://github.com/flatironinstitute/finufft.git
cd finufft/
make test
make python

I've also tried copying both make.inc.linux_ICC and make.inc.manylinux to make.inc and recompiling.
However, the execution time remains at ~5 seconds per slice in all cases...
Do I need to be adding some additional flag? I have an intel i7 cpu if that's relevant...

ahbarnett · 2022-08-18T16:18:26Z

Hi, no, the default flags should be sufficient on i7: (from the makefile) -O3 -funroll-loops -march=native -fcx-limited-range Don't use manylinux since it's not avx2 :( How many k-points and how big is your uniform grid? Eg, say M=1e5 pts and N=300^2 grid, you can time a test executable for that size to know how fast the CPU code is: (base) ***@***.*** /home/alex/numerics/finufft> OMP_NUM_THREADS=1 test/finufft2d_test 300 300 1e5 1e-6 test 2d type 1: 100000 NU pts to (300,300) modes in 0.0167 s 5.99e+06 NU pts/s one mode: rel err in F[111,78] is 7.11e-08 test 2d type 2: (300,300) modes to 100000 NU pts in 0.015 s 6.65e+06 NU pts/s one targ: rel err in c[50000] is 1.71e-07 etc This is for ryzen, similar to i7 per thread. Not a great test since it's too short, but gives an idea. If your slices have the same k-pts you can test the "many interface" speed. Eg for 100 slices: (base) ***@***.*** /home/alex/numerics/finufft> OMP_NUM_THREADS=8 test/finufft2dmany_test 100 300 300 1e5 1e-6 test 2d1 many vs repeated single: ------------------------------------ ntr=100: 100000 NU pts to (300,300) modes in 0.342 s 2.92e+07 NU pts/s one mode: rel err in F[111,78] of trans#99 is 4.14e-07 100 of: 100000 NU pts to (300,300) modes in 0.717 s 1.4e+07 NU pts/s speedup T_FINUFFT2D1 / T_finufft2d1many = 2.09 consistency check: sup ( ||f_many-f||_2 / ||f||_2 ) = 8.19e-17 test 2d2 many vs repeated single: ------------------------------------ ntr=100: (300,300) modes to 100000 NU pts in 0.294 s 3.4e+07 NU pts/s one targ: rel err in c[50000] of trans#99 is 4.67e-08 100 of: (300,300) modes to 100000 NU pts in 0.6 s 1.67e+07 NU pts/s speedup T_FINUFFT2D2 / T_finufft2d2many = 2.04 etc If your speeds (eg compare throughputs in NU pts/s) don't match this, you have a problem. First `make test -j` and try that above on your chip. Don't give it all threads. If the py speed is much less, could it be you're not feeding in numpy dtype = complex128 arrays, so it's making unnecesary copies of I/O arrays on the py side? You should also test the basic timings from py: (here I include my 2d1 and 2d2 only) (py39) ***@***.*** /home/alex/numerics/finufft> OMP_NUM_THREADS=8 python python/test/run_speed_tests.py Accuracy and speed tests for 1000000 nonuniform points and eps=1e-06 (error estimates use 5 samples per run) ... finufft2d1: Est rel l2 err 1.6e-05 CPU time (sec) 0.0551 tot NU pts/sec 1.82e+07 finufft2d1many: Est rel l2 err 1.58e-05 CPU time (sec) 0.289 tot NU pts/sec 2.77e+07 finufft2d2: Est rel l2 err 2.47e-06 CPU time (sec) 0.0528 tot NU pts/sec 1.89e+07 finufft2d2many: Est rel l2 err 3.27e-06 CPU time (sec) 0.282 tot NU pts/sec 2.84e+07 ... Let me know your results for above, how it works out, Alex

…

On Wed, Aug 17, 2022 at 10:25 PM bwheelz36 ***@***.***> wrote: Hi, I've done the following commands (all inside a python virtual environment): git clone https://github.com/flatironinstitute/finufft.gitcd finufft/ make test make python I've also tried copying both make.inc.linux_ICC and make.inc.manylinux to make.inc and recompiling. However, the execution time remains at ~5 seconds per slice in all cases... Do I need to be adding some additional flag? I have an intel i7 cpu if that's relevant... — Reply to this email directly, view it on GitHub <#235 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACNZRSSZ4XYKJZAWNK6OUZLVZWNI3ANCNFSM56XY66ZA> . You are receiving this because you commented.Message ID: ***@***.***>

-- *---------------------------------------------------------------------~^`^~._.~' |\ Alex H. Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

bwheelz36 · 2022-08-19T01:20:21Z

hi:

"you can time a test executable for that size to know how fast the CPU code is"

I've copied the results of both tests you ran below. We are currently using 2D nufft type 3 (I think maybe we could be using type 2, but anyway that doesn't explain the windows/linux discrepancy).
To me these results all look reasonably similar to yours...

"could it be you're not feeding in numpy dtype = complex128?"

data type is numpy.complex128 - plus, I don't think this could explain the ~8x speed difference between windows/linux?

In case it's helpful, I put instructions here on how to reproduce this issue in linux. I'm far from a NUFFT expert so I can certainly believe my implementation isn't optimal - but again, the surprising thing to me is the different speed on different operating systems...

Cpp test

(venv2) brendan@BigDog:~/python/MRI_DistortionQA/examples/finufft$ OMP_NUM_THREADS=1
(venv2) brendan@BigDog:~/python/MRI_DistortionQA/examples/finufft$ test/finufft2d_test 300 300 1e5 1e-6
test 2d type 1:
        100000 NU pts to (300,300) modes in 0.0687 s    1.45e+06 NU pts/s
        one mode: rel err in F[111,78] is 7.11e-08
test 2d type 2:
        (300,300) modes to 100000 NU pts in 0.0174 s    5.74e+06 NU pts/s
        one targ: rel err in c[50000] is 1.71e-07
test 2d type 3:
        100000 NU to 90000 NU in 0.0486 s               3.91e+06 tot NU pts/s
        one targ: rel err in F[45000] is 1.12e-07

python test

(venv2) brendan@BigDog:~/python/MRI_DistortionQA/examples/finufft$ python python/test/run_speed_tests.py
Accuracy and speed tests for 1000000 nonuniform points and eps=1e-06 (error estimates use 5 samples per run)
finufft1d1:
    Est rel l2 err  8.47e-07
    CPU time (sec)  0.189
    tot NU pts/sec  5.3e+06

finufft1d2:
    Est rel l2 err  6.7e-07
    CPU time (sec)  0.186
    tot NU pts/sec  5.38e+06

finufft1d3:
    Est rel l2 err  5.64e-08
    CPU time (sec)  0.121
    tot NU pts/sec  1.65e+07

finufft2d1:
    Est rel l2 err  9.52e-06
    CPU time (sec)  0.135
    tot NU pts/sec  7.42e+06

finufft2d1many:
    Est rel l2 err  1.1e-05
    CPU time (sec)  0.564
    tot NU pts/sec  1.42e+07

finufft2d2:
    Est rel l2 err  4.62e-06
    CPU time (sec)  0.0831
    tot NU pts/sec  1.2e+07

finufft2d2many:
    Est rel l2 err  4.57e-06
    CPU time (sec)  0.418
    tot NU pts/sec  1.91e+07

finufft2d3:
    Est rel l2 err  1.33e-07
    CPU time (sec)  0.154
    tot NU pts/sec  1.3e+07

finufft3d1:
    Est rel l2 err  2.57e-06
    CPU time (sec)  0.332
    tot NU pts/sec  3.01e+06

finufft3d2:
    Est rel l2 err  2.19e-06
    CPU time (sec)  0.21
    tot NU pts/sec  4.76e+06

finufft3d3:
    Est rel l2 err  2.91e-08
    CPU time (sec)  0.414
    tot NU pts/sec  4.83e+06

bwheelz36 · 2022-08-25T22:05:06Z

any further suggestions on this? Otherwise I can close the issue and just run on windows...

ahbarnett · 2022-08-26T01:56:44Z

Hi, Well, your linux timings look similar to mine, maybe half the speed. Please note that OMP_NUM_THREADS=1 has to be on the *same line* as the command - the shell has no memory from line to line unless you export the variable. Hence you are using all threads. What I was hoping you'd do is replace the transform sizes and number of vectors with your actual parameters from your application, not use my (arbitrary) choices. I have no idea what image sizeor number of k-space points etc you have in MRI :) The goal was to compare their timing via these simple testers with what you're observing. Sorry I haven't had to install your pkgs and try it out that way. But a factor of 8 is a huge one, should be easy to debug. Are you sure it's not due to number of threads being given to finufft or your python driver? Best, Alex

…

On Thu, Aug 25, 2022 at 6:05 PM bwheelz36 ***@***.***> wrote: any further suggestions on this? Otherwise I can close the issue and just run on windows... — Reply to this email directly, view it on GitHub <#235 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACNZRSXONB22XKHYF2GXZCDV27UZ5ANCNFSM56XY66ZA> . You are receiving this because you commented.Message ID: ***@***.***>

-- *---------------------------------------------------------------------~^`^~._.~' |\ Alex H. Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

bwheelz36 · 2022-08-30T04:28:52Z

sorry - not being a NUFFT expert I am struggling with the nomenclature a bit.
the NUFFT is initialized like this:

  self.Nufft_Ax_Plan = Plan(3, 2, 1, 1e-06, -1)
  self.Nufft_Ax_Plan.setpts(self.xj, self.yj, None, self.sk, self.tk)
  self.Nufft_Atb_Plan = Plan(3, 2, 1, 1e-06, 1)
  self.Nufft_Atb_Plan.setpts(self.sk, self.tk, None, self.xj, self.yj)

xj, yj, sk and tk are all 1D matrices (created from 2D geometry) of the same size. In the example I'm currently running the size is 26208.

We call each of these interfaces 20 times as part of a least squares optimization - so 40 times in total.
Since each slice is taking ~5 seconds, this implies each call is taking 5/40=~0.125 seconds.

I'm struggling to figure out how to relate this back to the test parameters sorry...

Verified that all cores are being utilized during execution

ahbarnett · 2022-08-30T15:43:11Z

Hi, thanks. Since you're doing type 3, the underlying FFT size (in each of the 2 dimensions) is determined by the space-bandwidth product (range of xj times range of sk). So it is possible that is unusually large, but it shouldn't be in MRI, and doesn't explain windows/linux difference. I agree: you should be using type 2 for Ax and type 1 ("adjoint") for Atb, since the image is a regular grid. It will speed you up by 3x or more. It is strange that your number of k-space points is exactly the same as your number of image points. Normally the image would be 256*256 or something. Another thing you could do is iterate on all slices together using the "many" interface. That might be another 3x speedup. But your timing is definitely 10x too slow, since it implies 4e5 total pts/sec. (26208*2 / .125) Even on one thread I get 10x this speed. You should be getting 5e6 pts/sec, so 40 transforms done in 0.5 sec. Eg test codes use a typical space-bandwidth product: OMP_NUM_THREADS=1 test/finufft2dmany_test 40 162 162 26208 1e-6 ... 40 of: 26208 NU to 26244 NU in 0.553 s 3.77e+06 tot NU pts/s ... (that's using repeated "single" calls, like you do) Are you positive it is FINUFFT that is slower? Can you set debug=1 in your Py code and report timings from your use case? That would show us which internal step is taking longer. eg: OMP_NUM_THREADS=1 test/finufft2d_test 162 162 26208 1e-6 1 ... test 2d type 3: [finufft_makeplan] new plan: FINUFFT version 2.1.0 ................. [finufft_makeplan] 2d3: ntrans=1 M=26208 N=26244 X1=3.14 C1=2 S1=81 D1=138 gam1=1.06678 nf1=216 X2=3.14 C2=-3 S2=81 D2=-40.5 gam2=1.06668 nf2=216 [finufft_setpts t3] widcen, batch 0.00GB alloc: 9.8e-05 s [finufft_setpts t3] phase & deconv factors: 0.00842 s [finufft_setpts t3] sort (didSort=1): 0.000161 s [finufft_setpts t3] inner t2 plan & setpts: 0.00128 s [finufft_execute t3] start ntrans=1 (1 batches, bsize=1)... [finufft_execute t3] done. tot prephase: 7.01e-05 s tot spread: 0.00187 s tot type 2: 0.00418 s tot deconvolve: 3.5e-05 s 26208 NU to 26244 NU in 0.0161 s 3.26e+06 tot NU pts/s Note it's 0.016 sec total, nearly 10x faster than 0.125 sec. S1 and S2 are the space-bandwidth products, BTW. Good luck, Alex

…

On Tue, Aug 30, 2022 at 12:29 AM bwheelz36 ***@***.***> wrote: sorry - not being a NUFFT expert I am struggling with the nomenclature a bit. the NUFFT is initialized like this: self.Nufft_Ax_Plan = Plan(3, 2, 1, 1e-06, -1) self.Nufft_Ax_Plan.setpts(self.xj, self.yj, None, self.sk, self.tk) self.Nufft_Atb_Plan = Plan(3, 2, 1, 1e-06, 1) self.Nufft_Atb_Plan.setpts(self.sk, self.tk, None, self.xj, self.yj) xj, yj, sk and tk are all 1D matrices (created from 2D geometry) of the same size. In the example I'm currently running the size is 26208. We call each of these interfaces 20 times as part of a least squares optimization - so 40 times in total. Since each slice is taking ~5 seconds, this implies each call is taking 5/40=~0.125 seconds. I'm struggling to figure out how to relate this back to the test parameters sorry... Verified that all cores are being utilized during execution — Reply to this email directly, view it on GitHub <#235 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACNZRSVHZLGPNB7W3UXOVYDV3WEY7ANCNFSM56XY66ZA> . You are receiving this because you commented.Message ID: ***@***.***>

-- *-------------------------------------------------------------------~^`^~._.~' |\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

ahbarnett · 2022-08-30T15:45:07Z

Another simple idea to try: set nthreads=1. Now we know you're doing small transforms, many cores don't help and sometimes hinder. On Tue, Aug 30, 2022 at 11:42 AM Alex Barnett < ***@***.***> wrote:

…

Hi, thanks. Since you're doing type 3, the underlying FFT size (in each of the 2 dimensions) is determined by the space-bandwidth product (range of xj times range of sk). So it is possible that is unusually large, but it shouldn't be in MRI, and doesn't explain windows/linux difference. I agree: you should be using type 2 for Ax and type 1 ("adjoint") for Atb, since the image is a regular grid. It will speed you up by 3x or more. It is strange that your number of k-space points is exactly the same as your number of image points. Normally the image would be 256*256 or something. Another thing you could do is iterate on all slices together using the "many" interface. That might be another 3x speedup. But your timing is definitely 10x too slow, since it implies 4e5 total pts/sec. (26208*2 / .125) Even on one thread I get 10x this speed. You should be getting 5e6 pts/sec, so 40 transforms done in 0.5 sec. Eg test codes use a typical space-bandwidth product: OMP_NUM_THREADS=1 test/finufft2dmany_test 40 162 162 26208 1e-6 ... 40 of: 26208 NU to 26244 NU in 0.553 s 3.77e+06 tot NU pts/s ... (that's using repeated "single" calls, like you do) Are you positive it is FINUFFT that is slower? Can you set debug=1 in your Py code and report timings from your use case? That would show us which internal step is taking longer. eg: OMP_NUM_THREADS=1 test/finufft2d_test 162 162 26208 1e-6 1 ... test 2d type 3: [finufft_makeplan] new plan: FINUFFT version 2.1.0 ................. [finufft_makeplan] 2d3: ntrans=1 M=26208 N=26244 X1=3.14 C1=2 S1=81 D1=138 gam1=1.06678 nf1=216 X2=3.14 C2=-3 S2=81 D2=-40.5 gam2=1.06668 nf2=216 [finufft_setpts t3] widcen, batch 0.00GB alloc: 9.8e-05 s [finufft_setpts t3] phase & deconv factors: 0.00842 s [finufft_setpts t3] sort (didSort=1): 0.000161 s [finufft_setpts t3] inner t2 plan & setpts: 0.00128 s [finufft_execute t3] start ntrans=1 (1 batches, bsize=1)... [finufft_execute t3] done. tot prephase: 7.01e-05 s tot spread: 0.00187 s tot type 2: 0.00418 s tot deconvolve: 3.5e-05 s 26208 NU to 26244 NU in 0.0161 s 3.26e+06 tot NU pts/s Note it's 0.016 sec total, nearly 10x faster than 0.125 sec. S1 and S2 are the space-bandwidth products, BTW. Good luck, Alex On Tue, Aug 30, 2022 at 12:29 AM bwheelz36 ***@***.***> wrote: > sorry - not being a NUFFT expert I am struggling with the nomenclature a > bit. > the NUFFT is initialized like this: > > self.Nufft_Ax_Plan = Plan(3, 2, 1, 1e-06, -1) > self.Nufft_Ax_Plan.setpts(self.xj, self.yj, None, self.sk, self.tk) > self.Nufft_Atb_Plan = Plan(3, 2, 1, 1e-06, 1) > self.Nufft_Atb_Plan.setpts(self.sk, self.tk, None, self.xj, self.yj) > > xj, yj, sk and tk are all 1D matrices (created from 2D geometry) of the > same size. In the example I'm currently running the size is 26208. > > We call each of these interfaces 20 times as part of a least squares > optimization - so 40 times in total. > Since each slice is taking ~5 seconds, this implies each call is taking > 5/40=~0.125 seconds. > > I'm struggling to figure out how to relate this back to the test > parameters sorry... > > Verified that all cores are being utilized during execution > > — > Reply to this email directly, view it on GitHub > <#235 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ACNZRSVHZLGPNB7W3UXOVYDV3WEY7ANCNFSM56XY66ZA> > . > You are receiving this because you commented.Message ID: > ***@***.***> > -- *-------------------------------------------------------------------~^`^~._.~' |\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

-- *-------------------------------------------------------------------~^`^~._.~' |\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

bwheelz36 · 2022-10-13T00:35:23Z

sorry I've been so slow to respond to this. I promise I will get back to it one day!

ahbarnett · 2022-10-13T02:30:16Z

No worries. We just want users to be happy, so bugs/slowdowns are of interest...

…

On Wed, Oct 12, 2022 at 8:35 PM bwheelz36 ***@***.***> wrote: sorry I've been so slow to respond to this. I promise I will get back to it one day! — Reply to this email directly, view it on GitHub <#235 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACNZRSWQ6EJX46LECS6SNM3WC5KNLANCNFSM56XY66ZA> . You are receiving this because you commented.Message ID: ***@***.***>

-- *-------------------------------------------------------------------~^`^~._.~' |\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

DiamonDinoia · 2022-10-13T13:30:33Z

One question, did you install fftw using the package manager or you compiled it yourself? The one from the package manager does not have avx enabled on linux and is substantially slower.

bwheelz36 · 2022-10-18T03:30:26Z

Hey @DiamonDinoia - good question, it was a long time ago now!
Looking at the build instructions, I guess that I would have done sudo apt-get install make build-essential libfftw3-dev.
you think I should try again while building fftw3 from source? And I assume I would then have to update something to tell cmake where the compiled version is? (sorry this is getting slightly over my head!)

DiamonDinoia · 2022-10-18T09:44:25Z

Yes, you should do sudo apt purge libfftw3-dev then go to fftw.org and download the tarball. Unpack it and do the commands as explained here with ./configure CFLAGS="-march=native -fPIC" --enable-openmp --enable-avx2 (or --enable-avx512 if supported by your cpu) once done, do sudo make install -j. Then you need to do the configuration again with --enable-float (in addition to the flags used the first time). Finally, install it again with sudo make install -j. Then you can build finufft as you did before and the performance should be greater. To clear single and double precision fftw are two separate libraries hence they need to be installed twice.

bwheelz36 · 2022-10-24T00:39:34Z

Hey all, thanks for all the help on this.

Just for context, let me state that it is not critical for us that our code runs fast, nor that it runs on linux. In other words, this is somewhat of a curiosity for me rather than something of pressing importance. With that said, because I would like to respect the time of everyone who has contributed to this so far, as well as because I think it may be of importance for other users, I have done the following:

@ahbarnett: we did attempt switching the type III for a type I/II - however, we only saw a moderate speed up of ~20% (which may very well have been due to a bad implementation on our end, but with the above comment in mind, we decided to stick to the slow but definitely working version)
@DiamonDinoia: thanks for your tip. I've carried out your suggestion, and this seems to explain at least part of the story, as I eleborate on below:

I previously provided an example to reproduce this issue, but I appreciate it is kind of annoying to have to download data etc, so I have developed a new example which is a standalone python script.
I have tested this example on windows, and on linux under various scenarios. I am yet to have it run as fast on linux as on windows, although the suggestion by @DiamonDinoia did make a substantial difference. The results are outlined in the table below. These are the mean ± std of ten runs.

Case	Run time (s)
Linux: pip installed finufft	3.4 ± 0.33
Linux: built fftw from source, pip installed finufft	1.23 ± 0.16
Linux: built fftw and finufft from source	1.44 ± 0.22
Windows: pip installed finufft	0.45 ± 0.02s

You can see that:

building fftw from scratch make a substantial difference (one caveat here: the 'default fftw' is on different hardware because I couldn't figure out how to uninstall the compliled version; nevertheless based on previous experience I'm pretty confident this result would be similar on the same hardware)
Building finufft from scratch actually seemed to result in a slight slowdown; you can see the exact build commands I used in the links above. this was on ubuntu 22.04
Windows is always substantially faster than linux

DiamonDinoia · 2022-10-27T11:00:53Z

This is totally not what I expected. I think we should check if this is a python issue.

Could you check the performance on windows/linux with make tests? If I remember correctly it should print metrics like time. Am I correcr @ahbarnett?

Thanks,
Marco

bwheelz36 · 2022-10-28T01:23:49Z

I can run the tests on linux. I might need some help to get them working in windows (I hardly use c++ and never use it in windows)
Looking through the build instructions

assign the parent directories of the FFTW header file to FFTW_H_DIR, of the FFTW libraries to FFTW_LIB_DIR"

I assume "FFTW libraries" are these precompiled binaries
I'm not sure what to put for FFTW_H_DIR

If the library is compiled successfully, you can try to run the tests. Note that your system has to fulfill the following prerequisites to this end: A Linux distribution set up via WSL (has been tested with Ubuntu 20.04 LTS from the Windows Store) and the 64bit gnu-make mentioned before.

this kind of sounds like the tests will run through WSL anyway, which might not be that useful for testing performance?

lu1and10 · 2022-10-28T04:46:12Z

I can run the tests on linux. I might need some help to get them working in windows (I hardly use c++ and never use it in windows) Looking through the build instructions

assign the parent directories of the FFTW header file to FFTW_H_DIR, of the FFTW libraries to FFTW_LIB_DIR"

I assume "FFTW libraries" are these precompiled binaries

I'm not sure what to put for FFTW_H_DIR

If the library is compiled successfully, you can try to run the tests. Note that your system has to fulfill the following prerequisites to this end: A Linux distribution set up via WSL (has been tested with Ubuntu 20.04 LTS from the Windows Store) and the 64bit gnu-make mentioned before.

this kind of sounds like the tests will run through WSL anyway, which might not be that useful for testing performance?

Hi,

I was able to install dual systems(archlinux and windows) natively a computer with cpu Intel(R) Xeon(R) E-2176M CPU @ 2.70GHzv

Both fftw packages on archlinux and windows are installed via pacman. Tested the following and results(with -O3 -march=native flag),

On Windows:

# export OMP_NUM_THREADS=1
Administrator@Win11-2022BDZKV MSYS ~/projects/finufft/test
# ./finufft2d_test 1e2 1e2 1e4 1e-12 0 2 0.0 1e-11
max threads: 1
test 2d type 1:
        10000 NU pts to (100,100) modes in 0.00477 s    2.1e+06 NU pts/s
        one mode: rel err in F[37,26] is 1.69e-13
        dirft2d: rel l2-err of result F is 2.36e-12
test 2d type 2:
        (100,100) modes to 10000 NU pts in 0.0031 s     3.22e+06 NU pts/s
        one targ: rel err in c[5000] is 9.54e-14
        dirft2d: rel l2-err of result c is 2.32e-12
test 2d type 3:
        10000 NU to 10000 NU in 0.0209 s                9.58e+05 tot NU pts/s
        one targ: rel err in F[5000] is 5.07e-13
        dirft2d: rel l2-err of result F is 2.65e-12

On Archlinux:

[root@libin test]# export OMP_NUM_THREADS=1
[root@libin test]# ./finufft2d_test 1e2 1e2 1e4 1e-12 0 2 0.0 1e-11
max threads: 1
test 2d type 1:
10000 NU pts to (100,100) modes in 0.00315 s 3.18e+06 NU pts/s
one mode: rel err in F[37,26] is 1.66e-13
dirft2d: rel l2-err of result F is 2.35e-12
test 2d type 2:
(100,100) modes to 10000 NU pts in 0.0028 s 3.57e+06 NU pts/s
one targ: rel err in c[5000] is 1.13e-12
dirft2d: rel l2-err of result c is 2.38e-12
test 2d type 3:
10000 NU to 10000 NU in 0.0118 s         1.69e+06 tot NU pts/s
one targ: rel err in F[5000] is 1.34e-12
dirft2d: rel l2-err of result F is 2.75e-12

Seems on linux, the C++ binary is faster than on Windows, within factor of 2 for type3 2d.

Is it possible to test that the python function scipy.sparse.linalg.lsqr in your script has similar performance on windows and linux?
Thanks.

bwheelz36 · 2022-10-31T02:08:39Z

aha! @lu1and10 good idea.
I've reproduced that example using pynufft instead of finufft, and I actually simular behavior:

OS	time
Windows	2.18 ± 0.18s
Linux	4.23 ± 1.36s

so you are right, this is not an finufft issue - or at the very least, not only an finufft issues

My apologies everyone. As mentioned at the top, my initial profiling showed a lot of time spent in _interfaces.py - I mistakenly believed this was part of finufft, but it turns out it is part of scipy.
I'll keep digging and update you all - now I just want to find out what is happening!

bwheelz36 · 2022-10-31T23:57:23Z

open an issue on scipy

bwheelz36 · 2022-11-09T04:25:01Z

ok - I will close this for now but update if I ever get to the bottom of it....

bwheelz36 mentioned this issue Aug 19, 2022

Slow finufft on linux Image-X-Institute/mri_distortion_toolkit#137

Open

bwheelz36 closed this as completed Nov 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

substantially slower execution time on linux than windows? #235

substantially slower execution time on linux than windows? #235

bwheelz36 commented Aug 17, 2022

lu1and10 commented Aug 17, 2022 •

edited

Loading

ahbarnett commented Aug 17, 2022 via email

bwheelz36 commented Aug 18, 2022

ahbarnett commented Aug 18, 2022 via email

bwheelz36 commented Aug 19, 2022

bwheelz36 commented Aug 25, 2022

ahbarnett commented Aug 26, 2022 via email

bwheelz36 commented Aug 30, 2022

ahbarnett commented Aug 30, 2022 via email

ahbarnett commented Aug 30, 2022 via email

bwheelz36 commented Oct 13, 2022

ahbarnett commented Oct 13, 2022 via email

DiamonDinoia commented Oct 13, 2022 •

edited

Loading

bwheelz36 commented Oct 18, 2022

DiamonDinoia commented Oct 18, 2022 •

edited

Loading

bwheelz36 commented Oct 24, 2022

DiamonDinoia commented Oct 27, 2022

bwheelz36 commented Oct 28, 2022

lu1and10 commented Oct 28, 2022 •

edited

Loading

bwheelz36 commented Oct 31, 2022 •

edited

Loading

bwheelz36 commented Oct 31, 2022

bwheelz36 commented Nov 9, 2022

substantially slower execution time on linux than windows? #235

substantially slower execution time on linux than windows? #235

Comments

bwheelz36 commented Aug 17, 2022

lu1and10 commented Aug 17, 2022 • edited Loading

ahbarnett commented Aug 17, 2022 via email

bwheelz36 commented Aug 18, 2022

ahbarnett commented Aug 18, 2022 via email

bwheelz36 commented Aug 19, 2022

Cpp test

python test

bwheelz36 commented Aug 25, 2022

ahbarnett commented Aug 26, 2022 via email

bwheelz36 commented Aug 30, 2022

ahbarnett commented Aug 30, 2022 via email

ahbarnett commented Aug 30, 2022 via email

bwheelz36 commented Oct 13, 2022

ahbarnett commented Oct 13, 2022 via email

DiamonDinoia commented Oct 13, 2022 • edited Loading

bwheelz36 commented Oct 18, 2022

DiamonDinoia commented Oct 18, 2022 • edited Loading

bwheelz36 commented Oct 24, 2022

DiamonDinoia commented Oct 27, 2022

bwheelz36 commented Oct 28, 2022

lu1and10 commented Oct 28, 2022 • edited Loading

bwheelz36 commented Oct 31, 2022 • edited Loading

bwheelz36 commented Oct 31, 2022

bwheelz36 commented Nov 9, 2022

lu1and10 commented Aug 17, 2022 •

edited

Loading

DiamonDinoia commented Oct 13, 2022 •

edited

Loading

DiamonDinoia commented Oct 18, 2022 •

edited

Loading

lu1and10 commented Oct 28, 2022 •

edited

Loading

bwheelz36 commented Oct 31, 2022 •

edited

Loading