Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

substantially slower execution time on linux than windows? #235

Closed
bwheelz36 opened this issue Aug 17, 2022 · 22 comments
Closed

substantially slower execution time on linux than windows? #235

bwheelz36 opened this issue Aug 17, 2022 · 22 comments

Comments

@bwheelz36
Copy link

Hi

Thanks for this code. We are using the python interface for distortion correction in MRI

I have noticed something rather curious. When i run my code on windows, it takes around 0.7 seconds per slice to correct. On linux, the time is ~5 seconds. since we have to correct several hundred slices, this makes a big difference to total execution time. I have tested this on two computers, one of which has the exact same CPU as the windoows computer. I tested using fresh environements with python 3.9 and 3.10 with the same results.

You can see the way we are calling the code around line 293 here; we are using the Plan interface.

I can just run this on windows so it's not urgent but I thought it was very strange. It might be the first time anything's ever worked better on windows 😝
I've also attached a cProfile file I ran which you can visualize with e.g. snakeviz, but this doesn't appear too informative to me - all I can tell is it's spending a lot of time in _interfaces.py so I guess the problem is at a lower level?

@lu1and10
Copy link
Member

lu1and10 commented Aug 17, 2022

Hi

I think it is because the linux python wheels uploaded to pypi is compiled with flag -march=x86-64 -mtune=generic -msse4, see https://github.com/flatironinstitute/finufft/runs/7796400053?check_suite_focus=true#step:6:18
The wheel is generated to work on a broad range of CPU architecture. Your machine probably supports avx2 or avx512, to make the full speed of your machine, you may need to checkout the source repository and compile/install from the source as in described https://finufft.readthedocs.io/en/latest/install.html#install-python instead of using pip install finufft.

In the future release, @wendazhou and the team will work on the CPU dispatching so that the python binding lower level C++ binary will utilize the fastest CPU instruction sets on the user's machine.

@ahbarnett
Copy link
Collaborator

ahbarnett commented Aug 17, 2022 via email

@bwheelz36
Copy link
Author

Hi,
I've done the following commands (all inside a python virtual environment):

git clone https://github.com/flatironinstitute/finufft.git
cd finufft/
make test
make python

I've also tried copying both make.inc.linux_ICC and make.inc.manylinux to make.inc and recompiling.
However, the execution time remains at ~5 seconds per slice in all cases...
Do I need to be adding some additional flag? I have an intel i7 cpu if that's relevant...

@ahbarnett
Copy link
Collaborator

ahbarnett commented Aug 18, 2022 via email

@bwheelz36
Copy link
Author

hi:

"you can time a test executable for that size to know how fast the CPU code is"

I've copied the results of both tests you ran below. We are currently using 2D nufft type 3 (I think maybe we could be using type 2, but anyway that doesn't explain the windows/linux discrepancy).
To me these results all look reasonably similar to yours...

"could it be you're not feeding in numpy dtype = complex128?"

data type is numpy.complex128 - plus, I don't think this could explain the ~8x speed difference between windows/linux?

In case it's helpful, I put instructions here on how to reproduce this issue in linux. I'm far from a NUFFT expert so I can certainly believe my implementation isn't optimal - but again, the surprising thing to me is the different speed on different operating systems...

Cpp test

(venv2) brendan@BigDog:~/python/MRI_DistortionQA/examples/finufft$ OMP_NUM_THREADS=1
(venv2) brendan@BigDog:~/python/MRI_DistortionQA/examples/finufft$ test/finufft2d_test 300 300 1e5 1e-6
test 2d type 1:
        100000 NU pts to (300,300) modes in 0.0687 s    1.45e+06 NU pts/s
        one mode: rel err in F[111,78] is 7.11e-08
test 2d type 2:
        (300,300) modes to 100000 NU pts in 0.0174 s    5.74e+06 NU pts/s
        one targ: rel err in c[50000] is 1.71e-07
test 2d type 3:
        100000 NU to 90000 NU in 0.0486 s               3.91e+06 tot NU pts/s
        one targ: rel err in F[45000] is 1.12e-07

python test

(venv2) brendan@BigDog:~/python/MRI_DistortionQA/examples/finufft$ python python/test/run_speed_tests.py
Accuracy and speed tests for 1000000 nonuniform points and eps=1e-06 (error estimates use 5 samples per run)
finufft1d1:
    Est rel l2 err  8.47e-07
    CPU time (sec)  0.189
    tot NU pts/sec  5.3e+06

finufft1d2:
    Est rel l2 err  6.7e-07
    CPU time (sec)  0.186
    tot NU pts/sec  5.38e+06

finufft1d3:
    Est rel l2 err  5.64e-08
    CPU time (sec)  0.121
    tot NU pts/sec  1.65e+07

finufft2d1:
    Est rel l2 err  9.52e-06
    CPU time (sec)  0.135
    tot NU pts/sec  7.42e+06

finufft2d1many:
    Est rel l2 err  1.1e-05
    CPU time (sec)  0.564
    tot NU pts/sec  1.42e+07

finufft2d2:
    Est rel l2 err  4.62e-06
    CPU time (sec)  0.0831
    tot NU pts/sec  1.2e+07

finufft2d2many:
    Est rel l2 err  4.57e-06
    CPU time (sec)  0.418
    tot NU pts/sec  1.91e+07

finufft2d3:
    Est rel l2 err  1.33e-07
    CPU time (sec)  0.154
    tot NU pts/sec  1.3e+07

finufft3d1:
    Est rel l2 err  2.57e-06
    CPU time (sec)  0.332
    tot NU pts/sec  3.01e+06

finufft3d2:
    Est rel l2 err  2.19e-06
    CPU time (sec)  0.21
    tot NU pts/sec  4.76e+06

finufft3d3:
    Est rel l2 err  2.91e-08
    CPU time (sec)  0.414
    tot NU pts/sec  4.83e+06

@bwheelz36
Copy link
Author

any further suggestions on this? Otherwise I can close the issue and just run on windows...

@ahbarnett
Copy link
Collaborator

ahbarnett commented Aug 26, 2022 via email

@bwheelz36
Copy link
Author

sorry - not being a NUFFT expert I am struggling with the nomenclature a bit.
the NUFFT is initialized like this:

  self.Nufft_Ax_Plan = Plan(3, 2, 1, 1e-06, -1)
  self.Nufft_Ax_Plan.setpts(self.xj, self.yj, None, self.sk, self.tk)
  self.Nufft_Atb_Plan = Plan(3, 2, 1, 1e-06, 1)
  self.Nufft_Atb_Plan.setpts(self.sk, self.tk, None, self.xj, self.yj)

xj, yj, sk and tk are all 1D matrices (created from 2D geometry) of the same size. In the example I'm currently running the size is 26208.

We call each of these interfaces 20 times as part of a least squares optimization - so 40 times in total.
Since each slice is taking ~5 seconds, this implies each call is taking 5/40=~0.125 seconds.

I'm struggling to figure out how to relate this back to the test parameters sorry...

Verified that all cores are being utilized during execution

@ahbarnett
Copy link
Collaborator

ahbarnett commented Aug 30, 2022 via email

@ahbarnett
Copy link
Collaborator

ahbarnett commented Aug 30, 2022 via email

@bwheelz36
Copy link
Author

sorry I've been so slow to respond to this. I promise I will get back to it one day!

@ahbarnett
Copy link
Collaborator

ahbarnett commented Oct 13, 2022 via email

@DiamonDinoia
Copy link
Collaborator

DiamonDinoia commented Oct 13, 2022

One question, did you install fftw using the package manager or you compiled it yourself? The one from the package manager does not have avx enabled on linux and is substantially slower.

@bwheelz36
Copy link
Author

Hey @DiamonDinoia - good question, it was a long time ago now!
Looking at the build instructions, I guess that I would have done sudo apt-get install make build-essential libfftw3-dev.
you think I should try again while building fftw3 from source? And I assume I would then have to update something to tell cmake where the compiled version is? (sorry this is getting slightly over my head!)

@DiamonDinoia
Copy link
Collaborator

DiamonDinoia commented Oct 18, 2022

Yes, you should do sudo apt purge libfftw3-dev then go to fftw.org and download the tarball. Unpack it and do the commands as explained here with ./configure CFLAGS="-march=native -fPIC" --enable-openmp --enable-avx2 (or --enable-avx512 if supported by your cpu) once done, do sudo make install -j. Then you need to do the configuration again with --enable-float (in addition to the flags used the first time). Finally, install it again with sudo make install -j. Then you can build finufft as you did before and the performance should be greater. To clear single and double precision fftw are two separate libraries hence they need to be installed twice.

@bwheelz36
Copy link
Author

Hey all, thanks for all the help on this.

Just for context, let me state that it is not critical for us that our code runs fast, nor that it runs on linux. In other words, this is somewhat of a curiosity for me rather than something of pressing importance. With that said, because I would like to respect the time of everyone who has contributed to this so far, as well as because I think it may be of importance for other users, I have done the following:

  • @ahbarnett: we did attempt switching the type III for a type I/II - however, we only saw a moderate speed up of ~20% (which may very well have been due to a bad implementation on our end, but with the above comment in mind, we decided to stick to the slow but definitely working version)
  • @DiamonDinoia: thanks for your tip. I've carried out your suggestion, and this seems to explain at least part of the story, as I eleborate on below:

I previously provided an example to reproduce this issue, but I appreciate it is kind of annoying to have to download data etc, so I have developed a new example which is a standalone python script.
I have tested this example on windows, and on linux under various scenarios. I am yet to have it run as fast on linux as on windows, although the suggestion by @DiamonDinoia did make a substantial difference. The results are outlined in the table below. These are the mean ± std of ten runs.

Case Run time (s)
Linux: pip installed finufft 3.4 ± 0.33
Linux: built fftw from source, pip installed finufft 1.23 ± 0.16
Linux: built fftw and finufft from source 1.44 ± 0.22
Windows: pip installed finufft 0.45 ± 0.02s

You can see that:

  • building fftw from scratch make a substantial difference (one caveat here: the 'default fftw' is on different hardware because I couldn't figure out how to uninstall the compliled version; nevertheless based on previous experience I'm pretty confident this result would be similar on the same hardware)
  • Building finufft from scratch actually seemed to result in a slight slowdown; you can see the exact build commands I used in the links above. this was on ubuntu 22.04
  • Windows is always substantially faster than linux

@DiamonDinoia
Copy link
Collaborator

This is totally not what I expected. I think we should check if this is a python issue.

Could you check the performance on windows/linux with make tests? If I remember correctly it should print metrics like time. Am I correcr @ahbarnett?

Thanks,
Marco

@bwheelz36
Copy link
Author

I can run the tests on linux. I might need some help to get them working in windows (I hardly use c++ and never use it in windows)
Looking through the build instructions

assign the parent directories of the FFTW header file to FFTW_H_DIR, of the FFTW libraries to FFTW_LIB_DIR"

If the library is compiled successfully, you can try to run the tests. Note that your system has to fulfill the following prerequisites to this end: A Linux distribution set up via WSL (has been tested with Ubuntu 20.04 LTS from the Windows Store) and the 64bit gnu-make mentioned before.

  • this kind of sounds like the tests will run through WSL anyway, which might not be that useful for testing performance?

@lu1and10
Copy link
Member

lu1and10 commented Oct 28, 2022

I can run the tests on linux. I might need some help to get them working in windows (I hardly use c++ and never use it in windows) Looking through the build instructions

assign the parent directories of the FFTW header file to FFTW_H_DIR, of the FFTW libraries to FFTW_LIB_DIR"

If the library is compiled successfully, you can try to run the tests. Note that your system has to fulfill the following prerequisites to this end: A Linux distribution set up via WSL (has been tested with Ubuntu 20.04 LTS from the Windows Store) and the 64bit gnu-make mentioned before.

  • this kind of sounds like the tests will run through WSL anyway, which might not be that useful for testing performance?

Hi,

I was able to install dual systems(archlinux and windows) natively a computer with cpu Intel(R) Xeon(R) E-2176M CPU @ 2.70GHzv

Both fftw packages on archlinux and windows are installed via pacman. Tested the following and results(with -O3 -march=native flag),

On Windows:

# export OMP_NUM_THREADS=1
Administrator@Win11-2022BDZKV MSYS ~/projects/finufft/test
# ./finufft2d_test 1e2 1e2 1e4 1e-12 0 2 0.0 1e-11
max threads: 1
test 2d type 1:
        10000 NU pts to (100,100) modes in 0.00477 s    2.1e+06 NU pts/s
        one mode: rel err in F[37,26] is 1.69e-13
        dirft2d: rel l2-err of result F is 2.36e-12
test 2d type 2:
        (100,100) modes to 10000 NU pts in 0.0031 s     3.22e+06 NU pts/s
        one targ: rel err in c[5000] is 9.54e-14
        dirft2d: rel l2-err of result c is 2.32e-12
test 2d type 3:
        10000 NU to 10000 NU in 0.0209 s                9.58e+05 tot NU pts/s
        one targ: rel err in F[5000] is 5.07e-13
        dirft2d: rel l2-err of result F is 2.65e-12

On Archlinux:

[root@libin test]# export OMP_NUM_THREADS=1
[root@libin test]# ./finufft2d_test 1e2 1e2 1e4 1e-12 0 2 0.0 1e-11
max threads: 1
test 2d type 1:
10000 NU pts to (100,100) modes in 0.00315 s 3.18e+06 NU pts/s
one mode: rel err in F[37,26] is 1.66e-13
dirft2d: rel l2-err of result F is 2.35e-12
test 2d type 2:
(100,100) modes to 10000 NU pts in 0.0028 s 3.57e+06 NU pts/s
one targ: rel err in c[5000] is 1.13e-12
dirft2d: rel l2-err of result c is 2.38e-12
test 2d type 3:
10000 NU to 10000 NU in 0.0118 s         1.69e+06 tot NU pts/s
one targ: rel err in F[5000] is 1.34e-12
dirft2d: rel l2-err of result F is 2.75e-12

Seems on linux, the C++ binary is faster than on Windows, within factor of 2 for type3 2d.

Is it possible to test that the python function scipy.sparse.linalg.lsqr in your script has similar performance on windows and linux?
Thanks.

@bwheelz36
Copy link
Author

bwheelz36 commented Oct 31, 2022

aha! @lu1and10 good idea.
I've reproduced that example using pynufft instead of finufft, and I actually simular behavior:

OS time
Windows 2.18 ± 0.18s
Linux 4.23 ± 1.36s

so you are right, this is not an finufft issue - or at the very least, not only an finufft issues

My apologies everyone. As mentioned at the top, my initial profiling showed a lot of time spent in _interfaces.py - I mistakenly believed this was part of finufft, but it turns out it is part of scipy.
I'll keep digging and update you all - now I just want to find out what is happening!

@bwheelz36
Copy link
Author

open an issue on scipy

@bwheelz36
Copy link
Author

ok - I will close this for now but update if I ever get to the bottom of it....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants