-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
substantially slower execution time on linux than windows? #235
Comments
Hi I think it is because the linux python wheels uploaded to pypi is compiled with flag In the future release, @wendazhou and the team will work on the CPU dispatching so that the python binding lower level C++ binary will utilize the fastest CPU instruction sets on the user's machine. |
Hi - compiling and installing locally into python isn't hard. As Libin
said, see here:
https://finufft.readthedocs.io/en/latest/install.html#install-python
and let us know how it goes.
Best, Alex
…On Wed, Aug 17, 2022 at 5:51 AM Libin Lu ***@***.***> wrote:
Hi
I think it is because the linux python wheels uploaded to pypi is compiled
with flag -march=x86-64 -mtune=generic -msse4, see [
https://github.com/flatironinstitute/finufft/runs/7796400053?check_suite_focus=true#step:6:18](https://github.com/flatironinstitute/finufft/runs/7796400053?check_suite_focus=true#step:6:18)
The wheel is generated to work on a broad range of CPU architecture. Your
machine probably supports avx2 or avx512, to make the full speed of your
machine, you may need to checkout the source repository and compile/install
from the source as in described [
https://finufft.readthedocs.io/en/latest/install.html#install-python](https://finufft.readthedocs.io/en/latest/install.html#install-python)
instead of using pip install finufft`.
In the future release, @wendazhou <https://github.com/wendazhou> and the
team will work on the CPU dispatching so that the python binding lower
level C++ binary will utilize the fastest CPU instruction sets on the
user's machine.
—
Reply to this email directly, view it on GitHub
<#235 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACNZRSWXMR72U7NIEGISUPLVZSY3JANCNFSM56XY66ZA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
--
*---------------------------------------------------------------------~^`^~._.~'
|\ Alex H. Barnett Center for Computational Mathematics, Flatiron
Institute
| \ http://users.flatironinstitute.org/~ahb
646-876-5942
|
Hi, git clone https://github.com/flatironinstitute/finufft.git
cd finufft/
make test
make python I've also tried copying both |
Hi, no, the default flags should be sufficient on i7: (from the makefile)
-O3 -funroll-loops -march=native -fcx-limited-range
Don't use manylinux since it's not avx2 :(
How many k-points and how big is your uniform grid?
Eg, say M=1e5 pts and N=300^2 grid, you can time a test executable for that
size to know how fast the CPU code is:
(base) ***@***.***
/home/alex/numerics/finufft> OMP_NUM_THREADS=1 test/finufft2d_test 300 300
1e5 1e-6
test 2d type 1:
100000 NU pts to (300,300) modes in 0.0167 s 5.99e+06 NU pts/s
one mode: rel err in F[111,78] is 7.11e-08
test 2d type 2:
(300,300) modes to 100000 NU pts in 0.015 s 6.65e+06 NU pts/s
one targ: rel err in c[50000] is 1.71e-07
etc
This is for ryzen, similar to i7 per thread. Not a great test since it's
too short, but gives an idea.
If your slices have the same k-pts you can test the "many interface" speed.
Eg for 100 slices:
(base) ***@***.*** /home/alex/numerics/finufft> OMP_NUM_THREADS=8
test/finufft2dmany_test 100 300 300 1e5 1e-6
test 2d1 many vs repeated single: ------------------------------------
ntr=100: 100000 NU pts to (300,300) modes in 0.342 s 2.92e+07 NU pts/s
one mode: rel err in F[111,78] of trans#99 is 4.14e-07
100 of: 100000 NU pts to (300,300) modes in 0.717 s 1.4e+07 NU pts/s
speedup T_FINUFFT2D1 / T_finufft2d1many = 2.09
consistency check: sup ( ||f_many-f||_2 / ||f||_2 ) = 8.19e-17
test 2d2 many vs repeated single: ------------------------------------
ntr=100: (300,300) modes to 100000 NU pts in 0.294 s 3.4e+07 NU pts/s
one targ: rel err in c[50000] of trans#99 is 4.67e-08
100 of: (300,300) modes to 100000 NU pts in 0.6 s 1.67e+07 NU pts/s
speedup T_FINUFFT2D2 / T_finufft2d2many = 2.04
etc
If your speeds (eg compare throughputs in NU pts/s) don't match this, you
have a problem.
First `make test -j` and try that above on your chip. Don't give it all
threads.
If the py speed is much less, could it be you're not feeding in numpy dtype
= complex128
arrays, so it's making unnecesary copies of I/O arrays on the py side?
You should also test the basic timings from py: (here I include my 2d1 and
2d2 only)
(py39) ***@***.*** /home/alex/numerics/finufft> OMP_NUM_THREADS=8 python
python/test/run_speed_tests.py
Accuracy and speed tests for 1000000 nonuniform points and eps=1e-06 (error
estimates use 5 samples per run)
...
finufft2d1:
Est rel l2 err 1.6e-05
CPU time (sec) 0.0551
tot NU pts/sec 1.82e+07
finufft2d1many:
Est rel l2 err 1.58e-05
CPU time (sec) 0.289
tot NU pts/sec 2.77e+07
finufft2d2:
Est rel l2 err 2.47e-06
CPU time (sec) 0.0528
tot NU pts/sec 1.89e+07
finufft2d2many:
Est rel l2 err 3.27e-06
CPU time (sec) 0.282
tot NU pts/sec 2.84e+07
...
Let me know your results for above, how it works out, Alex
…On Wed, Aug 17, 2022 at 10:25 PM bwheelz36 ***@***.***> wrote:
Hi,
I've done the following commands (all inside a python virtual environment):
git clone https://github.com/flatironinstitute/finufft.gitcd finufft/
make test
make python
I've also tried copying both make.inc.linux_ICC and make.inc.manylinux to
make.inc and recompiling.
However, the execution time remains at ~5 seconds per slice in all cases...
Do I need to be adding some additional flag? I have an intel i7 cpu if
that's relevant...
—
Reply to this email directly, view it on GitHub
<#235 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACNZRSSZ4XYKJZAWNK6OUZLVZWNI3ANCNFSM56XY66ZA>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
*---------------------------------------------------------------------~^`^~._.~'
|\ Alex H. Barnett Center for Computational Mathematics, Flatiron
Institute
| \ http://users.flatironinstitute.org/~ahb
646-876-5942
|
hi:
I've copied the results of both tests you ran below. We are currently using 2D nufft type 3 (I think maybe we could be using type 2, but anyway that doesn't explain the windows/linux discrepancy).
data type is numpy.complex128 - plus, I don't think this could explain the ~8x speed difference between windows/linux? In case it's helpful, I put instructions here on how to reproduce this issue in linux. I'm far from a NUFFT expert so I can certainly believe my implementation isn't optimal - but again, the surprising thing to me is the different speed on different operating systems... Cpp test(venv2) brendan@BigDog:~/python/MRI_DistortionQA/examples/finufft$ OMP_NUM_THREADS=1
(venv2) brendan@BigDog:~/python/MRI_DistortionQA/examples/finufft$ test/finufft2d_test 300 300 1e5 1e-6
test 2d type 1:
100000 NU pts to (300,300) modes in 0.0687 s 1.45e+06 NU pts/s
one mode: rel err in F[111,78] is 7.11e-08
test 2d type 2:
(300,300) modes to 100000 NU pts in 0.0174 s 5.74e+06 NU pts/s
one targ: rel err in c[50000] is 1.71e-07
test 2d type 3:
100000 NU to 90000 NU in 0.0486 s 3.91e+06 tot NU pts/s
one targ: rel err in F[45000] is 1.12e-07 python test(venv2) brendan@BigDog:~/python/MRI_DistortionQA/examples/finufft$ python python/test/run_speed_tests.py
Accuracy and speed tests for 1000000 nonuniform points and eps=1e-06 (error estimates use 5 samples per run)
finufft1d1:
Est rel l2 err 8.47e-07
CPU time (sec) 0.189
tot NU pts/sec 5.3e+06
finufft1d2:
Est rel l2 err 6.7e-07
CPU time (sec) 0.186
tot NU pts/sec 5.38e+06
finufft1d3:
Est rel l2 err 5.64e-08
CPU time (sec) 0.121
tot NU pts/sec 1.65e+07
finufft2d1:
Est rel l2 err 9.52e-06
CPU time (sec) 0.135
tot NU pts/sec 7.42e+06
finufft2d1many:
Est rel l2 err 1.1e-05
CPU time (sec) 0.564
tot NU pts/sec 1.42e+07
finufft2d2:
Est rel l2 err 4.62e-06
CPU time (sec) 0.0831
tot NU pts/sec 1.2e+07
finufft2d2many:
Est rel l2 err 4.57e-06
CPU time (sec) 0.418
tot NU pts/sec 1.91e+07
finufft2d3:
Est rel l2 err 1.33e-07
CPU time (sec) 0.154
tot NU pts/sec 1.3e+07
finufft3d1:
Est rel l2 err 2.57e-06
CPU time (sec) 0.332
tot NU pts/sec 3.01e+06
finufft3d2:
Est rel l2 err 2.19e-06
CPU time (sec) 0.21
tot NU pts/sec 4.76e+06
finufft3d3:
Est rel l2 err 2.91e-08
CPU time (sec) 0.414
tot NU pts/sec 4.83e+06 |
any further suggestions on this? Otherwise I can close the issue and just run on windows... |
Hi,
Well, your linux timings look similar to mine, maybe half the speed.
Please note that OMP_NUM_THREADS=1 has to be on the *same line* as the
command - the shell has no memory from line to line unless you export the
variable. Hence you are using all threads.
What I was hoping you'd do is replace the transform sizes and number of
vectors with your actual parameters from your application, not use my
(arbitrary) choices. I have no idea what image sizeor number of k-space
points etc you have in MRI :) The goal was to compare their timing via
these simple testers with what you're observing.
Sorry I haven't had to install your pkgs and try it out that way.
But a factor of 8 is a huge one, should be easy to debug.
Are you sure it's not due to number of threads being given to finufft or
your python driver?
Best, Alex
…On Thu, Aug 25, 2022 at 6:05 PM bwheelz36 ***@***.***> wrote:
any further suggestions on this? Otherwise I can close the issue and just
run on windows...
—
Reply to this email directly, view it on GitHub
<#235 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACNZRSXONB22XKHYF2GXZCDV27UZ5ANCNFSM56XY66ZA>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
*---------------------------------------------------------------------~^`^~._.~'
|\ Alex H. Barnett Center for Computational Mathematics, Flatiron
Institute
| \ http://users.flatironinstitute.org/~ahb
646-876-5942
|
sorry - not being a NUFFT expert I am struggling with the nomenclature a bit. self.Nufft_Ax_Plan = Plan(3, 2, 1, 1e-06, -1)
self.Nufft_Ax_Plan.setpts(self.xj, self.yj, None, self.sk, self.tk)
self.Nufft_Atb_Plan = Plan(3, 2, 1, 1e-06, 1)
self.Nufft_Atb_Plan.setpts(self.sk, self.tk, None, self.xj, self.yj) xj, yj, sk and tk are all 1D matrices (created from 2D geometry) of the same size. In the example I'm currently running the size is 26208. We call each of these interfaces 20 times as part of a least squares optimization - so 40 times in total. I'm struggling to figure out how to relate this back to the test parameters sorry... Verified that all cores are being utilized during execution |
Hi, thanks. Since you're doing type 3, the underlying FFT size (in each of
the 2 dimensions) is determined by the space-bandwidth product (range of xj
times range of sk). So it is possible that is unusually large, but it
shouldn't be in MRI, and doesn't explain windows/linux difference.
I agree: you should be using type 2 for Ax and type 1 ("adjoint") for Atb,
since the image is a regular grid.
It will speed you up by 3x or more.
It is strange that your number of k-space points is exactly the same as
your number of image points.
Normally the image would be 256*256 or something.
Another thing you could do is iterate on all slices together using the
"many" interface. That might be another 3x speedup.
But your timing is definitely 10x too slow, since it implies 4e5 total
pts/sec.
(26208*2 / .125)
Even on one thread I get 10x this speed. You should be getting 5e6 pts/sec,
so 40 transforms done in 0.5 sec.
Eg test codes use a typical space-bandwidth product:
OMP_NUM_THREADS=1 test/finufft2dmany_test 40 162 162 26208 1e-6
...
40 of: 26208 NU to 26244 NU in 0.553 s 3.77e+06 tot NU pts/s
...
(that's using repeated "single" calls, like you do)
Are you positive it is FINUFFT that is slower? Can you set debug=1 in your
Py code and report timings from your use case? That would show us which
internal step is taking longer.
eg:
OMP_NUM_THREADS=1 test/finufft2d_test 162 162 26208 1e-6 1
...
test 2d type 3:
[finufft_makeplan] new plan: FINUFFT version 2.1.0 .................
[finufft_makeplan] 2d3: ntrans=1
M=26208 N=26244
X1=3.14 C1=2 S1=81 D1=138 gam1=1.06678 nf1=216
X2=3.14 C2=-3 S2=81 D2=-40.5 gam2=1.06668 nf2=216
[finufft_setpts t3] widcen, batch 0.00GB alloc: 9.8e-05 s
[finufft_setpts t3] phase & deconv factors: 0.00842 s
[finufft_setpts t3] sort (didSort=1): 0.000161 s
[finufft_setpts t3] inner t2 plan & setpts: 0.00128 s
[finufft_execute t3] start ntrans=1 (1 batches, bsize=1)...
[finufft_execute t3] done. tot prephase: 7.01e-05 s
tot spread: 0.00187 s
tot type 2: 0.00418 s
tot deconvolve: 3.5e-05 s
26208 NU to 26244 NU in 0.0161 s 3.26e+06 tot NU pts/s
Note it's 0.016 sec total, nearly 10x faster than 0.125 sec.
S1 and S2 are the space-bandwidth products, BTW.
Good luck, Alex
…On Tue, Aug 30, 2022 at 12:29 AM bwheelz36 ***@***.***> wrote:
sorry - not being a NUFFT expert I am struggling with the nomenclature a
bit.
the NUFFT is initialized like this:
self.Nufft_Ax_Plan = Plan(3, 2, 1, 1e-06, -1)
self.Nufft_Ax_Plan.setpts(self.xj, self.yj, None, self.sk, self.tk)
self.Nufft_Atb_Plan = Plan(3, 2, 1, 1e-06, 1)
self.Nufft_Atb_Plan.setpts(self.sk, self.tk, None, self.xj, self.yj)
xj, yj, sk and tk are all 1D matrices (created from 2D geometry) of the
same size. In the example I'm currently running the size is 26208.
We call each of these interfaces 20 times as part of a least squares
optimization - so 40 times in total.
Since each slice is taking ~5 seconds, this implies each call is taking
5/40=~0.125 seconds.
I'm struggling to figure out how to relate this back to the test
parameters sorry...
Verified that all cores are being utilized during execution
—
Reply to this email directly, view it on GitHub
<#235 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACNZRSVHZLGPNB7W3UXOVYDV3WEY7ANCNFSM56XY66ZA>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
*-------------------------------------------------------------------~^`^~._.~'
|\ Alex Barnett Center for Computational Mathematics, Flatiron Institute
| \ http://users.flatironinstitute.org/~ahb 646-876-5942
|
Another simple idea to try: set nthreads=1. Now we know you're doing small
transforms, many cores don't help and sometimes hinder.
On Tue, Aug 30, 2022 at 11:42 AM Alex Barnett <
***@***.***> wrote:
… Hi, thanks. Since you're doing type 3, the underlying FFT size (in each of
the 2 dimensions) is determined by the space-bandwidth product (range of xj
times range of sk). So it is possible that is unusually large, but it
shouldn't be in MRI, and doesn't explain windows/linux difference.
I agree: you should be using type 2 for Ax and type 1 ("adjoint") for Atb,
since the image is a regular grid.
It will speed you up by 3x or more.
It is strange that your number of k-space points is exactly the same as
your number of image points.
Normally the image would be 256*256 or something.
Another thing you could do is iterate on all slices together using the
"many" interface. That might be another 3x speedup.
But your timing is definitely 10x too slow, since it implies 4e5 total
pts/sec.
(26208*2 / .125)
Even on one thread I get 10x this speed. You should be getting 5e6
pts/sec, so 40 transforms done in 0.5 sec.
Eg test codes use a typical space-bandwidth product:
OMP_NUM_THREADS=1 test/finufft2dmany_test 40 162 162 26208 1e-6
...
40 of: 26208 NU to 26244 NU in 0.553 s 3.77e+06 tot NU pts/s
...
(that's using repeated "single" calls, like you do)
Are you positive it is FINUFFT that is slower? Can you set debug=1 in your
Py code and report timings from your use case? That would show us which
internal step is taking longer.
eg:
OMP_NUM_THREADS=1 test/finufft2d_test 162 162 26208 1e-6 1
...
test 2d type 3:
[finufft_makeplan] new plan: FINUFFT version 2.1.0 .................
[finufft_makeplan] 2d3: ntrans=1
M=26208 N=26244
X1=3.14 C1=2 S1=81 D1=138 gam1=1.06678 nf1=216
X2=3.14 C2=-3 S2=81 D2=-40.5 gam2=1.06668 nf2=216
[finufft_setpts t3] widcen, batch 0.00GB alloc: 9.8e-05 s
[finufft_setpts t3] phase & deconv factors: 0.00842 s
[finufft_setpts t3] sort (didSort=1): 0.000161 s
[finufft_setpts t3] inner t2 plan & setpts: 0.00128 s
[finufft_execute t3] start ntrans=1 (1 batches, bsize=1)...
[finufft_execute t3] done. tot prephase: 7.01e-05 s
tot spread: 0.00187 s
tot type 2: 0.00418 s
tot deconvolve: 3.5e-05 s
26208 NU to 26244 NU in 0.0161 s 3.26e+06 tot NU pts/s
Note it's 0.016 sec total, nearly 10x faster than 0.125 sec.
S1 and S2 are the space-bandwidth products, BTW.
Good luck, Alex
On Tue, Aug 30, 2022 at 12:29 AM bwheelz36 ***@***.***>
wrote:
> sorry - not being a NUFFT expert I am struggling with the nomenclature a
> bit.
> the NUFFT is initialized like this:
>
> self.Nufft_Ax_Plan = Plan(3, 2, 1, 1e-06, -1)
> self.Nufft_Ax_Plan.setpts(self.xj, self.yj, None, self.sk, self.tk)
> self.Nufft_Atb_Plan = Plan(3, 2, 1, 1e-06, 1)
> self.Nufft_Atb_Plan.setpts(self.sk, self.tk, None, self.xj, self.yj)
>
> xj, yj, sk and tk are all 1D matrices (created from 2D geometry) of the
> same size. In the example I'm currently running the size is 26208.
>
> We call each of these interfaces 20 times as part of a least squares
> optimization - so 40 times in total.
> Since each slice is taking ~5 seconds, this implies each call is taking
> 5/40=~0.125 seconds.
>
> I'm struggling to figure out how to relate this back to the test
> parameters sorry...
>
> Verified that all cores are being utilized during execution
>
> —
> Reply to this email directly, view it on GitHub
> <#235 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACNZRSVHZLGPNB7W3UXOVYDV3WEY7ANCNFSM56XY66ZA>
> .
> You are receiving this because you commented.Message ID:
> ***@***.***>
>
--
*-------------------------------------------------------------------~^`^~._.~'
|\ Alex Barnett Center for Computational Mathematics, Flatiron
Institute
| \ http://users.flatironinstitute.org/~ahb
646-876-5942
--
*-------------------------------------------------------------------~^`^~._.~'
|\ Alex Barnett Center for Computational Mathematics, Flatiron Institute
| \ http://users.flatironinstitute.org/~ahb 646-876-5942
|
sorry I've been so slow to respond to this. I promise I will get back to it one day! |
No worries. We just want users to be happy, so bugs/slowdowns are of
interest...
…On Wed, Oct 12, 2022 at 8:35 PM bwheelz36 ***@***.***> wrote:
sorry I've been so slow to respond to this. I promise I will get back to
it one day!
—
Reply to this email directly, view it on GitHub
<#235 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACNZRSWQ6EJX46LECS6SNM3WC5KNLANCNFSM56XY66ZA>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
*-------------------------------------------------------------------~^`^~._.~'
|\ Alex Barnett Center for Computational Mathematics, Flatiron Institute
| \ http://users.flatironinstitute.org/~ahb 646-876-5942
|
One question, did you install fftw using the package manager or you compiled it yourself? The one from the package manager does not have avx enabled on linux and is substantially slower. |
Hey @DiamonDinoia - good question, it was a long time ago now! |
Yes, you should do |
Hey all, thanks for all the help on this. Just for context, let me state that it is not critical for us that our code runs fast, nor that it runs on linux. In other words, this is somewhat of a curiosity for me rather than something of pressing importance. With that said, because I would like to respect the time of everyone who has contributed to this so far, as well as because I think it may be of importance for other users, I have done the following:
I previously provided an example to reproduce this issue, but I appreciate it is kind of annoying to have to download data etc, so I have developed a new example which is a standalone python script.
You can see that:
|
This is totally not what I expected. I think we should check if this is a python issue. Could you check the performance on windows/linux with make tests? If I remember correctly it should print metrics like time. Am I correcr @ahbarnett? Thanks, |
I can run the tests on linux. I might need some help to get them working in windows (I hardly use c++ and never use it in windows)
|
Hi, I was able to install dual systems(archlinux and windows) natively a computer with cpu Intel(R) Xeon(R) E-2176M CPU @ 2.70GHzv Both fftw packages on archlinux and windows are installed via pacman. Tested the following and results(with -O3 -march=native flag), On Windows:
On Archlinux:
Seems on linux, the C++ binary is faster than on Windows, within factor of 2 for type3 2d. Is it possible to test that the python function |
aha! @lu1and10 good idea.
so you are right, this is not an finufft issue - or at the very least, not only an finufft issues My apologies everyone. As mentioned at the top, my initial profiling showed a lot of time spent in |
open an issue on scipy |
ok - I will close this for now but update if I ever get to the bottom of it.... |
Hi
Thanks for this code. We are using the python interface for distortion correction in MRI
I have noticed something rather curious. When i run my code on windows, it takes around 0.7 seconds per slice to correct. On linux, the time is ~5 seconds. since we have to correct several hundred slices, this makes a big difference to total execution time. I have tested this on two computers, one of which has the exact same CPU as the windoows computer. I tested using fresh environements with python 3.9 and 3.10 with the same results.
You can see the way we are calling the code around line 293 here; we are using the Plan interface.
I can just run this on windows so it's not urgent but I thought it was very strange. It might be the first time anything's ever worked better on windows 😝
I've also attached a cProfile file I ran which you can visualize with e.g. snakeviz, but this doesn't appear too informative to me - all I can tell is it's spending a lot of time in _interfaces.py so I guess the problem is at a lower level?
The text was updated successfully, but these errors were encountered: