Bad Performance of Parallelization with On-the-fly Transformation #144

yuxuanzhuang · 2020-09-18T16:11:23Z

Expected behaviour

Analysis of a Universe with on-the-fly transformation scales good (reasonable).

Actual behaviour

The scaling performance is really bad even with two cores.

Code

import MDAnalysis as mda
from MDAnalysis import transformations as trans
from pmda.rms.rmsd import RMSD as parallel_rmsd

u = mda.Universe(files['PDB'], files['LONG_TRAJ']) #  9000 frames

fit_trans = trans.fit_rot_trans(u.atoms, u.atoms)
u.trajectory.add_transformations(fit_trans)

n_jobs = [1, 2, 4, 8, 16, 32, 64]

rmsd = parallel_rmsd(u.atoms, u.atoms)
rmsd.run(n_blocks=nj,
               n_jobs=nj) #  timeit

Reason

In some Transformations includes numpy.dot which itself is multi-threaded. So the cores are oversubscribed.

Possible solution

define NUM_THREADS=1 for numpy (https://docs.dask.org/en/latest/array-best-practices.html#avoid-oversubscribing-threads). which is surprisingly faster even for serial (single-core) performance.
use cupy(https://cupy.dev/) to leverage the GPU power. (only replacing the numpy.dot operation of the Transformation)

Benchmarking result

Benchmarking system:
- AMD EPYC 7551 32-Core Processor
- RTX 2080 Ti
- cephfs file system

Currently version of MDAnalysis:

(run python -c "import MDAnalysis as mda; print(mda.__version__)") 2.0.0 dev
(run python -c "import pmda; print(pmda.__version__)")
(run python -c "import dask; print(dask.__version__)")

The text was updated successfully, but these errors were encountered:

orbeckst · 2020-09-18T17:02:02Z

Your results are interesting.

What is the single blue bar "RMSD" result? Is this running the code in serial but allow multi threaded operations? I don't understand why it improves with n_cores/

Interestingly, cupy does not do better than single threaded numpy. This reminds me of results from a REU where Robert Delgado found the same http://doi.org/10.6084/m9.figshare.3823293.v1 because the data transfer to/from GPU was expensive.

yuxuanzhuang · 2020-09-18T17:36:54Z

The blue bar is the RMSD without any Transformation, just as a reference.

yuxuanzhuang · 2020-09-18T17:41:53Z

So ideally we would like to only use one thread for numpy here. But since it's a global setting, I am not sure it can be set locally.

orbeckst · 2020-09-18T17:44:15Z

If you set it as an environment variable from Python, does it work?

Is there a with block (context manager) available to do it?

yuxuanzhuang · 2020-09-19T16:41:56Z

Actually there is one. threadpoolctl.
I am even able to use it as a decorator. The issue is, ideally, it should be able to decorate _single_frame (so everything inside would be limit to one thread), but I am not sure if there's a way to only do it once (e.g. inheritance from the Base class)

For the Transformation case, I guess it shouldn't be limited to only the parallel code, as it also improves (a lot) for serial code. So I am starting a PR (MDAnalysis/mdanalysis#2950) for that.

Also, I should double-check the benchmark with another benchmarking system.

orbeckst · 2021-05-04T19:42:51Z

@yuxuanzhuang have you tried PMDA again since PR MDAnalysis/mdanalysis#2950 got merged?

yuxuanzhuang · 2021-05-12T07:22:48Z

EDIT:
Okay, just ignore the following benchmark. I realize I was using the current pmda master branch, which does not support on-the-fly transformations (and just ignore it when reconstructing the Universe). Maybe we should push #132 forward since we will soon have a beta 2.0.0 MDA version release.

Sorry, I need to find some time to look into it. It is really weird to see parallel RMSD performs better with a single core... maybe I messed up something.

from MDAnalysis.analysis.rms import RMSD as serial_rmsd
from pmda.rms.rmsd import RMSD as parallel_rmsd

u = mda.Universe(files['PDB'], files['SHORT_TRAJ'])
fit_trans = trans.fit_rot_trans(u.atoms, u.atoms)
u.trajectory.add_transformations(fit_trans)

rmsd = serial_rmsd(u.atoms, u.atoms)
%time rmsd.run()

CPU times: user 20.6 s, sys: 961 ms, total: 21.5 s
Wall time: 19.7 s

rmsd = parallel_rmsd(u.atoms, u.atoms)
%time rmsd.run(n_blocks=1, n_jobs=1)

CPU times: user 11.4 s, sys: 0 ns, total: 11.4 s
Wall time: 11.4 s

yuxuanzhuang · 2021-05-12T16:12:12Z

The performance in our cluster is pretty bumpy recently so I am using my NUC for benchmarking instead (with the code above), here is the result.

The bottleneck from transformations no longer exists.

orbeckst · 2021-05-12T16:30:45Z

Which versions of PMDA and MDA did you use for the above benchmark that shows similar scaling for RMSD and RMSD+fit+rot trans?

yuxuanzhuang · 2021-05-12T17:25:11Z

Which versions of PMDA and MDA did you use for the above benchmark that shows similar scaling for RMSD and RMSD+fit+rot trans?

The develop branch of MDA and PR #132 branch of PMDA.

yuxuanzhuang mentioned this issue Sep 19, 2020

Limit numpy thread usage for Transformation classes MDAnalysis/mdanalysis#2950

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad Performance of Parallelization with On-the-fly Transformation #144

Bad Performance of Parallelization with On-the-fly Transformation #144

yuxuanzhuang commented Sep 18, 2020 •

edited

Loading

orbeckst commented Sep 18, 2020

yuxuanzhuang commented Sep 18, 2020

yuxuanzhuang commented Sep 18, 2020

orbeckst commented Sep 18, 2020

yuxuanzhuang commented Sep 19, 2020 •

edited

Loading

orbeckst commented May 4, 2021

yuxuanzhuang commented May 12, 2021 •

edited

Loading

yuxuanzhuang commented May 12, 2021

orbeckst commented May 12, 2021

yuxuanzhuang commented May 12, 2021

Bad Performance of Parallelization with On-the-fly Transformation #144

Bad Performance of Parallelization with On-the-fly Transformation #144

Comments

yuxuanzhuang commented Sep 18, 2020 • edited Loading

Expected behaviour

Actual behaviour

Code

Reason

Possible solution

Benchmarking result

Currently version of MDAnalysis:

orbeckst commented Sep 18, 2020

yuxuanzhuang commented Sep 18, 2020

yuxuanzhuang commented Sep 18, 2020

orbeckst commented Sep 18, 2020

yuxuanzhuang commented Sep 19, 2020 • edited Loading

orbeckst commented May 4, 2021

yuxuanzhuang commented May 12, 2021 • edited Loading

yuxuanzhuang commented May 12, 2021

orbeckst commented May 12, 2021

yuxuanzhuang commented May 12, 2021

yuxuanzhuang commented Sep 18, 2020 •

edited

Loading

yuxuanzhuang commented Sep 19, 2020 •

edited

Loading

yuxuanzhuang commented May 12, 2021 •

edited

Loading