Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix 47 pleafletfinder #69

Merged
merged 21 commits into from
Oct 17, 2018
Merged

Conversation

iparask
Copy link
Collaborator

@iparask iparask commented Sep 30, 2018

Fixes #47

Changes made in this Pull Request:

  • Adding new LeafletFInder class which analyzes in parallel a frame and reports the two leaflets.
  • Added a test file.
  • Add Timings functionality
  • Traverse over multiple frames.

This PR can be used as a place holder. I would like to initiate a discussion as to as the written code is based on the requirements. I would like to report two major differences from other cases. This code uses Dask Bags to execute instead of delayed. In addition, parallelization is done on a frame basis. This was the most efficient implementation we had in the paper (code linked in the respective issue).

Please review/ comment as necessary and code can grow.

Giannis

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

@orbeckst
Copy link
Member

orbeckst commented Oct 2, 2018

@iparask sklearn is not getting installed. https://travis-ci.org/MDAnalysis/pmda/jobs/435384534#L1101

For MDAnalysis, sklearn is optional. Is BallTree required for your code or can you make it optional?

@iparask
Copy link
Collaborator Author

iparask commented Oct 2, 2018

The most efficient one uses BallTrees. It also reduces the memory footprint a lot. Let me include KDTrees instead, which are part of Scipy. Otherwise we will have to go with cdist or something similar.
@orbeckst, I missed the travis output. Sorry about that

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial comments inline. Getting meaningful tests to pass is the next step as far as I can tell.

Can you make sklearn.BallTree optional, i.e., use something from scipy (which we always install) and use BallTree if sklearn is available?

pmda/leaflet.py Outdated
# Released under the GNU Public Licence, v2 or any higher version
"""
LeafletFInder Analysis tool --- :mod:`pmda.leaflet`
=======================================
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs longer underscores or sphinx will barff

pmda/leaflet.py Show resolved Hide resolved
pmda/leaflet.py Outdated
import networkx as nx

import MDAnalysis as mda
from sklearn.neighbors import BallTree
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required or can it be made optional or replaced by something from scipy?

At the moment the tests fail because sklearn is not installed.

We can decide to install sklearn if needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use the KDTree we've got inside MDAnalysis, which does periodic boundaries. See: https://github.com/MDAnalysis/mdanalysis/blob/develop/package/MDAnalysis/lib/pkdtree.py

pmda/leaflet.py Show resolved Hide resolved
pmda/leaflet.py Outdated

Attributes
----------
_results : list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If users are not supposed to use it (underscore prefixed) then don't document.

If users should be able to work with it, don't use underscore.

pmda/leaflet.py Outdated
else:
train = np.vstack([window[0],window[1]])
test = np.vstack([window[0],window[1]])
tree = BallTree(train, leaf_size=40)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be replaced with something from scipy and we use sklearn if available?

assert_array_almost_equal)

import MDAnalysis
from MDAnalysisTests.datafiles import (PSF, DCD)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I ask which atoms to select from the large trajectory? Thank you!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most likely "name P*" – sorry, can't check right now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VOD555 what kind of lipids are in these trajectories and how do you select the headgroup phosphorous?

Copy link
Collaborator

@VOD555 VOD555 Oct 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are POPG and POPE lipids in these trajectories, and you can use
u.select_atoms('name P and resname POPG') for phosphorous in POPG/POPE, or
u.select_atoms('name P') for all phosphorous in lipids.

@orbeckst
Copy link
Member

orbeckst commented Oct 2, 2018

Is BallTree a lot better?

MDAnalysis also recently got capped_distance which could perhaps be used instead. We now use this for neighborsearching. Importantly, it takes periodicity into account.

@richardjgowers can you comment on capped_distance() vs BallTree() – did this come up during @ayushsuhane 's benchmarking during GSOC?

@richardjgowers
Copy link
Member

@orbeckst a BallTree is just a KDTree (I think...). grid benched faster than KDTree nearly all of the time, so if we can use that it would be better

@iparask
Copy link
Collaborator Author

iparask commented Oct 3, 2018

@orbeckst @richardjgowers I have a wiki page at least with some information for KD-Trees and BallTrees. Also, I remember @drelu had created a plot comparing different trees, I will put it here as soon as I find it.

@iparask
Copy link
Collaborator Author

iparask commented Oct 3, 2018

Here is Andre's experiment link

I also found this benchmarking blog

@kain88-de
Copy link
Member

kain88-de commented Oct 3, 2018

@iparask we had a student (@ayushsuhane) putting in a lot of work in the summer to efficiently calculate distances. Have a look at his blog. He implemented several algorithms in MDAnalysis to efficiently calculate distances with a cutoff. He has found that the cell-list algorithm yielded the best results. It's available in MDAnalysis as lib.distances.capped_distance in the develop branch. Please have look at these functions. All functions included in MDAnalysis also take care of periodicity for triclinic boxes.

@orbeckst
Copy link
Member

orbeckst commented Oct 5, 2018

Currently only as simple PEP8 check fails.

You can run the PEP8 checks locally, too (together with coverage):

pytest  --pep8 -v --cov pmda  pmda

(needs pytest-pep8 and pytest-cov).

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @iparask , see quick comments inline.

pmda/leaflet.py Outdated Show resolved Hide resolved
pmda/leaflet.py Outdated Show resolved Hide resolved
else:
train = np.vstack([window[0], window[1]])
test = np.vstack([window[0], window[1]])
tree = cKDTree(train, leafsize=40)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment here that this does not respect PBC; possible TODO: work with capped_distances() instead

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by PBC?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://en.wikipedia.org/wiki/Periodic_boundary_conditions

like Pacman geometry. Particles are simulated within a finite box of size [Lx, Ly, Lz], Particles near x=0 and x=Lx are actually very close to each other.

So the leaflet is actually an infinite sheet and has no "edge". Without considering periodic boundaries you'll identify the leaflets, but you'll miss some of the edges in the graph representation, which might be important for some analysis

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I see. May I suggest the following. Have a working implementation without taking into account PBCs and without breaking the trajectory into multiple blocks. As soon as that is done and this PR can be accepted, merge it so that we do not miss the release this is intended for.

Next version of the implementation will take into account PBCs and also try to see how to parallelize over trajectory blocks.

@orbeckst , @richardjgowers do you agree with that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with an initial version that does not do PBC if

  • the docs mention this
  • we have an issue open for it

Breaking trajectory into blocks can also wait in my opinion.

----
At the moment, this class has far fewer features than the serial
version :class:`MDAnalysis.analysis.leaflet.LeafletFinder`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional notes to add:

  • Currently, periodic boundaries are not taken into account.
  • The calculation is parallelized on a per-frame basis [Paraskevakos2018]_; at the moment, no parallelization over trajectory blocks is performed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

pmda/leaflet.py Outdated Show resolved Hide resolved
pmda/leaflet.py Outdated
# Partition the data based on a 2-dimensional partitioning
for i in range(1, matrix_size + 1, part_size):
for j in range(1, matrix_size + 1, part_size):
arraged_coord.append(([atoms[i - 1:i - 1 + part_size],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this correctly, you are partitioning a (M, M) matrix A into sub-blocks. A_{ij} will represent a contact.

Because we evaluate the atomgroup with itself, this is symmetric, A_ij = A_ji. Couldn't you save on the order of M^2/2 evaluations by only taking the diagonal and lower triangle blocks?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me give it a try and see what is happening with the tests. This is an exact transport from the code I used for the paper. I will fix it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you have a look at optimizing this?

If it should be optimized, add it to the #72 issue.

For this PR we can put the code as it was for the paper.

pmda/leaflet.py Outdated
comp = [g for g in subgraphs]
return comp

def _single_frame(self, scheduler_kwargs, n_blocks, cutoff=15.0):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why n_blocks – it's not used at the moment. Omit?

(Unless you immediately want to parallelize over blocks, too)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parallelization happens in a single frame. blocks specify the number of partitioning the data over.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove it and change the way I used it in the code

pmda/leaflet.py Outdated
number of tasks to start, if `-1` use number of logical cpu cores.
This argument will be ignored when the distributed scheduler is
used
n_blocks : int, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment that this is currently ignore: no parallelization over trajectory blocks. Or are you already working on adding the parallelization over frames, too?

pmda/leaflet.py Outdated
if n_jobs == -1:
n_jobs = cpu_count()

if n_blocks is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

COuld be removed if n_blocks not used.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is being used to produce the correct number of tasks. Because it is optional, if it is not set we set it manually based on the number of jobs. Now that I think about it n_jobs is not really needed since I parallelize based on the number of blocks.

pmda/leaflet.py Show resolved Hide resolved

# partial connected components

subgraphs = nx.connected_components(graph)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc networkx is wholly Python

This function in MDAnalysis:

https://github.com/MDAnalysis/mdanalysis/blob/develop/package/MDAnalysis/lib/_cutil.pyx#L322

Is a c++ function which does the same, pass it node and edge indices and it will return a list of the indices in each subgraph

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand your comment correctly, you are suggesting to change from netwrokx to the function MDAnalysis implements, right?

When I started the implementation of the Leaflet Finder for my paper I used the class that exists in MDAnalysis (link). There networkx is being used. In addition, I was very cautious when I built my version to use highly used python packages for things I did not want to reimplement.

Did you reimplement a Connected component calculator or did that pre-exist Leaflet Finder?

Also, I am extremely bad with abbreviations and I have to google them all the time. Can you expand iirc?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, iirc = if I remember correctly.

Yeah I think the MDAnalysis function will be faster. It was added after the original leaflet finder class.

If you want to keep this the same for a closer comparison to the original class that's fine

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardjgowers did you benchmark find_fragments() against networkx?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Also, we should update the serial LeafletFinder with capped_distances and find_fragments...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we use networkx for the initial version because this is what was described in the paper.

We can then open an new issue to optimize parallel leafletfinder and add PBC.

setup.py Outdated Show resolved Hide resolved
@orbeckst
Copy link
Member

@iparask there's still an import of BallTree somewhere:

==================================== ERRORS ====================================
__________________ ERROR collecting pmda/test/test_leaflet.py __________________
ImportError while importing test module '/home/travis/build/MDAnalysis/pmda/pmda/test/test_leaflet.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
pmda/test/test_leaflet.py:10: in <module>
    from pmda import leaflet
pmda/leaflet.py:28: in <module>
    from sklearn.neighbors import BallTree
E   ModuleNotFoundError: No module named 'sklearn'

Once the tests pass and you need a final review, please ping me. Thanks!

@codecov
Copy link

codecov bot commented Oct 12, 2018

Codecov Report

Merging #69 into master will decrease coverage by 1.34%.
The diff coverage is 94.64%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #69      +/-   ##
==========================================
- Coverage   99.27%   97.92%   -1.35%     
==========================================
  Files           7        8       +1     
  Lines         274      386     +112     
  Branches       26       48      +22     
==========================================
+ Hits          272      378     +106     
- Misses          1        4       +3     
- Partials        1        4       +3
Impacted Files Coverage Δ
pmda/leaflet.py 94.64% <94.64%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3d78fc7...ce5d471. Read the comment docs.

@iparask
Copy link
Collaborator Author

iparask commented Oct 12, 2018

@orbeckst, this is a request for a final review. I have added all parts as asked. I believe it is as close as I can see it getting. Travis fails with warnings mainly because I have overridden and changed the methods of the base class in terms of input parameters.

@orbeckst
Copy link
Member

We'll have to fix the pylint warnings or exclude them; if we merge it like this, all future CI will fail.

Run

pylint pmda

locally while testing. I get:

Using config file /Volumes/Data/oliver/Biop/Projects/Methods/MDAnalysis/pmda/.pylintrc
************* Module pmda.leaflet
pmda/leaflet.py:73: [W0231(super-init-not-called), LeafletFinder.__init__] __init__ method from base class 'ParallelAnalysisBase' is not called
pmda/leaflet.py:115: [E1102(not-callable), LeafletFinder._find_connected_components] cKDTree is not callable
pmda/leaflet.py:159: [W0221(arguments-differ), LeafletFinder._single_frame] Parameters differ from overridden '_single_frame' method
pmda/leaflet.py:192: [W1619(old-division), LeafletFinder._single_frame] division w/o __future__ statement
pmda/leaflet.py:229: [W0221(arguments-differ), LeafletFinder.run] Parameters differ from overridden 'run' method

-----------------------------------
Your code has been rated at 9.77/10

Most of these warnings look fixable to me

  • pmda/leaflet.py:115: [E1102(not-callable), LeafletFinder._find_connected_components] cKDTree is not callable — ???
  • pmda/leaflet.py:192: [W1619(old-division), LeafletFinder._single_frame] division w/o future statement — add the future import or replace with //
  • pmda/leaflet.py:159: [W0221(arguments-differ), LeafletFinder._single_frame] Parameters differ from overridden '_single_frame' method — I think you still need to be able to accept all arguments that the standard _single_frame() could and just ignore them.
  • pmda/leaflet.py:229: [W0221(arguments-differ), LeafletFinder.run] Parameters differ from overridden 'run' method — We might have decided on the different call signature during the review process. However, if at all possible we should keep the call generic signature intact.

Disable the other pylint warning in code by bracketing the code with # pylint: disable=.../# pylint: enable=... blocks; see for example

# pylint: disable=redefined-builtin

@orbeckst
Copy link
Member

On a second look, the mismatching call signatures might be a bit more difficult.... start with disabling the warnings in the code. I'll then comment in a code review.

(One way or another, the linter has to pass.)

@orbeckst
Copy link
Member

Looking good – is there a way to increase test coverage? It will decrease our 99% to 96%...

@orbeckst
Copy link
Member

Actually, just checked the diff – there's not much you can do for coverage.

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me; this is the code as we had it for the paper.

Follow-up in #72 .

pmda/leaflet.py Outdated
# Partition the data based on a 2-dimensional partitioning
for i in range(1, matrix_size + 1, part_size):
for j in range(1, matrix_size + 1, part_size):
arraged_coord.append(([atoms[i - 1:i - 1 + part_size],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you have a look at optimizing this?

If it should be optimized, add it to the #72 issue.

For this PR we can put the code as it was for the paper.

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add yourself to AUTHORS etc – see comments.

I'll merge then.

@@ -21,6 +21,7 @@ Enhancements
* add add timing for _conclude and _prepare (Issue #49)
* add parallel particle-particle RDF calculation module pmda.rdf (Issue #41)
* add readonly_attributes context manager to ParallelAnalysisBase
* add parallel implementation of Leaflet Finder (Issue #47)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add your GitHub name at the top of 0.2.0!

add yourself to AUTHORS and theauthors line in doc/conf.py

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I will!

@iparask
Copy link
Collaborator Author

iparask commented Oct 17, 2018

Can you help me read the output of the codecov bot you included? I think that if we can test using the distributed scheduler the coverage will increase.

@orbeckst
Copy link
Member

orbeckst commented Oct 17, 2018

@ianmkenney @iparask in PR #66 we will automatically run all possible parallelizations so no need to do this manually here.

Just add yourself as an author and we're done here.

EDIT: sorry, autocompleted the wrong username

@orbeckst orbeckst merged commit a718250 into MDAnalysis:master Oct 17, 2018
@orbeckst
Copy link
Member

Thank you @iparask , congratulations to your first contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants