Fix 47 pleafletfinder #69

iparask · 2018-09-30T23:13:56Z

Fixes #47

Changes made in this Pull Request:

Adding new LeafletFInder class which analyzes in parallel a frame and reports the two leaflets.
Added a test file.
Add Timings functionality
Traverse over multiple frames.

This PR can be used as a place holder. I would like to initiate a discussion as to as the written code is based on the requirements. I would like to report two major differences from other cases. This code uses Dask Bags to execute instead of delayed. In addition, parallelization is done on a frame basis. This was the most efficient implementation we had in the paper (code linked in the respective issue).

Please review/ comment as necessary and code can grow.

Giannis

PR Checklist

Tests?
Docs?
CHANGELOG updated?
Issue raised/referenced?

Bringing PMDA matser changes here before continuing development

sync

orbeckst · 2018-10-02T18:22:20Z

@iparask sklearn is not getting installed. https://travis-ci.org/MDAnalysis/pmda/jobs/435384534#L1101

For MDAnalysis, sklearn is optional. Is BallTree required for your code or can you make it optional?

iparask · 2018-10-02T18:29:20Z

The most efficient one uses BallTrees. It also reduces the memory footprint a lot. Let me include KDTrees instead, which are part of Scipy. Otherwise we will have to go with cdist or something similar.
@orbeckst, I missed the travis output. Sorry about that

orbeckst

Initial comments inline. Getting meaningful tests to pass is the next step as far as I can tell.

Can you make sklearn.BallTree optional, i.e., use something from scipy (which we always install) and use BallTree if sklearn is available?

orbeckst · 2018-10-02T18:22:51Z

pmda/leaflet.py

+# Released under the GNU Public Licence, v2 or any higher version
+"""
+LeafletFInder Analysis tool --- :mod:`pmda.leaflet`
+=======================================


needs longer underscores or sphinx will barff

pmda/leaflet.py

orbeckst · 2018-10-02T18:24:39Z

pmda/leaflet.py

+import networkx as nx
+
+import MDAnalysis as mda
+from sklearn.neighbors import BallTree


Is this required or can it be made optional or replaced by something from scipy?

At the moment the tests fail because sklearn is not installed.

We can decide to install sklearn if needed.

This should use the KDTree we've got inside MDAnalysis, which does periodic boundaries. See: https://github.com/MDAnalysis/mdanalysis/blob/develop/package/MDAnalysis/lib/pkdtree.py

pmda/leaflet.py

orbeckst · 2018-10-02T18:26:04Z

pmda/leaflet.py

+
+        Attributes
+        ----------
+        _results : list


If users are not supposed to use it (underscore prefixed) then don't document.

If users should be able to work with it, don't use underscore.

orbeckst · 2018-10-02T18:26:54Z

pmda/leaflet.py

+        else:
+            train = np.vstack([window[0],window[1]])
+            test  = np.vstack([window[0],window[1]])
+        tree = BallTree(train, leaf_size=40)


Can this be replaced with something from scipy and we use sklearn if available?

orbeckst · 2018-10-02T18:31:27Z

pmda/test/test_leaflet.py

+                                   assert_array_almost_equal)
+
+import MDAnalysis
+from MDAnalysisTests.datafiles import (PSF, DCD)


These datafiles do not include a membrane.

Use the MDAnalysis tests for a single frame https://github.com/MDAnalysis/mdanalysis/blob/edb1643c0a78f7e5447aa3ccc02712e4205ff073/testsuite/MDAnalysisTests/analysis/test_leaflet.py#L30 to compare results and the files https://github.com/MDAnalysis/mdanalysis/blob/edb1643c0a78f7e5447aa3ccc02712e4205ff073/testsuite/MDAnalysisTests/analysis/test_rdf_s.py#L31 contain a trajectory with a membrane protein in a membrane.

May I ask which atoms to select from the large trajectory? Thank you!

Most likely "name P*" – sorry, can't check right now.

@VOD555 what kind of lipids are in these trajectories and how do you select the headgroup phosphorous?

There are POPG and POPE lipids in these trajectories, and you can use
u.select_atoms('name P and resname POPG') for phosphorous in POPG/POPE, or
u.select_atoms('name P') for all phosphorous in lipids.

orbeckst · 2018-10-02T18:47:27Z

Is BallTree a lot better?

MDAnalysis also recently got capped_distance which could perhaps be used instead. We now use this for neighborsearching. Importantly, it takes periodicity into account.

@richardjgowers can you comment on capped_distance() vs BallTree() – did this come up during @ayushsuhane 's benchmarking during GSOC?

richardjgowers · 2018-10-02T19:22:30Z

@orbeckst a BallTree is just a KDTree (I think...). grid benched faster than KDTree nearly all of the time, so if we can use that it would be better

iparask · 2018-10-03T13:38:37Z

@orbeckst @richardjgowers I have a wiki page at least with some information for KD-Trees and BallTrees. Also, I remember @drelu had created a plot comparing different trees, I will put it here as soon as I find it.

iparask · 2018-10-03T14:10:06Z

Here is Andre's experiment link

I also found this benchmarking blog

kain88-de · 2018-10-03T14:25:36Z

@iparask we had a student (@ayushsuhane) putting in a lot of work in the summer to efficiently calculate distances. Have a look at his blog. He implemented several algorithms in MDAnalysis to efficiently calculate distances with a cutoff. He has found that the cell-list algorithm yielded the best results. It's available in MDAnalysis as lib.distances.capped_distance in the develop branch. Please have look at these functions. All functions included in MDAnalysis also take care of periodicity for triclinic boxes.

orbeckst · 2018-10-05T18:33:51Z

Currently only as simple PEP8 check fails.

You can run the PEP8 checks locally, too (together with coverage):

pytest  --pep8 -v --cov pmda  pmda

(needs pytest-pep8 and pytest-cov).

orbeckst

Hi @iparask , see quick comments inline.

pmda/leaflet.py

orbeckst · 2018-10-05T18:42:05Z

pmda/leaflet.py

+        else:
+            train = np.vstack([window[0], window[1]])
+            test = np.vstack([window[0], window[1]])
+        tree = cKDTree(train, leafsize=40)


Please add a comment here that this does not respect PBC; possible TODO: work with capped_distances() instead

What do you mean by PBC?

https://en.wikipedia.org/wiki/Periodic_boundary_conditions

like Pacman geometry. Particles are simulated within a finite box of size [Lx, Ly, Lz], Particles near x=0 and x=Lx are actually very close to each other.

So the leaflet is actually an infinite sheet and has no "edge". Without considering periodic boundaries you'll identify the leaflets, but you'll miss some of the edges in the graph representation, which might be important for some analysis

Okay, I see. May I suggest the following. Have a working implementation without taking into account PBCs and without breaking the trajectory into multiple blocks. As soon as that is done and this PR can be accepted, merge it so that we do not miss the release this is intended for.

Next version of the implementation will take into account PBCs and also try to see how to parallelize over trajectory blocks.

@orbeckst , @richardjgowers do you agree with that?

I am ok with an initial version that does not do PBC if

the docs mention this

we have an issue open for it

Breaking trajectory into blocks can also wait in my opinion.

orbeckst · 2018-10-05T18:45:30Z

pmda/leaflet.py

+    ----
+    At the moment, this class has far fewer features than the serial
+    version :class:`MDAnalysis.analysis.leaflet.LeafletFinder`.
+


Additional notes to add:

Currently, periodic boundaries are not taken into account.

The calculation is parallelized on a per-frame basis [Paraskevakos2018]_; at the moment, no parallelization over trajectory blocks is performed.

pmda/leaflet.py

orbeckst · 2018-10-05T19:00:48Z

pmda/leaflet.py

+        # Partition the data based on a 2-dimensional partitioning
+        for i in range(1, matrix_size + 1, part_size):
+            for j in range(1, matrix_size + 1, part_size):
+                arraged_coord.append(([atoms[i - 1:i - 1 + part_size],


If I understand this correctly, you are partitioning a (M, M) matrix A into sub-blocks. A_{ij} will represent a contact.

Because we evaluate the atomgroup with itself, this is symmetric, A_ij = A_ji. Couldn't you save on the order of M^2/2 evaluations by only taking the diagonal and lower triangle blocks?

Let me give it a try and see what is happening with the tests. This is an exact transport from the code I used for the paper. I will fix it

Did you have a look at optimizing this?

If it should be optimized, add it to the #72 issue.

For this PR we can put the code as it was for the paper.

orbeckst · 2018-10-05T19:04:00Z

pmda/leaflet.py

+        comp = [g for g in subgraphs]
+        return comp
+
+    def _single_frame(self, scheduler_kwargs, n_blocks, cutoff=15.0):


Why n_blocks – it's not used at the moment. Omit?

(Unless you immediately want to parallelize over blocks, too)

The parallelization happens in a single frame. blocks specify the number of partitioning the data over.

I will remove it and change the way I used it in the code

orbeckst · 2018-10-05T19:04:29Z

pmda/leaflet.py

+            number of tasks to start, if `-1` use number of logical cpu cores.
+            This argument will be ignored when the distributed scheduler is
+            used
+        n_blocks : int, optional


Comment that this is currently ignore: no parallelization over trajectory blocks. Or are you already working on adding the parallelization over frames, too?

orbeckst · 2018-10-05T19:05:15Z

pmda/leaflet.py

+        if n_jobs == -1:
+            n_jobs = cpu_count()
+
+        if n_blocks is None:


COuld be removed if n_blocks not used.

It is being used to produce the correct number of tasks. Because it is optional, if it is not set we set it manually based on the number of jobs. Now that I think about it n_jobs is not really needed since I parallelize based on the number of blocks.

pmda/leaflet.py

richardjgowers · 2018-10-10T18:26:07Z

pmda/leaflet.py

+
+        # partial connected components
+
+        subgraphs = nx.connected_components(graph)


iirc networkx is wholly Python

This function in MDAnalysis:

https://github.com/MDAnalysis/mdanalysis/blob/develop/package/MDAnalysis/lib/_cutil.pyx#L322

Is a c++ function which does the same, pass it node and edge indices and it will return a list of the indices in each subgraph

If I understand your comment correctly, you are suggesting to change from netwrokx to the function MDAnalysis implements, right?

When I started the implementation of the Leaflet Finder for my paper I used the class that exists in MDAnalysis (link). There networkx is being used. In addition, I was very cautious when I built my version to use highly used python packages for things I did not want to reimplement.

Did you reimplement a Connected component calculator or did that pre-exist Leaflet Finder?

Also, I am extremely bad with abbreviations and I have to google them all the time. Can you expand iirc?

Sorry, iirc = if I remember correctly.

Yeah I think the MDAnalysis function will be faster. It was added after the original leaflet finder class.

If you want to keep this the same for a closer comparison to the original class that's fine

@richardjgowers did you benchmark find_fragments() against networkx?

(Also, we should update the serial LeafletFinder with capped_distances and find_fragments...)

I suggest we use networkx for the initial version because this is what was described in the paper.

We can then open an new issue to optimize parallel leafletfinder and add PBC.

setup.py

orbeckst · 2018-10-11T21:40:17Z

@iparask there's still an import of BallTree somewhere:

==================================== ERRORS ====================================
__________________ ERROR collecting pmda/test/test_leaflet.py __________________
ImportError while importing test module '/home/travis/build/MDAnalysis/pmda/pmda/test/test_leaflet.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
pmda/test/test_leaflet.py:10: in <module>
    from pmda import leaflet
pmda/leaflet.py:28: in <module>
    from sklearn.neighbors import BallTree
E   ModuleNotFoundError: No module named 'sklearn'

Once the tests pass and you need a final review, please ping me. Thanks!

codecov · 2018-10-12T14:37:38Z

Codecov Report

Merging #69 into master will decrease coverage by 1.34%.
The diff coverage is 94.64%.

@@            Coverage Diff             @@
##           master      #69      +/-   ##
==========================================
- Coverage   99.27%   97.92%   -1.35%     
==========================================
  Files           7        8       +1     
  Lines         274      386     +112     
  Branches       26       48      +22     
==========================================
+ Hits          272      378     +106     
- Misses          1        4       +3     
- Partials        1        4       +3

Impacted Files	Coverage Δ
pmda/leaflet.py	`94.64% <94.64%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3d78fc7...ce5d471. Read the comment docs.

iparask · 2018-10-12T18:59:47Z

@orbeckst, this is a request for a final review. I have added all parts as asked. I believe it is as close as I can see it getting. Travis fails with warnings mainly because I have overridden and changed the methods of the base class in terms of input parameters.

orbeckst · 2018-10-12T21:45:18Z

We'll have to fix the pylint warnings or exclude them; if we merge it like this, all future CI will fail.

Run

pylint pmda

locally while testing. I get:

Using config file /Volumes/Data/oliver/Biop/Projects/Methods/MDAnalysis/pmda/.pylintrc
************* Module pmda.leaflet
pmda/leaflet.py:73: [W0231(super-init-not-called), LeafletFinder.__init__] __init__ method from base class 'ParallelAnalysisBase' is not called
pmda/leaflet.py:115: [E1102(not-callable), LeafletFinder._find_connected_components] cKDTree is not callable
pmda/leaflet.py:159: [W0221(arguments-differ), LeafletFinder._single_frame] Parameters differ from overridden '_single_frame' method
pmda/leaflet.py:192: [W1619(old-division), LeafletFinder._single_frame] division w/o __future__ statement
pmda/leaflet.py:229: [W0221(arguments-differ), LeafletFinder.run] Parameters differ from overridden 'run' method

-----------------------------------
Your code has been rated at 9.77/10

Most of these warnings look fixable to me

pmda/leaflet.py:115: [E1102(not-callable), LeafletFinder._find_connected_components] cKDTree is not callable — ???
pmda/leaflet.py:192: [W1619(old-division), LeafletFinder._single_frame] division w/o future statement — add the future import or replace with //
pmda/leaflet.py:159: [W0221(arguments-differ), LeafletFinder._single_frame] Parameters differ from overridden '_single_frame' method — I think you still need to be able to accept all arguments that the standard _single_frame() could and just ignore them.
pmda/leaflet.py:229: [W0221(arguments-differ), LeafletFinder.run] Parameters differ from overridden 'run' method — We might have decided on the different call signature during the review process. However, if at all possible we should keep the call generic signature intact.

Disable the other pylint warning in code by bracketing the code with # pylint: disable=.../# pylint: enable=... blocks; see for example

pmda/pmda/rdf.py

Line 90 in 3d78fc7

# pylint: disable=redefined-builtin

orbeckst · 2018-10-12T21:48:49Z

On a second look, the mismatching call signatures might be a bit more difficult.... start with disabling the warnings in the code. I'll then comment in a code review.

(One way or another, the linter has to pass.)

orbeckst · 2018-10-17T19:23:49Z

Looking good – is there a way to increase test coverage? It will decrease our 99% to 96%...

orbeckst · 2018-10-17T19:25:22Z

Actually, just checked the diff – there's not much you can do for coverage.

orbeckst

Looks good to me; this is the code as we had it for the paper.

Follow-up in #72 .

orbeckst · 2018-10-17T19:26:28Z

pmda/leaflet.py

+        # Partition the data based on a 2-dimensional partitioning
+        for i in range(1, matrix_size + 1, part_size):
+            for j in range(1, matrix_size + 1, part_size):
+                arraged_coord.append(([atoms[i - 1:i - 1 + part_size],


Did you have a look at optimizing this?

If it should be optimized, add it to the #72 issue.

For this PR we can put the code as it was for the paper.

orbeckst

Please add yourself to AUTHORS etc – see comments.

I'll merge then.

orbeckst · 2018-10-17T19:31:00Z

CHANGELOG

@@ -21,6 +21,7 @@ Enhancements
  * add add timing for _conclude and _prepare (Issue #49)
  * add parallel particle-particle RDF calculation module pmda.rdf (Issue #41)
  * add readonly_attributes context manager to ParallelAnalysisBase
+  * add parallel implementation of Leaflet Finder (Issue #47)


add your GitHub name at the top of 0.2.0!

add yourself to AUTHORS and theauthors line in doc/conf.py

Okay I will!

iparask · 2018-10-17T19:46:49Z

Can you help me read the output of the codecov bot you included? I think that if we can test using the distributed scheduler the coverage will increase.

orbeckst · 2018-10-17T19:59:42Z

~~@ianmkenney~~ @iparask in PR #66 we will automatically run all possible parallelizations so no need to do this manually here.

Just add yourself as an author and we're done here.

EDIT: sorry, autocompleted the wrong username

orbeckst · 2018-10-17T21:57:00Z

Thank you @iparask , congratulations to your first contribution!

Ioannis Paraskevakos and others added 5 commits August 19, 2018 16:07

Setting up parallel leaflet finder

cbe2af2

Merge pull request #1 from MDAnalysis/master

6acc133

Bringing PMDA matser changes here before continuing development

Merge pull request #2 from iparask/master

ebe668f

sync

Changed run method

df3b72c

Adding errors in LF and adding test

140616b

orbeckst reviewed Oct 2, 2018

View reviewed changes

Adding Scipy and networkx in setup. Changed balltree with ckdtree

5d1d946

Ioannis Paraskevakos added 3 commits October 3, 2018 14:10

Single frame test passes, PEP8 both class and test.

ab81b90

Removing code comments based on PEP8 messages

bb65a59

trajectory level testing and pep8 changes

56f39ac

orbeckst reviewed Oct 5, 2018

View reviewed changes

Ioannis Paraskevakos added 2 commits October 10, 2018 12:07

Changing where the paper is mentioned

146dcaa

Updated Changelog

0ad2998

richardjgowers reviewed Oct 10, 2018

View reviewed changes

orbeckst reviewed Oct 10, 2018

View reviewed changes

setup.py Outdated Show resolved Hide resolved

Ioannis Paraskevakos added 2 commits October 11, 2018 00:56

Continuing PR review resolve

d5813ac

Reduced number of tasks

063b76a

iparask mentioned this pull request Oct 11, 2018

Extend Parallel Leaflet Finder to take Periodic Boundary Conditions into account #72

Open

Ioannis Paraskevakos added 3 commits October 11, 2018 20:37

Removing redundant import

4765fd9

PEP8 corrections

594b19e

Further PEP8 corrections

4ddc64f

Debug, references, increased coverage and PEP8.

b7d99f7

Ioannis Paraskevakos added 2 commits October 12, 2018 13:59

Docs added for Leaflet Finder

8cfcb19

Addind missing file

042e5bc

Trying to fix linter's messages

df64155

orbeckst approved these changes Oct 17, 2018

View reviewed changes

orbeckst requested changes Oct 17, 2018

View reviewed changes

Adding my name on AUTHORS, changelog and conf.py

ce5d471

orbeckst approved these changes Oct 17, 2018

View reviewed changes

orbeckst merged commit a718250 into MDAnalysis:master Oct 17, 2018


		# partial connected components

		subgraphs = nx.connected_components(graph)

Fix 47 pleafletfinder #69

Fix 47 pleafletfinder #69

Conversation

iparask commented Sep 30, 2018 • edited Loading

PR Checklist

orbeckst commented Oct 2, 2018

iparask commented Oct 2, 2018 • edited Loading

orbeckst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VOD555 Oct 3, 2018 • edited Loading

Choose a reason for hiding this comment

orbeckst commented Oct 2, 2018

richardjgowers commented Oct 2, 2018

iparask commented Oct 3, 2018 • edited by kain88-de Loading

iparask commented Oct 3, 2018

kain88-de commented Oct 3, 2018 • edited Loading

orbeckst commented Oct 5, 2018

orbeckst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

orbeckst commented Oct 11, 2018

codecov bot commented Oct 12, 2018 • edited Loading

Codecov Report

iparask commented Oct 12, 2018

orbeckst commented Oct 12, 2018

orbeckst commented Oct 12, 2018

orbeckst commented Oct 17, 2018

orbeckst commented Oct 17, 2018

orbeckst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

orbeckst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iparask commented Oct 17, 2018

orbeckst commented Oct 17, 2018 • edited Loading

orbeckst commented Oct 17, 2018

iparask commented Sep 30, 2018 •

edited

Loading

iparask commented Oct 2, 2018 •

edited

Loading

VOD555 Oct 3, 2018 •

edited

Loading

iparask commented Oct 3, 2018 •

edited by kain88-de

Loading

kain88-de commented Oct 3, 2018 •

edited

Loading

codecov bot commented Oct 12, 2018 •

edited

Loading

orbeckst commented Oct 17, 2018 •

edited

Loading