-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix 47 pleafletfinder #69
Conversation
Bringing PMDA matser changes here before continuing development
@iparask sklearn is not getting installed. https://travis-ci.org/MDAnalysis/pmda/jobs/435384534#L1101 For MDAnalysis, sklearn is optional. Is BallTree required for your code or can you make it optional? |
The most efficient one uses BallTrees. It also reduces the memory footprint a lot. Let me include KDTrees instead, which are part of Scipy. Otherwise we will have to go with cdist or something similar. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial comments inline. Getting meaningful tests to pass is the next step as far as I can tell.
Can you make sklearn.BallTree optional, i.e., use something from scipy (which we always install) and use BallTree if sklearn is available?
pmda/leaflet.py
Outdated
# Released under the GNU Public Licence, v2 or any higher version | ||
""" | ||
LeafletFInder Analysis tool --- :mod:`pmda.leaflet` | ||
======================================= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs longer underscores or sphinx will barff
pmda/leaflet.py
Outdated
import networkx as nx | ||
|
||
import MDAnalysis as mda | ||
from sklearn.neighbors import BallTree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this required or can it be made optional or replaced by something from scipy?
At the moment the tests fail because sklearn is not installed.
We can decide to install sklearn if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should use the KDTree we've got inside MDAnalysis, which does periodic boundaries. See: https://github.com/MDAnalysis/mdanalysis/blob/develop/package/MDAnalysis/lib/pkdtree.py
pmda/leaflet.py
Outdated
|
||
Attributes | ||
---------- | ||
_results : list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If users are not supposed to use it (underscore prefixed) then don't document.
If users should be able to work with it, don't use underscore.
pmda/leaflet.py
Outdated
else: | ||
train = np.vstack([window[0],window[1]]) | ||
test = np.vstack([window[0],window[1]]) | ||
tree = BallTree(train, leaf_size=40) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be replaced with something from scipy and we use sklearn if available?
pmda/test/test_leaflet.py
Outdated
assert_array_almost_equal) | ||
|
||
import MDAnalysis | ||
from MDAnalysisTests.datafiles import (PSF, DCD) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These datafiles do not include a membrane.
Use the MDAnalysis tests for a single frame https://github.com/MDAnalysis/mdanalysis/blob/edb1643c0a78f7e5447aa3ccc02712e4205ff073/testsuite/MDAnalysisTests/analysis/test_leaflet.py#L30 to compare results and the files https://github.com/MDAnalysis/mdanalysis/blob/edb1643c0a78f7e5447aa3ccc02712e4205ff073/testsuite/MDAnalysisTests/analysis/test_rdf_s.py#L31 contain a trajectory with a membrane protein in a membrane.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I ask which atoms to select from the large trajectory? Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most likely "name P*" – sorry, can't check right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VOD555 what kind of lipids are in these trajectories and how do you select the headgroup phosphorous?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are POPG and POPE lipids in these trajectories, and you can use
u.select_atoms('name P and resname POPG')
for phosphorous in POPG/POPE, or
u.select_atoms('name P')
for all phosphorous in lipids.
Is BallTree a lot better? MDAnalysis also recently got capped_distance which could perhaps be used instead. We now use this for neighborsearching. Importantly, it takes periodicity into account. @richardjgowers can you comment on |
@orbeckst a BallTree is just a KDTree (I think...). grid benched faster than KDTree nearly all of the time, so if we can use that it would be better |
@orbeckst @richardjgowers I have a wiki page at least with some information for KD-Trees and BallTrees. Also, I remember @drelu had created a plot comparing different trees, I will put it here as soon as I find it. |
@iparask we had a student (@ayushsuhane) putting in a lot of work in the summer to efficiently calculate distances. Have a look at his blog. He implemented several algorithms in MDAnalysis to efficiently calculate distances with a cutoff. He has found that the cell-list algorithm yielded the best results. It's available in MDAnalysis as |
Currently only as simple PEP8 check fails. You can run the PEP8 checks locally, too (together with coverage):
(needs pytest-pep8 and pytest-cov). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @iparask , see quick comments inline.
else: | ||
train = np.vstack([window[0], window[1]]) | ||
test = np.vstack([window[0], window[1]]) | ||
tree = cKDTree(train, leafsize=40) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment here that this does not respect PBC; possible TODO: work with capped_distances()
instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by PBC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://en.wikipedia.org/wiki/Periodic_boundary_conditions
like Pacman geometry. Particles are simulated within a finite box of size [Lx, Ly, Lz], Particles near x=0 and x=Lx are actually very close to each other.
So the leaflet is actually an infinite sheet and has no "edge". Without considering periodic boundaries you'll identify the leaflets, but you'll miss some of the edges in the graph representation, which might be important for some analysis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I see. May I suggest the following. Have a working implementation without taking into account PBCs and without breaking the trajectory into multiple blocks. As soon as that is done and this PR can be accepted, merge it so that we do not miss the release this is intended for.
Next version of the implementation will take into account PBCs and also try to see how to parallelize over trajectory blocks.
@orbeckst , @richardjgowers do you agree with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok with an initial version that does not do PBC if
- the docs mention this
- we have an issue open for it
Breaking trajectory into blocks can also wait in my opinion.
---- | ||
At the moment, this class has far fewer features than the serial | ||
version :class:`MDAnalysis.analysis.leaflet.LeafletFinder`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional notes to add:
- Currently, periodic boundaries are not taken into account.
- The calculation is parallelized on a per-frame basis [Paraskevakos2018]_; at the moment, no parallelization over trajectory blocks is performed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pmda/leaflet.py
Outdated
# Partition the data based on a 2-dimensional partitioning | ||
for i in range(1, matrix_size + 1, part_size): | ||
for j in range(1, matrix_size + 1, part_size): | ||
arraged_coord.append(([atoms[i - 1:i - 1 + part_size], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this correctly, you are partitioning a (M, M) matrix A into sub-blocks. A_{ij} will represent a contact.
Because we evaluate the atomgroup with itself, this is symmetric, A_ij = A_ji. Couldn't you save on the order of M^2/2 evaluations by only taking the diagonal and lower triangle blocks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me give it a try and see what is happening with the tests. This is an exact transport from the code I used for the paper. I will fix it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you have a look at optimizing this?
If it should be optimized, add it to the #72 issue.
For this PR we can put the code as it was for the paper.
pmda/leaflet.py
Outdated
comp = [g for g in subgraphs] | ||
return comp | ||
|
||
def _single_frame(self, scheduler_kwargs, n_blocks, cutoff=15.0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why n_blocks
– it's not used at the moment. Omit?
(Unless you immediately want to parallelize over blocks, too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parallelization happens in a single frame. blocks
specify the number of partitioning the data over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove it and change the way I used it in the code
pmda/leaflet.py
Outdated
number of tasks to start, if `-1` use number of logical cpu cores. | ||
This argument will be ignored when the distributed scheduler is | ||
used | ||
n_blocks : int, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment that this is currently ignore: no parallelization over trajectory blocks. Or are you already working on adding the parallelization over frames, too?
pmda/leaflet.py
Outdated
if n_jobs == -1: | ||
n_jobs = cpu_count() | ||
|
||
if n_blocks is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
COuld be removed if n_blocks
not used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is being used to produce the correct number of tasks. Because it is optional, if it is not set we set it manually based on the number of jobs. Now that I think about it n_jobs
is not really needed since I parallelize based on the number of blocks.
|
||
# partial connected components | ||
|
||
subgraphs = nx.connected_components(graph) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iirc networkx is wholly Python
This function in MDAnalysis:
https://github.com/MDAnalysis/mdanalysis/blob/develop/package/MDAnalysis/lib/_cutil.pyx#L322
Is a c++ function which does the same, pass it node and edge indices and it will return a list of the indices in each subgraph
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand your comment correctly, you are suggesting to change from netwrokx to the function MDAnalysis implements, right?
When I started the implementation of the Leaflet Finder for my paper I used the class that exists in MDAnalysis (link). There networkx is being used. In addition, I was very cautious when I built my version to use highly used python packages for things I did not want to reimplement.
Did you reimplement a Connected component calculator or did that pre-exist Leaflet Finder?
Also, I am extremely bad with abbreviations and I have to google them all the time. Can you expand iirc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, iirc = if I remember correctly.
Yeah I think the MDAnalysis function will be faster. It was added after the original leaflet finder class.
If you want to keep this the same for a closer comparison to the original class that's fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richardjgowers did you benchmark find_fragments()
against networkx?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Also, we should update the serial LeafletFinder with capped_distances and find_fragments...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we use networkx for the initial version because this is what was described in the paper.
We can then open an new issue to optimize parallel leafletfinder and add PBC.
@iparask there's still an import of BallTree somewhere:
Once the tests pass and you need a final review, please ping me. Thanks! |
Codecov Report
@@ Coverage Diff @@
## master #69 +/- ##
==========================================
- Coverage 99.27% 97.92% -1.35%
==========================================
Files 7 8 +1
Lines 274 386 +112
Branches 26 48 +22
==========================================
+ Hits 272 378 +106
- Misses 1 4 +3
- Partials 1 4 +3
Continue to review full report at Codecov.
|
@orbeckst, this is a request for a final review. I have added all parts as asked. I believe it is as close as I can see it getting. Travis fails with warnings mainly because I have overridden and changed the methods of the base class in terms of input parameters. |
We'll have to fix the pylint warnings or exclude them; if we merge it like this, all future CI will fail. Run
locally while testing. I get:
Most of these warnings look fixable to me
Disable the other pylint warning in code by bracketing the code with Line 90 in 3d78fc7
|
On a second look, the mismatching call signatures might be a bit more difficult.... start with disabling the warnings in the code. I'll then comment in a code review. (One way or another, the linter has to pass.) |
Looking good – is there a way to increase test coverage? It will decrease our 99% to 96%... |
Actually, just checked the diff – there's not much you can do for coverage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me; this is the code as we had it for the paper.
Follow-up in #72 .
pmda/leaflet.py
Outdated
# Partition the data based on a 2-dimensional partitioning | ||
for i in range(1, matrix_size + 1, part_size): | ||
for j in range(1, matrix_size + 1, part_size): | ||
arraged_coord.append(([atoms[i - 1:i - 1 + part_size], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you have a look at optimizing this?
If it should be optimized, add it to the #72 issue.
For this PR we can put the code as it was for the paper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add yourself to AUTHORS etc – see comments.
I'll merge then.
@@ -21,6 +21,7 @@ Enhancements | |||
* add add timing for _conclude and _prepare (Issue #49) | |||
* add parallel particle-particle RDF calculation module pmda.rdf (Issue #41) | |||
* add readonly_attributes context manager to ParallelAnalysisBase | |||
* add parallel implementation of Leaflet Finder (Issue #47) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add your GitHub name at the top of 0.2.0!
add yourself to AUTHORS and theauthors
line in doc/conf.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I will!
Can you help me read the output of the codecov bot you included? I think that if we can test using the distributed scheduler the coverage will increase. |
Just add yourself as an author and we're done here. EDIT: sorry, autocompleted the wrong username |
Thank you @iparask , congratulations to your first contribution! |
Fixes #47
Changes made in this Pull Request:
This PR can be used as a place holder. I would like to initiate a discussion as to as the written code is based on the requirements. I would like to report two major differences from other cases. This code uses
Dask Bags
to execute instead ofdelayed
. In addition, parallelization is done on a frame basis. This was the most efficient implementation we had in the paper (code linked in the respective issue).Please review/ comment as necessary and code can grow.
Giannis
PR Checklist