-
-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
change the pdb_paths working style and support for loading both local… #214
base: master
Are you sure you want to change the base?
Conversation
… and downloading files
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Awesome work, thanks @1511878618 ! If you merge into #213 the CI pipeline should be much faster & I can push a release to PyPI tonight :) |
I'm not sure about how to change into #213, and i see master branch is associated with #213. Or, maybe you can help me? Also, I'm not sure whether are there some potential bugs in it~ |
@@ -0,0 +1,247 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please switch out this notebook for unit tests :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i'll try it later in this month
graph_label_map: Optional[Dict[str, torch.Tensor]] = None, | ||
node_label_map: Optional[Dict[str, torch.Tensor]] = None, | ||
chain_selection_map: Optional[Dict[str, List[str]]] = None, | ||
pdb_paths: Optional[List[str]] = [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reasoning for using empty lists as the default arg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
empty lists
can add together even if they are empty, while None
can't. So we can skip some if
for different statements of the user pass pdb_paths
or pdb_codes
or uniprot_ids
, and just merge them into self.structures,
which is used at process
func and it works like os.listdir(self.raw_dir)
.
As for some potential bugs, i'm really not sure would this will cause some bugs as i use empty list
instead of None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be None
If you want to retain the behaviour inside the object, you could do:
if working_list is None: working_list = []
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and actually i have done that in the latest commit
else: | ||
self.pdb_paths_name = [] | ||
|
||
self.structures = list(set(self.pdb_codes + self.uniprot_ids + self.pdb_paths_name)) # remove some pdb_codes is in pdb_path and loaded repeately |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should be a set operation. With chain selections you may want to have e.g. 3eiy_A
and 3eiy_B
as different examples in your dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, i guess it would'n make a difference at chain selection, this set operation is to drop duplicate in the result list of pdb_codes
+ uniprot_ids
+ paths_name
. As you can see, local dir may have some pdb files like 10gs.pdb
, and if pdb_codes
also have 10gs
to download, and self.structures would contain double 10gs
and so the finial dataset
object will have duplicate Data
object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It becomes a problem here though (L283), no?
def process(self):
"""Process structures into PyG format and save to disk."""
# Read data into huge `Data` list.
structure_files = [
f"{self.raw_dir}/{pdb}.pdb" for pdb in self.structures
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, i guess not.
code like below in the tests/ml/test.ipynb
from graphein.ml.datasets import InMemoryProteinGraphDataset
local_dir = "../protein/test_data"
pdb_paths = [osp.join(local_dir, pdb_file) for pdb_file in os.listdir(local_dir) if pdb_file.endswith(".pdb")]
ds = InMemoryProteinGraphDataset(root = "../protein/test_data/InMemoryProteinGraphDataset",
name = "InMemoryProteinGraphDataset_test",
pdb_paths=pdb_paths,
pdb_codes=["10gs"],
uniprot_ids=["A0A6J1BG53", "A0A6P5Z5F7"],
af_version=3)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you see what happens with:
from graphein.ml.dataset import InMemoryProteinGraphDataset
ds = InMemoryProteinGraphDataset(root = ""../protein.test_data/InMemoryProteinGraphDataset", pdb_paths=pdb_paths, pdb_codes = ["4hhb", "4hhb"], chain_selection=["A","B"])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, i'll try later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you see what happens with:
from graphein.ml.dataset import InMemoryProteinGraphDataset ds = InMemoryProteinGraphDataset(root = ""../protein.test_data/InMemoryProteinGraphDataset", pdb_paths=pdb_paths, pdb_codes = ["4hhb", "4hhb"], chain_selection=["A","B"])
and why i ['4hhs', '4hhs']
I guess this may need lots of change~
@@ -225,6 +251,7 @@ def download(self): | |||
for pdb in set(self.pdb_codes) | |||
if not os.path.exists(Path(self.raw_dir) / f"{pdb}.pdb") | |||
] | |||
print("downloading uniprotids") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using log
would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohhhhhhhhhhh, too sry for these print
, forget to remove them, XD.
I'll remove them today
@@ -237,6 +264,7 @@ def download(self): | |||
] | |||
|
|||
def __len__(self) -> int: | |||
"""Returns length of data set (number of structures).""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should return the number of examples (not just the number of structures for the multiple chain reason I mentioned previously)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
@@ -492,8 +513,10 @@ def processed_file_names(self) -> List[str]: | |||
|
|||
@property | |||
def raw_dir(self) -> str: | |||
if self.pdb_paths is not None: | |||
return self.pdb_path # replace raw dir with user local pdb_path | |||
if self.pdb_paths: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think it would be useful to allow users to choose a path for raw_dir
when initialising the Dataset objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree.
If we simply change self.raw_dir
instead of self.pdb_paths
, where the former is a folder dir the latter is a list containing pdb_file dir, i guess we will use os.listdir
to get local pdb files dir.
And the question is if os.listdir
in the func, and then the order of self.structure
maybe hard to match the order of graph_labels
and node_labels
, since we match the labels
by index of list, i guess.
I'm not sure about this, i prefer to dict, which key is the names like {'10gs':0}
would be better than {0:0}
. And then we could just change the raw_dir
and os.listdir
and get a list of pdb file dir containing both local and downloaded pdb files, and process and assign each pdb files with their node_graph_label or chain_selection or graph_label by their name (remove root path and suffix like ./test/10gs.pdb
-> 10gs
) not by the enumunated index (which i think it is hard to match the correct order with pdb files when passing the graph_labels)
This description is not very clear, i'll try to make it clear later...
If something wrong in my understanding, please tell me 😄 , i'm still reading and learning your code lol. It's really a pythonic code, i learnt a lot 👍 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we simply change
self.raw_dir
instead ofself.pdb_paths
, where the former is a folder dir the latter is a list containing pdb_file dir, i guess we will use `os.listdir`` to get local pdb files dir.
I don't think this is the best idea. I think being explicit about the paths users want to use is best. For instance, people may want to use only a subset of their dataset (rather than everything in the directory - e.g. imagine where you want to keep all your pdb files together but train/test on different subsets). It also has the potential problem with hidden files like .DS_Store
etc. You're also completely right about the matching the list to node labels etc.
I'm not sure about this, i prefer to dict, which key is the names like
{'10gs':0}
would be better than{0:0}
This was my initial implementation. However, this ran into the problem where you may have different examples in your dataset drawn from different chains of the same PDB. E.g. imagine you have 3eiy_A
and 3eiy_B
with different labels. The current implementation allows for this, whereas indexing on the PDB name does not.
If something wrong in my understanding, please tell me 😄 , i'm still reading and learning your code lol. It's really a pythonic code, i learnt a lot 👍 👍
Thanks!! Me too!
…dataset_pyg_dev merge from v1.5.2
…rameters to Keyword Parameters since the former do not work and update the usage of it into dataloader_tutorial.ipynb also some little change in datasetload.py
… add first transform which turn attr of pyg Data to tensor
Codecov ReportBase: 40.27% // Head: 48.00% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #214 +/- ##
==========================================
+ Coverage 40.27% 48.00% +7.73%
==========================================
Files 48 86 +38
Lines 2811 5458 +2647
==========================================
+ Hits 1132 2620 +1488
- Misses 1679 2838 +1159
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
SonarCloud Quality Gate failed. 0 Bugs No Coverage information |
SonarCloud Quality Gate failed. 0 Bugs No Coverage information |
Reference Issues/PRs
Fixes #210
What does this implement/fix? Explain your changes
support for loading both local and downloading pdb files, and save at
self.raw_dir
What testing did you do to verify the changes in this PR?
Currently, none of this, just test on dataset loader doc notebook.
More changes and complements will be done at tomorrow morning.........
Pull Request Checklist
./CHANGELOG.md
file (if applicable)./graphein/tests/*
directories (if applicable)./notebooks/
(if applicable)python -m py.test tests/
and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g.,python -m py.test tests/protein/test_graphs.py
)black .
andisort .