Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFAM family <-> PDB structure ID mapping #324

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

Conversation

rvinas
Copy link
Collaborator

@rvinas rvinas commented Jun 27, 2023

Here's a script to retrieve a mapping between RFAM families and PDB structure IDs. How could this be integrated into the codebase?

There are two points that might need to be addressed:

  • In the RFAM API, I couldn't figure out how to get a complete list of RFAM families. Judging from here, it seems that the family accession IDs follow the format: RF00001, RF00002, ..., RF04236. To retrieve all families, I introduced an argument max_id to specify the max ID limit (e.g. setting max_id=4236 will stop querying after RF04236).
  • Downloading the mappings for all families is time-consuming (it seems we can only query a single family at a time), it took ~40 min on my laptop. Would it be good to cache the data in graphein/datasets? It might be important to allow the users to re-download the data in case of updates in the RFAM database.

Pull Request Checklist

  • Added a note about the modification or contribution to the ./CHANGELOG.md file (if applicable)
  • Added appropriate unit test functions in the ./graphein/tests/* directories (if applicable)
  • Modify documentation in the corresponding Jupyter Notebook under ./notebooks/ (if applicable)
  • Ran python -m py.test tests/ and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., python -m py.test tests/protein/test_graphs.py)
  • Checked for style issues by running black . and isort .

@codecov-commenter
Copy link

codecov-commenter commented Jun 27, 2023

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 0% with 66 lines in your changes missing coverage. Please review.

Project coverage is 44.70%. Comparing base (8123f42) to head (2f3e6d6).
Report is 184 commits behind head on master.

Files Patch % Lines
graphein/rna/download_rfam.py 0.00% 66 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #324      +/-   ##
==========================================
+ Coverage   40.27%   44.70%   +4.43%     
==========================================
  Files          48      114      +66     
  Lines        2811     7982    +5171     
==========================================
+ Hits         1132     3568    +2436     
- Misses       1679     4414    +2735     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@a-r-j
Copy link
Owner

a-r-j commented Jun 27, 2023

Thanks for this Ramon, looks great! Let me have a think about how to integrate this more. An immediate thought is to couple this to the PDBManager which can take care of retrieving structures etc. We'd probably need some adaptations to support splitting RNA datasets properly.

Re finding families, is family.txt.gz what you want?

There's also Rfam.pdb.gz that could be helpful?

I think favouring the metadata/indices stored on the FTP server over the API might be better from a user POV (probably faster & no worries about being rate limited). We could make a wrapper for this similar to the PDBManager?

There also seems to be a ton of metadata on the FTP server. I'm not sure what else could be useful to pull in 🤔

@rvinas
Copy link
Collaborator Author

rvinas commented Jun 28, 2023

This sounds good. I feel quite silly, I completely missed these two files! Yes, downloading via the FTP server would definitely be much faster. I'll modify the script to just download these two files (and perhaps merge everything into a single dataframe?). We could then look into how to integrate this into PDBManager.

@sonarcloud
Copy link

sonarcloud bot commented Jul 4, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

Copy link

sonarcloud bot commented Apr 1, 2024

Quality Gate Passed Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants