You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Trying to load a PDB of the new blockbuster drug semaglutide, a 31-residue polypeptide with two non-canonical residues, using the experimental _additional_substructures argument rapidly consumes memory. On my machine it consumed all 64GB of RAM and caused the OS kernel to kill the Python kernel in about a minute.
I think this is because one of the two NCAAs is huge - a mostly linear sidechain of about 40 atoms length - and doing an all against all search across the whole peptide is problematic. We might also be holding on to information longer than we need to (possibly in the match cache?) or have a memory leak.
Because I haven't been able to load the PDB, I don't know if I've made a mistake in my SMILES - but the memory is an issue regardless.
I... can't tell if you're joking. That API is so janky I've been looking forward to dropping it when I made _custom_substructures public. Do you unironically like _additional_substructures? I can consider leaving it in if so.
For once in my life I am not being sarcastic! Its just so much easier to write a SMILES than an unambiguous SMARTS with mapped hydrogens. Yes its janky but I think the jank can be fixed - optionally take atom (and residue?) names from the molecule if they're there, support dummy atoms (atomic number 0) that are automatically not part of the matched substructure, or even just replace the argument with a free function that converts a molecule to a substructure dictionary entry that can be passed to _custom_substructures.
But yeah, the key thing I love is just writing a SMILES (which I already know is unambiguous cause if I do a from_smiles on it I only get one molecule) instead of a SMARTS with all explicit atoms and connectivity (which I could not pull off for RNA even with several hours work). Its the difference between minutes and hours of work.
Describe the bug
Trying to load a PDB of the new blockbuster drug semaglutide, a 31-residue polypeptide with two non-canonical residues, using the experimental
_additional_substructures
argument rapidly consumes memory. On my machine it consumed all 64GB of RAM and caused the OS kernel to kill the Python kernel in about a minute.I think this is because one of the two NCAAs is huge - a mostly linear sidechain of about 40 atoms length - and doing an all against all search across the whole peptide is problematic. We might also be holding on to information longer than we need to (possibly in the match cache?) or have a memory leak.
Because I haven't been able to load the PDB, I don't know if I've made a mistake in my SMILES - but the memory is an issue regardless.
To Reproduce
PDB of Semaglutide and notebook that causes the issue:
semaglutide_from_pdb.zip
Output
"Python kernel has crashed and is restarting."
Computing environment (please complete the following information):
micromamba list
Additional context
I LOVE the
_additional_substructures
API.The text was updated successfully, but these errors were encountered: