Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Atom-level and residue-level metadata easily fall out of sync #1921

Open
mattwthompson opened this issue Aug 2, 2024 · 5 comments
Open

Atom-level and residue-level metadata easily fall out of sync #1921

mattwthompson opened this issue Aug 2, 2024 · 5 comments

Comments

@mattwthompson
Copy link
Member

mattwthompson commented Aug 2, 2024

Describe the bug

Updating an atom's residue-relevant metadata does not update the same data on the corresponding residue object. Editing the residue also doesn't update the atom. I'm not sure if this is the intended behavior - if it is, there needs to be safeguards or at least documentation, and if it's not, this is a bug.

To Reproduce

from openff.toolkit import Topology

from openff.utilities import get_data_file_path


protein = Topology.from_pdb(
    get_data_file_path(
        "proteins/MainChain_HIE.pdb",
        "openff.toolkit",
    ),
).molecule(0)


def check():
    for index in [0, -1]:
        assert [*protein.residues][index].residue_number == protein.atom(
            index,
        ).metadata["residue_number"], index


check()  # Passes

for atom in protein.atoms:
    atom.metadata["residue_number"] = str(int(atom.metadata["residue_number"]) + 10)

try:
    check()
except AssertionError:
    print("Mismatch if atoms are edited")

# Reset changes at atom level
for atom in protein.atoms:
    atom.metadata["residue_number"] = str(int(atom.metadata["residue_number"]) - 10)

for atom in protein.atoms:
    atom.metadata["residue_number"] = str(int(atom.metadata["residue_number"]) + 10)

try:
    check()
except AssertionError:
    print("Mismatch if residues are edited")

Output

$ python indexing.pyThe OpenEye Toolkits are found to be installed but not licensed and therefore will not be used.
The OpenEye Toolkits require a (free for academics) license, see https://docs.eyesopen.com/toolkits/python/quickstart-python/license.html
The OpenEye Toolkits are found to be installed but not licensed and therefore will not be used.
The OpenEye Toolkits require a (free for academics) license, see https://docs.eyesopen.com/toolkits/python/quickstart-python/license.html
Mismatch if atoms are edited
Mismatch if residues are edited

Computing environment (please complete the following information):

Additional context

This behavior makes sense given what I know about the design, but I think it's reasonable for a user to expect that updating residue data on an atom might correspondingly update the residue itself. The Atom.metadata field is implied to do this sort of thing:

An optional dictionary where keys are strings and values are strings or ints. This is intended to record atom-level information used to inform hierarchy definition and iteration, such as grouping atom by residue and chain.

If this is the intended behavior, it'd be nice to have it documented in some sort of "PDB loading cookbook" #1554

@mattwthompson
Copy link
Member Author

Something might be going on here, can't tell if this is related

ipdb> protein.atom(0).metadata["residue_number"]
'14'
ipdb> protein.residues[0].residue_number
'14'
ipdb> protein.to_topology().atom(0).metadata["residue_number"]
'14'
ipdb> protein.to_topology().residues[0].residue_number
'1'

@j-wags
Copy link
Member

j-wags commented Aug 6, 2024

Thanks for the great writeup. It's not a bug, just a poorly documented design.

The core issue is that Atom and Molecule/Topology-level metadata aren't kept in sync by default.

Changes to Atom-level metadata can be propagated to relevant HierarchyElements by running offmol.percieve_hierarchy.

There's currently no way to propagate HierarchyElement metadata changes back to the underlying atoms.

Some options I could see are:

  • If we don't want to add any major new functionality
    • Better documentation of this behavior
    • Raise an error if the user tries to __setattr__ on a HierarchyElement
  • Additionally, if we want to reduce confusion about the state of HierarchyElements after changes to Atom metadata:
    • When an Atom is iterated over inside a perceive_hierarchy call, have an attribute like _hierarchy_perceived be set on it. Then if someone updates the atom's metadata after that, print a warning saying that they'll need to call perceive_hierarchy again for those changes to be reflected in Molecule/Topology iterators
  • If we want to automatically update HierarchyElements based on changes to Atom metadata:
    • Modify Atoms to know which HierarchyElements they're part of, and if the Atom's metadata is changed, mark the HierarchyElement as being in a "dirty" state. Also, implement a check for this dirty state any time it might become important and re-percieve hierarchy when appropriate. This seems potentially complex, performance-impacting, and brittle.
  • After thinking through it, I don't think we want to allow setting atom metadata through a HierarchyElement attribute (so, I don't think we should allow offmol.residues[0].residue_name = "foo"). That's because this could easily land a molecule in a state where it has two residues with identical metadata, which violates the workings of HierarchyScheme (atoms with identical metadata all go into a single residue).

Happy to take a PR for any of these or discuss further.

@mattwthompson
Copy link
Member Author

My low-effort before-coffee opinion on this now is that we should quickly document most of this behavior (~few days) before we make any decisions about major design changes (~weeks to several months)

@Yoshanuikabundi
Copy link
Collaborator

Yoshanuikabundi commented Aug 7, 2024

FWIW, this is documented:

Hierarchy schemes are not updated dynamically; if a Molecule with hierarchy schemes changes, Molecule.update_hierarchy_schemes() must be called before the scheme is iterated over again or else the grouping may be incorrect.

When I was writing the documentation for this feature, I did find it a bit difficult to decide where to put this information, because a lot of users will only interact with, say, Molecule.residues, which can't be documented because it's not really a method. I kind of ended up just trying to put it everywhere. I'm a bit miffed that I didn't put it in Topology.hierarchy_iterator(), because that seems like an obvious place that people might run into it. If you have any other suggestions for more places to mention this, or clearer ways to phrase it, I'm all ears!

I also agree that dynamic iterators would be much less surprising.

@mattwthompson
Copy link
Member Author

FWIW, I searched mostly for "metadata" since that's the way I was interacting with the API. I appreciate that this behavior is mentioned several times in those methods, but they're also methods I didn't need to interact with to get this behavior.

My only suggestion at the documentation level is to include it when Atoms are described, i.e. here or a probably-copy-pasted-bit in Molecule.add_atom:

image

A "Beware! This other thing ... see here" might have helped in this case.

Anyway, I think making the actual behavior less squishy would be more impactful than documenting it a fifth time over. The power and flexibility of this functionality has huge upside but I seem to keep shooting myself in the foot when I try to hook into it for tasks I naively think are straightforward

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants