Atom-level and residue-level metadata easily fall out of sync #1921

mattwthompson · 2024-08-02T19:59:36Z

Describe the bug

Updating an atom's residue-relevant metadata does not update the same data on the corresponding residue object. Editing the residue also doesn't update the atom. I'm not sure if this is the intended behavior - if it is, there needs to be safeguards or at least documentation, and if it's not, this is a bug.

To Reproduce

from openff.toolkit import Topology

from openff.utilities import get_data_file_path


protein = Topology.from_pdb(
    get_data_file_path(
        "proteins/MainChain_HIE.pdb",
        "openff.toolkit",
    ),
).molecule(0)


def check():
    for index in [0, -1]:
        assert [*protein.residues][index].residue_number == protein.atom(
            index,
        ).metadata["residue_number"], index


check()  # Passes

for atom in protein.atoms:
    atom.metadata["residue_number"] = str(int(atom.metadata["residue_number"]) + 10)

try:
    check()
except AssertionError:
    print("Mismatch if atoms are edited")

# Reset changes at atom level
for atom in protein.atoms:
    atom.metadata["residue_number"] = str(int(atom.metadata["residue_number"]) - 10)

for atom in protein.atoms:
    atom.metadata["residue_number"] = str(int(atom.metadata["residue_number"]) + 10)

try:
    check()
except AssertionError:
    print("Mismatch if residues are edited")

Output

$ python indexing.pyThe OpenEye Toolkits are found to be installed but not licensed and therefore will not be used.
The OpenEye Toolkits require a (free for academics) license, see https://docs.eyesopen.com/toolkits/python/quickstart-python/license.html
The OpenEye Toolkits are found to be installed but not licensed and therefore will not be used.
The OpenEye Toolkits require a (free for academics) license, see https://docs.eyesopen.com/toolkits/python/quickstart-python/license.html
Mismatch if atoms are edited
Mismatch if residues are edited

Computing environment (please complete the following information):

Operating system: macOS
Output of running conda list: https://gist.github.com/mattwthompson/087a81f4eb1a6bdb1900e1e287615f8b

Additional context

This behavior makes sense given what I know about the design, but I think it's reasonable for a user to expect that updating residue data on an atom might correspondingly update the residue itself. The Atom.metadata field is implied to do this sort of thing:

An optional dictionary where keys are strings and values are strings or ints. This is intended to record atom-level information used to inform hierarchy definition and iteration, such as grouping atom by residue and chain.

If this is the intended behavior, it'd be nice to have it documented in some sort of "PDB loading cookbook" #1554

The text was updated successfully, but these errors were encountered:

mattwthompson · 2024-08-02T20:11:24Z

Something might be going on here, can't tell if this is related

ipdb> protein.atom(0).metadata["residue_number"]
'14'
ipdb> protein.residues[0].residue_number
'14'
ipdb> protein.to_topology().atom(0).metadata["residue_number"]
'14'
ipdb> protein.to_topology().residues[0].residue_number
'1'

j-wags · 2024-08-06T23:01:29Z

Thanks for the great writeup. It's not a bug, just a poorly documented design.

The core issue is that Atom and Molecule/Topology-level metadata aren't kept in sync by default.

Changes to Atom-level metadata can be propagated to relevant HierarchyElements by running offmol.percieve_hierarchy.

There's currently no way to propagate HierarchyElement metadata changes back to the underlying atoms.

Some options I could see are:

If we don't want to add any major new functionality
- Better documentation of this behavior
- Raise an error if the user tries to __setattr__ on a HierarchyElement
Additionally, if we want to reduce confusion about the state of HierarchyElements after changes to Atom metadata:
- When an Atom is iterated over inside a perceive_hierarchy call, have an attribute like _hierarchy_perceived be set on it. Then if someone updates the atom's metadata after that, print a warning saying that they'll need to call perceive_hierarchy again for those changes to be reflected in Molecule/Topology iterators
If we want to automatically update HierarchyElements based on changes to Atom metadata:
- Modify Atoms to know which HierarchyElements they're part of, and if the Atom's metadata is changed, mark the HierarchyElement as being in a "dirty" state. Also, implement a check for this dirty state any time it might become important and re-percieve hierarchy when appropriate. This seems potentially complex, performance-impacting, and brittle.
After thinking through it, I don't think we want to allow setting atom metadata through a HierarchyElement attribute (so, I don't think we should allow offmol.residues[0].residue_name = "foo"). That's because this could easily land a molecule in a state where it has two residues with identical metadata, which violates the workings of HierarchyScheme (atoms with identical metadata all go into a single residue).

Happy to take a PR for any of these or discuss further.

mattwthompson · 2024-08-07T13:56:33Z

My low-effort before-coffee opinion on this now is that we should quickly document most of this behavior (~few days) before we make any decisions about major design changes (~weeks to several months)

Yoshanuikabundi · 2024-08-07T17:33:46Z

FWIW, this is documented:

The HierarchyScheme class and Molecule.add_hierarchy_scheme() method both say it explicitly with the wording quoted below
Molecule.add_default_hierarchy_schemes() mentions it
I tried to point to one of the above sources everywhere a hierarchy scheme is mentioned.

Hierarchy schemes are not updated dynamically; if a Molecule with hierarchy schemes changes, Molecule.update_hierarchy_schemes() must be called before the scheme is iterated over again or else the grouping may be incorrect.

When I was writing the documentation for this feature, I did find it a bit difficult to decide where to put this information, because a lot of users will only interact with, say, Molecule.residues, which can't be documented because it's not really a method. I kind of ended up just trying to put it everywhere. I'm a bit miffed that I didn't put it in Topology.hierarchy_iterator(), because that seems like an obvious place that people might run into it. If you have any other suggestions for more places to mention this, or clearer ways to phrase it, I'm all ears!

I also agree that dynamic iterators would be much less surprising.

mattwthompson · 2024-08-08T13:34:09Z

FWIW, I searched mostly for "metadata" since that's the way I was interacting with the API. I appreciate that this behavior is mentioned several times in those methods, but they're also methods I didn't need to interact with to get this behavior.

My only suggestion at the documentation level is to include it when Atoms are described, i.e. here or a probably-copy-pasted-bit in Molecule.add_atom:

A "Beware! This other thing ... see here" might have helped in this case.

Anyway, I think making the actual behavior less squishy would be more impactful than documenting it a fifth time over. The power and flexibility of this functionality has huge upside but I seem to keep shooting myself in the foot when I try to hook into it for tasks I naively think are straightforward

mattwthompson added bug documentation reliability pdb reading labels Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Atom-level and residue-level metadata easily fall out of sync #1921

Atom-level and residue-level metadata easily fall out of sync #1921

mattwthompson commented Aug 2, 2024 •

edited

Loading

mattwthompson commented Aug 2, 2024

j-wags commented Aug 6, 2024

mattwthompson commented Aug 7, 2024

Yoshanuikabundi commented Aug 7, 2024 •

edited

Loading

mattwthompson commented Aug 8, 2024

Atom-level and residue-level metadata easily fall out of sync #1921

Atom-level and residue-level metadata easily fall out of sync #1921

Comments

mattwthompson commented Aug 2, 2024 • edited Loading

mattwthompson commented Aug 2, 2024

j-wags commented Aug 6, 2024

mattwthompson commented Aug 7, 2024

Yoshanuikabundi commented Aug 7, 2024 • edited Loading

mattwthompson commented Aug 8, 2024

mattwthompson commented Aug 2, 2024 •

edited

Loading

Yoshanuikabundi commented Aug 7, 2024 •

edited

Loading