Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loose the hashing computing without include core version #6215

Merged

Conversation

unkcpz
Copy link
Member

@unkcpz unkcpz commented Dec 8, 2023

fixes #6214

I remove the redundant version and I still keep the major version (TBD), maybe we can remove the core version.

@danielhollas
Copy link
Collaborator

This change makes a lot of sense, would be good to get into 2.5.0 if @sphuber agrees.

@sphuber
Copy link
Contributor

sphuber commented Dec 11, 2023

I think the idea is definitely interesting and we probably want to apply it, however, I think we need a bit more time to think about any issues related to backwards compatibility. We might want to add a database migration to update all existing nodes. After internal discussion, we slated this most likely for the release following v2.5.

aiida/plugins/utils.py Outdated Show resolved Hide resolved
aiida/plugins/utils.py Outdated Show resolved Hide resolved
tests/engine/processes/calcjobs/test_calc_job.py Outdated Show resolved Hide resolved
@unkcpz unkcpz force-pushed the fix/6214/loose-hashing-make-without-core-version branch from b35ce50 to ae3e749 Compare December 12, 2023 21:45
@unkcpz unkcpz force-pushed the fix/6214/loose-hashing-make-without-core-version branch from 2b14896 to 2ccb2a4 Compare December 14, 2023 15:26
Copy link
Collaborator

@danielhollas danielhollas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @sphuber mentioned above, this probably needs a migration. Some comments but I do not know the caching system to do a proper review.

docs/source/reference/command_line.rst Outdated Show resolved Hide resolved
objects = node._get_objects_to_hash()
core_version = import_module(node.__module__.split('.', maxsplit=1)[0]).__version__

assert core_version not in objects
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this assertion fail before the change?? Shouldn't it be look at obj['version']['core'] or something like it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will fail. Besides change the core version to only contains major version, the first change in the PR is to remove the core_version that is redundantly added to the _get_objects_to_hash, see aiidateam/aiida-pseudo#167 (comment) the original return value of the method.

@unkcpz unkcpz force-pushed the fix/6214/loose-hashing-make-without-core-version branch from 163b263 to 7e2f3fb Compare December 18, 2023 21:23
@unkcpz unkcpz force-pushed the fix/6214/loose-hashing-make-without-core-version branch from 4fe37a1 to 8f53d77 Compare January 11, 2024 10:47
@unkcpz
Copy link
Member Author

unkcpz commented Jan 11, 2024

Hi @sphuber, I rebase this PR. You can give it a look and we can then find time to discuss.
I think it is this #6000 PR include the migration required. We should do it separately or in this PR?

@unkcpz
Copy link
Member Author

unkcpz commented Feb 8, 2024

I start to need this change since I update the aiida-core for my production environment and nothing is now cached. Any idea on how to proceed with it? @sphuber

@sphuber
Copy link
Contributor

sphuber commented Feb 8, 2024

I start to need this change since I update the aiida-core for my production environment and nothing is now cached. Any idea on how to proceed with it? @sphuber

To me it is not clear whether this is the right solution and whether it doesn't introduce problems. For starters, if the version is a problem, why keep the major version. If the real thing that matters in caching is the hash of a node, and the hash is simply defined by its content, then if the content doesn't change, the hash also shouldn't change, even if the aiida-core version that created it changes.

Furthermore, even if we change the core version to just include the major version, the plugin version is still there. That might also change often and you would still have the same problem. This would now push it down to plugins to update their Data plugins to manually exclude that attribute from _get_objects_to_hash. If we think the version is not needed in the hash and really is just making caching useless in many cases, shouldn't we just get rid of it entirely?

And if the answer is yes, I don't think the current solution is the right one. Why change the actual versions that are stored (or even remove them)? This way we are losing valuable information. When we can simply modify NodeCaching._get_objects_to_hash to exclude this key. This would have the same effect, without any information loss.

Finally, there is the question whether we need a migration if we go through with this. We already changed hashing in v2.4 and added a migration for it. This was the 1st migration since v2.0, and we might have to add it once again in the next release and drop the hashes once again. Not saying we shouldn't do it therefore, but it is something to keep in mind, as it will make it more difficult for users to switch between this new version and older ones.

So I think some discussion is still required before we can go on with this. Thoughts?

@unkcpz
Copy link
Member Author

unkcpz commented Feb 22, 2024

After the discussion with @sphuber. We agree to completely remove the aiida-core version from computing hashing.

In the PR, I didn't remove the plugin version from caching, and I think at the moment we'd better still keep it there. Mentioned by @GeigerJ2, it can be false positive caching that happened if the plugin change the implementation and the calcjob give the different outputs. But inside the workchain, the cached calcjob node is taken and fail the workchain.
I remember I already did encountered the same issue during the plugin development, so even when plugin version exist it can not help to solve this issue during development.
But we need to prevent this happened to the users of plugin by include the plugin version so once the new version of plugin released it is a calcjob that won't get from old sourced one.
We can keep on loosing this strict by include in the future the minor version or only the major version of the plugin.
For now, I assume fully remove the aiida-core version from the hashing is already a bold move.

@sphuber
Copy link
Contributor

sphuber commented Feb 23, 2024

Before I forget, another thing we might want to change while we are chaning the hashing algorithm: it would be useful if _get_objects_to_hash would return a dictionary instead of a list. Sometimes to debug hashing it is useful to look at the value returned by this method, but it being a list makes it difficult to identify what each value represents. If we turn it into a dictionary, that would make it easier to identify each part

@unkcpz unkcpz force-pushed the fix/6214/loose-hashing-make-without-core-version branch from 6b66903 to 579ee5f Compare February 23, 2024 10:49
@@ -135,6 +132,7 @@ def _hash_ignored_attributes(cls) -> Tuple[str, ...]: # noqa: N805
'priority',
'max_wallclock_seconds',
'max_memory_kb',
'version', # ?? How I exclude the version.core ??
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sphuber
First, the attributes like queue_name is in last_job_info which is in _node._updatable_attributes and already be excluded, so the is no point to keep it there.
Second, it now only support exclude by key, and for the nest dictionary the login in CalcJobNodeCaching does not yet try to go level by level to exclude things.

In order to make this work, I think in the _hash_ignored_attributes here it can specify version.core and in the NodeCaching class it will use the format to exclude precisely the version.core to have attribute object from: {version: {'core': --, 'plugin': --}} to {'version': {'plugin': --}}.

What you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could change _hash_ignored_attributes and the implementation that consumes it to support nested keys, but you could also update CalcJobNodeCaching._get_objects_to_hash() to just call the super and then from the returned list, get the element that corresponds to the dictionary of attributes and then remove the 'version.core' key.

I am still not a 100% clear about the hypothetical case of the plugin being updated needing the cache to be invalidated (by the hash being different). Wouldn't the same reasoning also not apply to aiida-core? What if the CalcJob base class changes some small (but significant) part of the implementation? Wouldn't that be the same scenario? So why keep the plugin version but drop the core version in that case? And if the changes could happen due to a fix of a bug (which could be released with a patch version of aiida-core) then just checking the major or even minor version also won't protect against this hypothetical case.

So it seems we are back to square 1. Yes, keeping the core and plugin versions in the hash might be the safest, but how common are the cases that it is trying to protect against? If they are rare enough, should we really invalidate hashes with each (patch) version update, strongly reducing the effectiveness of the caching mechanism?

@danielhollas
Copy link
Collaborator

I am still not a 100% clear about the hypothetical case of the plugin being updated needing the cache to be invalidated (by the hash being different). Wouldn't the same reasoning also not apply to aiida-core? What if the CalcJob base class changes some small (but significant) part of the implementation? Wouldn't that be the same scenario? So why keep the plugin version but drop the core version in that case? And if the changes could happen due to a fix of a bug (which could be released with a patch version of aiida-core) then just checking the major or even minor version also won't protect against this hypothetical case.
So it seems we are back to square 1. Yes, keeping the core and plugin versions in the hash might be the safest, but how common are the cases that it is trying to protect against? If they are rare enough, should we really invalidate hashes with each (patch) version update, strongly reducing the effectiveness of the caching mechanism?

I have the same question as @sphuber. Ignoring aiida-core version but invalidating cache upon plugin update seems very strange. Especially since plugins might undergo more rapid development. It does make sense in a sense that we have tight control over what we do in aiida-core, and if we did make some breaking change, we can always drop the hashes via DB migration. This would seem more difficult to handle for plugins (not to mention that plugin developers might not be aware of caching subtleties).

In the PR, I didn't remove the plugin version from caching, and I think at the moment we'd better still keep it there. Mentioned by @GeigerJ2, it can be false positive caching that happened if the plugin change the implementation and the calcjob give the different outputs. But inside the workchain, the cached calcjob node is taken and fail the workchain.
I remember I already did encountered the same issue during the plugin development, so even when plugin version exist it can not help to solve this issue during development.

@unkcpz @GeigerJ2 It would be very useful if you could write down couple such scenarios in detail. What are the exact conditions for the breakage? What would actually happen when there is a breaking change? Would the whole workflow crash or silently do a wrong thing? Based on that I wonder if we could have some mitigation strategies in aiida-core to prevent these issues.

@unkcpz
Copy link
Member Author

unkcpz commented Feb 23, 2024

@unkcpz @GeigerJ2 It would be very useful if you could write down couple such scenarios in detail. What are the exact conditions for the breakage?

The example I did encounter is I run a workchain of CALCJOB_A -> Dict_FORMAT_A -> CALCJOB_B. The calcjob_a will output a dict and I use the value from the dict as the input for calcjob_b (there can be some calcfunction in the middle for provenance, but doesn't matter for the moment).

  • I run the workchain and CALCJOB_A get cached but for some reason the CALCJOB_B not successfully done therefore will not be the valid cached source.
  • Let's assume now I change the output format of CALCJOB_A so the format it is different with containing more items, and the newly included items is used for CALCJOB_B.
  • I rerun the workchain, I expect that the CALCJOB_A runs and generate new format but it actually take the old calcjob_a from cache and give the old format.

@sphuber
Copy link
Contributor

sphuber commented Feb 23, 2024

I see, that would indeed fail. There is a deeper problem here. The real cause for the discrepancy in this scenario is the Parser. Although this can (and for now almost always practically is) part of the same plugin package as the CalcJob, it doesn't have to be. But I admit that in practice this is unlikely as usually the calcjob and parser live in the same package.

Still, in this situation, one could "invalidate" the original CALCJOB_A node so it won't be used for the cache anymore. How common are these scenarios? Would it not be more reasonable to request users invalidate these old nodes if the plugins have changed significantly, rather than making caching practically ineffective by default across the board?

@danielhollas
Copy link
Collaborator

danielhollas commented Feb 23, 2024

Still, in this situation, one could "invalidate" the original CALCJOB_A node so it won't be used for the cache anymore. How common are these scenarios? Would it not be more reasonable to request users invalidate these old nodes if the plugins have changed significantly, rather than making caching practically ineffective by default across the board?

This sounds reasonable as long as this is thoroughly documented. How does one currently invalidate a cache for a single node? Is it possible to do through CLI? If not, we should make it as easy as possible. One disadvantage I see that the user might not be aware that the caching was the reason for the failure? Is there anything that could be done to make that more apparent?

(this is just thinking out loud. I haven't used caching myself yet so perhaps these are non-issues)

@unkcpz
Copy link
Member Author

unkcpz commented Feb 23, 2024

No yet, I planed it in #6096 but not finished it.

@sphuber
Copy link
Contributor

sphuber commented Feb 23, 2024

How does one currently invalidate a cache for a single node?

As @unkcpz said, there isn't a CLI endpoint yet but he has a PoC. In the Python API, you literally just do load_node().is_valid_cache = False. Once the CLI end point is there, I think it should be relatively easy to control for users.

One disadvantage I see that the user might not be aware that the caching was the reason for the failure? Is there anything that could be done to make that more apparent?

Not sure, I doubt it though. If you think about @unkcpz example, a KeyError or AttributeError would probably be raised when the workchain code tries to access an attribute in the Dict node that doesn't exist. Should the aiida-core code than try to catch this exception and start inspecting the code that raises it, to see if it was operating on an output node of a CalcJobNode that was cached? Seems impossible ^^

I did recently improve the verdi process list output to show by default whether a process was taken from the cache or not. So if the workchain fails and they check that, they would see that the last calculation was cached, which might give a hint at least. Not sure how much more we could do. It surely won't be immediately obvious to new users that are not familiar with caching.

@unkcpz
Copy link
Member Author

unkcpz commented Feb 23, 2024

Maybe we can take this as action. We remove both core and plugin versions. Since the caching is not the default config, so the user will not encountered any issue caused by the false positive caching.
If user try to enable the caching, we pop up a warning with links sort of like that "SHOULD READ BEFORE USING CACHING".

But sure, adding the CLI to the caching is also an important thing come to rescue if wired thing happened with caching.

@danielhollas
Copy link
Collaborator

If user try to enable the caching, we pop up a warning with links sort of like that "SHOULD READ BEFORE USING CACHING".

I like that idea, we can link to the "Limitations and guidelines" section

https://aiida.readthedocs.io/projects/aiida-core/en/latest/topics/provenance/caching.html#limitations-and-guidelines

Note that this section needs to be updated if we go forward with this PR.

The real cause for the discrepancy in this scenario is the Parser. Although this can (and for now almost always practically is) part of the same plugin package as the CalcJob, it doesn't have to be. But I admit that in practice this is unlikely as usually the calcjob and parser live in the same package.

I don't know the CalcJob internals well, but I wonder if there's anything currently not included in the hash that would decrease the chance of a bad cache hit (i.e. something more fine-grained than the plugin version?).

@unkcpz
Copy link
Member Author

unkcpz commented Feb 23, 2024

I don't know the CalcJob internals well, but I wonder if there's anything currently not included in the hash that would decrease the chance of a bad cache hit (i.e. something more fine-grained than the plugin version?).

I don't think so, you can take a look (better to be a calcjob submit to remote using non-direct scheduler which may contains more attributes) by checking the node.base.attributes, those items exclude the node._updatable_attributes are used for the hashing computing.

@sphuber
Copy link
Contributor

sphuber commented Mar 17, 2024

Do you want to update this PR @unkcpz and remove the version information entirely from the hash calculation? We have given ample opportunity to discuss and I think we are converging on the conclusion that it makes caching ineffective and we are willing to take the slight increase in risk of false positives.

I would like to include these changes in the upcoming release. I just opened #6321 which fixes a bug in caching. It would be good to include all these changes in one go with a database migration that drops all hashes.

@danielhollas
Copy link
Collaborator

danielhollas commented Mar 17, 2024

Would be great to have this, I've just started using caching and it is perfect for our use case.

How common are these scenarios? Would it not be more reasonable to request users invalidate these old nodes if the plugins have changed significantly, rather than making caching practically ineffective by default across the board?

I have to say that I am still uncomfortable about passing this burden on users, and would like us to think harder to make these cases less probable since caching bugs are notoriously hard to debug.

The scenario pointed by @unkcpz could be traced to the change in the parser. I am wondering if we could, instead of just copying the output nodes, we could run the current parser on top of the finished calculation?

Another scenario would be if the plugin changed the implementation of the function that prepares the input files for the QM program (prepare_submission?). I could imagine e.g. plugin changing some default parameters of the ab initio calculations. Is this something we could capture somehow, e.g. by running the function and comparing the input files with the ones from the cached nodes?

Another, somewhat orthogonal question -- in aiida-core we can do a DB migration and drop the hashes if the implementation changes. Do plugin authors have any possibility of invalidating the cache in a similar manner, if the plugin version is no longer part of the cache?

@sphuber
Copy link
Contributor

sphuber commented Mar 17, 2024

The scenario pointed by @unkcpz could be traced to the change in the parse. I am wondering if we could, instead of just copying the output nodes, we could run the current parser on top of the finished calculation?

This is unfortunately not really an option, because the parser could rely on files that are retrieved with the retrieve_temporary_list but these files are not stored in the retrieved node and so cannot be provided to the parser.

Another scenario would be if the plugin changed the implementation of the function that prepares the input files for the QM program (prepare_submission?). I could imagine e.g. plugin changing some default parameters of the ab initio calculations. Is this something we could capture somehow, e.g. by running the function and comparing the input files with the ones from the cached nodes?

This would become very slow though. The whole premise of the caching mechanism is that to match nodes, a query is performed based on their hash, which is computed just once when the node is stored. With your suggestion, you would now have to find all nodes that match the hash, and then actually one-by-one start running the prepare_for_submission and then manually checking all input files, comparing them. This would be a huge performance hit. And then we are still not a 100% sure that we capture everything, because just checking the input files generated by prepare_for_submission is not everything that determines the conditions for a calculation. So I would definitely not go in this direction.

Another, somewhat orthogonal question -- in aiida-core we can do a DB migration and drop the hashes if the implementation changes. Do plugin authors have any possibility of invalidating the cache in a similar manner, if the plugin version is no longer part of the cache?

No they cannot, only aiida-core can add migrations. Users would have to essentially perform a query and invalidate those nodes, e.g. something like

for node, in QueryBuilder().append(PwCalculation).iterall():
    node.is_valid_cache = False

@unkcpz
Copy link
Member Author

unkcpz commented Mar 18, 2024

Hi @sphuber, I am going to finish this one today. So in the end, we agree on remove all versions from caching correct? If so this PR would be then a tiny one.
Also mentioned in #6215 (comment), do we require the change on the prompt when activating the caching?

@sphuber
Copy link
Contributor

sphuber commented Mar 18, 2024

I am going to finish this one today. So in the end, we agree on remove all versions from caching correct?

I did suggest to go on with removing all versions from the hash, but then @danielhollas voiced concerns. I responded to his suggestions, explaining why I think they are not feasible. Not sure if those counter comments were satisfactory.

Also mentioned in #6215 (comment), do we require the change on the prompt when activating the caching?

Not sure to be honest. It would require a special exception in Config.set_option which feels a bit dirty. I think this should simply be part of the caching documentation.

@danielhollas
Copy link
Collaborator

I did suggest to go on with removing all versions from the hash, but then @danielhollas voiced concerns. I responded to his suggestions, explaining why I think they are not feasible. Not sure if those counter comments were satisfactory.

I am fine with the proposed change, the purpose of voicing concerns was to make us think a bit harder about how to make the situation better. :-) I was aware that the proposed ideas were not going fly probably, but was hoping they would trigger some ideas from @sphuber who is knowledgeable about implementation details. :-)

tl;dr @unkcpz please go ahead with the implementation in this PR. Also finishing the CLI caching control in #6096 would be great. Any further work better be done separately anyway.

The whole premise of the caching mechanism is that to match nodes, a query is performed based on their hash, which is computed just once when the node is stored. With your suggestion, you would now have to find all nodes that match the hash, and then actually one-by-one start running the prepare_for_submission and then manually checking all input files, comparing them. This would be a huge performance hit.

I think the biggest argument against this is the implementation complexity. I wonder if there is some middle ground that could catch most issue (e.g. just checking the default input file?).

I am not convinced that this would be a big performance hit. In most cases, the prepare_submission would be only called on a single node that matched, and its cost should be negligible to the cost of actual QM calculation?

(perhaps we should move this discussion to Discourse and focus on the implementation in this PR)

@unkcpz
Copy link
Member Author

unkcpz commented Mar 18, 2024

tl;dr @unkcpz please go ahead with the implementation in this PR.

👍 Will do it.

In most cases, the prepare_submission would be only called on a single node that matched

I think I sort of agree with @danielhollas and think this would bring a bit robust to the caching mechanism. In the end it may not have too many cached nodes found.

@unkcpz unkcpz force-pushed the fix/6214/loose-hashing-make-without-core-version branch 2 times, most recently from 4a1bc4b to 64f1531 Compare March 18, 2024 19:15
@unkcpz unkcpz requested a review from sphuber March 18, 2024 19:16
@unkcpz unkcpz force-pushed the fix/6214/loose-hashing-make-without-core-version branch from 64f1531 to 6fc33b2 Compare March 18, 2024 20:21
@unkcpz
Copy link
Member Author

unkcpz commented Mar 19, 2024

Hi @sphuber, I think this is ready for the final review.

@sphuber
Copy link
Contributor

sphuber commented Mar 19, 2024

Thanks @unkcpz . I gave the changes a quick look and they look ok to me. I would just first want to make sure that what I suggested in this comment would provide the necessary control to plugin developers to essentially reset the cache if the implementation of a Data or CalcJob plugin changes significantly. I will try to make a PR tonight/tomorrow and then @danielhollas and yourself can maybe take a look at it. If that seems sufficient, we can merge both changes.

@sphuber
Copy link
Contributor

sphuber commented Mar 20, 2024

I just opened #6328 as a draft PR that attempts an implementation, but as I describe in detail there, it seems not so easy after all...

Copy link
Collaborator

@danielhollas danielhollas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (with one suggestion for an extra test).
Might be nice to get this landed first and integrate it to #6328 so that we can test everything end-to-end.

tests/orm/nodes/test_node.py Outdated Show resolved Hide resolved
When the caching mechanism was first introduced, precaution was taken to
reduce the chances for false positive cache hits to a miniumum as much
as possible. For that reason, the version of the package that provides
the node class was included in the calculation of the hash. This would
cover `Data` plugins, however, since the `ProcessNode` class is not
subclassable, there the version information was added slightly
differently. When a `Process` is run, the `version` attribute was added
to the attributes of its `ProcessNode` which includes the version of
`aiida-core` that is currently installed, as well as that of the plugin
package providing the process plugin.

Now that caching has been battle-tested and shown to be useful in
practice, this approach turns out to be too limiting. Now, whenever a
new version of `aiida-core` or a plugin is installed, the associated
nodes are invalidated as valid cache sources as future node instances
will have a different hash, merely because the version has changed.

Therefore, the version information is removed from the calculation of
the hash. The explicit version of the package providing the node class
that was explicitly added in `NodeCaching.get_objects_to_hash` is
removed and the `version` attribute is added to the list of
`_hash_ignored_attributes`.
@sphuber sphuber force-pushed the fix/6214/loose-hashing-make-without-core-version branch from 7396c6c to 2222c65 Compare April 16, 2024 07:44
Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to you both @unkcpz and @danielhollas

@sphuber sphuber merged commit 4c60bbe into aiidateam:main Apr 16, 2024
19 checks passed
@unkcpz
Copy link
Member Author

unkcpz commented Apr 17, 2024

Thanks for taking care of this @sphuber and sorry for not active aiida stuff this month.

@unkcpz unkcpz deleted the fix/6214/loose-hashing-make-without-core-version branch April 17, 2024 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Loosely make hashing without include full aiida-core version
3 participants