-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loose the hashing computing without include core version #6215
Loose the hashing computing without include core version #6215
Conversation
This change makes a lot of sense, would be good to get into 2.5.0 if @sphuber agrees. |
I think the idea is definitely interesting and we probably want to apply it, however, I think we need a bit more time to think about any issues related to backwards compatibility. We might want to add a database migration to update all existing nodes. After internal discussion, we slated this most likely for the release following v2.5. |
b35ce50
to
ae3e749
Compare
2b14896
to
2ccb2a4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @sphuber mentioned above, this probably needs a migration. Some comments but I do not know the caching system to do a proper review.
objects = node._get_objects_to_hash() | ||
core_version = import_module(node.__module__.split('.', maxsplit=1)[0]).__version__ | ||
|
||
assert core_version not in objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this assertion fail before the change?? Shouldn't it be look at obj['version']['core']
or something like it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will fail. Besides change the core version to only contains major version, the first change in the PR is to remove the core_version that is redundantly added to the _get_objects_to_hash
, see aiidateam/aiida-pseudo#167 (comment) the original return value of the method.
163b263
to
7e2f3fb
Compare
4fe37a1
to
8f53d77
Compare
I start to need this change since I update the |
To me it is not clear whether this is the right solution and whether it doesn't introduce problems. For starters, if the version is a problem, why keep the major version. If the real thing that matters in caching is the hash of a node, and the hash is simply defined by its content, then if the content doesn't change, the hash also shouldn't change, even if the Furthermore, even if we change the core version to just include the major version, the plugin version is still there. That might also change often and you would still have the same problem. This would now push it down to plugins to update their And if the answer is yes, I don't think the current solution is the right one. Why change the actual versions that are stored (or even remove them)? This way we are losing valuable information. When we can simply modify Finally, there is the question whether we need a migration if we go through with this. We already changed hashing in v2.4 and added a migration for it. This was the 1st migration since v2.0, and we might have to add it once again in the next release and drop the hashes once again. Not saying we shouldn't do it therefore, but it is something to keep in mind, as it will make it more difficult for users to switch between this new version and older ones. So I think some discussion is still required before we can go on with this. Thoughts? |
After the discussion with @sphuber. We agree to completely remove the aiida-core version from computing hashing. In the PR, I didn't remove the plugin version from caching, and I think at the moment we'd better still keep it there. Mentioned by @GeigerJ2, it can be false positive caching that happened if the plugin change the implementation and the calcjob give the different outputs. But inside the workchain, the cached calcjob node is taken and fail the workchain. |
Before I forget, another thing we might want to change while we are chaning the hashing algorithm: it would be useful if |
6b66903
to
579ee5f
Compare
@@ -135,6 +132,7 @@ def _hash_ignored_attributes(cls) -> Tuple[str, ...]: # noqa: N805 | |||
'priority', | |||
'max_wallclock_seconds', | |||
'max_memory_kb', | |||
'version', # ?? How I exclude the version.core ?? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sphuber
First, the attributes like queue_name
is in last_job_info
which is in _node._updatable_attributes
and already be excluded, so the is no point to keep it there.
Second, it now only support exclude by key, and for the nest dictionary the login in CalcJobNodeCaching
does not yet try to go level by level to exclude things.
In order to make this work, I think in the _hash_ignored_attributes
here it can specify version.core
and in the NodeCaching class it will use the format to exclude precisely the version.core to have attribute object from: {version: {'core': --, 'plugin': --}}
to {'version': {'plugin': --}}
.
What you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could change _hash_ignored_attributes
and the implementation that consumes it to support nested keys, but you could also update CalcJobNodeCaching._get_objects_to_hash()
to just call the super
and then from the returned list, get the element that corresponds to the dictionary of attributes and then remove the 'version.core' key.
I am still not a 100% clear about the hypothetical case of the plugin being updated needing the cache to be invalidated (by the hash being different). Wouldn't the same reasoning also not apply to aiida-core
? What if the CalcJob
base class changes some small (but significant) part of the implementation? Wouldn't that be the same scenario? So why keep the plugin version but drop the core version in that case? And if the changes could happen due to a fix of a bug (which could be released with a patch version of aiida-core
) then just checking the major or even minor version also won't protect against this hypothetical case.
So it seems we are back to square 1. Yes, keeping the core and plugin versions in the hash might be the safest, but how common are the cases that it is trying to protect against? If they are rare enough, should we really invalidate hashes with each (patch) version update, strongly reducing the effectiveness of the caching mechanism?
I have the same question as @sphuber. Ignoring aiida-core version but invalidating cache upon plugin update seems very strange. Especially since plugins might undergo more rapid development. It does make sense in a sense that we have tight control over what we do in aiida-core, and if we did make some breaking change, we can always drop the hashes via DB migration. This would seem more difficult to handle for plugins (not to mention that plugin developers might not be aware of caching subtleties).
@unkcpz @GeigerJ2 It would be very useful if you could write down couple such scenarios in detail. What are the exact conditions for the breakage? What would actually happen when there is a breaking change? Would the whole workflow crash or silently do a wrong thing? Based on that I wonder if we could have some mitigation strategies in aiida-core to prevent these issues. |
The example I did encounter is I run a workchain of
|
I see, that would indeed fail. There is a deeper problem here. The real cause for the discrepancy in this scenario is the Still, in this situation, one could "invalidate" the original |
This sounds reasonable as long as this is thoroughly documented. How does one currently invalidate a cache for a single node? Is it possible to do through CLI? If not, we should make it as easy as possible. One disadvantage I see that the user might not be aware that the caching was the reason for the failure? Is there anything that could be done to make that more apparent? (this is just thinking out loud. I haven't used caching myself yet so perhaps these are non-issues) |
No yet, I planed it in #6096 but not finished it. |
As @unkcpz said, there isn't a CLI endpoint yet but he has a PoC. In the Python API, you literally just do
Not sure, I doubt it though. If you think about @unkcpz example, a I did recently improve the |
Maybe we can take this as action. We remove both core and plugin versions. Since the caching is not the default config, so the user will not encountered any issue caused by the false positive caching. But sure, adding the CLI to the caching is also an important thing come to rescue if wired thing happened with caching. |
I like that idea, we can link to the "Limitations and guidelines" section Note that this section needs to be updated if we go forward with this PR.
I don't know the CalcJob internals well, but I wonder if there's anything currently not included in the hash that would decrease the chance of a bad cache hit (i.e. something more fine-grained than the plugin version?). |
I don't think so, you can take a look (better to be a calcjob submit to remote using non-direct scheduler which may contains more attributes) by checking the |
Do you want to update this PR @unkcpz and remove the version information entirely from the hash calculation? We have given ample opportunity to discuss and I think we are converging on the conclusion that it makes caching ineffective and we are willing to take the slight increase in risk of false positives. I would like to include these changes in the upcoming release. I just opened #6321 which fixes a bug in caching. It would be good to include all these changes in one go with a database migration that drops all hashes. |
Would be great to have this, I've just started using caching and it is perfect for our use case.
I have to say that I am still uncomfortable about passing this burden on users, and would like us to think harder to make these cases less probable since caching bugs are notoriously hard to debug. The scenario pointed by @unkcpz could be traced to the change in the parser. I am wondering if we could, instead of just copying the output nodes, we could run the current parser on top of the finished calculation? Another scenario would be if the plugin changed the implementation of the function that prepares the input files for the QM program (prepare_submission?). I could imagine e.g. plugin changing some default parameters of the ab initio calculations. Is this something we could capture somehow, e.g. by running the function and comparing the input files with the ones from the cached nodes? Another, somewhat orthogonal question -- in aiida-core we can do a DB migration and drop the hashes if the implementation changes. Do plugin authors have any possibility of invalidating the cache in a similar manner, if the plugin version is no longer part of the cache? |
This is unfortunately not really an option, because the parser could rely on files that are retrieved with the
This would become very slow though. The whole premise of the caching mechanism is that to match nodes, a query is performed based on their hash, which is computed just once when the node is stored. With your suggestion, you would now have to find all nodes that match the hash, and then actually one-by-one start running the
No they cannot, only
|
Hi @sphuber, I am going to finish this one today. So in the end, we agree on remove all versions from caching correct? If so this PR would be then a tiny one. |
I did suggest to go on with removing all versions from the hash, but then @danielhollas voiced concerns. I responded to his suggestions, explaining why I think they are not feasible. Not sure if those counter comments were satisfactory.
Not sure to be honest. It would require a special exception in |
I am fine with the proposed change, the purpose of voicing concerns was to make us think a bit harder about how to make the situation better. :-) I was aware that the proposed ideas were not going fly probably, but was hoping they would trigger some ideas from @sphuber who is knowledgeable about implementation details. :-) tl;dr @unkcpz please go ahead with the implementation in this PR. Also finishing the CLI caching control in #6096 would be great. Any further work better be done separately anyway.
I think the biggest argument against this is the implementation complexity. I wonder if there is some middle ground that could catch most issue (e.g. just checking the default input file?). I am not convinced that this would be a big performance hit. In most cases, the (perhaps we should move this discussion to Discourse and focus on the implementation in this PR) |
👍 Will do it.
I think I sort of agree with @danielhollas and think this would bring a bit robust to the caching mechanism. In the end it may not have too many cached nodes found. |
4a1bc4b
to
64f1531
Compare
64f1531
to
6fc33b2
Compare
Hi @sphuber, I think this is ready for the final review. |
Thanks @unkcpz . I gave the changes a quick look and they look ok to me. I would just first want to make sure that what I suggested in this comment would provide the necessary control to plugin developers to essentially reset the cache if the implementation of a |
I just opened #6328 as a draft PR that attempts an implementation, but as I describe in detail there, it seems not so easy after all... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! (with one suggestion for an extra test).
Might be nice to get this landed first and integrate it to #6328 so that we can test everything end-to-end.
When the caching mechanism was first introduced, precaution was taken to reduce the chances for false positive cache hits to a miniumum as much as possible. For that reason, the version of the package that provides the node class was included in the calculation of the hash. This would cover `Data` plugins, however, since the `ProcessNode` class is not subclassable, there the version information was added slightly differently. When a `Process` is run, the `version` attribute was added to the attributes of its `ProcessNode` which includes the version of `aiida-core` that is currently installed, as well as that of the plugin package providing the process plugin. Now that caching has been battle-tested and shown to be useful in practice, this approach turns out to be too limiting. Now, whenever a new version of `aiida-core` or a plugin is installed, the associated nodes are invalidated as valid cache sources as future node instances will have a different hash, merely because the version has changed. Therefore, the version information is removed from the calculation of the hash. The explicit version of the package providing the node class that was explicitly added in `NodeCaching.get_objects_to_hash` is removed and the `version` attribute is added to the list of `_hash_ignored_attributes`.
7396c6c
to
2222c65
Compare
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks to you both @unkcpz and @danielhollas
Thanks for taking care of this @sphuber and sorry for not active aiida stuff this month. |
fixes #6214
I remove the redundant version and I still keep the major version (TBD), maybe we can remove the core version.