Do not migrate legacy images if snapshots are already present #1990

davidcassany · 2024-03-05T14:01:46Z

This commit prevents executing the legacy images migration logic if snapshotter already finds available snapshots. This mostly means the migration was already executed and legacy images had already a chance to be converted into snapshots.

codecov-commenter · 2024-03-05T14:05:46Z

Codecov Report

Attention: Patch coverage is 0% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 72.52%. Comparing base (b266360) to head (63df62d).

Files	Patch %	Lines
pkg/snapshotter/loopdevice.go	0.00%	3 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1990      +/-   ##
==========================================
- Coverage   72.55%   72.52%   -0.04%     
==========================================
  Files          76       76              
  Lines        8902     8906       +4     
==========================================
  Hits         6459     6459              
- Misses       1910     1913       +3     
- Partials      533      534       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

anmazzotti

Nice. I think this explains why I did see multiple snapshots being created at times.

davidcassany · 2024-03-05T17:13:33Z

Right I just manually tested the following with this sequence:

upgrade from stable to dev
1.1 manual reboot to fallback brings cluster up again
upgrade from dev to this PR
2.1 manual reboot to fallback brings cluster up again
reboot to fallback and from there upgrade to this PR again
3.1 manual reboot to fallback brings cluster up again

@anmazzotti @fgiudici about the upgrade resouces lifecycle: All three upgrades (1st, 2nd and 3rd) were done with a devoted ManagedOSImage resource without deleting the previous one, so after all three upgrades three different upgrade groups are still alive. Then:

I deleted the cluster (reset was enabled) and the node got reset to stable (recovery was not upgrade)
Created a new cluster with the same name of the deleted one
All three ManagedOSImages got triggered and started simultaneously
2nd upgrade resource won the download race and managed to run the upgrade, the other 3rd upgrade got blocked in suc-upgrade (the script prevents concurrent executions).
On reboot upgrade 3rd got quickly applied (it is was already the same version, hence it succeeded without doing any change). Upgrade 1st got reboot while the upgrade pod was initializing, it doesn't seam to recontinue after reboot, feels it got stuck in pod initializing...
Deleted the stuck pod in initializing and the system-upgrade-controller immediately recreated the pod and triggered the upgrade. This upgrade was fully executed because this image was pulled from a different repository in the registry (suc-upgrade only prevents downgrades if host and upgrade image are from the same repository in the registry).

After all the cluster got up and ready again and the node got the three upgrade labels, but the process resulted to be a bit insane and without any guarantee to know which would have been end result (image from 1st or from 2nd). I bet that if it would have been a multi-node cluster we would see heterogeneous results some node in image A and some other in image B.

All in all, keeping upgrade resources around seams to be a really tricky practice, however I can't think of a meaningful criteria on how to determine the upgrade resource deletion (think at scale). In that particular case I think it would have been way more sane and reliable to just keep updating the same upgrade resource to modify the image to upgrade to, so you could consider to the rule of thumb that a single upgrade group shall exist for a given group of clusters. I do think making upgrade groups editable makes sense.

Finally, also because I was in this weird scenario, I tried to actually re-trigger upgrade 3rd into the cluster. I could do that by deleting the related label from the node resource, but fun enough, I could not delete the label from the rancher UI I had to kubectl into that cluster to do so 🤷🏽‍♂️ deleting the label from rancher ui did not nothing, it was not effective.

This commit prevents executing the legacy images migration logic if snapshotter already finds available snapshots. This mostly means the migration was already executed and legacy images had already a chance to be converted into snapshots. Signed-off-by: David Cassany <[email protected]>

fgiudici · 2024-03-05T18:03:12Z

[...]
All in all, keeping upgrade resources around seams to be a really tricky practice, however I can't think of a meaningful criteria on how to determine the upgrade resource deletion (think at scale). In that particular case I think it would have been way more sane and reliable to just keep updating the same upgrade resource to modify the image to upgrade to, so you could consider to the rule of thumb that a single upgrade group shall exist for a given group of clusters. I do think making upgrade groups editable makes sense.
[...]

Thanks for sharing all the tests David, quite interesting!
Keeping the upgrade resources around makes things messy indeed... having the upgrade groups editable could be an option. Having also a "Done" flag in the status as per Andrea's idea would be handy, sure we should figure out how to give "feedback" to the upgrade process. Right now it would sound great to me having these two improvements together.
In any case, we have to improve the upgrade lifecycle for sure.

Signed-off-by: David Cassany <[email protected]>

davidcassany requested a review from a team as a code owner March 5, 2024 14:01

anmazzotti approved these changes Mar 5, 2024

View reviewed changes

davidcassany mentioned this pull request Mar 5, 2024

Define a meaningful behavior and life cycle options of ManagedOSImage resource rancher/elemental#1268

Closed

davidcassany force-pushed the fix_legacy_images_management branch from 10b5e8d to 44fb99f Compare March 5, 2024 17:23

davidcassany enabled auto-merge (rebase) March 5, 2024 17:37

davidcassany linked an issue Mar 5, 2024 that may be closed by this pull request

Root password lost after an OS upgrade #1982

Closed

fgiudici approved these changes Mar 5, 2024

View reviewed changes

davidcassany disabled auto-merge March 5, 2024 21:25

davidcassany force-pushed the fix_legacy_images_management branch from 67c92f9 to 44fb99f Compare March 5, 2024 22:53

Fix btrfs snapshotter image variable

63df62d

Signed-off-by: David Cassany <[email protected]>

davidcassany enabled auto-merge (rebase) March 5, 2024 22:56

change debug message

808a058

Signed-off-by: David Cassany <[email protected]>

davidcassany disabled auto-merge March 6, 2024 08:36

davidcassany merged commit c4bea71 into rancher:main Mar 6, 2024
17 of 19 checks passed

davidcassany deleted the fix_legacy_images_management branch March 6, 2024 09:10

davidcassany restored the fix_legacy_images_management branch May 15, 2024 13:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not migrate legacy images if snapshots are already present #1990

Do not migrate legacy images if snapshots are already present #1990

davidcassany commented Mar 5, 2024

codecov-commenter commented Mar 5, 2024 •

edited

Loading

anmazzotti left a comment

davidcassany commented Mar 5, 2024 •

edited

Loading

fgiudici commented Mar 5, 2024

Do not migrate legacy images if snapshots are already present #1990

Do not migrate legacy images if snapshots are already present #1990

Conversation

davidcassany commented Mar 5, 2024

codecov-commenter commented Mar 5, 2024 • edited Loading

Codecov Report

anmazzotti left a comment

Choose a reason for hiding this comment

davidcassany commented Mar 5, 2024 • edited Loading

fgiudici commented Mar 5, 2024

codecov-commenter commented Mar 5, 2024 •

edited

Loading

davidcassany commented Mar 5, 2024 •

edited

Loading