Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polling for lvm2 objects might deliver results out of order #966

Closed
mvollmer opened this issue Apr 4, 2022 · 4 comments · Fixed by #967
Closed

Polling for lvm2 objects might deliver results out of order #966

mvollmer opened this issue Apr 4, 2022 · 4 comments · Fixed by #967

Comments

@mvollmer
Copy link
Contributor

mvollmer commented Apr 4, 2022

We started noticing a failure in the Cockpit CI tests for lvm2. UDisks2 would fail with

Error waiting for logical volume object for 'lvol0': Timed out waiting for object

The ultimate reason for this was that multiple vgs_task_func instances were running concurrently and would deliver their results out of order.

The details are like this:

The test creates a VG names "TEST1", then does some stuff with it, then deletes it, then creates a new vgroup called "vgroup0", then creates a logical volume named "lvol0" in it. (All very quickly.) UDisks2 would quite often fail at this point with the timeout error mentioned above.

UDisks2 would follow along by running vgs_task_fun and lvs_task_func, each instance in a new thread.

This sequence happened:

  • Instance A of vg_task_func sees that there are no volume groups after TEST1 has been deleted.
  • Instance B sees that vgroup1 exists
  • Instance C sees that vgroup1 exists
  • The results of B are delivered, object O1 is created for vgroup0
  • A method is called on O1 to create "lvol0"
  • The method starts waiting for an object to show up in O1 for "lvol0"
  • The results of A are delivered, O1 is destroyed.
  • The results of C are delivered, O2 is created for vgroup0
  • lvs_task_func runs, a object for "lvo0" is eventually created and attached to O2
  • The method times out because O1 never gets to see an object for "lvol0".
@tbzatek
Copy link
Member

tbzatek commented Apr 4, 2022

Correct, I was also seeing similar issues when doing simple load tests. My old attempt to improve things was: #814 but it was more or less just a workaround and I was not fully satisfied with the result. I have some more ideas for a proper solution going down to the udisks modular design, ultimately time and prioritization is my only problem now.

FYI the current code is more or less unchanged storaged legacy of yours :-)

@mvollmer
Copy link
Contributor Author

mvollmer commented Apr 5, 2022

My old attempt to improve things was: #814 but it was more or less just a workaround and I was not fully satisfied with the result.

I'll try to produce a simple PR that restores correctness without any attempt at optimization.

FYI the current code is more or less unchanged storaged legacy of yours :-)

Heh! :-) I am pretty sure it was much simpler back then, without so much asynchronicity. It was also probably much slower.

@mvollmer
Copy link
Contributor Author

mvollmer commented Apr 5, 2022

FYI the current code is more or less unchanged storaged legacy of yours :-)

Heh! :-) I am pretty sure it was much simpler back then, without so much asynchronicity. It was also probably much slower.

Oh, no, I just had a look and you are right! We did spawn a process instead of a thread, but otherwise it was all the same. So I am grumpy at myself!

mvollmer added a commit to mvollmer/udisks that referenced this issue Apr 5, 2022
mvollmer added a commit to mvollmer/udisks that referenced this issue Apr 5, 2022
mvollmer added a commit to mvollmer/udisks that referenced this issue Apr 5, 2022
mvollmer added a commit to mvollmer/udisks that referenced this issue Apr 5, 2022
@tbzatek
Copy link
Member

tbzatek commented Apr 5, 2022

So there are more optimizations that I tried to address in #814: if an update is already running, avoid spawning new one, wait for its result and then compare timestamps of a last uevent received for the particular object. This is to prevent running multiple lvs/vgs at the same time when we knew we'd throw out the results anyway since a new uevent has been received in the meantime. With a particularly large setup (1000+ LVs) and burst of uevents I've seen the updates slowly finishing even after 20 minutes - so many updates have been spawned. So this was a kind of ratelimiting to ensure reasonable responsiveness.

All of this was just a workaround anyway. The logic is telling me that it's not always necessary to update whole system picture, only look up potentially related devices and update those. That was my proposal for a proper solution, unimplemented at the moment.

Now, with tons of work still pending on the nvme support and libblockdev-3.0 API adaptations the udisks-2.10 release is still far away. If there's a desire to get the lvm2 module in better shape anytime soon, we might merge parts of #814 in the meantime. The uevent processing rewrite would be a major work and big chunks of the code would change anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants