You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It was detected that the resource-usage-tracker service did not have a lock inside redis and it was reporting the following error: redis.exceptions.LockNotOwnedError: Cannot reacquire a lock that's no longer owned.
What triggered this situation?
The cause is currently unknown.
What could trigger this situation?
Lock is removed manually or somehow expires.
Redis database becoming unavailable for a short amount of time.
Why is this bad?
Immagine having multiple instances relying on this lock, if somehow it is freed (and the lock_context does nothing), a different instance will acquire it and use it as if the resource (protected by the lock) was free to be used.
How does it present itself?
Consider the following code:
asyncwithlock_context(...):
# some user defined# code here
Two situations compete to producing the issue:
lock_context creates a task which raises redis.exceptions.LockNotOwnedError but only logs the issue without handling it.
lock_context (context manager) has no way of stopping the execution of the code defined by the user.
To mitigate the issue it was proposed to try and stop the user defined code.
The only possible solution for a context manger is using signals (which are only available on linux).
When running test tests/test_redis_utils.py::test_context_manager_timing_out the code hangs unexpectedly inside the signal's handler functions. This means the entire asyncio runtime will be halted.
This approach cannot be used.
tests/test_redis_utils.py::test_context_manager_timing_out which produces a very unexpected result.
def handler(signum, frame):
> raise RuntimeError(f"Operation timed out after {after} seconds")
E RuntimeError: Operation timed out after 10 seconds
tests/test_redis_utils.py:190: RuntimeError
> /home/silenthk/work/pr-osparc-redis-lock-issues/packages/service-library/tests/test_redis_utils.py(190)handler()
-> raise RuntimeError(f"Operation timed out after {after} seconds")
Different proposals
Feel free to suggest more below.
@GitHK: I would do the following. The locking mechanism should receive a task (which runs the user defined code) which it can cancel if something goes wrong. This solution no longer uses a context manger, but uses a battle tested method of stopping the running user defined code.
The text was updated successfully, but these errors were encountered:
GitHK
changed the title
lock_context does not realise lock is no longer owned
lock_context does not release lock is no longer owned
Nov 5, 2024
GitHK
changed the title
lock_context does not release lock is no longer owned
lock_context does not release lock when it no longer owns the lock
Nov 5, 2024
GitHK
changed the title
lock_context does not release lock when it no longer owns the locklock_context does not release lock when it no longer owns it
Nov 5, 2024
During the same period, a similar issue was observed in storage — the same Cannot reacquire a lock error was logged. The storage background task was running properly, and the lock key was present in the Redis lock DB table. (This is strange, the Resource Tracker issue differed because the key was missing there. However, why this error was logged in storage is unknown)
OpenTelemetry tracing was introduced recently; could this be causing some blocking behavior?
I would suggest not rushing but instead taking two steps:
Analyze if there might be any blocking issues that could have caused this.
To make the platform more robust, implement a mechanism to handle this situation.
Easiest solution: Consider marking the application as unhealthy (when it restarts, it should gracefully resolve the issue by itself).
During the same period, a similar issue was observed in storage — the same Cannot reacquire a lock error was logged. The storage background task was running properly, and the lock key was present in the Redis lock DB table. (This is strange, the Resource Tracker issue differed because the key was missing there. However, why this error was logged in storage is unknown)
@matusdrobuliak66 can we please make a different issue out of this one? It is something different and I don't want to mix them.
During the same period, a similar issue was observed in storage — the same Cannot reacquire a lock error was logged. The storage background task was running properly, and the lock key was present in the Redis lock DB table. (This is strange, the Resource Tracker issue differed because the key was missing there. However, why this error was logged in storage is unknown)
@matusdrobuliak66 can we please make a different issue out of this one? It is something different and I don't want to mix them.
Yes, no problem. I just wanted to make a note of it, because since we don't fully understand the issue yet, we can't be certain that those two are not somehow interconnected.
What happened?
It was detected that the
resource-usage-tracker
service did not have a lock inside redis and it was reporting the following error:redis.exceptions.LockNotOwnedError: Cannot reacquire a lock that's no longer owned
.What triggered this situation?
The cause is currently unknown.
What could trigger this situation?
Why is this bad?
Immagine having multiple instances relying on this lock, if somehow it is freed (and the
lock_context
does nothing), a different instance will acquire it and use it as if the resource (protected by the lock) was free to be used.How does it present itself?
Consider the following code:
Two situations compete to producing the issue:
lock_context
creates a task which raisesredis.exceptions.LockNotOwnedError
but only logs the issue without handling it.lock_context
(context manager) has no way of stopping the execution of the code defined by the user.Is there a way to reproduce it in a test?
Yes, apply the following changes in your repo changes.diff.zip
Then run one of the following tests to make the issue appear:
tests/test_redis_utils.py::test_possible_regression_lock_extension_fails_if_key_is_missing
tests/test_redis_utils.py::test_possible_regression_first_extension_delayed_and_expires_key
What was already tried?
To mitigate the issue it was proposed to try and stop the user defined code.
The only possible solution for a context manger is using
signals
(which are only available on linux).When running test
tests/test_redis_utils.py::test_context_manager_timing_out
the code hangs unexpectedly inside the signal'shandler
functions. This means the entire asyncio runtime will be halted.This approach cannot be used.
tests/test_redis_utils.py::test_context_manager_timing_out
which produces a very unexpected result.Different proposals
Feel free to suggest more below.
locking mechanism
should receive atask
(which runs the user defined code) which it can cancel if something goes wrong. This solution no longer uses a context manger, but uses a battle tested method of stopping the running user defined code.The text was updated successfully, but these errors were encountered: