-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RESOURCE_EXHAUSTED - Use algorithm to retry and increase back-off time #140
Comments
Sounds good! There appears to be a library that implements exponential backoff called retrying however, it was last released in 2014. I would prefer to use something similar to this instead of implementing it ourselves.
We should refrain from depending on Kubernetes. |
Hello and first of all thanks for having created this project. I'm starting out with Zeebe and having an already available Python client is really practical :)
|
Just had another thought about it and if Pyzeebe becomes async as per #143 , the waiting backoff function should of course be awaitable |
https://github.com/jd/tenacity is an alternative (a fork of https://github.com/invl/retry which I've been happy with in the past) with support for both synchronous Python and async. It's a single-dependency package, rather popular and seemingly well-maintained. Think it could be a good fit for our use case - if we design it correctly, we could allow the user to pass a custom retry strategy consisting of tenancy objects. |
Updating this issue, because I think we need to rethink the approach here, and also because #172 will remove the watcher. I've been seeing some issues with RESOURCE_EXCHAUSTED being thrown quite often. I posted the following on the Zeebe/Camunda Slack:
zell pointed me to how it's being done in c#, node, java:
Studying the c# example, I notice that the Pyzeebe approach is a bit different, e.g. pyzeebe/pyzeebe/grpc_internals/zeebe_job_adapter.py Lines 49 to 58 in 71689f0
I.e. if the worker receives a |
@Chadys @JonatanMartens @kbakk we are working with and we get lots of exceptions when we use we dont have this problem we also have java client and in java we rarely see this exception and it always manages to connect at the end anyone can point us to the problem? example for logs we get(we get it much more) |
I have some good news :) You can use the built-in gRPC client retry mechanism, which is described in more details here Here's a quick example: retryPolicy = json.dumps(
{
"methodConfig": [
{
"name": [{"service": "gateway_protocol.Gateway"}],
"retryPolicy": {
"maxAttempts": 5,
"initialBackoff": "0.1s",
"maxBackoff": "10s",
"backoffMultiplier": 2,
"retryableStatusCodes": ["UNAVAILABLE", "RESOURCE_EXHAUSTED"],
},
}
]
}
)
grpc_channel = create_insecure_channel(hostname="localhost", port=26500, channel_options={"grpc.service_config": retryPolicy}) This will retry all requests returning "UNAVAILABLE" or "RESOURCE_EXHAUSTED". You can assign specific policies on a per method basis as well, e.g. {
"methodConfig": [
{
"name": [{"service": "gateway_protocol.Gateway"}],
"retryPolicy": {
"maxAttempts": 5,
"initialBackoff": "1s",
"maxBackoff": "10s",
"backoffMultiplier": 2,
"retryableStatusCodes": ["UNAVAILABLE", "RESOURCE_EXHAUSTED"],
},
},
{
"name": [{"service": "gateway_protocol.Gateway", "method": "Topology"}],
"retryPolicy": {
"maxAttempts": 2,
"initialBackoff": "0.1s",
"maxBackoff": "5s",
"backoffMultiplier": 2,
"retryableStatusCodes": ["UNAVAILABLE"],
},
}
]
} I've tested it with request time outs as well. If you assign a request time out directly on the client RPC call, then that time out spans all the retries. Meaning if I set a request time out of 5 seconds, it will return DEADLINE_EXCEEDED before the max retry attempts is reached. When all your retries are up, you will get the most recent error. So this doesn't fully solve the issues with the job poller. We would likely want to retry forever there, just keep polling, and only logging perhaps throttled warnings when things are not working (ideally configurable). For the job poller, maybe catching the errors and retrying them is best. There we can do something like the Java job worker does. Hope this helps! |
As this may be useful for others, I can expand a bit on what's safe to retry or not.
The following errors should never be retried:
Not safe to retry on
Sometimes safe to retry on
Generally safe to retry on DEADLINE_EXCEEDED:
So now we can talk about So one could use the following retry policy: {
"methodConfig": [
{
"name": [],
"retryPolicy": {
"maxAttempts": 5,
"initialBackoff": "0.1s",
"maxBackoff": "5s",
"backoffMultiplier": 3,
"retryableStatusCodes": ["UNAVAILABLE", "RESOURCE_EXHAUSTED"]
}
},
{
"name": [
{"service": "gateway_protocol.Gateway", "method": "ActivateJobs"},
{"service": "gateway_protocol.Gateway", "method": "CancelProcessInstance"},
{"service": "gateway_protocol.Gateway", "method": "EvaluateDecision"},
{"service": "gateway_protocol.Gateway", "method": "DeployResource"},
{"service": "gateway_protocol.Gateway", "method": "FailJob"},
{"service": "gateway_protocol.Gateway", "method": "ResolveIncident"},
{"service": "gateway_protocol.Gateway", "method": "ThrowError"},
{"service": "gateway_protocol.Gateway", "method": "Topology"},
{"service": "gateway_protocol.Gateway", "method": "UpdateJobRetries"},
],
"retryPolicy": {
"maxAttempts": 5,
"initialBackoff": "0.1s",
"maxBackoff": "5s",
"backoffMultiplier": 3,
"retryableStatusCodes": ["UNAVAILABLE", "RESOURCE_EXHAUSTED", "DEADLINE_EXCEEDED"]
}
}
]
} This will always retry Note here that the So you would create your channel as: # create retry policy
retryPolicy = json.dumps(
{
"methodConfig": [
{
"name": [{"service": "gateway_protocol.Gateway"}],
"retryPolicy": {
"maxAttempts": 5,
"initialBackoff": "0.1s",
"maxBackoff": "10s",
"backoffMultiplier": 4,
"retryableStatusCodes": ["UNAVAILABLE", "RESOURCE_EXHAUSTED"]
}
},
{
"name": [
{"service": "gateway_protocol.Gateway", "method": "ActivateJobs"},
{"service": "gateway_protocol.Gateway", "method": "CancelProcessInstance"},
{"service": "gateway_protocol.Gateway", "method": "EvaluateDecision"},
{"service": "gateway_protocol.Gateway", "method": "DeployResource"},
{"service": "gateway_protocol.Gateway", "method": "FailJob"},
{"service": "gateway_protocol.Gateway", "method": "ResolveIncident"},
{"service": "gateway_protocol.Gateway", "method": "ThrowError"},
{"service": "gateway_protocol.Gateway", "method": "Topology"},
{"service": "gateway_protocol.Gateway", "method": "UpdateJobRetries"},
],
"retryPolicy": {
"maxAttempts": 5,
"initialBackoff": "0.1s",
"maxBackoff": "10s",
"backoffMultiplier": 5,
"retryableStatusCodes": ["UNAVAILABLE", "RESOURCE_EXHAUSTED", "DEADLINE_EXCEEDED"]
}
}
]
}
)
# Create a zeebe client without credentials
grpc_channel = create_insecure_channel(hostname="localhost", port=26500, channel_options={"grpc.service_config": retryPolicy}) |
Is your feature request related to a problem? Please describe.
With the current behaviour of the watcher, there's no backoff before restarting a task.
According to the Zeebe docs, in case a
RESOURCE_EXHAUSTED
status is returned (which translates topyzeebe.exceptions.zeebe_exceptions.ZeebeBackPressure
), we should "retry with an appropriate retry policy (e.g. a combination of exponential backoff or jitter wrapped in a circuit breaker)".Describe the solution you'd like
E.g. wait
attempt * 5 + randint(1,3)
sec between each attempt.Describe alternatives you've considered
If the process is being managed in e.g. Kubernetes or similar, there may be some way to instrument Kubernetes to wait n sec before restarting the process. Then we could have Kubernetes handle this. I am however not familiar with such possibilities.
The text was updated successfully, but these errors were encountered: