-
Notifications
You must be signed in to change notification settings - Fork 628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve timeout and retry mechanic of exporters #4183
base: main
Are you sure you want to change the base?
Improve timeout and retry mechanic of exporters #4183
Conversation
I recognized the public symbols test being red (https://github.com/open-telemetry/opentelemetry-python/actions/runs/10760943700/job/29900823921). I understand it's purpose, but I guess there is some guidance needed on which names are acceptable to be new public names and which not. Any advice? Regarding the changelog: Is it intended the I add the change log entry? |
In that case we need a changelog for this one |
This is the first change in a chain of commits to rework the retry mechanic. It is based on the work of open-telemetry#3764 and basically trying to land the changes proposed by this monolithic commit step by step. The plan is roughly to proceed in these steps: * Extract retry mechanic from GRPC exporters * Consolidate HTTP with GRPC exporter retry implementation * Pipe timeout through RetryingExporter * Make exporter lock protect the whole export instead of just a single iteration * Make timeout float instead of int * Add back-off with jitter It's pretty likely that the plan will change along the way.
This unifies the implementation of the OTLP exporters and the HTTP log exporter. Next step is to consolidate the remaining HTTP exporters. Fixes open-telemetry#4043.
This unifies the implementation of the OTLP exporters and the HTTP metric exporter.
This unifies the implementation of the OTLP exporters and the HTTP trace exporter.
Propagate the given timeout value from all exporters to the RetryingExporter, which will hand it over to the individual `_export` function call. Currently the same unaltered value is handed over. In following commits the Retrying exporter will update the timeout handed over to `_export` depending on the remaining time.
TL;DR The time assert in the test was wrong. Looking at the original commit (c84ba9) of this test, the delay of the mocked backoff was 4 seconds fixed with 6 iterations. So the time until shutdown returned was around 24 seconds. The default timeout of the lock wait in shutdown did not hit before the retry of the exporter was over. The line `self.assertGreaterEqual(now, (start_time + 30 / 1000))` seemed to be wrong from the beginning. It just ensured that the shutdown call took at least 30ms. I assume it was supposed to do something else. But to me it's not clear what exactly it should do. One explantion would be that it was intended to do `self.assertGreaterEqual(now, (start_time + 30 * 1000))` instead. But that would mean: Shutdown should need at least 30 seconds to pass, which does not make sense imho. Another option would be that it was intended to do `self.assertLessEqual(now, (start_time + 30 * 1000))` instead. That would mean: Shutdown should need at maximum 30 seconds to pass, which actually could make more sense. It would ensure that the call of `self.exporter.shutdown` returns after 30 seconds. My biggest issue here is, that it tests the happy path where the retry is over before the timeout while there is no mechanism to ensure that the exporter thread is stopped in a sensible time in case it's current exponential backoff is in a longer sleep already while the shutdown flag is set. If I understand it correctly, it can hit the beginning of a 32 second sleep, which could mean that a process needs a) 30 seconds to signal the shutdown and additional 32 seconds to finish the exporter thread. But that's something for another change. After all these words, looking at the current HEAD, the situation of the test case is basically the same as originally. The retry interval has been changed recently to be way shorter. The default lock timeout stayed the same at 30 seconds, but that does not matter too much since the retry mechanism is finished long before that. And the aforementioned assert is still wrong, with some additional problem from open-telemetry#4014.
Previously the export lock was released by the export between retries, which gave a shutdown the opportunity to interrupt the export depending on which thread is first to get the lock again. This behavior is not desired as it heavily relies on timing and makes things less deterministic. The retry loop still checks whether the shutdown flag is set and will break in case it was told to do so.
A few tests are still commented out and will be enabled later, once the implementation is updated to implement the desired behavior. The tests are, like other existing tests, timing sensitive which makes them prone to fail on slow systems. It would certainly be better to get rid of this issue. I haven't found another approach in the existing test cases so I sticked to the current pattern.
Replace the RetryableExporter shutdown flag and sleep with a `threading.Event` to make the waiting time for the next retry interruptable. This enables us to change the semantic of the shutdown timeout to work in a less surprising way. Previously the timeout worked roughly like this: 1. Wait for the current export iteration to finish 2. Try to get the lock faster than the export thread 3. If the lock was acquired, the shutdown flag was set 4. If the lock was not acquired, the shutdown flag was set 5. The export thread would have picked this flag up after finishing the current retry cycle. This lead to hanging applications on shutdown, giving the applications no choice to influence the shutdown behavior. Uses could not decide how much time shall be spent trying to prevent the loss of data. This change allows shutdown to interrupt the current export waiting call so that the exporter can be shut down with a user provided deadline.
Derive a deadline from the initial timeout for each export_with_retry call and try to stick to this time frame as good as possible. Each call to export_function will get the remaining time as timeout parameter and is expected to return as quick in case of an issue so that export_with_retry can care for waiting and retrying. This also removes the surprsing deadline triggered by the max_value of the exponential backoff.
This change allows the user to call retrying exports with individiual timeouts. This argument will be needed in the next step, where the GRPC metric exporters batch processing is changed to respect the timeout for the whole export.
The timeout passed as argument to the export method is now overriding the default values set during initialization) once given. Additonally also ensure the batching in OTLPMetricExporter.export respects the export timeout.
The timeout passed as argument to the export method is now overriding the default values set during initialization if given. This change is made to align the API of all exporters. In the metric exporter this parameter was already existant but was not used. This was changed with the previous commit (2102e3b25b52).
It's desirable to have a common API across all exporters. Since the previously existing timeout_millis argument wasn't used anyways, it should not be an issue to change the default value here. The None case will fall back to the object level timeout, which is the same as the previous default argument value.
defd3b8
to
62a9101
Compare
To my understanding the following new public symbols are necessary:
If I understood CONTRIBUTING.md correctly, I can not do anything further about it and now some maintainer / code owner needs to judge it. |
Description
I took a deep look into #3764, tried to separate the individual changes. While I think the approach of the PR in general is good, I changed some aspects of the change. For example, I don't agree with the change picking the lowest of the given timeouts (see here).
Having the change split into multiple steps, it's still a big undertaking. So I guess a reviewer will have to take some time to get into it. Nevertheless, I hope that someone will be found who feels up to it.
Fixes #4043
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Several tests were added (retry logic; shutdown logic; new timeout export argument) and of course executed also :)
Does This PR Require a Contrib Repo Change?
Checklist: