-
Notifications
You must be signed in to change notification settings - Fork 523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long delay between EA received input and loaded #14337
Comments
I'm fairly convinced that it is due to slow httpServer and grpcServer shutdown, as I managed to reproduce something similar with an artificial delay:
With this change, initially "gc tick" will be logged every 2 seconds, but after the first reload, "gc tick" frequency doubles, indicate that 2 gc routines are running. This will cause #14305 when gc is called when another gc is running. The reason why this delay causes 2 apm-server processors (e.g. TBS) to run concurrently is that, when reload is triggered from EA "received input", the server's context is canceled. The server is actually a wrapper over the actual gmux server and processors, let's call it wrappedServer. The shutdown sequence will be that gmux server first shuts down, then only after gmux server shuts down, the processors TLDR: in a hot reload, there is a period of time where an old server and new server run concurrently. We need to limit this reload time, as well as to ensure the processors are fine with running concurrently (e.g. to have 2 TBS processor running at the same time during hot reload). |
To fix this bug, it requires a few changes
|
#14339 is merged, but keeping this issue open, as we want to double check if all the processors are fine with concurrent runs. |
@carsonip please close this task if everything works as expected or update with follow up steps. |
APM Server version (
apm-server version
): 8.14.3Description of the problem including expected versus actual behavior:
In EA managed apm-server, there are observations that there can be delay between log "received input from elastic-agent" and "loaded input config" in the order of days. It implies 2 servers are actually running in the apm-server process during this long period when the old server is stopping. The fact that 2 TBS gc goroutines may cause #14305 , actually makes #14305 also an evidence that 2 servers are running at the same time. Additionally, as only 1 reload can happen at a time, a long reload will stall other input updates.
Edit: as explained below, there are a few parts to this problem:
Provide logs (if relevant):
The text was updated successfully, but these errors were encountered: