Long delay between EA received input and loaded #14337

carsonip · 2024-10-10T16:45:07Z

APM Server version (apm-server version): 8.14.3

Description of the problem including expected versus actual behavior:

In EA managed apm-server, there are observations that there can be delay between log "received input from elastic-agent" and "loaded input config" in the order of days. It implies 2 servers are actually running in the apm-server process during this long period when the old server is stopping. The fact that 2 TBS gc goroutines may cause #14305 , actually makes #14305 also an evidence that 2 servers are running at the same time. Additionally, as only 1 reload can happen at a time, a long reload will stall other input updates.

Edit: as explained below, there are a few parts to this problem:

Make the server overlapping time bounded in a hot reload situation
Check and fix processor code that are not concurrent-safe (e.g. monitoring metrics reporting in go-docappender, tbs processor)

Provide logs (if relevant):

"Oct 7, 2024 @ 19:46:30.592"	"loaded input config"
"Oct 7, 2024 @ 19:46:30.577"	"tail sampler aborted"
"Oct 7, 2024 @ 19:46:30.576"	"LSM aggregator stopped"
"Oct 7, 2024 @ 19:46:30.576"	"stopping aggregator"
"Oct 7, 2024 @ 19:46:30.576"	"Server stopped"
"Sep 30, 2024 @ 23:41:27.228"	"Path /intake/v3/rum/events added to request handler"
"Sep 30, 2024 @ 23:41:27.228"	"Path /config/v1/rum/agents added to request handler"
"Sep 30, 2024 @ 23:41:27.228"	"Path /intake/v2/rum/events added to request handler"
"Sep 30, 2024 @ 23:41:27.228"	"RUM endpoints enabled!"
"Sep 30, 2024 @ 23:41:27.228"	"Starting apm-server [ceaf482859c86da0ba6de99005ab2fffae7551c6 built 2024-07-08 17:22:03 +0000 UTC]. Hit CTRL-C to stop it."
"Sep 30, 2024 @ 23:41:27.224"	"stopping apm-server... waiting maximum of 30s for queues to drain"
"Sep 30, 2024 @ 23:41:27.224"	"Listening on: [::]:8200"
"Sep 30, 2024 @ 23:41:27.224"	"Stop listening on: 0.0.0.0:8200"
"Sep 30, 2024 @ 23:41:27.222"	"received input from elastic-agent"

The text was updated successfully, but these errors were encountered:

carsonip · 2024-10-10T17:56:47Z

I'm fairly convinced that it is due to slow httpServer and grpcServer shutdown, as I managed to reproduce something similar with an artificial delay:

diff --git a/internal/beatcmd/beat.go b/internal/beatcmd/beat.go
index 8a62e1c89..6d19f67fb 100644
--- a/internal/beatcmd/beat.go
+++ b/internal/beatcmd/beat.go
@@ -365,7 +365,7 @@ func (b *Beat) Run(ctx context.Context) error {
 		return err
 	}
 
-	if b.Manager.Enabled() {
+	if b.Manager.Enabled() || true {
 		reloader, err := NewReloader(b.Info, b.newRunner)
 		if err != nil {
 			return err
@@ -377,6 +377,37 @@ func (b *Beat) Run(ctx context.Context) error {
 			return fmt.Errorf("failed to start manager: %w", err)
 		}
 		defer b.Manager.Stop()
+
+		g.Go(func() error {
+			for {
+				in := config.MustNewConfigFrom(map[string]interface{}{
+					"apm-server": map[string]interface{}{
+						"rum.enabled": true,
+						"host":        "0.0.0.0:8200",
+						"sampling.tail": map[string]interface{}{
+							"enabled": true,
+							"policies": []map[string]interface{}{
+								{"sampling_rate": 0.1},
+							},
+							"storage_gc_interval": "2s",
+						},
+					},
+				})
+				out := config.MustNewConfigFrom(map[string]interface{}{
+					"elasticsearch": map[string]interface{}{
+						"host":     []string{"localhost:9200"},
+						"username": "admin",
+						"password": "changeme",
+					},
+				})
+				if err := reloader.reload(in, out, nil); err != nil {
+					logp.Err("reload error")
+				}
+				time.Sleep(time.Minute)
+			}
+
+			return nil
+		})
 	} else {
 		if !b.Config.Output.IsSet() {
 			return errors.New("no output defined, please define one under the output section")
diff --git a/internal/beater/server.go b/internal/beater/server.go
index 397e21e50..8bfec6c05 100644
--- a/internal/beater/server.go
+++ b/internal/beater/server.go
@@ -21,6 +21,7 @@ import (
 	"context"
 	"net"
 	"net/http"
+	"time"
 
 	"go.elastic.co/apm/module/apmgorilla/v2"
 	"go.elastic.co/apm/v2"
@@ -227,6 +228,7 @@ func (s server) run(ctx context.Context) error {
 		// See https://github.com/elastic/gmux/issues/13
 		s.httpServer.stop()
 		s.grpcServer.GracefulStop()
+		time.Sleep(5 * time.Minute)
 		return nil
 	})
 	if err := g.Wait(); err != http.ErrServerClosed {
diff --git a/x-pack/apm-server/sampling/processor.go b/x-pack/apm-server/sampling/processor.go
index 82dc2df59..48546a542 100644
--- a/x-pack/apm-server/sampling/processor.go
+++ b/x-pack/apm-server/sampling/processor.go
@@ -394,8 +394,10 @@ func (p *Processor) Run() error {
 		for {
 			select {
 			case <-p.stopping:
+				p.logger.Error("gc stopping")
 				return nil
 			case <-ticker.C:
+				p.logger.Error("gc tick")
 				const discardRatio = 0.5
 				var err error
 				for err == nil {

With this change, initially "gc tick" will be logged every 2 seconds, but after the first reload, "gc tick" frequency doubles, indicate that 2 gc routines are running. This will cause #14305 when gc is called when another gc is running.

The reason why this delay causes 2 apm-server processors (e.g. TBS) to run concurrently is that, when reload is triggered from EA "received input", the server's context is canceled. The server is actually a wrapper over the actual gmux server and processors, let's call it wrappedServer. The shutdown sequence will be that gmux server first shuts down, then only after gmux server shuts down, the processors .Stop() will be called. But in the reloader, a new wrappedServer is already running while the old wrappedServer is stopping (yes, it stops listening, but processors are still running). This is a plausible explanation of the observed logs.

TLDR: in a hot reload, there is a period of time where an old server and new server run concurrently. We need to limit this reload time, as well as to ensure the processors are fine with running concurrently (e.g. to have 2 TBS processor running at the same time during hot reload).

carsonip · 2024-10-10T18:31:55Z

To fix this bug, it requires a few changes

httpServer and grpcServer need to respect shutdown timeout. It cannot just wait indefinitely for connections to terminate.
But even with that and e.g. a shutdown timeout of 10s, in the current design, it is still possible for 2 wrapped servers to run in parallel during this 10s. This is fine for the most part, e.g. aggregation processor (since badger run in memory), bulk indexer, but will be fatal for TBS processor, and possibly some libbeat metric registry code (e.g. in newFinalBatchProcessor which may or may not be broken now, as well as TBS metrics). TBS indexing might actually be ok, but GC definitely cannot run concurrently.
Alternatively, we can make the new wrappedServer wait for the old wrappedServer to shutdown completely, but that may imply some downtime, especially when there's a shutdown timeout > 0. But this is the safest option.

Part of elastic#14337

carsonip · 2024-10-15T14:47:02Z

#14339 is merged, but keeping this issue open, as we want to double check if all the processors are fine with concurrent runs.

simitt · 2024-11-11T08:02:37Z

@carsonip please close this task if everything works as expected or update with follow up steps.

carsonip added the bug label Oct 10, 2024

carsonip self-assigned this Oct 10, 2024

carsonip mentioned this issue Oct 10, 2024

TBS: "Value log GC request rejected" error #14305

Closed

carsonip added a commit to carsonip/apm-server that referenced this issue Oct 11, 2024

Apply shutdown timeout to http server to limit reload delay

ed864d5

Part of elastic#14337

carsonip mentioned this issue Oct 11, 2024

Apply shutdown timeout to http server to limit reload delay #14339

Merged

2 tasks

mergify bot mentioned this issue Oct 15, 2024

[8.x] Apply shutdown timeout to http server to limit reload delay (backport #14339) #14362

Merged

2 tasks

carsonip mentioned this issue Oct 16, 2024

"Output Events Rate" in stack monitoring is always zero #8383

Open

carsonip removed their assignment Oct 21, 2024

rubvs mentioned this issue Oct 29, 2024

APM Server 8.16 test plan #14405

Closed

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long delay between EA received input and loaded #14337

Long delay between EA received input and loaded #14337

carsonip commented Oct 10, 2024 •

edited

Loading

carsonip commented Oct 10, 2024 •

edited

Loading

carsonip commented Oct 10, 2024 •

edited

Loading

carsonip commented Oct 15, 2024

simitt commented Nov 11, 2024

Long delay between EA received input and loaded #14337

Long delay between EA received input and loaded #14337

Comments

carsonip commented Oct 10, 2024 • edited Loading

carsonip commented Oct 10, 2024 • edited Loading

carsonip commented Oct 10, 2024 • edited Loading

carsonip commented Oct 15, 2024

simitt commented Nov 11, 2024

carsonip commented Oct 10, 2024 •

edited

Loading

carsonip commented Oct 10, 2024 •

edited

Loading

carsonip commented Oct 10, 2024 •

edited

Loading