Strong correlation between ingestion rate limit errors and p99 latency #9720

LasseHels · 2024-10-23T07:48:17Z

Note: I've opened this as a bug, but I appreciate that it may as well be a misconfiguration on our end.

Describe the bug

We observe a strong correlation between rate limit errors returned by Mimir distributors and p99 latency on the write path.

As the rate of 429 responses from distributors increase, so does p99 latency of both distributors and ingesters:

The immediate drop in rate limit responses at around 10:30 UTC on October 18th was when we increased the limit. When we did this, rate limit responses ceased and we no longer saw spikes in p99 latency.

As the latency graphs show, p99 latency doesn't gradually increase. Rather, it seems that the p99 latency for certain pods will briefly spike to 100 seconds (which we assume is a timeout value somewhere) before dropping back to normal. Zooming in on the p99 latency for a single pod shows this behaviour more clearly:

In the above graph, we see that distributor-68fb644758-cj929 normally sits at ~50 ms p99 latency, but then suddenly spikes to 100 seconds at around 06:30:30 UTC. It subsequently immediately drops back down to its previous p99 latency of ~50 ms.

We observe that the spikes in distributor p99 latency apply across several response codes: 200, 202, 400 and 429.

To Reproduce

At this time, we are unsure of how to reproduce the issue. We've can somewhat inconsistently reproduce in our non-prod environment by setting our ingestion limits artificially low.

Expected behavior

Our naive expectation would be that hitting ingestion rate limits does not impact p99 write latency. If anything, we would expect that 429 responses are faster than regular requests, as the write is presumably rejected early in the ingestion process.

Environment

Infrastructure: We run Mimir on Kubernetes version 1.29.7. Our Mimir version is 2.13.0.
Deployment tool: We deploy Mimir via Jsonnet.

Additional Context

For reference, here are our distributor and ingester traffic statistic:

Here are our distributor args:

      -auth.multitenancy-enabled=true
      -auth.no-auth-tenant=fake
      -config.expand-env=true
      -config.file=/etc/mimir/config/mimir.yaml
      -distributor.drop-label=metrics_ha
      -distributor.ha-tracker.cluster=metrics_ha
      -distributor.ha-tracker.consul.hostname=mimir-consul-consul-server:8500
      -distributor.ha-tracker.enable=true
      -distributor.ha-tracker.enable-for-all-users=true
      -distributor.ha-tracker.etcd.endpoints=etcd-client.mimir.svc.cluster.local.:2379
      -distributor.ha-tracker.max-clusters=0
      -distributor.ha-tracker.prefix=prom_ha/
      -distributor.ha-tracker.store=consul
      -distributor.health-check-ingesters=true
      -distributor.ingestion-burst-size=100000000
      -distributor.ingestion-rate-limit=15000000
      -distributor.ring.heartbeat-period=15s
      -distributor.ring.heartbeat-timeout=4m
      -distributor.ring.prefix=
      -distributor.ring.store=memberlist
      -ingester.ring.heartbeat-period=20s
      -ingester.ring.heartbeat-timeout=1m
      -ingester.ring.prefix=
      -ingester.ring.replication-factor=3
      -ingester.ring.store=memberlist
      -ingester.ring.zone-awareness-enabled=true
      -mem-ballast-size-bytes=1073741824
      -memberlist.bind-port=7947
      -memberlist.join=gossip-ring.mimir.svc.cluster.local:7947
      -runtime-config.file=/etc/mimir/overrides.yaml
      -server.grpc.keepalive.max-connection-age=60s
      -server.grpc.keepalive.max-connection-age-grace=5m
      -server.grpc.keepalive.max-connection-idle=1m
      -server.grpc.keepalive.min-time-between-pings=10s
      -server.grpc.keepalive.ping-without-stream-allowed=true
      -server.http-listen-port=8080
      -shutdown-delay=90s
      -target=distributor
      -tenant-federation.enabled=true
      -validation.max-label-names-per-series=40

Note that distributor.ingestion-rate-limit was set to 10000000 at the time of the issue. We since increased it to the current value of 15000000.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strong correlation between ingestion rate limit errors and p99 latency #9720

Strong correlation between ingestion rate limit errors and p99 latency #9720

LasseHels commented Oct 23, 2024 •

edited

Loading

Strong correlation between ingestion rate limit errors and p99 latency #9720

Strong correlation between ingestion rate limit errors and p99 latency #9720

Comments

LasseHels commented Oct 23, 2024 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Environment

Additional Context

LasseHels commented Oct 23, 2024 •

edited

Loading