Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strong correlation between ingestion rate limit errors and p99 latency #9720

Open
LasseHels opened this issue Oct 23, 2024 · 0 comments
Open

Comments

@LasseHels
Copy link
Contributor

LasseHels commented Oct 23, 2024

Note: I've opened this as a bug, but I appreciate that it may as well be a misconfiguration on our end.

Describe the bug

We observe a strong correlation between rate limit errors returned by Mimir distributors and p99 latency on the write path.

As the rate of 429 responses from distributors increase, so does p99 latency of both distributors and ingesters:
Image

The immediate drop in rate limit responses at around 10:30 UTC on October 18th was when we increased the limit. When we did this, rate limit responses ceased and we no longer saw spikes in p99 latency.

Image
Image

As the latency graphs show, p99 latency doesn't gradually increase. Rather, it seems that the p99 latency for certain pods will briefly spike to 100 seconds (which we assume is a timeout value somewhere) before dropping back to normal. Zooming in on the p99 latency for a single pod shows this behaviour more clearly:
Image

In the above graph, we see that distributor-68fb644758-cj929 normally sits at ~50 ms p99 latency, but then suddenly spikes to 100 seconds at around 06:30:30 UTC. It subsequently immediately drops back down to its previous p99 latency of ~50 ms.

We observe that the spikes in distributor p99 latency apply across several response codes: 200, 202, 400 and 429.

To Reproduce

At this time, we are unsure of how to reproduce the issue. We've can somewhat inconsistently reproduce in our non-prod environment by setting our ingestion limits artificially low.

Expected behavior

Our naive expectation would be that hitting ingestion rate limits does not impact p99 write latency. If anything, we would expect that 429 responses are faster than regular requests, as the write is presumably rejected early in the ingestion process.

Environment

  • Infrastructure: We run Mimir on Kubernetes version 1.29.7. Our Mimir version is 2.13.0.
  • Deployment tool: We deploy Mimir via Jsonnet.

Additional Context

For reference, here are our distributor and ingester traffic statistic:
Image

Here are our distributor args:

      -auth.multitenancy-enabled=true
      -auth.no-auth-tenant=fake
      -config.expand-env=true
      -config.file=/etc/mimir/config/mimir.yaml
      -distributor.drop-label=metrics_ha
      -distributor.ha-tracker.cluster=metrics_ha
      -distributor.ha-tracker.consul.hostname=mimir-consul-consul-server:8500
      -distributor.ha-tracker.enable=true
      -distributor.ha-tracker.enable-for-all-users=true
      -distributor.ha-tracker.etcd.endpoints=etcd-client.mimir.svc.cluster.local.:2379
      -distributor.ha-tracker.max-clusters=0
      -distributor.ha-tracker.prefix=prom_ha/
      -distributor.ha-tracker.store=consul
      -distributor.health-check-ingesters=true
      -distributor.ingestion-burst-size=100000000
      -distributor.ingestion-rate-limit=15000000
      -distributor.ring.heartbeat-period=15s
      -distributor.ring.heartbeat-timeout=4m
      -distributor.ring.prefix=
      -distributor.ring.store=memberlist
      -ingester.ring.heartbeat-period=20s
      -ingester.ring.heartbeat-timeout=1m
      -ingester.ring.prefix=
      -ingester.ring.replication-factor=3
      -ingester.ring.store=memberlist
      -ingester.ring.zone-awareness-enabled=true
      -mem-ballast-size-bytes=1073741824
      -memberlist.bind-port=7947
      -memberlist.join=gossip-ring.mimir.svc.cluster.local:7947
      -runtime-config.file=/etc/mimir/overrides.yaml
      -server.grpc.keepalive.max-connection-age=60s
      -server.grpc.keepalive.max-connection-age-grace=5m
      -server.grpc.keepalive.max-connection-idle=1m
      -server.grpc.keepalive.min-time-between-pings=10s
      -server.grpc.keepalive.ping-without-stream-allowed=true
      -server.http-listen-port=8080
      -shutdown-delay=90s
      -target=distributor
      -tenant-federation.enabled=true
      -validation.max-label-names-per-series=40

Note that distributor.ingestion-rate-limit was set to 10000000 at the time of the issue. We since increased it to the current value of 15000000.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant