You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note: I've opened this as a bug, but I appreciate that it may as well be a misconfiguration on our end.
Describe the bug
We observe a strong correlation between rate limit errors returned by Mimir distributors and p99 latency on the write path.
As the rate of 429 responses from distributors increase, so does p99 latency of both distributors and ingesters:
The immediate drop in rate limit responses at around 10:30 UTC on October 18th was when we increased the limit. When we did this, rate limit responses ceased and we no longer saw spikes in p99 latency.
As the latency graphs show, p99 latency doesn't gradually increase. Rather, it seems that the p99 latency for certain pods will briefly spike to 100 seconds (which we assume is a timeout value somewhere) before dropping back to normal. Zooming in on the p99 latency for a single pod shows this behaviour more clearly:
In the above graph, we see that distributor-68fb644758-cj929 normally sits at ~50 ms p99 latency, but then suddenly spikes to 100 seconds at around 06:30:30 UTC. It subsequently immediately drops back down to its previous p99 latency of ~50 ms.
We observe that the spikes in distributor p99 latency apply across several response codes: 200, 202, 400 and 429.
To Reproduce
At this time, we are unsure of how to reproduce the issue. We've can somewhat inconsistently reproduce in our non-prod environment by setting our ingestion limits artificially low.
Expected behavior
Our naive expectation would be that hitting ingestion rate limits does not impact p99 write latency. If anything, we would expect that 429 responses are faster than regular requests, as the write is presumably rejected early in the ingestion process.
Environment
Infrastructure: We run Mimir on Kubernetes version 1.29.7. Our Mimir version is 2.13.0.
Deployment tool: We deploy Mimir via Jsonnet.
Additional Context
For reference, here are our distributor and ingester traffic statistic:
Note: I've opened this as a bug, but I appreciate that it may as well be a misconfiguration on our end.
Describe the bug
We observe a strong correlation between rate limit errors returned by Mimir distributors and p99 latency on the write path.
As the rate of
429
responses from distributors increase, so does p99 latency of both distributors and ingesters:The immediate drop in rate limit responses at around 10:30 UTC on October 18th was when we increased the limit. When we did this, rate limit responses ceased and we no longer saw spikes in p99 latency.
As the latency graphs show, p99 latency doesn't gradually increase. Rather, it seems that the p99 latency for certain pods will briefly spike to 100 seconds (which we assume is a timeout value somewhere) before dropping back to normal. Zooming in on the p99 latency for a single pod shows this behaviour more clearly:
In the above graph, we see that
distributor-68fb644758-cj929
normally sits at ~50 ms p99 latency, but then suddenly spikes to 100 seconds at around 06:30:30 UTC. It subsequently immediately drops back down to its previous p99 latency of ~50 ms.We observe that the spikes in distributor p99 latency apply across several response codes:
200
,202
,400
and429
.To Reproduce
At this time, we are unsure of how to reproduce the issue. We've can somewhat inconsistently reproduce in our non-prod environment by setting our ingestion limits artificially low.
Expected behavior
Our naive expectation would be that hitting ingestion rate limits does not impact p99 write latency. If anything, we would expect that
429
responses are faster than regular requests, as the write is presumably rejected early in the ingestion process.Environment
1.29.7
. Our Mimir version is2.13.0
.Additional Context
For reference, here are our distributor and ingester traffic statistic:
Here are our distributor args:
Note that
distributor.ingestion-rate-limit
was set to10000000
at the time of the issue. We since increased it to the current value of15000000
.The text was updated successfully, but these errors were encountered: