You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We're using Remora for exporting consumer group lag to CloudWatch metrics. Thanks for open sourcing this!
The issue
Metrics are currently exported as follows:
This limits how they can be queried (for example in Grafana). When creating a single graph that shows the lag for all partitions in a certain consumer group, you have to add a query for each of them individually. This is because you can't do wildcard searches on metric a name. Grafana allows for up to 5 CloudWatch searches in a single panel, so a maximum of 5 partitions can be plotted.
It is possible to do wildcard searches on dimensions though. This way, you would be able to do a single query that displays all partition offsets regardless of the number of partitions.
Proposed solution
I propose we change how metrics are exported to CloudWatch:
Metric name: By consumer group.<Consumer group id>.<metric> where is one of 'lag', 'logend' and 'offset'
Metric dimensions:
Topic (e.g. 'MyTopic')
Partition (e.g. '2')
For internal metrics like KafkaClientActor.receiveCounter:
Metric name: Remora internals.<metric> where is the same as what it is now
Metric dimensions:
metricType (e.g. 'gauge' or 'counterCount')
This would be a breaking change, so we'd have to change the version to 2.0.0.
Turns out there is a workaround for this issue. In Grafana, you can set a CloudWatch expression. The following expression plots each partition as a single line.
SEARCH(' {TheValueForCLOUDWATCH_NAME} gauge.TOPIC_NAME AND CONSUMER_GROUP_NAME.lag NOT TOPIC_NAME.CONSUMER_GROUP_NAME', 'Average', 60)
It would still be beneficial to improve the way metrics are stored as suggested in my original post though.
I definitely see the value in the changes you suggested, however I would suggest having this as a configurable option as opposed to making a breaking change
Yeah, having it configurable makes sense. Keep it turned off by default for 1.x.x and possibly flip it to on by default whenever Remora upgrades to 2.0.0.
I'm not part of the team that uses Remora anymore, so I don't think I'll be able to write a MR for this. CC'ing my old teammate @wgreven, so he's in the loop for this issue.
We're using Remora for exporting consumer group lag to CloudWatch metrics. Thanks for open sourcing this!
The issue
Metrics are currently exported as follows:
This limits how they can be queried (for example in Grafana). When creating a single graph that shows the lag for all partitions in a certain consumer group, you have to add a query for each of them individually. This is because you can't do wildcard searches on metric a name. Grafana allows for up to 5 CloudWatch searches in a single panel, so a maximum of 5 partitions can be plotted.
It is possible to do wildcard searches on dimensions though. This way, you would be able to do a single query that displays all partition offsets regardless of the number of partitions.
Proposed solution
I propose we change how metrics are exported to CloudWatch:
By consumer group.<Consumer group id>.<metric>
where is one of 'lag', 'logend' and 'offset'Topic
(e.g. 'MyTopic')Partition
(e.g. '2')For internal metrics like
KafkaClientActor.receiveCounter
:Remora internals.<metric>
where is the same as what it is nowmetricType
(e.g. 'gauge' or 'counterCount')This would be a breaking change, so we'd have to change the version to 2.0.0.
More info on CloudWatch dimensions: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_concepts.html
What do you think?
The text was updated successfully, but these errors were encountered: