Add operator metric for backup failures and successes #112

JamesLaverack · 2019-12-04T16:00:11Z

Users of the operator want to monitor backup failures and successes, in particular to alert on failed backups or a lack of successful ones.

Design

A metric will be added to the exposed operator metrics as a counter of successes and failures for EtcdBackupSchedule resources. This counter will be labelled by the namespace and name of the EtcdBackupSchedule resource.

Other options

Instrumenting all backups

All backups could be counted by building our counter from EtcdBackup resources directly. However as the backup resource has no unique name to operate on, and has only a list of endpoints, there's no good way to provide a unique identity of which cluster is being backed up.

Without labels on the metric it would be hard to identify from a dashboard or alert which etcd cluster (if there are multiple) is failing to backup.

Not using a metric

Alternatively, all of this information is available in the Kubernetes API anyway via a status field on EtcdBackup resources. However this relies on an Kuberntes administrator using and configuring something like kube-state-metrics to support alerts and dashboards on this data.

The text was updated successfully, but these errors were encountered:

adamhosier mentioned this issue Dec 4, 2019

0.2.0 checklist #38

Closed

13 tasks

cheahjs mentioned this issue Feb 12, 2020

0.3.0 checklist #155

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add operator metric for backup failures and successes #112

Add operator metric for backup failures and successes #112

JamesLaverack commented Dec 4, 2019 •

edited

Loading

Add operator metric for backup failures and successes #112

Add operator metric for backup failures and successes #112

Comments

JamesLaverack commented Dec 4, 2019 • edited Loading

Design

Other options

Instrumenting all backups

Not using a metric

JamesLaverack commented Dec 4, 2019 •

edited

Loading