Rule "Host RAID array got inactive" has misleading description #395

jlherren · 2024-01-02T10:30:30Z

Rule 1.2.25. Host RAID array got inactive has a misleading description that does not match its expression:

Expression: (node_md_state{state="inactive"} > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}

Description:

RAID array {{ $labels.device }} is in a degraded state due to one or more disk failures. The number of spare drives is insufficient to fix the issue automatically.

I had a disk causing issues in a RAID 1. I manually failed the disk (mdadm --fail) upon which the rule 1.2.26. Host RAID disk failure correctly reported it. Then I removed the bad disk from the RAID (mdadm --remove), after which 1.2.26 no longer reports it, because the disk is no longer in a failed state. But now rule 1.2.25 does not report anything, since the RAID is still active and the server still fully operational. The RAID is just degraded, which the rule doesn't actually check for, contrary to the description.

I suppose the description should be fixed and a new rule be added to detect degraded RAIDs. I tried node_md_disks{state="active"} < node_md_disks_required but it doesn't seem to work (I'm not so proficient in the query language).

My metrics

md0/md1/md2 are all RAID1 on the same two disks.

# HELP node_md_disks Number of active/failed/spare disks of device.
# TYPE node_md_disks gauge
node_md_disks{device="md0",state="active"} 1
node_md_disks{device="md0",state="failed"} 0
node_md_disks{device="md0",state="spare"} 0
node_md_disks{device="md1",state="active"} 1
node_md_disks{device="md1",state="failed"} 0
node_md_disks{device="md1",state="spare"} 0
node_md_disks{device="md2",state="active"} 1
node_md_disks{device="md2",state="failed"} 0
node_md_disks{device="md2",state="spare"} 0
# HELP node_md_disks_required Total number of disks of device.
# TYPE node_md_disks_required gauge
node_md_disks_required{device="md0"} 2
node_md_disks_required{device="md1"} 2
node_md_disks_required{device="md2"} 2
# HELP node_md_state Indicates the state of md-device.
# TYPE node_md_state gauge
node_md_state{device="md0",state="active"} 1
node_md_state{device="md0",state="check"} 0
node_md_state{device="md0",state="inactive"} 0
node_md_state{device="md0",state="recovering"} 0
node_md_state{device="md0",state="resync"} 0
node_md_state{device="md1",state="active"} 1
node_md_state{device="md1",state="check"} 0
node_md_state{device="md1",state="inactive"} 0
node_md_state{device="md1",state="recovering"} 0
node_md_state{device="md1",state="resync"} 0
node_md_state{device="md2",state="active"} 1
node_md_state{device="md2",state="check"} 0
node_md_state{device="md2",state="inactive"} 0
node_md_state{device="md2",state="recovering"} 0
node_md_state{device="md2",state="resync"} 0

The text was updated successfully, but these errors were encountered:

guruevi · 2024-02-25T01:09:01Z

Fixed in #405

- adjust rule for inactive arrays to reflect it not being about degraded arrays - add rule for arrays with less active disks than expected, suggesting a degraded state samber/awesome-prometheus-alerts#395 Signed-off-by: Georg Pfuetzenreuter <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rule "Host RAID array got inactive" has misleading description #395

Rule "Host RAID array got inactive" has misleading description #395

jlherren commented Jan 2, 2024

guruevi commented Feb 25, 2024

Rule "Host RAID array got inactive" has misleading description #395

Rule "Host RAID array got inactive" has misleading description #395

Comments

jlherren commented Jan 2, 2024

guruevi commented Feb 25, 2024