You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
RAID array {{ $labels.device }} is in a degraded state due to one or more disk failures. The number of spare drives is insufficient to fix the issue automatically.
I had a disk causing issues in a RAID 1. I manually failed the disk (mdadm --fail) upon which the rule 1.2.26. Host RAID disk failure correctly reported it. Then I removed the bad disk from the RAID (mdadm --remove), after which 1.2.26 no longer reports it, because the disk is no longer in a failed state. But now rule 1.2.25 does not report anything, since the RAID is still active and the server still fully operational. The RAID is just degraded, which the rule doesn't actually check for, contrary to the description.
I suppose the description should be fixed and a new rule be added to detect degraded RAIDs. I tried node_md_disks{state="active"} < node_md_disks_required but it doesn't seem to work (I'm not so proficient in the query language).
My metrics
md0/md1/md2 are all RAID1 on the same two disks.
# HELP node_md_disks Number of active/failed/spare disks of device.
# TYPE node_md_disks gauge
node_md_disks{device="md0",state="active"} 1
node_md_disks{device="md0",state="failed"} 0
node_md_disks{device="md0",state="spare"} 0
node_md_disks{device="md1",state="active"} 1
node_md_disks{device="md1",state="failed"} 0
node_md_disks{device="md1",state="spare"} 0
node_md_disks{device="md2",state="active"} 1
node_md_disks{device="md2",state="failed"} 0
node_md_disks{device="md2",state="spare"} 0
# HELP node_md_disks_required Total number of disks of device.
# TYPE node_md_disks_required gauge
node_md_disks_required{device="md0"} 2
node_md_disks_required{device="md1"} 2
node_md_disks_required{device="md2"} 2
# HELP node_md_state Indicates the state of md-device.
# TYPE node_md_state gauge
node_md_state{device="md0",state="active"} 1
node_md_state{device="md0",state="check"} 0
node_md_state{device="md0",state="inactive"} 0
node_md_state{device="md0",state="recovering"} 0
node_md_state{device="md0",state="resync"} 0
node_md_state{device="md1",state="active"} 1
node_md_state{device="md1",state="check"} 0
node_md_state{device="md1",state="inactive"} 0
node_md_state{device="md1",state="recovering"} 0
node_md_state{device="md1",state="resync"} 0
node_md_state{device="md2",state="active"} 1
node_md_state{device="md2",state="check"} 0
node_md_state{device="md2",state="inactive"} 0
node_md_state{device="md2",state="recovering"} 0
node_md_state{device="md2",state="resync"} 0
The text was updated successfully, but these errors were encountered:
- adjust rule for inactive arrays to reflect it not being about
degraded arrays
- add rule for arrays with less active disks than expected,
suggesting a degraded state
samber/awesome-prometheus-alerts#395
Signed-off-by: Georg Pfuetzenreuter <[email protected]>
Rule 1.2.25. Host RAID array got inactive has a misleading description that does not match its expression:
Expression:
(node_md_state{state="inactive"} > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
Description:
I had a disk causing issues in a RAID 1. I manually failed the disk (
mdadm --fail
) upon which the rule 1.2.26. Host RAID disk failure correctly reported it. Then I removed the bad disk from the RAID (mdadm --remove
), after which 1.2.26 no longer reports it, because the disk is no longer in a failed state. But now rule 1.2.25 does not report anything, since the RAID is still active and the server still fully operational. The RAID is just degraded, which the rule doesn't actually check for, contrary to the description.I suppose the description should be fixed and a new rule be added to detect degraded RAIDs. I tried
node_md_disks{state="active"} < node_md_disks_required
but it doesn't seem to work (I'm not so proficient in the query language).My metrics
md0/md1/md2 are all RAID1 on the same two disks.The text was updated successfully, but these errors were encountered: