Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rule "Host RAID array got inactive" has misleading description #395

Open
jlherren opened this issue Jan 2, 2024 · 1 comment
Open

Rule "Host RAID array got inactive" has misleading description #395

jlherren opened this issue Jan 2, 2024 · 1 comment

Comments

@jlherren
Copy link

jlherren commented Jan 2, 2024

Rule 1.2.25. Host RAID array got inactive has a misleading description that does not match its expression:

Expression: (node_md_state{state="inactive"} > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}

Description:

RAID array {{ $labels.device }} is in a degraded state due to one or more disk failures. The number of spare drives is insufficient to fix the issue automatically.

I had a disk causing issues in a RAID 1. I manually failed the disk (mdadm --fail) upon which the rule 1.2.26. Host RAID disk failure correctly reported it. Then I removed the bad disk from the RAID (mdadm --remove), after which 1.2.26 no longer reports it, because the disk is no longer in a failed state. But now rule 1.2.25 does not report anything, since the RAID is still active and the server still fully operational. The RAID is just degraded, which the rule doesn't actually check for, contrary to the description.

I suppose the description should be fixed and a new rule be added to detect degraded RAIDs. I tried node_md_disks{state="active"} < node_md_disks_required but it doesn't seem to work (I'm not so proficient in the query language).

My metrics md0/md1/md2 are all RAID1 on the same two disks.
# HELP node_md_disks Number of active/failed/spare disks of device.
# TYPE node_md_disks gauge
node_md_disks{device="md0",state="active"} 1
node_md_disks{device="md0",state="failed"} 0
node_md_disks{device="md0",state="spare"} 0
node_md_disks{device="md1",state="active"} 1
node_md_disks{device="md1",state="failed"} 0
node_md_disks{device="md1",state="spare"} 0
node_md_disks{device="md2",state="active"} 1
node_md_disks{device="md2",state="failed"} 0
node_md_disks{device="md2",state="spare"} 0
# HELP node_md_disks_required Total number of disks of device.
# TYPE node_md_disks_required gauge
node_md_disks_required{device="md0"} 2
node_md_disks_required{device="md1"} 2
node_md_disks_required{device="md2"} 2
# HELP node_md_state Indicates the state of md-device.
# TYPE node_md_state gauge
node_md_state{device="md0",state="active"} 1
node_md_state{device="md0",state="check"} 0
node_md_state{device="md0",state="inactive"} 0
node_md_state{device="md0",state="recovering"} 0
node_md_state{device="md0",state="resync"} 0
node_md_state{device="md1",state="active"} 1
node_md_state{device="md1",state="check"} 0
node_md_state{device="md1",state="inactive"} 0
node_md_state{device="md1",state="recovering"} 0
node_md_state{device="md1",state="resync"} 0
node_md_state{device="md2",state="active"} 1
node_md_state{device="md2",state="check"} 0
node_md_state{device="md2",state="inactive"} 0
node_md_state{device="md2",state="recovering"} 0
node_md_state{device="md2",state="resync"} 0
@guruevi
Copy link

guruevi commented Feb 25, 2024

Fixed in #405

cboltz pushed a commit to openSUSE/heroes-salt that referenced this issue Aug 7, 2024
- adjust rule for inactive arrays to reflect it not being about
  degraded arrays
- add rule for arrays with less active disks than expected,
  suggesting a degraded state

samber/awesome-prometheus-alerts#395

Signed-off-by: Georg Pfuetzenreuter <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants