Graceful handling of data from Redis that fails to be parsed #607

julianbrost · 2023-06-21T12:30:04Z

At the moment, Icinga DB completely relies on Icinga 2 that there is no data present in Redis it can't process. If this assumption is violated, Icinga DB simply crashes at the moment. There already were a few instances where such bugs were reported, for example:

Icinga DB feature should normalize *command.arguments[*].{required,skip_key,repeat_key}` to boolean icinga2#9576
icingadb service crashes on negative downtime duration or end before start icinga2#9774
IcingaDB conversion of strings to number types, to avoid crashes icinga2#9791

Not writing data that the receiver does not understand at all is a good idea, but in order to make Icinga DB more robust, we should implement more graceful/fine-grained error handling in Icinga DB as well.

Task

Go through all code that reads from Redis and the errors that can occur, especially where data is parsed. Check if there is a more graceful way of handling the error than just bailing out. For example, for the object sync, an individual object that fails to be parsed could just be skipped. Summarize your findings in a comment to this issue and propose changes where and how things could be handled better.

Al2Klimov · 2023-07-11T09:29:44Z

propose changes where and how things could be handled better.

a health check bit in the DB telling the user to look into logs
a new sublogger (grep-able 👍) triggering the latter, probably even named health, for the following
(overdue: nothing to do here)
config+runtime: can't decode? – ignore it. (I guess. Object( update)s would disappear.)
history: here we can't do the above (ex. for state+notification) as an end event upserts its start event. But here we have History(Table)Meta we can fall back to. This gives us the object identification we can blacklist in memory for future successfully decoded events. However, ignored stuff is kept in Redis!

OK?

Al2Klimov · 2023-07-11T10:06:58Z

Almost forgot: to be honest, I'd prefer not to do anything at all here. As Eric said, our main problem is string/uint discrepancy which we fixed in 2.14/2.13.8, not individual values being provocatively out of range. E.g. if you've set you check timeout to 2^64, well, please set it back to an actual value. Or, if you've got an actual problem, please report it.

A crash is a good motivator for the latter and a reasonable one if(!) 2.14/2.13.8 is enough which it is IMAO. In contrast, a health check is easy to ignore.

julianbrost · 2023-07-11T11:44:34Z

a health check bit in the DB telling the user to look into logs

a new sublogger (grep-able +1) triggering the latter, probably even named health, for the following

These two points are for #608.

(overdue: nothing to do here)

Why? I mean this probably only takes very few inputs so likely not our biggest concern.

config+runtime: can't decode? – ignore it. (I guess. Object( update)s would disappear.)

For the initial sync (i.e. the one computing the delta first), would this leave the object as-is in the database or delete it (that might happen if you treat it like it wouldn't exist in Redis)?

history: here we can't do the above (ex. for state+notification) as an end event upserts its start event. But here we have History(Table)Meta we can fall back to. This gives us the object identification we can blacklist in memory for future successfully decoded events. However, ignored stuff is kept in Redis!

And if even that part wouldn't be decodable, it would stop the sync for the entire type? Or discard it and go on as there probably isn't hope in fixing it either?

Almost forgot: to be honest, I'd prefer not to do anything at all here. As Eric said, our main problem is string/uint discrepancy which we fixed in 2.14/2.13.8, not individual values being provocatively out of range. E.g. if you've set you check timeout to 2^64, well, please set it back to an actual value. Or, if you've got an actual problem, please report it.

I stick to "if it crashes, it's a bug", we can argue about the priority. It just doesn't look good if you can (accidentally or on purpose) input a value and this knocks out significant parts of your monitoring.

A crash is a good motivator for the latter and a reasonable one if(!) 2.14/2.13.8 is enough which it is IMAO. In contrast, a health check is easy to ignore.

What you call a good motivator potentially stops people from working with Icinga Web. If anything, that should be an argument for not having a "hide warning entirely" button.

Al2Klimov · 2023-07-11T12:29:24Z

(overdue: nothing to do here)

Why? I mean this probably only takes very few inputs so likely not our biggest concern.

It even is not a concern at all. It doesn't suffer from such problems at all. By design. So there's nothing to do here.

config+runtime: can't decode? – ignore it. (I guess. Object( update)s would disappear.)

For the initial sync (i.e. the one computing the delta first), would this leave the object as-is in the database or delete it (that might happen if you treat it like it wouldn't exist in Redis)?

I'd do the latter.

more motivating to fix the error
less confusing than it's there, but not up to date

history: here we can't do the above (ex. for state+notification) as an end event upserts its start event. But here we have History(Table)Meta we can fall back to. This gives us the object identification we can blacklist in memory for future successfully decoded events. However, ignored stuff is kept in Redis!

And if even that part wouldn't be decodable, it would stop the sync for the entire type? Or discard it and go on as there probably isn't hope in fixing it either?

Oh. 😅 I'd stop the sync for the entire type. This -if occurs- is a serious problem.

If anything, that should be an argument for not having a "hide warning entirely" button.

@nilmerg Can we do this?

julianbrost added area/configsync area/statesync area/history area/runtime labels Jun 21, 2023

This was referenced Jun 21, 2023

Report non-fatal errors/warning via the database for use in the Icinga Web health checks #608

Open

IcingaDB::PrepareObject(): round Notification#interval and limit it to >=0 Icinga/icinga2#9793

Merged

Al2Klimov self-assigned this Jul 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graceful handling of data from Redis that fails to be parsed #607

Graceful handling of data from Redis that fails to be parsed #607

julianbrost commented Jun 21, 2023

Al2Klimov commented Jul 11, 2023

Al2Klimov commented Jul 11, 2023

julianbrost commented Jul 11, 2023

Al2Klimov commented Jul 11, 2023

Graceful handling of data from Redis that fails to be parsed #607

Graceful handling of data from Redis that fails to be parsed #607

Comments

julianbrost commented Jun 21, 2023

Task

Al2Klimov commented Jul 11, 2023

Al2Klimov commented Jul 11, 2023

julianbrost commented Jul 11, 2023

Al2Klimov commented Jul 11, 2023