monitor.flux: deadman alerts not generating a state change notification ('when status changes from') #3807

umaplehurst · 2021-06-16T08:25:18Z

Related to: influxdata/influxdb#18769 -- but may not cover all possible causes for the problem(s) that people are seeing

Testing on InfluxDB Cloud I am finding that I can reproduce following:

a) set up a deadman check, with "Schedule Every" set to 1m, and "When values are not reporting for" set to 5m,
b) i) set up a notification rule, "When status changes from" OK to ANY,
b) ii) set up a notification rule, "When status changes from" ANY to OK,

A notification is successfully generated for the transition out of the deadman condition (dead host comes back to life) but not into the deadman condition (alive host becomes dead). It seems that one important precondition for causing this bug is setting the deadman alert check "When values are not reporting for" threshold for a longer period of time (say, 5m) than the notification rule Schedule Task Every time (say, 1m).

I did some dissecting of bits of monitor.flux in the Query Editor. If my analysis and understanding of how everything works is right, _source_timestamp stops ticking as soon as the incoming data feed goes dead; i.e. for the alert checks that subsequently occur after as the host has gone dead, it will keep the same value of the very last successfully received sample from the host, but _time will still keep incrementing with the alert check time.

So, if the incoming data has been dead for quite some time, let's say 5m, by the time _stateChanges is executed in monitor.flux then the last two 1m-apart check samples that it will compare for the state change (e.g. for OK --> CRIT when doing a deadman check) will not sort because both of those samples will have the exact same value of _source_timestamp (as the input source has been dead for quite some time now). This means the subsequent difference/filter logic in _stateChanges will not work as expected, as the code expects the rows to be sorted in the correct direction of time flow. And this results in the bug that the notification is never generated.

The simple fix here is to extend the list of columns in the sort and to include _time as a second sort column after _source_timestamp. This way, the rows will always be sorted by _time if sorting by _source_timestamp doesn't provide an ordering. I tested this on InfluxDB Cloud by manually modifying the default Notification Rule that the UI creates and editing monitor["stateChanges"] in that UI-generated code to use my own patched _stateChanges function with this sort change described. This sorts out the problem for me and OK --> CRIT (i.e. deadman condition) notifications are now correctly being generated further upwards to Slack.

The text was updated successfully, but these errors were encountered:

umaplehurst · 2021-08-24T21:12:48Z

@alespour Would you mind taking a look at this as nobody else seems to have picked this up yet and you were the last person to touch this file and introduce the change (4551f74) which I think is causing the issue?

alespour · 2021-08-25T11:53:38Z

@umaplehurst I must admit I am completely puzzled :)

The issue with state change detection I struggled before is that monitor.check saves the original data point time _source_timestamp, and _time is the time of the check execution. So I think (or thought, at least) that in a table (series)

there cannot be two or more statuses with the same _source_timestamp
all statuses have the same _time

Therefore sorting by source timestamp helped.

Assuming something like this (pseudo flux code) is behind deadman check

data = from(bucket: "...")
  |> range(start: -${staleTime})
data
  |> monitor.deadman(t: experimental.subDuration(from: now(), d: ${timeSince}))`,
  |> monitor.check(data: check, messageFn: messageFn,{predicate})`,

Then, with a pipeline like this when incoming data goes dead, there are no data -> no statuses -> rule cannot detect any change -> no notifications, I think... I recently had to make a workaround in custom deadman solution to handle such situation because provided deadman check and rule did not work.

But when you say that fixed sorting helped you, there must be something I misunderstand or do not understand.

umaplehurst · 2021-08-25T12:31:53Z

@alespour Many thanks for looking into this, I think the more eyeballs here the better. What I'm finding is that if the incoming data goes dead, there are still statuses being successfully logged (with level crit) for "a little while". "a little while" seems to depend on the "and stop checking after" setting which is set in the deadman rule -- so, if that's set to 3h then after 3h of a dead pipeline, then yeah, no more statuses will get pushed any more. But, until then, e.g. after a failure, there is definitely a change in status from ok to crit that is making its way into the table ok, the bug is that the rule notification does not fire off...

@desa Would you be able to provide some input on this? Looks like you rewrote stateChanges in 298f935 and your version had a |> sort(columns: ["_time"], desc: false). This was then changed to |> sort(columns: ["_source_timestamp"], desc: false) and I think this change broke things for a deadman check. What I'm proposing here is that this needs to be changed across to |> sort(columns: ["_source_timestamp", "_time"], desc: false) which sorts out the problem for me.

alespour · 2021-08-25T14:49:39Z

@umaplehurst Oh, I see now. With certain timings, the last data point is dead-evaluated multiple times, so there are multiple statuses with the same source time but of course different time. Hence it must be also sorted by time as the second column to give proper sequence. I'll prepare a PR with fix.

alespour · 2021-10-04T17:27:21Z

@umaplehurst So this should be fixed in just released v2.0.9.

umaplehurst · 2021-10-04T22:40:27Z

@alespour Awesome, many thanks for getting the fix merged! I'm going to close this ticket as the code change looks good to me, I see no reason why it shouldn't resolve the problem.

umaplehurst mentioned this issue Aug 4, 2021

influxdb2 does not send notifications influxdata/influxdb#18769

Open

alespour mentioned this issue Aug 25, 2021

fix(stdlib): sort statuses also by _time #3996

Merged

1 task

umaplehurst closed this as completed Oct 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

monitor.flux: deadman alerts not generating a state change notification ('when status changes from') #3807

monitor.flux: deadman alerts not generating a state change notification ('when status changes from') #3807

umaplehurst commented Jun 16, 2021

umaplehurst commented Aug 24, 2021

alespour commented Aug 25, 2021 •

edited

Loading

umaplehurst commented Aug 25, 2021

alespour commented Aug 25, 2021 •

edited

Loading

alespour commented Oct 4, 2021

umaplehurst commented Oct 4, 2021

monitor.flux: deadman alerts not generating a state change notification ('when status changes from') #3807

monitor.flux: deadman alerts not generating a state change notification ('when status changes from') #3807

Comments

umaplehurst commented Jun 16, 2021

umaplehurst commented Aug 24, 2021

alespour commented Aug 25, 2021 • edited Loading

umaplehurst commented Aug 25, 2021

alespour commented Aug 25, 2021 • edited Loading

alespour commented Oct 4, 2021

umaplehurst commented Oct 4, 2021

alespour commented Aug 25, 2021 •

edited

Loading

alespour commented Aug 25, 2021 •

edited

Loading