Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

monitor.flux: deadman alerts not generating a state change notification ('when status changes from') #3807

Closed
umaplehurst opened this issue Jun 16, 2021 · 6 comments

Comments

@umaplehurst
Copy link

Related to: influxdata/influxdb#18769 -- but may not cover all possible causes for the problem(s) that people are seeing

Testing on InfluxDB Cloud I am finding that I can reproduce following:

a) set up a deadman check, with "Schedule Every" set to 1m, and "When values are not reporting for" set to 5m,
b) i) set up a notification rule, "When status changes from" OK to ANY,
b) ii) set up a notification rule, "When status changes from" ANY to OK,

A notification is successfully generated for the transition out of the deadman condition (dead host comes back to life) but not into the deadman condition (alive host becomes dead). It seems that one important precondition for causing this bug is setting the deadman alert check "When values are not reporting for" threshold for a longer period of time (say, 5m) than the notification rule Schedule Task Every time (say, 1m).

I did some dissecting of bits of monitor.flux in the Query Editor. If my analysis and understanding of how everything works is right, _source_timestamp stops ticking as soon as the incoming data feed goes dead; i.e. for the alert checks that subsequently occur after as the host has gone dead, it will keep the same value of the very last successfully received sample from the host, but _time will still keep incrementing with the alert check time.

So, if the incoming data has been dead for quite some time, let's say 5m, by the time _stateChanges is executed in monitor.flux then the last two 1m-apart check samples that it will compare for the state change (e.g. for OK --> CRIT when doing a deadman check) will not sort because both of those samples will have the exact same value of _source_timestamp (as the input source has been dead for quite some time now). This means the subsequent difference/filter logic in _stateChanges will not work as expected, as the code expects the rows to be sorted in the correct direction of time flow. And this results in the bug that the notification is never generated.

The simple fix here is to extend the list of columns in the sort and to include _time as a second sort column after _source_timestamp. This way, the rows will always be sorted by _time if sorting by _source_timestamp doesn't provide an ordering. I tested this on InfluxDB Cloud by manually modifying the default Notification Rule that the UI creates and editing monitor["stateChanges"] in that UI-generated code to use my own patched _stateChanges function with this sort change described. This sorts out the problem for me and OK --> CRIT (i.e. deadman condition) notifications are now correctly being generated further upwards to Slack.

@umaplehurst
Copy link
Author

@alespour Would you mind taking a look at this as nobody else seems to have picked this up yet and you were the last person to touch this file and introduce the change (4551f74) which I think is causing the issue?

@alespour
Copy link
Contributor

alespour commented Aug 25, 2021

@umaplehurst I must admit I am completely puzzled :)

The issue with state change detection I struggled before is that monitor.check saves the original data point time _source_timestamp, and _time is the time of the check execution. So I think (or thought, at least) that in a table (series)

  • there cannot be two or more statuses with the same _source_timestamp
  • all statuses have the same _time

Therefore sorting by source timestamp helped.

Assuming something like this (pseudo flux code) is behind deadman check

data = from(bucket: "...")
  |> range(start: -${staleTime})
data
  |> monitor.deadman(t: experimental.subDuration(from: now(), d: ${timeSince}))`,
  |> monitor.check(data: check, messageFn: messageFn,{predicate})`,

Then, with a pipeline like this when incoming data goes dead, there are no data -> no statuses -> rule cannot detect any change -> no notifications, I think... I recently had to make a workaround in custom deadman solution to handle such situation because provided deadman check and rule did not work.

But when you say that fixed sorting helped you, there must be something I misunderstand or do not understand.

@umaplehurst
Copy link
Author

@alespour Many thanks for looking into this, I think the more eyeballs here the better. What I'm finding is that if the incoming data goes dead, there are still statuses being successfully logged (with level crit) for "a little while". "a little while" seems to depend on the "and stop checking after" setting which is set in the deadman rule -- so, if that's set to 3h then after 3h of a dead pipeline, then yeah, no more statuses will get pushed any more. But, until then, e.g. after a failure, there is definitely a change in status from ok to crit that is making its way into the table ok, the bug is that the rule notification does not fire off...

@desa Would you be able to provide some input on this? Looks like you rewrote stateChanges in 298f935 and your version had a |> sort(columns: ["_time"], desc: false). This was then changed to |> sort(columns: ["_source_timestamp"], desc: false) and I think this change broke things for a deadman check. What I'm proposing here is that this needs to be changed across to |> sort(columns: ["_source_timestamp", "_time"], desc: false) which sorts out the problem for me.

@alespour
Copy link
Contributor

alespour commented Aug 25, 2021

@umaplehurst Oh, I see now. With certain timings, the last data point is dead-evaluated multiple times, so there are multiple statuses with the same source time but of course different time. Hence it must be also sorted by time as the second column to give proper sequence. I'll prepare a PR with fix.

@alespour
Copy link
Contributor

alespour commented Oct 4, 2021

@umaplehurst So this should be fixed in just released v2.0.9.

@umaplehurst
Copy link
Author

@alespour Awesome, many thanks for getting the fix merged! I'm going to close this ticket as the code change looks good to me, I see no reason why it shouldn't resolve the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants