-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
monitor.flux: deadman alerts not generating a state change notification ('when status changes from') #3807
Comments
@umaplehurst I must admit I am completely puzzled :) The issue with state change detection I struggled before is that
Therefore sorting by source timestamp helped. Assuming something like this (pseudo flux code) is behind deadman check
Then, with a pipeline like this when incoming data goes dead, there are no data -> no statuses -> rule cannot detect any change -> no notifications, I think... I recently had to make a workaround in custom deadman solution to handle such situation because provided deadman check and rule did not work. But when you say that fixed sorting helped you, there must be something I misunderstand or do not understand. |
@alespour Many thanks for looking into this, I think the more eyeballs here the better. What I'm finding is that if the incoming data goes dead, there are still statuses being successfully logged (with level @desa Would you be able to provide some input on this? Looks like you rewrote |
@umaplehurst Oh, I see now. With certain timings, the last data point is dead-evaluated multiple times, so there are multiple statuses with the same source time but of course different time. Hence it must be also sorted by time as the second column to give proper sequence. I'll prepare a PR with fix. |
@umaplehurst So this should be fixed in just released v2.0.9. |
@alespour Awesome, many thanks for getting the fix merged! I'm going to close this ticket as the code change looks good to me, I see no reason why it shouldn't resolve the problem. |
Related to: influxdata/influxdb#18769 -- but may not cover all possible causes for the problem(s) that people are seeing
Testing on InfluxDB Cloud I am finding that I can reproduce following:
a) set up a deadman check, with "Schedule Every" set to 1m, and "When values are not reporting for" set to 5m,
b) i) set up a notification rule, "When status changes from"
OK
toANY
,b) ii) set up a notification rule, "When status changes from"
ANY
toOK
,A notification is successfully generated for the transition out of the deadman condition (dead host comes back to life) but not into the deadman condition (alive host becomes dead). It seems that one important precondition for causing this bug is setting the deadman alert check "When values are not reporting for" threshold for a longer period of time (say, 5m) than the notification rule Schedule Task Every time (say, 1m).
I did some dissecting of bits of
monitor.flux
in the Query Editor. If my analysis and understanding of how everything works is right,_source_timestamp
stops ticking as soon as the incoming data feed goes dead; i.e. for the alert checks that subsequently occur after as the host has gone dead, it will keep the same value of the very last successfully received sample from the host, but_time
will still keep incrementing with the alert check time.So, if the incoming data has been dead for quite some time, let's say 5m, by the time
_stateChanges
is executed inmonitor.flux
then the last two 1m-apart check samples that it will compare for the state change (e.g. forOK
-->CRIT
when doing a deadman check) will notsort
because both of those samples will have the exact same value of_source_timestamp
(as the input source has been dead for quite some time now). This means the subsequentdifference
/filter
logic in_stateChanges
will not work as expected, as the code expects the rows to be sorted in the correct direction of time flow. And this results in the bug that the notification is never generated.The simple fix here is to extend the list of columns in the
sort
and to include_time
as a second sort column after_source_timestamp
. This way, the rows will always be sorted by_time
if sorting by_source_timestamp
doesn't provide an ordering. I tested this on InfluxDB Cloud by manually modifying the default Notification Rule that the UI creates and editingmonitor["stateChanges"]
in that UI-generated code to use my own patched_stateChanges
function with thissort
change described. This sorts out the problem for me andOK
-->CRIT
(i.e. deadman condition) notifications are now correctly being generated further upwards to Slack.The text was updated successfully, but these errors were encountered: