Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(stdlib): sort statuses also by _time #3996

Merged
merged 5 commits into from
Aug 31, 2021

Conversation

alespour
Copy link
Contributor

This PR fixes issue in alerting where, depending on check and rule scheduling and timing configuration, state changes are sometimes not detected and therefore notifications not sent.

The problem is that status of data points may be evaluated multiple times with certain scheduling, for example as reported in #3807, deadman check evaluates status based on data point time and not a metric value. That results in existence of multiple statuses with the same source timestamp but different time (and level) in the monitoring bucket. Then, when an alert rule attempts to detect state change, it gets incorrect sequence of statuses, because sorting by only (the same) source timestamp is insufficient.

Affected functions are stateChangesOnly and _stateChanges functions in monitor package, where before sort,
_level is dropped from the group key which results in a new series with possibly unpredictable order (by _time).

The issue is a side effect of the changes in #2725. Mea culpa.

Done checklist

  • Test cases written

@alespour alespour marked this pull request as ready for review August 25, 2021 17:20
@nathanielc nathanielc requested a review from wolffcm August 30, 2021 18:21
Copy link

@wolffcm wolffcm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks good and the changes make sense. Thanks for fixing this!

Can I trouble you to adapt your test to our newer format which uses a Flux keyword testcase? We want to be using it moving forward. Here is an example:

testcase window_period_gaps {

@alespour alespour force-pushed the fix/monitor-state-change branch from 104b1af to 462093c Compare August 31, 2021 14:37
@alespour
Copy link
Contributor Author

@wolffcm Done. Nice feature!

Copy link

@wolffcm wolffcm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look great. Thanks again!

@wolffcm wolffcm merged commit 1fbf56e into influxdata:master Aug 31, 2021
@wolffcm
Copy link

wolffcm commented Sep 1, 2021

@alespour Unfortunately we had to revert this commit because the new test doesn't pass in the OSS version of influxdb---these tests (like the older ones) will use testing.load() to insert data into InfluxDB and then select it back.

The culprit seems to be that the data being passed to testing.load() is pivoted. There are a couple ways to solve this:

  • Remove testing.load() and just run the test against in-memory data. This is fine because you're not testing push-down logic or planner rules or anything like that.
  • Stream unpivoted data into testing.load().

The first option is the easiest and that's what I recommend. It should be as simple as just deleting the |> testing.load() line from your test.

Unfortunately, I'm not sure of the right steps to check if a new Flux test like this will pass in influxdb before actually merging to master. We need to look into that.

jsternberg added a commit that referenced this pull request Sep 1, 2021
@jsternberg
Copy link
Contributor

Fixed using @wolffcm's first suggestion in #4012.

jsternberg added a commit that referenced this pull request Sep 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants