Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

influxdb2 does not send notifications #18769

Open
omers opened this issue Jun 28, 2020 · 58 comments
Open

influxdb2 does not send notifications #18769

omers opened this issue Jun 28, 2020 · 58 comments
Labels
area/2.x OSS 2.0 related issues and PRs kind/bug

Comments

@omers
Copy link

omers commented Jun 28, 2020

I setup influxdb2 to send alerts to our slack channel.
I see only alerts that were sent 4 days ago and no alerts received to slack channel

for debugging, I created a flask app to listen as an http endpoint, and created an http notification rule to route all the alerts to the flask app. I see no requests

Steps to reproduce:

  1. Setup an alert
  2. Setup slack notification endpoint
  3. Create alert that set the status to Critical
  4. Go to the alert history page and check the alert was sent

Expected behavior:
Alerts will be delievered

Actual behavior:
Describe What actually happened.

Environment info:

  • System info: Linux 5.3.0-1023-aws x86_64
  • InfluxDB version: InfluxDB 2.0.0-beta.13 (git: 86796dd) build_date: 2020-06-28T06:23:47Z
  • Other relevant environment details: influxd runs as a daemon on Ubuntu OS

Config:
/usr/sbin/influxd run --engine-path /influx/engine --bolt-path /influx/boltdb.db --http-bind-address 127.0.0.1:9999 --log-level info

The last time alerts were sent was 4 days ago:
image

@marcosciatta
Copy link

marcosciatta commented Jul 1, 2020

Same issue.
But it seems that only occurs with thresholds alert, at least for me.
InfluxDB 2.0.0-beta.12 (git: ff620782eb) build_date: 2020-07-01T10:38:19Z
In some case seems that it's random some check are sended other not.

Check works well, i can see the row in the check's history but I don't see the correlated row in notification history.

@xiaoxuanzi
Copy link

xiaoxuanzi commented Jul 3, 2020

Same issue.
influxdb version:

influxdb_2.0.0-beta.13_linux_amd64

@mjf
Copy link

mjf commented Jul 8, 2020

The very same issue here. 😞

# influxd2 version
InfluxDB 2.0.0-beta.13 (git: 86796ddf2d) build_date: 2020-07-08T11:07:50Z

I don't see anything in the notification tab (but checks seem to work). It also seems that alerts are beying scheduled somehow (at least the last running time seems to get updated regularly) and perhaps even run (?) but nothing really happens. I am trying the HTTP PUSH method, just to clarify it more. And of course, nothing special gets logged... 😢

How to debug it to know more, please? 😁

@omers
Copy link
Author

omers commented Jul 9, 2020

Any feedback from influx team?

@mjf
Copy link

mjf commented Jul 13, 2020

@omers No.

@Pupix
Copy link

Pupix commented Jul 13, 2020

I had encountered the same problem during the past week. The notification rules seem to work only for checks created before the rules was created. Newly made checks don't fire notifications.

To "fix" it, you have to do a PUT (PATCH doesn't work) request on the rule, with the same data, effectivly overriding it with itself and that seems to refresh the list of checks that the rule is looking for.

@zeesumwang
Copy link

+1

@alex88
Copy link

alex88 commented Aug 5, 2020

I see the same error on my end as I've reported here https://community.influxdata.com/t/notification-when-status-changes-from-ok-crit/13847/9
Tried multiple configs:

  • with and without tags
  • state change from any to crit and crit to any
  • when state is crit
    all with 1 minute interval and check 1 minute interval too, no notifications has been sent out

@ivanpricewaycom
Copy link

+1 we're running version=2.0.0-beta.15, statuses are being written by checks, the statuses are changing, but notifications using the 'changes from' type are not firing.

notifications for 'is equal to' work as expected.

image

i tried 'reputting' the rule as per @Pupix 's suggestion (via the UI), makes no difference.

it seems this is the exact same problem as: #17809

seems to me the issue should be reopened...

-ivan

@mhall119 mhall119 added the area/2.x OSS 2.0 related issues and PRs label Aug 6, 2020
@ivanpricewaycom
Copy link

adding to my previous comment, after playing with different intervals for the notification rule and check interval i believe that this is related to:

#18284

indeed setting the notification rule to ~1m9s, with a check interval of 30s, ensures that there is 'always' (almost) a check that has run in the notification window, and i am receiving state change notifs (via http).

this setup seems fragile and i'm not sure we're able to rely on it for production use however... will continue experimenting.

@ivanpricewaycom
Copy link

following on from this, we have now modified our config so that the checks run every 10s, and the notification every 30s, but there are still notifications being missed.

executing the following:

import "influxdata/influxdb/monitor"

monitor.from(start: -30s)
  |> filter(fn: (r) => r["line_code"] == "LI-XXXX")
  |> monitor.stateChanges(toLevel: "crit")

results in the 'correct' list of state changes, but not all of these changes result in a notification.

is there a way to have more debug information for the notification sending, to try to understand where the ones that are not being sent are failing ?

@omers
Copy link
Author

omers commented Aug 13, 2020

@ivanpricewaycom I couldn't find any option to debug alerts and rules.
So I guess we need to wait for a response from Influx team

@docere-priv
Copy link

I've observed the same issue in my setup with 'changes from' notification rule. For 'is equal to' notifications are sent to the http endpoint (however, interestingly I am always receiving the same event 3 times with the same timestamp).

I am still trying to find out if workaround @ivanpricewaycom works. Unfortunately, without much successes until now :(.

I am using beta.16 version.
# influxd version InfluxDB 2.0.0-beta.16 (git: 50964d732c) build_date: 2020-08-07T20:18:07Z

@abhi1693
Copy link

@mhall119 Any ETA on fixing this issue? I have confirmed and the alerts and notifications are working on 2.0.0-beta.9 branch. Is there a way that I can build a docker image for that tag?

@mhall119
Copy link
Contributor

@abhi1693 right now all efforts are on finishing the storage engine change, after than I think there is a new version of Flux that's ready to be added which might have a fix for this.

@abhi1693
Copy link

@mhall119 Thanks for replying so soon. Is there a timeline on this?

@mhall119
Copy link
Contributor

The storage engine change is currently in the works, I think that's supposed to land in the next week or so. After than I'm not as sure on the schedule, bug you can ask in our Slack in the #influxdb-v2 channel

@russorat
Copy link
Contributor

I am able to confirm the following in the current beta-16 build:

  • checks are firing as expected and writing to the _monitoring bucket as expected
  • Configuring endpoints seems to work as expected. I configured both the HTTP endpoint and the Slack endpoint using a url from https://webhook.site/
  • NotificationRules are running at the correct interval and generating the correct flux codem, but notification ARE NOT actually being sent. I am able to take the flux code being generated (http://localhost:9999/api/v2/notificationRules/ID/query) and run it in the data explorer and a notification is fired as expected.

We will have to investigate what is going on.

@aanthony1243
Copy link
Contributor

aanthony1243 commented Aug 25, 2020

we merged a fix for this recently: #19392

Summary: when a check's data is perfectly aligned on a boundary with the notification rules' schedule, we had a bug that trimmed off both the starting and ending points of the alerting time range. The fix makes sure that one side of the rule is always accepted, thus assuring that no rows are missed.

Unfortunately, it requires opening/saving each notification rule in order to regenerate the correct code. We are looking into migration solutions for users with a large number of notification rules.

@ivanpricewaycom
Copy link

ok great news, as soon as beta.17 (docker image) is released we'll be testing this.

@tiny-pangolin
Copy link

tiny-pangolin commented Sep 1, 2020 via email

@abhi1693
Copy link

@aanthony1243 Any ETA on closing this bug and releasing another beta version?

@tiny-pangolin
Copy link

This big has been fixed in the latest beta I have been able to receive alerts via pager duty and slack

@abhi1693
Copy link

@tiny6996 There hasn't been a release since Aug 8 for v2. Can you please confirm which version are you referring to?

@ivanpricewaycom
Copy link

Given that the fix that @aanthony1243 refers to was merged 24/8 it is clearly not included in the beta 16, @tiny6996 must be referring to a different problem.. I'm still waiting for beta 17 to see if the fix addresses the problem we're experiencing.

@ivanpricewaycom
Copy link

yeah sorry i don't have any other suggestions for you @pavleec , what we need here is better debug visibility on the task / notif system as a whole to understand where the blockages are.

a sandbox environment on a publicly-available influx instance would be useful also to help share the problem with influx devs.

@OlafHaalstra
Copy link

After some more debugging the issue seemed to be that we had configured a notification rule that triggered upon:
(1) When status changes from OK to ANY
Upon creating notifications for:
(2) When equal to INFO/CRIT/WARN
The notifications are now pushed by rule (2), with some repetition if the status stays the same. The problem seems to be that the query that is created for the notification task returns an empty result and therefor the status never reverts back to OK. I was able to manually trigger notifications when changing the status back to OK first and then to any level that triggered the rule (1). In order to make sure that there always is a value I think we need a combination of:

|> aggregateWindow(.., createEmpty: true)

and

|> fill(value: 0)

Where filling empty results only works with interpolate which is mentioned in this issue.

@psteinbachs psteinbachs assigned danxmoran and unassigned psteinbachs Dec 15, 2020
@alespour
Copy link
Contributor

I believe this influxdata/flux#1877 is related to this.

@NZSmartie
Copy link

Hi, this is still a problem. I have a few rules that rely on the status change as reporting when a status equals crit et al, will be too repetitive. However, the alerts do not trigger and no notifications are sent.

@danxmoran
Copy link
Contributor

danxmoran commented Feb 25, 2021

This big TODO might be the reason why #19392 didn't fix this issue Red herring, looks like that function is unused...

@brobotic
Copy link

Just upgraded from 1.x to 2.0.4 (git: 4e7a59bb9a) build_date: 2021-02-08T17:47:02Z and am experiencing this same issue. I can see that they hit crit correctly in the check status, but Slack notifications for OK (or ANY) -> CRIT do not fire off. I do see CRIT -> OK fire notifications, though. Really want to stick on 2.0, but its looking like back to 1.x until this is fixed.

@Glokeru
Copy link

Glokeru commented Apr 30, 2021

same behavior with 2.0.6:
OK -> CRIT - does not send notifications
CRIT -> OK - sends notifications
CRIT - sends notifications

@omers
Copy link
Author

omers commented May 31, 2021

Well this has become a real blocker.
We are evaluating their managed cloud service, and the whole alerting system just does not work!!

I think that version 2.0 is not ready at all for production usage.

Anyone can advise?

@cammurray
Copy link

I've been trying for the past two hours trying different combinations, and alerts are just broken.

I can get equals conditions to fire (e.g state = CRIT), but any "change" condition (e.g ANY to CRIT) just does not want to send a notification. This means that the alerting is basically useless as you need another interim system in-between to filter notifications that have already been sent.

@omers
Copy link
Author

omers commented Jul 19, 2021 via email

@russorat
Copy link
Contributor

hi all, thank you for your comments. we are very aware that our checks and alerting UI needs some improvement, and we are in the process of making those changes.

if you haven't already, check out this blog post for a detailed description of what's going on behind the scenes: https://www.influxdata.com/blog/influxdbs-checks-and-notifications-system/

long story short, alerts are just customized tasks behind the scenes, and you can customize them however you like. Today, our UI is limited in what you can build, but that should be changing soon.

we have documentation for building custom alerting as well which can also help troubleshoot alerts not firing: https://docs.influxdata.com/influxdb/cloud/monitor-alert/custom-checks/

i understand the frustration with the current setup and we are taking steps to make the process easier. thank you!

@TechCiel
Copy link

Hi contributors, and guys here who bothered by this problem,

It's surprised to find the problem I've encountered have a >1 year age. After digging in notification/rule/rule.go , especially the func increaseDur, I've got some thoughts about the cause.

Firstly, my conclusion is: when the interval of a check >= of a notification rule, its status transition might be discarded.

After reading the comment above func increaseDur and have a look of #1877 , I realized that we're filtering check results by interval of rules. Consider a case like we have 1check/1h and 1rule/1s. Every second we'll check statuses within the last 2s according to the code, which will get 0 or 1 records to construct no transition. So the rule never fires.

But when it comes to same interval (the = in >=), things become tricky. Consider we have a check and a rule, both with 5s interval. We will have status at 0 / 5 / 10 / 15 ...s. In this case, the rule will be firing with the check almost simultaneously. If at 10s, the rule query the db before the check is written, the system lose a point at 10s. But will it get both 0s and 5s?

After look into notification/rule/http_test.go, I've found there's a experimental["subDuration"](from: now(), d: 1h)). The check records will be always save as an accurate second with no milliseconds, however, function now() does not. This lead to a mismatch of the point at 0s, with a very littile difference of time, which is the execution time of the rule. This way the point at 0s position in this case will always be filtered out. If the check didn't finish writing status before rule is executed, the rule will fail to fire at this point.

It have been several hours investigating weird behaviors of the monitor system for me to come up with ideas. This is quite a sound hypothesis in a few tests, but it's late in my local, I might just get myself into chaos. I'm not familiar with Go, my apologies for your time wasted if I misunderstood the code.

Thanks you all. <3

@umaplehurst
Copy link

@TechCiel Please take a look at influxdata/flux#3807 where I'm proposing a patch for the issue. If you could test that out on your side, it would be useful feedback.

@danxmoran danxmoran removed their assignment Oct 26, 2021
@lukasvida
Copy link

any updates on this?

@umaplehurst
Copy link

@lukasvida influxdata/flux#3807 was fixed in v2.0.9 so I believe that should resolve some of the problems observed here

@lukasvida
Copy link

lukasvida commented Feb 23, 2022

I have problem with notifications ANY -> OK.
My checks were not running (idk why) but i re-wrote them using tasks and now they are changing statuses correctly and at correct intervals. They are threshold checks.

Whenever status changes to CRIT, notification is fired, but when that same status changes back to OK after next task run, notification is not fired.

I'm using two notification rules, one is equals to CRIT and the other one ANY -> OK. The latter one is running with offset 5s larger than the first one, and is not firing sometimes. Any help?

EDIT: I'm using version InfluxDB 2.1.1 (git: 657e1839de) on docker

@sawo1337
Copy link

I'm getting pretty much the same issue with 2.5, I can see states changing, but notifications are rarely sent out. On a test alert that is firing every couple of minutes, the history shows the last notification event over an hour ago. This is a new install, only one event configured.
This is the third major issue that I see still open since 2020, took us two days to overcome limitations such as lack of smtp support and no official Teams connector, but looks like this issue is going to be a rollback for us.
From what I can see, Influx 2.x is at "take it or leave it" stage for the past two years, I would advise anyone even remotely considering going to 2.x to do full-scale testing first and then migrate. Even ridiculously easy implementations such as smtp support is getting dismissed as exotic feature that was available in Kapacitor, which was supposed to be integrated into Influxdb. Teams is also third party module without guaranteed support, it takes 10 lines of code to implement. The list can go on.

@omers
Copy link
Author

omers commented Dec 17, 2022

This BUG is 2 years old already.
Anyone from influx is planning to acknowledge or respond?

@sawo1337
Copy link

sawo1337 commented Dec 19, 2022

This BUG is 2 years old already.
Anyone from influx is planning to acknowledge or respond?

@omers I don't think there is going to be a fix soon, unfortunately. Seems like this could be a serious design flaw related to locking/blocking which could be causing this problem. That is why they've got offset pretty much everywhere - to avoid locking. This is unreliable at best, and random-level success at worst. We've got 100 Telegraf agents logging data in the same database and so far I can't seem to find a way to make global alerts work. If you specify data from a single host, then it mostly works. But as soon as you try to query larger datasets, it just doesn't work. For example, I've set CPU check with 100VMs logging every 30 seconds, then run a benchmark on one of the VMs. I can clearly see on the check graph that the CPU for the particular VM is at 99-100% CPU, yet there is no critical status on the log (critical is configured to be >95%). In another instance, I get the status to change, but no alert is issued. Or I get the alert to fire, but no actual data is sent to the endpoint.
Even on a check that is only linked to a single host, I've noticed it takes 2-3 intervals (sometimes more) to recognize the data that is already in the graph. And this is with using offset. This is why blocking is the most likely reason for the alert system failing, unfortunately fixing this most likely would take quite a bit of work.

@ghost
Copy link

ghost commented May 22, 2023

I've been trying for the past two hours trying different combinations, and alerts are just broken.

I can get equals conditions to fire (e.g state = CRIT), but any "change" condition (e.g ANY to CRIT) just does not want to send a notification. This means that the alerting is basically useless as you need another interim system in-between to filter notifications that have already been sent.

This was the cause of my problems. After I changed the status rule to EQUALS crit, instead of FROM ok TO crit the notification rule triggered and sent an HTTP notification. This is still more spammy than the from-to-condition status rule, but at least it works. Please fix this, Influx.

@sawo1337
Copy link

I've been trying for the past two hours trying different combinations, and alerts are just broken.
I can get equals conditions to fire (e.g state = CRIT), but any "change" condition (e.g ANY to CRIT) just does not want to send a notification. This means that the alerting is basically useless as you need another interim system in-between to filter notifications that have already been sent.

This was the cause of my problems. After I changed the status rule to EQUALS crit, instead of FROM ok TO crit the notification rule triggered and sent an HTTP notification. This is still more spammy than the from-to-condition status rule, but at least it works. Please fix this, Influx.

how many clients do you have logging data there?

@ghost
Copy link

ghost commented May 23, 2023

I've been trying for the past two hours trying different combinations, and alerts are just broken.
I can get equals conditions to fire (e.g state = CRIT), but any "change" condition (e.g ANY to CRIT) just does not want to send a notification. This means that the alerting is basically useless as you need another interim system in-between to filter notifications that have already been sent.

This was the cause of my problems. After I changed the status rule to EQUALS crit, instead of FROM ok TO crit the notification rule triggered and sent an HTTP notification. This is still more spammy than the from-to-condition status rule, but at least it works. Please fix this, Influx.

how many clients do you have logging data there?

About 5 at any given time.

@amit12cool
Copy link

I'm experiencing this issue has any one got it working, please share the fix.

Refer my open issue here #24319

@OlexanderKulyk
Copy link

Notifications sometimes work for me, sometimes they don’t, it’s a nightmare (((

@DonKingMat
Copy link

DonKingMat commented Oct 29, 2023

This will never been fixed at all.

Await Version 3 and pay an totally unreal subscription fee and it might work.

v2 remains as an experiment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/2.x OSS 2.0 related issues and PRs kind/bug
Projects
None yet
Development

No branches or pull requests