Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request - Notification also from degraded to active #11

Open
jayhding opened this issue Aug 12, 2016 · 6 comments
Open

Feature Request - Notification also from degraded to active #11

jayhding opened this issue Aug 12, 2016 · 6 comments
Labels

Comments

@jayhding
Copy link
Contributor

Right now notification will only be sent when service has become degraded for a while, but we would also like to receive notifications when it has recovered from degraded status. Then we can know what is the service's final status.

@ndelitski
Copy link
Owner

What do you think on how should we notify if a service periodically jumping from a degraded to an active state and vice versa? is it ok to receive too many notifications? current logic is when a service become degraded you will receive only one notification independent on next status changes, maybe we should have specific settings enabling this feature?

@jayhding
Copy link
Contributor Author

That's exactly what we often see that service is flapping between active and degraded, actually I did change the code to notify for both directions and we have used for some time.

It is true it will generate more emails and that's why we changed to notify as slack message.

But considering the convenient to access slack, we can easily know if a service is back to normal state without connecting to private network in after hours situation.

Definitely it is fine to control this feature by a specific flag.

We can see if @SydOps also share the same opinion as me.

@ozbillwang
Copy link
Contributor

ozbillwang commented Aug 12, 2016

I am not really care the recovered status. Agree with @ndelitski, no need too many notifications. On Rancher server + hosts, especially for enterprise, we will install thousands containers, if there are too many notifications, operators will ignore them directly.

Second, we don't use Rancher Alarms as main alarms system. We have others, such sensu, dynatrace, etc. These alarms system will report the application and service high level health, more than containers health. If one container is unhealthy, but HA/ELB or website works fine, we don't spend time on the problem immediately. Rancher-alarms for me is only for operators or developers who get quick notification for particular rancher container services. Only notify when it is needed.

The best is, within Slack, you can delete the messages, if the slack bot is smart enough, it should be fine to delete previous degraded message, if it thinks the broken container is back and active. But I don't know how difficult to write code as this way.

Recovery notification is good feature, if we can add the codes, but make sure we can have option to turn it on/off easily.

@flaccid
Copy link
Contributor

flaccid commented Aug 14, 2016

We all have different desires, use cases however the degraded->active to detect flapping has been very useful. Control by settings yes please.

Too many notifications to slack isn't really an issue particularly if you use a dedicated channel. Concept of DevOps/agile/CD here is to stop work and fix to keep the pipeline going. The spice must flow!

I doubt you can delete messages done by a webhook as its not a real bot/user, worth checking though. Deletion in my opinion however is changing history, where a potential log of that can help in doing a post-mortem of certain events.

What we have found is that if you get a rancher alarm, something is wrong so any operator really should look at it straight away especially if in production. Its much different to getting noise for something like host alerting on 'busy cpu' where its informative and can be ignored.

@ndelitski
Copy link
Owner

For the start if we implement an option like notifyWhenRecovered=true which is disabled by default, everybody ok with it? It will be configurable per target(email|slack...)

@flaccid
Copy link
Contributor

flaccid commented Aug 14, 2016

Absolutely. For first version that is great, using the same template I'd assume.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants