You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What did you do?
Update a template and triggered a config reload. This template was invalid. What did you expect to see?
The metric alertmanager_config_last_reload_successful being set to false.
What did you see instead? Under which circumstances?
The metric alertmanager_config_last_reload_successful was set to true, while the Coordinator was reporting an invalid template in the logs.
Extras
This triggered an incident for us. What happened is AM got scheduled on non-preemptible nodes and therefore was running for a while. This template was updated a month ago and the reload metric remained true, even after a bunch of reloads during the template update and now, hence we never noticed the reload to fail. After draining these nodes, all AM's got rescheduled and could not boot up because of invalid template syntax which we were not aware of.
In other words, this template failure is silent on reload and catastrophic on reboot. Especially in combination with Thanos ruler. As ruler will not boot up if it can not resolve an IP for AM. Wich in k8s will happen if there are no ready pods behind a service.
ts=2020-09-11T12:20:27.391Z caller=coordinator.go:137 component=configuration msg="one or more config change subscribers failed to apply new config" file=/etc/config/alertmanager.yml err="failed to parse templates: template: messa
gebird.tmpl:21: invalid syntax"
The text was updated successfully, but these errors were encountered:
OGKevin
changed the title
Coordinator subscriber errors should be included in alertmanager_config_last_reload_successful
Coordinator subscriber errors are excluded in alertmanager_config_last_reload_successful
Sep 15, 2020
Indeed thanks for reporting it! From a quick glance, the config metrics need to be moved to the Coordinator.Reload() method. Do you want to/can send a PR?
Now that I think of it, however, should AM not revert back to the previous config if the Coordinator fails to update everything? Bc in our situation, we would have e.g. updated config with an outdated template. Or are we assuming that now with this metric fix, that folks will have reload failure alerting in place and try to fix the config ASAP?
* #2372 Move config reload metrics to Coordinator.Reload()
Signed-off-by: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com>
* #2372 Minor refactoring.
Signed-off-by: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com>
What did you do?
Update a template and triggered a config reload. This template was invalid.
What did you expect to see?
The metric alertmanager_config_last_reload_successful being set to false.
What did you see instead? Under which circumstances?
The metric alertmanager_config_last_reload_successful was set to true, while the Coordinator was reporting an invalid template in the logs.
Extras
This triggered an incident for us. What happened is AM got scheduled on non-preemptible nodes and therefore was running for a while. This template was updated a month ago and the reload metric remained true, even after a bunch of reloads during the template update and now, hence we never noticed the reload to fail. After draining these nodes, all AM's got rescheduled and could not boot up because of invalid template syntax which we were not aware of.
In other words, this template failure is silent on reload and catastrophic on reboot. Especially in combination with Thanos ruler. As ruler will not boot up if it can not resolve an IP for AM. Wich in k8s will happen if there are no ready pods behind a service.
This is due to the following code:
alertmanager/config/coordinator.go
Lines 119 to 130 in 1fdff6b
alertmanager/config/coordinator.go
Lines 96 to 111 in 1fdff6b
After setting the metrics to success, AM triggers the subscribers to do their work:
alertmanager/config/coordinator.go
Lines 136 to 143 in 1fdff6b
The template subscriber is created here:
alertmanager/cmd/alertmanager/main.go
Lines 381 to 454 in 1fdff6b
and does not expose any type of metrics if this succeeded or not.
The following code makes the reboot catastrophic:
alertmanager/cmd/alertmanager/main.go
Lines 456 to 458 in 1fdff6b
Environment
System information:
insert output of
uname -srm
hereAlertmanager version:
v0.21.0
Alertmanager configuration file:
The text was updated successfully, but these errors were encountered: