Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coordinator subscriber errors are excluded in alertmanager_config_last_reload_successful #2372

Closed
OGKevin opened this issue Sep 15, 2020 · 2 comments · Fixed by #2373
Closed

Comments

@OGKevin
Copy link
Contributor

OGKevin commented Sep 15, 2020

What did you do?
Update a template and triggered a config reload. This template was invalid.
What did you expect to see?
The metric alertmanager_config_last_reload_successful being set to false.

What did you see instead? Under which circumstances?
The metric alertmanager_config_last_reload_successful was set to true, while the Coordinator was reporting an invalid template in the logs.

Extras
This triggered an incident for us. What happened is AM got scheduled on non-preemptible nodes and therefore was running for a while. This template was updated a month ago and the reload metric remained true, even after a bunch of reloads during the template update and now, hence we never noticed the reload to fail. After draining these nodes, all AM's got rescheduled and could not boot up because of invalid template syntax which we were not aware of.

In other words, this template failure is silent on reload and catastrophic on reboot. Especially in combination with Thanos ruler. As ruler will not boot up if it can not resolve an IP for AM. Wich in k8s will happen if there are no ready pods behind a service.

This is due to the following code:

level.Info(c.logger).Log(
"msg", "Loading configuration file",
"file", c.configFilePath,
)
if err := c.loadFromFile(); err != nil {
level.Error(c.logger).Log(
"msg", "Loading configuration file failed",
"file", c.configFilePath,
"err", err,
)
return err
}

// loadFromFile triggers a configuration load, discarding the old configuration.
func (c *Coordinator) loadFromFile() error {
conf, err := LoadFile(c.configFilePath)
if err != nil {
c.configSuccessMetric.Set(0)
return err
}
c.config = conf
c.configSuccessMetric.Set(1)
c.configSuccessTimeMetric.Set(float64(time.Now().Unix()))
hash := md5HashAsMetricValue([]byte(c.config.original))
c.configHashMetric.Set(hash)
return nil
}

After setting the metrics to success, AM triggers the subscribers to do their work:

if err := c.notifySubscribers(); err != nil {
c.logger.Log(
"msg", "one or more config change subscribers failed to apply new config",
"file", c.configFilePath,
"err", err,
)
return err
}

The template subscriber is created here:

configCoordinator.Subscribe(func(conf *config.Config) error {
tmpl, err = template.FromGlobs(conf.Templates...)
if err != nil {
return errors.Wrap(err, "failed to parse templates")
}
tmpl.ExternalURL = amURL
// Build the routing tree and record which receivers are used.
routes := dispatch.NewRoute(conf.Route, nil)
activeReceivers := make(map[string]struct{})
routes.Walk(func(r *dispatch.Route) {
activeReceivers[r.RouteOpts.Receiver] = struct{}{}
})
// Build the map of receiver to integrations.
receivers := make(map[string][]notify.Integration, len(activeReceivers))
var integrationsNum int
for _, rcv := range conf.Receivers {
if _, found := activeReceivers[rcv.Name]; !found {
// No need to build a receiver if no route is using it.
level.Info(configLogger).Log("msg", "skipping creation of receiver not referenced by any route", "receiver", rcv.Name)
continue
}
integrations, err := buildReceiverIntegrations(rcv, tmpl, logger)
if err != nil {
return err
}
// rcv.Name is guaranteed to be unique across all receivers.
receivers[rcv.Name] = integrations
integrationsNum += len(integrations)
}
inhibitor.Stop()
disp.Stop()
inhibitor = inhibit.NewInhibitor(alerts, conf.InhibitRules, marker, logger)
silencer := silence.NewSilencer(silences, marker, logger)
pipeline := pipelineBuilder.New(
receivers,
waitFunc,
inhibitor,
silencer,
notificationLog,
peer,
)
configuredReceivers.Set(float64(len(activeReceivers)))
configuredIntegrations.Set(float64(integrationsNum))
api.Update(conf, func(labels model.LabelSet) {
inhibitor.Mutes(labels)
silencer.Mutes(labels)
})
disp = dispatch.NewDispatcher(alerts, routes, pipeline, marker, timeoutFunc, logger, dispMetrics)
routes.Walk(func(r *dispatch.Route) {
if r.RouteOpts.RepeatInterval > *retention {
level.Warn(configLogger).Log(
"msg",
"repeat_interval is greater than the data retention period. It can lead to notifications being repeated more often than expected.",
"repeat_interval",
r.RouteOpts.RepeatInterval,
"retention",
*retention,
"route",
r.Key(),
)
}
})
go disp.Run()
go inhibitor.Run()
return nil
})

and does not expose any type of metrics if this succeeded or not.

The following code makes the reboot catastrophic:

if err := configCoordinator.Reload(); err != nil {
return 1
}

Environment

  • System information:

    insert output of uname -srm here

  • Alertmanager version:
    v0.21.0

  • Alertmanager configuration file:

insert configuration here
  • Logs:
ts=2020-09-11T12:20:27.391Z caller=coordinator.go:137 component=configuration msg="one or more config change subscribers failed to apply new config" file=/etc/config/alertmanager.yml err="failed to parse templates: template: messa
gebird.tmpl:21: invalid syntax"
@OGKevin OGKevin changed the title Coordinator subscriber errors should be included in alertmanager_config_last_reload_successful Coordinator subscriber errors are excluded in alertmanager_config_last_reload_successful Sep 15, 2020
@simonpasquier
Copy link
Member

Indeed thanks for reporting it! From a quick glance, the config metrics need to be moved to the Coordinator.Reload() method. Do you want to/can send a PR?

@OGKevin
Copy link
Contributor Author

OGKevin commented Sep 15, 2020

Now that I think of it, however, should AM not revert back to the previous config if the Coordinator fails to update everything? Bc in our situation, we would have e.g. updated config with an outdated template. Or are we assuming that now with this metric fix, that folks will have reload failure alerting in place and try to fix the config ASAP?

simonpasquier pushed a commit that referenced this issue Sep 28, 2020
* #2372 Move config reload metrics to Coordinator.Reload()

Signed-off-by: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com>

* #2372 Minor refactoring.

Signed-off-by: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants