Coordinator subscriber errors are excluded in alertmanager_config_last_reload_successful #2372

OGKevin · 2020-09-15T09:55:00Z

What did you do?
Update a template and triggered a config reload. This template was invalid.
What did you expect to see?
The metric alertmanager_config_last_reload_successful being set to false.

What did you see instead? Under which circumstances?
The metric alertmanager_config_last_reload_successful was set to true, while the Coordinator was reporting an invalid template in the logs.

Extras
This triggered an incident for us. What happened is AM got scheduled on non-preemptible nodes and therefore was running for a while. This template was updated a month ago and the reload metric remained true, even after a bunch of reloads during the template update and now, hence we never noticed the reload to fail. After draining these nodes, all AM's got rescheduled and could not boot up because of invalid template syntax which we were not aware of.

In other words, this template failure is silent on reload and catastrophic on reboot. Especially in combination with Thanos ruler. As ruler will not boot up if it can not resolve an IP for AM. Wich in k8s will happen if there are no ready pods behind a service.

This is due to the following code:

alertmanager/config/coordinator.go

Lines 119 to 130 in 1fdff6b

    
           level.Info(c.logger).Log( 
        
           	"msg", "Loading configuration file", 
        
           	"file", c.configFilePath, 
        
           ) 
        
           if err := c.loadFromFile(); err != nil { 
        
           	level.Error(c.logger).Log( 
        
           		"msg", "Loading configuration file failed", 
        
           		"file", c.configFilePath, 
        
           		"err", err, 
        
           	) 
        
           	return err 
        
           }

alertmanager/config/coordinator.go

Lines 96 to 111 in 1fdff6b

    
           // loadFromFile triggers a configuration load, discarding the old configuration. 
        
           func (c *Coordinator) loadFromFile() error { 
        
           	conf, err := LoadFile(c.configFilePath) 
        
           	if err != nil { 
        
           		c.configSuccessMetric.Set(0) 
        
           		return err 
        
           	} 
        
           	c.config = conf 
        
           	c.configSuccessMetric.Set(1) 
        
           	c.configSuccessTimeMetric.Set(float64(time.Now().Unix())) 
        
           	hash := md5HashAsMetricValue([]byte(c.config.original)) 
        
           	c.configHashMetric.Set(hash) 
        
           	return nil 
        
           }

After setting the metrics to success, AM triggers the subscribers to do their work:

alertmanager/config/coordinator.go

Lines 136 to 143 in 1fdff6b

    
           if err := c.notifySubscribers(); err != nil { 
        
           	c.logger.Log( 
        
           		"msg", "one or more config change subscribers failed to apply new config", 
        
           		"file", c.configFilePath, 
        
           		"err", err, 
        
           	) 
        
           	return err 
        
           }

The template subscriber is created here:

alertmanager/cmd/alertmanager/main.go

Lines 381 to 454 in 1fdff6b

    
           configCoordinator.Subscribe(func(conf *config.Config) error { 
        
           	tmpl, err = template.FromGlobs(conf.Templates...) 
        
           	if err != nil { 
        
           		return errors.Wrap(err, "failed to parse templates") 
        
           	} 
        
           	tmpl.ExternalURL = amURL 
        
           	// Build the routing tree and record which receivers are used. 
        
           	routes := dispatch.NewRoute(conf.Route, nil) 
        
           	activeReceivers := make(map[string]struct{}) 
        
           	routes.Walk(func(r *dispatch.Route) { 
        
           		activeReceivers[r.RouteOpts.Receiver] = struct{}{} 
        
           	}) 
        
           	// Build the map of receiver to integrations. 
        
           	receivers := make(map[string][]notify.Integration, len(activeReceivers)) 
        
           	var integrationsNum int 
        
           	for _, rcv := range conf.Receivers { 
        
           		if _, found := activeReceivers[rcv.Name]; !found { 
        
           			// No need to build a receiver if no route is using it. 
        
           			level.Info(configLogger).Log("msg", "skipping creation of receiver not referenced by any route", "receiver", rcv.Name) 
        
           			continue 
        
           		} 
        
           		integrations, err := buildReceiverIntegrations(rcv, tmpl, logger) 
        
           		if err != nil { 
        
           			return err 
        
           		} 
        
           		// rcv.Name is guaranteed to be unique across all receivers. 
        
           		receivers[rcv.Name] = integrations 
        
           		integrationsNum += len(integrations) 
        
           	} 
        
           	inhibitor.Stop() 
        
           	disp.Stop() 
        
           	inhibitor = inhibit.NewInhibitor(alerts, conf.InhibitRules, marker, logger) 
        
           	silencer := silence.NewSilencer(silences, marker, logger) 
        
           	pipeline := pipelineBuilder.New( 
        
           		receivers, 
        
           		waitFunc, 
        
           		inhibitor, 
        
           		silencer, 
        
           		notificationLog, 
        
           		peer, 
        
           	) 
        
           	configuredReceivers.Set(float64(len(activeReceivers))) 
        
           	configuredIntegrations.Set(float64(integrationsNum)) 
        
           	api.Update(conf, func(labels model.LabelSet) { 
        
           		inhibitor.Mutes(labels) 
        
           		silencer.Mutes(labels) 
        
           	}) 
        
           	disp = dispatch.NewDispatcher(alerts, routes, pipeline, marker, timeoutFunc, logger, dispMetrics) 
        
           	routes.Walk(func(r *dispatch.Route) { 
        
           		if r.RouteOpts.RepeatInterval > *retention { 
        
           			level.Warn(configLogger).Log( 
        
           				"msg", 
        
           				"repeat_interval is greater than the data retention period. It can lead to notifications being repeated more often than expected.", 
        
           				"repeat_interval", 
        
           				r.RouteOpts.RepeatInterval, 
        
           				"retention", 
        
           				*retention, 
        
           				"route", 
        
           				r.Key(), 
        
           			) 
        
           		} 
        
           	}) 
        
           	go disp.Run() 
        
           	go inhibitor.Run() 
        
           	return nil 
        
           })

and does not expose any type of metrics if this succeeded or not.

The following code makes the reboot catastrophic:

alertmanager/cmd/alertmanager/main.go

Lines 456 to 458 in 1fdff6b

    
           if err := configCoordinator.Reload(); err != nil { 
        
           	return 1 
        
           }

Environment

System information:

insert output of uname -srm here
Alertmanager version:
v0.21.0
Alertmanager configuration file:

insert configuration here

Logs:

ts=2020-09-11T12:20:27.391Z caller=coordinator.go:137 component=configuration msg="one or more config change subscribers failed to apply new config" file=/etc/config/alertmanager.yml err="failed to parse templates: template: messa
gebird.tmpl:21: invalid syntax"

The text was updated successfully, but these errors were encountered:

simonpasquier · 2020-09-15T10:00:17Z

Indeed thanks for reporting it! From a quick glance, the config metrics need to be moved to the Coordinator.Reload() method. Do you want to/can send a PR?

OGKevin · 2020-09-15T10:14:43Z

Now that I think of it, however, should AM not revert back to the previous config if the Coordinator fails to update everything? Bc in our situation, we would have e.g. updated config with an outdated template. Or are we assuming that now with this metric fix, that folks will have reload failure alerting in place and try to fix the config ASAP?

* #2372 Move config reload metrics to Coordinator.Reload() Signed-off-by: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com> * #2372 Minor refactoring. Signed-off-by: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com>

OGKevin changed the title ~~Coordinator subscriber errors should be included in alertmanager_config_last_reload_successful~~ Coordinator subscriber errors are excluded in alertmanager_config_last_reload_successful Sep 15, 2020

This was referenced Sep 15, 2020

[#2372] Move config reload metrics to Coordinator.Reload() #2373

Merged

Thanos ruler dies on startup if AM DNS can't be resolved thanos-io/thanos#3186

Closed

simonpasquier closed this as completed in #2373 Sep 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coordinator subscriber errors are excluded in alertmanager_config_last_reload_successful #2372

Coordinator subscriber errors are excluded in alertmanager_config_last_reload_successful #2372

OGKevin commented Sep 15, 2020 •

edited

Loading

simonpasquier commented Sep 15, 2020

OGKevin commented Sep 15, 2020

Coordinator subscriber errors are excluded in alertmanager_config_last_reload_successful #2372

Coordinator subscriber errors are excluded in alertmanager_config_last_reload_successful #2372

Comments

OGKevin commented Sep 15, 2020 • edited Loading

simonpasquier commented Sep 15, 2020

OGKevin commented Sep 15, 2020

OGKevin commented Sep 15, 2020 •

edited

Loading