Exponential hinted-handoff interval on fail #4284

otoolep · 2015-10-01T04:29:38Z

This change has the hinted handoff service backoff exponentially if any hinted handoff write fails.

This is not ideal since hinted handoff may be writing to many nodes, some of which have come back online. This change means that all nodes will be written to slowly if any node is down. It may be better to have a separate goroutine for each node that is due to receive data, so nodes that are up get the data as quickly as possible, and the backoff only applies to downed nodes.

otoolep · 2015-10-01T04:31:10Z

@pauldix @jwilder

If we're going to do this, it seems to me we should pull the existing HH code apart and run a goroutine for each node's data. Each goroutine would have its own exponential back off, as needed Thoughts?

otoolep · 2015-10-01T18:54:02Z

Discussed this with @jwilder, and it's actually not too bad because when a node comes online, it will get all its data in one shot.

[ci skip]

jwilder · 2015-10-01T19:01:54Z

services/hh/service.go

@@ -119,17 +119,26 @@ func (s *Service) WriteShard(shardID, ownerID uint64, points []models.Point) err

 func (s *Service) retryWrites() {
 	defer s.wg.Done()
-	ticker := time.NewTicker(time.Duration(s.cfg.RetryInterval))
-	defer ticker.Stop()
+	currInterval := time.Duration(s.cfg.RetryInterval)


Should probably clamp the initial retry interval down to the Max as well here in case someone happened to misconfigure it. e.g. they set a retry-interval = 5m but retry-max-interval = 1m. The first retry would wait 5m but subsequent ones would wait 1m.

Good call, will do.

jwilder · 2015-10-01T19:02:23Z

One small thing but otherwise 👍

This could happen due to misconfiguration, so do something sensible in that case.

otoolep · 2015-10-01T19:05:03Z

Thanks @jwilder -- made that change, will merge on green.

Exponential hinted-handoff interval on fail

otoolep added 2 commits September 30, 2015 21:10

Add config support for max HH retry interval

4eba2c1

Exponential backoff if any hinted-handoff fails

878f776

otoolep added the 2 - Working label Oct 1, 2015

otoolep mentioned this pull request Oct 1, 2015

Increase HH retry interval to 10 seconds #4268

Closed

Update CHANGELOG

c7599e0

[ci skip]

jwilder reviewed Oct 1, 2015
View reviewed changes

Clamp initial value of HH retry interval

8a1e5a9

This could happen due to misconfiguration, so do something sensible in that case.

otoolep added a commit that referenced this pull request Oct 1, 2015

Merge pull request #4284 from influxdb/hh_backoff

0a63bb1

Exponential hinted-handoff interval on fail

otoolep merged commit 0a63bb1 into master Oct 1, 2015

otoolep removed the 2 - Working label Oct 1, 2015

otoolep deleted the hh_backoff branch October 1, 2015 19:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exponential hinted-handoff interval on fail #4284

Exponential hinted-handoff interval on fail #4284

otoolep commented Oct 1, 2015

otoolep commented Oct 1, 2015

otoolep commented Oct 1, 2015

jwilder Oct 1, 2015

otoolep Oct 1, 2015

jwilder commented Oct 1, 2015

otoolep commented Oct 1, 2015

Exponential hinted-handoff interval on fail #4284

Exponential hinted-handoff interval on fail #4284

Conversation

otoolep commented Oct 1, 2015

otoolep commented Oct 1, 2015

otoolep commented Oct 1, 2015

jwilder Oct 1, 2015

Choose a reason for hiding this comment

otoolep Oct 1, 2015

Choose a reason for hiding this comment

jwilder commented Oct 1, 2015

otoolep commented Oct 1, 2015