Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exponential hinted-handoff interval on fail #4284

Merged
merged 4 commits into from
Oct 1, 2015
Merged

Exponential hinted-handoff interval on fail #4284

merged 4 commits into from
Oct 1, 2015

Conversation

otoolep
Copy link
Contributor

@otoolep otoolep commented Oct 1, 2015

This change has the hinted handoff service backoff exponentially if any hinted handoff write fails.

This is not ideal since hinted handoff may be writing to many nodes, some of which have come back online. This change means that all nodes will be written to slowly if any node is down. It may be better to have a separate goroutine for each node that is due to receive data, so nodes that are up get the data as quickly as possible, and the backoff only applies to downed nodes.

@otoolep
Copy link
Contributor Author

otoolep commented Oct 1, 2015

@pauldix @jwilder

If we're going to do this, it seems to me we should pull the existing HH code apart and run a goroutine for each node's data. Each goroutine would have its own exponential back off, as needed Thoughts?

@otoolep
Copy link
Contributor Author

otoolep commented Oct 1, 2015

Discussed this with @jwilder, and it's actually not too bad because when a node comes online, it will get all its data in one shot.

[ci skip]
@@ -119,17 +119,26 @@ func (s *Service) WriteShard(shardID, ownerID uint64, points []models.Point) err

func (s *Service) retryWrites() {
defer s.wg.Done()
ticker := time.NewTicker(time.Duration(s.cfg.RetryInterval))
defer ticker.Stop()
currInterval := time.Duration(s.cfg.RetryInterval)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably clamp the initial retry interval down to the Max as well here in case someone happened to misconfigure it. e.g. they set a retry-interval = 5m but retry-max-interval = 1m. The first retry would wait 5m but subsequent ones would wait 1m.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, will do.

@jwilder
Copy link
Contributor

jwilder commented Oct 1, 2015

One small thing but otherwise 👍

This could happen due to misconfiguration, so do something sensible in
that case.
@otoolep
Copy link
Contributor Author

otoolep commented Oct 1, 2015

Thanks @jwilder -- made that change, will merge on green.

otoolep added a commit that referenced this pull request Oct 1, 2015
Exponential hinted-handoff interval on fail
@otoolep otoolep merged commit 0a63bb1 into master Oct 1, 2015
@otoolep otoolep deleted the hh_backoff branch October 1, 2015 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants