-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exponential hinted-handoff interval on fail #4284
Conversation
Discussed this with @jwilder, and it's actually not too bad because when a node comes online, it will get all its data in one shot. |
[ci skip]
@@ -119,17 +119,26 @@ func (s *Service) WriteShard(shardID, ownerID uint64, points []models.Point) err | |||
|
|||
func (s *Service) retryWrites() { | |||
defer s.wg.Done() | |||
ticker := time.NewTicker(time.Duration(s.cfg.RetryInterval)) | |||
defer ticker.Stop() | |||
currInterval := time.Duration(s.cfg.RetryInterval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably clamp the initial retry interval down to the Max as well here in case someone happened to misconfigure it. e.g. they set a retry-interval = 5m
but retry-max-interval = 1m
. The first retry would wait 5m
but subsequent ones would wait 1m
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, will do.
One small thing but otherwise 👍 |
This could happen due to misconfiguration, so do something sensible in that case.
Thanks @jwilder -- made that change, will merge on green. |
Exponential hinted-handoff interval on fail
This change has the hinted handoff service backoff exponentially if any hinted handoff write fails.
This is not ideal since hinted handoff may be writing to many nodes, some of which have come back online. This change means that all nodes will be written to slowly if any node is down. It may be better to have a separate goroutine for each node that is due to receive data, so nodes that are up get the data as quickly as possible, and the backoff only applies to downed nodes.