-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding circuit breakers on ingester server side for write path #8180
Conversation
8666953
to
26fd19e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Yuri. I haven't reviewed tests yet, but I have enough comments to send this first batch now.
2a6c937
to
1121ffe
Compare
In internal discussion with Yuri, I proposed to change the direction of this PR, and use "Standalone Usage" of circuit breakers: try to acquire permit from circuit breaker in |
134a503
to
cd7e066
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't finished my review yet, because I've run out of time. Will continue tomorrow.
c5f7fff
to
6b915ca
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next batch of comments.
pkg/ingester/ingester.go
Outdated
} | ||
if shouldFinish { | ||
defer i.finishPushRequest(reqSize) | ||
defer func() { | ||
i.finishPushRequest(ctx, reqSize, time.Since(start), acquiredCircuitBreakerPermit, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be fine to just call i.FinishPushRequest
. For that we need to preserve ctx
returned by startPushRequest
, but we need to do that anyway if we want to update pushRequestState
in the defer
statement below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have applied your suggestion. The only thing to be aware of was if a call to this PushWithCleanup()
actually acquired a permit (during a call to i.startPushRequest()
). In that case, the value of pushRequestState.acquiredCircuitBreakerPermit
also has to be updated in the defer
statement below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, since in the previous version of this implementation was:
_, shouldFinish, err := i.startPushRequest(ctx, reqSize)
if err != nil {
return middleware.DoNotLogError{Err: err}
}
if shouldFinish {
defer i.finishPushRequest(reqSize)
}
does it mean that we had a bug?
pkg/ingester/circuitbreaker.go
Outdated
return cb.active.Load() | ||
} | ||
|
||
func (cb *circuitBreaker) setActive() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rename this to:
func (cb *circuitBreaker) setActive() { | |
func (cb *circuitBreaker) activate() { |
I would expect setActive
to receive an active bool
arg to set active property to a desired value, but this method only activates the circuit breaker.
pkg/ingester/circuitbreaker.go
Outdated
if cb == nil { | ||
return false | ||
} | ||
return cb.active.Load() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make this shorter:
if cb == nil { | |
return false | |
} | |
return cb.active.Load() | |
return cb != nil && cb.active.Load() |
f.DurationVar(&cfg.PushTimeout, prefix+"push-timeout", defaultPushTimeout, "How long is execution of ingester's Push supposed to last before it is reported as timeout in a circuit breaker. This configuration is used for circuit breakers only, and timeout expirations are not reported as errors") | ||
} | ||
|
||
type circuitBreaker struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type circuitBreaker struct { | |
// circuitBreaker abstracts the ingester's server-side circuit breaker functionality. | |
// A nil *circuitBreaker is a valid noop implementation. | |
type circuitBreaker struct { |
pkg/ingester/circuitbreaker.go
Outdated
func (cb *circuitBreaker) tryAcquirePermit() (bool, error) { | ||
if !cb.isActive() { | ||
return false, nil | ||
} | ||
if !cb.cb.TryAcquirePermit() { | ||
cb.metrics.circuitBreakerResults.WithLabelValues(resultOpen).Inc() | ||
return false, newCircuitBreakerOpenError(cb.cb.RemainingDelay()) | ||
} | ||
return true, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect that if tryAcquirePermit
returns false, nil
, then I don't have permit to proceed. However, it's not like that.
I would try to refactor the method to either avoid having to read the comment to understand the return values (what's the first return value? prod code calls it "acquired" while tests call it "status") or at least to force the user to think about it.
The comment above says:
If it was not possible to acquire a permit, success flag false is returned. In this case no call to finishPushRequest is needed.
Which puts logic in the caller of this method who shouldn't care whether it's active or not, and I think we can do that better by just returning a callback:
func (cb *circuitBreaker) tryAcquirePermit() (bool, error) { | |
if !cb.isActive() { | |
return false, nil | |
} | |
if !cb.cb.TryAcquirePermit() { | |
cb.metrics.circuitBreakerResults.WithLabelValues(resultOpen).Inc() | |
return false, newCircuitBreakerOpenError(cb.cb.RemainingDelay()) | |
} | |
return true, nil | |
} | |
func (cb *circuitBreaker) tryAcquirePermit() (finish func(duration time.Duration, pushErr error), err error) | |
if !cb.isActive() { | |
return func(_ time.Duration, _ error,) {}, nil | |
} | |
if !cb.cb.TryAcquirePermit() { | |
cb.metrics.circuitBreakerResults.WithLabelValues(resultOpen).Inc() | |
return nil, newCircuitBreakerOpenError(cb.cb.RemainingDelay()) | |
} | |
return cb.finishPushRequest, nil | |
} |
Then the caller doesn't have more logic than just:
finish, err := cb.tryAquirePermit()
if err != nil {
return ..., err
}
// ...
finish(time.Since(t0), err)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I understand your rationale (client just calls returned function), it looks like slightly over-engineered solution to a trivial problem, where a simple boolean works just fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also keep the current implementation as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like slightly over-engineered solution
Closures are first class citizens just like bools :)
to a trivial problem
The difference is that it removes the problem of having to decide things from the caller.
pkg/ingester/circuitbreaker.go
Outdated
return true, nil | ||
} | ||
|
||
func (cb *circuitBreaker) recordResult(err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can you move this method below finishPushRequest()
?
pkg/ingester/ingester.go
Outdated
if cfg.CircuitBreakerConfig.Enabled { | ||
// We create an inactive circuit breaker, which will be activated on a successful completion of starting. | ||
i.circuitBreaker = newCircuitBreaker(cfg.CircuitBreakerConfig, false, logger, registerer) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given you're passing the entire cfg.CircuitBreakerConfig
to newCircuitBreaker
, and nil
instance of *circuitBreaker
is valid, can we move this logic to newCircuitBreaker
?
Otherwise this method is taking decisions based on that entity's config and if it's false, it relies on that entity's nil behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have moved the logic in newCircuitBreaker()
.
pkg/ingester/ingester.go
Outdated
if i.cfg.CircuitBreakerConfig.InitialDelay == 0 { | ||
i.circuitBreaker.setActive() | ||
} else { | ||
time.AfterFunc(i.cfg.CircuitBreakerConfig.InitialDelay, func() { | ||
i.circuitBreaker.setActive() | ||
}) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, IMO, this logic should be inside of newCircuitBreaker
.
This method shouldn't read circuit breaker's config and take decisions on it, it should just call i.circuitBreaker.onStarting()
or something similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it should be inside of newCircuitBreaker
specifically (how would we know when ingester finished starting?), but moving it to circuit breaker method sounds fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yes, sorry, I wrote half comment and then realised that this wasn't part of newIngester
anymore. I agree, it shouldn't be in newCircuitBreaker
, but should be in a method of circuit breaker (as I stated in the second part of my comment 😂 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have moved the logic in the circuitBreaker.activate()
method: if cfg.InitialDelay
is 0, the circuit breaker gets activated immediately. Otherwise it gets activated after the initial delay.
Then, ingester.startig()
will just call i.circuitbreaker.activate()
, which will activate the circuit breaker in the right moment. WDYT @colega?
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
2949ada
to
e136df3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job! Thank you very much for addressing all my feedback!
* Adding circuit breakers on ingester server side for write path Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com> * Fixing review findings Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com> * Fixing review findigs Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com> * Implementing the gauge as NewGaugeFunc * Fixing review findings Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com> * Fixing lint issues Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com> * Adding test for hitting deadline when ingester.Push is used Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com> * Fixing review findings Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com> * Fix additional review findings Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com> * Get rid of finishPushRequest Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com> * Add unit test for startPushRequest Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com> * Activate circuit breaker on a successful completion of ingester.starting Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com> * Rename the output error of ingester.PushWithCleanup() Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com> * Fixing review findings Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com> * Get rid of test-delay key from in context Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com> * Fixing review findings Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com> --------- Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
…na#8180) * Adding circuit breakers on ingester server side for write path Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com> * Fixing review findings Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com> * Fixing review findigs Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com> * Implementing the gauge as NewGaugeFunc * Fixing review findings Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com> * Fixing lint issues Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com> * Adding test for hitting deadline when ingester.Push is used Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com> * Fixing review findings Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com> * Fix additional review findings Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com> * Get rid of finishPushRequest Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com> * Add unit test for startPushRequest Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com> * Activate circuit breaker on a successful completion of ingester.starting Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com> * Rename the output error of ingester.PushWithCleanup() Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com> * Fixing review findings Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com> * Get rid of test-delay key from in context Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com> * Fixing review findings Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com> --------- Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
What this PR does
Similarly to what was done in #5650, when circuit breakers were added to ingester client side, this PR adds circuit breakers to the server side.
Which issue(s) this PR fixes or relates to
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]
.about-versioning.md
updated with experimental features.