-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add optional circuit breaker to ingester client #5650
Conversation
108ff49
to
b97f94e
Compare
2295ff5
to
26d0e0d
Compare
b97f94e
to
57eee0e
Compare
26d0e0d
to
aec69ee
Compare
d172ebf
to
f7cf432
Compare
f7cf432
to
ee1bf45
Compare
} | ||
|
||
// MakeIngesterClient makes a new IngesterClient | ||
func MakeIngesterClient(addr string, cfg Config, metrics *Metrics) (HealthAndIngesterClient, error) { | ||
dialOpts, err := cfg.GRPCClientConfig.DialOption(grpcclient.Instrument(metrics.RequestDuration)) | ||
unary, stream := grpcclient.Instrument(metrics.requestDuration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note from testing: this wraps the HealthClient
methods with a circuit breaker too which is probably not what we want.
ts=2023-08-04T19:51:10.560584491Z caller=pool.go:196 level=warn msg="removing ingester failing healthcheck" addr=XXX:9095 reason="circuit breaker is open"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The circuit breaker interceptor knows the method
too. What if configure the circuit breaker to only act on a list of configured methods, so that we're explicit on which gRPC methods we want the circuit breaker to apply?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about only applying the interceptor to methods that contain cortex.Ingester
? This would avoid the need to keep a list of methods up to date and allow non-ingester things like the health check to bypass it. Or would you prefer an explicit list of methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a list of methods to circuit-break in e312654, LMK what you think.
e47a3a9
to
cd875e8
Compare
Note that I've marked this functionality as experimental because we really do need to experiment with this at scale before enabling it by default. It's also possible the implementation will change completely (@jhalterman has some ideas about what this behavior would look like in an ideal world). However, I think it's at a point where it makes sense to include it in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job! LGTM. I left few comments I would be glad if you could look at before merging, thanks!
} | ||
|
||
// MakeIngesterClient makes a new IngesterClient | ||
func MakeIngesterClient(addr string, cfg Config, metrics *Metrics) (HealthAndIngesterClient, error) { | ||
dialOpts, err := cfg.GRPCClientConfig.DialOption(grpcclient.Instrument(metrics.RequestDuration)) | ||
unary, stream := grpcclient.Instrument(metrics.requestDuration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The circuit breaker interceptor knows the method
too. What if configure the circuit breaker to only act on a list of configured methods, so that we're explicit on which gRPC methods we want the circuit breaker to apply?
This is nice overall. @56quarters and I talked about this - we should consider switching to a time based circuit breaker when one is available (hopefully soon). This would allow us to threshold off of recent failure % rather than consecutive failures (which is not great for dynamic systems). That aside, I do have a few suggestions:
I'm also curious if you have an idea what settings we might use in prod? |
👍
👍
👍
The defaults (10s cooldown, 10 consecutive failures) worked well in my testing. I'd like to make the defaults the appropriate settings for production to reduce the amount of tuning everyone has to do. |
Add a circuit breaker when making gRPC requests to ingesters to avoid making the request if the ingester is down or hitting per-instance limits. All other errors are ignored by the circuit and result in the usual behaivor. Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
233818c
to
e9fc920
Compare
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely done 👍
Switch to failsafe circuit breaker implementation that allows us to define an error rate over a moving window instead new windows every N seconds. Helps for high traffic clusters where the long tail of requests might exceed the timeout enough in raw numbers but still very infrequently compared to request volume. Related #5650 Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Switch to failsafe circuit breaker implementation that allows us to define an error rate over a moving window instead new windows every N seconds. Helps for high traffic clusters where the long tail of requests might exceed the timeout enough in raw numbers but still very infrequently compared to request volume. Related #5650 Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
What this PR does
Add a circuit breaker when making gRPC requests to ingesters to avoid making
the request if the ingester is down or hitting per-instance limits. All other
errors are ignored by the circuit and result in the usual behaivor.
Which issue(s) this PR fixes or relates to
N/A
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]