Add optional circuit breaker to ingester client #5650

56quarters · 2023-08-01T13:21:19Z

What this PR does

Add a circuit breaker when making gRPC requests to ingesters to avoid making
the request if the ingester is down or hitting per-instance limits. All other
errors are ignored by the circuit and result in the usual behaivor.

Which issue(s) this PR fixes or relates to

N/A

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

56quarters · 2023-08-04T19:55:10Z

pkg/ingester/client/client.go

 }

 // MakeIngesterClient makes a new IngesterClient
 func MakeIngesterClient(addr string, cfg Config, metrics *Metrics) (HealthAndIngesterClient, error) {
-	dialOpts, err := cfg.GRPCClientConfig.DialOption(grpcclient.Instrument(metrics.RequestDuration))
+	unary, stream := grpcclient.Instrument(metrics.requestDuration)


Note from testing: this wraps the HealthClient methods with a circuit breaker too which is probably not what we want.

ts=2023-08-04T19:51:10.560584491Z caller=pool.go:196 level=warn msg="removing ingester failing healthcheck" addr=XXX:9095 reason="circuit breaker is open"

The circuit breaker interceptor knows the method too. What if configure the circuit breaker to only act on a list of configured methods, so that we're explicit on which gRPC methods we want the circuit breaker to apply?

What do you think about only applying the interceptor to methods that contain cortex.Ingester? This would avoid the need to keep a list of methods up to date and allow non-ingester things like the health check to bypass it. Or would you prefer an explicit list of methods?

I've added a list of methods to circuit-break in e312654, LMK what you think.

56quarters · 2023-08-08T20:09:01Z

Note that I've marked this functionality as experimental because we really do need to experiment with this at scale before enabling it by default. It's also possible the implementation will change completely (@jhalterman has some ideas about what this behavior would look like in an ideal world). However, I think it's at a point where it makes sense to include it in main.

pkg/ingester/client/circuitbreaker.go

pracucci

Good job! LGTM. I left few comments I would be glad if you could look at before merging, thanks!

CHANGELOG.md

docs/sources/mimir/configure/about-versioning.md

pkg/ingester/client/client.go

pracucci · 2023-08-18T15:18:54Z

pkg/ingester/client/client.go

 }

 // MakeIngesterClient makes a new IngesterClient
 func MakeIngesterClient(addr string, cfg Config, metrics *Metrics) (HealthAndIngesterClient, error) {
-	dialOpts, err := cfg.GRPCClientConfig.DialOption(grpcclient.Instrument(metrics.RequestDuration))
+	unary, stream := grpcclient.Instrument(metrics.requestDuration)


The circuit breaker interceptor knows the method too. What if configure the circuit breaker to only act on a list of configured methods, so that we're explicit on which gRPC methods we want the circuit breaker to apply?

pkg/ingester/client/circuitbreaker.go

jhalterman · 2023-08-18T21:52:53Z

This is nice overall. @56quarters and I talked about this - we should consider switching to a time based circuit breaker when one is available (hopefully soon). This would allow us to threshold off of recent failure % rather than consecutive failures (which is not great for dynamic systems).

That aside, I do have a few suggestions:

Could we obscure some of the circuit breaker details, so that users don't need to have to know/care how that pattern works? A few things that could help with this:
- We could rename circuit-breaker-max-consecutive-failures to circuit-breaker-failure-threshold, and use that same value for ReadyToTrip and MaxRequests (since both of these states are basically doing failure thresholding). That would allow us to remove the circuit-breaker-max-half-open-requests setting.
- We could remove circuit-breaker-closed-interval and just use the default interval (0) since we're only thresholding off of consecutive failures, and the total number of successes/failures don't matter.
- We should rename circuit-breaker-open-timeout to something like circuit-breaker-open-duration or circuit-breaker-half-open-delay since timeout implies the breaker might be open for <= that time when in practice it's = that time. Alternatively, we might consider calling this something like circuit-breaker-cooldown-period so that users don't have to know what open/half-open mean.

I'm also curious if you have an idea what settings we might use in prod?

56quarters · 2023-08-22T13:51:46Z

This is nice overall. @56quarters and I talked about this - we should consider switching to a time based circuit breaker when one is available (hopefully soon). This would allow us to threshold off of recent failure % rather than consecutive failures (which is not great for dynamic systems).

That aside, I do have a few suggestions:

Could we obscure some of the circuit breaker details, so that users don't need to have to know/care how that pattern works? A few things that could help with this:

We could rename circuit-breaker-max-consecutive-failures to circuit-breaker-failure-threshold, and use that same value for ReadyToTrip and MaxRequests (since both of these states are basically doing failure thresholding). That would allow us to remove the circuit-breaker-max-half-open-requests setting.

👍

We could remove circuit-breaker-closed-interval and just use the default interval (0) since we're only thresholding off of consecutive failures, and the total number of successes/failures don't matter.

👍

We should rename circuit-breaker-open-timeout to something like circuit-breaker-open-duration or circuit-breaker-half-open-delay since timeout implies the breaker might be open for <= that time when in practice it's = that time. Alternatively, we might consider calling this something like circuit-breaker-cooldown-period so that users don't have to know what open/half-open mean.

👍

I'm also curious if you have an idea what settings we might use in prod?

The defaults (10s cooldown, 10 consecutive failures) worked well in my testing. I'd like to make the defaults the appropriate settings for production to reduce the amount of tuning everyone has to do.

Add a circuit breaker when making gRPC requests to ingesters to avoid making the request if the ingester is down or hitting per-instance limits. All other errors are ignored by the circuit and result in the usual behaivor. Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

jhalterman

Nicely done 👍

Switch to failsafe circuit breaker implementation that allows us to define an error rate over a moving window instead new windows every N seconds. Helps for high traffic clusters where the long tail of requests might exceed the timeout enough in raw numbers but still very infrequently compared to request volume. Related #5650 Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

56quarters force-pushed the 56quarters/circuit-breaker branch from 108ff49 to b97f94e Compare August 1, 2023 17:47

56quarters force-pushed the 56quarters/instance-errors branch from 2295ff5 to 26d0e0d Compare August 1, 2023 20:11

56quarters force-pushed the 56quarters/circuit-breaker branch from b97f94e to 57eee0e Compare August 1, 2023 20:11

56quarters force-pushed the 56quarters/instance-errors branch from 26d0e0d to aec69ee Compare August 3, 2023 13:45

56quarters force-pushed the 56quarters/circuit-breaker branch from d172ebf to f7cf432 Compare August 3, 2023 13:59

Base automatically changed from 56quarters/instance-errors to main August 3, 2023 17:25

56quarters force-pushed the 56quarters/circuit-breaker branch from f7cf432 to ee1bf45 Compare August 3, 2023 17:28

56quarters commented Aug 4, 2023

View reviewed changes

56quarters force-pushed the 56quarters/circuit-breaker branch from e47a3a9 to cd875e8 Compare August 8, 2023 20:01

56quarters marked this pull request as ready for review August 8, 2023 20:09

56quarters requested review from a team as code owners August 8, 2023 20:09

56quarters commented Aug 8, 2023

View reviewed changes

pkg/ingester/client/circuitbreaker.go Show resolved Hide resolved

56quarters commented Aug 8, 2023

View reviewed changes

pkg/ingester/client/circuitbreaker.go Show resolved Hide resolved

56quarters requested a review from jhalterman August 9, 2023 14:39

pracucci self-requested a review August 18, 2023 15:12

pracucci approved these changes Aug 18, 2023

View reviewed changes

56quarters added 3 commits August 22, 2023 10:58

Log when the circuit breaker changes state

67bd2d0

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

Code review changes

e9fc920

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

56quarters force-pushed the 56quarters/circuit-breaker branch from 233818c to e9fc920 Compare August 22, 2023 15:27

56quarters requested a review from grafanabot as a code owner August 22, 2023 15:27

56quarters added 4 commits August 22, 2023 11:35

Make the linter happy

1a62ff6

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

Fix YAML configuration names

7fa5807

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

Update YAML field names

45f59ac

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

Only circuit-break ingester methods

e312654

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

56quarters requested a review from pracucci August 22, 2023 21:11

jhalterman approved these changes Aug 23, 2023

View reviewed changes

56quarters merged commit 057e3e4 into main Aug 23, 2023
27 checks passed

56quarters deleted the 56quarters/circuit-breaker branch August 23, 2023 17:31

56quarters mentioned this pull request Sep 6, 2023

Switch to failsafe-go circuit breaker implementation #5951

Merged

3 tasks

duricanikolic mentioned this pull request Sep 19, 2023

Ingester: test hitting per-instance limits with circuit breaker #6065

Merged

3 tasks

duricanikolic mentioned this pull request May 24, 2024

Adding circuit breakers on ingester server side for write path #8180

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optional circuit breaker to ingester client #5650

Add optional circuit breaker to ingester client #5650

56quarters commented Aug 1, 2023 •

edited

Loading

56quarters Aug 4, 2023

pracucci Aug 18, 2023

56quarters Aug 22, 2023 •

edited

Loading

56quarters Aug 22, 2023 •

edited

Loading

56quarters commented Aug 8, 2023

pracucci left a comment

pracucci Aug 18, 2023

jhalterman commented Aug 18, 2023 •

edited

Loading

56quarters commented Aug 22, 2023

jhalterman left a comment

Add optional circuit breaker to ingester client #5650

Add optional circuit breaker to ingester client #5650

Conversation

56quarters commented Aug 1, 2023 • edited Loading

What this PR does

Which issue(s) this PR fixes or relates to

Checklist

56quarters Aug 4, 2023

Choose a reason for hiding this comment

pracucci Aug 18, 2023

Choose a reason for hiding this comment

56quarters Aug 22, 2023 • edited Loading

Choose a reason for hiding this comment

56quarters Aug 22, 2023 • edited Loading

Choose a reason for hiding this comment

56quarters commented Aug 8, 2023

pracucci left a comment

Choose a reason for hiding this comment

pracucci Aug 18, 2023

Choose a reason for hiding this comment

jhalterman commented Aug 18, 2023 • edited Loading

56quarters commented Aug 22, 2023

jhalterman left a comment

Choose a reason for hiding this comment

56quarters commented Aug 1, 2023 •

edited

Loading

56quarters Aug 22, 2023 •

edited

Loading

56quarters Aug 22, 2023 •

edited

Loading

jhalterman commented Aug 18, 2023 •

edited

Loading