Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optional circuit breaker to ingester client #5650

Merged
merged 7 commits into from
Aug 23, 2023

Conversation

56quarters
Copy link
Contributor

@56quarters 56quarters commented Aug 1, 2023

What this PR does

Add a circuit breaker when making gRPC requests to ingesters to avoid making
the request if the ingester is down or hitting per-instance limits. All other
errors are ignored by the circuit and result in the usual behaivor.

Which issue(s) this PR fixes or relates to

N/A

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Base automatically changed from 56quarters/instance-errors to main August 3, 2023 17:25
}

// MakeIngesterClient makes a new IngesterClient
func MakeIngesterClient(addr string, cfg Config, metrics *Metrics) (HealthAndIngesterClient, error) {
dialOpts, err := cfg.GRPCClientConfig.DialOption(grpcclient.Instrument(metrics.RequestDuration))
unary, stream := grpcclient.Instrument(metrics.requestDuration)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note from testing: this wraps the HealthClient methods with a circuit breaker too which is probably not what we want.

ts=2023-08-04T19:51:10.560584491Z caller=pool.go:196 level=warn msg="removing ingester failing healthcheck" addr=XXX:9095 reason="circuit breaker is open"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The circuit breaker interceptor knows the method too. What if configure the circuit breaker to only act on a list of configured methods, so that we're explicit on which gRPC methods we want the circuit breaker to apply?

Copy link
Contributor Author

@56quarters 56quarters Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about only applying the interceptor to methods that contain cortex.Ingester? This would avoid the need to keep a list of methods up to date and allow non-ingester things like the health check to bypass it. Or would you prefer an explicit list of methods?

Copy link
Contributor Author

@56quarters 56quarters Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a list of methods to circuit-break in e312654, LMK what you think.

@56quarters
Copy link
Contributor Author

Note that I've marked this functionality as experimental because we really do need to experiment with this at scale before enabling it by default. It's also possible the implementation will change completely (@jhalterman has some ideas about what this behavior would look like in an ideal world). However, I think it's at a point where it makes sense to include it in main.

@56quarters 56quarters marked this pull request as ready for review August 8, 2023 20:09
@56quarters 56quarters requested review from a team as code owners August 8, 2023 20:09
@pracucci pracucci self-requested a review August 18, 2023 15:12
Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job! LGTM. I left few comments I would be glad if you could look at before merging, thanks!

CHANGELOG.md Outdated Show resolved Hide resolved
pkg/ingester/client/client.go Outdated Show resolved Hide resolved
}

// MakeIngesterClient makes a new IngesterClient
func MakeIngesterClient(addr string, cfg Config, metrics *Metrics) (HealthAndIngesterClient, error) {
dialOpts, err := cfg.GRPCClientConfig.DialOption(grpcclient.Instrument(metrics.RequestDuration))
unary, stream := grpcclient.Instrument(metrics.requestDuration)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The circuit breaker interceptor knows the method too. What if configure the circuit breaker to only act on a list of configured methods, so that we're explicit on which gRPC methods we want the circuit breaker to apply?

pkg/ingester/client/circuitbreaker.go Outdated Show resolved Hide resolved
pkg/ingester/client/circuitbreaker.go Outdated Show resolved Hide resolved
pkg/ingester/client/circuitbreaker.go Outdated Show resolved Hide resolved
pkg/ingester/client/circuitbreaker.go Show resolved Hide resolved
@jhalterman
Copy link
Member

jhalterman commented Aug 18, 2023

This is nice overall. @56quarters and I talked about this - we should consider switching to a time based circuit breaker when one is available (hopefully soon). This would allow us to threshold off of recent failure % rather than consecutive failures (which is not great for dynamic systems).

That aside, I do have a few suggestions:

  • Could we obscure some of the circuit breaker details, so that users don't need to have to know/care how that pattern works? A few things that could help with this:
    • We could rename circuit-breaker-max-consecutive-failures to circuit-breaker-failure-threshold, and use that same value for ReadyToTrip and MaxRequests (since both of these states are basically doing failure thresholding). That would allow us to remove the circuit-breaker-max-half-open-requests setting.
    • We could remove circuit-breaker-closed-interval and just use the default interval (0) since we're only thresholding off of consecutive failures, and the total number of successes/failures don't matter.
    • We should rename circuit-breaker-open-timeout to something like circuit-breaker-open-duration or circuit-breaker-half-open-delay since timeout implies the breaker might be open for <= that time when in practice it's = that time. Alternatively, we might consider calling this something like circuit-breaker-cooldown-period so that users don't have to know what open/half-open mean.

I'm also curious if you have an idea what settings we might use in prod?

@56quarters
Copy link
Contributor Author

This is nice overall. @56quarters and I talked about this - we should consider switching to a time based circuit breaker when one is available (hopefully soon). This would allow us to threshold off of recent failure % rather than consecutive failures (which is not great for dynamic systems).

That aside, I do have a few suggestions:

  • Could we obscure some of the circuit breaker details, so that users don't need to have to know/care how that pattern works? A few things that could help with this:

  • We could rename circuit-breaker-max-consecutive-failures to circuit-breaker-failure-threshold, and use that same value for ReadyToTrip and MaxRequests (since both of these states are basically doing failure thresholding). That would allow us to remove the circuit-breaker-max-half-open-requests setting.

👍

  • We could remove circuit-breaker-closed-interval and just use the default interval (0) since we're only thresholding off of consecutive failures, and the total number of successes/failures don't matter.

👍

  • We should rename circuit-breaker-open-timeout to something like circuit-breaker-open-duration or circuit-breaker-half-open-delay since timeout implies the breaker might be open for <= that time when in practice it's = that time. Alternatively, we might consider calling this something like circuit-breaker-cooldown-period so that users don't have to know what open/half-open mean.

👍

I'm also curious if you have an idea what settings we might use in prod?

The defaults (10s cooldown, 10 consecutive failures) worked well in my testing. I'd like to make the defaults the appropriate settings for production to reduce the amount of tuning everyone has to do.

Add a circuit breaker when making gRPC requests to ingesters to avoid making
the request if the ingester is down or hitting per-instance limits. All other
errors are ignored by the circuit and result in the usual behaivor.

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Copy link
Member

@jhalterman jhalterman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done 👍

@56quarters 56quarters merged commit 057e3e4 into main Aug 23, 2023
27 checks passed
@56quarters 56quarters deleted the 56quarters/circuit-breaker branch August 23, 2023 17:31
56quarters added a commit that referenced this pull request Sep 6, 2023
Switch to failsafe circuit breaker implementation that allows us to define
an error rate over a moving window instead new windows every N seconds. Helps
for high traffic clusters where the long tail of requests might exceed the
timeout enough in raw numbers but still very infrequently compared to request
volume.

Related #5650

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
56quarters added a commit that referenced this pull request Sep 13, 2023
Switch to failsafe circuit breaker implementation that allows us to define
an error rate over a moving window instead new windows every N seconds. Helps
for high traffic clusters where the long tail of requests might exceed the
timeout enough in raw numbers but still very infrequently compared to request
volume.

Related #5650

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants