Resource-based circuit breaking #3332

danielhochman · 2018-05-09T17:27:32Z

Description

Envoy should have the ability to circuit break on system resources like CPU.

Circuit breakers at ingress are used to protect our hosts from resource exhaustion. To determine circuit breaker thresholds, we run a "redline" test, which increasingly ramps traffic on a single host until it degrades. We note rq_active, then set the threshold less some buffer.

Over time this ends up being a poor approximation for the real bottleneck for most of our services, CPU:

If any of the service's dependencies slow down, a single host can handle additional concurrency without exhausting its local resources.
If a service (the local service or a downstream) ships code that changes the overall load profile and/or the overall request mix, the circuit breaker may no longer protect the service or circuit break early.

Working out the platform-dependent implementation and the algorithm will be the fun part. I'd like to get a first impression from other users before getting into that.

The text was updated successfully, but these errors were encountered:

mattklein123 · 2018-05-09T23:36:24Z

IMO this is best done as a dedicated filter, as the circuit breaking is a bit different from what we currently do and I think it's pretty self contained. I think I would make this a general resource based ingress circuit breaking filter that could be eventually extended to memory and other things. As long as we have the right platform abstractions for getting the information we need I think this sounds like a very useful feature to add.

alyssawilk · 2018-06-11T13:53:29Z

I think we may also want to tie this in to the centralized system for #373. I can imagine hitting some threshold (event loop time?) at which we simply stop accepting new requests so we can make forward progress on existing ones.

stale · 2018-07-11T14:12:08Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

eightnoteight · 2020-07-10T15:33:54Z

came across this issue and wondered about the implementation part in the context of containers and running envoy as a sidecar in an ecs task or in k8s pod,

one way to do this would be to share process namespace, that way the envoy can track cpu of the other container using /proc/{main-container-pid}/root/sys/fs/cgroup/cpuacct/cpuacct.usage_all

but this solutions adds too many requirements on the end user like having to share process namespace, run envoy as root user, allow envoy to access entire disk of the main container.

is there any other easy/secure way to track resources of system in a container based system?

danielhochman added the enhancement Feature requests. Not bugs or questions. label May 9, 2018

mattklein123 added this to the 1.7.0 milestone May 11, 2018

mattklein123 assigned danielhochman May 11, 2018

mattklein123 modified the milestones: 1.7.0, 1.8.0 May 28, 2018

stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jul 11, 2018

danielhochman added the help wanted Needs help! label Jul 11, 2018

stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Jul 11, 2018

mattklein123 modified the milestones: 1.8.0, 1.9.0 Sep 21, 2018

mattklein123 removed this from the 1.9.0 milestone Oct 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resource-based circuit breaking #3332

Resource-based circuit breaking #3332

danielhochman commented May 9, 2018

mattklein123 commented May 9, 2018

alyssawilk commented Jun 11, 2018

stale bot commented Jul 11, 2018

eightnoteight commented Jul 10, 2020

Resource-based circuit breaking #3332

Resource-based circuit breaking #3332

Comments

danielhochman commented May 9, 2018

mattklein123 commented May 9, 2018

alyssawilk commented Jun 11, 2018

stale bot commented Jul 11, 2018

eightnoteight commented Jul 10, 2020