Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[http] Adaptive timeouts for idle HTTP client connections waiting for a request #11427

Closed
antoniovicente opened this issue Jun 3, 2020 · 50 comments · Fixed by #14155
Closed
Assignees

Comments

@antoniovicente
Copy link
Contributor

HTTP1 downstream connections are either waiting for the first byte of the request headers, parsing request headers or handling the active request. We benefit from relatively long timeout for connections with no active request as it allows for effective prefetching and connection reuse. We do expect a relatively short time between when the first and last byte of the request headers arrive which could be covered by a shorter timeout.

Similar logic could also be applied to H2 connections with 0 active requests.

The timeouts for idle connections with no active requests and connections parsing headers could be reduced under could be reduced based on memory pressure to increase resiliency to memory exhaustion attacks and improving the proxy's ability to accept legitimate connections/requests. Config strawman: min/max timeouts and high/low memory thresholds at which the timeouts apply or high/low connection count at which timeouts apply.

@akonradi
Copy link
Contributor

akonradi commented Jun 5, 2020

High/low memory thresholds and connection counts sounds like something that should be done through the overload manager. There is already a resource monitor for watching memory usage, and it shouldn't be too bad to add one for the number of open connections. We could then wire that into an action that modifies either all or a subset of connection timeouts.

@mattklein123
Copy link
Member

@akonradi yup that's the plan. Our goal is to thread the overload manager through more places and use it for active/passive adaptive thresholds. cc @tonya11en

@akonradi
Copy link
Contributor

akonradi commented Jun 5, 2020

Has there been any work on adding a resource monitor for connection counts? If not I'd be happy to take that on.

@mattklein123
Copy link
Member

Has there been any work on adding a resource monitor for connection counts? If not I'd be happy to take that on.

There has, let's sync up on this briefly offline.

@antoniovicente
Copy link
Contributor Author

@tonya11en

One thing to think about is how to implement adaptive timeouts. The current practice of using independent timers to track request timeouts makes reducing timeouts under memory pressure challenging. You can't rely on the timeout value being known at the time you want to schedule the timer. A potential solution may involve use of a timer list for operations that are waiting for timeouts with the same parameters, plus logic on how to interpret the scaled/adjusted timeouts.

@akonradi
Copy link
Contributor

akonradi commented Jul 21, 2020

I've been thinking about this for a while and after a couple discombobulated PRs, I've come up with a way to implement adaptive timeouts fairly efficiently. TL;DR: have the thread-local Overload Manager state hand out timers that can be scaled after the fact, and use a new range trigger and "scale" action to reduce those timeouts when load increases. Read on for the gory details.

Range timer

The existing Overload Manager uses Timer objects to track connection timeouts. Since the Timer interface isn't especially amenable to scaling, my plan is to introduce a new RangeTimer class like this that requires the caller to specify a minimum and maximum timeout value.

Scaled timer manager

This is a new object that acts as a factory for RangeTimers which can be scaled in response to load. I imagine an API that looks something like this:

class ScaledTimerManager {
public:
  ScaledTimerManager(Dispatcher& dispatcher);
  /**
   * Creates a new range timer whose actual timeout values depend on the current scale factor.
   */
  RangeTimerPtr createRangeTimer(TimerCb);
  /**
   * Updates the scale factor used for new and previously created range timers.
   * @param scale_factor must be in the range [0, 1].
   */
  void setScaleFactor(float scale_factor);
};

An efficient implementation of this class is discussed further down.

'Scaling' state for overload manager actions

The existing Overload Manager deals with actions that are either on or off. Since timeout scaling works best when done gradually in response to load, I'd like to rename 'inactive' to scaling and 'active' to saturated. The result is a thin wrapper around a float that is clamped to the range [0, 1], where [0, 1) = scaling and 1 = saturated. The currently implemented actions can simply check for saturation instead of for the 'active' state.

Overload Manager Range trigger

The existing threshold trigger will continue to produce inactive and saturated action states. To enable the new functionality, a new RangeTrigger message will be added to the OverloadManager config. I imagine it'll look something like this:

message RangeTrigger {
  // If the resource pressure is greater than this value, the trigger will enter the
  // :ref:`scaling <arch_overview_overload_manager-triggers-state>` state.
  double min_value = 1 [(validate.rules).double = {lte: 1.0 gte: 0.0}];

  // If the resource pressure is greater than this value, the trigger will enter saturation.
  double max_value = 2 [(validate.rules).double = {lte: 1.0 gte: 0.0}];
}

The range trigger will produce action states that scale linearly between 0 (inactive) and 1 (saturated) as resource pressure varies between min_value and max_value (and are clamped outside that range).

When the (still-to-be-defined) "scale connection timeouts" overload action is enabled, the ThreadLocalOverloadStateImpl object will update its ScaledTimerManager with the new scale value as it changes, thus reducing timeouts.

# Bonus: how to implement ScaledTimerManager

To implement the proposed API efficiently, the ScaledTimerManager needs to be able to rescale all timers in O(1) time. This can be implemented by using bucketing to group timers by the allowed variance in timeout (i.e. (max - min)). This basically puts each timer on a fixed track which is determined by the timeout for the bucket. To make this easier to implement, each range timer will actually be implemented as a Timer which expires after min and puts it in the bucket for (max - min).

Each bucket has a Timer object and a list of structs that back enabled ScaledRangeTimers in that bucket. The backing struct lists the time when the range timer was enabled and the allowed variance. The timer for the bucket is always enabled to fire at the desired point in time for the front range timer in the list, taking the current scaling factor into account. As the scaling factor changes, the timer for each bucket is updated.

This doesn't scale timers perfectly uniformly since a timer that is supposed to fire with variance 3 minutes might get put in the 3 minute bucket with one supposed to fire with variance 4 minutes, but it does satisfy the RangeTimer API. Then all timers in the same bucket can be scaled together since they are ordered by enqueue time and changing the scaling factor doesn't reorder their target trigger times - it just moves them closer or further apart, like points on the bellows of an accordion.

Having each ScaledRangeTimer inserted in at most one bucket and maintaining a reference to its position in the bucket's std::list of structs means that disabling timers takes O(1) time. Adjusting the scale factor is still non-trivial since it requires touching every bucket, but choosing buckets so their durations grow exponentially keeps the number of buckets bounded. With buckets sized at a factor of 2 and 64-bit duration values, there are still only 64 possible buckets.

@antoniovicente
Copy link
Contributor Author

Tell me more about how the scaling of timers in a bucket would work.

Let's say that we are operating in the bucket with timers that have timeouts (120 sec, 240 secs]
Let's say that we add the following timers:
240 sec timer added at time N (deadline: N+240)
180 sec timer added at time N+59 (deadline: N+239)
150 sec timer added at time N+88 (deadline: N+238)

At time N+121 we process an event to scale timers down to 50% and run expired timers. Computing the scaled deadline using an efficient algorithm like expiring timers at N+121+bucket_upper_range*50% (also known as N+241) would result in all 3 of the timers above triggering. Is that acceptable? If not, how do we make case above more correct while also remaining efficient?

@htuch
Copy link
Member

htuch commented Jul 21, 2020

Throwing out another alternative, apologies if this is old ground. To simplify, could we just have two clocks effectively, one that is regular real-time for regular timers and another that is scaled. We might have a minimum/maximum scaling factor for the clock, but all timers on the scaled clock are effectively in some EDF queue and maintain relative order. Let's say you want to maintain min bounds in this world. You could provide this by rescheduling if an event happens too early for its min bound. When you rescale the clock, each pending timer might need a correctional rescheduling if it happens before its min bound in real-time.

This might appear to be O(n), but I'd argue it's actually O(1) with amortized analysis (you pay at most one reschedule per timer event). If you reduce the scaling factor for the variable speed clock, then you don't pay any extra cost. If you increase it, each time you do this you could potentially force a reschedule. So the complexity might be O(m) where m is the number of times you reschedule within the max deadline of the set of timers. I'm thinking m should be small, since you probably don't want to be continuously churning anyway.

Addition: borrowing from Alex's idea above for treating the portion of time before the min bound and after differently, what if we place a ScaledTimer first in the real-time queue, up to its min bound. Once that is hit, then it gets transferred to the variable time queue. Max bound is enforced by maintaining on the real-time queue an event for the ScaledTimer. The ScaledTimer fires once in variable mode once either its max bound timer expires or the scaled one. This has no rescheduling cost and comes at the expense of maintaining possible up to two timer events for every real timer.

@akonradi
Copy link
Contributor

@antoniovicente That's consistent with what I'm proposing. Since timers are only added to a bucket once their min time has elapsed, they can be legally triggered anywhere between 0 and their requested max value. The actual triggering sequence is determined only by the time at which the timer is added to a bucket and by which bucket it gets placed in. Bucket intervals should be thought of as [min, max), where all timers placed in them will trigger in min * scaling factor seconds after they are placed.

So for your example (assuming a [120, 241) bucket), the target times for each timer would be

  • 240: N + 0 + 120 = N + 120
  • 180: N + 59 + 120 = N+ 179
  • 150: N + 88 + 120 = N + 208

At N+120, the first timer would trigger. At N+121 the scaling factor changes to 50% and the second timer triggers because its scaled deadline is N + 59 + (.5 * 120) = N + 119. The third timer would then trigger at N + 88 + 60 = N + 148, in another 27 seconds.

@akonradi
Copy link
Contributor

@htuch I had been assuming that scaled time was always at least as fast as regular time so I don't think we need the sentinel timer. I'm not familiar with EDF scheduling but it looks like the implementation in Envoy (thanks for the pointer) picks the next task in O(log n) time. Maybe this is a good time to ask what exactly the time complexities should be for this implementation? I've gone through a couple iterations of design that range from O(n log n) rescaling to O(1). The buckets in this proposal allow for all operations to be O(1) at the cost of imprecision (a 5-10 second timer can trigger after 6 seconds, even if the scaling factor is 1). I could imagine a std::set or heap-based implementation that does all operations in O(log n) but is much more correct with respect to scaling the timeouts in the requested range.

@antoniovicente
Copy link
Contributor Author

@antoniovicente That's consistent with what I'm proposing. Since timers are only added to a bucket once their min time has elapsed, they can be legally triggered anywhere between 0 and their requested max value. The actual triggering sequence is determined only by the time at which the timer is added to a bucket and by which bucket it gets placed in. Bucket intervals should be thought of as [min, max), where all timers placed in them will trigger in min * scaling factor seconds after they are placed.

So for your example (assuming a [120, 241) bucket), the target times for each timer would be

  • 240: N + 0 + 120 = N + 120
  • 180: N + 59 + 120 = N+ 179
  • 150: N + 88 + 120 = N + 208

At N+120, the first timer would trigger. At N+121 the scaling factor changes to 50% and the second timer triggers because its scaled deadline is N + 59 + (.5 * 120) = N + 119. The third timer would then trigger at N + 88 + 60 = N + 148, in another 27 seconds.

I think that the case you're describing ends up rounding the specified timeout up to the top of the range in the bucket. I think that does work better since at least it ensures that timers don't fire much earlier than intended. They do fire later than expected.

We should not do an O(n) recomputation of deadlines in each of the buckets when the scaling factor changes.

@akonradi
Copy link
Contributor

I'd rather fire the timers early, but within their [min, max] window than much later, since with the proposed bucketing a timer scheduled for [min, max] could end up firing almost at 2 * max after scheduling time, which seems bad.

Right, O(n) definitely seems bad. O(log n) seems acceptable though?

@antoniovicente
Copy link
Contributor Author

I'd rather fire the timers early, but within their [min, max] window than much later, since with the proposed bucketing a timer scheduled for [min, max] could end up firing almost at 2 * max after scheduling time, which seems bad.

I don't know if early or late is preferable, that said it makes some sense to me to have timers trigger at the right value when not in overload which may argue for an implementation similar to the one in my example above, which does results in timers triggering early when under partial overload.

Use of a smaller scaling factor per buckets may be helpful as a way to reduce the delta between time when timers should and do trigger. Use of 1.4 instead of 2 would provide finer granularity at the expense of up to 2x as many buckets.

Right, O(n) definitely seems bad. O(log n) seems acceptable though?

Can we do better than log n? Earlier I mentioned something about tracking which buckets have items added to them recently. Focusing on the buckets with higher add rates and presumably the most items in them provides the most benefit.

Another possible avenue to consider is to try to reduce the total number of distinct timeout values and thus avoid the need for bucketting.

@mattklein123
Copy link
Member

Question: in the initial implementation do we even need to modify existing timers? I'm wondering if only new timers is good enough which would give time for old timers to age out? The system wouldn't be early as responsive but it would be much simpler.

@antoniovicente
Copy link
Contributor Author

@antoniovicente That's consistent with what I'm proposing. Since timers are only added to a bucket once their min time has elapsed, they can be legally triggered anywhere between 0 and their requested max value. The actual triggering sequence is determined only by the time at which the timer is added to a bucket and by which bucket it gets placed in. Bucket intervals should be thought of as [min, max), where all timers placed in them will trigger in min * scaling factor seconds after they are placed.

So for your example (assuming a [120, 241) bucket), the target times for each timer would be

  • 240: N + 0 + 120 = N + 120
  • 180: N + 59 + 120 = N+ 179
  • 150: N + 88 + 120 = N + 208

At N+120, the first timer would trigger. At N+121 the scaling factor changes to 50% and the second timer triggers because its scaled deadline is N + 59 + (.5 * 120) = N + 119. The third timer would then trigger at N + 88 + 60 = N + 148, in another 27 seconds.

Use of 120 as the deadline for all of these 3 timers despite the requested values being higher seems very strange.

@akonradi
Copy link
Contributor

Use of a smaller scaling factor per buckets may be helpful as a way to reduce the delta between time when timers should and do trigger. Use of 1.4 instead of 2 would provide finer granularity at the expense of up to 2x as many buckets.

Yeah if we only instantiate buckets when they're used, and empty buckets are mostly free, then an exponential growth factor of 1.4 isn't that bad, especially if there are very few distinct [min, max] values.

Can we do better than log n?

Better for which operations? The ones I'm looking at are enabling a timer, disabling a timer, changing the scale factor, and executing a callback.

Use of 120 as the deadline for all of these 3 timers despite the requested values being higher seems very strange.

That's a fundamental property of the bucketing strategy. It relies on quantization to be efficient. And while strange, it doesn't violate the proposed RangeTimer contract: the timers are being triggered somewhere in the requested range, just earlier than they could have been.

@mattklein123 Antonio and I discussed this a bit, and I think the opinion is that when dealing with connection timeouts that are on the scale of minutes, reducing timeouts for future timers won't necessarily do much to actually address overload.

@htuch I thought a bit more about it, and I think my suggestion of using an ordered container for tracking timers by their current trigger time (taking the scaling factor into account) isn't workable, since two timers' trigger times can be reordered by a change in scaling factor. I think you're onto something with thinking about this in terms of a separate scaled time point/clock, and will think about that some more.

@akonradi
Copy link
Contributor

akonradi commented Jul 23, 2020

I've been talking with a couple of y'all offline about how to implement the ScaledTimerManager. I wanted to circle back and see if the rest of the proposal seemed reasonable though, specifically the augmentation of the Overload Manager's action state and addition of a range trigger.

+ @eziskind who designed the OM and provided feedback on #11697

@akonradi
Copy link
Contributor

akonradi commented Jul 28, 2020

I've written an alternate implementation based on the suggestion from Harvey about scaling time itself. You can see this approach here. It uses the same ScaledTimerManager API as the bucketing approach.

I'd appreciate feedback about the larger proposal outlined above (ignoring the Bonus section) regarding the Overload Manager, which is unaffected by the choice of scaled timer implementations.

@antoniovicente
Copy link
Contributor Author

The proposal in #11427 (comment) looks good from my perspective.

The only piece of feedback I have is about the "'Scaling' state for overload manager actions" section: You may want to talk about OverloadActionState::isSaturated() which is the new way to determine if it is time to stop accept or other saturation actions. Boolean actions only care if state is saturated or not saturated. Consider removing "inactive" from the list of possible states.

The "Bonus: how to implement ScaledTimerManager" section may be out of date since you have been exploring several different options on how to implement scaled timers. I made some high level comments on the variable-speed clock prototype mentioned in the previous comment.

@akonradi
Copy link
Contributor

Consider removing "inactive" from the list of possible states.

That's a good idea. I guess there's really no value in having a separate "inactive" state since existing actions will treat the "inactive" and "scaling" states identically. I've updated the proposal above.

@akonradi
Copy link
Contributor

Thanks, I've sent #12352 for the OM changes, which can be made independent of the scaled timers.

@antoniovicente
Copy link
Contributor Author

/assign @akonradi

@mattklein123
Copy link
Member

@akonradi would it be possible to write a short doc on the recommended option, and the alternatives considered? This thread is getting pretty long and I'm having a hard time keeping track. Thank you!

@akonradi
Copy link
Contributor

I've written a short design doc, available here that discusses scaled timers, the recommended option, and alternatives. Please take a look, and feel free to leave feedback here or there.

akonradi added a commit to akonradi/envoy that referenced this issue Sep 16, 2020
Add a ScaledRangeTimerManager class that produces RangeTimers that can
be adjusted after the fact by changing the scaling factor on the
manager. This class implements the dynamic queues approach suggested
in envoyproxy#11427#issuecomment-691154144. The code is tested but not yet
integrated anywhere.

Signed-off-by: Alex Konradi <akonradi@google.com>
@akonradi
Copy link
Contributor

Now that #13129 is almost in, it's time to talk about how scaled timers should be configured. Here are some options:

  1. replace existing timeout config fields with XXX_min and XXX_max fields
  2. for an existing timeout config field XXX, add a min_XXX or XXX_min_scaled field
  3. add an opaque config field to the OverloadAction message and use it with a new overload action to spell out a list of timer names and adjustment values
    a) where adjustments are specified as a percentage of the existing time
    b) where adjustments are specified as absolute minimums
    c) where adjustments are specified as overrides and are absolute [min, max] values

I lean towards 3.a since it avoids adding a bunch of extra config fields, keeps the overload config together, and doesn't make any existing fields or values redundant (like 3.c would).

@htuch
Copy link
Member

htuch commented Sep 30, 2020

@akonradi I like 3.a or 3.b. 3.c is redundant somewhat and 1/2 adds a lot of complexity for a feature that is likely not used by most users and other xDS clients won't want to implement. CC @envoyproxy/api-shepherds

@akonradi
Copy link
Contributor

I prefer 3.a over 3.b since it's easier to eyeball a percent and say "that looks reasonable" without the context of what the max is.

@mattklein123
Copy link
Member

Yeah I have a strong preference for (3). I like 3a if I had to pick one, though it seems like we could eventually have 3a and 3b, etc. if we wanted.

@antoniovicente
Copy link
Contributor Author

I have some worry about there being a big separation between the place original timeout and threshold/min are configured.

The most worrying timeouts would be ones per cluster where it may be appropriate to use different thresholds by cluster. I guess that max(min_override, threshold * full_timeout)) would probably work in general.

That said, the first 2 timeouts that we want to apply scaling logic for are timeouts waiting for SSL handshake and timeout waiting for HTTP/1.1 request headers. Use of approach 3a or 3b for those seems totally fine.

@mattklein123
Copy link
Member

That said, the first 2 timeouts that we want to apply scaling logic for are timeouts waiting for SSL handshake and timeout waiting for HTTP/1.1 request headers. Use of approach 3a or 3b for those seems totally fine.

I agree about config discoverability concerns. My feeling is let's try 3a/3b for these 2 cases and see how it looks/feels and then we can go from there?

@antoniovicente
Copy link
Contributor Author

antoniovicente commented Oct 1, 2020

Another thing to consider: the two timeouts mentioned above don't exist yet. #13341 implements the first of the two timeouts. We have the option to introduce a timeout config message with either max and optional, or timeout and optional scale.

3a/3b still seems fine since it does allow us to make forward progress.

@mattklein123
Copy link
Member

We have the option to introduce a timeout config message with either max and optional, or timeout and optional scale.

Yeah good point. It would be fine to introduce a new config wrapper for this type of timeout, though then we might end up in a situation in which some have the new message and some use 3a/3b and we would end up with deprecations. I could go either way.

@htuch
Copy link
Member

htuch commented Oct 1, 2020

I'm good with a timeout message. I'm fine also with limited deprecation if the number of sites is small, i.e. I can count it on one hand. I wouldn't advise an API-wide replacement with new message and deprecation as that is a lot of churn for all places we have timeouts. It's important to keep in mind that most users will never use this feature and all known xDS clients other than Envoy (i.e. gRPC, but others are coming) are unlikely to support this.

@antoniovicente
Copy link
Contributor Author

I talked with alex about this offline and understand his proposal better now. I think he's going for option 3a/3b.

@akonradi
Copy link
Contributor

akonradi commented Oct 7, 2020

I've added a prototype that supports options 3a & 3b here.

@htuch
Copy link
Member

htuch commented Oct 9, 2020

@akonradi can you enumerate the set of timer types? I'm wondering how large/small it is?

@akonradi
Copy link
Contributor

akonradi commented Oct 9, 2020

I was thinking

I know Antonio has also mentioned an additional timer for (and I'm badly paraphrasing) the TLS handshake, but I don't know that we'd want to scale that during overload.

@htuch
Copy link
Member

htuch commented Oct 12, 2020

OK, so two of them are new and can be done with a dedicated type, the other two wouldn't be so bad to deprecate/migrate, but I'm OK with 3a.

@antoniovicente
Copy link
Contributor Author

I was thinking

I know Antonio has also mentioned an additional timer for (and I'm badly paraphrasing) the TLS handshake, but I don't know that we'd want to scale that during overload.

We want a timeout that covers from the accept of a new SSL connetion until SSL handshake complete. Yes, we should scale that timeout during overload.

@antoniovicente antoniovicente removed the help wanted Needs help! label Nov 20, 2020
@antoniovicente
Copy link
Contributor Author

I was thinking

  • HTTP connection idle timeout

IIRC #13475 added scaling for this timeout

  • HTTP stream idle timeout

Not sure what plans we have for this case. I know a timeout exists, do we want to provide a way to scale this timeout? I think we could go either way here.

Do we have plans to scale this timeout? If we expect this to be configured to a small value most of the time, I think it would be fine to omit scaling.

I know Antonio has also mentioned an additional timer for (and I'm badly paraphrasing) the TLS handshake, but I don't know that we'd want to scale that during overload.

Issue #11426 covers the SSL connection case.

Based on the replies to the questions above, we should decide if there's work remaining on this issue, or it is time to close it and declare victory.

@akonradi
Copy link
Contributor

Scaling the stream idle timeout should be pretty trivial, and should help with resource attacks since otherwise the connection idle timeout is useless for preventing resource attacks. I don't have a strong opinion about the header processing timeout since it's supposed to be small.

@akonradi
Copy link
Contributor

akonradi commented Dec 7, 2020

From discussion on #14290: Should we consider scaling the drain timeout for HTTP connections? From Antonio:

If we allow the downstream connection timeout to scale to 0 secs, 5 second drain timeout may seem like an eternity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants