Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

router: add new ratelimited retry backoff strategy #12202

Merged
merged 37 commits into from
Aug 14, 2020

Conversation

numerodix
Copy link
Contributor

This PR implements a new retry back off strategy that uses values from response headers like Retry-After or X-RateLimit-Reset (the headers are configurable) to decide the back off interval before retrying a request.

This is PR (2/2) that addresses issue #12200. See also the design document: https://docs.google.com/document/d/1NSzrx3-KsAFs0ObaLQZUVIt9O4PZv3mXi3Lm7LNzmZE/edit?usp=sharing

Signed-off-by: Martin Matusiak numerodix@gmail.com

Risk Level: Low
Testing: New unit tests added for all code changes.
Docs Changes: added
Release Notes: added

@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/envoy/.
CC @envoyproxy/api-watchers: FYI only for changes made to api/envoy/.

🐱

Caused by: #12202 was opened by numerodix.

see: more, trace.

@zuercher zuercher self-assigned this Jul 21, 2020
// 1. An interval in seconds as an integer, for example: ``120``.
// 2. A unix timestamp as an integer, for example: ``1595320702``. The timestamp
// must be in the future relative to the current system time.
repeated HeaderMatcher reset_headers = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if there is a way to explain this with less jargon to the user. We're basically just taking whatever he backend tells us and using this as the duration for next retry?

What happens when there are multiple retries and the backend only specifies a retry for the first?

Copy link
Contributor Author

@numerodix numerodix Jul 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if there is a way to explain this with less jargon to the user. We're basically just taking whatever he backend tells us and using this as the duration for next retry?

Yes, that's right. What do you think would be a better way of wording it?

What happens when there are multiple retries and the backend only specifies a retry for the first?

The backend doesn't anticipate that multiple retries will be necessary. It just says "try again in 3 seconds, which is when your quota will be refilled". If at that time your quota is up again (because you've made a lot of requests from other threads/other machines) then the backend will give you a new duration which it has the discretion to decide any way it wants to.

Copy link
Member

@zuercher zuercher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Some comments below, mostly around the API changes which I think will have some knock-on effects on your code.

// 1. An interval in seconds as an integer, for example: ``120``.
// 2. A unix timestamp as an integer, for example: ``1595320702``. The timestamp
// must be in the future relative to the current system time.
repeated HeaderMatcher reset_headers = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments for this field:

  1. Should this have a validation annotation that requires at least one value? Or is there a default?
  2. I'm not sure HeaderMatcher is the right type. You could easily specify a HeaderMatcher that would only match headers whose values do not conform to the described values. For example, matching a header named retry-after with a value prefixed by x is a legal HeaderMatcher. I also think inverted matches aren't likely to be useful. I think maybe we just want repeated string reset_headers?
  3. Should the interval value be in milliseconds instead of seconds? I think that matches most of our other header-related intervals (timeouts and such).
  4. I think we should consider some way to make the semantics of these values unambiguous. The interpretation of large values can change over time as current time gets larger. One way would be to maintain a separate set of interval vs. fixed time headers (though this implies an extra x-envoy header as well).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing to at least document is what happens when multiple conflicting headers are present

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing to at least document is what happens when multiple conflicting headers are present

I've added this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this have a validation annotation that requires at least one value? Or is there a default?

Yes, it should. Without at least one header this will do nothing. I've added that.

I'm not sure HeaderMatcher is the right type. You could easily specify a HeaderMatcher that would only match headers whose values do not conform to the described values. For example, matching a header named retry-after with a value prefixed by x is a legal HeaderMatcher. I also think inverted matches aren't likely to be useful. I think maybe we just want repeated string reset_headers?

I completely agree. I had misgivings about using HeaderMatcher here, because the only field that it makes senses to use is name. Using exact_match, range_match, regex_match, present_match, prefix/suffix or invert makes no sense, because as you say, the user might make the header no longer match.

The reason I used HeaderMatcher is that this is the UX pattern we have in the retry policy. All the other header related fields use it, so users don't have to learn a new type.

I'm happy to change it if you think it makes more sense to use repeated string reset_headers?

Should the interval value be in milliseconds instead of seconds? I think that matches most of our other header-related intervals (timeouts and such).

This is not something we (ie. Envoy) control. The fields Retry-After and X-RateLimit-Reset are standardized:

But in the wild they are not always used consistently. I discuss this in the design doc (linked in the first PR comment).

I think we should consider some way to make the semantics of these values unambiguous. The interpretation of large values can change over time as current time gets larger. One way would be to maintain a separate set of interval vs. fixed time headers (though this implies an extra x-envoy header as well).

I see where you're coming from, but I'm skeptical here because I've found that the two commonly used representations today are a) seconds as an int and b) unix timestamp as an int. That's why I propose supporting these two. If in the future we wanted to support a date in the HTTP date format (which is mentioned in the Retry-After RFC but does not seem to be used much) then we'd have to add yet another parameter + header for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes more sense to use repeated string reset_headers, but defer to the api shepherds.

For the ambiguity between interval & timestamp we could split the headers into those that take intervals and those that take timestamps. For example, either by having separate header lists or introducing a message with header name and an enum that describes the expected type of value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, how about something like this:

  message RateLimitedRetryBackOff {
    message ResetHeader {
      enum ValueFormat {
        INT_SECONDS = 0;
        INT_UNIX_TIMESTAMP = 1;
      }

      string name = 1 [
        (validate.rules).string = {min_bytes: 1 well_known_regex: HTTP_HEADER_NAME strict: false}
      ];

      ValueFormat format = 2;
    }

    repeated ResetHeader reset_headers = 1 [(validate.rules).repeated = {min_items: 1}];

    google.protobuf.Duration reset_max_interval = 2 [(validate.rules).duration = {gt {}}];
  }

This makes it pretty extensible for adding other formats in the future if we need to.

As for the header based API, in the interest of not adding a separate header for each format we could do something like this:

x-envoy-ratelimited-reset-headers: Retry-After:INT_SECONDS, X-RateLimit-RESET:INT_UNIX_TIMESTAMP

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattklein123 or @alyssawilk would you like to provide direction?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That addresses my concerns. But we do need someone from @envoyproxy/api-shepherds to approve in Harvey's stead.

@@ -17,6 +17,7 @@

#include "absl/container/inlined_vector.h"
#include "absl/strings/string_view.h"
#include "absl/types/optional.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unused import?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes thank you!

PROTOBUF_GET_OPTIONAL_MS(retry_policy.rate_limited_retry_back_off(), reset_max_interval);
if (ratelimited_reset_max_interval_ && ((*ratelimited_reset_max_interval_).count()) < 1) {
ratelimited_reset_max_interval_ = std::chrono::milliseconds(1);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to provide the 300s default here, as noted in the proto definition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking here was that in the RetryPolicy this member is optional and doesn't have a default value. So it reflects whatever was configured by the user (possibly nothing).

Then in RetryStateImpl it has a default defined in the header file. If no value was passed via config nor http header, this default will stand. Otherwise it will be overridden by user input. This test should bear that out:
https://github.com/envoyproxy/envoy/pull/12202/files#diff-0a6a6327b31a217b22c653b1d1ebf4acR914

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a mild preference for doing it here to keep the lookup from the proto and the default together.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I see the value in moving the default to where most of the other defaults live. I've moved it to RetryPolicy now.


// Try to parse the value of the header as an int storing the number of seconds
if (absl::SimpleAtoi(header_value, &num_seconds)) {
parsed_value = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a break? Or else add a note in the proto def comments that in the case of multiple matches the last one takes precedence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I missed that when refactoring. I added a test to lock in the behavior of first matching header is used.

@numerodix numerodix force-pushed the ratelimited-backoff-strategy branch 3 times, most recently from 9ae0d0d to bce0cc1 Compare July 22, 2020 22:49
@mattklein123 mattklein123 self-assigned this Jul 27, 2020
@mattklein123
Copy link
Member

Please merge master and I will take a look. Thank you!

/wait

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this neat feature. A few API/interface questions to get started. Thank you!

/wait

Comment on lines 1099 to 1101
// 1. An interval in seconds as an integer, for example: ``120``.
// 2. A unix timestamp as an integer, for example: ``1595320702``. The timestamp
// must be in the future relative to the current system time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can you differentiate between an interval and a unix time stamp? Doesn't there need to be some other format included or some configuration to tell what type of format the header is? It seems like maybe it should be a repeated list of messages where each message includes a HeaderMatcher and then there is a format enum or something?

Also, can you clarify in the docs what happens if the format is invalid?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Retry-After spec allows http-date as a format but not a unix timestamp.

The X-Ratelimit-Reset draft spec allows IMF-fixdate, but it is not recommended. Also doesn't specify unix timestamp as a possibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Pchelolo !

Yes, that's correct. I'm aware of the standards, however in the wild they are not followed as written. I have surveyed a number of public APIs which you can find below and it's based on this that I've proposed to support the above two formats:
https://docs.google.com/document/d/1NSzrx3-KsAFs0ObaLQZUVIt9O4PZv3mXi3Lm7LNzmZE/edit#heading=h.nd3aike2c678

Copy link
Contributor Author

@numerodix numerodix Jul 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial idea was that we could tell whether the value in the header represents a number of seconds or a Unix timestamp. For it to be a valid Unix timestamp in a retry context it needs to be greater than the current system time. @zuercher convinced me (above) that we should instead have a dedicated protobuf message that captures both pieces of information: a) the header name and b) the format

Also, can you clarify in the docs what happens if the format is invalid?

Added

Comment on lines 1104 to 1107
// Specifies the maximum rate limited retry back off interval that Envoy
// will allow. If a rate limited reset header contains an interval longer
// than this then the default exponential back off strategy will be used
// instead. Defaults to 300 seconds.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also clarify that this request is still governed by any global timeouts in place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to the docs

// the rate limited retry back off strategy will be used instead of the
// default exponential back off strategy (configured using `retry_back_off`)
// whenever a response includes the matching headers.
RateLimitedRetryBackOff rate_limited_retry_back_off = 11;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about folding this in to the RetryBackoff message? It seems logically related since it will govern what the final backoff time is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's definitely possible. I figured it made more sense to make it clear that it's a separate code path with its own parameters and the new stat upstream_rq_retry_backoff_ratelimited to reflect whether it's being used or not.

In the doc update I pushed the mechanism does take a fair bit of explaining, so I think that might be another reason to keep them separated.

Comment on lines 276 to 277
HEADER_FUNC(EnvoyRateLimitedResetHeaders) \
HEADER_FUNC(EnvoyRateLimitedResetMaxIntervalMs) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not crazy about adding new inline headers here that very few people will use. See my comments above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roger that, I'll remove them.

Comment on lines 779 to 786
/**
* This getter exists so that once matchHeaders() has been called on a HeaderMap the concrete
* header in that map can be extracted by using this getter to discover its name. This is possible
* in all cases, except when invert_match_ is true, because in the latter case name_ would *not*
* reflect the matching header.
* @return the header name the matcher will match against.
*/
virtual const LowerCaseString* name() const PURE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused how this can be idempotent. If you need this information shouldn't it be returned by matchesHeaders?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the way the HeaderMatcher is used is first we call matchesHeaders(request_headers). If that returns true we still don't know which request header actually matched, only that one of them did. Thus I needed to add this getter function to extract what the header name actually is. With that I can get the value from request_headers.

Comment on lines 260 to 270
x-envoy-ratelimited-reset-headers
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Setting this header will extend the list of rate limited reset headers specified in the :ref:`route configuration rate limited retry back off
<envoy_v3_api_field_config.route.v3.RetryPolicy.RateLimitedRetryBackOff.reset_headers>`.

x-envoy-ratelimited-reset-max-interval-ms
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Setting this header will override the rate limited reset max interval specified in the :ref:`route configuration rate limited retry back off
<envoy_v3_api_field_config.route.v3.RetryPolicy.RateLimitedRetryBackOff.reset_max_interval>`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the headers are configurable, why do we need these well defined headers? Can we remove them to make this simpler? I'm not completely clear what they do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an attempt to allow users to configure the new back off strategy using either the bootstrap or using headers, like so:

# bootstrap API
retry_policy:
  # as before - rename to exponential_retry_back_off ?
  retry_back_off:
    base_interval: 25
    max_interval: 250
 
  rate_limited_retry_back_off:
    reset_headers: \
      Retry-After,RateLimit-Reset,X-RateLimit-Reset,X-Rate-Limit-Reset
    reset_max_interval: 5000
 
# header API
x-envoy-ratelimited-reset-headers: \
  Retry-After,RateLimit-Reset,X-RateLimit-Reset,X-Rate-Limit-Reset
x-envoy-ratelimited-reset-max-interval-ms: 5000

But I'm thinking that's overkill and having the header based API is not really useful.

Comment on lines 428 to 429
upstream_rq_retry_backoff_exponential, Counter, Total retries using the exponential backoff strategy
upstream_rq_retry_backoff_ratelimited, Counter, Total retries using the ratelimited backoff strategy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can skip this on the virtual cluster. I doubt this will be used practically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, removed it.

@numerodix
Copy link
Contributor Author

/wait

@Pchelolo
Copy link
Contributor

FYI: I'm currently working on #12356 that will ultimately provide the X-RateLimit-Reset header.

@numerodix
Copy link
Contributor Author

numerodix commented Jul 31, 2020

I think my initial stab at this PR was a little under documented. I've pushed a decent update to the docs which aims to improve this.

It includes the new protobuf message hashed out with the help of @zuercher. It also includes an example config and walks the user through exactly what will happen, including the various corner cases. My hope is that this makes the design much clearer and addresses @htuch's concern above. I belive it also clarifies most of your questions, Matt.

I haven't removed the "old" fields from the protobuf because the existing code still depends on them.

The doc changes:
eeae8e8#diff-762b98ebf225e9214f643752cf504760

The initial implementation was leaning on HeaderMatcher, which doesn't quite fit the use case. In order to process the new ResetHeader protobuf message in code I need to implement an abstraction similar to HeaderMatcher, but it will be specific to the use case of this PR. I'm thinking it will have an interface in header_map.h and an implementation similar to header_utility.h. @mattklein123 does this approach sound good to you?

@mattklein123
Copy link
Member

Can you merge main and fix CI and I can take another look? Thank you.

/wait

@numerodix numerodix marked this pull request as draft August 1, 2020 04:37
@numerodix
Copy link
Contributor Author

Converted to draft because I need to implement the new design to make the code consistent with the api and get all the CI builds green.

@numerodix numerodix force-pushed the ratelimited-backoff-strategy branch from 233d730 to 950e7e2 Compare August 1, 2020 12:10
@numerodix numerodix marked this pull request as ready for review August 1, 2020 21:50
@numerodix
Copy link
Contributor Author

Ready now, @mattklein123

Signed-off-by: Martin Matusiak <numerodix@gmail.com>
Signed-off-by: Martin Matusiak <numerodix@gmail.com>
Signed-off-by: Martin Matusiak <numerodix@gmail.com>
Signed-off-by: Martin Matusiak <numerodix@gmail.com>
Signed-off-by: Martin Matusiak <numerodix@gmail.com>
Signed-off-by: Martin Matusiak <numerodix@gmail.com>
@numerodix
Copy link
Contributor Author

I think we're good here, @mattklein123 ?

To the person merging this - can I get a little help getting this over the line? PRs are merged to master all the time and create conflicts against the release notes and the stats integration test. I've fixed it a bunch of times already, but if it conflicts again during US working hours I won't have a chance to resolve it until my Sydney morning and it won't have a chance to get merged until a day later, by which time there's probably a new conflict. So would you mind fixing the conflict if you see one? That'd be awesome!

@mattklein123
Copy link
Member

I believe @zuercher was owning the review here. @zuercher can you see this through?

Signed-off-by: Martin Matusiak <numerodix@gmail.com>
Signed-off-by: Martin Matusiak <numerodix@gmail.com>
Signed-off-by: Martin Matusiak <numerodix@gmail.com>
Signed-off-by: Martin Matusiak <numerodix@gmail.com>
zuercher
zuercher previously approved these changes Aug 13, 2020
@zuercher
Copy link
Member

@mattklein123 still needs api approval

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM other than a merge issue. Thank you!

/wait

@@ -62,7 +62,7 @@ New Features
* postgres network filter: :ref:`metadata <config_network_filters_postgres_proxy_dynamic_metadata>` is produced based on SQL query.
* ratelimit: added :ref:`enable_x_ratelimit_headers <envoy_v3_api_msg_extensions.filters.http.ratelimit.v3.RateLimit>` option to enable `X-RateLimit-*` headers as defined in `draft RFC <https://tools.ietf.org/id/draft-polli-ratelimit-headers-03.html>`_.
* rbac filter: added a log action to the :ref:`RBAC filter <envoy_v3_api_msg_config.rbac.v3.RBAC>` which sets dynamic metadata to inform access loggers whether to log.
* redis: added fault injection support :ref:`fault injection for redis proxy <envoy_v3_api_field_extensions.filters.network.redis_proxy.v3.RedisProxy.faults>`, described further in :ref:`configuration documentation <config_network_filters_redis_proxy>`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you had a merge issue here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Fixed now.

Signed-off-by: Martin Matusiak <numerodix@gmail.com>
@mattklein123
Copy link
Member

@numerodix please never force push. It makes reviews incredibly difficult. Just add commits and merge main. Thank you!

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@numerodix
Copy link
Contributor Author

@mattklein123 Thanks! Can you merge this now?

@mattklein123 mattklein123 merged commit bd2b989 into envoyproxy:master Aug 14, 2020
mpuncel added a commit to mpuncel/envoy that referenced this pull request Aug 14, 2020
* master: (67 commits)
  logger: support log control in admin interface and command line option for Fancy Logger (envoyproxy#12369)
  test: fix http_timeout_integration_test flake (envoyproxy#12654)
  [fuzz]added an input check in writefilter fuzzer and added test cases (envoyproxy#12628)
  add 'explicit' restriction. (envoyproxy#12643)
  scoped_rds_integration_test migrate from api v2 to api v3. (envoyproxy#12633)
  fuzz: added fuzz test for listener filter tls_inspector (envoyproxy#12617)
  testing: fix multiple race conditions in simulated time tests (envoyproxy#12527)
  [tls] Move handshaking behavior into SslSocketInfo. (envoyproxy#12571)
  header: getting rid of exception-throwing behaviors in header files [the rest] (envoyproxy#12611)
  router: add new ratelimited retry backoff strategy (envoyproxy#12202)
  [redis_proxy] added a constraint for route.prefix().size() (envoyproxy#12637)
  network: add tcp listener backlog config (envoyproxy#12625)
  runtime: debug log that condition is always true when fractionalPercent numerator > denominator (envoyproxy#12068)
  WatchDog Extension hook (envoyproxy#12416)
  router: add dynamic metadata header formatter (envoyproxy#11858)
  statsd: revert visibility to public (envoyproxy#12621)
  Fix regression of /build_* in gitignore (envoyproxy#12630)
  Added a missing extension point to documentation. (envoyproxy#12620)
  Reverts proxy protocol test on windows (envoyproxy#12619)
  caching: Improved the tests and coverage of the CacheFilter tree (envoyproxy#12544)
  ...

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants