Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: respect service's suggested retryAfter when throttled #39

Merged
merged 1 commit into from
Oct 18, 2023

Conversation

ddneilson
Copy link
Contributor

What was the problem/requirement? (What/Why)

When calling a deadline cloud service API and getting a throttle/retry response the exception object may contain a "retryAfterSeconds" field alongside the error. When that field is present, the calling client should treat that as a request to retry in no sooner than the given number of seconds; it is a load-shedding mechanism for the service. We should respect the service's request.

What was the solution? (How)

Added to the logic of all of the deadline-cloud API wrappers to have
them extract the value of the "retryAfterSeconds" field if it's present, and pass that to our backoff-delay calculator. We use the value as a lower limit on the returned delay.
I also made the scheduler use the API wrapper for update_worker; it
still had its own implementation that didn't properly handle exceptions. This necessitated adding the ability to interrupt the update_worker's throttled-retries so preserve the functionality at that call site.

What is the impact of this change?

The worker agent will be a more kind and cooperative client to the deadline cloud service.

How was this change tested?

This part of the code is well unit tested, so I just ensured that the unit tests cover the new functionality.

Was this change documented?

N/A

Is this a breaking change?

No

@ddneilson ddneilson requested a review from a team as a code owner September 25, 2023 16:21
@jusiskin jusiskin self-requested a review September 26, 2023 14:45
@jusiskin jusiskin added the enhancement New feature or request label Sep 26, 2023
@ddneilson ddneilson force-pushed the ddneilson/13268 branch 2 times, most recently from e3abf72 to e4a8739 Compare September 27, 2023 18:56
@jericht jericht self-requested a review September 27, 2023 19:49
Copy link
Contributor

@jusiskin jusiskin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good, but my primary concern is how this change would affect the load on the service. We'd done a prior analysis in #20 on the backoff algorithm being used and this might change the outcome.

src/deadline_worker_agent/boto/retries.py Outdated Show resolved Hide resolved
When calling a deadline cloud service API and getting a throttle/retry
response the exception object may contain a "retryAfterSeconds" field
alongside the error. When that field is present, the calling client
should treat that as a request to retry in no sooner than the given number
of seconds; it is a load-shedding mechanism for the service. We should
respect the service's request.

Solution:
 Added to the logic of all of the deadline-cloud API wrappers to have
them extract the value of the "retryAfterSeconds" field if it's present,
and pass that to our backoff-delay calculator. We use the value as a
lower limit on the returned delay.
 I also made the scheduler use the API wrapper for update_worker; it
still had its own implementation that didn't properly handle exceptions.
This necessitated adding the ability to interrupt the update_worker's
throttled-retries so preserve the functionality at that call site.

Signed-off-by: Daniel Neilson <53624638+ddneilson@users.noreply.github.com>
@jusiskin jusiskin merged commit d83066a into mainline Oct 18, 2023
9 checks passed
@jusiskin jusiskin deleted the ddneilson/13268 branch October 18, 2023 19:10
gmchale79 pushed a commit that referenced this pull request Oct 31, 2023
When calling a deadline cloud service API and getting a throttle/retry
response the exception object may contain a "retryAfterSeconds" field
alongside the error. When that field is present, the calling client
should treat that as a request to retry in no sooner than the given number
of seconds; it is a load-shedding mechanism for the service. We should
respect the service's request.

Solution:
 Added to the logic of all of the deadline-cloud API wrappers to have
them extract the value of the "retryAfterSeconds" field if it's present,
and pass that to our backoff-delay calculator. We use the value as a
lower limit on the returned delay.
 I also made the scheduler use the API wrapper for update_worker; it
still had its own implementation that didn't properly handle exceptions.
This necessitated adding the ability to interrupt the update_worker's
throttled-retries so preserve the functionality at that call site.

Signed-off-by: Daniel Neilson <53624638+ddneilson@users.noreply.github.com>
Signed-off-by: Graeme McHale <gmchale@amazon.com>
gmchale79 pushed a commit that referenced this pull request Nov 2, 2023
When calling a deadline cloud service API and getting a throttle/retry
response the exception object may contain a "retryAfterSeconds" field
alongside the error. When that field is present, the calling client
should treat that as a request to retry in no sooner than the given number
of seconds; it is a load-shedding mechanism for the service. We should
respect the service's request.

Solution:
 Added to the logic of all of the deadline-cloud API wrappers to have
them extract the value of the "retryAfterSeconds" field if it's present,
and pass that to our backoff-delay calculator. We use the value as a
lower limit on the returned delay.
 I also made the scheduler use the API wrapper for update_worker; it
still had its own implementation that didn't properly handle exceptions.
This necessitated adding the ability to interrupt the update_worker's
throttled-retries so preserve the functionality at that call site.

Signed-off-by: Daniel Neilson <53624638+ddneilson@users.noreply.github.com>
Signed-off-by: Graeme McHale <gmchale@amazon.com>
gahyusuh pushed a commit that referenced this pull request Nov 6, 2023
When calling a deadline cloud service API and getting a throttle/retry
response the exception object may contain a "retryAfterSeconds" field
alongside the error. When that field is present, the calling client
should treat that as a request to retry in no sooner than the given number
of seconds; it is a load-shedding mechanism for the service. We should
respect the service's request.

Solution:
 Added to the logic of all of the deadline-cloud API wrappers to have
them extract the value of the "retryAfterSeconds" field if it's present,
and pass that to our backoff-delay calculator. We use the value as a
lower limit on the returned delay.
 I also made the scheduler use the API wrapper for update_worker; it
still had its own implementation that didn't properly handle exceptions.
This necessitated adding the ability to interrupt the update_worker's
throttled-retries so preserve the functionality at that call site.

Signed-off-by: Daniel Neilson <53624638+ddneilson@users.noreply.github.com>
Signed-off-by: Graeme McHale <gmchale@amazon.com>
Signed-off-by: Gahyun Suh <132245153+gahyusuh@users.noreply.github.com>
gmchale79 pushed a commit that referenced this pull request Feb 12, 2024
When calling a deadline cloud service API and getting a throttle/retry
response the exception object may contain a "retryAfterSeconds" field
alongside the error. When that field is present, the calling client
should treat that as a request to retry in no sooner than the given number
of seconds; it is a load-shedding mechanism for the service. We should
respect the service's request.

Solution:
 Added to the logic of all of the deadline-cloud API wrappers to have
them extract the value of the "retryAfterSeconds" field if it's present,
and pass that to our backoff-delay calculator. We use the value as a
lower limit on the returned delay.
 I also made the scheduler use the API wrapper for update_worker; it
still had its own implementation that didn't properly handle exceptions.
This necessitated adding the ability to interrupt the update_worker's
throttled-retries so preserve the functionality at that call site.

Signed-off-by: Daniel Neilson <53624638+ddneilson@users.noreply.github.com>
Signed-off-by: Graeme McHale <gmchale@amazon.com>
gmchale79 pushed a commit that referenced this pull request Mar 11, 2024
When calling a deadline cloud service API and getting a throttle/retry
response the exception object may contain a "retryAfterSeconds" field
alongside the error. When that field is present, the calling client
should treat that as a request to retry in no sooner than the given number
of seconds; it is a load-shedding mechanism for the service. We should
respect the service's request.

Solution:
 Added to the logic of all of the deadline-cloud API wrappers to have
them extract the value of the "retryAfterSeconds" field if it's present,
and pass that to our backoff-delay calculator. We use the value as a
lower limit on the returned delay.
 I also made the scheduler use the API wrapper for update_worker; it
still had its own implementation that didn't properly handle exceptions.
This necessitated adding the ability to interrupt the update_worker's
throttled-retries so preserve the functionality at that call site.

Signed-off-by: Daniel Neilson <53624638+ddneilson@users.noreply.github.com>
Signed-off-by: Graeme McHale <gmchale@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants