fix: respect service's suggested retryAfter when throttled #39

ddneilson · 2023-09-25T16:21:45Z

What was the problem/requirement? (What/Why)

When calling a deadline cloud service API and getting a throttle/retry response the exception object may contain a "retryAfterSeconds" field alongside the error. When that field is present, the calling client should treat that as a request to retry in no sooner than the given number of seconds; it is a load-shedding mechanism for the service. We should respect the service's request.

What was the solution? (How)

Added to the logic of all of the deadline-cloud API wrappers to have
them extract the value of the "retryAfterSeconds" field if it's present, and pass that to our backoff-delay calculator. We use the value as a lower limit on the returned delay.
I also made the scheduler use the API wrapper for update_worker; it
still had its own implementation that didn't properly handle exceptions. This necessitated adding the ability to interrupt the update_worker's throttled-retries so preserve the functionality at that call site.

What is the impact of this change?

The worker agent will be a more kind and cooperative client to the deadline cloud service.

How was this change tested?

This part of the code is well unit tested, so I just ensured that the unit tests cover the new functionality.

Was this change documented?

N/A

Is this a breaking change?

No

src/deadline_worker_agent/aws/deadline/__init__.py

jusiskin

The changes look good, but my primary concern is how this change would affect the load on the service. We'd done a prior analysis in #20 on the backoff algorithm being used and this might change the outcome.

src/deadline_worker_agent/boto/retries.py

When calling a deadline cloud service API and getting a throttle/retry response the exception object may contain a "retryAfterSeconds" field alongside the error. When that field is present, the calling client should treat that as a request to retry in no sooner than the given number of seconds; it is a load-shedding mechanism for the service. We should respect the service's request. Solution: Added to the logic of all of the deadline-cloud API wrappers to have them extract the value of the "retryAfterSeconds" field if it's present, and pass that to our backoff-delay calculator. We use the value as a lower limit on the returned delay. I also made the scheduler use the API wrapper for update_worker; it still had its own implementation that didn't properly handle exceptions. This necessitated adding the ability to interrupt the update_worker's throttled-retries so preserve the functionality at that call site. Signed-off-by: Daniel Neilson <53624638+ddneilson@users.noreply.github.com>

When calling a deadline cloud service API and getting a throttle/retry response the exception object may contain a "retryAfterSeconds" field alongside the error. When that field is present, the calling client should treat that as a request to retry in no sooner than the given number of seconds; it is a load-shedding mechanism for the service. We should respect the service's request. Solution: Added to the logic of all of the deadline-cloud API wrappers to have them extract the value of the "retryAfterSeconds" field if it's present, and pass that to our backoff-delay calculator. We use the value as a lower limit on the returned delay. I also made the scheduler use the API wrapper for update_worker; it still had its own implementation that didn't properly handle exceptions. This necessitated adding the ability to interrupt the update_worker's throttled-retries so preserve the functionality at that call site. Signed-off-by: Daniel Neilson <53624638+ddneilson@users.noreply.github.com> Signed-off-by: Graeme McHale <gmchale@amazon.com>

When calling a deadline cloud service API and getting a throttle/retry response the exception object may contain a "retryAfterSeconds" field alongside the error. When that field is present, the calling client should treat that as a request to retry in no sooner than the given number of seconds; it is a load-shedding mechanism for the service. We should respect the service's request. Solution: Added to the logic of all of the deadline-cloud API wrappers to have them extract the value of the "retryAfterSeconds" field if it's present, and pass that to our backoff-delay calculator. We use the value as a lower limit on the returned delay. I also made the scheduler use the API wrapper for update_worker; it still had its own implementation that didn't properly handle exceptions. This necessitated adding the ability to interrupt the update_worker's throttled-retries so preserve the functionality at that call site. Signed-off-by: Daniel Neilson <53624638+ddneilson@users.noreply.github.com> Signed-off-by: Graeme McHale <gmchale@amazon.com> Signed-off-by: Gahyun Suh <132245153+gahyusuh@users.noreply.github.com>

When calling a deadline cloud service API and getting a throttle/retry response the exception object may contain a "retryAfterSeconds" field alongside the error. When that field is present, the calling client should treat that as a request to retry in no sooner than the given number of seconds; it is a load-shedding mechanism for the service. We should respect the service's request. Solution: Added to the logic of all of the deadline-cloud API wrappers to have them extract the value of the "retryAfterSeconds" field if it's present, and pass that to our backoff-delay calculator. We use the value as a lower limit on the returned delay. I also made the scheduler use the API wrapper for update_worker; it still had its own implementation that didn't properly handle exceptions. This necessitated adding the ability to interrupt the update_worker's throttled-retries so preserve the functionality at that call site. Signed-off-by: Daniel Neilson <53624638+ddneilson@users.noreply.github.com> Signed-off-by: Graeme McHale <gmchale@amazon.com>

ddneilson requested a review from a team as a code owner September 25, 2023 16:21

jusiskin self-requested a review September 26, 2023 14:45

jusiskin added the enhancement New feature or request label Sep 26, 2023

ddneilson commented Sep 26, 2023

View reviewed changes

src/deadline_worker_agent/aws/deadline/__init__.py Outdated Show resolved Hide resolved

ddneilson commented Sep 26, 2023

View reviewed changes

src/deadline_worker_agent/aws/deadline/__init__.py Outdated Show resolved Hide resolved

ddneilson force-pushed the ddneilson/13268 branch 2 times, most recently from e3abf72 to e4a8739 Compare September 27, 2023 18:56

jericht self-requested a review September 27, 2023 19:49

jericht approved these changes Sep 27, 2023

View reviewed changes

jusiskin reviewed Sep 27, 2023

View reviewed changes

src/deadline_worker_agent/boto/retries.py Outdated Show resolved Hide resolved

ddneilson force-pushed the ddneilson/13268 branch from e4a8739 to eff5b41 Compare October 18, 2023 17:53

ddneilson force-pushed the ddneilson/13268 branch from eff5b41 to 92087e5 Compare October 18, 2023 18:00

jericht approved these changes Oct 18, 2023

View reviewed changes

jusiskin approved these changes Oct 18, 2023

View reviewed changes

jusiskin merged commit d83066a into mainline Oct 18, 2023
9 checks passed

jusiskin deleted the ddneilson/13268 branch October 18, 2023 19:10

client-software-ci mentioned this pull request Feb 22, 2024

chore(release): 0.21.0 #172

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: respect service's suggested retryAfter when throttled #39

fix: respect service's suggested retryAfter when throttled #39

ddneilson commented Sep 25, 2023

jusiskin left a comment

fix: respect service's suggested retryAfter when throttled #39

fix: respect service's suggested retryAfter when throttled #39

Conversation

ddneilson commented Sep 25, 2023

What was the problem/requirement? (What/Why)

What was the solution? (How)

What is the impact of this change?

How was this change tested?

Was this change documented?

Is this a breaking change?

jusiskin left a comment

Choose a reason for hiding this comment