Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backoff retries in the activator. #1814

Merged
merged 2 commits into from
Aug 10, 2018

Conversation

markusthoemmes
Copy link
Contributor

@markusthoemmes markusthoemmes commented Aug 8, 2018

Fixes #1229

Proposed Changes

Added an exponential backoff to the activator's retry logic. In the process, I lowered the timeout to start with (we might need to adjust that a bit to hit a sweet spot) and the total time to retry is now bounded by the elapsed time spent in retrying + requesting.

To determine a good retry interval, the following table can help. Production data on how many retries were needed in reality will help to adjust though.

image

Regarding tests: Didn't find any for this specific file. I'd love to add some but will need some guidance on how to do so if necessary.

Release Note

Added an exponential backoff to the activator's retry logic

@knative-prow-robot knative-prow-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 8, 2018
@markusthoemmes
Copy link
Contributor Author

/test pull-knative-serving-integration-tests

1 similar comment
@markusthoemmes
Copy link
Contributor Author

/test pull-knative-serving-integration-tests

@markusthoemmes
Copy link
Contributor Author

/assign @josephburnett

@@ -68,6 +68,10 @@ type retryRoundTripper struct {
start time.Time
}

func (rrt *retryRoundTripper) CalculateDelay(retries int, minRetryInterval time.Duration) time.Duration {
return time.Duration(int(minRetryInterval/time.Millisecond)*retries*retries) * time.Millisecond
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is quadratic, not exponential. What we want is an aggressive retry during normal activation times, but a quickly growing retry interval thereafter. Which is easier to achieve with exponential because of that hockey stick shape.

In my experience a small base like 1.3 is a good start. With the retry index as the exponent. Then multiply by the min retry.

E.g. return time.Duration(int(minRetryInterval/time.Millisecond)*(1.3^retries)) * time.Millisecond

It would look something like this. (The actual numbers should be tuned, but the point is to keep the curve low and fast until we leave normal operating conditions.)

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh. Of course it's quadratic... very much my bad. Thanks for pointing that out, I'll fix accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went for Base=1.3, MinRetry=100ms for now, giving me a progression as shown in the table:

image

@knative-prow-robot knative-prow-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 10, 2018
@knative-metrics-robot
Copy link

The following is the coverage report on pkg/.
Say /test pull-knative-serving-go-coverage to run the coverage report again

File Old Coverage New Coverage Delta
pkg/activator/util/retryer.go 100.0% 84.6% -15.4

Copy link
Contributor

@josephburnett josephburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 10, 2018
@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: josephburnett, markusthoemmes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 10, 2018
@markusthoemmes
Copy link
Contributor Author

/retest

2 similar comments
@josephburnett
Copy link
Contributor

/retest

@markusthoemmes
Copy link
Contributor Author

/retest

@srinivashegde86
Copy link
Contributor

/restest

@srinivashegde86
Copy link
Contributor

/retest

1 similar comment
@markusthoemmes
Copy link
Contributor Author

/retest

@knative-prow-robot knative-prow-robot merged commit 1345d3a into knative:master Aug 10, 2018
@trisberg trisberg mentioned this pull request Aug 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants