-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry on network failures #4454
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: pierDipi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Codecov Report
@@ Coverage Diff @@
## master #4454 +/- ##
==========================================
+ Coverage 81.06% 81.19% +0.12%
==========================================
Files 281 281
Lines 7981 7981
==========================================
+ Hits 6470 6480 +10
+ Misses 1122 1112 -10
Partials 389 389
Continue to review full report at Codecov.
|
Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>
4e63ba9
to
51a8165
Compare
/cc @slinkydeveloper |
func checkRetry(_ context.Context, resp *nethttp.Response, _ error) (bool, error) { | ||
return resp != nil && resp.StatusCode >= 300, nil | ||
func checkRetry(_ context.Context, resp *nethttp.Response, err error) (bool, error) { | ||
return !(resp != nil && resp.StatusCode < 300), err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should not retry on any error... Some of them are reasonable to not retry, like dial tcp timeout
when the tcp address it's not reachable, or dns errors... My feeling is that it should be configurable which errors to retry and which not, because otherwise you bloat the dispatchers always in retrying, when there is no need (because maybe the pointed service doesn't exist at all)....
What are people feelings about that? @matzew, @vaikas, @lionelvillard. @grantr ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think retrying on any error is fine. I'd rather not have the user deal with having to define too many of the errors, network, dns, etc. If they want retry, seems reasonable to expect any error to be retried and not just errors that happen at the protocol level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On k8s, pods may be started and deleted for many reasons (scale-up/down, node drain, node restart, ...), and not all addressables will implement proper readiness and graceful shutdown, so it may be quite common to get all kinds of errors during normal operation, and IMHO user will expect that setting "retries" will make the dispatcher retry in these kinds of situations...
Another thing to consider is that doing retries on network errors may cause a duplicate event to be delivered (which seems allowd by the cloudevents spec, https://github.com/cloudevents/spec/blob/v1.0/spec.md#id ) , so perhaps some users may prefer for events to be lost instead of potentially delivering twice... (of course, they still can set retries to 0 then...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing to consider is that doing retries on network errors may cause a duplicate event to be delivered (which seems allowd by the cloudevents spec, https://github.com/cloudevents/spec/blob/v1.0/spec.md#id ) , so perhaps some users may prefer for events to be lost instead of potentially delivering twice... (of course, they still can set retries to 0 then...)
Well, but this ends up in the discussions of "exactly once", which atm is out of scope of eventing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think retrying is a reasonable default, then if there are interests for specific configurations we can always add them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @slinkydeveloper that not every error should be retried. fwiw... this is the logic in the "distributed" KafkaChannel dispatcher for determining when to retry. Also, we only retry some of the following because of the broker munging the actual responses(400), and the retry test expecting them(429)...
//
// Note - Normally we would NOT want to retry 400 responses, BUT the knative-eventing
// filter handler (due to CloudEvents SDK V1 usage) is swallowing the actual
// status codes from the subscriber and returning 400s instead. Once this has,
// been resolved we can remove 400 from the list of codes to retry.
//
if statusCode >= 500 || statusCode == 400 || statusCode == 404 || statusCode == 429 {
logger.Warn("Failed To Send Message To Subscriber Service - Retrying")
return true, nil
} else if statusCode >= 300 && statusCode <= 399 {
logger.Warn("Failed To Send Message To Subscriber Service - Not Retrying")
return false, nil
} else if statusCode == -1 {
logger.Warn("No StatusCode Detected In Error - Retrying")
return true, nil
}
// Do Not Retry 1XX, 2XX, & Most 4XX StatusCode Responses
return false, nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're discussing network errors, not status codes, for status codes, there is a separate issue: #2411
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for retry in error. Network error may happen for various reason IMO. If we have concerns about too many retries, we can always use other backoff strategy, eg exponential.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for retry in error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think retrying on any error is fine. I'd rather not have the user deal with having to define too many of the errors, network, dns, etc. If they want retry, seems reasonable to expect any error to be retried and not just errors that happen at the protocol level.
that would be my thinking here too
func checkRetry(_ context.Context, resp *nethttp.Response, _ error) (bool, error) { | ||
return resp != nil && resp.StatusCode >= 300, nil | ||
func checkRetry(_ context.Context, resp *nethttp.Response, err error) (bool, error) { | ||
return !(resp != nil && resp.StatusCode < 300), err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for retry in error. Network error may happen for various reason IMO. If we have concerns about too many retries, we can always use other backoff strategy, eg exponential.
func TestRetriesOnNetworkErrors(t *testing.T) { | ||
|
||
port := rand.Int31n(math.MaxUint16-1024) + 1024 | ||
n := int32(10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need all the int32
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use n
to set DeliverySpec.Retry
which is an int32
and then I compare it with nCalls
, so instead of a cast when I compare them I just use the same type for both, does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, makes sense. I thought you are just doing that in n and nCalls. I personally would do the cast once in Int32Ptr
, but it is just me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the approximate run time for this test? I think recently the linear backoff is changed to: t, 2t, 3t, 4t
, instead of t, t, t, t
for the interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I attached test logs in the PR body, it ran in 3.61s
.
I use the smallest backoff supported by the library but we can reduce n
if that's too much for a single unit test.
Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>
The following is the coverage report on the affected files.
|
/lgtm |
Should we backport this? |
it would be good to backport to 0.18 and 0.17 ? |
Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>
* Retry on network failure Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com> * Use a fixed port number Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>
@matzew backported, please, check linked PRs. |
* Retry on network failure Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com> * Use a fixed port number Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>
* Retry on network failures (#4454) * Retry on network failure Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com> * Use a fixed port number Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com> * nethttp -> http Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>
* Retry on network failures (knative#4454) * Retry on network failure Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com> * Use a fixed port number Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com> * fixing imports Signed-off-by: Matthias Wessendorf <mwessend@redhat.com> Co-authored-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>
* Retry on network failures (knative#4454) Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com> * nethttp -> http Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>
* Retry on network failures (knative#4454) Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com> * nethttp -> http Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>
* Update pingsource-mt-adapter.yaml * Like on 0.18.3, we skip the tracing tests Signed-off-by: Matthias Wessendorf <mwessend@redhat.com> * [release-0.18] Retry on network failures (knative#4454) (knative#4457) * Retry on network failures (knative#4454) Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com> * nethttp -> http Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com> * Backport knative#4465 (knative#4468) Signed-off-by: Francesco Guardiani <francescoguard@gmail.com> * [0.18] Backport knative#4466 (knative#4471) * Remove double invocations to responseWriter.WriteHeader in filter handler (knative#4466) * Fix knative#4464 Signed-off-by: Francesco Guardiani <francescoguard@gmail.com> * Docs Signed-off-by: Francesco Guardiani <francescoguard@gmail.com> * Moar tests Signed-off-by: Francesco Guardiani <francescoguard@gmail.com> * Linting Signed-off-by: Francesco Guardiani <francescoguard@gmail.com> * Nit with metrics Signed-off-by: Francesco Guardiani <francescoguard@gmail.com> (cherry picked from commit a6fc540) Signed-off-by: Francesco Guardiani <francescoguard@gmail.com> * Nit Signed-off-by: Francesco Guardiani <francescoguard@gmail.com> * fixed wrong marshall in apiserversouece which will fix the missing ceOverrides extension (knative#4477) (knative#4480) * fixed wrong marshall * fixed UT * [0.18] Readyness probe in broker ingress (knative#4483) * Fix knative#4473 Signed-off-by: Francesco Guardiani <francescoguard@gmail.com> * Massage the filter yaml Signed-off-by: Francesco Guardiani <francescoguard@gmail.com> Co-authored-by: Matthias Wessendorf <mwessend@redhat.com> Co-authored-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com> Co-authored-by: Francesco Guardiani <francescoguard@gmail.com> Co-authored-by: capri-xiyue <52932582+capri-xiyue@users.noreply.github.com>
Fixes #4453
Proposed Changes
Release Note
Test execution logs: