Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry on network failures #4454

Merged
merged 2 commits into from
Nov 3, 2020
Merged

Conversation

pierDipi
Copy link
Member

@pierDipi pierDipi commented Nov 3, 2020

Fixes #4453

Proposed Changes

  • Retry on network failures

Release Note

- 🐛 Fix bug
Retry on network failures

Test execution logs:

2020/11/03 13:25:27 [DEBUG] POST http://127.0.0.1:21697
2020/11/03 13:25:27 [ERR] POST http://127.0.0.1:21697 request failed: Post "http://127.0.0.1:21697": dial tcp 127.0.0.1:21697: connect: connection refused
2020/11/03 13:25:27 [DEBUG] POST http://127.0.0.1:21697: retrying in 0s (10 left)
2020/11/03 13:25:27 [ERR] POST http://127.0.0.1:21697 request failed: Post "http://127.0.0.1:21697": dial tcp 127.0.0.1:21697: connect: connection refused
2020/11/03 13:25:27 [DEBUG] POST http://127.0.0.1:21697: retrying in 100ms (9 left)
2020/11/03 13:25:27 [ERR] POST http://127.0.0.1:21697 request failed: Post "http://127.0.0.1:21697": dial tcp 127.0.0.1:21697: connect: connection refused
2020/11/03 13:25:27 [DEBUG] POST http://127.0.0.1:21697: retrying in 200ms (8 left)
2020/11/03 13:25:27 [ERR] POST http://127.0.0.1:21697 request failed: Post "http://127.0.0.1:21697": dial tcp 127.0.0.1:21697: connect: connection refused
2020/11/03 13:25:27 [DEBUG] POST http://127.0.0.1:21697: retrying in 300ms (7 left)
2020/11/03 13:25:28 [ERR] POST http://127.0.0.1:21697 request failed: Post "http://127.0.0.1:21697": dial tcp 127.0.0.1:21697: connect: connection refused
2020/11/03 13:25:28 [DEBUG] POST http://127.0.0.1:21697: retrying in 400ms (6 left)
2020/11/03 13:25:28 [DEBUG] POST http://127.0.0.1:21697 (status: 503): retrying in 500ms (5 left)
2020/11/03 13:25:29 [DEBUG] POST http://127.0.0.1:21697 (status: 503): retrying in 600ms (4 left)
2020/11/03 13:25:29 [DEBUG] POST http://127.0.0.1:21697 (status: 503): retrying in 700ms (3 left)
2020/11/03 13:25:30 [DEBUG] POST http://127.0.0.1:21697 (status: 503): retrying in 800ms (2 left)
--- PASS: TestRetriesOnNetworkErrors (3.61s)
PASS

Process finished with exit code 0

@knative-prow-robot knative-prow-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 3, 2020
@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pierDipi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 3, 2020
@google-cla google-cla bot added the cla: yes Indicates the PR's author has signed the CLA. label Nov 3, 2020
@codecov
Copy link

codecov bot commented Nov 3, 2020

Codecov Report

Merging #4454 into master will increase coverage by 0.12%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4454      +/-   ##
==========================================
+ Coverage   81.06%   81.19%   +0.12%     
==========================================
  Files         281      281              
  Lines        7981     7981              
==========================================
+ Hits         6470     6480      +10     
+ Misses       1122     1112      -10     
  Partials      389      389              
Impacted Files Coverage Δ
pkg/kncloudevents/message_sender.go 79.66% <100.00%> (+16.94%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f064837...722560c. Read the comment docs.

Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>
@pierDipi
Copy link
Member Author

pierDipi commented Nov 3, 2020

/cc @slinkydeveloper

func checkRetry(_ context.Context, resp *nethttp.Response, _ error) (bool, error) {
return resp != nil && resp.StatusCode >= 300, nil
func checkRetry(_ context.Context, resp *nethttp.Response, err error) (bool, error) {
return !(resp != nil && resp.StatusCode < 300), err
Copy link
Contributor

@slinkydeveloper slinkydeveloper Nov 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should not retry on any error... Some of them are reasonable to not retry, like dial tcp timeout when the tcp address it's not reachable, or dns errors... My feeling is that it should be configurable which errors to retry and which not, because otherwise you bloat the dispatchers always in retrying, when there is no need (because maybe the pointed service doesn't exist at all)....

What are people feelings about that? @matzew, @vaikas, @lionelvillard. @grantr ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think retrying on any error is fine. I'd rather not have the user deal with having to define too many of the errors, network, dns, etc. If they want retry, seems reasonable to expect any error to be retried and not just errors that happen at the protocol level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On k8s, pods may be started and deleted for many reasons (scale-up/down, node drain, node restart, ...), and not all addressables will implement proper readiness and graceful shutdown, so it may be quite common to get all kinds of errors during normal operation, and IMHO user will expect that setting "retries" will make the dispatcher retry in these kinds of situations...

Another thing to consider is that doing retries on network errors may cause a duplicate event to be delivered (which seems allowd by the cloudevents spec, https://github.com/cloudevents/spec/blob/v1.0/spec.md#id ) , so perhaps some users may prefer for events to be lost instead of potentially delivering twice... (of course, they still can set retries to 0 then...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing to consider is that doing retries on network errors may cause a duplicate event to be delivered (which seems allowd by the cloudevents spec, https://github.com/cloudevents/spec/blob/v1.0/spec.md#id ) , so perhaps some users may prefer for events to be lost instead of potentially delivering twice... (of course, they still can set retries to 0 then...)

Well, but this ends up in the discussions of "exactly once", which atm is out of scope of eventing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think retrying is a reasonable default, then if there are interests for specific configurations we can always add them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @slinkydeveloper that not every error should be retried. fwiw... this is the logic in the "distributed" KafkaChannel dispatcher for determining when to retry. Also, we only retry some of the following because of the broker munging the actual responses(400), and the retry test expecting them(429)...

	//
	// Note - Normally we would NOT want to retry 400 responses, BUT the knative-eventing
	//        filter handler (due to CloudEvents SDK V1 usage) is swallowing the actual
	//        status codes from the subscriber and returning 400s instead.  Once this has,
	//        been resolved we can remove 400 from the list of codes to retry.
	//
	if statusCode >= 500 || statusCode == 400 || statusCode == 404 || statusCode == 429 {
		logger.Warn("Failed To Send Message To Subscriber Service - Retrying")
		return true, nil
	} else if statusCode >= 300 && statusCode <= 399 {
		logger.Warn("Failed To Send Message To Subscriber Service - Not Retrying")
		return false, nil
	} else if statusCode == -1 {
		logger.Warn("No StatusCode Detected In Error - Retrying")
		return true, nil
	}

	// Do Not Retry 1XX, 2XX, & Most 4XX StatusCode Responses
	return false, nil

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're discussing network errors, not status codes, for status codes, there is a separate issue: #2411

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for retry in error. Network error may happen for various reason IMO. If we have concerns about too many retries, we can always use other backoff strategy, eg exponential.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for retry in error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think retrying on any error is fine. I'd rather not have the user deal with having to define too many of the errors, network, dns, etc. If they want retry, seems reasonable to expect any error to be retried and not just errors that happen at the protocol level.

that would be my thinking here too

func checkRetry(_ context.Context, resp *nethttp.Response, _ error) (bool, error) {
return resp != nil && resp.StatusCode >= 300, nil
func checkRetry(_ context.Context, resp *nethttp.Response, err error) (bool, error) {
return !(resp != nil && resp.StatusCode < 300), err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for retry in error. Network error may happen for various reason IMO. If we have concerns about too many retries, we can always use other backoff strategy, eg exponential.

pkg/kncloudevents/message_sender_test.go Outdated Show resolved Hide resolved
func TestRetriesOnNetworkErrors(t *testing.T) {

port := rand.Int31n(math.MaxUint16-1024) + 1024
n := int32(10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need all the int32.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use n to set DeliverySpec.Retry which is an int32 and then I compare it with nCalls, so instead of a cast when I compare them I just use the same type for both, does that make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, makes sense. I thought you are just doing that in n and nCalls. I personally would do the cast once in Int32Ptr, but it is just me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the approximate run time for this test? I think recently the linear backoff is changed to: t, 2t, 3t, 4t, instead of t, t, t, t for the interval.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I attached test logs in the PR body, it ran in 3.61s.
I use the smallest backoff supported by the library but we can reduce n if that's too much for a single unit test.

Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>
@knative-metrics-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-knative-eventing-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/kncloudevents/message_sender.go 65.9% 82.9% 17.1

@zhongduo
Copy link
Contributor

zhongduo commented Nov 3, 2020

/lgtm

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 3, 2020
@knative-prow-robot knative-prow-robot merged commit c2bfc44 into knative:master Nov 3, 2020
@pierDipi
Copy link
Member Author

pierDipi commented Nov 3, 2020

Should we backport this?

@matzew
Copy link
Member

matzew commented Nov 3, 2020

Should we backport this?

it would be good to backport to 0.18 and 0.17 ?

pierDipi added a commit to pierDipi/eventing that referenced this pull request Nov 3, 2020
Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>
pierDipi added a commit to pierDipi/eventing that referenced this pull request Nov 3, 2020
* Retry on network failure

Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>

* Use a fixed port number

Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>
@pierDipi
Copy link
Member Author

pierDipi commented Nov 3, 2020

@matzew backported, please, check linked PRs.

matzew pushed a commit to matzew/eventing that referenced this pull request Nov 3, 2020
* Retry on network failure

Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>

* Use a fixed port number

Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>
knative-prow-robot pushed a commit that referenced this pull request Nov 3, 2020
* Retry on network failures (#4454)

* Retry on network failure

Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>

* Use a fixed port number

Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>

* nethttp -> http

Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>
openshift-merge-robot pushed a commit to openshift/knative-eventing that referenced this pull request Nov 4, 2020
* Retry on network failures (knative#4454)

* Retry on network failure

Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>

* Use a fixed port number

Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>

* fixing imports

Signed-off-by: Matthias Wessendorf <mwessend@redhat.com>

Co-authored-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>
knative-prow-robot pushed a commit that referenced this pull request Nov 4, 2020
* Retry on network failures (#4454)

Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>

* nethttp -> http

Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>
matzew pushed a commit to matzew/eventing that referenced this pull request Nov 7, 2020
* Retry on network failures (knative#4454)

Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>

* nethttp -> http

Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>
matzew pushed a commit to openshift/knative-eventing that referenced this pull request Nov 7, 2020
* Retry on network failures (knative#4454)

Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>

* nethttp -> http

Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>
openshift-merge-robot pushed a commit to openshift/knative-eventing that referenced this pull request Nov 7, 2020
* Update pingsource-mt-adapter.yaml

* Like on 0.18.3, we skip the tracing tests

Signed-off-by: Matthias Wessendorf <mwessend@redhat.com>

* [release-0.18] Retry on network failures (knative#4454) (knative#4457)

* Retry on network failures (knative#4454)

Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>

* nethttp -> http

Signed-off-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>

* Backport knative#4465 (knative#4468)

Signed-off-by: Francesco Guardiani <francescoguard@gmail.com>

* [0.18] Backport knative#4466 (knative#4471)

* Remove double invocations to responseWriter.WriteHeader in filter handler (knative#4466)

* Fix knative#4464

Signed-off-by: Francesco Guardiani <francescoguard@gmail.com>

* Docs

Signed-off-by: Francesco Guardiani <francescoguard@gmail.com>

* Moar tests

Signed-off-by: Francesco Guardiani <francescoguard@gmail.com>

* Linting

Signed-off-by: Francesco Guardiani <francescoguard@gmail.com>

* Nit with metrics

Signed-off-by: Francesco Guardiani <francescoguard@gmail.com>

(cherry picked from commit a6fc540)
Signed-off-by: Francesco Guardiani <francescoguard@gmail.com>

* Nit

Signed-off-by: Francesco Guardiani <francescoguard@gmail.com>

* fixed wrong marshall in apiserversouece which will fix the missing ceOverrides extension  (knative#4477) (knative#4480)

* fixed wrong marshall

* fixed UT

* [0.18] Readyness probe in broker ingress (knative#4483)

* Fix knative#4473

Signed-off-by: Francesco Guardiani <francescoguard@gmail.com>

* Massage the filter yaml

Signed-off-by: Francesco Guardiani <francescoguard@gmail.com>

Co-authored-by: Matthias Wessendorf <mwessend@redhat.com>
Co-authored-by: Pierangelo Di Pilato <pierangelodipilato@gmail.com>
Co-authored-by: Francesco Guardiani <francescoguard@gmail.com>
Co-authored-by: capri-xiyue <52932582+capri-xiyue@users.noreply.github.com>
@pierDipi pierDipi deleted the KNATIVE-4453 branch November 25, 2021 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: yes Indicates the PR's author has signed the CLA. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

imc-dispatcher doesn't retry on connection failures or EOFs
10 participants