Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(executor): Replace default retry in executor with an increased value retryer #3891

Merged
merged 4 commits into from
Aug 31, 2020

Conversation

anggao
Copy link
Contributor

@anggao anggao commented Aug 28, 2020

Replace default retry in executor with an increased value retryer.

#3675 added a pod watch retry to handle timeout.
We need to increase this retry value to several other places. e.g. the wait container could timeout requests to etcd during wait.
We are experimenting the following error with argo-workflow 2.10.0.

time="2020-08-28T16:02:23.208Z" level=info msg="Waiting on main container"
time="2020-08-28T16:02:23.702Z" level=info msg="main container started with container ID: 17ec51e28b2a528c7c5238b5858e3d3a19c8c2f599f6c4950690da0f5a5c641a"
time="2020-08-28T16:02:23.702Z" level=info msg="Starting annotations monitor"
time="2020-08-28T16:02:23.709Z" level=info msg="Waiting for container 17ec51e28b2a528c7c5238b5858e3d3a19c8c2f599f6c4950690da0f5a5c641a to complete"
time="2020-08-28T16:02:23.709Z" level=info msg="Starting to wait completion of containerID 17ec51e28b2a528c7c5238b5858e3d3a19c8c2f599f6c4950690da0f5a5c641a ..."
time="2020-08-28T16:02:23.709Z" level=info msg="Starting deadline monitor"
time="2020-08-28T16:07:23.208Z" level=info msg="Alloc=7343 TotalAlloc=42779 Sys=72128 NumGC=11 Goroutines=10"
time="2020-08-28T16:08:05.884Z" level=warning msg="Failed to wait for container id '17ec51e28b2a528c7c5238b5858e3d3a19c8c2f599f6c4950690da0f5a5c641a': etcdserver: request timed out"
time="2020-08-28T16:08:05.884Z" level=error msg="executor error: etcdserver: request timed out"
time="2020-08-28T16:08:05.884Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2020-08-28T16:08:05.884Z" level=info msg="Capturing script exit code"
time="2020-08-28T16:08:05.884Z" level=info msg="Getting exit code of 17ec51e28b2a528c7c5238b5858e3d3a19c8c2f599f6c4950690da0f5a5c641a"
time="2020-08-28T16:08:05.884Z" level=info msg="Annotations monitor stopped"
time="2020-08-28T16:08:05.907Z" level=info msg="No output parameters"
time="2020-08-28T16:08:05.907Z" level=info msg="No output artifacts"
time="2020-08-28T16:08:05.907Z" level=info msg="Killing sidecars"
time="2020-08-28T16:08:05.911Z" level=info msg="Alloc=6540 TotalAlloc=46240 Sys=72128 NumGC=12 Goroutines=9"
time="2020-08-28T16:08:05.954Z" level=fatal msg="etcdserver: request timed out"

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
  • The title of the PR is (a) conventional, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
  • I've signed the CLA.
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My builds are green. Try syncing with master if they are not.
  • My organization is added to USERS.md.

@CLAassistant
Copy link

CLAassistant commented Aug 28, 2020

CLA assistant check
All committers have signed the CLA.

@alexec alexec requested a review from jessesuen August 28, 2020 19:07
@alexec alexec merged commit bb79e3f into argoproj:master Aug 31, 2020
@alexec alexec added this to the v2.11 milestone Aug 31, 2020
alexec pushed a commit that referenced this pull request Sep 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants