Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add restart policy & scheduler name for workflow pods #1109

Closed
wants to merge 5 commits into from
Closed

add restart policy & scheduler name for workflow pods #1109

wants to merge 5 commits into from

Conversation

houz42
Copy link

@houz42 houz42 commented Dec 1, 2018

RestartPolicy and SchedulerName are useful for controlling pods running of workflow.

Copy link
Contributor

@alexmt alexmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should provide ability to specify scheduler name and restart policy but on step level not whole workflow. User would have to repeat settings but this problem should be solved as part of #799

@jessesuen , need your opinion about restart policy . First I thought we don't need it since RetryStrategy is available. After some thinking I've decided it is useful. User might want to chose pod restart policy to make sure retry happens on the same node.

pkg/apis/workflow/v1alpha1/types.go Outdated Show resolved Hide resolved
pkg/apis/workflow/v1alpha1/types.go Outdated Show resolved Hide resolved
@houz42
Copy link
Author

houz42 commented Jan 12, 2019

@alexmt

  1. scheduler name and restart policy have been made step level
  2. restart policy has been restricted to Never and OnFailure

@jessesuen
Copy link
Member

jessesuen commented Jan 16, 2019

@houz42 I think the scheduler is a fine addition.

However a restartPolicy of OnFailure is problematic to set because restartPolicy is a pod spec level setting, and it will apply to the wait sidecar as well. The current design is that the controller relies on the fact the wait sidecar will exit with non-zero in many situations to understand the status of the step. For example, the entire wait logic will return non-zero if any of the following goes wrong:

  • artifact loading
  • log retrieval
  • output parameter retrieval
  • artifact saving
  • output annotation
  • wait k8s resource reached failure condition

In order to support a restartPolicy of OnFailure, we would need to modify the executor to always exit zero, and communicate back to the controller, that a step had failed in a different way. It's unclear what this mechanism would be.

One thought is: we currently already use pod annotations to communicate error messages to the controller. The controller could be modified such that it always expects a pod annotation to be set (even on success). Then, if the pod completed without setting an annotation, then something went wrong and the controller could fail the step.

So as it stands, supporting restartPolicy: OnFailure can't go in without:

  1. changing behavior of argoexec to always exit 0
  2. replacing the current exit 1 error communication mechanism with something else.

@houz42
Copy link
Author

houz42 commented Jan 17, 2019

@jessesuen maybe I should submit changes on schedulerName first and consider on restartPolicy later.

@jessesuen
Copy link
Member

Yes, scheduler only changes would be fine.

@houz42 houz42 mentioned this pull request Jan 20, 2019
@houz42
Copy link
Author

houz42 commented Jan 23, 2019

add scheduler name only in #1184

@houz42 houz42 closed this Jan 23, 2019
icecoffee531 pushed a commit to icecoffee531/argo-workflows that referenced this pull request Jan 5, 2022
* chore: deprecate in v1.5 comments

Signed-off-by: Derek Wang <whynowy@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants