-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add handling for podFailurePolicy #269
Add handling for podFailurePolicy #269
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: danielvegamyhre The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
609397c
to
94c3e70
Compare
func (r *JobSetReconciler) triggeredPodFailurePolicy(ctx context.Context, js *jobset.JobSet, ownedJobs *childJobs) bool { | ||
log := ctrl.LoggerFrom(ctx) | ||
for _, failedJob := range ownedJobs.failed { | ||
for _, c := range failedJob.Status.Conditions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use IsStatusConditionTrue
true from kubernetes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this more, I don't think you would be able to use this. That allows you to see if you have a FailedJob conditon but it doesn't match on the reason. I guess you could match on FailedJob condition and then check reason.
pkg/controllers/jobset_controller.go
Outdated
apiGVStr = jobset.GroupVersion.String() | ||
jobOwnerKey = ".metadata.controller" | ||
apiGVStr = jobset.GroupVersion.String() | ||
JobConditionReasonPodFailurePolicy = "PodFailurePolicy" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that this change comes from this location and if we go this route I think we’d want to maybe version the constant as a API for job.
I think moving the PodFailurePolicy reason into staging for job would make it so that it would be more difficult to change the constant. Maybe I’m being paranoid but if someone changes this constant name, it could break you pretty easily. It’d be nice to reference this constant value as part of the API rather than a hard coded constant.
@alculquicondor wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it would be ideal if we could simply reference this constant as part of the Job API but since we currently can't, in the interest of velocity I just hard coded a constant here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You did the correct thing. I have kubernetes/kubernetes#120175 where we would add these to API but it would be a few releases until you can use it probably.
n-2 support an all.
@@ -459,12 +460,30 @@ func (r *JobSetReconciler) executeFailurePolicy(ctx context.Context, js *jobset. | |||
} | |||
|
|||
func (r *JobSetReconciler) executeRestartPolicy(ctx context.Context, js *jobset.JobSet, ownedJobs *childJobs) error { | |||
if js.Spec.FailurePolicy.MaxRestarts == 0 { | |||
if js.Spec.FailurePolicy.MaxRestarts == 0 || r.triggeredPodFailurePolicy(ctx, js, ownedJobs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So just checking, we only want to obey PodFailurePolicy
if RestartPolicy
is specified. Default behavior would be to fail on any failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If no jobset failure policy is specified, the jobset will fail immediately without restarts anyway. So the only place we need to do this podFailurePolicy check is if restarting is an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this behavior would be immediately intuitive to users:
If one Job fails due to PodFailure policy, the entire job set fails.
Either make sure this is properly documented or make it a JobSet's FailurePolicy whether to respect the Job's.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation looks good to me! Its very clean.
Just a few comments.
Not sure if you want @alculquicondor or @ahg-g thoughts on it.
/hold
/lgtm
/lgtm |
New changes are detected. LGTM label has been removed. |
@@ -459,12 +460,30 @@ func (r *JobSetReconciler) executeFailurePolicy(ctx context.Context, js *jobset. | |||
} | |||
|
|||
func (r *JobSetReconciler) executeRestartPolicy(ctx context.Context, js *jobset.JobSet, ownedJobs *childJobs) error { | |||
if js.Spec.FailurePolicy.MaxRestarts == 0 { | |||
if js.Spec.FailurePolicy.MaxRestarts == 0 || r.triggeredPodFailurePolicy(ctx, js, ownedJobs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this behavior would be immediately intuitive to users:
If one Job fails due to PodFailure policy, the entire job set fails.
Either make sure this is properly documented or make it a JobSet's FailurePolicy whether to respect the Job's.
Yeah in my original proposal I wanted to have it configurable as part of the JobSet's FailurePolicy, but I realized modifying the API would probably require a new minor version, and v0.3.0 isn't scheduled until early October, and in the meantime we have users who want this functionality sooner rather than later. Personally I would strongly prefer it be a configurable option in the JobSet Failure Policy though so we may have no choice but to wait. I would be curious to get others thoughts on this though. |
So I was thinking about this a bit more and I realize that we have a few cases we should consider with RestartPolicy. PodFailurePolicy is one case but what about other FailedJob cases. And other area that I was thinking is what about failure policies on different replicas. Could we see a case where |
I wonder the same. These failures are also indicative of a problem in the user workload. Let's flip the question: in which scenarios (or use cases), it makes sense to retry the entire job? |
@danielvegamyhre wdyt? I can see a more general implementation to obey all failures of a job. |
/retest |
@kannon92 as long as there is any easy way to check if a Job failed due to |
We pushed a change in 1.29 to start exposing these as published apis in the reason field. Both of these work like PodFailurePolicy. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@danielvegamyhre: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Closing in favor of revised API discussed in #262 |
Fixes #262
Note I didn't modify the FailurePolicy API as I originally proposed in #262 but rather made respecting the podFailurePolicy the default behavior. Let me know if you have any feedback on this design decision, I am happy to discuss/revisit it.
cc @kannon92 @alculquicondor