Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add handling for podFailurePolicy #269

Closed

Conversation

danielvegamyhre
Copy link
Contributor

@danielvegamyhre danielvegamyhre commented Aug 24, 2023

Fixes #262

Note I didn't modify the FailurePolicy API as I originally proposed in #262 but rather made respecting the podFailurePolicy the default behavior. Let me know if you have any feedback on this design decision, I am happy to discuss/revisit it.

cc @kannon92 @alculquicondor

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 24, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danielvegamyhre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from ahg-g August 24, 2023 23:20
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 24, 2023
func (r *JobSetReconciler) triggeredPodFailurePolicy(ctx context.Context, js *jobset.JobSet, ownedJobs *childJobs) bool {
log := ctrl.LoggerFrom(ctx)
for _, failedJob := range ownedJobs.failed {
for _, c := range failedJob.Status.Conditions {
Copy link
Contributor

@kannon92 kannon92 Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use IsStatusConditionTrue
true from kubernetes here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this more, I don't think you would be able to use this. That allows you to see if you have a FailedJob conditon but it doesn't match on the reason. I guess you could match on FailedJob condition and then check reason.

apiGVStr = jobset.GroupVersion.String()
jobOwnerKey = ".metadata.controller"
apiGVStr = jobset.GroupVersion.String()
JobConditionReasonPodFailurePolicy = "PodFailurePolicy"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/kubernetes/kubernetes/blob/714e77595c8b19b693925bda2a96ab80c307d38f/pkg/controller/job/job_controller.go#L59

I know that this change comes from this location and if we go this route I think we’d want to maybe version the constant as a API for job.

I think moving the PodFailurePolicy reason into staging for job would make it so that it would be more difficult to change the constant. Maybe I’m being paranoid but if someone changes this constant name, it could break you pretty easily. It’d be nice to reference this constant value as part of the API rather than a hard coded constant.

@alculquicondor wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it would be ideal if we could simply reference this constant as part of the Job API but since we currently can't, in the interest of velocity I just hard coded a constant here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You did the correct thing. I have kubernetes/kubernetes#120175 where we would add these to API but it would be a few releases until you can use it probably.

n-2 support an all.

@@ -459,12 +460,30 @@ func (r *JobSetReconciler) executeFailurePolicy(ctx context.Context, js *jobset.
}

func (r *JobSetReconciler) executeRestartPolicy(ctx context.Context, js *jobset.JobSet, ownedJobs *childJobs) error {
if js.Spec.FailurePolicy.MaxRestarts == 0 {
if js.Spec.FailurePolicy.MaxRestarts == 0 || r.triggeredPodFailurePolicy(ctx, js, ownedJobs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So just checking, we only want to obey PodFailurePolicy if RestartPolicy is specified. Default behavior would be to fail on any failure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no jobset failure policy is specified, the jobset will fail immediately without restarts anyway. So the only place we need to do this podFailurePolicy check is if restarting is an option.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this behavior would be immediately intuitive to users:
If one Job fails due to PodFailure policy, the entire job set fails.

Either make sure this is properly documented or make it a JobSet's FailurePolicy whether to respect the Job's.

Copy link
Contributor

@kannon92 kannon92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation looks good to me! Its very clean.

Just a few comments.

Not sure if you want @alculquicondor or @ahg-g thoughts on it.
/hold
/lgtm

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 25, 2023
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 25, 2023
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 25, 2023
@kannon92
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 25, 2023
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 25, 2023
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
@@ -459,12 +460,30 @@ func (r *JobSetReconciler) executeFailurePolicy(ctx context.Context, js *jobset.
}

func (r *JobSetReconciler) executeRestartPolicy(ctx context.Context, js *jobset.JobSet, ownedJobs *childJobs) error {
if js.Spec.FailurePolicy.MaxRestarts == 0 {
if js.Spec.FailurePolicy.MaxRestarts == 0 || r.triggeredPodFailurePolicy(ctx, js, ownedJobs) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this behavior would be immediately intuitive to users:
If one Job fails due to PodFailure policy, the entire job set fails.

Either make sure this is properly documented or make it a JobSet's FailurePolicy whether to respect the Job's.

@danielvegamyhre
Copy link
Contributor Author

Not sure if this behavior would be immediately intuitive to users:
If one Job fails due to PodFailure policy, the entire job set fails.
Either make sure this is properly documented or make it a JobSet's FailurePolicy whether to respect the Job's.

Yeah in my original proposal I wanted to have it configurable as part of the JobSet's FailurePolicy, but I realized modifying the API would probably require a new minor version, and v0.3.0 isn't scheduled until early October, and in the meantime we have users who want this functionality sooner rather than later.

Personally I would strongly prefer it be a configurable option in the JobSet Failure Policy though so we may have no choice but to wait. I would be curious to get others thoughts on this though.

@kannon92
Copy link
Contributor

Not sure if this behavior would be immediately intuitive to users:
If one Job fails due to PodFailure policy, the entire job set fails.
Either make sure this is properly documented or make it a JobSet's FailurePolicy whether to respect the Job's.

Yeah in my original proposal I wanted to have it configurable as part of the JobSet's FailurePolicy, but I realized modifying the API would probably require a new minor version, and v0.3.0 isn't scheduled until early October, and in the meantime we have users who want this functionality sooner rather than later.

Personally I would strongly prefer it be a configurable option in the JobSet Failure Policy though so we may have no choice but to wait. I would be curious to get others thoughts on this though.

So I was thinking about this a bit more and I realize that we have a few cases we should consider with RestartPolicy.

PodFailurePolicy is one case but what about other FailedJob cases. ActiveDeadlineExceeded, BackoffLimitExceeded.

And other area that I was thinking is what about failure policies on different replicas. Could we see a case where parent RJ is allowed to failure but workers are not?

@alculquicondor
Copy link

PodFailurePolicy is one case but what about other FailedJob cases. ActiveDeadlineExceeded, BackoffLimitExceeded.

I wonder the same. These failures are also indicative of a problem in the user workload.

Let's flip the question: in which scenarios (or use cases), it makes sense to retry the entire job?

@kannon92
Copy link
Contributor

PodFailurePolicy is one case but what about other FailedJob cases. ActiveDeadlineExceeded, BackoffLimitExceeded.

I wonder the same. These failures are also indicative of a problem in the user workload.

Let's flip the question: in which scenarios (or use cases), it makes sense to retry the entire job?

@danielvegamyhre wdyt?

I can see a more general implementation to obey all failures of a job.

@kannon92
Copy link
Contributor

/retest

@danielvegamyhre
Copy link
Contributor Author

@kannon92 as long as there is any easy way to check if a Job failed due to ActiveDeadlineExceeded or BackoffLimitExceeded then we can include those, and perhaps change the name of the configuration to RespectNonRetriableErrors or something? We can probably think of a better name, but the general idea seems fine to me.

@kannon92
Copy link
Contributor

kannon92 commented Oct 17, 2023

@kannon92 as long as there is any easy way to check if a Job failed due to ActiveDeadlineExceeded or BackoffLimitExceeded then we can include those, and perhaps change the name of the configuration to RespectNonRetriableErrors or something? We can probably think of a better name, but the general idea seems fine to me.

We pushed a change in 1.29 to start exposing these as published apis in the reason field. Both of these work like PodFailurePolicy.

kubernetes/kubernetes@a62eb45

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 10, 2023
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

@danielvegamyhre: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-jobset-test-e2e-main-1-29 81ae7d5 link true /test pull-jobset-test-e2e-main-1-29

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@danielvegamyhre
Copy link
Contributor Author

Closing in favor of revised API discussed in #262

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature: Configurable Failure Policy API
4 participants