-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Configurable Failure Policy API #262
Comments
With replicas we have some complications as you would need a API that can understand that some replicas may fail while others don't. I think generally, a SOC should be that if the Job declares a job as failed with PodFailurePolicy we should respect that. I don't quite understand why we would want exit code information at the JobSet level. |
Say we have many large JobSet workloads enqueued in Kueue. JobSet 1 is scheduled and executed first, and has a failure policy of This could be avoided if the JobSet controller could be aware that the Job failed due to the non-retriable exit code specified in podFailurePolicy, and simply fail the JobSet immediately rather than restarting N times. |
I see. I still don't think the API would be exit codes/conditions of the container but you want some kind of JobFailurePolicy at the JobSet level. In this case, it could be conditions or statuses from the jobs that we only restart. I believe that you know that the jobs failed due to some kind of match of PodFailurePolicy. IIRC we publish a condition on match for PodFailurePolicy so that info would be in the job. What kind of API are you thinking? I find it easier to work backwards from the API. |
Here are a couple ideas for a possible API: Option 1: Generic, but more configuration required type FailurePolicy struct {
// TargetPodFailurePolicies, if set, will cause the JobSet to fail immediately
// without restart if any of its failed child jobs failed due to matching a podFailurePolicy.
FollowPodFailurePolicy *bool`json:"followPodFailurePolicy,omitempty"`
// MaxRestarts defines the limit on the number of JobSet restarts.
// A restart is achieved by recreating all active child jobs.
MaxRestarts int32 `json:"maxRestarts,omitempty"`
} This would require the user to define podFailurePolicies themselves in their Job templates. The implementation would then involve checking if a job failure was due to a match on a podFailurePolicy and then failing the JobSet rather than restarting One downside is failing the JobSet if it matches ANY podFailurePolicy is rather sweeping and inexact, however, I didn't see any obvious way to specify specific pod failure policies (e.g. a name or id). Option 2: Less generic, but less configuration required type FailurePolicy struct {
// FailNonRetriable, if set, will not restart the JobSet if one or more
// of its child jobs failed due to a non-retriable container exit code.
// Exit code 143 (SIGTERM) is treated as the only retriable exit code, as it
// is used for maintenance events and workload pre-emption, which are scenarios
// where JobSet recreation is desired.
FailNonRetriable *bool `json:"failNonRetriable,omitempty"`
// MaxRestarts defines the limit on the number of JobSet restarts.
// A restart is achieved by recreating all active child jobs.
MaxRestarts int32 `json:"maxRestarts,omitempty"`
} In this option, the user does not have to define a podFailurePolicy in their Job template. If podFailurePolicy we would inject into the Job template: podFailurePolicy:
rules:
- action: FailJob
onExitCodes:
containerName: user-container
operator: NotIn
values: [143] # SIGTERM Personally I am leaning toward Option 1, since it is more generic/flexible and we don't lock ourselves into any specific podFailurePolicy. The downside is it requires our users to have more k8s specific knowledge and configure more "knobs," which I would like to minimize. |
Maybe you can draft a KEP with some examples of Job with PodFailurePolicies and JobSet. I'd also like to ask @mimowo for his thoughts on this once he is back from vacation. Generally PodFailurePolicy has the concept of ExitCodes and Conditions. One thing I don't get is why is this needed? If one wants to fail a JobSet with |
If a Job fails due to a maintenance event on the underlying VMs, or workload preemption, we DO want it to be restarted as soon as possible. If a Job fails due to some software bug, we DON'T want it to restart at all, for the reasons described previously. |
I agree but that is the responsibility of This would be the same behavior as not specifying
|
How would one configure a podFailurePolicy + JobSet failurePolicy to achieve this behavior? If a maintenance event kills some pods on a node during graceful node shutdown, we want the job to fail AND for the entire JobSet to restart. Right now, the only way for the JobSet failurePolicy to execute is if 1+ child jobs have failed, so unless we change that, we want the maintenance event to fail the job. If a software bug causes the pods to die, we also want the job to fail, but we want the JobSet failure policy to know NOT to restart the entire JobSet. In both cases we want the Job to fail, but in only 1 case do we want the JobSet to restart. At the jobset level, we have no way of distinguishing between job failures which require restarts vs job failures which do not. Basically, we need the JobSet failurePolicy to be selective about when it should use the restarts allowed by |
I see. I think option 1 would be sufficient but this seems to be more subtle than a github issue can surmise. |
@alculquicondor I'd be curious to get your thoughts on this as well. When a Job fails, is there a way to check if it failed due to triggering a podFailurePolicy with action To summarize the issue: when a JobSet's child job fails, we need the JobSet failurePolicy to be selective about when it should use the restarts allowed by The use case requested by users: when pods are killed with SIGTERM due to maintenance events or workload preemption, we want the job to fail, triggering the JobSet failurePolicy and restarting the whole JobSet in accordance with it's See this comment for proposed API options so far. Basically, an error in the application code means the workload is doomed to crash no matter what, and each restart requires rescheduling thousands of pods using the computationally expensive/slow exclusive affinities (only ~58 pods/sec scheduling throughput), and so restarting N times will block other enqueued workloads from utilizing these resources the entire time, which could be 30min+ for large workloads. |
The Failed condition contains this reason https://github.com/kubernetes/kubernetes/blob/714e77595c8b19b693925bda2a96ab80c307d38f/pkg/controller/job/job_controller.go#L62C21-L62C21 Give it a try and let us know if there is anything missing, so we can include it in the next iteration for the change (hopefully we don't need extra API fields). The use case you mention is exactly why we developed PodFailurePolicy in Job. |
I think the api can be a bit more generic and can mimic the podFailurePolicy one, think of the following cases that can be qualified based on the child job's failure reason:
|
I like this, definitely more generic/flexible. |
Think of Job = pod in the following comparison between JobSet and Job: |
Oh, now that I read the descriptions of the actions, |
The current behavior, so yes. |
I like the revised API proposed by @ahg-g. More users have been requesting this feature lately, so unless there's any major concerns I'm going to proceed with the implementation soon. |
/assign |
/reopen |
@kannon92: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign |
@jedwins1998 let's make the ETA for this feature around the end of the month, we'll include it in the next release. This is an important feature which will require careful review and testing, and I don't want to rush it to squeeze it into v0.5.0. In addition to the implementation we will need:
|
What would you like to be added:
Right now, the JobSet FailurePolicy only allows specifying the max number of JobSet restarts to attempt before marking it as failed.
Users have requested that we should also allow them to configure the failure conditions under which the JobSet will be restarted, or when the JobSet should be failed immediately without going through the allowed number of restarts (see "Example user stories" section below for why this is needed).
As an initial thought, we could define a field in the JobSet FailurePolicy which allows the user to specify the JobSet should respect the Job's podFailurePolicy (i.e., if a Job failed due to a podFailurePolicy configuring the job to fail immediately without restart under certain conditions, then JobSet should also respect this and not restart the JobSet).
For example, in this list of common exit codes used by containers, we can see container exit code 143 is used for containers killed by graceful termination (SIGTERM), which is the signal used by maintenance events via graceful node shutdown, as well as by workload pre-emption.
A user could configure a
podFailurePolicy
on their job to fail the job immediately if there is any exit code except 143. If the JobSet respected this, it would restart if the child job was killed by a SIGTERM (maintenance event), but would not restart if there an application code error.Why is this needed:
Example use cases / user story:
As a user, when I enqueue and run many JobSet workloads using Kueue, I'd like to be able to specify the JobSet failure policy to restart if the containers exited with a SIGTERM exit code (e.g. maintenance event or pre-emption), but NOT restart if the container exited with an application code error, in order to not consume valuable resources repeatedly attempting to run pods that are doomed to fail, while other workloads sit idle in Kueue awaiting free resources to run.
Example API:
Example pod failure policy which the user could use in combination with
.spec.failurePolicy.followPodFailurePolicy = true
to allow jobsets to be restarted if the jobs were killed due to maintenance events, but not if the jobs failed due to application code errors or other bugs:The text was updated successfully, but these errors were encountered: