Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simpler top level status fields for total, succeeded and failed jobs #545

Closed
shrinandj opened this issue Apr 22, 2024 · 25 comments · Fixed by #594
Closed

Simpler top level status fields for total, succeeded and failed jobs #545

shrinandj opened this issue Apr 22, 2024 · 25 comments · Fixed by #594
Assignees

Comments

@shrinandj
Copy link

shrinandj commented Apr 22, 2024

What would you like to be added:
Currently, the status of a jobset object consists only of two arrays: conditions and replicatedJobsStatus. This is great for providing detailed status of individual conditions and replicated jobs.

However, it makes referencing the overall status of the jobset a little harder.

Along with the current conditions and replicatedJobsStatus fields, it will be useful to have at least 3 additional simple fields in the status indicating the overall status of the jobset, similar to K8s jobs:

status:
  totalJobs: 2
  succeededJobs: 2
  failedJobs: 0
  conditions:
  - lastTransitionTime: "2024-04-22T18:10:09Z"
....
  replicatedJobsStatus:
  - active: 0
...

Why is this needed:
We were trying to run Jobsets via Argo-workflows as part of a distributed training setup. Argo-workflows can create Kubernetes resources. In order to do this correctly, Argo allows users to specify a successCondition and failureCondition. These conditions need to be simple key=value conditions. Referencing array elements, filtering for specific conditions, etc. is not possible.

If the new fields are available, adding them to Argo or even using any other K8s tools would become that much easier:

      successCondition: status.succeededJobs = status.totalJobs
      failureCondition: status.failedJobs > 0
@kannon92
Copy link
Contributor

Is it not possible to use Conditions for this purpose? We emit a conditon with the job is successful or failed and you could check that.

@kannon92
Copy link
Contributor

The reason we have an array of this is because replicated jobs can all finish at different times so we can't really flatten the array to report jobs without that information.

@shrinandj
Copy link
Author

In some cases, it is not possible to use array notations and filters for specific conditions.

In the case of Argo workflows, they use a label selector to parse the success and failure conditions (which I see only works with key = value pattern).

Is it possible to have totalJobs in status populated when the jobset is created and then populate succeededJobs and failedJobs when the AllJobsCompleted condition is being populated?

@kannon92
Copy link
Contributor

The concern is that I think that would be insufficient for all use cases. If you had a lot of replicated jobs you may have some of the jobs finishing while others are not.

say JobSet with rep a, rep b, and rep c jobs starting in parallel. User really wants all of rep c to finish to declare the job successful but rep a and rep b finish and user uses a count based approach. This would end up with a case where you would declare the job successfully (if totalJobs is greater than value suggested). If one specified success_jobs = 2.

Other counter point is that we don't require certain replicated jobs to finish (say driver/worker where workers finish). In this case our job is successfully but the driver doesn't technically finish.

I think the condition is going to be the reliable approach.

@kannon92
Copy link
Contributor

@ahg-g @danielvegamyhre WDYT?

@shrinandj
Copy link
Author

The goal wouldn't be to remove existing conditions or replicatedJobStatuses. They have to be there to present the complete picture. The top level numerical metrics provide (a) a summary of overall status and (b) usable for cases like the one mentioned above.

@googs1025
Copy link
Member

googs1025 commented Apr 25, 2024

Maybe we can have a jobset's own status, so that argo workflow can also use this status to determine whether it is completed.

NAME             RESTARTS   COMPLETED   SUSPENDED   AGE.   Status
failure-policy   3                                  67m     Running
network-jobset                                      54m     Succeed
paralleljobs                True                    55m.    Failed

@shrinandj
Copy link
Author

Sure.. Having a Jobset's own top level status field simply indicating whether it is Running, Succeeded, Failed (or Suspended) would work too.

@kannon92
Copy link
Contributor

I think a phase field in status would be useful. I was surprised we don't actually have one.

@ahg-g @danielvegamyhre

WDYT of adding a phase field in JobSet.Status that has the options of "Running, Succeeded, Failed or Suspended"?

@danielvegamyhre
Copy link
Contributor

Yeah I agree. This would be useful for higher level tools which use JobSet as a building block, like XPK.

@kannon92
Copy link
Contributor

Nice idea @googs1025!

@ahg-g
Copy link
Contributor

ahg-g commented Apr 25, 2024

Conditions is the new "Phase". But if phase makes it easier to consume, I am fine with that.

@shrinandj
Copy link
Author

If there is a jobset with multiple jobs and say some of them are running while the others have failed, what would the status of the jobset be: Running? Failed?

One benefit of simply having the numbers (totalJobs, succeeded, failed) is that it might slightly less ambiguous and it leaves the caller to decide how to interpret the numbers. And if they care specifically about which job succeeded or failed, they can look at conditions or replicatedJobStatuses.

@danielvegamyhre
Copy link
Contributor

danielvegamyhre commented Apr 25, 2024

If there is a jobset with multiple jobs and say some of them are running while the others have failed, what would the status of the jobset be: Running? Failed?

One benefit of simply having the numbers (totalJobs, succeeded, failed) is that it might slightly less ambiguous and it leaves the caller to decide how to interpret the numbers. And if they care specifically about which job succeeded or failed, they can look at conditions or replicatedJobStatuses.

It would depend on the success policy and failure policy defined in the JobSet.

Defaults are:

  • if any job fails, the JobSet fails.
  • if all jobs succeed, the JobSet succeeds.

These policies are configurable though.

@ahg-g
Copy link
Contributor

ahg-g commented Apr 25, 2024

basically phase will be set once the JobSet reaches a terminal state

@kannon92
Copy link
Contributor

@ahg-g is correct. "Conditions" are the reliable option for most kubernetes controllers. Job uses conditions to detect certain states and other workflows support that.

It would be ideal to work with argo-workflows on supporting conditions for extended resources. For JobSet I think a phase is fine but I can't say the same for other projects.

Happy to support the jobset change though.

@shrinandj
Copy link
Author

Just to make this concrete, the new status field will be called phase and the status of the jobset will look like this, right?

status:
  phase: running # or suspended | succeeded | failed 
  conditions:
  - lastTransitionTime: "2024-04-22T18:10:09Z"
....
  replicatedJobsStatus:
  - active: 0
...

@jessicaxiejw
Copy link

@shrinandj Are you planning on merging the jobset integration with argo workflow upstream? I am looking into implementing something similar.

@shrinandj
Copy link
Author

Are you planning on merging the jobset integration with argo workflow upstream? I am looking into implementing something similar.

I haven't been able to get to this yet. If you want to go ahead with the implementation, I can help review it and test it and provide any other feedback.

Just to ensure we're on the same page: Argo can create any K8s resources based on a yaml/json file. But it needs a way to easily read back the status of the created object. This issue is to make that possible by having a simple high level status.

So, by "jobset integration with argo workflow", you mean adding this status field in the jobset object, correct?

@ahg-g
Copy link
Contributor

ahg-g commented May 21, 2024

I am supportive of this change if someone would like to take on this task, and we can track it for 0.7.0

@shrinandj
Copy link
Author

@jessicaxiejw Let me know when you have something ready to test. I can certainly take it out for a spin and provide feedback.

@danielvegamyhre
Copy link
Contributor

@jessicaxiejw if you want to implement this that would be super helpful! Please go ahead and assign yourself if you plan on working on this, or let us know if not and we can find another owner.

@jessicaxiejw
Copy link

jessicaxiejw commented Jun 4, 2024

Sorry about the late response. I am not working on it right now 😅 @shrinandj feel free taking a stab at it.

We decided to go with a completely different direction.

@googs1025
Copy link
Member

/assign

@googs1025
Copy link
Member

googs1025 commented Jun 5, 2024

I want to try it.
This is my opinion on the current situation.
@danielvegamyhre @kannon92

// the status of JobSet
const (
        // JobSetRunning means the job is running.
	JobSetRunning JobSetConditionType = "Running"
	// JobSetCompleted means the job has completed its execution.
	JobSetCompleted JobSetConditionType = "Completed"
	// JobSetFailed means the job has failed its execution.
	JobSetFailed JobSetConditionType = "Failed"
	// JobSetSuspended means the job is suspended.
	JobSetSuspended JobSetConditionType = "Suspended"
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants