Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Running condition #571

Closed
tenzen-y opened this issue May 14, 2024 · 18 comments
Closed

Support Running condition #571

tenzen-y opened this issue May 14, 2024 · 18 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@tenzen-y
Copy link
Member

What would you like to be added:
I would like to support the new JobSetConditionType and Running so that researchers can easily find out if the JobSet is still running.

// These are built-in conditions of a JobSet.
const (
// JobSetCompleted means the job has completed its execution.
JobSetCompleted JobSetConditionType = "Completed"
// JobSetFailed means the job has failed its execution.
JobSetFailed JobSetConditionType = "Failed"
// JobSetSuspended means the job is suspended.
JobSetSuspended JobSetConditionType = "Suspended"
// JobSetStartupPolicyInProgress means the StartupPolicy is in progress.
JobSetStartupPolicyInProgress JobSetConditionType = "StartupPolicyInProgress"
// JobSetStartupPolicyCompleted means the StartupPolicy has completed.
JobSetStartupPolicyCompleted JobSetConditionType = "StartupPolicyCompleted"
)

However, the current batch/v1 Job doesn't support the Running condition: https://github.com/kubernetes/kubernetes/blob/4f04dffe5b2cd652a20b362eaea30164e3e5ea54/pkg/apis/batch/types.go#L612C4-L627

So, we need to define when we can say the Job/JobSet is still running. Starting this discussion from KEP #2804 might be better.

Why is this needed:
Currently, we represent a similar context with Suspend=False, Completed=False, and Failed=False, but that representation approach is not intuitive. So, I would like to express the Running state clearly.

@tenzen-y
Copy link
Member Author

cc: @andreyvelich @ahg-g

@tenzen-y
Copy link
Member Author

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label May 14, 2024
@googs1025
Copy link
Member

hi! @tenzen-y There are some discussions on this issue #545

@danielvegamyhre
Copy link
Contributor

I have considered adding an "Active" or "Running" field before but it would be inconsistent with the Job API, where as you noted the Active/Running state is implicit (i.e., not failed, completed or suspended).

@alculquicondor I am curious, is there a reason an "active" or "running" condition is not part of the Job status?

@tenzen-y
Copy link
Member Author

hi! @tenzen-y There are some discussions on this issue #545

Thank you for pointing this out. I would like to focus on the new condition type separate from counting in this issue.

@tenzen-y
Copy link
Member Author

I have considered adding an "Active" or "Running" field before but it would be inconsistent with the Job API, where as you noted the Active/Running state is implicit (i.e., not failed, completed or suspended).

@alculquicondor I am curious, is there a reason an "active" or "running" condition is not part of the Job status?

@danielvegamyhre I have the same concern as you. So, as I mentioned in the issue description, it might be better to start discussing the Running condition in kubernetes/enhancements#3524.

Additionally, actually, Aldo mentioned the Running condition in the batch/v1 Job here: kubernetes/enhancements#3524 (comment)

@alculquicondor
Copy link

The question is what does "running" mean, when you have multiple pods. A field such as ready (for the number of ready pods) in the job status, gives the controllers enough information without having to define what a "Running" condition would be.

Maybe we should aim for a similar value in jobset status? Maybe it can just add up all the active fields from all of its jobs?

@andreyvelich
Copy link

The question is what does "running" mean, when you have multiple pods.

Ideally, Running means that all PyTorch workers are scheduled on the Kubernetes control plane, pulled the image, and ready to process Model Training.
As of today, we marked PyTorchJob Running if master replica is active or one of the workers is active in case PyTorchJob doesn't have a master.
From my point of view, that is wrong, since model training can start only when all replicas are active.
Image pull might take significant amount of time especially for the large models, and Job should not be in Running condition during that time. cc @johnugeorge @tenzen-y

Maybe it can just add up all the active fields from all of its jobs?

That looks good to me.

@danielvegamyhre
Copy link
Contributor

Maybe it can just add up all the active fields from all of its jobs?

That looks good to me.

@andreyvelich @alculquicondor Since active includes pending pods I think we would want is the ready values from each child Job.

JobSet already tracks the number of "Ready" jobs for each ReplicatedJob in the ReplicatedJobStatus here. A child Job is counted as ready if ready pods + completed pods >= expected pod count for the job, where the expected pod count is minimum of parallelism and completions.

So we could consider a JobSet "Active/Running" if all ReplicatedJobs have replicatedJobStatus.Ready == replicatedJob.Spec.Replicas

@alculquicondor
Copy link

Oops, sorry, I meant to say ready indeed :)

@ahg-g
Copy link
Contributor

ahg-g commented May 21, 2024

So we could consider a JobSet "Active/Running" if all ReplicatedJobs have replicatedJobStatus.Ready == replicatedJob.Spec.Replicas

@tenzen-y Do we need to set a condition for that given that the info is already in the status?

@kannon92
Copy link
Contributor

Ping @tenzen-y @andreyvelich

Is this still necessary?

@andreyvelich
Copy link

Yeah, I know that @googs1025 did some work to improve JobSet conditions in this PR: #594.
Also, I understand that is hard to define what does Running mean exactly.

So maybe we could just follow @danielvegamyhre suggestion here: #571 (comment)

From my perspective we should add this condition to the Job, since user assumes the following conditions when they create training job:
Created -> Running -> Succeeded

Any thoughts @tenzen-y ?

@tenzen-y
Copy link
Member Author

@ahg-g @andreyvelich @kannon92 As we discussed here, I'm hestating to add this running condition since at first glance, this Running phase / condition can be introduce easily, but decision of Running is too hard.

So, I would like not to introduce the Running phase/condition in the JobSet side. But, I'm wondering if the Kubeflow Training v2 can introduce the Running condition type depending on the JobSet Terminal Phase (

// TerminalState the state of the JobSet when it finishes execution.
// It can be either Complete or Failed. Otherwise, it is empty by default.
TerminalState string `json:"terminalState,omitempty"`
).

@tenzen-y
Copy link
Member Author

In conclusion, I don't prefer to introduce the Running phase/condition in the JobSet side.

@kannon92
Copy link
Contributor

Sounds good. I think phase can suffice.

Maybe it’s something you can add to the TrainJob KEP and we can close this issue.

@tenzen-y
Copy link
Member Author

Sounds good. I think phase can suffice.

Maybe it’s something you can add to the TrainJob KEP and we can close this issue.

That makes sense. Anyway, let's close this issue now.
Once we can find the use cases that it can get over the drawbacks, we can revisit here.
/close

@k8s-ci-robot
Copy link
Contributor

@tenzen-y: Closing this issue.

In response to this:

Sounds good. I think phase can suffice.

Maybe it’s something you can add to the TrainJob KEP and we can close this issue.

That makes sense. Anyway, let's close this issue now.
Once we can find the use cases that it can get over the drawbacks, we can revisit here.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

8 participants