-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Running condition #571
Comments
cc: @andreyvelich @ahg-g |
/kind feature |
I have considered adding an "Active" or "Running" field before but it would be inconsistent with the Job API, where as you noted the Active/Running state is implicit (i.e., not failed, completed or suspended). @alculquicondor I am curious, is there a reason an "active" or "running" condition is not part of the Job status? |
@danielvegamyhre I have the same concern as you. So, as I mentioned in the issue description, it might be better to start discussing the Additionally, actually, Aldo mentioned the |
The question is what does "running" mean, when you have multiple pods. A field such as Maybe we should aim for a similar value in jobset status? Maybe it can just add up all the |
Ideally, Running means that all PyTorch workers are scheduled on the Kubernetes control plane, pulled the image, and ready to process Model Training.
That looks good to me. |
@andreyvelich @alculquicondor Since JobSet already tracks the number of "Ready" jobs for each ReplicatedJob in the ReplicatedJobStatus here. A child Job is counted as ready if So we could consider a JobSet "Active/Running" if all ReplicatedJobs have |
Oops, sorry, I meant to say |
@tenzen-y Do we need to set a condition for that given that the info is already in the status? |
Ping @tenzen-y @andreyvelich Is this still necessary? |
Yeah, I know that @googs1025 did some work to improve JobSet conditions in this PR: #594. So maybe we could just follow @danielvegamyhre suggestion here: #571 (comment) From my perspective we should add this condition to the Job, since user assumes the following conditions when they create training job: Any thoughts @tenzen-y ? |
@ahg-g @andreyvelich @kannon92 As we discussed here, I'm hestating to add this running condition since at first glance, this So, I would like not to introduce the jobset/api/jobset/v1alpha2/jobset_types.go Lines 137 to 139 in 56c77da
|
In conclusion, I don't prefer to introduce the |
Sounds good. I think phase can suffice. Maybe it’s something you can add to the TrainJob KEP and we can close this issue. |
That makes sense. Anyway, let's close this issue now. |
@tenzen-y: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What would you like to be added:
I would like to support the new
JobSetConditionType
andRunning
so that researchers can easily find out if the JobSet is still running.jobset/api/jobset/v1alpha2/jobset_types.go
Lines 50 to 62 in 235745c
However, the current batch/v1 Job doesn't support the Running condition: https://github.com/kubernetes/kubernetes/blob/4f04dffe5b2cd652a20b362eaea30164e3e5ea54/pkg/apis/batch/types.go#L612C4-L627
So, we need to define when we can say the Job/JobSet is still running. Starting this discussion from KEP #2804 might be better.
Why is this needed:
Currently, we represent a similar context with
Suspend=False
,Completed=False
, andFailed=False
, but that representation approach is not intuitive. So, I would like to express the Running state clearly.The text was updated successfully, but these errors were encountered: