KEP-2170: Add the TrainJob state transition design #2298

tenzen-y · 2024-10-20T21:51:54Z

What this PR does / why we need it:
I added the TrainJob state machine design.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Part-of #2207
Relates to: #2170

Checklist:

Docs included if any changes are user facing

google-oss-prow · 2024-10-20T21:51:58Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from tenzen-y. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tenzen-y · 2024-10-20T21:52:22Z

/hold
/assign @kubeflow/wg-training-leads

coveralls · 2024-10-20T21:56:55Z

Pull Request Test Coverage Report for Build 11430149305

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 100.0%

Totals
Change from base Build 11426928089:	0.0%
Covered Lines:	73
Relevant Lines:	73

💛 - Coveralls

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

andreyvelich

Thank you for this @tenzen-y!
Please take a look @kubeflow/wg-training-leads @kannon92 @akshaychitneni @varshaprasad96 @ahg-g @vsoch @danielvegamyhre

andreyvelich · 2024-10-21T17:39:56Z

docs/proposals/2170-kubeflow-training-v2/README.md

+	// TrainJobSuspended means the TrainJob is suspended.
+	TrainJobSuspended string = "Suspended"
+
+	// TrainJobCompleted means that the actual jobs have completed its execution.


Should it be TrainJob, and similar for other statuses ?

Suggested change

// TrainJobCompleted means that the actual jobs have completed its execution.

// TrainJobCompleted means that the TrainJob has completed its execution.

andreyvelich · 2024-10-21T17:42:16Z

docs/proposals/2170-kubeflow-training-v2/README.md

@@ -279,6 +279,43 @@ type TrainJob struct {
 	Status TrainJobStatus `json:"status,omitempty"`
 }

+const (


Can we define those consts as TrainJobConditionType similar to Batch/Job: https://github.com/kubernetes/api/blob/master/batch/v1/types.go#L617 ?

I don't think so since our conditions typed are metav1.Condition and the Type is typed string.
The batch/v1 Job uses the typed JobConditionType since Job conditions typed are dedicated JobCondition: https://github.com/kubernetes/api/blob/2be23f6c5a7184af9846ff458a11751765ea2bde/batch/v1/types.go#L662

When we use the dedicated typed TrainJobConditionType, we need to cast the TrainJobConditionType to string everywhere. That is not ideal.

Oh, you are right.

andreyvelich · 2024-10-21T17:43:30Z

docs/proposals/2170-kubeflow-training-v2/README.md

+
+	// TrainJobResumedReason is the "Suspended" condition reason.
+	// When the TrainJob suspension is changed from True to False, this is added.
+	TrainJobResumedReason string = "Resumed"


@tenzen-y @kannon92 Do we have that status reason in JobSet or Job ?

Yes, we have. In the batch/v1 Job level, JobResumed is used as a reason for the Suspended condition type.
In the JobSet level, ResumeJobs is used as a reason for the Suspended condition type.

andreyvelich · 2024-10-21T17:45:25Z

docs/proposals/2170-kubeflow-training-v2/README.md

+	// TrainJobJobsCreationSucceededReason is the "Created" condition reason.
+	// When the creating objects succeeded after building succeeded, this is added.
+	TrainJobJobsCreationSucceededReason string = "JobsCreationSucceeded"
+
+	// TrainJobJobsBuildFailedReason is the "Created" condition reason.
+	// When the building objects based on the TrainJob and the specified runtime failed,
+	// this is added.
+	TrainJobJobsBuildFailedReason string = "JobsBuildFailed"
+
+	// TrainJobJobsCreationFailedReason is "Created" condition reason.
+	// When the creating objects failed even though building succeeded, this is added.
+	TrainJobJobsCreationFailedReason string = "JobsCreationFailed"


Should we introduce those conditions in the 2nd iteration ?
I think, we should discuss if users really want to decouple conditions when TrainJob's object creation failed and when TrainJob fails.

These are reasons, so we use these in the following. Additionally, if we do not distinguish JobCreationFailed and JobBuildFailed, it is hard to understand which part failed. So, these reasons must be included in the first phase.

JobsBuildFailed:

training-operator/pkg/controller.v2/trainjob_controller.go

Line 71 in 9e04bdd

objs, err := run.NewObjects(ctx, trainJob)

JobsCreationFailed:

training-operator/pkg/controller.v2/trainjob_controller.go

Line 98 in 9e04bdd

return creationErr

status: conditions: - type: Created status: false reason: JobsCreationFailed

status: conditions: - type: Created status: false reason: JobsBuildFailed

status: conditions: - type: Created status: true reason: JobsCreationSucceeded

Since we have validation webhook, what is the use-case you see when we can hit the JobsBuildFailed reason ?

We can not validate in advance if NewObject succeed since we can not know in advance what happens during the NewObject.

Only during reconciling, we can know the actual error.

So, I guess one of the examples could be if NewObject calls Kubernetes API server and this call fails.
In that case, we want to transition TrainJob to JobsBuildFailed status.

In that case, we want to transition TrainJob to JobsBuildFailed status.

Yes, that's right. In the current JobSet plugin, the building error could happen in the following:

training-operator/pkg/runtime.v2/framework/plugins/jobset/jobset.go

Line 93 in fd4d102

return nil, err

training-operator/pkg/runtime.v2/framework/plugins/jobset/jobset.go

Line 98 in fd4d102

return nil, err

Additionally, every plugin could return errors during Kubeflow Job Pipeline Framerowk.

andreyvelich · 2024-10-21T17:49:05Z

docs/proposals/2170-kubeflow-training-v2/README.md

+  - type: JobSetSuspended
+    status: false
+  - type: JobSetCompleted
+    status: true


Why do we need to propagate JobSet conditions to the TrainJob ?

I propagated these JobSet conditions so that runtimes can be propagated to the arbitrary conditions.
But, in the JobSet case, these seem to be slightly redundant. So, I'm wondering if we only extend the runtime interface so that the plugin can propagate arbitrary conditions, but JobSet plugin does not propagate own conditions, then TrainJob does not have the JobSet conditions.

But, I guess that the JobSet StartupPolocy conditions could help to understand what happens in the actual Jobs.
So, we can propagate only StartupPolicyInProgress and StartupPolicyCompleted.

https://github.com/kubernetes-sigs/jobset/blob/440f53e13ed1db15d4f1d5a04c2450e74df4d1e8/api/jobset/v1alpha2/jobset_types.go#L75-L78

@andreyvelich WDYT?

So, I'm wondering if we only extend the runtime interface so that the plugin can propagate arbitrary conditions, but JobSet plugin does not propagate own conditions, then TrainJob does not have the JobSet conditions.

I would suggest that we add those once we have initial implementation of TrainJob status.
I think, this is valid option, but we should discuss what conditions be be propagated by runtime plugins.

Any thoughts @tenzen-y ?

I would suggest that we add those once we have initial implementation of TrainJob status.
I think, this is valid option, but we should discuss what conditions be be propagated by runtime plugins.

I think that we need to implement the conditions propagation mechanism, but it's not clear in the JobSet situation. Especially, which conditions should be propagated to the TrainJob. So, I would propose that we extend the runtime interface so that plugins can propagate the conditions to TrainJob. But, in the first status implementation, we do not propagate any JobSet conditions to the TrainJob.

What do you think? @andreyvelich

So, I would propose that we extend the runtime interface so that plugins can propagate the conditions to TrainJob.

Sure, we can have this. Maybe we should create a tracking issue to implement it ?
I don't think is has a high priority compare to other tasks.

Yes, sure. I will open the issue.
But, let me mention the propagation mechanism in this KEP.

akshaychitneni · 2024-10-22T14:44:09Z

docs/proposals/2170-kubeflow-training-v2/README.md

+[...]
+status:
+  conditions:
+  - type: Created


Would it be helpful to add transition time and also reason for failure within conditions? I believe it would be help debugging failures

This is pseudo conditions. So, we actually support all fields supported in metav1.Conditions including transition time as you can see there:

training-operator/pkg/apis/kubeflow.org/v2alpha1/trainjob_types.go

Lines 261 to 262 in 66ebecd

// Conditions for the TrainJob.

Conditions []metav1.Condition `json:"conditions,omitempty"`

google-oss-prow bot requested review from jinchihe and kuizhiqing October 20, 2024 21:51

google-oss-prow bot added the size/L label Oct 20, 2024

google-oss-prow bot added the do-not-merge/hold label Oct 20, 2024

tenzen-y force-pushed the add-trainjob-state-transition branch from 851738c to 069345f Compare October 20, 2024 22:00

google-oss-prow bot added size/M and removed size/L labels Oct 20, 2024

tenzen-y force-pushed the add-trainjob-state-transition branch from 069345f to 39bfab2 Compare October 20, 2024 22:05

KEP-2170: Add the TrainJob state transition design

8ba2206

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

tenzen-y force-pushed the add-trainjob-state-transition branch from 39bfab2 to 8ba2206 Compare October 20, 2024 22:28

andreyvelich reviewed Oct 21, 2024

View reviewed changes

tenzen-y mentioned this pull request Oct 21, 2024

KEP-2170: Decouple JobSet from TrainJob #2296

Open

1 task

akshaychitneni reviewed Oct 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP-2170: Add the TrainJob state transition design #2298

KEP-2170: Add the TrainJob state transition design #2298

tenzen-y commented Oct 20, 2024

google-oss-prow bot commented Oct 20, 2024

tenzen-y commented Oct 20, 2024

coveralls commented Oct 20, 2024 •

edited

Loading

andreyvelich left a comment

andreyvelich Oct 21, 2024

tenzen-y Oct 23, 2024

andreyvelich Oct 21, 2024 •

edited

Loading

tenzen-y Oct 21, 2024

andreyvelich Oct 21, 2024

andreyvelich Oct 21, 2024

tenzen-y Oct 21, 2024 •

edited

Loading

andreyvelich Oct 21, 2024

tenzen-y Oct 21, 2024 •

edited

Loading

andreyvelich Oct 21, 2024

tenzen-y Oct 21, 2024

andreyvelich Oct 21, 2024

tenzen-y Oct 21, 2024

andreyvelich Oct 21, 2024

tenzen-y Oct 21, 2024

andreyvelich Oct 21, 2024

tenzen-y Oct 21, 2024

andreyvelich Oct 21, 2024

tenzen-y Oct 21, 2024

akshaychitneni Oct 22, 2024

tenzen-y Oct 22, 2024 •

edited

Loading

	// TrainJobCompleted means that the actual jobs have completed its execution.
	// TrainJobCompleted means that the TrainJob has completed its execution.

	// Conditions for the TrainJob.
	Conditions []metav1.Condition `json:"conditions,omitempty"`

KEP-2170: Add the TrainJob state transition design #2298

Are you sure you want to change the base?

KEP-2170: Add the TrainJob state transition design #2298

Conversation

tenzen-y commented Oct 20, 2024

google-oss-prow bot commented Oct 20, 2024

tenzen-y commented Oct 20, 2024

coveralls commented Oct 20, 2024 • edited Loading

Pull Request Test Coverage Report for Build 11430149305

Details

💛 - Coveralls

andreyvelich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreyvelich Oct 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y Oct 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y Oct 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y Oct 22, 2024 • edited Loading

Choose a reason for hiding this comment

coveralls commented Oct 20, 2024 •

edited

Loading

andreyvelich Oct 21, 2024 •

edited

Loading

tenzen-y Oct 21, 2024 •

edited

Loading

tenzen-y Oct 21, 2024 •

edited

Loading

tenzen-y Oct 22, 2024 •

edited

Loading