KEP-4368: Job Managed By; Promote to GA #5583

dejanzele · 2025-09-29T17:03:55Z

One-line PR description: Promote KEP-4368: Job Managed By to GA

Issue link: Job API managed-by mechanism #4368

Other comments: NONE

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>

mimowo · 2025-09-30T11:22:00Z

I would suggest to update the PR description, or the PR content to clearly address what is the status of the graduation criteria. Here is my review, feel free to use it as a basis.

Address reviews and bug reports from Beta users
Re-evaluate the ideas of improving debuggability (like extended kubectl, dedicated condition, or events)

So far, with the adoption in MultiKueue we have not got users requesting that extra debuggability. With the managedBy field being immutable it seems the feature is not causing issues requiring extensive debug. I would suggest to keep this ideas as follow up features and not blockers for the graduation.

Re-evaluate the need to skip reconciliation in the event handlers to optimize performance

I would say it could be a follow up performance improvement, but does not seem required. I have no record of MultiKueue users reporting hitting performance issues on the management cluster due to Job controller during the SyncJob.

Assess the fragmentation of the ecosystem. Look for other implementations of a job controller and asses their conformance with k8s.

We have already a couple of implementations of the ManagedBy field in order to integrate with MultiKueue:

These implementations are following the k8s core design, in particular making the field immutable.

There two slight differences:

the kubeflow trainer v1 puts the field under spec.runPolicy.managedBy. This is because the code for different controllers is deduplicated at the level of the runPolicy field.
another slight difference is that kubeflow locks the set of only "known controllers" to MultiKueue: kueue.x-k8s.io/multikueue. https://github.com/kubeflow/trainer/blob/2bb5b9a6c4f21bfe4bc23585dc5dbfaf1acb30ff/pkg/apis/trainer/v1alpha1/trainjob_types.go#L126. This was done to be "on the safe side", and avoid complex status validations as in the core.
Lock the feature gate

Will be done in the follow up PR.

PTAL @soltysh @atiratree @tenzen-y @mwielgus @kannon92

wojtek-t · 2025-10-03T13:56:27Z

/approve PRR

k8s-ci-robot · 2025-10-03T13:56:36Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dejanzele, wojtek-t
Once this PR has been reviewed and has the lgtm label, please assign soltysh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [wojtek-t]
keps/sig-apps/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tenzen-y

Thank you!
LGTM Job PoV.

/lgtm
/hold
for other reviewers.

tenzen-y · 2025-10-03T14:03:26Z

I would suggest to update the PR description, or the PR content to clearly address what is the status of the graduation criteria. Here is my review, feel free to use it as a basis.

Address reviews and bug reports from Beta users
Re-evaluate the ideas of improving debuggability (like extended kubectl, dedicated condition, or events)

So far, with the adoption in MultiKueue we have not got users requesting that extra debuggability. With the managedBy field being immutable it seems the feature is not causing issues requiring extensive debug. I would suggest to keep this ideas as follow up features and not blockers for the graduation.

Re-evaluate the need to skip reconciliation in the event handlers to optimize performance

I would say it could be a follow up performance improvement, but does not seem required. I have no record of MultiKueue users reporting hitting performance issues on the management cluster due to Job controller during the SyncJob.

Assess the fragmentation of the ecosystem. Look for other implementations of a job controller and asses their conformance with k8s.

We have already a couple of implementations of the ManagedBy field in order to integrate with MultiKueue:

JobSet

Kubeflow Trainer v1

Kubeflow Trainer v2

KubeRay

Tekton Pipelines

AppWrapper

These implementations are following the k8s core design, in particular making the field immutable.

There two slight differences:

the kubeflow trainer v1 puts the field under spec.runPolicy.managedBy. This is because the code for different controllers is deduplicated at the level of the runPolicy field.

another slight difference is that kubeflow locks the set of only "known controllers" to MultiKueue: kueue.x-k8s.io/multikueue. https://github.com/kubeflow/trainer/blob/2bb5b9a6c4f21bfe4bc23585dc5dbfaf1acb30ff/pkg/apis/trainer/v1alpha1/trainjob_types.go#L126. This was done to be "on the safe side", and avoid complex status validations as in the core.

Lock the feature gate

Will be done in the follow up PR.

PTAL @soltysh @atiratree @tenzen-y @mwielgus @kannon92

I totally agree wtih @mimowo .

atiratree

So far, with the adoption in MultiKueue we have not got users requesting that extra debuggability. With the managedBy field being immutable it seems the feature is not causing issues requiring extensive debug. I would suggest to keep this ideas as follow up features and not blockers for the graduation.

+1, the metric and immutable field should be sufficient as a source of truth. Having other options seems like an extra work without an apparent benefit.

We have already a couple of implementations of the ManagedBy field in order to integrate with MultiKueue:

Nice to see the usage, thanks for compiling the list. Can we mention these projects in the KEP?

It is good to see that the length constraint (The value cannot exceed 64 characters) is sufficient. But this validation is missing in some of the controllers. So I suppose it will throw an error if it is longer than that?

asses their conformance with k8s.

How do we want to approach this? Should we have a set of tests that these implementations have to pass?

atiratree · 2025-10-05T18:32:16Z

keps/sig-apps/4368-support-managed-by-for-batch-jobs/kep.yaml

+  stable: "v1.35"

 # The following PRR answers are required at alpha release
 # List the feature gate name and the components for which it must be enabled


Can you please update the metric name to jobs_by_external_controller_total to reflect the reality? The README also needs an update.

atiratree · 2025-10-05T18:33:48Z

keps/sig-apps/4368-support-managed-by-for-batch-jobs/README.md

 - Re-evaluate the need to skip reconciliation in the event handlers to optimize performance
- Assess the fragmentation of the ecosystem. Look for other implementations of a job controller and asses their conformance with k8s.
+- Assess the fragmentation of the ecosystem. Look for other implementations of a job controller and assess their conformance with k8s.
 - Lock the feature gate


When are we going to bump the new metric stability level?

I guess we could do it in 1.35 in the implementation phase, wdyt @dejanzele

atiratree · 2025-10-05T20:49:30Z

keps/sig-apps/4368-support-managed-by-for-batch-jobs/README.md

 ###### What steps should be taken if SLOs are not being met to determine the problem?

 N/A.



Do we have some new findings for the Troubleshooting section?

I'm not aware of that. We have worked with some users setting up MultiKueue, but it doesn't seem like the spec.managedBy would be bringing any particular reason to troublehooting (so far at least, and up to my knowledge).

atiratree · 2025-10-05T20:52:13Z

keps/sig-apps/4368-support-managed-by-for-batch-jobs/README.md


 #### GA

 - Address reviews and bug reports from Beta users


Can we please update the e2e tests section and link the tests?

mimowo · 2025-10-06T10:22:19Z

It is good to see that the length constraint (The value cannot exceed 64 characters) is sufficient. But this validation is missing in some of the controllers. So I suppose it will throw an error if it is longer than that?

Oh, good spot. Well, if the other controllers say, KubeRay allow for more than 64 characters, then nothing will break for them. I think the idea of restricting in the core k8s was to make it more constrained.

How do we want to approach this? Should we have a set of tests that these implementations have to pass?

One idea would be to just review a couple of aspects of the implementations, say:

validation for spec.managedBy to be at most 63 chars of len
validation for spec.managedBy to have a specific format
validation for the status
immutability
skip inside event handlers, vs. controller Reconcile

Then we could compile these aspects in a table and get some bigger picture. If we think some of the aspects are important to follow k8s core, then we could open an issue in the projects to "reconsider" this aspect. We may try to achieve this way better coherence in the ecosystem, but the maintainers of the projects may not prioritize full alignment.

mimowo · 2025-10-06T11:25:57Z

Then we could compile these aspects in a table and get some bigger picture.

This is the table I was able to compile by checking the implementations in the projects:

	k8s Job	JobSet	Kubeflow Trainer v1	Kubeflow Trainer v2	KubeRay	Tekton Pipelines	AppWrapper
open set of allowed values	y	y	n	n	n	y	y
max length 63	y	y	y (by closed set of options)	y (by closed set of options)	y (indirectly by closed set)	n	n
specific format	y	y	y (by closed set of options)	y (by closed set of options)	y (by closed set of options)	n	n
immutability	y	y	y	y	y	y	y
validation for the status	y	n	n	n	n	n	n
skip inside Reconcile	y	y	y	y	y	y (also in filtering)	y

So, there is a bit inconsistency about the "open set of values" vs "closed set of values". I know that in the Kubflow Trainer then motivation was to avoid the extra work on status validation, and closing the set of "recognized" orchestrators like Kueue only (potentially more in the future). It seems like all projects consistently wanted to avoid the effort of validating the status which we took in the k8s Job.

I think the main part if that all of them follow the key aspects: immutability and skip inside Reconcile.

mimowo · 2025-10-06T11:47:00Z

Actually on the "status" validation note, I'm wondering if we should plan simplifying this code by answering "true" for all. Maybe except "RejectFailedIndexesOverlappingCompleted" which can be considered a perf optimization (debatable).

The main reason we introduced the checks was to gracefully handle Jobs which could have been updated by the Job controller in an older version of k8s before the validation was introduced.

However:

all Jobs created in 1.32+ would't require the graceful handling as the rules would prevent entering invalid state
older Jobs are very unlikely that would violate the rules, because we haven't got any reports the rules are violated after enabling the validation by default (except for Fix validation for Job with suspend=true,completions=0 to set Complete condition kubernetes#132614, but here the Jobs wouldn't compute anything useful anyway).

So, I'm wondering it would be beneficial to cleanup the code and perform the checks always, the question is "when", some ideas that come to my mind:

just now, we have waited long enough to be safe
later under a dedicated feature gate, part of the KEP
later without a feature gate, just send a PR in 1.36 or so
later under a dedicated feature gate in the future

I'm personally hesitant between (1.) or (3.). (2.) or (4.) seem more involving, but I'm also ok for this way.

wdyt @atiratree @soltysh @deads2k ?

deads2k · 2025-10-06T16:35:29Z

So, I'm wondering it would be beneficial to cleanup the code and perform the checks always, the question is "when", some ideas that come to my mind:
1. just now, we have waited long enough to be safe

2. later under a dedicated feature gate, part of the KEP

3. later without a feature gate, just send a PR in 1.36 or so

4. later under a dedicated feature gate in the future
I'm personally hesitant between (1.) or (3.). (2.) or (4.) seem more involving, but I'm also ok for this way.

and 3. are definitely out. Is that table saying we have no known writers of violating data? If we have never had known writers of violating data, then I'm comfortable with 4. If we have ever had a known writer of violating data, then we cannot be guaranteed that all of the invalid data is cleaned up.

If we we wait long enough, declarative validation ratcheting will likely clean this up, but in the meantime we cannot risk making /status unwriteable.

mimowo · 2025-10-06T16:53:23Z

Is that table saying we have no known writers of violating data?

Except for the corner case issue:

in 1.32+ there is no writer, as the validation is "on" (assuming Beta feature is enabled)
prior to 1.32 it could only be Job controller, but I cannot recollect and specific issue violating the rules

If we have never had known writers of violating data, then I'm comfortable with 4.

Sure, I'm good with 4. I think turning the Reject... rules into "true" will not simplify the code dramatically anyway.

If we we wait long enough, declarative validation ratcheting will likely clean this up, but in the meantime we cannot risk making /status unwriteable.

Makes sense. I think we can make the "JobManagedBy" feature gate "GA" in 1.35 and wait a couple of releases to assume all "Jobs" are good. Is there any k8s guidence for how to wait in similar cases?

I think "jobs" are a bit safer than "Deployments" which are meant to run "Indefinitely". Jobs are meant to run only for a finite amount of time, so having a Job surviving a year would be very unusual ( I assume, but I know there might be exceptions ).

deads2k · 2025-10-06T17:50:39Z

I'm open to trying with a dedicated featuregate in the future. We'd likely leave the gate for four releases before locking it.

KEP-4368: Job Managed By; Promote to GA

bf78ac9

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 29, 2025

k8s-ci-robot requested review from janetkuo and johnbelamaric September 29, 2025 17:04

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Sep 29, 2025

github-project-automation bot added this to SIG Apps Sep 29, 2025

github-project-automation bot moved this to Needs Triage in SIG Apps Sep 29, 2025

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Sep 29, 2025

wojtek-t self-assigned this Sep 29, 2025

tenzen-y reviewed Oct 3, 2025

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 3, 2025

k8s-ci-robot assigned tenzen-y Oct 3, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 3, 2025

atiratree reviewed Oct 5, 2025

View reviewed changes

dejanzele mentioned this pull request Oct 6, 2025

Job API managed-by mechanism #4368

Open

16 tasks

		###### What steps should be taken if SLOs are not being met to determine the problem?

		N/A.

KEP-4368: Job Managed By; Promote to GA #5583

Are you sure you want to change the base?

KEP-4368: Job Managed By; Promote to GA #5583

Conversation

dejanzele commented Sep 29, 2025

Uh oh!

mimowo commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wojtek-t commented Oct 3, 2025

Uh oh!

k8s-ci-robot commented Oct 3, 2025

Uh oh!

tenzen-y left a comment

Choose a reason for hiding this comment

Uh oh!

tenzen-y commented Oct 3, 2025

Uh oh!

atiratree left a comment

Choose a reason for hiding this comment

Uh oh!

atiratree Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

atiratree Oct 5, 2025

Choose a reason for hiding this comment

Uh oh!

mimowo Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

atiratree Oct 5, 2025

Choose a reason for hiding this comment

Uh oh!

mimowo Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

atiratree Oct 5, 2025

Choose a reason for hiding this comment

Uh oh!

mimowo commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mimowo commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mimowo commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deads2k commented Oct 6, 2025

Uh oh!

mimowo commented Oct 6, 2025

Uh oh!

deads2k commented Oct 6, 2025

Uh oh!

Uh oh!

mimowo commented Sep 30, 2025 •

edited

Loading

atiratree Oct 5, 2025 •

edited

Loading

mimowo commented Oct 6, 2025 •

edited

Loading

mimowo commented Oct 6, 2025 •

edited

Loading

mimowo commented Oct 6, 2025 •

edited

Loading