-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-4368: Job Managed By; Promote to GA #5583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
dejanzele
commented
Sep 29, 2025
- One-line PR description: Promote KEP-4368: Job Managed By to GA
- Issue link: Job API managed-by mechanism #4368
- Other comments: NONE
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
I would suggest to update the PR description, or the PR content to clearly address what is the status of the graduation criteria. Here is my review, feel free to use it as a basis.
So far, with the adoption in MultiKueue we have not got users requesting that extra debuggability. With the managedBy field being immutable it seems the feature is not causing issues requiring extensive debug. I would suggest to keep this ideas as follow up features and not blockers for the graduation.
I would say it could be a follow up performance improvement, but does not seem required. I have no record of MultiKueue users reporting hitting performance issues on the management cluster due to Job controller during the SyncJob.
We have already a couple of implementations of the ManagedBy field in order to integrate with MultiKueue: These implementations are following the k8s core design, in particular making the field immutable. There two slight differences:
Will be done in the follow up PR. |
/approve PRR |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: dejanzele, wojtek-t The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
LGTM Job PoV.
/lgtm
/hold
for other reviewers.
I totally agree wtih @mimowo . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far, with the adoption in MultiKueue we have not got users requesting that extra debuggability. With the managedBy field being immutable it seems the feature is not causing issues requiring extensive debug. I would suggest to keep this ideas as follow up features and not blockers for the graduation.
+1, the metric and immutable field should be sufficient as a source of truth. Having other options seems like an extra work without an apparent benefit.
We have already a couple of implementations of the ManagedBy field in order to integrate with MultiKueue:
Nice to see the usage, thanks for compiling the list. Can we mention these projects in the KEP?
It is good to see that the length constraint (The value cannot exceed 64 characters
) is sufficient. But this validation is missing in some of the controllers. So I suppose it will throw an error if it is longer than that?
asses their conformance with k8s.
How do we want to approach this? Should we have a set of tests that these implementations have to pass?
stable: "v1.35" | ||
|
||
# The following PRR answers are required at alpha release | ||
# List the feature gate name and the components for which it must be enabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please update the metric name to jobs_by_external_controller_total
to reflect the reality? The README also needs an update.
- Re-evaluate the need to skip reconciliation in the event handlers to optimize performance | ||
- Assess the fragmentation of the ecosystem. Look for other implementations of a job controller and asses their conformance with k8s. | ||
- Assess the fragmentation of the ecosystem. Look for other implementations of a job controller and assess their conformance with k8s. | ||
- Lock the feature gate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When are we going to bump the new metric stability level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could do it in 1.35 in the implementation phase, wdyt @dejanzele
###### What steps should be taken if SLOs are not being met to determine the problem? | ||
|
||
N/A. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have some new findings for the Troubleshooting
section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not aware of that. We have worked with some users setting up MultiKueue, but it doesn't seem like the spec.managedBy would be bringing any particular reason to troublehooting (so far at least, and up to my knowledge).
|
||
#### GA | ||
|
||
- Address reviews and bug reports from Beta users |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please update the e2e tests section and link the tests?
Oh, good spot. Well, if the other controllers say, KubeRay allow for more than 64 characters, then nothing will break for them. I think the idea of restricting in the core k8s was to make it more constrained.
One idea would be to just review a couple of aspects of the implementations, say:
Then we could compile these aspects in a table and get some bigger picture. If we think some of the aspects are important to follow k8s core, then we could open an issue in the projects to "reconsider" this aspect. We may try to achieve this way better coherence in the ecosystem, but the maintainers of the projects may not prioritize full alignment. |
This is the table I was able to compile by checking the implementations in the projects:
So, there is a bit inconsistency about the "open set of values" vs "closed set of values". I know that in the Kubflow Trainer then motivation was to avoid the extra work on status validation, and closing the set of "recognized" orchestrators like Kueue only (potentially more in the future). It seems like all projects consistently wanted to avoid the effort of validating the status which we took in the k8s Job. I think the main part if that all of them follow the key aspects: immutability and skip inside Reconcile. |
Actually on the "status" validation note, I'm wondering if we should plan simplifying this code by answering "true" for all. Maybe except "RejectFailedIndexesOverlappingCompleted" which can be considered a perf optimization (debatable). The main reason we introduced the checks was to gracefully handle Jobs which could have been updated by the Job controller in an older version of k8s before the validation was introduced. However:
So, I'm wondering it would be beneficial to cleanup the code and perform the checks always, the question is "when", some ideas that come to my mind:
I'm personally hesitant between (1.) or (3.). (2.) or (4.) seem more involving, but I'm also ok for this way. wdyt @atiratree @soltysh @deads2k ? |
If we we wait long enough, declarative validation ratcheting will likely clean this up, but in the meantime we cannot risk making /status unwriteable. |
Except for the corner case issue:
Sure, I'm good with 4. I think turning the Reject... rules into "true" will not simplify the code dramatically anyway.
Makes sense. I think we can make the "JobManagedBy" feature gate "GA" in 1.35 and wait a couple of releases to assume all "Jobs" are good. Is there any k8s guidence for how to wait in similar cases? I think "jobs" are a bit safer than "Deployments" which are meant to run "Indefinitely". Jobs are meant to run only for a finite amount of time, so having a Job surviving a year would be very unusual ( I assume, but I know there might be exceptions ). |
I'm open to trying with a dedicated featuregate in the future. We'd likely leave the gate for four releases before locking it. |