Skip to content

Conversation

dejanzele
Copy link
Contributor

  • One-line PR description: Promote KEP-4368: Job Managed By to GA
  • Other comments: NONE

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 29, 2025
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Sep 29, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Apps Sep 29, 2025
@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Sep 29, 2025
@wojtek-t wojtek-t self-assigned this Sep 29, 2025
@mimowo
Copy link
Contributor

mimowo commented Sep 30, 2025

I would suggest to update the PR description, or the PR content to clearly address what is the status of the graduation criteria. Here is my review, feel free to use it as a basis.

So far, with the adoption in MultiKueue we have not got users requesting that extra debuggability. With the managedBy field being immutable it seems the feature is not causing issues requiring extensive debug. I would suggest to keep this ideas as follow up features and not blockers for the graduation.

  • Re-evaluate the need to skip reconciliation in the event handlers to optimize performance

I would say it could be a follow up performance improvement, but does not seem required. I have no record of MultiKueue users reporting hitting performance issues on the management cluster due to Job controller during the SyncJob.

  • Assess the fragmentation of the ecosystem. Look for other implementations of a job controller and asses their conformance with k8s.

We have already a couple of implementations of the ManagedBy field in order to integrate with MultiKueue:

These implementations are following the k8s core design, in particular making the field immutable.

There two slight differences:

Will be done in the follow up PR.

PTAL @soltysh @atiratree @tenzen-y @mwielgus @kannon92

@wojtek-t
Copy link
Member

wojtek-t commented Oct 3, 2025

/approve PRR

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dejanzele, wojtek-t
Once this PR has been reviewed and has the lgtm label, please assign soltysh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!
LGTM Job PoV.

/lgtm
/hold
for other reviewers.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 3, 2025
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 3, 2025
@tenzen-y
Copy link
Member

tenzen-y commented Oct 3, 2025

I would suggest to update the PR description, or the PR content to clearly address what is the status of the graduation criteria. Here is my review, feel free to use it as a basis.

So far, with the adoption in MultiKueue we have not got users requesting that extra debuggability. With the managedBy field being immutable it seems the feature is not causing issues requiring extensive debug. I would suggest to keep this ideas as follow up features and not blockers for the graduation.

  • Re-evaluate the need to skip reconciliation in the event handlers to optimize performance

I would say it could be a follow up performance improvement, but does not seem required. I have no record of MultiKueue users reporting hitting performance issues on the management cluster due to Job controller during the SyncJob.

  • Assess the fragmentation of the ecosystem. Look for other implementations of a job controller and asses their conformance with k8s.

We have already a couple of implementations of the ManagedBy field in order to integrate with MultiKueue:

These implementations are following the k8s core design, in particular making the field immutable.

There two slight differences:

Will be done in the follow up PR.

PTAL @soltysh @atiratree @tenzen-y @mwielgus @kannon92

I totally agree wtih @mimowo .

Copy link
Member

@atiratree atiratree left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far, with the adoption in MultiKueue we have not got users requesting that extra debuggability. With the managedBy field being immutable it seems the feature is not causing issues requiring extensive debug. I would suggest to keep this ideas as follow up features and not blockers for the graduation.

+1, the metric and immutable field should be sufficient as a source of truth. Having other options seems like an extra work without an apparent benefit.

We have already a couple of implementations of the ManagedBy field in order to integrate with MultiKueue:

Nice to see the usage, thanks for compiling the list. Can we mention these projects in the KEP?

It is good to see that the length constraint (The value cannot exceed 64 characters) is sufficient. But this validation is missing in some of the controllers. So I suppose it will throw an error if it is longer than that?

asses their conformance with k8s.

How do we want to approach this? Should we have a set of tests that these implementations have to pass?

stable: "v1.35"

# The following PRR answers are required at alpha release
# List the feature gate name and the components for which it must be enabled
Copy link
Member

@atiratree atiratree Oct 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please update the metric name to jobs_by_external_controller_total to reflect the reality? The README also needs an update.

- Re-evaluate the need to skip reconciliation in the event handlers to optimize performance
- Assess the fragmentation of the ecosystem. Look for other implementations of a job controller and asses their conformance with k8s.
- Assess the fragmentation of the ecosystem. Look for other implementations of a job controller and assess their conformance with k8s.
- Lock the feature gate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When are we going to bump the new metric stability level?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could do it in 1.35 in the implementation phase, wdyt @dejanzele

###### What steps should be taken if SLOs are not being met to determine the problem?

N/A.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have some new findings for the Troubleshooting section?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of that. We have worked with some users setting up MultiKueue, but it doesn't seem like the spec.managedBy would be bringing any particular reason to troublehooting (so far at least, and up to my knowledge).


#### GA

- Address reviews and bug reports from Beta users
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please update the e2e tests section and link the tests?

@mimowo
Copy link
Contributor

mimowo commented Oct 6, 2025

It is good to see that the length constraint (The value cannot exceed 64 characters) is sufficient. But this validation is missing in some of the controllers. So I suppose it will throw an error if it is longer than that?

Oh, good spot. Well, if the other controllers say, KubeRay allow for more than 64 characters, then nothing will break for them. I think the idea of restricting in the core k8s was to make it more constrained.

How do we want to approach this? Should we have a set of tests that these implementations have to pass?

One idea would be to just review a couple of aspects of the implementations, say:

  • validation for spec.managedBy to be at most 63 chars of len
  • validation for spec.managedBy to have a specific format
  • validation for the status
  • immutability
  • skip inside event handlers, vs. controller Reconcile

Then we could compile these aspects in a table and get some bigger picture. If we think some of the aspects are important to follow k8s core, then we could open an issue in the projects to "reconsider" this aspect. We may try to achieve this way better coherence in the ecosystem, but the maintainers of the projects may not prioritize full alignment.

@dejanzele dejanzele mentioned this pull request Oct 6, 2025
16 tasks
@mimowo
Copy link
Contributor

mimowo commented Oct 6, 2025

Then we could compile these aspects in a table and get some bigger picture.

This is the table I was able to compile by checking the implementations in the projects:

k8s Job JobSet Kubeflow Trainer v1 Kubeflow Trainer v2 KubeRay Tekton Pipelines AppWrapper
open set of allowed values y y n n n y y
max length 63 y y y (by closed set of options) y (by closed set of options) y (indirectly by closed set) n n
specific format y y y (by closed set of options) y (by closed set of options) y (by closed set of options) n n
immutability y y y y y y y
validation for the status y n n n n n n
skip inside Reconcile y y y y y y (also in filtering) y

So, there is a bit inconsistency about the "open set of values" vs "closed set of values". I know that in the Kubflow Trainer then motivation was to avoid the extra work on status validation, and closing the set of "recognized" orchestrators like Kueue only (potentially more in the future). It seems like all projects consistently wanted to avoid the effort of validating the status which we took in the k8s Job.

I think the main part if that all of them follow the key aspects: immutability and skip inside Reconcile.

@mimowo
Copy link
Contributor

mimowo commented Oct 6, 2025

Actually on the "status" validation note, I'm wondering if we should plan simplifying this code by answering "true" for all. Maybe except "RejectFailedIndexesOverlappingCompleted" which can be considered a perf optimization (debatable).

The main reason we introduced the checks was to gracefully handle Jobs which could have been updated by the Job controller in an older version of k8s before the validation was introduced.

However:

  1. all Jobs created in 1.32+ would't require the graceful handling as the rules would prevent entering invalid state
  2. older Jobs are very unlikely that would violate the rules, because we haven't got any reports the rules are violated after enabling the validation by default (except for Fix validation for Job with suspend=true,completions=0 to set Complete condition kubernetes#132614, but here the Jobs wouldn't compute anything useful anyway).

So, I'm wondering it would be beneficial to cleanup the code and perform the checks always, the question is "when", some ideas that come to my mind:

  1. just now, we have waited long enough to be safe
  2. later under a dedicated feature gate, part of the KEP
  3. later without a feature gate, just send a PR in 1.36 or so
  4. later under a dedicated feature gate in the future

I'm personally hesitant between (1.) or (3.). (2.) or (4.) seem more involving, but I'm also ok for this way.

wdyt @atiratree @soltysh @deads2k ?

@deads2k
Copy link
Contributor

deads2k commented Oct 6, 2025

So, I'm wondering it would be beneficial to cleanup the code and perform the checks always, the question is "when", some ideas that come to my mind:

1. just now, we have waited long enough to be safe

2. later under a dedicated feature gate, part of the KEP

3. later without a feature gate, just send a PR in 1.36 or so

4. later under a dedicated feature gate in the future

I'm personally hesitant between (1.) or (3.). (2.) or (4.) seem more involving, but I'm also ok for this way.

  1. and 3. are definitely out. Is that table saying we have no known writers of violating data? If we have never had known writers of violating data, then I'm comfortable with 4. If we have ever had a known writer of violating data, then we cannot be guaranteed that all of the invalid data is cleaned up.

If we we wait long enough, declarative validation ratcheting will likely clean this up, but in the meantime we cannot risk making /status unwriteable.

@mimowo
Copy link
Contributor

mimowo commented Oct 6, 2025

Is that table saying we have no known writers of violating data?

Except for the corner case issue:

  • in 1.32+ there is no writer, as the validation is "on" (assuming Beta feature is enabled)
  • prior to 1.32 it could only be Job controller, but I cannot recollect and specific issue violating the rules

If we have never had known writers of violating data, then I'm comfortable with 4.

Sure, I'm good with 4. I think turning the Reject... rules into "true" will not simplify the code dramatically anyway.

If we we wait long enough, declarative validation ratcheting will likely clean this up, but in the meantime we cannot risk making /status unwriteable.

Makes sense. I think we can make the "JobManagedBy" feature gate "GA" in 1.35 and wait a couple of releases to assume all "Jobs" are good. Is there any k8s guidence for how to wait in similar cases?

I think "jobs" are a bit safer than "Deployments" which are meant to run "Indefinitely". Jobs are meant to run only for a finite amount of time, so having a Job surviving a year would be very unusual ( I assume, but I know there might be exceptions ).

@deads2k
Copy link
Contributor

deads2k commented Oct 6, 2025

I'm open to trying with a dedicated featuregate in the future. We'd likely leave the gate for four releases before locking it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.

7 participants