Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2307: Graduate JobTrackingWithFinalizers to GA #3482

Merged
merged 4 commits into from
Sep 23, 2022

Conversation

alculquicondor
Copy link
Member

  • One-line PR description: Add GA and deprecation notes
  • Other comments:
    • Add metric job_pod_tracking_finalizer.
    • Add legacy to finalizers migration under MigrateJobLegacyTracking feature gate.
    • Add known failure modes.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Aug 29, 2022
@k8s-ci-robot k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Aug 29, 2022
@alculquicondor
Copy link
Member Author

/assign @soltysh
cc @ahg-g @mimowo

@alculquicondor
Copy link
Member Author

FYI @liggitt

the job controller migrates jobs with legacy tracking to tracking with finalizers.

If a Job doesn't have the annotation `batch.kubernetes.io/job-completion`, it
means that is not currently tracked with finalizers. The job controller starts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
means that is not currently tracked with finalizers. The job controller starts
means that it is not currently tracked with finalizers. The job controller starts

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

- For the remaining Jobs, the Job controller already accounted most of the
finished Pods in the status. The controller might leave some
finished Pods unaccounted, if they finish before the controller has a chance
to add a finalizer. This situation is no worse that the legacy tracking
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
to add a finalizer. This situation is no worse that the legacy tracking
to add a finalizer. This situation is no worse than the legacy tracking

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


In 1.26:

- Declare deprecation of annotation `batch.kubernetes.io/job-completion` in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious, are we expecting anyone to use/rely on this annotation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You never know

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 on deprecating and announcing that fact, we've been bitten in the past several times on alpha things that users started heavily rely on 😞

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


## Migrating Jobs with legacy tracking

Starting in 1.26, when the feature gate `MigrateJobLegacyTracking` is enabled,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a new feature gate, but it is not described before. Are we currently not migrating legacy jobs and they use continue to use the legacy path?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth adding the reason for this which is to facilitate the migration of jobs that may run across two upgrades. It looks like an overkill since jobs are not like deployments in that they don't live forever, but there are probably cases where people upgrade their clusters back to back.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added notes


In 1.26:

- Declare deprecation of annotation `batch.kubernetes.io/job-completion` in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 on deprecating and announcing that fact, we've been bitten in the past several times on alpha things that users started heavily rely on 😞

`.status.phase=(Pending/Running)`.
2. Ignore pods with `.status.phase=(Complete/Failed)` that don't have the `batch.kubernetes.io/job-completion`.
They are considered to be already counted in `.status.(failed/succeeded)`.
3. Add the annotation `batch.kubernetes.io/job-completion`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit is probably the only one which I have some doubts around. Can you estimate how many jobs might be using the old mechanism assuming both cases:

  1. with the feature on by default in 1.25.
  2. with the feature off in 1.25.

From my quick calculations, the 2nd case will be the most impactful, since that's basically all the jobs using the old mechanism that need to be migrated. Is there something we can do here to make the pain to users to be smaller? Maybe advise them to run 1.25 with the feature on before the upgrade? Would that kind of approach make it more reliable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we allow users to disable MigrateJobLegacyTracking at all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe we should do it gradually, iow. new jobs should use only the new tracking, but leave the old ones as they where?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I have no way to know.

For case 1, and assuming that admins are not doing "binge upgrading", it should be close to zero. There are 4 months between releases, which should be plenty of time for even the slowest jobs to finish.

Case 2 can be directly impacted.

I'm thinking that the recommendation should be either of these:

  1. If JobTrackingWithFinalizers was enabled in 1.25, enable MigrateJobLegacyTracking in 1.26.
  2. If the feature was disabled in 1.24, disable MigrateJobLegacyTracking in 1.26. They remaining jobs will be migrated in 1.27.

We could also never introduce the code for migration and document this:
If a Job that was created in 1.25- with JobTrackingWithFinalizers disabled, and it is still running in 1.26, the job controller would recreate all its pods upon upgrade to 1.27.

For people that had JobTrackingWithFinalizers enabled in 1.25, that gives them 2 releases of buffer for their jobs to finish.

Copy link
Member

@ahg-g ahg-g Sep 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the migration code is an overkill, the code is too complex as is and this will only increase sources of bugs/errors and misconfigurations. I think it is sufficient to have a release note indicating that upgrading clusters back to back through two releases may disrupt running jobs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree with Abdullah about that migration being overkill (and potential source of unwanted and unexpected problems). All we should do is to ensure good communication about these upcoming changes.

@wojtek-t
Copy link
Member

/assign

@alculquicondor - please ping me on Slack once you have SIG lgtm

@kikisdeliveryservice kikisdeliveryservice changed the title 2307: Graduate JobTrackingWithFinalizers to GA KEP-2307: Graduate JobTrackingWithFinalizers to GA Sep 16, 2022
Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion, otherwise
/lgtm
/approve

[#111646](https://github.com/kubernetes/kubernetes/pull/111646)).
In newer versions, this can still happen if there is a buggy webhook
that prevents pod updates to remove finalizers.
- Testing: Discovered bugs are covered by unit and integration tests.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add here the problem you described at the beginning, ie. the one that we recommend folks running with JobTrackingWithFinalizers in 1.25 and 1.26 to ensure all jobs with batch.kubernetes.io/job-completion are completed prior to upgrade to 1.27. I doubt we'll be updating the KEP then, so let's add it right away.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On top of that, I'd suggest writing a blog post for 1.26 release describing this and recommendations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good ideas. Added here and in the graduation criteria.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 22, 2022
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 22, 2022
@alculquicondor
Copy link
Member Author

@wojtek-t this is ready for PRR.

@wojtek-t
Copy link
Member

This is great!

/lgtm
/approve PRR

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 23, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, soltysh, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 23, 2022
@k8s-ci-robot k8s-ci-robot merged commit a4b5f5c into kubernetes:master Sep 23, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.26 milestone Sep 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants