-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2307: Graduate JobTrackingWithFinalizers to GA #3482
KEP-2307: Graduate JobTrackingWithFinalizers to GA #3482
Conversation
486bb61
to
01fe5ea
Compare
01fe5ea
to
7823232
Compare
FYI @liggitt |
the job controller migrates jobs with legacy tracking to tracking with finalizers. | ||
|
||
If a Job doesn't have the annotation `batch.kubernetes.io/job-completion`, it | ||
means that is not currently tracked with finalizers. The job controller starts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
means that is not currently tracked with finalizers. The job controller starts | |
means that it is not currently tracked with finalizers. The job controller starts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
- For the remaining Jobs, the Job controller already accounted most of the | ||
finished Pods in the status. The controller might leave some | ||
finished Pods unaccounted, if they finish before the controller has a chance | ||
to add a finalizer. This situation is no worse that the legacy tracking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to add a finalizer. This situation is no worse that the legacy tracking | |
to add a finalizer. This situation is no worse than the legacy tracking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
In 1.26: | ||
|
||
- Declare deprecation of annotation `batch.kubernetes.io/job-completion` in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious, are we expecting anyone to use/rely on this annotation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You never know
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 on deprecating and announcing that fact, we've been bitten in the past several times on alpha things that users started heavily rely on 😞
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
||
## Migrating Jobs with legacy tracking | ||
|
||
Starting in 1.26, when the feature gate `MigrateJobLegacyTracking` is enabled, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a new feature gate, but it is not described before. Are we currently not migrating legacy jobs and they use continue to use the legacy path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth adding the reason for this which is to facilitate the migration of jobs that may run across two upgrades. It looks like an overkill since jobs are not like deployments in that they don't live forever, but there are probably cases where people upgrade their clusters back to back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added notes
|
||
In 1.26: | ||
|
||
- Declare deprecation of annotation `batch.kubernetes.io/job-completion` in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 on deprecating and announcing that fact, we've been bitten in the past several times on alpha things that users started heavily rely on 😞
`.status.phase=(Pending/Running)`. | ||
2. Ignore pods with `.status.phase=(Complete/Failed)` that don't have the `batch.kubernetes.io/job-completion`. | ||
They are considered to be already counted in `.status.(failed/succeeded)`. | ||
3. Add the annotation `batch.kubernetes.io/job-completion`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bit is probably the only one which I have some doubts around. Can you estimate how many jobs might be using the old mechanism assuming both cases:
- with the feature on by default in 1.25.
- with the feature off in 1.25.
From my quick calculations, the 2nd case will be the most impactful, since that's basically all the jobs using the old mechanism that need to be migrated. Is there something we can do here to make the pain to users to be smaller? Maybe advise them to run 1.25 with the feature on before the upgrade? Would that kind of approach make it more reliable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we allow users to disable MigrateJobLegacyTracking
at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe we should do it gradually, iow. new jobs should use only the new tracking, but leave the old ones as they where?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I have no way to know.
For case 1, and assuming that admins are not doing "binge upgrading", it should be close to zero. There are 4 months between releases, which should be plenty of time for even the slowest jobs to finish.
Case 2 can be directly impacted.
I'm thinking that the recommendation should be either of these:
- If JobTrackingWithFinalizers was enabled in 1.25, enable MigrateJobLegacyTracking in 1.26.
- If the feature was disabled in 1.24, disable MigrateJobLegacyTracking in 1.26. They remaining jobs will be migrated in 1.27.
We could also never introduce the code for migration and document this:
If a Job that was created in 1.25- with JobTrackingWithFinalizers disabled, and it is still running in 1.26, the job controller would recreate all its pods upon upgrade to 1.27.
For people that had JobTrackingWithFinalizers enabled in 1.25, that gives them 2 releases of buffer for their jobs to finish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the migration code is an overkill, the code is too complex as is and this will only increase sources of bugs/errors and misconfigurations. I think it is sufficient to have a release note indicating that upgrading clusters back to back through two releases may disrupt running jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agree with Abdullah about that migration being overkill (and potential source of unwanted and unexpected problems). All we should do is to ensure good communication about these upcoming changes.
/assign @alculquicondor - please ping me on Slack once you have SIG lgtm |
4adc040
to
6e0acd6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One suggestion, otherwise
/lgtm
/approve
[#111646](https://github.com/kubernetes/kubernetes/pull/111646)). | ||
In newer versions, this can still happen if there is a buggy webhook | ||
that prevents pod updates to remove finalizers. | ||
- Testing: Discovered bugs are covered by unit and integration tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add here the problem you described at the beginning, ie. the one that we recommend folks running with JobTrackingWithFinalizers
in 1.25 and 1.26 to ensure all jobs with batch.kubernetes.io/job-completion
are completed prior to upgrade to 1.27. I doubt we'll be updating the KEP then, so let's add it right away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On top of that, I'd suggest writing a blog post for 1.26 release describing this and recommendations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good ideas. Added here and in the graduation criteria.
@wojtek-t this is ready for PRR. |
This is great! /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, soltysh, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
job_pod_tracking_finalizer
.MigrateJobLegacyTracking
feature gate.