-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ForegroundDeletion of Jobs is not always enforced before recreation #665
Comments
Download |
Assuming the culprit is another call to jobset/pkg/controllers/jobset_controller.go Line 560 in 2dcc751
I think the fix would be to just remove this conditional. |
Hey! Would you be open to creating an e2e test and see if your suggestion fixes it? |
Actually reading about Foreground deletion, it seems that it is not a blocking call to delete. It sets a finalized and then JobSet would recreate. So I’m not sure this is a bug tbh. Sounds like you want to only recreate the jobs once they are fully deleted. Jobs can take a long time to be deleted especially id the pods have graceful termination. I think this is really a feature request. |
From the GH Issue that specified foreground deletion it appears that the expected behavior is to block until all Pods are gone (otherwise bad things): #392 |
Yes, I should be able to do that |
@ahg-g WDYT of #665 (comment)? Reading the documentation on ForegroundDeletion, I don't understand how that becomes a blocking delete call. |
Foreground will keep the child job around with deletionTimestamp set until all its pods are deleted. This will prevent JobSet from creating a replacement for the child job until all the pods are deleted. So this is working as expected. @nstogner why do you want jobset to wait until all child jobs are deleted before creating their replacements? |
@nstogner this is not the intended behavior - as Abdullah mentioned above, the intended behavior is each Job will not be recreated until the prior iteration of that Job has been completed cleaned up (all pods and parent Job object removed from etcd). It doesn't wait for all Jobs to be deleted before recreating any Job. Closing this for now since there's been no activity for ~1 month, feel free to reopen if you want to continue the discussion. |
I am observing overlap of pods between two consecutive restart attempts. A straggler pod from restart-attempt N-1 will try to connect to the a new incarnation of a pod in restart-attempt N. So pretty sure the clean up is not happening fully before the new set of Jobs/pods are created. |
Yes, the intent is not to clean up all jobs before starting to create the replacements. The behavior is to recreate child jobs in parallel to facilitate quick restart. If we want to block on the recreation until all jobs are deleted, then this is a new FailurePolicy feature, we can call it |
What happened:
I noticed this issue when I was testing what would happen if I were to trigger a failure in a Job when multiple
.replicatedJobs[]
are specified.And observed that the JobSet controller did not wait for all Jobs to be deleted before creating a new one.
What you expected to happen:
I expected all Jobs from a given attempt to be fully deleted with ForegroundDeletion before any new Jobs are recreated.
How to reproduce it (as minimally and precisely as possible):
I used this manifest:
Anything else we need to know?:
Environment:
kubectl version
):git describe --tags --dirty --always
):v0.6.0
kind
(withpodman
)kubectl apply
The text was updated successfully, but these errors were encountered: