Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix PodGroup being incorrectly deleted due to frequent creation and deletion of pods #3375

Merged
merged 1 commit into from
Mar 29, 2024

Conversation

guoqinwill
Copy link
Contributor

fix :#3374

@wangyang0616
Copy link
Member

Can you briefly describe the processing time sequence of this scenario using a diagram?

oldPgVersion := oldJob.PgUID
newPgVersion := job.PgUID
klog.V(5).Infof("just add pguid:%v, try to delete pguid:%v", oldPgVersion, newPgVersion)
if oldPgVersion == newPgVersion {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which case podgroup ID will change? In my opinion, the created podgroup will keep exist and UID is same, has nothing to do with pod creating or deleting in same job.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UID of the podgroup stored as JobInfo is the same as the name of the pg, and it remains unchanged. However, when the number of pods becomes 0, it triggers the deletion of the pg. When the number of pods is non-zero, a new pg will be created again. At this point, the UID of the pg itself changes, making it unique. For a detailed analysis, please refer to the issue #3374. I have provided a detailed description of the problem scenario there.

@guoqinwill
Copy link
Contributor Author

Can you briefly describe the processing time sequence of this scenario using a diagram?

Added to the issue #3374

metrics.DeleteJobMetrics(job.Name, string(job.Queue), job.Namespace)
klog.V(3).Infof("Job <%v:%v/%v> was deleted.", job.UID, job.Namespace, job.Name)
sc.DeletedJobs.Forget(obj)
if oldJob, found := sc.Jobs[job.UID]; found {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to:

oldJob, found := sc.Jobs[job.UID];
if !found {
return
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

pkg/scheduler/cache/cache.go Outdated Show resolved Hide resolved
metrics.DeleteJobMetrics(job.Name, string(job.Queue), job.Namespace)
klog.V(3).Infof("Job <%v:%v/%v> was deleted.", job.UID, job.Namespace, job.Name)
sc.DeletedJobs.Forget(obj)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If oldPgVersion is not equal to newPgVersion, the current obj should also be discarded. Otherwise, DeletedJobs will be retried infinitely. In addition, DeletedJobs may be full after a long time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forget and Done method can guarantee old key be discarded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

oldJob, found := sc.Jobs[job.UID]
if !found {
klog.V(3).Infof("Failed to find Job <%v:%v/%v>, ignore it", job.UID, job.Namespace, job.Name)
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should obj be discarded here to avoid infinite retries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: guoqin <gq411will@163.com>
Signed-off-by: g00673948 <guoqin10@huawei.com>
@wangyang0616
Copy link
Member

wangyang0616 commented Mar 29, 2024

/lgtm

Copy link
Member

@william-wang william-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Mar 29, 2024
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: william-wang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 29, 2024
@volcano-sh-bot volcano-sh-bot merged commit 11037f8 into volcano-sh:master Mar 29, 2024
14 checks passed
@Monokaix
Copy link
Member

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants