Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add thread pool to verify pod deletion. #1227

Closed
wants to merge 1 commit into from

Conversation

ggallen
Copy link

@ggallen ggallen commented Aug 9, 2022

The node deletion in the plugin does not respect the terminationGracePeriodSeconds value. This can result in a race condition where a node is deleted in the plugin, but it has not actually been deleted from Kubernetes as the pod is taking some time to shut down after receiving the SIGTERM signal from Kubernetes.

The problem may be hidden in most cases unless quota is involved. In this situation, it is possible to get a quota exceeded error as the plugin thinks nodes have been deleted and it tries to allocate new nodes. In reality, they may still be around in Kubernetes.

This may lead people to assume that the concurrency limit setting is being ignored. This may be related to https://issues.jenkins.io/browse/JENKINS-59959.

The fix proposed here is to move the node deletion to a thread pool and have the deletion task ensure that the node is actually removed from Kubernetes before decrementing the plugin counter. The node deletion task waits for a maximum of "the grace period * 120% seconds", checking the node status every three seconds. Currently, the grace period is hard coded to the default 30 seconds.

Existing tests all run, and those already seem to test delete. I am unsure how to write a test for this but am willing to try with some help.

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests - that demonstrates feature works or fixes the issue

@ggallen ggallen requested a review from a team as a code owner August 9, 2022 20:17
@ggallen ggallen closed this Aug 10, 2022
@ggallen ggallen reopened this Aug 10, 2022
@ggallen ggallen closed this Aug 10, 2022
@ggallen ggallen reopened this Aug 10, 2022
@Vlatombe Vlatombe added the enhancement Improvements label Aug 16, 2022
@ggallen
Copy link
Author

ggallen commented Aug 16, 2022

I have spent three days trying to get the KubenetesPluginTest to pass, but no luck. I don't know enough about what is going on to fix it, but I'm happy to continue if I can get some help. There seems to be some sort of timing issue with a template being deleted, and that throws off the plugin counters since the unregister only happens when the pod is confirmed to be deleted.

I suspect someone with more knowledge about the plugin and the test could see the problem relatively quickly.

Can anyone help? I seem to be spinning my wheels and can no longer afford to do so.

@Vlatombe
Copy link
Member

Looking at this.

@Vlatombe
Copy link
Member

The test failure definitely looks related to the change

   8.422 [id=1257]	INFO	o.c.j.p.k.KubernetesNodeDeletedTask#run: Pod octal-permissions-1-gvp1l-6c5nx-bbf28 has been confirmed deleted.
   8.423 [id=1257]	WARNING	o.c.j.p.k.KubernetesProvisioningLimits#unregister: Pod template count for octal_Permissions_1-gvp1l-6c5nx went below zero. There is likely a bug in kubernetes-plugin

@@ -136,15 +138,13 @@ int getPodTemplateCount(String podTemplate) {

@Extension
public static class NodeListenerImpl extends NodeListener {
private static ThreadPoolExecutor executor = (ThreadPoolExecutor) Executors.newCachedThreadPool();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use Computer.threadPoolForRemoting

}
try {
LOGGER.log(Level.INFO, "Pod " + node.getNodeName() + " has NOT been confirmed deleted, waiting.");
Thread.sleep(3000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not sleep in an executor task; instead, reschedule.

@Vlatombe
Copy link
Member

Closing as #1543 & #1553 add generic mechanisms to ensure pods are not left behind.

@Vlatombe Vlatombe closed this May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants