Add thread pool to verify pod deletion. #1227

ggallen · 2022-08-09T20:17:13Z

The node deletion in the plugin does not respect the terminationGracePeriodSeconds value. This can result in a race condition where a node is deleted in the plugin, but it has not actually been deleted from Kubernetes as the pod is taking some time to shut down after receiving the SIGTERM signal from Kubernetes.

The problem may be hidden in most cases unless quota is involved. In this situation, it is possible to get a quota exceeded error as the plugin thinks nodes have been deleted and it tries to allocate new nodes. In reality, they may still be around in Kubernetes.

This may lead people to assume that the concurrency limit setting is being ignored. This may be related to https://issues.jenkins.io/browse/JENKINS-59959.

The fix proposed here is to move the node deletion to a thread pool and have the deletion task ensure that the node is actually removed from Kubernetes before decrementing the plugin counter. The node deletion task waits for a maximum of "the grace period * 120% seconds", checking the node status every three seconds. Currently, the grace period is hard coded to the default 30 seconds.

Existing tests all run, and those already seem to test delete. I am unsure how to write a test for this but am willing to try with some help.

Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
Ensure that the pull request title represents the desired changelog entry
Please describe what you did
Link to relevant issues in GitHub or Jira
Link to relevant pull requests, esp. upstream and downstream changes
Ensure you have provided tests - that demonstrates feature works or fixes the issue

ggallen · 2022-08-16T20:24:44Z

I have spent three days trying to get the KubenetesPluginTest to pass, but no luck. I don't know enough about what is going on to fix it, but I'm happy to continue if I can get some help. There seems to be some sort of timing issue with a template being deleted, and that throws off the plugin counters since the unregister only happens when the pod is confirmed to be deleted.

I suspect someone with more knowledge about the plugin and the test could see the problem relatively quickly.

Can anyone help? I seem to be spinning my wheels and can no longer afford to do so.

Vlatombe · 2022-08-25T08:05:32Z

Looking at this.

Vlatombe · 2022-08-25T08:07:59Z

The test failure definitely looks related to the change

   8.422 [id=1257]	INFO	o.c.j.p.k.KubernetesNodeDeletedTask#run: Pod octal-permissions-1-gvp1l-6c5nx-bbf28 has been confirmed deleted.
   8.423 [id=1257]	WARNING	o.c.j.p.k.KubernetesProvisioningLimits#unregister: Pod template count for octal_Permissions_1-gvp1l-6c5nx went below zero. There is likely a bug in kubernetes-plugin

jglick · 2023-02-27T15:08:50Z

src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesProvisioningLimits.java

@@ -136,15 +138,13 @@ int getPodTemplateCount(String podTemplate) {

    @Extension
    public static class NodeListenerImpl extends NodeListener {
+        private static ThreadPoolExecutor executor = (ThreadPoolExecutor) Executors.newCachedThreadPool();


use Computer.threadPoolForRemoting

jglick · 2023-02-27T15:09:05Z

src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesNodeDeletedTask.java

+                    }
+                    try {
+                        LOGGER.log(Level.INFO, "Pod " + node.getNodeName() + " has NOT been confirmed deleted, waiting.");
+                        Thread.sleep(3000);


Do not sleep in an executor task; instead, reschedule.

Vlatombe · 2024-05-21T08:09:40Z

Closing as #1543 & #1553 add generic mechanisms to ensure pods are not left behind.

Add thread pool to verify pod deletion.

b7172fe

ggallen requested a review from a team as a code owner August 9, 2022 20:17

ggallen closed this Aug 10, 2022

ggallen reopened this Aug 10, 2022

ggallen closed this Aug 10, 2022

ggallen reopened this Aug 10, 2022

Vlatombe added the enhancement Improvements label Aug 16, 2022

jglick reviewed Feb 27, 2023

View reviewed changes

jglick marked this pull request as draft February 27, 2023 15:09

jglick mentioned this pull request Feb 27, 2023

Do not ask API server for a Pod during deletion unnecessarily #1304

Merged

Vlatombe closed this May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add thread pool to verify pod deletion. #1227

Add thread pool to verify pod deletion. #1227

ggallen commented Aug 9, 2022 •

edited

Loading

ggallen commented Aug 16, 2022 •

edited

Loading

Vlatombe commented Aug 25, 2022

Vlatombe commented Aug 25, 2022

jglick Feb 27, 2023

jglick Feb 27, 2023

Vlatombe commented May 21, 2024

Add thread pool to verify pod deletion. #1227

Add thread pool to verify pod deletion. #1227

Conversation

ggallen commented Aug 9, 2022 • edited Loading

ggallen commented Aug 16, 2022 • edited Loading

Vlatombe commented Aug 25, 2022

Vlatombe commented Aug 25, 2022

jglick Feb 27, 2023

Choose a reason for hiding this comment

jglick Feb 27, 2023

Choose a reason for hiding this comment

Vlatombe commented May 21, 2024

ggallen commented Aug 9, 2022 •

edited

Loading

ggallen commented Aug 16, 2022 •

edited

Loading