-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add thread pool to verify pod deletion. #1227
Conversation
I have spent three days trying to get the I suspect someone with more knowledge about the plugin and the test could see the problem relatively quickly. Can anyone help? I seem to be spinning my wheels and can no longer afford to do so. |
Looking at this. |
The test failure definitely looks related to the change
|
@@ -136,15 +138,13 @@ int getPodTemplateCount(String podTemplate) { | |||
|
|||
@Extension | |||
public static class NodeListenerImpl extends NodeListener { | |||
private static ThreadPoolExecutor executor = (ThreadPoolExecutor) Executors.newCachedThreadPool(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use Computer.threadPoolForRemoting
} | ||
try { | ||
LOGGER.log(Level.INFO, "Pod " + node.getNodeName() + " has NOT been confirmed deleted, waiting."); | ||
Thread.sleep(3000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not sleep
in an executor task; instead, reschedule.
The node deletion in the plugin does not respect the terminationGracePeriodSeconds value. This can result in a race condition where a node is deleted in the plugin, but it has not actually been deleted from Kubernetes as the pod is taking some time to shut down after receiving the SIGTERM signal from Kubernetes.
The problem may be hidden in most cases unless quota is involved. In this situation, it is possible to get a quota exceeded error as the plugin thinks nodes have been deleted and it tries to allocate new nodes. In reality, they may still be around in Kubernetes.
This may lead people to assume that the concurrency limit setting is being ignored. This may be related to https://issues.jenkins.io/browse/JENKINS-59959.
The fix proposed here is to move the node deletion to a thread pool and have the deletion task ensure that the node is actually removed from Kubernetes before decrementing the plugin counter. The node deletion task waits for a maximum of "the grace period * 120% seconds", checking the node status every three seconds. Currently, the grace period is hard coded to the default 30 seconds.
Existing tests all run, and those already seem to test delete. I am unsure how to write a test for this but am willing to try with some help.