-
Notifications
You must be signed in to change notification settings - Fork 617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix leaking task resources when nodes are deleted #2806
Conversation
When a node is deleted, its tasks are asked to restart, which involves putting them into a desired state of Shutdown. However, the Allocator will not deallocate a task which is not in an actual state of a terminal state. Once a node is deleted, the only opportunity for its tasks to recieve updates and be moved to a terminal state is when the function moving those tasks to TaskStateOrphaned is called, 24 hours after the node enters the Down state. However, if a leadership change occurs, then that function will never be called, and the tasks will never be moved to a terminal state, leaking resources. With this change, upon node deletion, all of its tasks will be moved to TaskStateOrphaned, allowing those tasks' resources to be cleaned up. Signed-off-by: Drew Erny <drew.erny@docker.com>
Codecov Report
@@ Coverage Diff @@
## master #2806 +/- ##
==========================================
+ Coverage 61.93% 61.94% +0.01%
==========================================
Files 137 137
Lines 22131 22132 +1
==========================================
+ Hits 13706 13710 +4
+ Misses 6944 6940 -4
- Partials 1481 1482 +1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any other implications of deleting tasks right away instead of waiting for 24hrs?
ts.Store.View(func(tx store.ReadTx) { | ||
tasks, err := store.FindTasks(tx, store.All) | ||
require.NoError(t, err) | ||
// should only be 3 tasks left | ||
require.Len(t, tasks, 4) | ||
// and the list should not contain task1 | ||
// and the list should not contain task1 or task2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the task1 and task2 should have been removed, why are we testing for those tasks below?
@@ -1089,22 +1087,24 @@ func TestRemoveNodeAttachments(t *testing.T) { | |||
|
|||
// Now, call the function with our nodeID. make sure it returns no error | |||
err = ts.Store.Update(func(tx store.Tx) error { | |||
return removeNodeAttachments(tx, "id2") | |||
return orphanNodeTasks(tx, "id2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the orphaned tasks are deleted by the task reaper, correct? Can we also test for task deletions?
full diff: moby/swarmkit@7dded76...a8bbe7d changes included: - moby/swarmkit#2867 Only update non-terminal tasks on node removal - related to moby/swarmkit#2806 Fix leaking task resources when nodes are deleted - moby/swarmkit#2880 Bump to golang 1.12.9 - moby/swarmkit#2886 Bump vendoring to match current docker/docker master - regenerates protobufs - moby/swarmkit#2890 Remove hardcoded IPAM config subnet value for ingress network - fixes [ENGORC-2651] Specifying --default-addr-pool for docker swarm init is not picked up by ingress network Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
full diff: moby/swarmkit@7dded76...a8bbe7d changes included: - moby/swarmkit#2867 Only update non-terminal tasks on node removal - related to moby/swarmkit#2806 Fix leaking task resources when nodes are deleted - moby/swarmkit#2880 Bump to golang 1.12.9 - moby/swarmkit#2886 Bump vendoring to match current docker/docker master - regenerates protobufs - moby/swarmkit#2890 Remove hardcoded IPAM config subnet value for ingress network - fixes [ENGORC-2651] Specifying --default-addr-pool for docker swarm init is not picked up by ingress network Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: Drew Erny drew.erny@docker.com
- What I did
When a node is deleted, its tasks are asked to restart, which involves putting them into a desired state of Shutdown. However, the Allocator will not deallocate a task which is not in an actual state of a terminal state. Once a node is deleted, the only opportunity for its tasks to recieve updates and be moved to a terminal state is when the function moving those tasks to
TaskStateOrphaned
is called, 24 hours after the node enters the Down state. However, if a leadership change occurs, then that function will never be called, and the tasks will never be moved to a terminal state, leaking resources.With this change, upon node deletion, all of its tasks will be moved to
TaskStateOrphaned
, allowing those tasks' resources to be cleaned up.- How I did it
Altered the controlapi to move all tasks belonging to the node to the actual state of Orphaned.
- How to test it
Includes a unit test. However, testing for the full failure condition is nontrivial and likely not worth the effort.
- Description for the changelog
Fix bug where deleting nodes could result in task resources leaking.