Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix leaking task resources when nodes are deleted #2806

Merged
merged 1 commit into from
Mar 21, 2019

Conversation

dperny
Copy link
Collaborator

@dperny dperny commented Jan 14, 2019

Signed-off-by: Drew Erny drew.erny@docker.com

- What I did

When a node is deleted, its tasks are asked to restart, which involves putting them into a desired state of Shutdown. However, the Allocator will not deallocate a task which is not in an actual state of a terminal state. Once a node is deleted, the only opportunity for its tasks to recieve updates and be moved to a terminal state is when the function moving those tasks to TaskStateOrphaned is called, 24 hours after the node enters the Down state. However, if a leadership change occurs, then that function will never be called, and the tasks will never be moved to a terminal state, leaking resources.

With this change, upon node deletion, all of its tasks will be moved to TaskStateOrphaned, allowing those tasks' resources to be cleaned up.

- How I did it

Altered the controlapi to move all tasks belonging to the node to the actual state of Orphaned.

- How to test it

Includes a unit test. However, testing for the full failure condition is nontrivial and likely not worth the effort.

- Description for the changelog

Fix bug where deleting nodes could result in task resources leaking.

When a node is deleted, its tasks are asked to restart, which involves
putting them into a desired state of Shutdown. However, the Allocator
will not deallocate a task which is not in an actual state of a terminal
state. Once a node is deleted, the only opportunity for its tasks to
recieve updates and be moved to a terminal state is when the function
moving those tasks to TaskStateOrphaned is called, 24 hours after the
node enters the Down state. However, if a leadership change occurs, then
that function will never be called, and the tasks will never be moved to
a terminal state, leaking resources.

With this change, upon node deletion, all of its tasks will be moved to
TaskStateOrphaned, allowing those tasks' resources to be cleaned up.

Signed-off-by: Drew Erny <drew.erny@docker.com>
@codecov
Copy link

codecov bot commented Jan 14, 2019

Codecov Report

Merging #2806 into master will increase coverage by 0.01%.
The diff coverage is 85.71%.

@@            Coverage Diff             @@
##           master    #2806      +/-   ##
==========================================
+ Coverage   61.93%   61.94%   +0.01%     
==========================================
  Files         137      137              
  Lines       22131    22132       +1     
==========================================
+ Hits        13706    13710       +4     
+ Misses       6944     6940       -4     
- Partials     1481     1482       +1

Copy link
Contributor

@anshulpundir anshulpundir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any other implications of deleting tasks right away instead of waiting for 24hrs?

ts.Store.View(func(tx store.ReadTx) {
tasks, err := store.FindTasks(tx, store.All)
require.NoError(t, err)
// should only be 3 tasks left
require.Len(t, tasks, 4)
// and the list should not contain task1
// and the list should not contain task1 or task2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the task1 and task2 should have been removed, why are we testing for those tasks below?

@@ -1089,22 +1087,24 @@ func TestRemoveNodeAttachments(t *testing.T) {

// Now, call the function with our nodeID. make sure it returns no error
err = ts.Store.Update(func(tx store.Tx) error {
return removeNodeAttachments(tx, "id2")
return orphanNodeTasks(tx, "id2")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the orphaned tasks are deleted by the task reaper, correct? Can we also test for task deletions?

@dperny dperny merged commit 18e7e58 into moby:master Mar 21, 2019
thaJeztah added a commit to thaJeztah/docker that referenced this pull request Oct 7, 2019
full diff: moby/swarmkit@7dded76...a8bbe7d

changes included:

- moby/swarmkit#2867 Only update non-terminal tasks on node removal
  - related to moby/swarmkit#2806 Fix leaking task resources when nodes are deleted
- moby/swarmkit#2880 Bump to golang 1.12.9
- moby/swarmkit#2886 Bump vendoring to match current docker/docker master
  - regenerates protobufs
- moby/swarmkit#2890 Remove hardcoded IPAM config subnet value for ingress network
  - fixes [ENGORC-2651] Specifying --default-addr-pool for docker swarm init is not picked up by ingress network

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
thaJeztah added a commit to thaJeztah/docker that referenced this pull request Oct 21, 2019
full diff: moby/swarmkit@7dded76...a8bbe7d

changes included:

- moby/swarmkit#2867 Only update non-terminal tasks on node removal
  - related to moby/swarmkit#2806 Fix leaking task resources when nodes are deleted
- moby/swarmkit#2880 Bump to golang 1.12.9
- moby/swarmkit#2886 Bump vendoring to match current docker/docker master
  - regenerates protobufs
- moby/swarmkit#2890 Remove hardcoded IPAM config subnet value for ingress network
  - fixes [ENGORC-2651] Specifying --default-addr-pool for docker swarm init is not picked up by ingress network

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants