Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[18.03 backport] Fix leaking task resources when nodes are deleted #2841

Merged

Conversation

thaJeztah
Copy link
Member

@thaJeztah thaJeztah commented Mar 26, 2019

this is for ENGCORE-711 and ENGCORE-937

backport for the bump_v18.03 branch of

When a node is deleted, its tasks are asked to restart, which involves
putting them into a desired state of Shutdown. However, the Allocator
will not deallocate a task which is not in an actual state of a terminal
state. Once a node is deleted, the only opportunity for its tasks to
recieve updates and be moved to a terminal state is when the function
moving those tasks to TaskStateOrphaned is called, 24 hours after the
node enters the Down state. However, if a leadership change occurs, then
that function will never be called, and the tasks will never be moved to
a terminal state, leaking resources.

With this change, upon node deletion, all of its tasks will be moved to
TaskStateOrphaned, allowing those tasks' resources to be cleaned up.

@thaJeztah
Copy link
Member Author

ping @dperny PTAL

@dperny
Copy link
Collaborator

dperny commented Mar 28, 2019

Fails tests because 18.03 uses an older version of gogo/proto/types which does not include the TimestampNow function.

@thaJeztah thaJeztah changed the title [18.03 backport] Fix leaking task resources when nodes are deleted [WIP][18.03 backport] Fix leaking task resources when nodes are deleted Aug 6, 2019
@thaJeztah
Copy link
Member Author

@dperny would you be able to carry this one and #2840 (and fix the gogo/proto/types thing)?

@dperny
Copy link
Collaborator

dperny commented Aug 7, 2019

Yeah, sorry about not getting to it earlier. Let me see if I can fix it.

@dperny dperny force-pushed the 18.03_backport_fix_leaking_task_resources branch from 02f6933 to 0c472d8 Compare August 7, 2019 15:29
@dperny
Copy link
Collaborator

dperny commented Aug 7, 2019

@thaJeztah I fixed the issue by coding around TimestampNow, and left a huge comment explaining what I did.

@dperny
Copy link
Collaborator

dperny commented Aug 7, 2019

I'm also gonna add a cherry pick of #2867, because it's sort of a necessary fix for this fix.

@thaJeztah
Copy link
Member Author

Failing on linting;

/home/circleci/.go_workspace/src/github.com/docker/swarmkit/agent/exec/dockerapi/controller.go:657:3: ineffectual assignment to protocol
/home/circleci/.go_workspace/src/github.com/docker/swarmkit/cmd/swarmctl/service/flagparser/tmpfs.go:67:12: ineffectual assignment to multiplier
make: *** [ineffassign] Error 1
Exited with code 2

@thaJeztah
Copy link
Member Author

@dperny if you have to update this one again, would it make sense to do a clean cherry-pick of the original commit, and an extra commit with the local changes to address the TimestampNow issue?

In that case if (for whatever reason) we will update gogo/proto/types, we can easily revert that commit

(it also makes it slightly clearer what the actual modifications were)

@kolyshkin
Copy link
Contributor

We do need this one for 18.03.1-ee-11 (of which we already have -tp1)

I can't set milestone here, can someone else please?

@thaJeztah
Copy link
Member Author

opened #2877 to fix CI failures

@kolyshkin
Copy link
Contributor

@dperny can you please rebase/push (to trigger CI and include #2877)

When a node is deleted, its tasks are asked to restart, which involves
putting them into a desired state of Shutdown. However, the Allocator
will not deallocate a task which is not in an actual state of a terminal
state. Once a node is deleted, the only opportunity for its tasks to
recieve updates and be moved to a terminal state is when the function
moving those tasks to TaskStateOrphaned is called, 24 hours after the
node enters the Down state. However, if a leadership change occurs, then
that function will never be called, and the tasks will never be moved to
a terminal state, leaking resources.

With this change, upon node deletion, all of its tasks will be moved to
TaskStateOrphaned, allowing those tasks' resources to be cleaned up.

Additionally, as part of this backport, avoid using the gogo
types.TimestampNow function, which does not exist in the vendored
version.

Signed-off-by: Drew Erny <drew.erny@docker.com>
(cherry picked from commit 8467e6a)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
When a node is removed, its tasks are set in state ORPHANED. This does
not need to be done for tasks that are already in a terminal state, and
if all tasks in all states are updated, the size of the transaction may
grow too large to process, and node removal becomes impossible.

This changes to only set non-terminal tasks to state ORPHANED, and
terminal tasks are left alone.

Cherry pick does not apply cleanly, but the fix is rather simple.

Signed-off-by: Drew Erny <drew.erny@docker.com>
(cherry picked from commit d5df265)
Signed-off-by: Drew Erny <drew.erny@docker.com>
@dperny dperny force-pushed the 18.03_backport_fix_leaking_task_resources branch from ade2201 to 847a883 Compare September 10, 2019 16:08
@dperny
Copy link
Collaborator

dperny commented Sep 10, 2019

@kolyshkin just rebased and force-pushed.

@thaJeztah
Copy link
Member Author

Looks like it failed;

time="2019-09-10T16:17:35Z" level=error msg="update failed" error="task uw1lpkrhh7dzxemk3xbqqa0s5 was already shut down when reached by updater" task.id=d6gaiwohphh95jdjdfcqlx2qz 
--- FAIL: TestUpdaterRollback (3.46s)
    --- FAIL: TestUpdaterRollback/pause/monitor_set/spec_version_unset (0.64s)
	Error Trace:	update_test.go:279
			update_test.go:19
	Error:		Not equal: "image2" (expected)
            	        != "image1" (actual)
            

FAIL

@dperny
Copy link
Collaborator

dperny commented Sep 10, 2019

known flaky test in the codebase that i have spent many hours on and still not fixed.

@codecov
Copy link

codecov bot commented Sep 17, 2019

Codecov Report

❗ No coverage uploaded for pull request base (bump_v18.03@fa7db2d). Click here to learn what that means.
The diff coverage is 88.88%.

@@              Coverage Diff              @@
##             bump_v18.03   #2841   +/-   ##
=============================================
  Coverage               ?   61.7%           
=============================================
  Files                  ?     134           
  Lines                  ?   21805           
  Branches               ?       0           
=============================================
  Hits                   ?   13455           
  Misses                 ?    6909           
  Partials               ?    1441

@dperny dperny merged commit 7ef6769 into moby:bump_v18.03 Sep 17, 2019
@thaJeztah thaJeztah deleted the 18.03_backport_fix_leaking_task_resources branch September 17, 2019 19:19
@thaJeztah thaJeztah changed the title [WIP][18.03 backport] Fix leaking task resources when nodes are deleted [18.03 backport] Fix leaking task resources when nodes are deleted Sep 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants