-
Notifications
You must be signed in to change notification settings - Fork 620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[18.03 backport] Fix leaking task resources when nodes are deleted #2841
[18.03 backport] Fix leaking task resources when nodes are deleted #2841
Conversation
ping @dperny PTAL |
Fails tests because 18.03 uses an older version of gogo/proto/types which does not include the TimestampNow function. |
Yeah, sorry about not getting to it earlier. Let me see if I can fix it. |
02f6933
to
0c472d8
Compare
@thaJeztah I fixed the issue by coding around |
I'm also gonna add a cherry pick of #2867, because it's sort of a necessary fix for this fix. |
Failing on linting;
|
@dperny if you have to update this one again, would it make sense to do a clean cherry-pick of the original commit, and an extra commit with the local changes to address the In that case if (for whatever reason) we will update (it also makes it slightly clearer what the actual modifications were) |
We do need this one for 18.03.1-ee-11 (of which we already have -tp1) I can't set milestone here, can someone else please? |
opened #2877 to fix CI failures |
When a node is deleted, its tasks are asked to restart, which involves putting them into a desired state of Shutdown. However, the Allocator will not deallocate a task which is not in an actual state of a terminal state. Once a node is deleted, the only opportunity for its tasks to recieve updates and be moved to a terminal state is when the function moving those tasks to TaskStateOrphaned is called, 24 hours after the node enters the Down state. However, if a leadership change occurs, then that function will never be called, and the tasks will never be moved to a terminal state, leaking resources. With this change, upon node deletion, all of its tasks will be moved to TaskStateOrphaned, allowing those tasks' resources to be cleaned up. Additionally, as part of this backport, avoid using the gogo types.TimestampNow function, which does not exist in the vendored version. Signed-off-by: Drew Erny <drew.erny@docker.com> (cherry picked from commit 8467e6a) Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
When a node is removed, its tasks are set in state ORPHANED. This does not need to be done for tasks that are already in a terminal state, and if all tasks in all states are updated, the size of the transaction may grow too large to process, and node removal becomes impossible. This changes to only set non-terminal tasks to state ORPHANED, and terminal tasks are left alone. Cherry pick does not apply cleanly, but the fix is rather simple. Signed-off-by: Drew Erny <drew.erny@docker.com> (cherry picked from commit d5df265) Signed-off-by: Drew Erny <drew.erny@docker.com>
ade2201
to
847a883
Compare
@kolyshkin just rebased and force-pushed. |
Looks like it failed;
|
known flaky test in the codebase that i have spent many hours on and still not fixed. |
Codecov Report
@@ Coverage Diff @@
## bump_v18.03 #2841 +/- ##
=============================================
Coverage ? 61.7%
=============================================
Files ? 134
Lines ? 21805
Branches ? 0
=============================================
Hits ? 13455
Misses ? 6909
Partials ? 1441 |
this is for ENGCORE-711 and ENGCORE-937
backport for the bump_v18.03 branch of
When a node is deleted, its tasks are asked to restart, which involves
putting them into a desired state of Shutdown. However, the Allocator
will not deallocate a task which is not in an actual state of a terminal
state. Once a node is deleted, the only opportunity for its tasks to
recieve updates and be moved to a terminal state is when the function
moving those tasks to TaskStateOrphaned is called, 24 hours after the
node enters the Down state. However, if a leadership change occurs, then
that function will never be called, and the tasks will never be moved to
a terminal state, leaking resources.
With this change, upon node deletion, all of its tasks will be moved to
TaskStateOrphaned, allowing those tasks' resources to be cleaned up.