Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allocator: Avoid allocating tasks that are no longer running #2017

Merged
merged 1 commit into from
Mar 10, 2017

Conversation

aaronlehmann
Copy link
Collaborator

The allocator currently uses taskDead to check if a task should not be allocated. However, this checks that both the desired state and actual state are past "running". In the case of network attachment tasks, there is no orchestration, and the desired state is always "running". This means allocation can be reattempted on a failed attachment task indefinitely.

Change the allocator to only try to allocate tasks where both the desired and actual state are "running" or below.

Technically we could probably exclude any task which is at "pending" or beyond, because these have already been allocated, but I'm less confident in that.

cc @alexmavr @yongtang @aboch @dongluochen

@codecov
Copy link

codecov bot commented Mar 8, 2017

Codecov Report

Merging #2017 into master will increase coverage by 0.06%.
The diff coverage is 33.33%.

@@            Coverage Diff             @@
##           master    #2017      +/-   ##
==========================================
+ Coverage   53.76%   53.83%   +0.06%     
==========================================
  Files         109      109              
  Lines       19037    19033       -4     
==========================================
+ Hits        10236    10247      +11     
+ Misses       7563     7557       -6     
+ Partials     1238     1229       -9

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a5eb9c0...9de6ccd. Read the comment docs.

@aboch
Copy link

aboch commented Mar 8, 2017

The issue would potentially be exposed only during leader re-election, because at swarm init there can't be any network attachment task in store. Correct ?

@aaronlehmann
Copy link
Collaborator Author

That's right. It wouldn't happen when the task first fails, because the state would be >= Pending. But after a leader election, the task permanently gets added to the unallocated list.

@aboch
Copy link

aboch commented Mar 8, 2017

Looks good to me

@aluzzardi
Copy link
Member

Having a hard time mentally mapping this.

Current state:

checks that both the desired state and actual state are past "running"

Proposed state:

Change the allocator to only try to allocate tasks where both the desired and actual state are "running" or below.

So we are completely reversing the behavior?

@aaronlehmann
Copy link
Collaborator Author

@aluzzardi: This shouldn't be reversing the behavior. I think maybe you're missing the ! in !taskRunning?

Current state:

  • checks that both the desired state and actual state are past "running"
  • Skips tasks when both the desired state and actual state are past "running"

Proposed state:

  • Change the allocator to only try to allocate tasks where both the desired and actual state are "running" or below.

@aluzzardi
Copy link
Member

Sorry, I don't get it :/

Aren't those two the same? Also, "Change the allocator to only try to allocate tasks where both the desired and actual state are "running" or below." -> why would we want to allocate tasks that are running?

@aaronlehmann
Copy link
Collaborator Author

Sorry, I don't get it :/ Aren't those two the same?

Basically this is changing an && to an ||. Pseudocode for before:

if desiredState > running && actualState > running {
    skip task
}

Pseudocode for after:

if desiredState > running || actualState > running {
    skip task
}

The motivation for this is that network attachment tasks never change their desired state. The tasks I saw in a dump from a cluster getting lots of log spam from repeated allocations were network attachment tasks with desired state running but actual state failed.

why would we want to allocate tasks that are running?

I had a comment in the PR description about this:

Technically we could probably exclude any task which is at "pending" or beyond, because these have already been allocated, but I'm less confident in that.

Basically trying to make the least invasive change that will fix the problem, but I think tightening the check to skip running tasks as well would also be correct.

@aluzzardi
Copy link
Member

Why not:

if actualState > running {
    skip task
}

If we don't desire a task to be running, but it is actually running, should we really skip it?

@aaronlehmann
Copy link
Collaborator Author

I think you're right. Paying any attention at all to desired state here seems bogus.

Suppose the orchestrator tries to start a task, but soon after, the service is updated, and it ends up changing that task to desired state = shutdown. That task might get stuck in the allocator forever if the allocator ignores it on the grounds that desired state is past running.

Should we just switch to observed state? I think we should, but I'm wondering if I'm missing anything here.

The only potential risk I see in making this change is that we won't deallocate failed tasks until the observed state changes. So if a node is down, we might not be able to free the resources until the task becomes orphaned. But I thought that was how it was supposed to work anyway! The fact that it's freeing resources based on desired state seems like a bug.

@aaronlehmann aaronlehmann force-pushed the allocator-dead-tasks branch from 81be482 to aad8422 Compare March 9, 2017 22:03
@aaronlehmann
Copy link
Collaborator Author

@aluzzardi: I've updated this to only check observed state. I think this is the correct thing to do. It also makes the code simpler. It's a little bit more of a change from what was there before, though. I removed the taskRunning and taskDead functions becuase I think they're confusing and obscure what is actually being checked.

@aboch: Can you please take a second look?

The allocator currently uses taskDead to check if a task should not be
allocated. However, this checks that both the desired state and actual
state are past "running". In the case of attachment tasks, there is no
orchestration, and the desired state is always "running". This means
allocation can be reattempted on a failed attachment task indefinitely.

Change the allocator to only try to allocate tasks where both the
desired and actual state are "running" or below.

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
@aaronlehmann aaronlehmann force-pushed the allocator-dead-tasks branch from aad8422 to 9de6ccd Compare March 9, 2017 22:21
@aaronlehmann
Copy link
Collaborator Author

The PR that added desired state checking is #435. I reviewed it, but right now I don't understand why it's necessary or beneficial. I see that the conditions inside taskRunning and taskDead have changed significantly since then. They used to check for specific desired states, and now they check for a range. Maybe this made sense at the time, but it doesn't make sense now in the context of how desired state has evolved.

Still, I'm a little concerned that we don't understand this code very well. I'm tempted to go back to the original fix to be safer, but that's just going to delay addressing the technical debt.

What about removing the desired state checking in master, and just doing the tweak in the original PR for the stable release? Not perfect either because it introduces a weird difference, but it seems worth considering.

@aaronlehmann
Copy link
Collaborator Author

FWIW, I vendored this PR (along with all the others) into docker, and all integration tests passed.

@@ -281,7 +281,7 @@ func (a *Allocator) doNetworkInit(ctx context.Context) (err error) {
}

for _, t := range tasks {
if taskDead(t) {
if t.Status.State > api.TaskStateRunning {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine, because if the task's state is beyond running, we should not bother attempting an allocation for it, regardless of the desired state. Once a task moves beyond running, it cannot be brought back in the pipeline right ?
A new task will be created in place of it, I think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once a task moves beyond running, it cannot be brought back in the pipeline right ?

That's correct.

if taskDead(t) || isDelete {
// If the task has stopped running then we should free the network
// resources associated with the task right away.
if t.Status.State > api.TaskStateRunning || isDelete {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only concern here is, as you guys already discussed here and also in the pr which introduced the current check, is that we release the resources associated to this task, while the task may still be running on an unresponsive node, and a new task coming up will be given those resources.
So, we could end up with two containers in the cluster having same IP address.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I don't think this is a problem. This is looking at observed state, not desired state. So we err on the side of keeping the resources tied up if the node is unresponsive. There is a mechanism that marks tasks as orphaned after 24 hours of the node being down, so the allocator can finally free these resources.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for the clarification.

@dongluochen
Copy link
Contributor

Technically we could probably exclude any task which is at "pending" or beyond, because these have already been allocated, but I'm less confident in that.

Looking at https://github.com/docker/swarmkit/blob/master/manager/allocator/network.go#L295-L300, I think "pending" and beyond is not enough. If I understand correctly, t.status.State > running is right here. While a task could move between pending and running, network allocation for the task might not have propagated.

@aaronlehmann
Copy link
Collaborator Author

Looking at https://github.com/docker/swarmkit/blob/master/manager/allocator/network.go#L295-L300, I think "pending" and beyond is not enough. If I understand correctly, t.status.State > running is right here.

Agreed. I realized that after making that comment.

@aboch
Copy link

aboch commented Mar 10, 2017

Looks good to me still.

@dongluochen
Copy link
Contributor

LGTM

1 similar comment
@aluzzardi
Copy link
Member

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants