Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only create global service tasks on nodes that satisfy the constraints #1570

Merged
merged 3 commits into from
Oct 6, 2016

Conversation

aaronlehmann
Copy link
Collaborator

@aaronlehmann aaronlehmann commented Sep 26, 2016

Previously, the global orchestrator would always create a task for every node, and the scheduler would figure out which ones could proceed into a running state. This was inefficient, confusing to users, and caused problems with rolling updates (updates would get stuck if they encountered tasks that couldn't start due to constraints). Change the global orchestrator to only create tasks on nodes that meet the constraints. The scheduler still handles resource reservations.

This change also fixes a few problems I found in the global orchestrator:

  • Global service updates shouldn't apply to paused nodes. Since the node can't accept new tasks, trying to update the service just prevents it from running on that node.
  • As a consequence, node reconcilation needs to make sure existing tasks match the current service spec. For example, you could pause a node, update the service, and then later activate the node.
  • Drained nodes weren't being reconciled on orchestrator startup, so a leftover task on one of these nodes would never be shut down.
  • The Batch callback in service reconciliation would wrongly terminate early if it encountered a task that had completed.

The first commit moves constraint parsing and evaluation to its own package so the orchestrator can use it as well. It makes a few cleanups to the method naming in this code.

The second commit does the actual change to the global orchestrator.

The third commit is an optimization that changes reconcileOneService and reconcileServiceOneNode to handle more than one service at a time. This avoids having a raft round trip for each service, and also avoids some redundant queries to the memory store. This should not cause a behavior difference, though, and could be split into a separate PR. I think it's easier to review/test all of this at once, though.

I've run the Docker integration tests against this and I've tested this a bit with the Docker CLI. I'm working on an integration test to cover global services that use constraints.

Addresses moby/moby#26325

cc @dongluochen @aluzzardi

@codecov-io
Copy link

codecov-io commented Sep 26, 2016

Current coverage is 54.13% (diff: 56.28%)

Merging #1570 into master will increase coverage by 0.10%

@@             master      #1570   diff @@
==========================================
  Files            83         82     -1   
  Lines         13621      13658    +37   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           7360       7394    +34   
  Misses         5265       5265          
- Partials        996        999     +3   

Sunburst

Powered by Codecov. Last update a126286...bc7b79e

@aaronlehmann
Copy link
Collaborator Author

I wrote a Docker integration test: aaronlehmann/docker@d6bc367

@aaronlehmann
Copy link
Collaborator Author

And a test covering draining and pausing for global services: aaronlehmann/docker@dd5f3cd

g.addTask(ctx, batch, service.Service, nodeID)
} else {
// If task is out of date, update it. This can happen
// on node reconciliation if, for example, we drain a
Copy link
Contributor

@dongluochen dongluochen Sep 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the node is drained, global service tasks are removed. This should be covered by if len(tasks) == 0. Do you mean the node was paused?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was meant to say paused. I'll fix it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

}

if node.Spec.Availability == api.NodeAvailabilityPause {
// the node is paused, so we won't add or update
Copy link
Contributor

@dongluochen dongluochen Sep 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this move before if _, exists := nodeCompleted[nodeID]; exists || !meetsConstraints {? I think paused is a state that should block reconciliation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paused means it doesn't accept new tasks, so I think a paused node should still kill the task if the service is deleted. That's how it works for replicated services.

This code was formerly part of the scheduler package. It needs to be
moved to an independent package so the orchestrator can use it when
considering where to create tasks for global services.

While moving it, made a few minor cleanups:

- Renamed Expr to Constraint
- Unexport Key field
- Renamed ParseExprs to Parse

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
…the constraints

Previously, the global orchestrator would always create a task for every
node, and the scheduler would figure out which ones could proceed into a
running state. This was inefficient, confusing to users, and caused
problems with rolling updates (updates would get stuck if they
encountered tasks that couldn't start due to constraints). Change the
global orchestrator to only create tasks on nodes that meet the
constraints. The scheduler still handles resource reservations.

This change also fixes a few problems I found in the global
orchestrator:

- Global service updates shouldn't apply to paused nodes. Since the node
  can't accept new tasks, trying to update the service just prevents it
  from running on that node.
- As a consequence, node reconcilation needs to make sure existing tasks
  match the current service spec. For example, you could pause a node,
  update the service, and then later activate the node.
- Drained nodes weren't being reconciled on orchestrator startup, so a
  leftover task on one of these nodes would never be shut down.
- The Batch callback in service reconciliation would wrongly terminate
  early if it encountered a task that had completed.

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
This can avoid doing extra raft writes, and reading extra data from the
store.

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
@aaronlehmann aaronlehmann force-pushed the global-service-orchestration branch from a3bff52 to bc7b79e Compare September 27, 2016 21:08
@dongluochen
Copy link
Contributor

LGTM

@aaronlehmann
Copy link
Collaborator Author

@aluzzardi: Do you want to review this one? Should I go ahead and merge it?

Copy link
Member

@aluzzardi aluzzardi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Minor nit

@@ -0,0 +1,164 @@
package constraint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Shouldn't this be a subpackage of scheduler?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that make sense, since now orchestrator uses it as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's a moot point. To me it felt like it was part of the scheduler (which at the end of the day is the component responsible to place tasks onto nodes), and it turns out another component (global orchestrator) wants to use a piece of the scheduler. I don't think the scheduler and orchestrator are sharing a common piece, more like the orchestrator wants to use the scheduler for some stuff.

But it's a minor personal preference, no objective arguments

@aaronlehmann aaronlehmann merged commit 0ff1041 into moby:master Oct 6, 2016
@aaronlehmann aaronlehmann deleted the global-service-orchestration branch October 6, 2016 14:33
@allencloud
Copy link
Contributor

Great work. However I have a question. Here is the description:

  1. a cluster with 5 nodes, 2 managers, 3 workers;
  2. create a global service with constraints only running on managers: docker service create --mode global --constraint node.role==manager ubuntu:14.04 sleep 100000; it will create a service with two tasks
  3. if a promote a worker node to manager node, will it work that a task newly created on the promoted node?

@aaronlehmann
Thanks

@aaronlehmann
Copy link
Collaborator Author

@allencloud: Currently, this isn't handled. There is an issue covering it: #1009

When we fix #1009, we will have to make sure that the orchestrator handles these cases (for global services), not just the scheduler.

@aaronlehmann
Copy link
Collaborator Author

Actually, I believe the particular case you're talking about should be handled correctly because of this PR. When a node is updated (like having its role changed to manager), the orchestrator will reconcile the global services running on that node. If it meets a constaint that it didn't meet before, a task will be created for that node. If it no longer meets a constraint that it met before the update, its task will be removed.

But #1009 is still a problem for replicated services.

@colinmollenhour
Copy link

Did this not make it into 1.12.2? I tested like so:

# docker service create --mode global --constraint "node.foo==bar" --name foo busybox sleep 100000
# docker service ps foo
ID                         NAME     IMAGE    NODE   DESIRED STATE  CURRENT STATE             ERROR
9e2w5ymlj6s6758n6fo4kgmn5  foo      busybox  test1  Running        Allocated 13 seconds ago
2xg2w0ewde3jtlwrmgjl75oyv   \_ foo  busybox  test4  Running        Allocated 13 seconds ago
9cy73sxyb2ps9aqp2pedjlc1l   \_ foo  busybox  test2  Running        Allocated 13 seconds ago
agyleclrfwigvz1wib17s1el2   \_ foo  busybox  test5  Running        Allocated 13 seconds ago
cn7fixwz6mzwelhgdxp4fr6no   \_ foo  busybox  test3  Running        Allocated 13 seconds ago
# docker node inspect --format {{.Spec.Labels}} test1 test2 test3 test4 test5
map[]
map[foo:bar]
map[role:cache]
map[]
map[]

So if I understand correctly this service should only be allocated for test2, right?

@aaronlehmann
Copy link
Collaborator Author

This is for 1.13, sorry.

@allencloud
Copy link
Contributor

Thanks for your detailed explanation. @aaronlehmann
🐼

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants