-
Notifications
You must be signed in to change notification settings - Fork 617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix crash when there are attachments with null network #2819
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2819 +/- ##
==========================================
+ Coverage 61.94% 61.95% +0.01%
==========================================
Files 139 139
Lines 22120 22122 +2
==========================================
+ Hits 13702 13706 +4
- Misses 6937 6939 +2
+ Partials 1481 1477 -4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol, went to look through the logic to see if we could safely ignore this case (leading to some PR's I opened 😂)
LGTM - we should probably backport this one to relevant release branches
ping @dperny @anshulpundir PTAL |
whoops, needs a rebase @andrey-ko 🤗 |
@@ -1052,6 +1052,9 @@ func (a *Allocator) allocateNode(ctx context.Context, node *api.Node, existingAd | |||
// for every attachment, go through every network. if the attachment | |||
// belongs to one of the networks, then go to the next attachment. if | |||
// no network matches, then the the attachment should be removed. | |||
if attach.Network == nil { | |||
continue | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should arguably be placed before the comment on lines 1052-1054, since that comment is about what follows this.
@@ -1052,6 +1052,9 @@ func (a *Allocator) allocateNode(ctx context.Context, node *api.Node, existingAd | |||
// for every attachment, go through every network. if the attachment | |||
// belongs to one of the networks, then go to the next attachment. if | |||
// no network matches, then the the attachment should be removed. | |||
if attach.Network == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can this happen? When attaching a network that's not properly working (eg conflicts with another subnet)? Shouldn't we remove it from the node's attachments then?
In any case, seems worth at least a comment :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that's great point, but may be we should just return error and fail operation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good after comments addressed and rebased.
@@ -1052,6 +1052,9 @@ func (a *Allocator) allocateNode(ctx context.Context, node *api.Node, existingAd | |||
// for every attachment, go through every network. if the attachment | |||
// belongs to one of the networks, then go to the next attachment. if | |||
// no network matches, then the the attachment should be removed. | |||
if attach.Network == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for comment.
i wonder if this is related to #2764...? |
Signed-off-by: Andrey Kolomentsev <andrey.kolomentsev@docker.com>
Doesn't look like it would cause the issue, because in "restoring" mode, it would return before reaching that part https://github.com/docker/swarmkit/blob/5be0152d20e2b514ebdfc56b04d727454a7021bb/manager/allocator/network.go#L1041-L1043 In the "non-restore" case, the attachment would be created (if a given network was not yet found in the list of attachments https://github.com/docker/swarmkit/blob/5be0152d20e2b514ebdfc56b04d727454a7021bb/manager/allocator/network.go#L1020-L1028 I do wonder though if (in the "restoring" case), we should such attachments in (or before) the first loop; https://github.com/docker/swarmkit/blob/5be0152d20e2b514ebdfc56b04d727454a7021bb/manager/allocator/network.go#L995-L1007 (i.e., remove all empty attachments first) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I don't want to merge this without a root cause. The network allocator is absurdly fragile, and making changes without understanding them almost always causes further breakage. I'm looking at the code now. |
@andrey-ko Have you reproduced this crash anywhere? |
Is it possible that a network was deleted around the same time that the leadership change occurred? |
Are we even sure it's the attachment network that's nil, and not the attachment itself? The line includes |
I hate the allocator. |
So it definitely happened after leadership change occurred, but I don't remember if deleted any network around the same time when this event happened |
I'm going to try to reproduce it again, but it may take some time, I don't remember exact sequence of events that led to this |
one of the reasoning to add this check, was that there is similar check in this function a little bit earlier |
ping @dperny PTAL |
Hi guys, i think i'm having the same issue on my production swarm. Almost every time i deploy/remove a stack i would get the same error. So firstly my docker leader would exit followed by the other managers. However after waiting a certain period i would be able to successfully deploy all the stacks to the swarm. docker node ls:
Logs:
|
fix for crash that I caught while playing with swarm cluster.
here the stack trace of this crash:
stack trace is very similar to issue described here:
moby/moby#38287
so in my case I had a swarm cluster with 3 managers, than I demote 2 of them.
And after this docker start crashing 100% on the manager node.