Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix crash when there are attachments with null network #2819

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

andrey-ko
Copy link

fix for crash that I caught while playing with swarm cluster.
here the stack trace of this crash:

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x0 pc=0x20257db]

goroutine 5668 [running]:
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).allocateNode(0xc0012db840, 0x2bcf040, 0xc001360600, 0xc00008b040, 0x0, 0xc001a8a310, 0x1, 0x1, 0x0)
    C:/Users/docker/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/network.go:1057 +0x80b
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).allocateNodes(0xc0012db840, 0x2bcf040, 0xc001360600, 0x0, 0x0, 0x0)
    C:/Users/docker/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/network.go:489 +0x3ae
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).doNetworkInit(0xc0012db840, 0x2bcf040, 0xc001360600, 0x0, 0x0)
    C:/Users/docker/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/network.go:166 +0x7d1
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).doNetworkInit-fm(0x2bcf040, 0xc001360600, 0x0, 0x0)
    C:/Users/docker/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/allocator.go:105 +0x52
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).init(0xc0012db840, 0x2bcf040, 0xc001360600, 0xc0019ceba0, 0x0, 0x0, 0x0, 0x0)
    C:/Users/docker/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/allocator.go:174 +0x69e
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).Run.func2(0x0, 0x0)
    C:/Users/docker/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/allocator.go:122 +0xf9
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).Run(0xc0012db840, 0x2bcf040, 0xc0001920c0, 0x0, 0x0)
    C:/Users/docker/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/allocator.go:142 +0x62e
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager.(*Manager).becomeLeader.func6(0x2bcf040, 0xc0001920c0, 0xc0012db840)
    C:/Users/docker/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/manager.go:1059 +0x56
created by github.com/docker/docker/vendor/github.com/docker/swarmkit/manager.(*Manager).becomeLeader
    C:/Users/docker/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/manager.go:1058 +0xf6c

stack trace is very similar to issue described here:
moby/moby#38287

so in my case I had a swarm cluster with 3 managers, than I demote 2 of them.
And after this docker start crashing 100% on the manager node.

@codecov
Copy link

codecov bot commented Feb 9, 2019

Codecov Report

Merging #2819 into master will increase coverage by 0.01%.
The diff coverage is 0%.

@@            Coverage Diff             @@
##           master    #2819      +/-   ##
==========================================
+ Coverage   61.94%   61.95%   +0.01%     
==========================================
  Files         139      139              
  Lines       22120    22122       +2     
==========================================
+ Hits        13702    13706       +4     
- Misses       6937     6939       +2     
+ Partials     1481     1477       -4

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, went to look through the logic to see if we could safely ignore this case (leading to some PR's I opened 😂)

LGTM - we should probably backport this one to relevant release branches

@thaJeztah
Copy link
Member

ping @dperny @anshulpundir PTAL

@thaJeztah
Copy link
Member

whoops, needs a rebase @andrey-ko 🤗

@@ -1052,6 +1052,9 @@ func (a *Allocator) allocateNode(ctx context.Context, node *api.Node, existingAd
// for every attachment, go through every network. if the attachment
// belongs to one of the networks, then go to the next attachment. if
// no network matches, then the the attachment should be removed.
if attach.Network == nil {
continue
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should arguably be placed before the comment on lines 1052-1054, since that comment is about what follows this.

@@ -1052,6 +1052,9 @@ func (a *Allocator) allocateNode(ctx context.Context, node *api.Node, existingAd
// for every attachment, go through every network. if the attachment
// belongs to one of the networks, then go to the next attachment. if
// no network matches, then the the attachment should be removed.
if attach.Network == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can this happen? When attaching a network that's not properly working (eg conflicts with another subnet)? Shouldn't we remove it from the node's attachments then?

In any case, seems worth at least a comment :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for comment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's great point, but may be we should just return error and fail operation?

@anshulpundir anshulpundir requested a review from dperny February 13, 2019 19:59
Copy link
Contributor

@anshulpundir anshulpundir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good after comments addressed and rebased.

@@ -1052,6 +1052,9 @@ func (a *Allocator) allocateNode(ctx context.Context, node *api.Node, existingAd
// for every attachment, go through every network. if the attachment
// belongs to one of the networks, then go to the next attachment. if
// no network matches, then the the attachment should be removed.
if attach.Network == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for comment.

@dperny
Copy link
Collaborator

dperny commented Feb 13, 2019

i wonder if this is related to #2764...?

Signed-off-by: Andrey Kolomentsev <andrey.kolomentsev@docker.com>
@thaJeztah
Copy link
Member

thaJeztah commented Feb 17, 2019

i wonder if this is related to #2764...?

Doesn't look like it would cause the issue, because in "restoring" mode, it would return before reaching that part https://github.com/docker/swarmkit/blob/5be0152d20e2b514ebdfc56b04d727454a7021bb/manager/allocator/network.go#L1041-L1043

In the "non-restore" case, the attachment would be created (if a given network was not yet found in the list of attachments https://github.com/docker/swarmkit/blob/5be0152d20e2b514ebdfc56b04d727454a7021bb/manager/allocator/network.go#L1020-L1028

I do wonder though if (in the "restoring" case), we should such attachments in (or before) the first loop; https://github.com/docker/swarmkit/blob/5be0152d20e2b514ebdfc56b04d727454a7021bb/manager/allocator/network.go#L995-L1007

(i.e., remove all empty attachments first)

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dperny
Copy link
Collaborator

dperny commented Feb 19, 2019

I don't want to merge this without a root cause. The network allocator is absurdly fragile, and making changes without understanding them almost always causes further breakage. I'm looking at the code now.

@dperny
Copy link
Collaborator

dperny commented Feb 19, 2019

@andrey-ko Have you reproduced this crash anywhere?

@dperny
Copy link
Collaborator

dperny commented Feb 19, 2019

Is it possible that a network was deleted around the same time that the leadership change occurred?

@dperny
Copy link
Collaborator

dperny commented Feb 19, 2019

Are we even sure it's the attachment network that's nil, and not the attachment itself? The line includes if _, ok := nwIDs[attach.Network.ID], which means either the attachment OR the Network ID could be nil.

@dperny
Copy link
Collaborator

dperny commented Feb 19, 2019

I hate the allocator.

@andrey-ko
Copy link
Author

Is it possible that a network was deleted around the same time that the leadership change occurred?

So it definitely happened after leadership change occurred, but I don't remember if deleted any network around the same time when this event happened

@andrey-ko
Copy link
Author

ve you reproduced this crash any

I'm going to try to reproduce it again, but it may take some time, I don't remember exact sequence of events that led to this

@andrey-ko
Copy link
Author

one of the reasoning to add this check, was that there is similar check in this function a little bit earlier
https://github.com/docker/swarmkit/blob/bf7b2cb3d8a1d10fa96e836239ff6abf1459badd/manager/allocator/network.go#L1003 so I assumed it is possible to have null network in attachment

@thaJeztah
Copy link
Member

ping @dperny PTAL

@zhonghuiwen
Copy link

Hi guys, i think i'm having the same issue on my production swarm. Almost every time i deploy/remove a stack i would get the same error. So firstly my docker leader would exit followed by the other managers. However after waiting a certain period i would be able to successfully deploy all the stacks to the swarm.

docker node ls:

ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS      ENGINE VERSION
ksxkgs5zt8akxe0ap3zr9dh4h     UNI-VMTST002        Ready               Active                                  18.03.1-ce
r64e1bdkgd8k5mdy0hwsr4xox *   uni-vmsvr30010      Ready               Active              Reachable           19.03.2
2qdiplbssg4mbt578rt91kn5g     uni-vmsvr30011      Ready               Active              Leader              18.03.1-ce
ibayugacan7nvfnqcutvgipeu     uni-vmsvr30012      Ready               Active              Reachable           19.03.1

Logs:

Sep 10 20:38:28 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:28.329163232+08:00" level=info msg="manager selected by agent for new session: { }" module=node/agent node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:28 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:28.329197180+08:00" level=info msg="waiting 1.375068579s before registering session" module=node/agent node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.347410569+08:00" level=info msg="3677eedf18e41c22 is starting a new election at term 4758" module=raft node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.347462517+08:00" level=info msg="3677eedf18e41c22 became candidate at term 4759" module=raft node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.347482630+08:00" level=info msg="3677eedf18e41c22 received MsgVoteResp from 3677eedf18e41c22 at term 4759" module=raft node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.347497730+08:00" level=info msg="3677eedf18e41c22 [logterm: 4758, index: 17917234] sent MsgVote request to d72e99397b72e11 at term 4759" module=raft node.id=r64e1bdk
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.347510672+08:00" level=info msg="3677eedf18e41c22 [logterm: 4758, index: 17917234] sent MsgVote request to 543885594517f92f at term 4759" module=raft node.id=r64e1bd
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.347524057+08:00" level=info msg="raft.node: 3677eedf18e41c22 lost leader d72e99397b72e11 at term 4759" module=raft node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.350930604+08:00" level=info msg="3677eedf18e41c22 received MsgVoteResp from 543885594517f92f at term 4759" module=raft node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.350958464+08:00" level=info msg="3677eedf18e41c22 [quorum:2] has received 2 MsgVoteResp votes and 0 vote rejections" module=raft node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.350981629+08:00" level=info msg="3677eedf18e41c22 became leader at term 4759" module=raft node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.350996379+08:00" level=info msg="raft.node: 3677eedf18e41c22 elected leader 3677eedf18e41c22 at term 4759" module=raft node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.352796360+08:00" level=error msg="error creating cluster object" error="name conflicts with an existing object" module=node node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.352915264+08:00" level=info msg="leadership changed from not yet part of a raft cluster to r64e1bdkgd8k5mdy0hwsr4xox" module=node node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.353048497+08:00" level=info msg="dispatcher starting" module=dispatcher node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.373961675+08:00" level=error msg="failed to delete task" error="object does not exist" module=node node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.384913365+08:00" level=error msg="failed to delete task" error="object does not exist" module=node node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.385125825+08:00" level=error msg="failed to delete task" error="object does not exist" module=node node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.442209977+08:00" level=error msg="IPAM failure while releasing IP address 10.0.11.2/24" error="invalid pool id: "
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.442256535+08:00" level=error msg="IPAM failure while releasing IP address 10.0.21.2/24" error="invalid pool id: "
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.442269586+08:00" level=error msg="IPAM failure while releasing IP address 10.0.13.4/24" error="invalid pool id: "
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.442279521+08:00" level=error msg="IPAM failure while releasing IP address 10.0.56.2/24" error="invalid pool id: "
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.442292200+08:00" level=error msg="error deallocating attachment for network kkcbtuhbcmnx1pw7xg3cfd98m on node 2qdiplbssg4mbt578rt91kn5g" error="could not find networ
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.442308186+08:00" level=error msg="error deallocating attachment for network omala7bwk6ezqp8ay4d9n7pp7 on node 2qdiplbssg4mbt578rt91kn5g" error="could not find networ
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.442321474+08:00" level=error msg="IPAM failure while releasing IP address 10.0.3.3/24" error="invalid pool id: "
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.442334972+08:00" level=error msg="IPAM failure while releasing IP address 10.0.16.8/24" error="invalid pool id: "
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: panic: runtime error: invalid memory address or nil pointer dereference
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x56513d90a1eb]
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: goroutine 16421 [running]:
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).allocateNode(0xc003ecb600, 0x56513eb51fe0, 0xc00479f1c0, 0xc005405e00, 0x0, 0xc00420ed80, 0x7, 0x8, 0
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]:         /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/network.go:1055 +0x75b
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).allocateNodes(0xc003ecb600, 0x56513eb51fe0, 0xc00479f1c0, 0x0, 0x0, 0x0)
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]:         /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/network.go:489 +0x1dd
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).doNetworkInit(0xc003ecb600, 0x56513eb51fe0, 0xc00479f1c0, 0x0, 0x0)
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]:         /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/network.go:166 +0x4ab
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).init(0xc003ecb600, 0x56513eb51fe0, 0xc00479f1c0, 0xc00129f200, 0x0, 0x0, 0xc005c89ad0, 0x0)
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]:         /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/allocator.go:174 +0x496
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).Run.func2(0x0, 0x0)
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]:         /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/allocator.go:122 +0xbc
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).Run(0xc003ecb600, 0x56513eb51fe0, 0xc000fbd1c0, 0x0, 0x0)
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]:         /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/allocator.go:142 +0x391
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: github.com/docker/docker/vendor/github.com/docker/swarmkit/manager.(*Manager).becomeLeader.func6(0x56513eb51fe0, 0xc000fbd1c0, 0xc003ecb600)
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]:         /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/manager.go:1065 +0x5f
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: created by github.com/docker/docker/vendor/github.com/docker/swarmkit/manager.(*Manager).becomeLeader
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]:         /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/manager.go:1064 +0xf8e
Sep 10 20:38:31 uni-vmsvr30010 systemd[1]: docker.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Sep 10 20:38:31 uni-vmsvr30010 systemd[1]: docker.service: Unit entered failed state.
Sep 10 20:38:31 uni-vmsvr30010 systemd[1]: docker.service: Failed with result 'exit-code'.
Sep 10 20:38:33 uni-vmsvr30010 systemd[1]: docker.service: Service hold-off time over, scheduling restart.
Sep 10 20:38:33 uni-vmsvr30010 systemd[1]: Stopped Docker Application Container Engine.
Sep 10 20:38:33 uni-vmsvr30010 systemd[1]: Starting Docker Application Container Engine...
Sep 10 20:38:33 uni-vmsvr30010 dockerd[2852]: time="2019-09-10T20:38:33.556082894+08:00" level=info msg="Starting up"
Sep 10 20:38:33 uni-vmsvr30010 dockerd[2852]: time="2019-09-10T20:38:33.556988572+08:00" level=warning msg="[!] DON'T BIND ON ANY IP ADDRESS WITHOUT setting --tlsverify IF YOU DON'T KNOW WHAT YOU'RE DOING [!]"
Sep 10 20:38:33 uni-vmsvr30010 dockerd[2852]: time="2019-09-10T20:38:33.558032385+08:00" level=info msg="parsed scheme: \"unix\"" module=grpc
Sep 10 20:38:33 uni-vmsvr30010 dockerd[2852]: time="2019-09-10T20:38:33.558055991+08:00" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
Sep 10 20:38:33 uni-vmsvr30010 dockerd[2852]: time="2019-09-10T20:38:33.558078720+08:00" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock 0  <nil>}] }" module=grpc
Sep 10 20:38:33 uni-vmsvr30010 dockerd[2852]: time="2019-09-10T20:38:33.558090946+08:00" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
Sep 10 20:38:33 uni-vmsvr30010 dockerd[2852]: time="2019-09-10T20:38:33.558156800+08:00" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc00081cf90, CONNECTING" module=grpc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants