create WAL error: fileutil: file already locked on quick restart #1421

tonistiigi · 2016-08-22T18:12:54Z

TestSwarmInit sometimes blocks after latest swarmkit vendoring moby/moby#25833 (comment) . The test creates a single node cluster, stops, removes state(docker swarm leave) and creates a new one. In the second run it seems that manager fails with an error and blocks because of a related issue #1283 (comment)).

I tried to add debugging and it seems that the manager is failing with error time="2016-08-19T23:46:02.628112210Z" level=debug msg="m.Run=can't initialize raft node: create WAL error: fileutil: file already locked".

I bisected it to #1376 but that PR just seems to change the timing between stop and start. I can also make the test pass when I add a small sleep to test(reversing the speedup of key change). Even with the key change I can't reproduce with v1.12.1 so the timing seems to be very important or something changed in last week(#1369 maybe).

cc @aaronlehmann

The text was updated successfully, but these errors were encountered:

aaronlehmann · 2016-08-22T19:56:31Z

It's really strange that Run would hit this error, because this code path is only reachable if the WAL directory does not exist.

Do you think it's possible that two goroutines are calling manager.Run at the same time, or two daemons are using the same state directory? The only explanation I can come up for this that makes sense is two managers racing to create the WAL.

tonistiigi · 2016-08-22T21:39:56Z

@aaronlehmann https://gist.github.com/tonistiigi/f3e7496e1b8523174d0729080705ea38#file-daemon-log-L166-L184 Added logs for manager starting and error. Both seem to appear only 2 times and second one after the first has finished.

tonistiigi · 2016-08-22T22:11:00Z

@aaronlehmann Added logs around the file locks https://gist.github.com/tonistiigi/607c9647f8c50264410ea6bc25ab6c3c . Can't say that the result makes much sense.

aaronlehmann · 2016-08-22T23:51:50Z

I finally have an explanation for this that at least makes sense. A file is locked if any process has it locked. These locks follow forks.

Go's os.OpenFile opens files with close-on-exec enabled. But between the fork and exec, the lock is still being held by a second process.

wal.Create releases a lock and then relocks the file after renaming the directory. If Docker forked to spawn a process just before this, the lock could still be held by that second process (because it hasn't called exec yet).

The only solution I can think of is adding some retry logic. This logic probably belongs in etcd/wal or etcd/pkg/fileutil.

aaronlehmann · 2016-08-23T21:35:31Z

Another fix would be to patch upstream etcd/wal so that it only temporarily releases the lock for the rename on Windows. I'll try to open a PR for that.

aaronlehmann · 2016-08-23T21:51:33Z

@tonistiigi: Can you try this patch? aaronlehmann/etcd@5a39edf

tonistiigi · 2016-08-26T00:28:54Z

@aaronlehmann Yes, that patch seems to fix it for me.

tonistiigi · 2016-08-26T17:33:50Z

fixed by #1448

tonistiigi mentioned this issue Aug 22, 2016

vendor: update swarmkit to 8a761950f moby/moby#25833

Merged

tonistiigi closed this as completed Aug 26, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

create WAL error: fileutil: file already locked on quick restart #1421

create WAL error: fileutil: file already locked on quick restart #1421

tonistiigi commented Aug 22, 2016

aaronlehmann commented Aug 22, 2016

tonistiigi commented Aug 22, 2016

tonistiigi commented Aug 22, 2016

aaronlehmann commented Aug 22, 2016

aaronlehmann commented Aug 23, 2016

aaronlehmann commented Aug 23, 2016

tonistiigi commented Aug 26, 2016

tonistiigi commented Aug 26, 2016

create WAL error: fileutil: file already locked on quick restart #1421

create WAL error: fileutil: file already locked on quick restart #1421

Comments

tonistiigi commented Aug 22, 2016

aaronlehmann commented Aug 22, 2016

tonistiigi commented Aug 22, 2016

tonistiigi commented Aug 22, 2016

aaronlehmann commented Aug 22, 2016

aaronlehmann commented Aug 23, 2016

aaronlehmann commented Aug 23, 2016

tonistiigi commented Aug 26, 2016

tonistiigi commented Aug 26, 2016