Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create WAL error: fileutil: file already locked on quick restart #1421

Closed
tonistiigi opened this issue Aug 22, 2016 · 8 comments
Closed

create WAL error: fileutil: file already locked on quick restart #1421

tonistiigi opened this issue Aug 22, 2016 · 8 comments

Comments

@tonistiigi
Copy link
Member

TestSwarmInit sometimes blocks after latest swarmkit vendoring moby/moby#25833 (comment) . The test creates a single node cluster, stops, removes state(docker swarm leave) and creates a new one. In the second run it seems that manager fails with an error and blocks because of a related issue #1283 (comment)).

I tried to add debugging and it seems that the manager is failing with error time="2016-08-19T23:46:02.628112210Z" level=debug msg="m.Run=can't initialize raft node: create WAL error: fileutil: file already locked".

I bisected it to #1376 but that PR just seems to change the timing between stop and start. I can also make the test pass when I add a small sleep to test(reversing the speedup of key change). Even with the key change I can't reproduce with v1.12.1 so the timing seems to be very important or something changed in last week(#1369 maybe).

cc @aaronlehmann

@aaronlehmann
Copy link
Collaborator

It's really strange that Run would hit this error, because this code path is only reachable if the WAL directory does not exist.

Do you think it's possible that two goroutines are calling manager.Run at the same time, or two daemons are using the same state directory? The only explanation I can come up for this that makes sense is two managers racing to create the WAL.

@tonistiigi
Copy link
Member Author

@aaronlehmann https://gist.github.com/tonistiigi/f3e7496e1b8523174d0729080705ea38#file-daemon-log-L166-L184 Added logs for manager starting and error. Both seem to appear only 2 times and second one after the first has finished.

@tonistiigi
Copy link
Member Author

@aaronlehmann Added logs around the file locks https://gist.github.com/tonistiigi/607c9647f8c50264410ea6bc25ab6c3c . Can't say that the result makes much sense.

@aaronlehmann
Copy link
Collaborator

I finally have an explanation for this that at least makes sense. A file is locked if any process has it locked. These locks follow forks.

Go's os.OpenFile opens files with close-on-exec enabled. But between the fork and exec, the lock is still being held by a second process.

wal.Create releases a lock and then relocks the file after renaming the directory. If Docker forked to spawn a process just before this, the lock could still be held by that second process (because it hasn't called exec yet).

The only solution I can think of is adding some retry logic. This logic probably belongs in etcd/wal or etcd/pkg/fileutil.

@aaronlehmann
Copy link
Collaborator

Another fix would be to patch upstream etcd/wal so that it only temporarily releases the lock for the rename on Windows. I'll try to open a PR for that.

@aaronlehmann
Copy link
Collaborator

@tonistiigi: Can you try this patch? aaronlehmann/etcd@5a39edf

@tonistiigi
Copy link
Member Author

@aaronlehmann Yes, that patch seems to fix it for me.

@tonistiigi
Copy link
Member Author

fixed by #1448

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants