Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a race condition in Bazel's Windows process management. #6860

Closed
wants to merge 26 commits into from
Closed

Fix a race condition in Bazel's Windows process management. #6860

wants to merge 26 commits into from

Conversation

philwo
Copy link
Member

@philwo philwo commented Dec 7, 2018

(Second version of this PR that fixes the issues from the last time and refactors the code a bit to make it easier to understand and maintain.)

While we did correctly terminate the job object of a process when calling Subprocess.destroy and we did correctly wait for the process itself to exit in Subprocess.waitFor, we didn't wait for its children to exit.

This is usually not a big problem, except when they keep open some file handles to files that Bazel immediately tris to delete after killing / waiting for the main process, in which case the files might still be in use. This is the reason why we're seeing errors like this on CI:

https://buildkite.com/bazel/bazel-bazel/builds/5774#27afa2c9-9ff1-4224-9df7-ba0253e7e317/193-303

"Caught I/O exception: Cannot delete path (something below $TEST_TMPDIR): Access is denied."

This CL modifies the behavior to explicitly terminate the job object in Subprocess.waitFor when the main process has exited and then wait for all transitive subprocesses to exit, before continuing.

Due to some interesting semantics, one cannot simply use WaitForSingleObject on the Job handle, instead one has to use an IoCompletionPort that gets a signal when the last job in the group has exited: https://blogs.msdn.microsoft.com/oldnewthing/20130405-00/?p=4743

@meteorcloudy
Copy link
Member

Can we add a test to prevent #6853?
Maybe based on https://github.com/laszlocsomor/projects/tree/04e4749626dd76b04227e30739050643aefba813/bazel/sleepy_test? ;)

Copy link
Contributor

@laszlocsomor laszlocsomor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the incremental commits! It was much easier to review this PR in small pieces than it would've been as a whole.

@philwo
Copy link
Member Author

philwo commented Dec 7, 2018

Thanks so much for the good comments :) I've addressed them.

@philwo
Copy link
Member Author

philwo commented Dec 7, 2018

I will add a test as requested by @meteorcloudy, too.

@jin jin added the area-Windows Windows-specific issues and feature requests label Dec 10, 2018
@laszlocsomor
Copy link
Contributor

After discussing with @philwo in person: let's address the stylistic comments in a follow-up PR and only focus on the technical one.

This allows us to safely close the process handle early, while still allowing Bazel to ask for the exit code later.
If we don't have job objects, we terminate the process itself, but this could fail in case the process has already exited itself at this point.
Check for this condition and also wait for the termination to succeed to guarantee that the exit code can be safely queried afterwards.
While we did correctly terminate the job object of a process when calling Subprocess.destroy and we did correctly wait for the process itself to exit in Subprocess.waitFor, we didn't wait for its children to exit.

This is usually not a big problem, except when they keep open some file handles to files that Bazel immediately tris to delete after killing / waiting for the main process, in which case the files might still be in use. This is the reason why we're seeing errors like this on CI:

https://buildkite.com/bazel/bazel-bazel/builds/5774#27afa2c9-9ff1-4224-9df7-ba0253e7e317/193-303

"Caught I/O exception: Cannot delete path (something below $TEST_TMPDIR): Access is denied."

This CL modifies the behavior to explicitly terminate the job object in Subprocess.waitFor when the main process has exited and then wait for all transitive subprocesses to exit, before continuing.

Due to some interesting semantics, one cannot simply use WaitForSingleObject on the Job handle, instead one has to use an IoCompletionPort that gets a signal when the last job in the group has exited: https://blogs.msdn.microsoft.com/oldnewthing/20130405-00/?p=4743
@philwo
Copy link
Member Author

philwo commented Jan 15, 2019

I noticed that we already have the sleepy test in our tests: test_test_timeout in bazel_test_test.sh.

https://source.bazel.build/bazel/+/fbfd2c0b60a19f56420b5c31ed7a109238683c7c:src/test/shell/bazel/bazel_test_test.sh;l=225

@bazel-io bazel-io closed this in 4473bb1 Jan 15, 2019
@philwo philwo deleted the windows-race branch January 16, 2019 09:21
@meteorcloudy
Copy link
Member

Should we cherry-pick the fix into 0.22?

@philwo
Copy link
Member Author

philwo commented Jan 18, 2019

Should we cherry-pick the fix into 0.22?

I think it's too risky given the size of the change and also it's technically not a regression... :|

@meteorcloudy
Copy link
Member

OK, the good thing is we don't see the Windows flaky error as often as before with 0.21.0. So I'm fine waiting for 0.23

aehlig pushed a commit that referenced this pull request Jan 18, 2019
(Second version of this PR that fixes the issues from the last time and refactors the code a bit to make it easier to understand and maintain.)

While we did correctly terminate the job object of a process when calling Subprocess.destroy and we did correctly wait for the process itself to exit in Subprocess.waitFor, we didn't wait for its children to exit.

This is usually not a big problem, except when they keep open some file handles to files that Bazel immediately tris to delete after killing / waiting for the main process, in which case the files might still be in use. This is the reason why we're seeing errors like this on CI:

https://buildkite.com/bazel/bazel-bazel/builds/5774#27afa2c9-9ff1-4224-9df7-ba0253e7e317/193-303

"Caught I/O exception: Cannot delete path (something below $TEST_TMPDIR): Access is denied."

This CL modifies the behavior to explicitly terminate the job object in Subprocess.waitFor when the main process has exited and then wait for all transitive subprocesses to exit, before continuing.

Due to some interesting semantics, one cannot simply use WaitForSingleObject on the Job handle, instead one has to use an IoCompletionPort that gets a signal when the last job in the group has exited: https://blogs.msdn.microsoft.com/oldnewthing/20130405-00/?p=4743

Closes #6860.

PiperOrigin-RevId: 229371044
aehlig pushed a commit that referenced this pull request Jan 23, 2019
(Second version of this PR that fixes the issues from the last time and refactors the code a bit to make it easier to understand and maintain.)

While we did correctly terminate the job object of a process when calling Subprocess.destroy and we did correctly wait for the process itself to exit in Subprocess.waitFor, we didn't wait for its children to exit.

This is usually not a big problem, except when they keep open some file handles to files that Bazel immediately tris to delete after killing / waiting for the main process, in which case the files might still be in use. This is the reason why we're seeing errors like this on CI:

https://buildkite.com/bazel/bazel-bazel/builds/5774#27afa2c9-9ff1-4224-9df7-ba0253e7e317/193-303

"Caught I/O exception: Cannot delete path (something below $TEST_TMPDIR): Access is denied."

This CL modifies the behavior to explicitly terminate the job object in Subprocess.waitFor when the main process has exited and then wait for all transitive subprocesses to exit, before continuing.

Due to some interesting semantics, one cannot simply use WaitForSingleObject on the Job handle, instead one has to use an IoCompletionPort that gets a signal when the last job in the group has exited: https://blogs.msdn.microsoft.com/oldnewthing/20130405-00/?p=4743

Closes #6860.

PiperOrigin-RevId: 229371044
aehlig pushed a commit that referenced this pull request Jan 23, 2019
(Second version of this PR that fixes the issues from the last time and refactors the code a bit to make it easier to understand and maintain.)

While we did correctly terminate the job object of a process when calling Subprocess.destroy and we did correctly wait for the process itself to exit in Subprocess.waitFor, we didn't wait for its children to exit.

This is usually not a big problem, except when they keep open some file handles to files that Bazel immediately tris to delete after killing / waiting for the main process, in which case the files might still be in use. This is the reason why we're seeing errors like this on CI:

https://buildkite.com/bazel/bazel-bazel/builds/5774#27afa2c9-9ff1-4224-9df7-ba0253e7e317/193-303

"Caught I/O exception: Cannot delete path (something below $TEST_TMPDIR): Access is denied."

This CL modifies the behavior to explicitly terminate the job object in Subprocess.waitFor when the main process has exited and then wait for all transitive subprocesses to exit, before continuing.

Due to some interesting semantics, one cannot simply use WaitForSingleObject on the Job handle, instead one has to use an IoCompletionPort that gets a signal when the last job in the group has exited: https://blogs.msdn.microsoft.com/oldnewthing/20130405-00/?p=4743

Closes #6860.

PiperOrigin-RevId: 229371044
weixiao-huang pushed a commit to weixiao-huang/bazel that referenced this pull request Jan 31, 2019
(Second version of this PR that fixes the issues from the last time and refactors the code a bit to make it easier to understand and maintain.)

While we did correctly terminate the job object of a process when calling Subprocess.destroy and we did correctly wait for the process itself to exit in Subprocess.waitFor, we didn't wait for its children to exit.

This is usually not a big problem, except when they keep open some file handles to files that Bazel immediately tris to delete after killing / waiting for the main process, in which case the files might still be in use. This is the reason why we're seeing errors like this on CI:

https://buildkite.com/bazel/bazel-bazel/builds/5774#27afa2c9-9ff1-4224-9df7-ba0253e7e317/193-303

"Caught I/O exception: Cannot delete path (something below $TEST_TMPDIR): Access is denied."

This CL modifies the behavior to explicitly terminate the job object in Subprocess.waitFor when the main process has exited and then wait for all transitive subprocesses to exit, before continuing.

Due to some interesting semantics, one cannot simply use WaitForSingleObject on the Job handle, instead one has to use an IoCompletionPort that gets a signal when the last job in the group has exited: https://blogs.msdn.microsoft.com/oldnewthing/20130405-00/?p=4743

Closes bazelbuild#6860.

PiperOrigin-RevId: 229371044
@philwo philwo added the team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website label Jun 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-Windows Windows-specific issues and feature requests cla: yes team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants