-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
os: remove 5ms sleep on Windows in (*Process).Wait #25965
Comments
If you do not delete executable file immediately after its process exits, then you could remove that 5ms sleep, and nothing will change for you. So we really need that extra sleep mainly to make cmd/go work properly, because cmd/go runs programs and deletes them immediately. Perhaps we could move that 5ms sleep from os package and into cmd/go code. Alex |
It should be possible to delete a file after it finishes executing, for any program, not just cmd/go. This bug is about figuring out what synchronization we're missing. |
I understand that. I was suggesting what to do in case we cannot find such synchronization. Alex |
Is the sleep needed on every process we want to wait on?
We might want to use a different wait method on windows? |
When I tried to remove the sleep in https://go-review.googlesource.com/c/go/+/84175 I was doing it mostly for debugging Windows XP problems (#2866 and #17245), and it apparently didn't work. But we don't support Windows XP anymore, so maybe we can remove it now? But I forget everything about Windows I had learned while researching that patch. |
I've seen a lot of code that waits on a process handle, and none of it sleeps. It would be a major bug if you couldn't trust a process handle to be signaled at the right time, and would surely be fixed. The sleep must be masking a problem somewhere else. I'm all for removing the sleep and seeing what shakes loose. |
SGTM Alex |
Change https://golang.org/cl/145221 mentions this issue: |
CL 145221 have broken windows-2008 builder - latest snapshot of https://build.golang.org/ Even one failure on windows-2012 https://build.golang.org/log/d37c2bad6c49d20fe4bec6f1a8d380e583ea79e8 What should we do now? Alex |
Haven't we seen those errors in the past, though? I guess removing the 5ms sleep just makes them more likely? Is there some other handle we're not closing quickly? @jordanrh1, could you audit some of these code paths and make sure we're doing things right? (clearly we aren't) |
I'm not seeing anything obviously wrong with the code, but these are the things I'm suspecting:
To rule out (1), I want to see if I can repro the issue with a simple C program. If it is (1), then a Sleep may be the best solution, especially if Go's contract is that all files used by the process are released by the time Wait() returns. |
Yes, these errors happens more often. Maybe it was happening on Windows XP is even more often.
I think that once process handle is signaled - syscall.WaitForSingleObject returns - the process is exited. So opened handles could not matter here. But I could be wrong about that.
I do not see how to write similar C program for (1). The program that fails is cmd/go tests - how do you write C program that imitate cmd/go test? I don't see how it could be related to (2). Like I said above, we wait for syscall.WaitForSingleObject on process handle to return, so whatever handles process opens, they should be all closed now. And when new process starts, we use syscall.DuplicateHandle to create stdin, stdout and stderr handles for child process and close the handles immediately - see syscall.StartProcess for details. I am not sure what to say about (3) and (4). Should I revert CL 145221 in the meantime? Alex |
Sure? Super sad, but we don't have good short-term leads. If you do, please reopen this bug and reference both this bug in the commit, and #23171. /cc @aclements |
How about creating an executable file (probably by just making a copy of an existing executable), run it, Is there any way to wait for locks on a file to be released? We can't just do that unconditionally here (there could be other copies of the binary running, too), but maybe we could make remove willing to block for a little while if the file is locked before failing? |
Yes, what @aclements suggested. I should mention that I don't have any cycles to test this until the 19th. |
Change https://golang.org/cl/148957 mentions this issue: |
https://go-review.googlesource.com/c/go/+/148957
I will try to do what you have asked. I would not be surprised, if the system configuration or OS version will also matter here - imagine, if you have an antivirus program eager to catch executable with virus in them, or another "helpful" service.
I am not sure what locks you are talking about. Please, explain more. All errors, we see here, happen during cmd/go test. Perhaps, the exe in questions starts itself as a child, and then parent process exists, and we notice parent exit, but do not wait for the child? But all examples of faults are simple run of a program - they do not look like starting any child processes. Alex |
This reverts CL 145221 (commit 5c35973) Reason for revert: breaks the build occasionally. Updates #23171 Updates #25965 Change-Id: Ie1e3c76ab9bcd8d28b6118440b5f80c76f9b1852 Reviewed-on: https://go-review.googlesource.com/c/148957 Run-TryBot: Alex Brainman <alex.brainman@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
I created github.com/alexbrainman/winapi/tree/master/issue25965.TestIssue25965 The test runs required exercise for both Go version (only syscall package is used) and C version. The test fails immediately on my Windows 7 computer. But it was much harder to make it fail on my Windows 10 computer - I managed to make it fail after I added a loop of 100 iterations around the test - so be warned that the test might take a while to run. Anyway my Windows 10 output looks like this:
Let me know, if you want me to do anything else. Alex |
(CC @bufflig: this is currently one of the larger sources of test flakes on the Windows builders.) |
Change https://golang.org/cl/369017 mentions this issue: |
Change https://golang.org/cl/371296 mentions this issue: |
This works around what appears to be either a kernel bug or a Go runtime or syscall bug affecting certain Windows versions (possibly all pre-2016?). The retry loop is a simplified version of the one used in cmd/go/internal/robustio. We use the same 2-second arbitrary timeout as was used in that package, since it seems to be reliable in practice on the affected builders. (If it proves to be too short, we can lengthen it, within reason, in a followup CL.) Since this puts a higher-level workaround in place, we can also revert the lower-level workaround added to a specific test in CL 345670. This addresses the specific occurrences of the bug for users of (*testing.T).TempDir, but does not fix the underlying bug for Go users outside the "testing" package (which remains open as #25965). Fixes #50051 Updates #48012 Updates #25965 Change-Id: I35be7125f32f05c8350787f5ca9a22974b8d0770 Reviewed-on: https://go-review.googlesource.com/c/go/+/371296 Trust: Bryan Mills <bcmills@google.com> Run-TryBot: Bryan Mills <bcmills@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Patrik Nyblom <pnyb@google.com> Trust: Patrik Nyblom <pnyb@google.com> Run-TryBot: Patrik Nyblom <pnyb@google.com>
Change https://go.dev/cl/423194 mentions this issue: |
The HTTP/1 server deletes multipart form tempfiles after ServeHTTP returns, but the HTTP/2 server does not. Add a test to verify cleanup happens in both cases, temporarily disabled for the HTTP/2 path. For #20253 Updates #25965 Change-Id: Ib753f2761fe73b29321d9d4337dbb5090fd193c2 Reviewed-on: https://go-review.googlesource.com/c/go/+/423194 TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Damien Neil <dneil@google.com> Reviewed-by: Robert Griesemer <gri@google.com> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Change https://go.dev/cl/463840 mentions this issue: |
I've given another shot with CL 463840. It's novel approach is that it calls |
The contract of WaitForSingleObject is to wait until the process
is complete. It is NOT to wait until the executable file has been closed
(is this right? -- I'm no Windows expert, but it seems to be the case). But
the expectation of WaitForSingleObject *is* that the executable will be
closed soon. Not waiting for file closure is probably good, since the vast
majority of external process invocations do not require deletion of the
executable when finished -- that is done only by a few tools that
build-run-delete. Why should invocations of external processes that don't
require deletion of their executables always wait a ms or two or three
longer? Perhaps it is the responsibility of build-run-delete tools to
manage that fact that the executable might not already be deleted by the OS
when the process completes.
Dealing with that is pretty simple. On process
completion (when WaitForSingleObject returns) just retry closing a few
times:
WaitForSingleObject()
i := numberOfRetries
for ; i > 0; i-- {
if err := attemptToDeleteExecutable(); err != nil {
waitAShortTime()
}
}
if i == 0 {
// Indicate error that executable could not be deleted
}
I have used this technique in other programming languages for exactly this
issue. Instead of a fixed delay, it waits only as long as necessary. And
it's only used for those rather rare cases where the executable has to be
deleted after the process completes. The downside? Code like this has to be
put into a few tools (e.g. go run). This issue exists only in Windows,
since in *nix it's OK to "remove" an open file -- actual deletion occurs
automatically when the file is closed.
I suggest:
-- Go should return from Wait as soon as WaitForSingleObject returns.
-- Tools that have to delete executables when their processes complete
should incorporate retry loops similar to the example above.
…On Sat, Jan 28, 2023 at 11:09 AM Quim Muntal ***@***.***> wrote:
I've given another shot with CL 463840. It's novel approach is that it
calls WaitForSingleObject not on the handle itself but on a handle with
SYNCHRONIZE rights. Well, this thread is quite long, so it might also
been tried before, but if someone did, I haven't seen it.
—
Reply to this email directly, view it on GitHub
<#25965 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABLA2AMIK3EZ4E6HXSLIXX3WUVVF5ANCNFSM4FFZIIIQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
The problem with that philosophy is that many programs don't know whether the data they are deleting may include recently-executed programs. That forces either a layering violation (the use-agnostic tool needing to know about executables) or additional latency (the use-agnostic tool performing spurious retries). This is not a theoretical problem: the Go |
That also raises the problem of: how long do you retry for? If you attempt only a small number of retries, you'll have false-positive failures; but if you attempt a larger number of retries, then the latency in the cases where the retries are spurious gets worse and worse. For a project like Go that aims to support a wide variety of hardware and platforms, there isn't an obvious setting that addresses both of those constraints. |
Replying to myself with a little correction to the example retry loop for
deleting a just-completed process's executable file. It should be:
cmd := exec.Command(...)
cmd.Start()
...
cmd.Wait()
i := numberOfRetries
for ; i > 0; i-- {
if err := attemptToDeleteExecutable(); err != nil {
waitAShortTime()
}
}
if i == 0 {
// Indicate error that executable could not be deleted
}
This was supposed to be Go application code. Previously the first line of "
WaitForSingleObject()" would not be typically be part of an application.
…On Tue, Jan 31, 2023 at 11:12 AM Bob Alexander ***@***.***> wrote:
The contract of WaitForSingleObject is to wait until the process
is complete. It is NOT to wait until the executable file has been closed
(is this right? -- I'm no Windows expert, but it seems to be the case). But
the expectation of WaitForSingleObject *is* that the executable will be
closed soon. Not waiting for file closure is probably good, since the vast
majority of external process invocations do not require deletion of the
executable when finished -- that is done only by a few tools that
build-run-delete. Why should invocations of external processes that don't
require deletion of their executables always wait a ms or two or three
longer? Perhaps it is the responsibility of build-run-delete tools to
manage that fact that the executable might not already be deleted by the OS
when the process completes.
Dealing with that is pretty simple. On process
completion (when WaitForSingleObject returns) just retry closing a few
times:
WaitForSingleObject()
i := numberOfRetries
for ; i > 0; i-- {
if err := attemptToDeleteExecutable(); err != nil {
waitAShortTime()
}
}
if i == 0 {
// Indicate error that executable could not be deleted
}
I have used this technique in other programming languages for exactly this
issue. Instead of a fixed delay, it waits only as long as necessary. And
it's only used for those rather rare cases where the executable has to be
deleted after the process completes. The downside? Code like this has to be
put into a few tools (e.g. go run). This issue exists only in Windows,
since in *nix it's OK to "remove" an open file -- actual deletion occurs
automatically when the file is closed.
I suggest:
-- Go should return from Wait as soon as WaitForSingleObject returns.
-- Tools that have to delete executables when their processes complete
should incorporate retry loops similar to the example above.
On Sat, Jan 28, 2023 at 11:09 AM Quim Muntal ***@***.***>
wrote:
> I've given another shot with CL 463840. It's novel approach is that it
> calls WaitForSingleObject not on the handle itself but on a handle with
> SYNCHRONIZE rights. Well, this thread is quite long, so it might also
> been tried before, but if someone did, I haven't seen it.
>
> —
> Reply to this email directly, view it on GitHub
> <#25965 (comment)>, or
> unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ABLA2AMIK3EZ4E6HXSLIXX3WUVVF5ANCNFSM4FFZIIIQ>
> .
> You are receiving this because you commented.Message ID:
> ***@***.***>
>
|
I took another stab at this from another side. I'm pretty sure that all boils down to Windows not deleting a file when
This behavior is not documented, but it's what I could gather from exploring different sources, including the aforementioned dotnet and PowerShell issues. Going back to the 5ms sleep, it seems that deleting an executable file whose process has just been released makes the asynchronous deletion process take more than for normal files. If we add the fact that testing cleanup calls Luckily for us, the This is inline with @alexbrainman and @bradfitz findings while investigating this issue, where they found that it doesn't occur in Windows 10. What I'm doing now is trying to explain those findings so we don't just remove the 5ms seconds based on experimentation but also on a good reasoning. |
Change https://go.dev/cl/509335 mentions this issue: |
In our Windows implementation of os/Process.Wait, we always do a 5 millisecond sleep:
Sleeps in code are always gross. They're either too fast (and still flaky) or too slow (incurring extra delays on everybody when not needed).
This bug is about figuring out what we're doing wrong on Windows that made us need that sleep in the first place.
https://go-review.googlesource.com/c/go/+/84175 was one attempt to remove it, but the goal then was fixing a Windows XP test failure (for #17245), and it didn't fix the XP failure. We no longer support XP, though, so maybe that CL is okay now.
But maybe it's not correct.
There was also https://golang.org/cl/84896 from @johnsonj to use jobs to wait for process completion on windows, but it was "not working as expected on Server 2008" and @alexbrainman had concerns in #17245 (comment) (that whole thread is worth a read, even if it's mostly about debugging an XP issue, which might've been related to this bug.)
The text was updated successfully, but these errors were encountered: