Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spurious file-related failures on Windows runners #10483

Closed
2 of 13 tasks
tgross35 opened this issue Aug 24, 2024 · 22 comments
Closed
2 of 13 tasks

Spurious file-related failures on Windows runners #10483

tgross35 opened this issue Aug 24, 2024 · 22 comments
Assignees
Labels
Area: Rust bug report investigate Collect additional information, like space on disk, other tool incompatibilities etc. OS: Windows

Comments

@tgross35
Copy link

tgross35 commented Aug 24, 2024

Description

For the past few months, the rust-lang/rust project has had a lot of spurious failures on the Windows runners. These are typically either failure to open a file (mostly from link.exe) or failure to remove a file:

  • LINK : fatal error LNK1104: cannot open file ...
  • error: failed to remove file ..., Access is denied (os error 5)

Example run: https://github.com/rust-lang-ci/rust/actions/runs/10537107932/job/29198090275

Is it possible that something changed that would cause this? Even if not and this is a problem with our tooling, we could use assistance debugging.

Further context, links to failed jobs, and attempts to reproduce are at rust-lang/rust#127883. Almost every PR showing up in the mentions list is from one of these failures. These errors are similar to what was reported in #4086.

Cc @ChrisDenton and @ehuss who have been working to reproduce this.

Platforms affected

  • Azure DevOps
  • GitHub Actions - Standard Runners
  • GitHub Actions - Larger Runners

Runner images affected

  • Ubuntu 20.04
  • Ubuntu 22.04
  • Ubuntu 24.04
  • macOS 12
  • macOS 13
  • macOS 13 Arm64
  • macOS 14
  • macOS 14 Arm64
  • Windows Server 2019
  • Windows Server 2022

Image version and build link

Current runner version: '2.319.1'
Runner name: 'windows-2022-8core-32gb_4d2ba789d359'
Runner group name: 'Default Larger Runners'
Machine name: 'runner'
Operating System
  Microsoft Windows Server 2022
  10.0.20348
  Datacenter
Runner Image
  Image: windows-2022
  Version: 20240811.1.0
  Included Software: https://github.com/actions/runner-images/blob/win22/20240811.1/images/windows/Windows2022-Readme.md
  Image Release: https://github.com/actions/runner-images/releases/tag/win22%2F20240811.1

Is it regression?

Yes, around 2024-06-27 but the exact start is unknown. It has seemingly gotten significantly worst in the past week or so, that job has at least a 25% failure rate from this issue in the past couple of days (probably close to 50%).

Expected behavior

Accessing or removing the files should succeed.

Actual behavior

The file operations are encountering spurious failures, as linked above.

Repro steps

No known consistent reproduction.

@vidyasagarnimmagaddi
Copy link
Contributor

Hi @tgross35 ,Thank you for bringing this issue to us. We are looking into this issue and will update you on this issue after investigating.

@tgross35
Copy link
Author

tgross35 commented Aug 26, 2024

Thank you for the response. If you need to watch active jobs there is always one running at https://github.com/rust-lang-ci/rust/actions (mostly the auto branch). x86_64-msvc-ext dist-x86_64-msvc-alt dist-x86_64-msvc all fail commonly, usually between 1 and 2 hours of job start. -ext seems to be the most common failure.

There are also ongoing experiments to run the jobs multiple time with different tweaks and see what fails, e.g. rust-lang/rust#129504 and rust-lang/rust#129522.

@tgross35
Copy link
Author

tgross35 commented Sep 3, 2024

@vidyasagarnimmagaddi is there something we could do to debug this better? Our failure rate is currently over 50% due to this issue.

Somebody was able to confirm that we encounter this issue even running CI on an older state of our repo (from before this problem was noticed), which does seem to indicate it is caused by a change to the runner environment rather than changes to our code.

@ijunaidm
Copy link
Contributor

ijunaidm commented Sep 3, 2024

@tgross35 - sure, we will update you shortly to provide workaround/solution to the issue.

@ehuss
Copy link

ehuss commented Sep 4, 2024

@ijunaidm Thanks! I'm one of the people working on this on the Rust side. Another data point: I've never been able to reproduce this on the windows-2022 runner. I've only been able to do it on the windows-2022-8core-32gb runner. I don't know if that is just because of performance (windows-2022 might be too slow to trip whatever race condition), or if it is fundamental to differences in the image. One thing I noticed is that windows-2022-8core-32gb uses the C: drive whereas windows-2022 uses the D: drive. I'm not sure if that is relevant at all.

@tgross35
Copy link
Author

@ijunaidm are there any updates here, or are you able to help us debug in some way (e.g. provide a way to ssh into active runners)? We were forced to switch to the small runners which seems to make this issue less prevalent (still very common) but need to move back to the large runners at some point.

@ijunaidm
Copy link
Contributor

ijunaidm commented Oct 4, 2024

@tgross35 - Sorry, i will update you shortly on this issue.

@ijunaidm ijunaidm assigned subir0071 and unassigned ijunaidm Oct 17, 2024
@subir0071 subir0071 added the investigate Collect additional information, like space on disk, other tool incompatibilities etc. label Oct 22, 2024
@subir0071
Copy link
Contributor

Hi @tgross35 - I was going through the runner logs via the link provided and found that the workflow is attempting to delete "miri.exe", which might be busy when this attempt is being made.

image

If possible exclude the deletion step from the workflow as for every new workflow run Github provisions new runners.
Otherwise, make the pipeline wait to ensure the miri.exe is not busy, which can be another cause of access denied on deletion.

@tgross35
Copy link
Author

tgross35 commented Oct 23, 2024

@subir0071 thank you for reaching out.

If possible exclude the deletion step from the workflow as for every new workflow run Github provisions new runners.

We create this file during the run and later need to delete it, so eliminating this step is not possible. Several people from rust-lang have already tried workarounds, including:

Further, this issue doesn't always take the same form; sometimes it is failure to remove a different file, sometimes it is failure to open files.

Otherwise, make the pipeline wait to ensure the miri.exe is not busy, which can be another cause of access denied on deletion.

This is our main hypothesis for what is going on, but it doesn't seem to be us holding the file (meaning maybe there is some indexing, antivirus, or monitoring that was added within the past few months causing the problem). We have exhausted everything possible to figure out why miri.exe would be busy. The PRs linked above have some attempts at this, see also e.g. rust-lang/rust#127883 (comment).

We cannot really do any further debugging here without GitHub's help. We need a way to do something like SSH into one of the runner images or create a kernel dump, or have somebody from GitHub do this for us while we run jobs. Is this possible?

@geekosaur
Copy link

@jasagredo, I think this will look familiar to you.

@jasagredo
Copy link

@geekosaur Not exactly, our issue was with permissions in read-only files that prevented Windows from deleting them. This seems to be related to Windows holding the lock for a file in use for too long.

@tgross35

SSH into one of the runner images

Maybe it is not very useful but there is https://github.com/mxschmitt/action-tmate which allows you to ssh into a runner, although I'm not sure of how elevated are the permissions there so you might not be able to dump protected stuff.

@subir0071
Copy link
Contributor

Hi @tgross35 - Are you still facing this issue?

@tgross35
Copy link
Author

tgross35 commented Dec 9, 2024

We have workarounds in place to make this issue less common, but as far as we know it still exists.

@subir0071
Copy link
Contributor

Thanks for your response @tgross35 .
Will it be possible to provide the failure job run url, less than five days old?

@tgross35
Copy link
Author

@subir0071 I reverted our workarounds in rust-lang/rust#134150 and started a test job. Usually the failures happen 1-2 hours after the job starts, I'll retry it if it doesn't fail.

Here is a link for the running jobs https://github.com/rust-lang-ci/rust/actions/runs/12266538500.

@tgross35
Copy link
Author

@subir0071 it did fail with the same error that we had been experiencing before "failed to remove file" rust-lang/rust#134150 (comment)

@subir0071
Copy link
Contributor

The job failed with lock issue again.
Believe the error got reproduced again.
image

@subir0071
Copy link
Contributor

Hi @tgross35 , We have completed our investigation as found that all the related pieces of infrastructure did not malfunction during the execution of the failed job.
Based on the logs we can conclude the participating systems are working as expected.
Nothing more we can investigate in this regard.

Hence, we would request, to kindly analyze the error on your side as we can assure the issue is not caused due to the runner-images.
Thanks.

@tgross35
Copy link
Author

@subir0071 thank you for looking into this. We would love to analyze this more on our side, but I need to reiterate that this error has never been reproduced on anything other than the GitHub Windows runners. We have tried debugging this but have reached the limits of what we can do within the limits of running jobs (Handles does not report anything), so we are stuck without help from GitHub.

Can you provide us with a way to SSH into runners so we can see what is using the files? Unfortunately #10483 (comment) does not have sufficient permissions.

@subir0071
Copy link
Contributor

Hi @tgross35 - Unfortunately, it is not possible to ssh into the runner. However, as a workaround please feel free to provision a similar runner in an Azure VM to analyze the workflow.
Here is the documentation for the same.
Thanks.

@subir0071
Copy link
Contributor

Closing this issue as we have completed investigation and found no issues at our end.
User needs to debug the workflow steps and check possible reason of failure.

Documentation link provided to the user to provision a similar VM that is used to execute the workload in Github hosted runner.
Thanks.

@ChrisDenton
Copy link

ChrisDenton commented Dec 19, 2024

Just for the sake of completeness, this is a summary of our investigation so far:

  • The error message on delete is consistent with the application running while we try to delete it.
  • However, Handles does not show it as being open and procmon shows the process exiting (and its file handle closed) long before the call to delete.
  • We're experimenting with using a dev drive for builds, which so far appears to workaround the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Rust bug report investigate Collect additional information, like space on disk, other tool incompatibilities etc. OS: Windows
Projects
None yet
Development

No branches or pull requests

8 participants