Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XUnitLogChecker causes GCStress test failures #82143

Closed
BruceForstall opened this issue Feb 15, 2023 · 9 comments · Fixed by #82212
Closed

XUnitLogChecker causes GCStress test failures #82143

BruceForstall opened this issue Feb 15, 2023 · 9 comments · Fixed by #82212
Assignees
Milestone

Comments

@BruceForstall
Copy link
Member

BruceForstall commented Feb 15, 2023

https://dev.azure.com/dnceng-public/public/_build/results?buildId=171904&view=ms.vss-test-web.build-test-results-tab&runId=3436659&resultId=112296&paneView=debug

It appears that a test that is excluded from GCStress causes the new XUnitLogChecker (#80751) to report a test failure.

e.g.,

+ JIT/HardwareIntrinsics/HardwareIntrinsics_ro/HardwareIntrinsics_ro.sh
SKIPPING EXECUTION BECAUSE DOTNET_GCStress IS SET
+ dotnet /root/helix/work/correlation/XUnitLogChecker/XUnitLogChecker.dll JIT/HardwareIntrinsics/HardwareIntrinsics_ro HardwareIntrinsics_ro
[XUnitLogChecker]: No logs were found. Something went very wrong with this item...
[XUnitLogChecker]: Expected log name: 'HardwareIntrinsics_ro.tempLog.xml'
+ export _commandExitCode=254

@ivdiazsa @trylek @hoyosjs

Thus, all the HardwareIntrinsics tests are marked as failing, for example.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Feb 15, 2023
@ghost
Copy link

ghost commented Feb 15, 2023

Tagging subscribers to this area: @hoyosjs
See info in area-owners.md if you want to be subscribed.

Issue Details

https://dev.azure.com/dnceng-public/public/_build/results?buildId=171904&view=ms.vss-test-web.build-test-results-tab&runId=3436659&resultId=112296&paneView=debug

It appears that a test that is excluded from GCStress causes the new XUnitLogChecker (#80751) to report a test failure.

e.g.,

+ JIT/HardwareIntrinsics/HardwareIntrinsics_ro/HardwareIntrinsics_ro.sh
SKIPPING EXECUTION BECAUSE DOTNET_GCStress IS SET
+ dotnet /root/helix/work/correlation/XUnitLogChecker/XUnitLogChecker.dll JIT/HardwareIntrinsics/HardwareIntrinsics_ro HardwareIntrinsics_ro
[XUnitLogChecker]: No logs were found. Something went very wrong with this item...
[XUnitLogChecker]: Expected log name: 'HardwareIntrinsics_ro.tempLog.xml'
+ export _commandExitCode=254

@ivdiazsa @trylek @hoyosjs

Author: BruceForstall
Assignees: ivdiazsa
Labels:

area-Infrastructure-coreclr, blocking-clean-ci-optional

Milestone: -

@jkoritzinsky
Copy link
Member

I guess this means that we're sending work items that don't do anything in GCStress? We should definitely add some condition to not construct no-op work items, but getting the lane clean is more important.

@ivdiazsa
Copy link
Member

So, if I'm understanding correctly, there are certain conditions that make work items be skipped altogether, like DOTNET_GCStress being set for the Hardware Intrinsics ones. Is there a way to know this beforehand, so that we might avoid sending that work item to Helix altogether like Jeremy said?

@trylek
Copy link
Member

trylek commented Feb 15, 2023

I think you've nailed the problem exactly Ivan, at the time of generating Helix work items we don't yet know that the runtime cmd / sh script will filter all tests out in GC stress mode, while I agree it's somewhat wasteful I suspect it's a very exceptional case and my initial thinking is that we should just fix the tooling to tolerate it.

@ivdiazsa
Copy link
Member

ivdiazsa commented Feb 15, 2023

I think you've nailed the problem exactly Ivan, at the time of generating Helix work items we don't yet know that the runtime cmd / sh script will filter all tests out in GC stress mode, while I agree it's somewhat wasteful I suspect it's a very exceptional case and my initial thinking is that we should just fix the tooling to tolerate it.

I see. I added that check in hopes of also catching items that crashed badly enough that we ended up with nothing to work with. But if this can also happen on purpose, like in this example, I can remove it altogether and have the checker exit successfully in these cases as well.

@trylek
Copy link
Member

trylek commented Feb 15, 2023

Could we perhaps emit something like an empty canary item into the xml at startup? I agree with you that it would be unfortunate to conflate an early crash in the merged wrapper with this situation.

@ivdiazsa
Copy link
Member

That could be a good suggestion Tomas. But could you explain in more detail what you mean by "an empty canary item into the xml at startup"?

@trylek
Copy link
Member

trylek commented Feb 15, 2023

Well, if I recall correctly, the xml is basically a header followed by a list of entries corresponding to the individual unit tests. I don't remember how exactly your xml validation logic works but I guess that putting something like an empty test entry at the top might let us communicate "OK, I, the merged wrapper, have successfully launched, emitted the xml file header and the initial entry and I'm now launching the first test so consider this xml valid despite the fact that it may contain no entries if all tests cases get filtered out. I am still somewhat unsure how that could happen though - it would mean that a particular test wrapper runs all test cases out-of-proc and each of them is marked as CGStressIncompatible, that sounds somewhat far-fetched. I'm right now somewhat swamped with other bugs but I can try to take a closer look tomorrow to see what is the exact situation where this is happening.

@ivdiazsa
Copy link
Member

I think that could work but would require more effort. Here, the fixer is failing due to there not being any log at all. This is because the sh script is skipping everything entirely because of GCStress being there, which means the logic that writes the XML logs is not getting a chance to run in the first place. We would have to add some more logic that generates the empty log before this happens, and then make it collaborate with the usual log writer.

I think I can just remove this additional check in the fixer for now, just to keep our CI dirty for as little as possible, and then we can discuss more in-depth your proposed approach and also get @jkoritzinsky's input on the matter :)

@ivdiazsa ivdiazsa added this to the 8.0.0 milestone Feb 15, 2023
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Feb 15, 2023
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Feb 15, 2023
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Feb 16, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Mar 18, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants