-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Access Violation on x86 #12113
Comments
We are also seeing something similar with this stack trace:
|
I see quite a different managed stack trace in the dump you provided:
The unmanaged one is also quite different too and ends in:
|
Check thread 35, which is what WinDBG stopped on when I was doing live debugging for this.
|
Weird, usually the debugger switches me to correct thread after |
I am taking a look. |
@ayende I assume you have used 64 bit windbg to open the dump, right? It you open it with the wow64 version of windbg, then the call stack shown is quite different. The frame with unknown IP address is gone and there is no EH. While it is strange that the call stack differs this way, the one showed with Wow64 (which is the right one to use for debugging 32 bit code) matches your managed stack dump, as you can see below.
|
If this helps, it is fairly easy to reproduce. You can go to |
That's great, I was just about to ask if the code is open source. |
I've collected the following additional details:
The call at offset 0x11D4 - address 0e995fc4 (return address at offset 0x11DA - 0e995fca) is passed an object reference in ECX which is extracted from [ebp-134h] right before the call. The call site is in a catch handler and the [ebp-134h] is initialized before the respective |
@ayende I've tried to repro the issue, but the commit doesn't exist (I've found another one that was result of merging in the other commit) and so that's what I've checked out: ravendb/ravendb#8784. The dotnet run fails to build with:
|
Can you try |
Somehow the 3.0 SDK was failing to build it even though it should be able to target older SDKs. Now I was able to build it successfuly, but running it fails:
|
Sorry about that. Forgot that you need it.
Please note that I'm running this on CoreCLR 2.1.8, not 3.0 We haven't verified this issue on other versions of the runtime. |
I know. But it should be possible to build using 3.0 targetting older versions - it should just honor the versions specified in the project files. |
Ok, I was able to start the test now, but it is getting one exception at the beginning and then some timeout exceptions:
|
Are you running this inside a debugger, or on some really slow I/O? |
No, without a debugger on my main dev box (Intel XEON E5530, 8 cores, 24GB RAM). Only the disk is not a SSD. Let me try to run it from a SSD drive. |
Oh, I've found that I was running some heavy tests in the background, I've forgotten about it. Now it doesn't timeout anymore. I am sorry for the confusion. |
This test is spinning 5 servers internally which may hit the disk. If you have high disk activity and non SSD disk, that can certainly explain it. Great that this works now. On my end, in 64 bits, it just works. In 32 bits, it fails after about 50 runs |
Hmm, it seems I was just lucky before. That test run has crashed after 4 iterations with stack overflow though. Attempt to re-run the test got the timeout again. So it seems I am really at an edge. Is there a way to increase that timeout? I would like to run it with GC stress enabled to trigger the issue more reliably and that will slow everything down a lot. |
Try this:
By the way, nothing in this test should cause StackOverlow, I also run into that error (but couldn't reproduce in WinDBG so can't pull a stack trace). I'm assuming that this is another instance of: https://github.com/dotnet/coreclr/issues/22597 |
I have forgotten to ask - are you hitting this issue with both Debug and Release configurations? |
If you are throwing exceptions in (reasonably deep) async it could also be dotnet/roslyn#26567 |
My impression is that the issue behind #12038 will only cause spurious AVs in a short window just after calling |
@janvorli I'm sorry, I forgot to mention that. I'm hitting that in @benaadams We do have a lot of async calls, yes. @AndyAyersMS I'm not sure what the root cause is for this, but wouldn't calling |
VirtualAlloc'd memory and stack are in different parts of virtual address space. Stack overflow is usually triggered by accessing a reserved guard page at end of stack. So, no, not likely they'd interfere with one another. |
@ayende with COMPlus_GCStress=4, I've hit an interesting exception that is probably unrelated to the issue we are hunting, but that might indicate a bug in your code. So I am sharing it in case you'd like to double check that:
|
I keep getting the stack overflow even in Release builds. So I've tried to run it under WinDbg and I can see it getting constant stream of exceptions (I guess at least 10 per second or even more):
|
Ok, the VerifyHeap has shown there is quite a large scale heap corruption. Also, from the stress log, I can see that the last GC scan of the current thread was executed when the Voron.Data.Fixed.FixedSizeTree.AddEmbeddedEntry was not on the stack. So that rules out the GC hole in its GC info. We can get some more detail on that by running the app with env var |
My main concern here is that this is running on Win 2019 server, and it crashes consistently. |
My understanding is that #12038 will result in crashes but not heap corruption. |
I'm evaluating a candidate fix now. It is not x86 specific, and while I haven't looked in depth yet, I think the issue it addresses would apply to all architectures... will update when I know more. |
The issue is x86 windows specific. We handle that differently on other architectures. The reason is that on other architectures / platforms, when we are running in the filter, we have the try block frame on the call stack too and so we walk it and report the locals from there. Thus we don't need to report them from the filter frame. It is not the case on x86 Windows due to the way how SEH works. |
When a filter is finished executing, control can logically pass to the associated handler, any enclosing handler or filter, or any finally or fault handler nested within the associated try. This is a consequence of two-pass EH. The jit was not propagating liveness from the nested handlers, which lead to a live object being collected inadvertently. This change updates `fgGetHandlerLiveVars` to find the nested handlers and merge their live-in into the filter block live sets. Because these implicit EH flow edges can create cycles in the liveness dataflow equations, the jit will also now always iterate liveness when it sees there is exception flow, to ensure livness reaches the appropriate fixed point. Added test case. Closes #22820.
When a filter is finished executing, control can logically pass to the associated handler, any enclosing handler or filter, or any finally or fault handler nested within the associated try. This is a consequence of two-pass EH. The jit was not propagating liveness from the nested handlers, which lead to a live object being collected inadvertently. This change updates `fgGetHandlerLiveVars` to find the nested handlers and merge their live-in into the filter block live sets. Because these implicit EH flow edges can create cycles in the liveness dataflow equations, the jit will also now always iterate liveness when it sees there is exception flow, to ensure livness reaches the appropriate fixed point. Added test case. Closes #22820.
Yes, I see. From the jit's internal standpoint the liveness computation was still wrong for filters on other architectures, but fixing that didn't impact jit codegen or gc info. PR for the filter liveness fix is up: dotnet/coreclr#23044. |
@janvorli @ayende @arekpalinski would be great if you could verify the fix if possible... |
@AndyAyersMS I will. |
@AndyAyersMS I've tried the same (CoreCLR 2.1.8 with dotnet/coreclr#23044) on x86 and run the original repro reported by @ayende here - no failure. Great! @janvorli |
@arekpalinski can you please try with |
@janvorli We have a lead that heap corruption might be caused by change in our code. We're investigating that. |
When a filter is finished executing, control can logically pass to the associated handler, any enclosing handler or filter, or any finally or fault handler nested within the associated try. This is a consequence of two-pass EH. The jit was not propagating liveness from the nested handlers, which lead to a live object being collected inadvertently. This change updates `fgGetHandlerLiveVars` to find the nested handlers and merge their live-in into the filter block live sets. Because these implicit EH flow edges can create cycles in the liveness dataflow equations, the jit will also now always iterate liveness when it sees there is exception flow, to ensure livness reaches the appropriate fixed point. Added test case. Closes #22820.
x86 issue seems to be fixed. Am going to re-open this until we've got more clarity on what is happening for x64. |
We are pretty sure that the x64 stuff is our fault and not related to this issue. |
We can confirm that we no longer experience the issue on x64. @janvorli Thanks for help in narrowing it down. |
Thanks. Now keeping this open to track the (proposed) porting of this fix to 2.1. |
Port of dotnet#23044 to release/2.1. When a filter is finished executing, control can logically pass to the associated handler, any enclosing handler or filter, or any finally or fault handler nested within the associated try. This is a consequence of two-pass EH. The jit was not propagating liveness from the nested handlers, which lead to a live object being collected inadvertently. This change updates `fgGetHandlerLiveVars` to find the nested handlers and merge their live-in into the filter block live sets. Because these implicit EH flow edges can create cycles in the liveness dataflow equations, the jit will also now always iterate liveness when it sees there is exception flow, to ensure livness reaches the appropriate fixed point. Added test case. Closes #22820.
@AndyAyersMS Should the milestone be changed to 2.1/2.2? |
Makes sense, yes. |
@AndyAyersMS @BruceForstall Fixed by dotnet/coreclr#23138? |
Yes. Also merged to 2.2 via dotnet/coreclr#23256. |
We have a scenario in which we are getting an access violation exception during a particular load in our system.
I have a full dump of the process when the crash happened, available here:
https://drive.google.com/file/d/11oqZaegxKcoNT8Xj1u9YDIcH7LcmMBsW/view?usp=sharing
The scenario we have is a few servers running and communicating with one another on the same process.
This is part of our test setup. We recently started seeing hard failures, such as this one:
(d2c.3390): Access violation - code c0000005 (first chance)
The event log reports:
This machine has the following hot fixes applies:
The actual stack we are seeing is always something similar to:
The managed stack, FWIW, is:
We are using unsafe code, but we are pretty sure that we aren't corrupting the heap in any manner (lots of tests cover that) and if we were, I would expect to see the failure in different locations.
From trying to figure out what is going on, a few really strange things seem to be happening here:
Here is the actual failure:
And the full register usage is:
As you can see, the
esp
value has a non null value, but checking the memory location with the offset provided to the instruction shows just zeros.While troubleshooting this, we found a
NullReferenceException
in our code. We fixed it, but that made the problem go away.We suspect that this is some issue related to error handling inside the CoreCLR during JIT generation.
We have run into a different issue with KB4487017 (See: https://github.com/dotnet/coreclr/issues/22597), but we are reproducing this on different versions of Windows and without the KB in question.
We aren't able to reproduce this issue in 64 bits.
The text was updated successfully, but these errors were encountered: