Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assert failure: m_alignpad == 0 in libraries tests #70231

Closed
jkotas opened this issue Jun 4, 2022 · 28 comments · Fixed by #71113 or #72475
Closed

Assert failure: m_alignpad == 0 in libraries tests #70231

jkotas opened this issue Jun 4, 2022 · 28 comments · Fixed by #71113 or #72475
Assignees
Labels
area-GC-coreclr blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms'
Milestone

Comments

@jkotas
Copy link
Member

jkotas commented Jun 4, 2022

Hit in #70226

Dump and logs: https://dev.azure.com/dnceng/public/_build/results?buildId=1806089&view=ms.vss-test-web.build-test-results-tab&runId=48090090&resultId=197319&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab

Stacktrace:

coreclr!DbgAssertDialog+0x1af [D:\a\_work\1\s\src\coreclr\utilcode\debug.cpp @ 594] 
coreclr!ObjHeader::IllegalAlignPad+0x33 [D:\a\_work\1\s\src\coreclr\vm\syncblk.cpp @ 2952] 
coreclr!ObjHeader::GetBits+0x41 [D:\a\_work\1\s\src\coreclr\vm\syncblk.h @ 1545] 
coreclr!ObjHeader::Validate+0x50 [D:\a\_work\1\s\src\coreclr\vm\syncblk.cpp @ 2042] 
coreclr!Object::ValidateInner+0x493 [D:\a\_work\1\s\src\coreclr\vm\object.cpp @ 581] 
coreclr!Object::Validate+0xa1 [D:\a\_work\1\s\src\coreclr\vm\object.cpp @ 498] 
coreclr!OBJECTREF::operator->+0x21 [D:\a\_work\1\s\src\coreclr\vm\object.cpp @ 1242] 
coreclr!MarshalNative::GCHandleInternalAlloc+0x171 [D:\a\_work\1\s\src\coreclr\vm\marshalnative.cpp @ 492] 
System_Private_CoreLib!System.ReadOnlyMemory`1[[System.Byte, System.Private.CoreLib]].Pin()+0xffffffff`a4114a41
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jun 4, 2022
@jkotas
Copy link
Member Author

jkotas commented Jun 4, 2022

This is failing in the next-object validation.

m_alignpad is 0 in the dump. It means that it got cleared between the assert condition was checked and the crash dump was taken.

The most likely explanation is that there is a race condition between GCHeap::NextObj and the memory clearing for newly allocated objects in the GC that causes GCHeap::NextObj to return next object that was not cleared yet. @dotnet/gc Could you please take a look?

@ghost
Copy link

ghost commented Jun 4, 2022

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

Hit in #70226

Dump and logs: https://dev.azure.com/dnceng/public/_build/results?buildId=1806089&view=ms.vss-test-web.build-test-results-tab&runId=48090090&resultId=197319&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab

Stacktrace:

coreclr!DbgAssertDialog+0x1af [D:\a\_work\1\s\src\coreclr\utilcode\debug.cpp @ 594] 
coreclr!ObjHeader::IllegalAlignPad+0x33 [D:\a\_work\1\s\src\coreclr\vm\syncblk.cpp @ 2952] 
coreclr!ObjHeader::GetBits+0x41 [D:\a\_work\1\s\src\coreclr\vm\syncblk.h @ 1545] 
coreclr!ObjHeader::Validate+0x50 [D:\a\_work\1\s\src\coreclr\vm\syncblk.cpp @ 2042] 
coreclr!Object::ValidateInner+0x493 [D:\a\_work\1\s\src\coreclr\vm\object.cpp @ 581] 
coreclr!Object::Validate+0xa1 [D:\a\_work\1\s\src\coreclr\vm\object.cpp @ 498] 
coreclr!OBJECTREF::operator->+0x21 [D:\a\_work\1\s\src\coreclr\vm\object.cpp @ 1242] 
coreclr!MarshalNative::GCHandleInternalAlloc+0x171 [D:\a\_work\1\s\src\coreclr\vm\marshalnative.cpp @ 492] 
System_Private_CoreLib!System.ReadOnlyMemory`1[[System.Byte, System.Private.CoreLib]].Pin()+0xffffffff`a4114a41
Author: jkotas
Assignees: -
Labels:

area-GC-coreclr, area-Interop-coreclr, untriaged

Milestone: -

@BruceForstall
Copy link
Member

Same assert here: #68511

@BruceForstall
Copy link
Member

@jkotas jkotas added the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label Jun 5, 2022
@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Jun 6, 2022
@JulieLeeMSFT JulieLeeMSFT added this to the 7.0.0 milestone Jun 6, 2022
@mdh1418
Copy link
Member

mdh1418 commented Jun 6, 2022

@agocke
Copy link
Member

agocke commented Jun 20, 2022

This is still failing quite a lot

@Maoni0
Copy link
Member

Maoni0 commented Jun 21, 2022

these tests' logs all say "windows.10.amd64.open.rt". what is "open.rt"? I've been trying running the system.IO.pipes tests for a few hours with no repro.

@Maoni0
Copy link
Member

Maoni0 commented Jun 21, 2022

so both failures I looked at (System.IO.Pipes and System.Text.Json.Tests) were running on "windows.10.amd64.open.rt". @hoyosjs tells me this is a type of queue that's "Windows Server 2016-Datacenter running on 2 cores". that's a pretty old OS. I've been running these on win10 over night and haven't seen a repro. is it possible to check if we only see ever saw these failures on this particular type of queue?

@Maoni0
Copy link
Member

Maoni0 commented Jun 21, 2022

'System.Text.RegularExpressions.Unit.Tests' that failed also ran on that particular queue. I would imagine we certainly these same tests running on newer OSs.

@jkotas
Copy link
Member Author

jkotas commented Jun 21, 2022

We run of libraries tests with checked coreclr on Windows in windows.10.amd64.open.rt queue only. I do not see any other queues running this combination on Windows.

Here is a query you can use to find all libraries tests that failed with runtime asserts in last 10 days. (Some of these failed with a different assert.)

https://dataexplorer.azure.com/clusters/engsrvprod/databases/engineeringdata

let wi =
WorkItems
| join kind=leftsemi (Jobs | where Queued > ago (10d) ) on $left.JobName == $right.Name
| where ExitCode == -1073740286;
Files
| lookup kind=inner wi on $left.WorkItemName == $right.Name
| where FileName == "how-to-debug-dump.md"
| join WorkItems on $left.WorkItemName == $right.Name
| join Jobs on $left.JobName == $right.Name
| extend PhaseName = tostring(parse_json(Properties)["System.PhaseName"]),
Pipeline = tostring(parse_json(Properties).DefinitionName),
BuildId = tostring(parse_json(Properties).BuildId)
| where Pipeline !contains("jitstress")
| project Timestamp, QueueName, ExitCode, Uri, ConsoleUri, PhaseName, Pipeline, BuildId

Maoni0 added a commit that referenced this issue Jun 22, 2022
this is to fix #70231.

for regions we could run into this situation -

object is the last object before heap_segment_allocated (hs)
T0 calls NextObj, gets next obj which starts at heap_segment_allocated (hs)
T1 changes ephemeral_heap_segment to hs
T0 does these comparisons
        (nextobj >= heap_segment_allocated(hs) && hs != hp->ephemeral_heap_segment) ||
        (nextobj >= hp->alloc_allocated))
both still false because alloc_allocated hasn't been changed just yet (and the old alloc_allocated is larger than nextobj)

T0 validates next obj, concludes its m_alignpad is not 0, asserts

T1 forms an allocation context starting at heap_segment_allocated, clears memory so by the time the dump is taken, m_alignpad is already cleared (actually we clear it in a_fit_segment_end)

I'm fixing this by saving the ephemeral_heap_segment and alloc_allocated and bail if nextobj is not on the saved eph seg or if those 2 saved values are no long in sync.
@AaronRobinsonMSFT
Copy link
Member

Potentially still hitting this issue.

https://dev.azure.com/dnceng/public/_build/results?buildId=1858589&view=logs&jobId=bc70a89f-8fe2-5049-a7a5-2a5fb5d5e472

Console log: 'System.Text.Encoding.Tests' from job 747c1976-2e0a-4c21-808c-688692504295 workitem 3d7e3d3f-7501-42f3-b028-0eb4f802deb1 (windows.10.amd64.open.rt) executed on machine a000RKC

C:\h\w\9C510836\w\AACC0967\e>taskkill.exe /f /im corerun.exe 
ERROR: The process "corerun.exe" not found.

C:\h\w\9C510836\w\AACC0967\e>call RunTests.cmd --runtime-path C:\h\w\9C510836\p 
----- start Sat 07/02/2022 16:56:26.83 ===============  To repro directly: ===================================================== 
pushd C:\h\w\9C510836\w\AACC0967\e\
"C:\h\w\9C510836\p\dotnet.exe" exec --runtimeconfig System.Text.Encoding.Tests.runtimeconfig.json --depsfile System.Text.Encoding.Tests.deps.json xunit.console.dll System.Text.Encoding.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing 
popd
===========================================================================================================

C:\h\w\9C510836\w\AACC0967\e>"C:\h\w\9C510836\p\dotnet.exe" exec --runtimeconfig System.Text.Encoding.Tests.runtimeconfig.json --depsfile System.Text.Encoding.Tests.deps.json xunit.console.dll System.Text.Encoding.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing  
  Discovering: System.Text.Encoding.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Text.Encoding.Tests (found 340 of 343 test cases)
  Starting:    System.Text.Encoding.Tests (parallel test collections = on, max threads = 4)

Assert failure(PID 4932 [0x00001344], Thread: 3996 [0x0f9c]): m_alignpad == 0

CORECLR! GetCLRRuntimeHost + 0x1CC930 (0x00007ffd`52b2ade0)
CORECLR! GetCLRRuntimeHost + 0x153BF3 (0x00007ffd`52ab20a3)
CORECLR! GetCLRRuntimeHost + 0x153721 (0x00007ffd`52ab1bd1)
CORECLR! GetCLRRuntimeHost + 0x17C071 (0x00007ffd`52ada521)
CORECLR! GetCLRRuntimeHost + 0x363CC9 (0x00007ffd`52cc2179)
<no module>! <no symbol> + 0x0 (0x00007ffc`f37e784d)
<no module>! <no symbol> + 0x0 (0x000001a1`ba02d0f0)
<no module>! <no symbol> + 0x0 (0x000001a1`bd00c020)
    File: D:\a\_work\1\s\src\coreclr\vm\syncblk.cpp Line: 2956
    Image: C:\h\w\9C510836\p\dotnet.exe

----- end Sat 07/02/2022 16:56:41.28 ----- exit code -1073740286 ----------------------------------------------------------
2022-07-02T16:56:41.895Z	INFO   	run.py	run(48)	main	Beginning reading of test results.
2022-07-02T16:56:41.895Z	INFO   	run.py	__init__(42)	read_results	Searching 'C:\h\w\9C510836\w\AACC0967\e' for test results files
2022-07-02T16:56:41.895Z	INFO   	run.py	__init__(42)	read_results	Searching 'C:\h\w\9C510836\w\AACC0967\uploads' for test results files
2022-07-02T16:56:41.895Z	WARNING	run.py	__init__(55)	read_results	No results file found in any of the following formats: xunit, junit, trx
2022-07-02T16:56:41.895Z	INFO   	run.py	packing_test_reporter(30)	report_results	Packing 0 test reports to 'C:\h\w\9C510836\w\AACC0967\e\__test_report.json'
2022-07-02T16:56:41.895Z	INFO   	run.py	packing_test_reporter(33)	report_results	Packed 1417 bytes
read file: C:\h\w\9C510836\p\debug-dump-template.md
writing output file: C:\h\w\9C510836\w\AACC0967\uploads\how-to-debug-dump.md
done writing debug dump information

@jkotas
Copy link
Member Author

jkotas commented Jul 3, 2022

I have looked at the dump. Yes, it is still the same issue.

@jkotas jkotas reopened this Jul 3, 2022
@MichalPetryka
Copy link
Contributor

I hit this in #71685.

@janvorli
Copy link
Member

The case that @jkotas has just closed as a dup happened on Linux arm64. I am in a process of attempting to repro it on Linux at the moment. Maybe I'll get more lucky with the repro there.

@Maoni0
Copy link
Member

Maoni0 commented Jul 14, 2022

so I just started looking at some dumps that @mrsharm gathered. from an initial look nothing seems unusual. it's not hitting what I fixed so either that issue was no longer reproing (which is good as it was definitely an issue) or it was never due to that reason. nextObj looks correct (as in, it is indeed the next obj from the object that's being validated) and is always in gen0 so just got allocated. in one case I see that it was at the end of an alloc context but in the other 2 cases it was in the middle of an alloc context. it's too bad that there's a stress log entry that actually doesn't get logged (even though it seems like its attention is since it calls LogSpewAlways, it does log this ASSERT:D:\a\_work\1\s\src\coreclr\vm\syncblk.cpp, line:2956 which is not useful...) but even if it was it wouldn't be logging the interesting thing anyway. so if we can't get much from the dumps we'd at least turn on the logging for this so we can get something potentially useful in the stresslog.

@Maoni0
Copy link
Member

Maoni0 commented Jul 14, 2022

some updates on this (and I will be OOF tomorrow) - @PeterSolMS continued the investigation and made a lot of progress on it - his theory is the nextObj was allocated from a free list item but it's escaping the demotion check (I didn't write the code in NextObj to begin with but my guess is the same as Peter that it's for escaping verifying when nextObj is allocated from the free list). and the syncblock and method table were cleared out of order. we have a potential fix. @janvorli could actually repro this so he will run with the fix tomorrow and see if our theory is correct.

@EgorBo
Copy link
Member

EgorBo commented Jul 17, 2022

@Maoni0 @janvorli A sort of "stable" repro on Windows-x64:

build.cmd Clr+Libs -c Release -rc Checked

cd src\libraries\System.Collections\tests

dotnet build -c Release

cd C:\prj\runtime\artifacts\bin\System.Collections.Tests\Release\net7.0

$env:DOTNET_TieredCompilation=1
$env:DOTNET_TieredPGO=1
$env:DOTNET_JitRandomGuardedDevirtualization=1
$env:DOTNET_ReadyToRun=0

for(;;){C:\prj\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\dotnet.exe exec --runtimeconfig System.Collections.Tests.runtimeconfig.json --depsfile System.Collections.Tests.deps.json xunit.console.dll System.Collections.Tests.dll  -xml testResults.xml -nologo -notrait category=OuterLoop -notrait category=failing}

^ eventually fails with:

Assert failure(PID 20008 [0x00004e28], Thread: 32408 [0x7e98]): m_alignpad == 0

CORECLR! ObjHeader::Validate + 0x50 (0x00007ffd`6a8de630)
CORECLR! Object::ValidateInner + 0x493 (0x00007ffd`6a865773)
CORECLR! Object::Validate + 0xA1 (0x00007ffd`6a8652a1)
CORECLR! OBJECTREF::operator-> + 0x21 (0x00007ffd`6a85e951)
CORECLR! JIT_ClassProfile32 + 0xC7 (0x00007ffd`6aa82707)
<no module>! <no symbol> + 0x0 (0x00007ffd`0deade75)
<no module>! <no symbol> + 0x0 (0x0000013c`a74005f0)
<no module>! <no symbol> + 0x0 (0x00000037`cdd7bd30)
<no module>! <no symbol> + 0x0 (0x00007ffd`0d441780)
<no module>! <no symbol> + 0x0 (0x0000013c`00000000

I also observe the behaviour @jkotas noticed that m_alignpad is quickly cleared, I even put multiple if (m_alignpad != 0) { checks there

@EgorBo
Copy link
Member

EgorBo commented Jul 17, 2022

note: doesn't repro when I change optimization level from -O2 to -O0 for cee_wks_core (Checked config)

@janvorli
Copy link
Member

@Maoni0 has a fix already. I also found a reasonably frequent repro last week - the readytorun\coreroot_determinism\coreroot_determinism coreclr test and run it with her fix over the weekend. Before the fix, it reproed about every 80 iterations. With the fix, 1000 iterations have passed without any repro.

@Maoni0
Copy link
Member

Maoni0 commented Jul 18, 2022

thanks so much @janvorli for verifying. @mrsharm will be submitting a PR.

@Maoni0
Copy link
Member

Maoni0 commented Jul 18, 2022

and thanks @EgorBo for finding another test that repros consistently.

@mrsharm
Copy link
Member

mrsharm commented Jul 19, 2022

Thank you very much, @EgorBo! Based on your list of instructions, I was able to fairly quickly repro the assertion failure on my Windows desktop and have confirmed that after:

  1. Removing the demotion check:
//  if ((g == 0) && hp->settings.demotion)
//        return NULL;//could be racing with another core allocating
  1. Adding a null check on the GCSafeMethodTable:
   Object * nextObj = GCHeapUtilities::GetGCHeap ()->NextObj (this);
            if ((nextObj != NULL) &&
                (nextObj->GetGCSafeMethodTable() != nullptr) &&
                (nextObj->GetGCSafeMethodTable() != g_pFreeObjectMethodTable))

the assertion failure doesn't occur based on running the tests overnight.

The main question I had was: how do I find out which test did the assertion failure occur for?

Right now, I get a pretty general message that there was an assertion failure for a System.Collections.Test but no indication as to which test it failed on:

=== TEST EXECUTION SUMMARY ===
   System.Collections.Tests  Total: 32703, Errors: 0, Failed: 0, Skipped: 0, Time: 26.645s
  Discovering: System.Collections.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Collections.Tests (found 5562 of 7453 test cases)
  Starting:    System.Collections.Tests (parallel test collections = on, max threads = 20)

Assert failure(PID 24604 [0x0000601c], Thread: 18352 [0x47b0]): m_alignpad == 0

CORECLR! ObjHeader::Validate + 0x43 (0x00007ffa`66f8bcc3)
CORECLR! Object::ValidateInner + 0x493 (0x00007ffa`66f12ed3)
CORECLR! Object::Validate + 0xA1 (0x00007ffa`66f12a01)
CORECLR! OBJECTREF::operator-> + 0x21 (0x00007ffa`66f0c0b1)
CORECLR! JIT_ClassProfile32 + 0xC7 (0x00007ffa`671310b7)
<no module>! <no symbol> + 0x0 (0x00007ffa`0a65e527)
<no module>! <no symbol> + 0x0 (0x000000dd`6cafb030)
<no module>! <no symbol> + 0x0 (0x000000dd`6cafac00)
<no module>! <no symbol> + 0x0 (0x00000276`e4fd77f0)
<no module>! <no symbol> + 0x0 (0x00000276`e543c740)
    File: C:\runtime\src\coreclr\vm\syncblk.cpp Line: 2955
    Image: C:\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\dotnet.exe

  Discovering: System.Collections.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Collections.Tests (found 5562 of 7453 test cases)
  Starting:    System.Collections.Tests (parallel test collections = on, max threads = 20)
  Finished:    System.Collections.Tests

If we know the exact test, we'll have a quicker repro with a more targeted run.

@EgorBo
Copy link
Member

EgorBo commented Jul 19, 2022

@mrsharm in my repro you can append -verbose after -notrait category=failing and xunit will print test names
I think it were diffrient tests each run and I wasn't able to reproduce the issue when I was asking xunit to run just one test specifically (via -method xx)

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jul 19, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jul 20, 2022
@kunalspathak kunalspathak reopened this Aug 9, 2022
@jkotas
Copy link
Member Author

jkotas commented Aug 9, 2022

This assert can be hit for many different GC holes, GC heap corruptions and GC bugs. It is important to at least capture the stacktrace where the assert is hit, and track different stack traces by separate issues.

I think it is unlikely that the Linux arm crash you have seen is same root cause as this issue. I expect that it is going to have a different stacktrace. Unfortunately, we cannot tell for sure since no dump was collected.

I am closing this as non-actionable. If you see a test failing with this assert, do not re-activate this issue. Instead open a new issue and capture stack trace where the assert is hit in the issue description.

@jkotas jkotas closed this as completed Aug 9, 2022
@kunalspathak
Copy link
Member

It seems that it is inconsistent. Tried reproing it but doesn't repro. I will keep an eye and open new issue.

image

@ghost ghost locked as resolved and limited conversation to collaborators Sep 9, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-GC-coreclr blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms'
Projects
None yet
Development

Successfully merging a pull request may close this issue.