Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable QJFL and OSR by default for x64 and arm64 #65675

Merged
merged 1 commit into from
Mar 30, 2022

Conversation

AndyAyersMS
Copy link
Member

@AndyAyersMS AndyAyersMS commented Feb 21, 2022

Change these default values when the jit targets x64 or arm64:

  • COMPlus_TC_QuickJitForLoops=1
  • COMPlus_TC_OnStackReplacement=1

The upshot is that on x64/arm64 more methods will be jitted at Tier0,
and we will rely on OSR to get out of long-running Tier0 methods.

Other architectures continue to use the old behavior for now, as
OSR is not yet supported for x86 or arm.

See OSR Details and Debugging for more on how this might impact everyone's day to day development.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 21, 2022
@ghost ghost assigned AndyAyersMS Feb 21, 2022
@ghost
Copy link

ghost commented Feb 21, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Change these default values when the jit targets x64 or arm64:

  • COMPlus_TC_QuickJitForLoops=1
  • COMPlus_TC_OnStackReplacement=1

The upshot is that on x64/arm64 more methods will be jitted at Tier0,
and we will rely on OSR to get out of long-running Tier0 methods.

Other architectures continue to use the old behavior for now, as
OSR is not yet supported for x86 or arm.

Author: AndyAyersMS
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@AndyAyersMS
Copy link
Member Author

cc @dotnet/jit-contrib
fyi @jkotas @kouvel

@AndyAyersMS
Copy link
Member Author

/azp run runtime-coreclr jitstress, runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Feb 22, 2022

One jit stress failure in baseservices/threading/regressions/13662-a. Also saw this test fail in and earlier OSR PR #63642 (comment), so will need to investigate.

@AndyAyersMS
Copy link
Member Author

Haven't been able to repro the stress failure. The test Main method goes through osr:

; Assembly listing for method Test_13662_a:Main():int
; Emitting BLENDED_CODE for generic ARM64 CPU - MacOS
; Tier-1 compilation
; OSR variant for entry point 0x2e

Might be something timing related?

@AndyAyersMS
Copy link
Member Author

Ah, issue is that STRESS_LCL_FLDS is incompatible with OSR, we can't add padding to existing locals for OSR methods as they live on the Tier0 frame.

@AndyAyersMS
Copy link
Member Author

/azp run runtime-coreclr jitstress, runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@kunalspathak
Copy link
Member

/azp run Fuzzlyn, Antigen

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@AndyAyersMS
Copy link
Member Author

Latest round of jitstress shows quite a few more failures than the original run. main was clean last night so have to assume these could be relevant.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Feb 23, 2022

Fuzzlyn seems to have run cleanly but failed in some post-processing step (Jakob points out Fuzzlyn currently runs with TC disabled, so OSR won't be a factor here).

Antigen failures look like they match up pretty well with the failures from the last rolling run (Sunday).

@kunalspathak
Copy link
Member

Antigen failures look like they match up pretty well with the failures from the last rolling run (Sunday).

Yes, I verified that too. Seems I need to add more flags in Antigen to exercise the OSR code paths.

@AndyAyersMS
Copy link
Member Author

Latest round of jitstress shows quite a few more failures

One issue is more incompatibility with STRESS_LCL_FLDS -- this time in Tier0 frames -- if we pad out locals we mis-report where they are located on the Tier0 frame.

@AndyAyersMS
Copy link
Member Author

/azp run runtime-coreclr jitstress

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@AndyAyersMS
Copy link
Member Author

May still be more fixes needed, similar to the above... still looking.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Feb 23, 2022

GC\Features\SustainedLowLatency\scenario\scenario.dll seems to fail with QJFL=1, even with OSR and stress disabled.

TID 9f88: thread [os id=0x084918 id=0x084] redirect failed due to ContextFlags of 0xc810000b

Assert failure(PID 60484 [0x0000ec44], Thread: 18712 [0x4918]): Consistency check failed: AV in clr at this callstack:
------
CORECLR! PinObject + 0x90 (0x00007ffb`e54e4190)
CORECLR! ScanConsecutiveHandlesWithoutUserData + 0x65 (0x00007ffb`e54e2815)
CORECLR! BlockScanBlocksWithoutUserData + 0x44 (0x00007ffb`e54e2224)
CORECLR! xxxTableScanQueuedBlocksAsync + 0x1B5 (0x00007ffb`e54e30e5)
CORECLR! xxxAsyncSegmentIterator + 0x56 (0x00007ffb`e54e2dd6)
CORECLR! TableScanHandles + 0x281 (0x00007ffb`e54e2bf1)
CORECLR! xxxTableScanHandlesAsync + 0xBF (0x00007ffb`e54e2ebf)
CORECLR! HndScanHandlesForGC + 0x1B8 (0x00007ffb`e54dfca8)
CORECLR! Ref_TracePinningRoots + 0x116 (0x00007ffb`e54e6526)
CORECLR! GCScan::GcScanHandles + 0x94 (0x00007ffb`e54e8a14)
CORECLR! WKS::gc_heap::background_mark_phase + 0x723 (0x00007ffb`e5537983)
CORECLR! WKS::gc_heap::gc1 + 0x28A (0x00007ffb`e55490da)
CORECLR! WKS::gc_heap::bgc_thread_function + 0xD1 (0x00007ffb`e553bf11)
CORECLR! <lambda_d9a0428bbecf3d379716300c87d22bd6>::operator() + 0xA1 (0x00007ffb`e51ad611)
KERNEL32! BaseThreadInitThunk + 0x10 (0x00007ffc`594154e0)
NTDLL! RtlUserThreadStart + 0x2B (0x00007ffc`5afa485b)
-----
.AV on tid=0x4918 (18712), cxr=0000005367D7EB00, exr=0000005367D7EFF0

This also fails with main with COMPlus_TC_QuickJitForLoops=1. So evidently a pre-existing problem.

I'll open a side issue and mark this test as incompatible with stress in the meantime.

@AndyAyersMS
Copy link
Member Author

Odd build break building the cross-platform DAC

FAILED: pal/src/libunwind/src/CMakeFiles/libunwind_xdac.dir/dwarf/Gparser.c.obj 
C:\PROGRA~2\MICROS~1\2019\ENTERP~1\VC\Tools\MSVC\1429~1.301\bin\Hostx86\x64\cl.exe  /nologo -DCROSS_COMPILE -DDISABLE_CONTRACTS -DHAVE_CONFIG_H=1 -DHAVE_UNW_GET_ACCESSORS -DHAVE___THREAD=0 -DHOST_64BIT -DHOST_AMD64 -DHOST_WINDOWS -DNDEBUG -DPACKAGE_BUGREPORT=\"\" -DPACKAGE_STRING=\"\" -DTARGET_64BIT -DTARGET_ARM64 -DTARGET_LINUX -DTARGET_UNIX -DUNW_REMOTE_ONLY -DURTBLDENV_FRIENDLY=Retail -D_CRT_SECURE_NO_WARNINGS -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -D_Thread_local="" -D__aarch64__ -D__linux__ -Ipal\src\libunwind\src -ID:\a\_work\1\s\src\coreclr\pal\src\libunwind\src -ID:\a\_work\1\s\src\native -ID:\a\_work\1\s\src\coreclr\pal\src\libunwind\include\tdep -ID:\a\_work\1\s\src\coreclr\pal\src\libunwind\include -Ipal\src\libunwind\include\tdep -Ipal\src\libunwind\include -ID:\a\_work\1\s\src\coreclr\pal\src\libunwind\include\win /DWIN32 /D_WINDOWS  /guard:cf /guard:ehcont /O2 /Ob2 /DNDEBUG -MT   /Ox /nologo /W3 /WX /Oi /Oy- /Gm- /Zp8 /Gy /GS /fp:precise /FC /MP /Zm200 /Zc:strictStrings /Zc:wchar_t /Zc:inline /Zc:forScope /wd4065 /wd4100 /wd4127 /wd4189 /wd4200 /wd4201 /wd4245 /wd4291 /wd4456 /wd4457 /wd4458 /wd4733 /wd4838 /wd4960 /wd4961 /wd5105 /we4007 /we4013 /we4102 /we4551 /we4700 /we4640 /we4806 /w34092 /w34121 /w34125 /w34130 /w34132 /w34212 /w34530 /w35038 /w44177 /Zi /ZH:SHA_256 /source-charset:utf-8 /GL /TC /permissive- -wd4068 -wd4146 -wd4244 -wd4267 -wd4334 -wd4311 -wd4475 -wd4477 /showIncludes /Fopal\src\libunwind\src\CMakeFiles\libunwind_xdac.dir\dwarf\Gparser.c.obj /Fdpal\src\libunwind\src\CMakeFiles\libunwind_xdac.dir\ /FS -c D:\a\_work\1\s\src\coreclr\pal\src\libunwind\src\dwarf\Gparser.c
D:\a\_work\1\s\src\coreclr\pal\src\libunwind\include\libunwind-aarch64.h(198): error C2061: syntax error: identifier 'alignas'
D:\a\_work\1\s\src\coreclr\pal\src\libunwind\include\libunwind-aarch64.h(199): error C2059: syntax error: '}'
D:\a\_work\1\s\src\coreclr\pal\src\libunwind\include\libunwind-aarch64.h(215): error C2079: 'uc_mcontext' uses undefined struct 'unw_sigcontext'

@kunalspathak
Copy link
Member

kunalspathak commented Feb 23, 2022

I have added support for OSR stress switches in Antigen in kunalspathak/Antigen@e544a76. Triggering another run.

/azp run Antigen

@kunalspathak
Copy link
Member

/azp run Antigen

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@kunalspathak
Copy link
Member

Odd build break building the cross-platform DAC

@jkoritzinsky @hoyosjs

@AndyAyersMS
Copy link
Member Author

Odd build break building the cross-platform DAC

@jkoritzinsky @hoyosjs

@agocke is trying to fix this via #65798 but hitting snags.

@AndyAyersMS
Copy link
Member Author

Seems like something is wiping out huge numbers of test runs. Console log has:

If you’re reading this, that means the Helix work item you’re trying to find the logs for has dead-lettered.

What this means:

- All attempts to retry execution of this work item were unable to complete.  This can be both for infrastructure reasons (problems within Azure) or issues with the work item (for instance, causing a machine to reboot unexpectedly or killing the Helix client on the machine will force a retry).
- No further work will be done for this specific work item, and its exit code is set to an artificial -1 (since it did not complete, there is no real exit code).

Common causes:

- Disabled queue (end-of-life Helix queues are automatically forwarded to deadletter and will fail instantly)
- Unhealthy Helix Client machine(s)
- Queue has been backed up heavily by a large amount of work and was manually purged by the engineering team
- Azure issues (e.g. Service Bus is overloaded)
- Malformed payloads; if Helix cannot download and unzip all payloads successfully, work will retry until dead-lettered.

For follow up:

- Check if your Helix Queue is still enabled, either via the metadata you see by browsing to https://helix.dot.net/api/info/queues?api-version=2019-06-17 or recent emails from the .NET Engineering Infrastructure team.
- Check that all work item payloads are accessible using a browser.
- If you are sending to a non-disabled queue and find this error repeatedly occurring, please contact the dnceng team.
- If a single, specific work item dead letters and others do not, consider local debugging; it may be causing spontaneous reboot (or trigging one intentionally).

I am going to trigger jitstress on main to try and sort out what's broken, but no point doing that until #65798 or similar lands.

@hoyosjs
Copy link
Member

hoyosjs commented Feb 24, 2022

@AndyAyersMS this is a bug in helix. See https://github.com/dotnet/core-eng/issues/15685; a hotfix is being worked on.

@AndyAyersMS
Copy link
Member Author

@AndyAyersMS this is a bug in helix. See dotnet/core-eng#15685; a hotfix is being worked on.

Ok, thanks.... any idea what we should do about the broken builds issue?

@hoyosjs
Copy link
Member

hoyosjs commented Feb 25, 2022

/azp run runtime

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@AndyAyersMS
Copy link
Member Author

Merged in stress fixes and enabling promotion. Will start a new round of testing.

@AndyAyersMS
Copy link
Member Author

Seeing some AvxVnni HW tests fail with

Unhandled exception. System.BadImageFormatException: Could not load file or assembly 'C:\h\w\A8BE0A34\p\system.runtime.interopservices.dll'. Format of the executable (.exe) or library (.dll) is invalid.
File name: 'C:\h\w\A8BE0A34\p\system.runtime.interopservices.dll'

will retry

@AndyAyersMS AndyAyersMS force-pushed the OSROnByDefaultForX64andArm64 branch from f2c98ca to 0ca4006 Compare March 23, 2022 02:30
@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Mar 28, 2022

This is the main PR to enable OSR, currently blocked by:

@JulieLeeMSFT JulieLeeMSFT added this to the 7.0.0 milestone Mar 28, 2022
Change these default values when the jit targets x64 or arm64:

* COMPlus_TC_QuickJitForLoops=1
* COMPlus_TC_OnStackReplacement=1

The upshot is that on x64/arm64 more methods will be jitted at Tier0,
and we will rely on OSR to get out of long-running Tier0 methods.

Other architectures continue to use the old behavior for now, as
OSR is not yet supported for x86 or arm.
@AndyAyersMS AndyAyersMS force-pushed the OSROnByDefaultForX64andArm64 branch from 0ca4006 to 561b808 Compare March 29, 2022 20:12
@AndyAyersMS
Copy link
Member Author

Libraries failure looks like #60962.

@AndyAyersMS
Copy link
Member Author

/azp run runtime-coreclr jitstress, runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@AndyAyersMS
Copy link
Member Author

The arm64 jitstress failure in Regressions\coreclr\GitHub_45929 seems to be happening sporadically in recent runs, so I don't think it is caused by this PR.

@AndyAyersMS
Copy link
Member Author

@dotnet/jit-contrib think this is finally ready.

I need to do one last double-check of the perf numbers before merging.

@AndyAyersMS
Copy link
Member Author

Jitstress failure is #60152.

@dotnet/jit-contrib ping -- somebody needs to approve this.

@EgorBo
Copy link
Member

EgorBo commented Apr 7, 2022

Lots of Ubuntu-Arm64 improvements: dotnet/perf-autofiling-issues#4427

@EgorBo
Copy link
Member

EgorBo commented Apr 7, 2022

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants