Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix test execution jobs in the coreclr-release-outerloop-nightly pipeline #85278

Merged
merged 1 commit into from
Apr 25, 2023

Conversation

trylek
Copy link
Member

@trylek trylek commented Apr 24, 2023

As described in the issue

#85263

the coreclr-release-outerloop-nightly pipeline has been malfunctioning for quite a while due to not publishing the native test components. This simple change fixes that - while it doesn't make the pipeline completely green, at least we're now sending the tests to Helix and observing just a couple of remaining test failures.

Thanks

Tomas

/cc @dotnet/runtime-infrastructure

@ghost
Copy link

ghost commented Apr 24, 2023

Tagging subscribers to this area: @hoyosjs
See info in area-owners.md if you want to be subscribed.

Issue Details

As described in the issue

#85263

the coreclr-release-outerloop-nightly pipeline has been malfunctioning for quite a while due to not publishing the native test components. This simple change fixes that - while it doesn't make the pipeline completely green, at least we're now sending the tests to Helix and observing just a couple of remaining test failures.

Thanks

Tomas

/cc @dotnet/runtime-infrastructure

Author: trylek
Assignees: -
Labels:

area-Infrastructure-coreclr

Milestone: -

Copy link
Member

@hoyosjs hoyosjs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much does this pipeline get used? It's been broken and never looked at, so is it worth to spend the resources? How does it differ from the outer loop pipeline?

@trylek
Copy link
Member Author

trylek commented Apr 24, 2023

@hoyosjs - I have raised exactly the same concerns on the original issue thread, I believe we should consider consolidating it with the runtime-coreclr outerloop pipeline.

@trylek
Copy link
Member Author

trylek commented Apr 24, 2023

Having said that, I think the additional testing does provide some value, if I understand it correctly, it runs the tests in release mode in contrast to runtime-coreclr outerloop which only runs the tests in checked mode (despite the fact that it builds the debug and release versions of the runtime as well).

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@trylek
Copy link
Member Author

trylek commented Apr 24, 2023

Just FYI I have rerun the failed jobs because apparently many of them failed due to transient network failures; the updated results should be available in an hour or so.

@kunalspathak
Copy link
Member

Just FYI I have rerun the failed jobs because apparently many of them failed due to transient network failures; the updated results should be available in an hour or so.

I did see some related to crossgen2 and Avx512 which seems to need investigation.

@trylek
Copy link
Member Author

trylek commented Apr 24, 2023

So did I. One of the annoying limitations of AzDO is the inability to rerun just a single job (or I just don't know how to do it).

@trylek
Copy link
Member Author

trylek commented Apr 25, 2023

OK, most of the runs have finished and now I believe the failures are "real". One stable bug is the "coreroot_determinism" failure, I'll look into that as I suspect I might have an idea what is going on there. For some of the AVX 512 failures, I'm seeing the relatively uncommon exit code C000001D meaning invalid instruction so I'm wondering whether we're sure the Helix HW we're using supports these new processor features.

@kunalspathak
Copy link
Member

For some of the AVX 512 failures, I'm seeing the relatively uncommon exit code C000001D meaning invalid instruction so I'm wondering whether we're sure the Helix HW we're using supports these new processor features.

@tannergooding

@tannergooding
Copy link
Member

For some of the AVX 512 failures, I'm seeing the relatively uncommon exit code C000001D meaning invalid instruction so I'm wondering whether we're sure the Helix HW we're using supports these new processor features.

It's more likely that there is some encoding bug for a new instruction and outerloop hits the right edge case to trigger it.

Is there a repro?

@trylek trylek merged commit 41cc61e into main Apr 25, 2023
@trylek
Copy link
Member Author

trylek commented Apr 25, 2023

@tannergooding - I am able to repro the issue locally when running the HW intrinsic tests in Crossgen2 mode using basically the following command sequence:

build clr+libs -c Release
src\tests\build release test JIT\HardwareIntrinsics\HardwareIntrinsics_X86_Avx512_r.csproj
src\tests\run release crossgen2

In the release mode I see a failure that looks similar to the one in the lab:

17:07:08.042 Failed test: _Avx512DQ_VL_Vector128_r::JIT.HardwareIntrinsics.X86._Avx512DQ_VL_Vector128.Program.BroadcastPairScalarToVector128UInt32()
17:07:08.053 Running test: _Avx512DQ_VL_Vector128_r::JIT.HardwareIntrinsics.X86._Avx512DQ_VL_Vector128.Program.BroadcastPairScalarToVector128Single()
Beginning scenario: RunBasicScenario_UnsafeRead
Fatal error. System.Runtime.InteropServices.SEHException (0x80004005): External component has thrown an exception.
   at JIT.HardwareIntrinsics.X86._Avx512DQ_VL_Vector128.SimpleUnaryOpTest__BroadcastPairScalarToVector128Single.RunBasicScenario_UnsafeRead()
   at JIT.HardwareIntrinsics.X86._Avx512DQ_VL_Vector128.Program.BroadcastPairScalarToVector128Single()
   at Program.<$>g__TestExecutor3|0_2(System.IO.StreamWriter, System.IO.StreamWriter, <>c__DisplayClass0_0 ByRef)
   at Program.$(System.String[])

Sadly the call stack involves about ten runtime JITted methods without symbol information so I haven't yet figured out how to drill deeper into the failure.

In debug mode, I'm hitting a JIT assertion instead:

build clr -c Debug
src\tests\build test JIT\HardwareIntrinsics\HardwareIntrinsics_X86_Avx512_r.csproj
src\tests\run crossgen2
16:56:42.023 Running test: _Avx512BW_r::JIT.HardwareIntrinsics.X86._Avx512BW.Program.ShiftLeftLogicalVariableInt16()

Assert failure(PID 32348 [0x00007e5c], Thread: 13636 [0x3544]): Assertion failed 'inputSize == 4 || inputSize == 8' in 'JIT.HardwareIntrinsics.X86._Avx512BW.SimpleBinaryOpTest__ShiftLeftLogicalVariableInt16:RunBasicScenario_UnsafeRead():this' during 'Generate code' (IL size 114; hash 0x77e786fa; MinOpts)

    File: C:\git\runtime7\src\coreclr\jit\emitxarch.cpp Line: 15481
    Image: c:\git\runtime7\artifacts\tests\coreclr\windows.x64.debug\tests\core_root\corerun.exe

with the following call stack:

 	clrjit.dll!assertAbort(const char * why, const char * file, unsigned int line) Line 304	C++
>	clrjit.dll!emitter::TryEvexCompressDisp8Byte(emitter::instrDesc * id, __int64 dsp, bool * dspInByte) Line 15481	C++
 	clrjit.dll!emitter::emitInsSizeSVCalcDisp(emitter::instrDesc * id, unsigned __int64 code, int var, int dsp) Line 3751	C++
 	clrjit.dll!emitter::emitInsSizeSV(emitter::instrDesc * id, unsigned __int64 code, int var, int dsp) Line 3828	C++
 	clrjit.dll!emitter::emitIns_R_R_S(instruction ins, emitAttr attr, _regNumber_enum reg1, _regNumber_enum reg2, int varx, int offs) Line 6844	C++
 	clrjit.dll!emitter::emitIns_SIMD_R_R_S(instruction ins, emitAttr attr, _regNumber_enum targetReg, _regNumber_enum op1Reg, int varx, int offs) Line 8279	C++
 	clrjit.dll!CodeGen::inst_RV_RV_TT(instruction ins, emitAttr size, _regNumber_enum targetReg, _regNumber_enum op1Reg, GenTree * op2, bool isRMW) Line 1130	C++
 	clrjit.dll!CodeGen::genHWIntrinsic_R_R_RM(GenTreeHWIntrinsic * node, instruction ins, emitAttr attr, _regNumber_enum targetReg, _regNumber_enum op1Reg, GenTree * op2) Line 665	C++
 	clrjit.dll!CodeGen::genHWIntrinsic_R_R_RM(GenTreeHWIntrinsic * node, instruction ins, emitAttr attr) Line 637	C++
 	clrjit.dll!CodeGen::genHWIntrinsic(GenTreeHWIntrinsic * node) Line 261	C++
 	clrjit.dll!CodeGen::genCodeForTreeNode(GenTree * treeNode) Line 1898	C++
 	clrjit.dll!CodeGen::genCodeForBBlist() Line 469	C++
 	clrjit.dll!CodeGen::genGenerateMachineCode() Line 1915	C++
 	clrjit.dll!CodeGenPhase::DoPhase() Line 1650	C++
 	clrjit.dll!Phase::Run() Line 61	C++
 	clrjit.dll!DoPhase(CodeGen * _codeGen, Phases _phase, void(CodeGen::*)() _action) Line 1664	C++
 	clrjit.dll!CodeGen::genGenerateCode(void * * codePtr, unsigned int * nativeSizeOfCode) Line 1674	C++

when building the method

JIT.HardwareIntrinsics.X86._Avx512BW.SimpleBinaryOpTest__ShiftLeftLogicalVariableInt16.RunBasicScenario_UnsafeRead()

where the proximate cause is inputSize being equal to 2 in emitter::TryEvexCompressDisp8Byte.

Hope that helps

Tomas

@tannergooding
Copy link
Member

tannergooding commented Apr 25, 2023

Thanks! I believe these are all fixed as part of #85281 and should be resolved once its merged

They snuck through CI due to #85056 and the corresponding jobs showing as passing. That should itself be resolved with #85183

@hoyosjs hoyosjs deleted the dev/trylek/CoreClrReleaseOuterloopNightlyFix branch April 25, 2023 19:54
@ghost ghost locked as resolved and limited conversation to collaborators May 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants