Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: Bump LDC-LLVM to v18.1.3 (except for Android and macOS arm64) #4604

Merged
merged 1 commit into from
Apr 13, 2024

Conversation

kinke
Copy link
Member

@kinke kinke commented Mar 27, 2024

No description provided.

@JohanEngelen
Copy link
Member

I tried to fix the ASan and fuzzer lit tests.

@kinke
Copy link
Member Author

kinke commented Mar 28, 2024

Oh nice, thx Johan! Note that this is totally a draft and will probably be split up into at least 2 PRs, so expect further force-pushes etc. My main interest here is Android and native TLS - I've gotten rid of our TLS hack in our LDC-LLVM 18, and the prebuilt Android LLVM binaries are built with the latest NDK, requiring Android >= 10. After very few mods here, it seems to cross-compile & -link just fine for 32-bit Android, but fails with undefined __tls_get_addr for AArch64.

The current macOS arm64 error wrt. unknown stack probe is most likely caused by the prebuilt LLVM binaries being compiled on macOS 14 (I've switched to the native M1 CI runners in the LLVM repo). So that's probably going to need #4541 in some form.

Windows required 2 further CMake tweaks; the current libxml2 issue for Win64 is going to be fixed in some hours.

@kinke kinke force-pushed the llvm18 branch 2 times, most recently from d622e96 to 068285e Compare March 28, 2024 03:25
@kinke kinke mentioned this pull request Mar 28, 2024
@kinke kinke force-pushed the llvm18 branch 2 times, most recently from dae4d1a to b29b6bb Compare March 28, 2024 16:52
@kinke
Copy link
Member Author

kinke commented Mar 29, 2024

@thewilsonator: 2 out of the 3 remaining lit-test failures are dcompute-related (codegen/dcompute_cl_{addrspaces_new,images}.d), dying because of the same LLVM assertion:

llvm/lib/Target/SPIRV/SPIRVDuplicatesTracker.cpp:59: void llvm::SPIRVGeneralDuplicatesTracker::buildDepsGraph(std::vector<SPIRV::DTSortableEntry *> &, llvm::MachineModuleInfo *): Assertion `(MI->getOpcode() == SPIRV::OpVariable && i == 3) || Reg2Entry.count(RegOp)' failed.

It'd be nice if you could look into that.

@kinke
Copy link
Member Author

kinke commented Mar 29, 2024

Oh WTF - sick & tired of the macOS issues. I thought I could just cherry-pick #4541 and fix the 'unsupported stack probe' error that way - as both LLVM and LDC are compiled on the same GHA macos-14 image (and with its default Xcode version) then. And hoped that the driver/config_diag.d problem in #4541 might vanish for the same reason (as our LLVM 17 was cross-compiled on an x86_64 macos-11 image IIRC).

But no, of course not. First of all, the 'unsupported stack probe' error persisted. Disabling LTO and PGO works around that. But now the stage-2 compiler (compiled with itself) crashes multiple times with libc++abi: Pure virtual function called! - for a bunch of tests, and when compiling some Phobos unittests. No stack traces.

@kinke
Copy link
Member Author

kinke commented Mar 30, 2024

Okay, slowly getting somewhere now wrt. macOS arm64. The /usr/bin/c++ workaround seems to have made matters worse; more luck with LDC_LINK_MANUALLY=ON but skipping the ExtractDMDSystemLinker step by manually providing D_LINKER_ARGS. The latter detection failed with a missing System library; so it's not like the lib{m,pthread} symlinks (to libSystem) had vanished in newer Xcodes (what I assumed) - it doesn't seem to find any system libs, bizarrely.

@kinke
Copy link
Member Author

kinke commented Mar 30, 2024

I.e., in https://github.com/ldc-developers/ldc/actions/runs/8486647167/job/23253351620?pr=4604, there's now a single 'pure virtual function called' error, for the fail_compilation/ice10922.d test, after some error output. Other than that, the 2 general dcompute lit-test failures with LLVM 18 above, and just one more failure - the std.experimental.allocator.building_blocks.kernighan_ritchie unittests with enabled optimizations; possibly just some missing alignment in some unittest. The old issues in #4541 aren't encountered anymore - both driver/config_diag.d and the gammafunction Phobos unittests pass.

@JohanEngelen
Copy link
Member

@thewilsonator I tried to find the cause of the crash of tests/codegen/dcompute_cl_addrspaces_new.d with LLVM 18, but I haven't found anything. Can you help with that? Thanks.

@kinke
Copy link
Member Author

kinke commented Mar 30, 2024

The macOS arm64 LTO problems appear serious. The 'unsupported stack probe' seems to come from the C++ IR (Apple clang v15.0 vs. LLVM 18 from stage-1 bootstrap compiler); disabling LTO on the C++ side fixed that error. But the LTO'd ldc2-unittest executable now hangs at runtime in the latest CI job; building it already spit out something like ~15k warnings wrt. definition subprograms cannot be nested within DICompositeType when enabling ODR...

@kinke
Copy link
Member Author

kinke commented Mar 30, 2024

The numerous Pure virtual function called! errors are back when enabling PGO (separately, without LTO) - as before with LDC_LINK_MANUALLY=OFF and /usr/bin/c++, but both LTO and PGO disabled. WTF.

@thewilsonator
Copy link
Contributor

but I haven't found anything. Can you help with that?

Built LLVM from releases/18.x and LDC on that based on master. It works for me locally. No idea.

@thewilsonator
Copy link
Contributor

Also that code hasn't been changed for a while. Only changed with some extension addition.

@kinke
Copy link
Member Author

kinke commented Mar 30, 2024

The Pure virtual function called! errors appear definitely indeterministic/random across equivalent CI runs. :(

Built LLVM from releases/18.x and LDC on that based on master. It works for me locally. No idea.

With enabled assertions? The tests pass without assertions, as shown by the vanilla-LLVM-workflow job.

@thewilsonator
Copy link
Contributor

Yep my bad, built Release LLVM instead of Debug

@thewilsonator
Copy link
Contributor

Has this ever worked on an LLVM with assertions enabled? e.g. LLVM 16/17 (with opaque pointers enabled?). i.e. is this an LLVM regression or a problem on our behalf?

@JohanEngelen
Copy link
Member

JohanEngelen commented Mar 30, 2024

Has this ever worked on an LLVM with assertions enabled? e.g. LLVM 16/17 (with opaque pointers enabled?). i.e. is this an LLVM regression or a problem on our behalf?

Yes, works for me with LLVM 17. (where same build params with LLVM 18 result in assertion hit)

There is a difference in how 17 and 18 transform the IR in the pipeline. Compare the output using --print-before-all

@kinke
Copy link
Member Author

kinke commented Mar 30, 2024

Wrt. macOS arm64, I was about to give up, running out of ideas - but then switching from default Xcode v15.0 to latest v15.2 (in the GHA image) finally seems to have vanquished the Pure virtual function called! errors - 3 CI jobs without any of these errors now.

With LTO for the D parts only, the driver/config_diag.d issue from #4541 is now back.

@kinke
Copy link
Member Author

kinke commented Mar 30, 2024

And 3 Pure virtual function called! errors are back in the latest CI job, with PGO alone. So it looks like both LTO and PGO are broken in LLVM 18 for macOS arm64; maybe there's a good reason why there isn't any official macOS LLVM 18 GitHub artifact yet. Linux AArch64 tested by Cirrus CI was just fine IIRC (before exceeding our budget for this month); it's using LTO for the compiler, but no PGO.

In the meantime, I'm preparing new prebuilt binaries for our LLVM fork - rebased onto latest upstream release/18.x with some more fixes (v18.1.3 scheduled for April 2nd), and for macOS arm64, using Xcode v15.2 too, and not disabling the EH unwind tables anymore (hopefully fixing driver/config_diag.d with LTO'd LDC, but just a wild guess).

@JohanEngelen
Copy link
Member

JohanEngelen commented Mar 30, 2024

I can try to reproduce the LTO+PGO issue locally, but will take a little bit, tinkering with build settings

edit: as before, I can repro driver/config_diag.d issue with LTO (works without LTO). Updating to new macos and latest xcode now.

@kinke
Copy link
Member Author

kinke commented Mar 30, 2024

Oh no, the Pure virtual function called! errors are back with the new LLVM binaries - without LTO nor PGO. The 3 consecutive jobs without these errors might have been a fluke after all. In which case the natively compiled binary would be useless and not fit for distribution to users. I can only hope that our current cross-compiled (and so totally untested at runtime) macOS arm64 CI artifacts aren't that bad.

Edit: The problem with the macOS arm64 GHA runners is that they only support the macos-14 image, so I cannot simply use the macos-12 image as for the x86_64 job and see if it works with an older OS/toolchain image, which has no issues on x86 at least.

@kinke
Copy link
Member Author

kinke commented Mar 30, 2024

@JohanEngelen: If you have some time, it'd be great if you could try to reproduce the apparently consistent failure with enabled optimizations only (no stacktrace unfortunately, so I don't know which test it is):

core.exception.AssertError@std/experimental/allocator/building_blocks/kernighan_ritchie.d(329): Assertion failure

AFAICT, the ctor expects the buffer to be at least pointer-aligned. And this stack-allocated buffer with alignment 1 looks very suspicious then, but there might be more: https://github.com/dlang/phobos/blob/a2ade9dec49e70c6acd447df52321988a4c2fb9f/std/experimental/allocator/building_blocks/kernighan_ritchie.d#L649-L651

Edit: Yep, at least https://github.com/dlang/phobos/blob/a2ade9dec49e70c6acd447df52321988a4c2fb9f/std/experimental/allocator/building_blocks/kernighan_ritchie.d#L919 too.

@kinke
Copy link
Member Author

kinke commented Mar 30, 2024

One potentially good news at least - with full LTO (for the D parts alone), driver/config_diag.d has just passed (maybe really the now-enabled EH unwind tables for the macOS-arm64 LLVM libs). In that latest CI job (https://github.com/ldc-developers/ldc/actions/runs/8494091408/job/23268828803?pr=4604), there were 2 Pure virtual function called! errors - one during dmd-testsuite, and one during the initial test runners compilation (working later when making sure the test runners are built before running them). So it looks like LTO might not make matters worse anymore; the virtual-func errors seem unrelated, a general issue on these macos-14 runners at least, and show up randomly at runtime - for most CI jobs, something like 0-5 times. Out of maybe 5-10 thousand compiler invocations, no idea really. :)

@kinke
Copy link
Member Author

kinke commented Mar 31, 2024

Aaand we have a lucky CI job again (no virtual-func errors), with full LTO for the D parts alone, and no PGO: https://github.com/ldc-developers/ldc/actions/runs/8495061837/job/23270913949?pr=4604 - only the 2 expected dcompute lit-test failures, and the 2 (shared/static) optimized kernighan_ritchie unittest assertions.

driver/config_diag.d has been passing a few times in a row now with LTO, so that might be consistently fixed by the unwind tables.

@JohanEngelen
Copy link
Member

fixed by the unwind tables.

Can you point me to what you meant by "not disabling the EH unwind tables" ?

@kinke
Copy link
Member Author

kinke commented Mar 31, 2024

Building LLVM with (default) LLVM_ENABLE_UNWIND_TABLES=ON; I've been disabling those to reduce the executable sizes.

@JohanEngelen
Copy link
Member

Testcase:

void main() {
    throw new Exception("Test exception");
}

Before updating macOS (Sonoma 14.2 iirc), this test crashed with -O -flto=full -defaultlib=phobos2-ldc-lto,druntime-ldc-lto, but now after updating it works. And driver/config_diag.d also works. I propose we add this testcase to the testsuite (with -O -flto=full -defaultlib=phobos2-ldc-lto,druntime-ldc-lto).

Moving on to the kernighan_ritchie unittests.

@JohanEngelen
Copy link
Member

@JohanEngelen: If you have some time, it'd be great if you could try to reproduce the apparently consistent failure with enabled optimizations only (no stacktrace unfortunately, so I don't know which test it is):

core.exception.AssertError@std/experimental/allocator/building_blocks/kernighan_ritchie.d(329): Assertion failure

AFAICT, the ctor expects the buffer to be at least pointer-aligned. And this stack-allocated buffer with alignment 1 looks very suspicious then, but there might be more: https://github.com/dlang/phobos/blob/a2ade9dec49e70c6acd447df52321988a4c2fb9f/std/experimental/allocator/building_blocks/kernighan_ritchie.d#L649-L651

Correct, this was the bug. Fixed in LDC phobos (upstream: dlang/phobos#8965).

@kinke
Copy link
Member Author

kinke commented Apr 1, 2024

Correct, this was the bug.

Perfect, thx.

Before updating macOS (Sonoma 14.2 iirc), this test crashed with -O -flto=full -defaultlib=phobos2-ldc-lto,druntime-ldc-lto, but now after updating it works.

With the same Xcode version (which one?)?!

And what about the random Pure virtual function called errors? I don't find anything useful via Google. Even if the chance of hitting that error is something like 1:1000, that's bad enough to break CI quite often as seen here, and IMO unsuited for distribution to users.

@thewilsonator
Copy link
Contributor

LLVM issue for the SPIRV regression llvm/llvm-project#87315

@kinke
Copy link
Member Author

kinke commented Apr 4, 2024

For macOS arm64, no improvements wrt. the virtual-func issue with LLVM v18.1.3 and Xcode v15.3 (newly available in the GHA image).

@kinke kinke changed the title CI: Bump LDC-LLVM to v18.1.2 CI: Bump LDC-LLVM to v18.1.3 Apr 4, 2024
@kinke kinke force-pushed the llvm18 branch 2 times, most recently from 3af7563 to 17d43af Compare April 13, 2024 20:20
@kinke kinke changed the title CI: Bump LDC-LLVM to v18.1.3 CI: Bump LDC-LLVM to v18.1.3 (except for Android and macOS arm64) Apr 13, 2024
@kinke kinke marked this pull request as ready for review April 13, 2024 20:22
@kinke kinke enabled auto-merge (squash) April 13, 2024 20:27
I've removed our own TLS emulation for Android, and switched to the
latest NDK (r26d). Switching to native TLS supported since Android
10/11 requires a few compiler and druntime changes; I'll follow up.

The macOS arm64 LLVM binaries were built natively on macOS 14 arm64,
with Xcode v15.3. Experiments with a native LDC build on such a macOS
14 arm64 CI runner show random 'Pure virtual function called' errors
for the compiler itself (compiled with itself). Not sure whether
these are regressions (with a failure rate of maybe very roughly
1:1000), or happening with the current cross-compiled macOS arm64
binaries too - opting for the safe variant of keeping LLVM 17 for
the macOS arm64 job.
@kinke kinke merged commit d2498cc into ldc-developers:master Apr 13, 2024
23 checks passed
@kinke kinke deleted the llvm18 branch April 13, 2024 22:36
@thewilsonator
Copy link
Contributor

Don't know exactly when that LLVM-SPIRV bug was fixed but it works with current LLVM main branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants