-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix redhat arm64 #52244
Fix redhat arm64 #52244
Conversation
Something is broken in the build on many platforms, I am investigating it. |
Hmm, the problem is that we don't have lld on our build machines / in the build docker images. And there is no lld on macOS. |
I made a PR to add LLD to the images here. Do we just need to condition the linker for OSX? I don't have an M1 Mac to try this on so I don't know if the issue repros there. |
@crummel we need lld in our docker images, but it turned out to be more involved than I've thought. I have created a PR adding it to the Ubuntu 16.04 images that we use for building, but there was a pushback |
In the meantime while you're on 6.0, anything we can help with? |
I just learned of this suggestion to use lld everywhere. Unfortunately there is currently no support for s390x in lld, so it would be good if this could be made conditional on the architecture as well. |
@uweigand thank you for pointing it out. I'll add exclusion for s390x. |
eng/native/configurecompiler.cmake
Outdated
if (NOT CLR_CMAKE_HOST_ARCH_S390X) | ||
add_linker_flag(-fuse-ld=lld) | ||
endif () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@janvorli, should we adjust this block accordingly to support all platforms:
runtime/eng/native/configuretools.cmake
Lines 66 to 80 in 57bfe47
if (NOT CLR_CMAKE_HOST_WIN32) | |
# detect linker | |
set(ldVersion ${CMAKE_C_COMPILER};-Wl,--version) | |
execute_process(COMMAND ${ldVersion} | |
ERROR_QUIET | |
OUTPUT_VARIABLE ldVersionOutput) | |
if("${ldVersionOutput}" MATCHES "GNU ld" OR "${ldVersionOutput}" MATCHES "GNU gold" OR "${ldVersionOutput}" MATCHES "GNU linkers") | |
set(LD_GNU 1) | |
elseif("${ldVersionOutput}" MATCHES "Solaris Link") | |
set(LD_SOLARIS 1) | |
else(CLR_CMAKE_HOST_OSX OR CLR_CMAKE_HOST_MACCATALYST) | |
set(LD_OSX 1) | |
endif() | |
endif() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, thank you, I've forgotten about this detection code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It actually gives a match for the LD_GNU:
LLD 10.0.0 (compatible with GNU linkers)
I wonder if we need to differentiate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it would be better if the -fuse-ld=lld is moved to the same block to keep all linker-related configs in one place. Also, we can explicitly exclude CMAKE_C_COMPILER_ID MATCHES "GNU"
from using lld (it would otherwise give errors in environment where we only have gcc-toolchain installed).
Aside, I think this line is unnecessary (and it's unique coreclr as everything is using shared or default configs):
runtime/src/coreclr/pgosupport.cmake
Line 9 in 57bfe47
set(CMAKE_REQUIRED_LIBRARIES -flto -fuse-ld=gold) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least the -flto is needed for PGO. The -fuse-ld=gold should not be needed when the lld linker is used, as it should also support LTO IIRC.
eng/native/configurecompiler.cmake
Outdated
@@ -105,6 +105,10 @@ elseif (CLR_CMAKE_HOST_UNIX) | |||
# Use uppercase CMAKE_BUILD_TYPE for the string comparisons below | |||
string(TOUPPER ${CMAKE_BUILD_TYPE} UPPERCASE_CMAKE_BUILD_TYPE) | |||
|
|||
if (NOT CLR_CMAKE_HOST_ARCH_S390X AND NOT CLR_CMAKE_HOST_OSX AND NOT CLR_CMAKE_HOST_FREEBSD) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (NOT CLR_CMAKE_HOST_ARCH_S390X AND NOT CLR_CMAKE_HOST_OSX AND NOT CLR_CMAKE_HOST_FREEBSD) | |
if (NOT CLR_CMAKE_HOST_ARCH_S390X AND NOT CLR_CMAKE_HOST_OSX AND NOT CLR_CMAKE_HOST_FREEBSD AND NOT CMAKE_C_COMPILER_ID MATCHES "GNU") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and I still think
runtime/eng/native/configuretools.cmake
Line 67 in 57bfe47
# detect linker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you are saying that instead of detecting the linker there for Linux, we would set this option at that place? We add a plenty of linker flags in the eng/native/configurecompiler.cmake already, so it seemed to make sense to have it there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was actually suggesting to move that condition after line 67 of configuretools.cmake (without deleting anything existing there), because that is the file which is included in all subsets including libraries. configurecompiler is only for coreclr and hosts. Unfortunately, it is not clear as we haven't finished the refactoring properly from my point of view but I paused it to avoid merge conflicts at the time.
I think you would agree that it is good idea to use same linker (whichever is selected) for all components of a given build rather than mixing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree with using the same linker for everything. As for moving the setting, it seemed strange that we would set the lld using the option and then right away run the linker detection and find that it is lld.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I was thinking I might have misunderstood what you've meant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could be more explicit, e.g.
if (NOT CLR_CMAKE_HOST_WIN32)
if (CMAKE_C_COMPILER_ID MATCHES "Clang" AND NOT CLR_CMAKE_HOST_ARCH_S390X AND NOT CLR_CMAKE_HOST_OSX AND NOT CLR_CMAKE_HOST_FREEBSD)
add_linker_flag(-fuse-ld=lld)
set(LD_GNU 1) # or maybe LD_LLVM and update the usages with `LD_GNU OR LD_LLVM` to be more explicit / future proof?
else()
# detect linker
... existing code ..
endif()
endif()
Clang on ARM64 places the .rodata section into the same segment as .text. On RHEL 8 ARM64, the kernel is configured for 64kB memory pages. When we flip the page protection of the page containing the GS cookie to RW and back to RO, we assume that the cookie lives in a non-executable memory. This assumption is broken on RHEL 8 and we end up setting protection of a part of the coreclr code to read only instead of back to RX. This change switches the linker we use to lld from the previously used gnu linker. That linker places .rodata into a different segment than .text by default. Moreover, I was planning to move to using lld anyways to use all build tools from LLVM.
The lld linker has revealed that we were using absolute addresses in some asm helpers and so load time relocation was necessary. This change fixes it by replacing all of those by PC relative ones.
Use new images that have lld linker
There is no support for lld on that platform yet
Also update PGO stuff to not to use gold linker when lld is present.
d60032b
to
9bfa3a0
Compare
5574fcb
to
2dabaa8
Compare
2dabaa8
to
f1abcd3
Compare
So it seems to be working except for a problem with our lab machines that build for Alpine. They use an old clang 5.0.1 (as they run Alpine 3.9). And the lld that's installed on those machines seems to have some bug that causes a build failure. |
I've just verified that Alpine build from 3.13 works fine on Alpine >=3.11. But for this PR, I am going to disable lld for MUSL so that we don't rush the decision on the Alpine version to use for building. |
Http3_MsQuic test failures are tracked in #56090. Should be fixed and merged shortly. Sorry for the inconvenience! |
I have decided to change the way we force the lld. The problem is that the way it is done now, the compiler detection code in cmake won't see the lld linker. While it is not a problem at the moment, I am preparing a change to cross build without the need to install gnu cross tools on the host. In that case, the compiler check fails, as the default linker it looks for is the cross tool one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a few questions.
@@ -23,7 +23,10 @@ function(add_pgo TargetName) | |||
else(CLR_CMAKE_HOST_WIN32) | |||
if(UPPERCASE_CMAKE_BUILD_TYPE STREQUAL RELEASE OR UPPERCASE_CMAKE_BUILD_TYPE STREQUAL RELWITHDEBINFO) | |||
target_compile_options(${TargetName} PRIVATE -flto -fprofile-instr-generate) | |||
set_property(TARGET ${TargetName} APPEND_STRING PROPERTY LINK_FLAGS " -flto -fuse-ld=gold -fprofile-instr-generate") | |||
set_property(TARGET ${TargetName} APPEND_STRING PROPERTY LINK_FLAGS " -flto -fprofile-instr-generate") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm, LLD supports the same PGO linker flags as GOLD, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. I have verified that a release build with -pgoinstrument passes and that the compiler was really passed these options.
@@ -855,7 +855,6 @@ elseif(CLR_CMAKE_TARGET_ARCH_ARM64) | |||
set(VM_SOURCES_DAC_AND_WKS_ARCH | |||
${ARCH_SOURCES_DIR}/stubs.cpp | |||
exceptionhandling.cpp | |||
gcinfodecoder.cpp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change intentionally part of this PR? It seems unrelated at a glance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there were some multiply defined symbols due to double inclusion of this file and LLD complained.
@@ -14,7 +14,7 @@ parameters: | |||
# This is the default platform provided by Arcade, intended for use by a managed-only repo. | |||
defaultManagedPlatform: | |||
name: 'Managed' | |||
container: 'mcr.microsoft.com/dotnet-buildtools/prereqs:centos-7-3e800f1-20190501005343' | |||
container: 'mcr.microsoft.com/dotnet-buildtools/prereqs:centos-7-20210714125435-9b5bbc2' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will likely cause a merge conflict when the next time arcade folder is mirrored https://github.com/dotnet/arcade/blob/dde80826e2fe464ef34182d2f786a6a51121d9a6/eng/common/templates/jobs/source-build.yml#L17.
There was a proposal to send a heads-up comment from bot account on PR that is modifying eng/common and the author is not dotnet-bot. I think it can save the author the hassle of undoing-and-redoing-the-right-way (which has happened a few times in runtime, aspnetcore and other repos).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thank you for noticing this, I have not noticed this was in the common folder. I'll make the same change in arcade, that should fix the future issue, I believe (I did something like that in the past).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, reading the comment in the file above the change, we can just revert the change in it as it is used for managed stuff only and lld is not needed in there. I've updated all references to the centos image I've found and haven't noticed the purpose.
@omajid I think 6.0.100-preview.7.21379.14 (and probably a couple of earlier versions) should have this fix in it. It doesn't look like there's an ARM64 RPM, is that something you can work with? |
Missing RPM for aarch64 shouldn't be a problem. But this still doesn't work for me end-to-end:
It looks like dotnet-sdk-6.0.100-preview.6.21355.2-linux-arm64.tar.gz is still broken? Am I doing something wrong? |
The preview.6 SDK doesn't have the fix. |
I changed it in
The binlog just says |
Exit code 139 maps to the signal handler code 11 (= 139 - 128). 11 is SIGSEGV. We have a segfault. Maybe running it under a native debugger will help? |
It looks like it's still using preview6 for some parts of the build:
I'll look into this a bit more. |
It appears that if the preview6 runtime is available the build prefers to use that instead of preview7. So in addition to my previous steps:
there needs to be afterwards
After that we get through the build but not tests:
|
@janvorli Any idea where to go from here? Is this the same stack trace you were seeing? |
@crummel I had a look, and I don't think these changes are on the |
The |
Thanks, I'll give it a try with that. |
I got the same stacktrace in 6.0.100-rc.1.21411.2, and anything after that (commit dotnet/installer@cfe0456 and after) segfaults immediately:
I don't have the symbols for the bootstrap version so I have nothing useful yet:
I'm cross-building an ARM64 SDK now so I'll have the symbols and hopefully a better idea of what's going on. |
@crummel can you check |
That looks like the issue:
|
This PR is meant to cause them to be in different sections. If we're sure the binaries were compiled with this change, we need to figure out why it didn't have an effect. |
It does look like this PR commit is in the RC1 I'm using:
@janvorli Any idea why this fix wouldn't be working? Anything else I should check? |
I think it would not work if for some reason the build machine didn't have the lld linker installed. Can we get a build log from the machine it was built on? Maybe I've missed an update in a part of the infra somewhere. |
Hmm, this is weird, I can repro it with a local cross build for arm64 too. So maybe it got broken after my change or I've made some mistake in a final cleanup before finalizing the PR. |
This is the runtime build in that SDK according to BAR: https://dev.azure.com/dnceng/internal/_build/results?view=results&buildId=1287635. Logs are only good for two more days. |
I can repro it locally. Even with checking out the state of the repo right after my fix. So I must have made some mistake. |
Clang on ARM64 places the .rodata section into the same segment
as .text. On RHEL 8 ARM64, the kernel is configured for 64kB
memory pages. When we flip the page protection of the page containing
the GS cookie to RW and back to RO, we assume that the cookie lives
in a non-executable memory. This assumption is broken on RHEL 8 and
we end up setting protection of a part of the coreclr code to read
only instead of back to RX.
This change switches the linker we use to lld from the previously
used gnu linker. That linker places .rodata into a different segment
than .text by default. Moreover, I was planning to move to using
lld anyways to use all build tools from LLVM.
Enabling linking using lld revealed a problem with ARM (32 bit) target.
We were using absolute addresses in some asm helpers and so load time
relocation was necessary. This change fixes it by replacing all of those by
PC relative ones.