-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AArch64] BOLT asserts when encoding out-of-range Pending Relocations #116817
Comments
S1 sounds good to me. We just have to make sure that patching of old code is enabled, e.g. via
If a function is not going to be emitted, then there will be no reference/relocation recorded. Cold functions are not emitted in "lite" mode which is default for x86, but not AArch64. llvm-project/bolt/lib/Core/BinaryFunction.cpp Lines 1565 to 1576 in 14a259f
|
I may be missing some connection between scanExternalRefs+flushPendingRelocations and PendingPatches of PatchEntries.cpp. Do you want to introduce another requirement for when scanExternalRefs/flushPendingRelocations should run? ie only when |
Let me add more context on why we need to patch the original function entries. Whenever we duplicate a function body for execution purposes, we also need to duplicate all associate metadata, e.g. .eh_frame info, exception ranges, DWARF debug info, etc. If the function uses jump tables, those need to be duplicated as well. Currently, BOLT cannot duplicate all of the accompanying metadata and instead we force the execution of the new/optimized version while prohibiting execution of the original code. This is achieved in
For the reasons described above, it makes sense to patch entries in all original functions in most of the cases. Since we care about the performance, we have an exception for "small" functions ( llvm-project/bolt/lib/Passes/PatchEntries.cpp Lines 66 to 69 in b357495
There are also cases where the code cannot be patched due to size limitations. Whenever we fail to patch the original entries, we have to bail out on optimizing the function ( llvm-project/bolt/lib/Passes/PatchEntries.cpp Line 116 in b357495
Lastly, if we can guarantee that the old code never gets executed, we can skip patching altogether. This guarantee is achieved via successful run of If we are going to bail out on relocation processing for I'm going to experiment with enabling unconditional execution of |
What is the problem
When a function is ignored from BOLT optimization, some relocations are emitted to patch the original fixups in the original text. Those are generated during external reference scanning. This is an optimization for adjusting the initial calls to land back to the optimized code. The relevant relocations are of type
R_AARCH64_CALL26
.In large binaries the encoding does not fit the supported range, leading to an assertion:
Ignoring functions can be common. Some reasons include:
Reproducer
How the reproducer triggers the crash
The error is triggered through instrumentation due to code size increases.
When instrumenting MongoDB, BOLT requires to ignore some functions in order to succeed. When doing so, some additional relocations are emitted. However, instrumentation ends up increasing code size by more than 3x, which goes beyond those relocations' range. This causes an assertion crash when trying to encode such relocations (
R_AARCH64_CALL26
).Why BOLT requires to ignore some functions
This requirement was introduced by PR #89681. Before that stricter check, those functions were not required to be ignored and the bug was not triggered (ie, on LLVM 18 RC). PR #101466 improves support for entry-points other than the main one to the function, but still BOLT bails out in our use case.
Regardless, this issue will not focus on such improvements as the error can be triggered without it
Reproducer Instructions
Use use this Dockerfile and the below script to compile the input binary and pull it on an AArch64 Ubuntu host.
Alternatively, one can get it pre-compiled from here.
Then, compile bolt version
dcd62070cf45
with assertions ON and use it to instrument the binary with:LLVM_DIR=path/to/llvm/bin/dir $LLVM_DIR/llvm-bolt mongod -instrument -o mongod.instrumented \ --instrumentation-file=prof.fdata \ --instrumentation-sleep-time=60 \ --instrumentation-no-counters-clear \ --instrumentation-wait-forks \ --skip-funcs=_ZL9InterpretP9JSContextRN2js8RunStateE/1,_ZN2v88internal12_GLOBAL__N_18RawMatchIhEENS0_19IrregexpInterpreter6ResultEPNS0_7IsolateENS0_9ByteArrayENS0_6StringENS0_6VectorIKT_EEPiiiijNS0_6RegExp10CallOriginEj/1,_ZN2v88internal12_GLOBAL__N_18RawMatchIDsEENS0_19IrregexpInterpreter6ResultEPNS0_7IsolateENS0_9ByteArrayENS0_6StringENS0_6VectorIKT_EEPiiiijNS0_6RegExp10CallOriginEj/1
Proposed Solutions
A draft PR around (S1) will follow.
(S1) Apply pending relocs only when in bounds
Keep the 'external reference scan' optimization on, and apply it on a per callsite basis.
That is, if a particular external reference is within range, then the encoding of that relocation happens, otherwise it is ignored.
This may require some checks that the PendingRelocations were added specifically for optimization, ie during external reference scanning and that are
R_AARCH64_CALL26
.(S2) Hint users to use
-no-scan
Instead of the current assertion, BOLT could escalate this to an exit with a recommendation to use
-no-scan
to overcome this crash. This is not very obvious unless someone is familiar with specific BOLT code regions. Also, by bailing out at all times, we ensure that BOLT does not patch the code in some wrong way?Q: What if an 'external reference' points to a cold function? Is that pending relocation still flushed?
I assume that even if it does, it will have no effect.
The text was updated successfully, but these errors were encountered: