-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
0.7.1 tests failing in ppc64le with g++14 -O2 #464
Comments
@musicinmybrain Thanks for reporting. From glancing at the logs I see this is also compiled with Two significant questions for reflection:
Having an answer to 2 would help in revising/improving the code. However, while the answer to 1 can be found with some effort (given the constraints with this architecture), I'm not sure that the answer to 2 can be found. But possibly the answer to 1 will significantly narrow the search space. Also, while we were hypothesizing in #440 that the cause for the problem is UB, it could also be that gcc has a bug in its frontend, and the bug manifests only with the flags from 1. Chasing this bug will be significantly harder than #440, as I don't have access to ppc64le hardware, and any attempt from my side at diagnosing and fixing the problem must involve a CI roundtrip. Do you have access to ppc64le? |
I don’t have general-purpose interactive shell access to real I can do as many experimental RPM package builds in COPR as I want using real I can reproduce the error in an emulated
Something like that is probably the most promising avenue if you want to try to play with this interactively yourself. I’m happy to run tests with the tools at my disposal. The flags corresponding to each of the optimization levels are documented at https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html. I wonder if this happens with |
For Fedora’s purposes, I could increase the optimization level to I just confirmed that tests pass with If there is a single flag that triggers test failures between |
This actually fails with It’s possible that some of the other default flags in Fedora (including enabling LTO) are affecting this. |
I am trying to create a Minimal Verifiable Example. Looking at the logs you originally posted, I see quickstart fails: That's convenient to have a MVE! So I edited the quickstart example to run only the function where this happens, Unfortunately none of these fail :-/. Can you try to get it to fail on your end? Relevant commands should be: cd samples/add_subdirectory
mkdir -p build
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CXX_FLAGS_RELEASE:STRING="...." # fill with appropriate options here
cmake --build build --config Release --target run |
So I started by trying to imitate Fedora’s build flags as closely as I reasonably could, with the idea that if I could reproduce this with
I haven’t had a chance to try to minimize the compiler flags yet, but that looked interesting enough to go ahead and report it! |
The warning persists when I drop I’ve reduced the compiler flags to
while still reproducing the segfault. I assume I can reduce them further, but it takes a while to compile the example under emulation. |
Same segfault down to
|
If I restore the LTO flags (
|
Thanks for the investigation and the repro -- this is invaluable. That is exactly the same error as in your logs, so this means we have a reliable way to reproduce on a local machine. I did not know about mock, and I am only somewhat familiar with qemu, but now there's a very good reason to pick these up. I'll report back when I'm able to reproduce on my end. |
As for the ODR problem, let's track it in the linked issue. Thanks for reporting that one too! |
set -xe ; cd ~/proj/rapidyaml/samples/add_subdirectory/ ; sys=powerpc64le ; cxxflags="-O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -std=c++11 -DNDEBUG" ; bdir=build/$sys ; mkdir -p $bdir ; cmake -S . -B $bdir -D CMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS_RELEASE:STRING="$cxxflags" --toolchain ../../.github/toolchains/linux-gnu-$sys.cmake ; cmake --build $bdir --target run -j --verbose
Cannot reproduce it in ubuntu using powerpc64le-linux-gnu-g++ together with qemu-ppc64le-static... Here's what I did, which succeeds:
Also, adding (BTW, I've tweaked the quickstart to make compilation faster; it now goes under 5s in my machine) |
I tried further now, using a different ubuntu distro. No luck, could not reproduce it with any of those flags. If the compiler flags work in one OS, but not in another OS, it is starting to look like something else in the environment, like compiler/linker version. I don't have easy access to any RedHat distro to use FWIW in the previous comment I used ubuntu22.04, which has root@bach:/src# powerpc64le-linux-gnu-g++ --version
powerpc64le-linux-gnu-g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
root@bach:/src# qemu-ppc64le-static --version
qemu-ppc64le version 6.2.0 (Debian 1:6.2+dfsg-2ubuntu6.22) now I used ubuntu18.04 which has root@bach:/rapidyaml# powerpc64le-linux-gnu-g++ --version
powerpc64le-linux-gnu-g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
root@bach:/rapidyaml# qemu-ppc64le-static --version
qemu-ppc64le version 2.11.1(Debian 1:2.11+dfsg-1ubuntu7.42) This does not confirm any version related suspicion. FWIW, the prior setup was simply $ apt-get install -y \
gcc-powerpc64le-linux-gnu \
g++-powerpc64le-linux-gnu \
qemu \
qemu-system \
qemu-user-static and then running the command in the previous comment. |
I also tried running all the tests you listed as failing. None of them failed...
|
set -xe ; cd ~/proj/rapidyaml/samples/add_subdirectory/ ; sys=powerpc64le ; cxxflags="-O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -std=c++11 -DNDEBUG" ; bdir=build/$sys ; mkdir -p $bdir ; cmake -S . -B $bdir -D CMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS_RELEASE:STRING="$cxxflags" --toolchain ../../.github/toolchains/linux-gnu-$sys.cmake ; cmake --build $bdir --target run -j --verbose to run the tests: ( set -xe ; \ cxxflags="-O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -std=c++11 -DNDEBUG" ; \ cd /rapidyaml/ ; \ sys=powerpc64le ; \ bdir=build/$sys ; \ mkdir -p $bdir ; \ cmake -S . -B $bdir \ -D CMAKE_BUILD_TYPE=Release \ -DRYML_BUILD_TESTS=ON \ -DGIT=/usr/bin/git \\ -DCMAKE_CXX_FLAGS_RELEASE:STRING="$cxxflags" \ --toolchain .github/toolchains/linux-gnu-$sys.cmake ; \ for t in quickstart parse_engine_6_qmrk parse_engine_7_seqimap tree emit json merge seq map scalar_null anchor number ; do \ cmake --build $bdir -j --verbose --target ryml-test-$t \ && qemu-ppc64le-static $bdir/test/ryml-test-$t \ || echo "\n\n\nFAILED\n\n" ; \ done )
Hmm, maybe this one could be toolchain-specific after all. I reduced the command line to There are official Fedora docker images at https://hub.docker.com/_/fedora/, but I do not have any insight into combining docker with You could also use a VM, either installing from an ISO (https://fedoraproject.org/workstation/download) or using one of the Cloud Base Images (https://fedoraproject.org/cloud/download). You could use a All of my tests for this particular issue have been in a Fedora Rawhide chroot, with GCC 14.1.1 and binutils 2.42.90. I’ll try the same process from #464 (comment) on some other releases and see if the issue can be reproduced there. Fedora 41 and 40 are both temporarily ahead of Rawhide with GCC 14.2.1; F41 has binutils 2.43 and F40 has binutils 2.41. Fedora 39 has GCC 13.3.1 and binutils 2.40. I can also try some enterprise distros, like CentOS Stream 10 with GCC 14.1.1 and binutils 2.41, Alma Linux 9 with GCC 11.4.1 and binutils 2.35.2, and Rocky Linux 8 with GCC 8.5.0 and binutils 2.30. (I could try clang, too, but I’ll look at the various GCC’s first.) |
Reproduced on:
Failed to reproduce on:
It’s looking very likely that this is a regression specific to GCC 14. Whether it’s a compiler bug / miscompilation – which is very possible – or perhaps a problem in rapidyaml revealed by a new optimization, I couldn’t say. |
Made some progress: was able to reproduce with a gcc14.2 cross-compiler, and unable to reproduce with gcc13.2. Using just I did so by getting a docker image with a gcc14(ubuntu24.10) and gcc13(ubuntu24.04) cross-compiler: #img=ubuntu:24.04 # uses gcc13.2
img=ubuntu:24.10 # uses gcc14.2
docker pull $img ; docker run --rm -it --network host -v ~/proj:/proj $img ... then inside the container: apt-get update ; apt-get install -y \
g{++,cc}-powerpc64le-linux-gnu \
build-essential \
cmake \
git \
qemu-user-static
# define a function to try with different compiler flags
function try464()
{
cxxflags="$1"
cmkflags="$2"
( set -xe ; \
cd /proj/rapidyaml/samples/add_subdirectory/ ; \
sys=powerpc64le ; \
bdir=build/$sys-docker ; \
mkdir -p $bdir ; \
cmake -S . -B $bdir \
--toolchain ../../.github/toolchains/linux-gnu-$sys.cmake \
-DCMAKE_BUILD_TYPE=Release \
-DGIT=git \
$cmkflags \
-DCMAKE_CXX_FLAGS_RELEASE:STRING="$cxxflags" \
; \
cmake --build $bdir --target run -j --verbose )
} ... and finally: $ try464 "-O2" # fails with gcc14, succeeds with gcc13
$ try464 "-O2" "-DRYML_DBG=ON" # succeeds
$ try464 "-O1" # succeeds
$ try464 "-O3" # succeeds FWIW, here's the error. Notice it's missing the keys:
EDIT: @musicinmybrain I looked again the error you posted above; notice how it's different:
|
Nice! At least the association with GCC 14 seems to be consistent, and we have a cross-distro reproduction. |
A further finding: this seems to be related with the compiler frontend. I tried with both gcc14.1 and 14.2 for x64, and was not able to reproduce. |
I narrowed down the quickstart error to this function: template<class EventHandler>
csubstr ParseEngine<EventHandler>::_maybe_filter_key_scalar_dquot(ScannedScalar const& C4_RESTRICT sc)
{
csubstr maybe_filtered = sc.scalar;
if(sc.needs_filter)
{
if(m_options.scalar_filtering())
{
maybe_filtered = _filter_scalar_dquot(sc.scalar);
}
else
{
_c4dbgp("dquo scalar left unfiltered");
m_evt_handler->mark_key_scalar_unfiltered();
}
}
else
{
_c4dbgp("dquo scalar doesn't need filtering");
}
return maybe_filtered;
} When this function is called with As with #440, enabling RYML_DBG (or adding any manual prints inside the function) made the problem go away. There were also other weird behaviours I observed during debugging, and I can tell more about them if you wish. But eventually I tried the following: template<class EventHandler>
csubstr ParseEngine<EventHandler>::_maybe_filter_key_scalar_dquot(ScannedScalar const& C4_RESTRICT sc)
{
- csubstr maybe_filtered = sc.scalar;
if(sc.needs_filter)
{
if(m_options.scalar_filtering())
{
- maybe_filtered = _filter_scalar_dquot(sc.scalar);
+ return _filter_scalar_dquot(sc.scalar);
}
else
{
_c4dbgp("dquo scalar left unfiltered");
m_evt_handler->mark_key_scalar_unfiltered();
}
}
else
{
_c4dbgp("dquo scalar doesn't need filtering");
}
- return maybe_filtered;
+ return sc.scalar;
} Which enabled the quickstart test to succeed. After that, I did the same to every other I'm pushing the changeset now; please try on your end, and let me know what happens. |
I repeated the exact steps in #464 (comment), including the “unminimized”
Next I’ll try applying master...fix/464_ub_ppc64le as a patch to 0.7.1, and double-check that the full test suite passes on all architectures. |
That worked. I also managed a successful test-build of jsonnet, which is the only thing in Fedora that currently depends on rapidyaml. It would be nice to really understand the root cause, but at least everything seems to be working. Thanks for taking the time to investigate all of these reports. Are you planning a release with the fixes, or should I patch 0.7.1? |
Don't patch, I'll cleanup this branch and make a new release. I'm somewhat bothered by this situation (eg, this issue, and #440, and the previous problem referred therein). The functions I now had to change look innocuous, and their call sites also look innocuous. It is remarkable that the optimizer is choking on this code, and it is even more remarkable that Even if UB is the ultimate reason, I find it hard to surmise how that is so. So do the many static analysis tools, and none of them point at anything suspicious. Given that this issue is so specific (ie only gcc14+ppc64le+-O2), having a test in the CI does not make much sense (or otherwise the already gigantic size of this project's CI would balloon). So I'm just going to do the changes and merge them in. |
If this turns out to be a bug/regression in the |
I remember that you said |
Oh, right, the segfault. Well, I did try with I guess I'll try again with the full flags. |
Did you try |
I just tried again on the tip of the branch with gcc 14.2. All of these succeed: try464 "-O2"
try464 "-O3"
try464 "-O2 -fexceptions -g -grecord-gcc-switches -pipe -Wall -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -fstack-protector-strong"
try464 "-O3 -fexceptions -g -grecord-gcc-switches -pipe -Wall -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -fstack-protector-strong" |
Since #467 is a proven fix to at least some of the problems here, I'm merging it now; doing so will close the issue. Feel free to reopen if the segfault persists. |
I just now tried #464 (comment) again, with the following differences:
All the tests passed. |
@musicinmybrain v0.7.2 is now available. Thanks for your help in chasing this issue. |
Thank you! This seems to be working well for me, so I’m going to update the |
TL;DR full symptoms, gathered below:
Original report
v0.7.1 is showing quite a few regressions on
ppc64le
only:builder-live.log
(I am not sure if I tested 0.7.0 thoroughly on non-
x86_64
architectures or not.)Originally posted by @musicinmybrain in #440 (comment)
The text was updated successfully, but these errors were encountered: