Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with LTO build of ei #5609

Closed
saleyn opened this issue Jan 15, 2022 · 14 comments
Closed

Issue with LTO build of ei #5609

saleyn opened this issue Jan 15, 2022 · 14 comments
Assignees
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM

Comments

@saleyn
Copy link
Contributor

saleyn commented Jan 15, 2022

Describe the bug
ei_* functions are reported as undefined by the linker when linking with libei.a.

This issue appears in some builds of the 24.2 release (presently reproduced in Arch Linux and macOS), where libei.a is getting built with LTO optimization that results in undefined ei_* symbols when linking with that library.

An illustration of this error can be found here.

Notably, when the OTP 24.2 is compiled locally from sources, this doesn't happen, but when the pre-built package is installed from the distribution of Arch/macOS, and the project containing a port program that depends on libei.a is built, undefined symbols are reported.

I believe it boils down to libei.a getting compiled with LTO, which is not something that happened in the releases prior to 24.2.

$ nm /usr/lib/erlang/lib/erl_interface-5.1/lib/libei.a | grep -B2 -A10 ei_connect
nm: ei_connect.o: plugin needed to handle lto object

ei_connect.o:
0000000000000001 C __gnu_lto_slim

ei_resolve.o:
0000000000000001 C __gnu_lto_slim

eirecv.o:
0000000000000001 C __gnu_lto_slim

send.o:
0000000000000001 C __gnu_lto_slim

Whereas a locally compiled 24.2 has a different output:

$ nm /opt/sw/erlang/24.2/lib/erlang/lib/erl_interface-5.1/lib/libei.a | grep ei_connect
ei_connect.o:
0000000000003050 T ei_connect
                 U ei_connect_ctx_t__
0000000000000a40 t ei_connect_helper
0000000000003110 T ei_connect_host_port
0000000000003080 T ei_connect_host_port_tmo
0000000000002e00 T ei_connect_init
0000000000000018 b ei_connect_initialized
0000000000002a50 T ei_connect_init_ussi
0000000000002ef0 T ei_connect_tmo
0000000000002a30 T ei_connect_xinit
0000000000002480 T ei_connect_xinit_ussi
                 U ei_connect_t__
0000000000000910 T ei_connect_ctx_t__
0000000000001360 T ei_connect_t__
@saleyn saleyn added the bug Issue is reported as a bug label Jan 15, 2022
@rickard-green rickard-green added the team:VM Assigned to OTP team VM label Jan 17, 2022
@garazdawi
Copy link
Contributor

Hello! What versions have you tested? I see in the linked issue that you have tried 24.1 and that worked? From what git tells me, we have done zero changes in erl_interface between 24.1 and 24.2, so it seems odd that this is something that we have caused.

Do you know what CFLAGS and LDFLAGS are being passed to configure?

@saleyn
Copy link
Contributor Author

saleyn commented Jan 17, 2022

In the Arch package repository used to build the erlang distribution package we can find:

The last link shows that Erlang was configured with:

./configure \
    --enable-builtin-zlib \
    --enable-smp-support \
    --prefix=/usr \
    --with-odbc \
    --with-wx-config=/usr/bin/wx-config-gtk3

but I don't see any special assignments to LDFLAGS and CXXFLAGS there aside from:

sed -i 's/^LDFLAGS = /LDFLAGS += /g' otp/lib/megaco/src/flex/Makefile.in
sed -i 's/^LDFLAGS =  /LDFLAGS += /g' otp/lib/odbc/c_src/Makefile.in

I cannot reproduce this when building locally in my environment from sources, and would think that it's a build issue that needs to be addressed by the maintainer of the Arch package. However, the user of my erlexec library reported the same issue with a macOS distribution, which makes me think that it could be something related to the version of the gcc compiler/linker which triggers the LTO issue.

I tried downgrading the erlang package to 24.1.7 using:

sudo pacman -U https://archive.archlinux.org/packages/e/erlang/erlang-24.1.7-1-x86_64.pkg.tar.zst

and this doesn't exhibit the libei.a linking issue.
However, upgrading back to the latest 24.2 still produces the linking error:

$ sudo pacman -U https://archive.archlinux.org/packages/e/erlang/erlang-24.2-1-x86_64.pkg.tar.zst
$ pwd
/home/serge/projects/erl-libs/erlexec
$ DEBUG=1 make
...
===> Linking /home/serge/projects/erl-libs/erlexec/priv/x86_64-pc-linux-gnu/exec-port
===> sh(g++ c_src/ei++.o c_src/exec.o c_src/exec_impl.o  -lcap  -L"/usr/lib/erlang/lib/erl_interface-5.1/lib" -lei -o /home/serge/projects/erl-libs/erlexec/priv/x86_64-pc-linux-gnu/exec-port)
failed with return code 1 and the following output:
/usr/bin/ld: c_src/ei++.o: in function `ei::Serializer::print(std::ostream&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)':
/home/serge/projects/erl-libs/erlexec/c_src/ei++.cpp:62: undefined reference to `ei_s_print_term'
/usr/bin/ld: c_src/ei++.o: in function `ei::Serializer::read()':
/home/serge/projects/erl-libs/erlexec/c_src/ei++.cpp:145: undefined reference to `ei_decode_version'
/usr/bin/ld: c_src/exec.o: in function `ei::Serializer::decodeTupleSize()':
...

I also added a comment to the maintainer of the Arch's erlang package: https://bugs.archlinux.org/task/73240, which highlights a strange change in the size of the libei.a:

Package version 24.2:

$ du -sb /usr/lib/erlang/usr/lib/libei*
74254 /usr/lib/erlang/usr/lib/libei.a
74254 /usr/lib/erlang/usr/lib/libei_st.a

Package version 24.1.7:

$ du -sb /usr/lib/erlang/usr/lib/libei*
282530 /usr/lib/erlang/usr/lib/libei.a
278736 /usr/lib/erlang/usr/lib/libei_st.a

@garazdawi
Copy link
Contributor

I did some digging and found this announcement and line 52 in that looks very much as what is happening to the nm output. So I tried to do

strip -R .gnu.lto_* -R .gnu.debuglto_* -N __gnu_lto_v1 "/build/lib/erl_interface/obj/x86_64-pc-linux-gnu/libei.a" -o libei.a

and then I get an archive that looks just like the libei.a distributed by arch linux. I don't think that this is really a problem, as the strip should only remove some unneeded lto information. Can you try to build erlexec with a libei.a that is stripped using the strip above and see if you get the same problem?

@saleyn
Copy link
Contributor Author

saleyn commented Jan 17, 2022

It doesn't look like strip is affecting the size of libei.a in case of my locally built version 24.2:

$ du -b lib/erl_interface/obj/x86_64-pc-linux-gnu/libei.a
814802  lib/erl_interface/obj/x86_64-pc-linux-gnu/libei.a
$ strip -R .gnu.lto_* -R .gnu.debuglto_* -N __gnu_lto_v1 "lib/erl_interface/obj/x86_64-pc-linux-gnu/libei.a" -o libei.a
$ echo $?
0
$ du -b libei.a
814802  libei.a

If I try to run strip on the version of libei.a installed by the Arch package manager, it complains about the missing plugin:

$ strip -R .gnu.lto_* -R .gnu.debuglto_* -N __gnu_lto_v1 "/usr/lib/erlang/lib/erl_interface-5.1/lib/libei.a" -o libei.a
strip: stbapByx/ei_connect.o: plugin needed to handle lto object

@garazdawi
Copy link
Contributor

I forgot to add that you probably need to have the lto flags to configure as well. From what I could understand they are:

./configure CFLAGS="-march=x86-64 -mtune=generic -O2 -pipe -fno-plt -fexceptions -Wp,-D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security -fstack-clash-protection -fcf-protection -flto" LDFLAGS="-Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now" AR=gcc-ar RANLIB=gcc-ranlib

@saleyn
Copy link
Contributor Author

saleyn commented Jan 17, 2022

When building with these flags, and running strip I am able to reproduce this problem with libei.a. So, is this the Arch build issue that they have a set of wrong flags when building erlang, or the configure script needs to be modified to remove -flto?

@garazdawi
Copy link
Contributor

Can you reproduce the problem if you do not strip? Either way this to me is a problem in arch Linux and not with our build system.

@saleyn
Copy link
Contributor Author

saleyn commented Jan 17, 2022

If I don't strip, the problem is not reproducible, but it is, when the libei.a is stripped using with the command you provided.

@saleyn
Copy link
Contributor Author

saleyn commented Jan 17, 2022

Either way this to me is a problem in arch Linux and not with our build system.

Since it appears to be an issue in both Arch and macOS repositories, maybe the configure needs to strip -flto, and introduce an explicit switch to enable LTO in erl_interface, so that the package maintainers don't need to make changes to the way the package has been build (or maybe you can suggest a better solution)?

Now that the cause of the problem is better understood, is there a way to link with the stripped libei.a compiled with LTO that would resolve all symbols?

@garazdawi
Copy link
Contributor

Since it appears to be an issue in both Arch and macOS repositories, maybe the configure needs to strip -flto, and introduce an explicit switch to enable LTO in erl_interface, so that the package maintainers don't need to make changes to the way the package has been build (or maybe you can suggest a better solution)?

The package maintainers have asked for lto to be enabled, not sure that it is a good thing to disable it automatically. That the archive works when not stripped seems to indicate that, at least on arch, lto works for archives. But maybe it should be disabled anyways as the intermediate format gcc uses for lto is very version-dependent from what I gather. (Coincidentally we would not have this problem if the dynamic library in #5601 was used).

For reference, this is the patch that applies the lto strip to pacman, https://lists.archlinux.org/pipermail/pacman-dev/2021-March/024911.html.

Do you know what configure flags were passed to the macOS build? Would be interesting to see how it manifests there and if the same strip is needed.

Now that the cause of the problem is better understood, is there a way to link with the stripped libei.a compiled with LTO that would resolve all symbols?

I don't know.

@saleyn
Copy link
Contributor Author

saleyn commented Jan 18, 2022

Do you know what configure flags were passed to the macOS build? Would be interesting to see how it manifests there and if the same strip is needed.

This bug on macOS was reported by @Skoda091, who used macOS Monterey 12.1. I don't have a mac environment to test, and the 24.2 macOS build github action workflows pass successfully.

@saleyn
Copy link
Contributor Author

saleyn commented Jan 18, 2022

FWIW, added a bug report to Arch's makepkg.

Also adding CFLAGS="-ffat-lto-objects" to configure the erlang package does solve this problem when libei.a is later stripped:

$ ./configure CFLAGS="-march=x86-64 -mtune=generic -O2 -pipe -fno-plt -fexceptions -Wp,-D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security -fstack-clash-protection -fcf-protection -flto -ffat-lto-objects" LDFLAGS="-Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now" AR=gcc-ar RANLIB=gcc-ranlib
$ make
...
$ strip -R .gnu.lto_* -R .gnu.debuglto_* -N __gnu_lto_v1 "lib/erl_interface/obj/x86_64-pc-linux-gnu/libei.a" -o libei.a && mv libei.a lib/erl_interface/obj/x86_64-pc-linux-gnu/libei.a
$ sudo make install

@saleyn saleyn closed this as completed Jan 18, 2022
@garazdawi
Copy link
Contributor

Also adding CXXFLAGS="-ffat-lto-objects" to configure the erlang package does solve this problem when libei.a is later stripped.

Aha, that makes sense. I didn't know about that option.

@saleyn
Copy link
Contributor Author

saleyn commented Jan 26, 2022

The maintainer of the erlang Arch Linux package made this change, so all should be good with libei.a going forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM
Projects
None yet
Development

No branches or pull requests

3 participants