Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

glibc: add enableCopyLibGccHack #210112

Closed
wants to merge 1 commit into from
Closed

glibc: add enableCopyLibGccHack #210112

wants to merge 1 commit into from

Conversation

ghost
Copy link

@ghost ghost commented Jan 10, 2023

Description of changes

There has been much wailing and gnashing of teeth over the "libgcc copying hack":

To facilitate the excision of this foulness, let's provide an option (on by default) for it. Turning this option off will be a first step towards removing the hack.

Things done

@ghost ghost mentioned this pull request Jan 10, 2023
4 tasks
@ofborg ofborg bot added 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 0 This PR does not cause any packages to rebuild on Linux labels Jan 10, 2023
@ghost ghost marked this pull request as ready for review January 10, 2023 23:10
@tpwrules
Copy link
Contributor

What is the plan to turn this flag off? I am pretty sure it cannot be turned off conditionally based on architecture, as right now that copy is the one loaded dynamically by glibc at runtime for certain pthread functions. It will abort if it can't be found. The fact that it's from the wrong GCC and causes linker errors is sort of a separate issue.

But once that issue is fixed then we can just remove the preInstall. I don't see a situation where this flag is on in some cases and off in others.

@ghost
Copy link
Author

ghost commented Jan 11, 2023

What is the plan to turn this flag off?

This PR turns off enableCopyLibGccHack when you turn on gcc.enableExternalBootstrap:

@tpwrules
Copy link
Contributor

I think the hack is no longer necessary for bootstrap, but it is still necessary (or at least currently supports) for programs which uses pthreads. Please try to build pkgs.dejagnu, I wonder if it aborts during the testing phase.

@ghost
Copy link
Author

ghost commented Jan 11, 2023

Please try to build pkgs.dejagnu, I wonder if it aborts during the testing phase.

Indeed, I have observed the failure you describe. I will investigate.

@ghost
Copy link
Author

ghost commented Jan 11, 2023

This seems to be a libtclpthread_cancel() problem. I'm going to see if we can put a hack into libtcl in order to remove a hack from basically everything else.

@tpwrules
Copy link
Contributor

tpwrules commented Jan 11, 2023

Yes, I provided additional context in #208412, particularly in option 2. glibc seems to need to dynamically load it in support of pthreads.

To me the crime is not that there is a libgcc_s.so in glibc (as keeping a copy in glibc is a convenient way to make sure glibc always has it and avoid increasing everybody's closure size by referencing GCC's libraries), but that it is currently the wrong version. I'm excited by all the work around cleaning up the bootstrap sequence and maybe the hack can be removed long-term, but I don't want to focus on it unless it helps upgrade GCC. The failure mode is particularly egregious too, with random programs randomly aborting under certain circumstances.

@ghost
Copy link
Author

ghost commented Jan 11, 2023

Root cause:

  • glibc commit 9d79e0377b08773ec4f7ec38479b1563606f7ef7
    • moved the stack unwinder out of glibc
    • replaced the compiler-agnostic implementation of pthread_cancel() with one that uses the stack unwinder
    • added a dlopen("libgcc_s") to access the unwinder

If killing threads needs compiler-specific information (which it probably does), what's supposed to happen when you compile two .o files, one with clang and one with gcc, link them together, and call pthread_kill()? Which compiler's unwinder should it use?

If libgcc_s's unwinder works on any compiler's binaries (using dwarf2 data?) then it ought to be its own library, separate from gcc.

@ghost
Copy link
Author

ghost commented Jan 11, 2023

To me the crime is not that there is a libgcc_s.so

The crime is that pthread_cancel() exists. Killing threads, if it is even allowed at all, is a language feature not a quasi-system-call.

(as keeping a copy in glibc is a convenient way to make sure glibc always has it and avoid increasing everybody's closure size by referencing GCC's libraries),

The solution to the closure size concern is to move libgcc into its own output of gcc and reference it. Copying like this defeats a lot of Nix's sanity checks.

@ghost ghost mentioned this pull request Jan 11, 2023
8 tasks
@ghost
Copy link
Author

ghost commented Jan 11, 2023

Added an explanation of the real reason why this hack was put in place.

@tpwrules
Copy link
Contributor

tpwrules commented Jan 11, 2023

This is not quite the correct explanation, glibc itself does the abort and the application cannot stop it from happening in any way. You can see expect printing the message in strace to a presumably-already-closed fd. This also all applies to pthread_exit() which is a more reasonable call, and some stack unwinding support logic.

If you simply Google libgcc_s.so.1 must be installed for pthread_exit to work then you see this error has popped up in quite a wide variety of programs and scenarios, so I don't believe it's true that these pthread calls are used "very infrequently". I just provided that example as an easy trigger within nixpkgs. But the fact that it can show up anywhere is scary and I don't have a good solution. Nor to what happens if you mix compilers.

I don't think adding the extra output is a bad solution, but it will then require a hack to GCC to link it into every program which uses pthreads or even glibc to be safe. It is also probably smart to update the comment in nixpkgs to the current state of the problem.

@Mic92
Copy link
Member

Mic92 commented Jan 11, 2023

guix have their gcc patched so it will add libgcc to all libraries they build btw: https://github.com/guix-mirror/guix/blob/5e4ec8218142eee8e6e148e787381a5ef891c5b1/gnu/packages/gcc.scm#L243

@ghost
Copy link
Author

ghost commented Jan 11, 2023

This is not quite the correct explanation, glibc itself does the abort and the application cannot stop it from happening in any way.

Thanks, fixed

I just provided that example as an easy trigger within nixpkgs.

You provided the same example I did -- libtcl (via expect). Is there another? Asynchronous thread cancellation is generally considered to be a language misfeature -- Java deprecated it 20 years ago and Rust won't let you do it without unsafe.

I don't think adding the extra output is a bad solution,

I have an implementation of that and am testing it now.

but it will then require a hack to GCC

I don't believe so. The gcc.libgcc output becomes a dependency of glibc, which makes it a dependency of everything that could need it for pthread_cancel().

@tpwrules
Copy link
Contributor

Asynchronous thread cancellation is generally considered to be a language misfeature -- Java deprecated it 20 years ago and Rust won't let you do it without unsafe.

As explained in my previous comments

  • pthread_exit() is not asynchronous and is completely safe
  • pthread_exit() will also abort if libgcc_s.so cannot be found
  • Plenty of real-world code uses both functions, as evidenced by the reports of glibc's error message. Therefore we cannot break it.

The gcc.libgcc output becomes a dependency of glibc

How does this happen? It is not currently as libgcc_s.so is loaded dynamically. And making it so would be an essentially identical situation to this hack where glibc pulls the wrong libgcc_s.so into other processes, it would just be stored in a slightly different place.

@ghost
Copy link
Author

ghost commented Jan 12, 2023

As explained in my previous comments and commits

  • pthread_**cancel**() is asynchronous if PTHREAD_CANCEL_ASYNCHRONOUS is used.
  • if PTHREAD_CANCEL_ASYNCHRONOUS is not used, there is no need for stack unwinding, and no need for libgcc_s.
  • Real-world code rarely, if ever, uses PTHREAD_CANCEL_ASYNCHRONOUS. No examples have been presented in this thread.

We seem to be talking past each other.

@tpwrules
Copy link
Contributor

My point is I don't understand why it matters what glibc could actually do, the problem today is that it does need libgcc_s in common situations encountered by popular programs. For example, CPython (though they are trying to reduce its usage as this keeps biting them).

Do you propose changing (or at least make a note that it could be changed in the future) glibc to need libgcc_s in fewer circumstances, i.e. those which are unlikely to be encountered in real-world code? What about for older versions?

@ghost
Copy link
Author

ghost commented Jan 12, 2023

Do you propose changing (or at least make a note that it could be changed in the future) glibc to need libgcc_s in fewer circumstances, i.e. those which are unlikely to be encountered in real-world code? What about for older versions?

That would be nice. But the politicking required to get it merged is not something I have time for. It would also be nice if they had not deleted the old implementation that did not need libgcc_s.

@ghost
Copy link
Author

ghost commented Jan 12, 2023

guix have their gcc patched so it will add libgcc to all libraries they build btw: https://github.com/guix-mirror/guix/blob/5e4ec8218142eee8e6e148e787381a5ef891c5b1/gnu/packages/gcc.scm#L243

@Mic92, thank you so much for this link.

The most important part of that link is that it links to this mailing list posting. That was incredibly helpful.

I bet that explains why all previous attempts to pull libgcc_s from anywhere other than the exact same directory that contains glibc had failed. It also explains how to make them not fail.

@tpwrules
Copy link
Contributor

Proof of concept of removing the hack completely here. Can you point me to the commit which puts libgcc_s.so in a separate output?

@ofborg ofborg bot added 2.status: merge conflict This PR has merge conflicts with the target branch 10.rebuild-linux: 0 This PR does not cause any packages to rebuild on Linux and removed 2.status: merge conflict This PR has merge conflicts with the target branch 10.rebuild-linux: 1-10 labels Feb 13, 2023
@Artturin
Copy link
Member

Can this be closed?

@ghost
Copy link
Author

ghost commented May 1, 2023

Yep.

@ghost ghost closed this May 1, 2023
@ghost ghost deleted the pr/glibc/enableCopyLibGccHack branch May 1, 2023 10:08
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 0 This PR does not cause any packages to rebuild on Linux
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants