Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lxc/attach: Revert "- LXC attach should exit on SIGCHLD" #4517

Merged
merged 1 commit into from
Jan 24, 2025

Conversation

mihalicyn
Copy link
Member

@mihalicyn mihalicyn commented Jan 24, 2025

This reverts commit f021584 (from #4509)

Let's revert this change as it introduces 2 regressions:

  1. it's not correct to do exit(2) from a signal handler in this case, as we skip a proper cleaning procedures like restoring PTY configuration state (see lxc_terminal_delete()) which leads to a problem with a PTY after lxc-attach exits.

[ hint: just try to use lxc-attach on a main branch with this change and you will see it. After lxc-attach exits you won't be able to type anything in your current terminal session as it's messed up. ]

  1. this introduces race-condition in the code which leads to a regression on LXD/(and I believe Incus too) which can be seen as random "Failed to retrieve PID of executing child process" errors on "lxc exec"/"incus exec" commands. It's extremely hard to reproduce, but my guess is that we are getting a race condition here, because by the time when we set a new signal handler for SIGCHLD, transient process is still alive and when it exists it generates SIGCHLD which may lead to exit().

  2. This changes a behavior of lxc-attach which was there for years and it's quite scary to be honest. I'm not against having this change, but in a different form, for example we can add a new command line parameter for lxc-attach command which will enable this behavior.

  3. This changes behavior for ->attach() API call which is even worse.

My first attempt was to fix that change to prevent race, but then I've noticed that we also have a more serious problem described in (1), this requires more work to do.

This reverts commit f021584.

Let's revert this change as it introduces 2 regressions:
1. it's not correct to do exit(2) from a signal handler in this case,
as we skip a proper cleaning procedures like restoring PTY configuration
state (see lxc_terminal_delete()) which leads to a problem with a PTY after lxc-attach exits.

[ hint: just try to use lxc-attach on a main branch with this change and you will
see it. After lxc-attach exits you won't be able to type anything in your
current terminal session as it's messed up. ]

2. this introduces race-condition in the code which leads to a
regression on LXD/(and I believe Incus too) which can be seen as
random "Failed to retrieve PID of executing child process" errors
on "lxc exec"/"incus exec" commands. It's extremely hard to reproduce,
but my guess is that we are getting a race condition here, because
by the time when we set a new signal handler for SIGCHLD, transient process
is still alive and when it exists it generates SIGCHLD which may lead to
exit().

3. This changes a behavior of lxc-attach which was there for *years*
and it's quite scary to be honest. I'm not against having this change, but
in a different form, for example we can add a new command line parameter for
lxc-attach command which will enable this behavior.

My first attempt was to fix that change to prevent race, but then
I've noticed that we also have a more serious problem described in (1),
this requires more work to do.

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
@mihalicyn
Copy link
Member Author

cc @asainkujovic

Probably, we should reopen #4507 after merging this.

@stgraber stgraber merged commit 9e95451 into lxc:main Jan 24, 2025
13 checks passed
@asainkujovic
Copy link
Contributor

asainkujovic commented Jan 26, 2025

@mihalicyn
about (1) "it's not correct to do exit(2)", you are right, raise(SIGTERM) should be used, and it lead to lxc_term_delete well.

about (2) "... transient process could still be alive and when it exists it generates SIGCHLD" ... you are right, attaching handler line signal(SIGCHLD,...) should have been just few lines bellow, right after ret = wait_for_pid(pid) where code ensures that transient process is finished. Then only child left behind is attached process (usually bash shell), and it should be the only process that will send SIGCHLD on own exit, subtree (granchildren) does not generate it ('initial process does not receive SIGCHLD from them).

i am not aware of all usecase-s of lxc_attach function, and if in API ->attach() is sometimes preferable that attach should wait for all subtree (all possible grand-children), then new flag should be introduced for this like you said.

with those 2 modifications and new cmd-line-param it looks like this:
main...asainkujovic:lxc:sigchld_with_flag

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants