Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let exit_code be better aligned with C/POSIX #240

Open
EdSchouten opened this issue Feb 21, 2023 · 0 comments · May be fixed by #244
Open

Let exit_code be better aligned with C/POSIX #240

EdSchouten opened this issue Feb 21, 2023 · 0 comments · May be fixed by #244

Comments

@EdSchouten
Copy link
Collaborator

I am glad that REv2 already uses int32 as its data type for exit_code. Even though most UNIX systems historically only had 8-bit exit codes, modern versions of POSIX provide interfaces for obtaining the full integer value. Using functions like waitid(), you can obtain a siginfo_t whose si_status field contains the full value.

Where we do deviate from C/POSIX is that there are no facilities for indicating that a process did not terminate by calling _Exit(), but through a signal. This leads to confusing situations where people wonder why their build actions terminate with exit code 139, while in reality it terminated through SIGSEGV (signal 11, giving exit code 128+11=139). Even worse are modern versions of Go's os/exec package, which will always return an exit code of -1 in case of signal delivery. This means that users can't even figure out why their actions terminated.

One solution to this would be to change the existing field:

  // The exit code of the command.
  int32 exit_code = 4;

To something like this:

  oneof termination_reason {
    // The command terminated by calling the operating system's equivalent
    // of `_Exit()` with the code provided.
    int32 exit_code = 4;

    // The command terminated by the delivery of a signal.
    int32 signal_number = 13;
 }

Even though virtually all operating systems use 9 for SIGKILL and 15 for SIGTERM, C/POSIX make no such requirement. The only place where that does happen, is in the definition of the kill utility, where as part of the X/Open System Interfaces (XSI) they provide a hardcoded list of numerical command line options.

My recommendation would thus be to use this instead:

  oneof termination_reason {
    // The command terminated by calling the operating system's equivalent
    // of `_Exit()` with the code provided.
    int32 exit_code = 4;

    // The command terminated by the delivery of a signal. This field contains the name of
    // the signal, with the "SIG" prefix removed (e.g., "SEGV" for segmentation violations).
    string signal_name = 13;
 }

Unfortunately it's not safe for us to add a oneof to the protocol retroactively, as it would cause exit_code to be zero when signal_name is set. Maybe we can already add this to REv2 in the form of a separate field, with the remark that exit_code is set to a non-zero value if signal_name is set...

EdSchouten added a commit to EdSchouten/remote-apis that referenced this issue Mar 8, 2023
POSIX issue 7, 2018 edition, section 3.303 "Process Termination" states
the following:

> There are two kinds of process termination:
>
> 1. Normal termination occurs by a return from main(), when requested
>    with the exit(), _exit(), or _Exit() functions; or when the last
>    thread in the process terminates by returning from its start
>    function, by calling the pthread_exit() function, or through
>    cancellation.
>
> 2. Abnormal termination occurs when requested by the abort() function
>    or when some signals are received.

Both these cases can be detected by parent processes separately.
Unfortunately, REv2 currently doesn't allow this, as we only provide an
exit code. Implementations such as Buildbarn tend to set the exit code
for abnormal termination to 128+signum, meaning that if a segmentation
fault occurs, an action terminates with exit code 139. This tends to be
fairly confusing for non-expert users.

The goal of this change is to make all of this less confusing. It adds a
new message type named ProcessTerminationResult, which contains the
termination result in a more structured way. It can now either be an
exit code or a signal number. If it turns out that other operating
systems provide even more ways, we can add additional `oneof` cases.

For the exit code in case of normal termination I decided to use an
int64 instead of int32. The reason being that POSIX allows you to get
the full exit code back, using APIs like waitid(). On ILP64 systems, the
exit code may thus be 64 bits.

For the signal in case of abnormal termination I decided to use a
string. The reason being that signal numbers are not covered by POSIX.
Everybody always assumes that 9 is SIGKILL, but POSIX only documents
this as a convention for the command line flags of the `kill` utility.
Not the the actual `SIG*` constants.

Fixes: bazelbuild#240
EdSchouten added a commit to EdSchouten/remote-apis that referenced this issue Mar 10, 2023
POSIX issue 7, 2018 edition, section 3.303 "Process Termination" states
the following:

> There are two kinds of process termination:
>
> 1. Normal termination occurs by a return from main(), when requested
>    with the exit(), _exit(), or _Exit() functions; or when the last
>    thread in the process terminates by returning from its start
>    function, by calling the pthread_exit() function, or through
>    cancellation.
>
> 2. Abnormal termination occurs when requested by the abort() function
>    or when some signals are received.

Both these cases can be detected by parent processes separately.
Unfortunately, REv2 currently doesn't allow this, as we only provide an
exit code. Implementations such as Buildbarn tend to set the exit code
for abnormal termination to 128+signum, meaning that if a segmentation
fault occurs, an action terminates with exit code 139. This tends to be
fairly confusing for non-expert users.

The goal of this change is to make all of this less confusing. It adds a
new message type named ProcessTerminationResult, which contains the
termination result in a more structured way. It can now either be an
exit code or a signal number. If it turns out that other operating
systems provide even more ways, we can add additional `oneof` cases.

For the exit code in case of normal termination I decided to use an
int64 instead of int32. The reason being that POSIX allows you to get
the full exit code back, using APIs like waitid(). On ILP64 systems, the
exit code may thus be 64 bits.

For the signal in case of abnormal termination I decided to use a
string. The reason being that signal numbers are not covered by POSIX.
Everybody always assumes that 9 is SIGKILL, but POSIX only documents
this as a convention for the command line flags of the `kill` utility.
Not the the actual `SIG*` constants. Adding an enumeration would also be
unwise, as operating systems are free to add their own signals that are
not covered by POSIX (e.g., SIGINFO on BSD).

Fixes: bazelbuild#240
@EdSchouten EdSchouten changed the title REv3: Let exit_code be better aligned with C/POSIX Let exit_code be better aligned with C/POSIX Mar 14, 2023
EdSchouten added a commit to EdSchouten/remote-apis that referenced this issue Mar 14, 2023
POSIX issue 7, 2018 edition, section 3.303 "Process Termination" states
the following:

> There are two kinds of process termination:
>
> 1. Normal termination occurs by a return from main(), when requested
>    with the exit(), _exit(), or _Exit() functions; or when the last
>    thread in the process terminates by returning from its start
>    function, by calling the pthread_exit() function, or through
>    cancellation.
>
> 2. Abnormal termination occurs when requested by the abort() function
>    or when some signals are received.

Both these cases can be detected by parent processes separately.
Unfortunately, REv2 currently doesn't allow this, as we only provide an
exit code. Implementations such as Buildbarn tend to set the exit code
for abnormal termination to 128+signum, meaning that if a segmentation
fault occurs, an action terminates with exit code 139. This tends to be
fairly confusing for non-expert users.

The goal of this change is to make all of this less confusing. It adds a
new message type named ProcessTerminationResult, which contains the
termination result in a more structured way. It can now either be an
exit code or a signal number. If it turns out that other operating
systems provide even more ways, we can add additional `oneof` cases.

For the exit code in case of normal termination I decided to use an
int64 instead of int32. The reason being that POSIX allows you to get
the full exit code back, using APIs like waitid(). On ILP64 systems, the
exit code may thus be 64 bits.

For the signal in case of abnormal termination I decided to use a
string. The reason being that signal numbers are not covered by POSIX.
Everybody always assumes that 9 is SIGKILL, but POSIX only documents
this as a convention for the command line flags of the `kill` utility.
Not the the actual `SIG*` constants. Adding an enumeration would also be
unwise, as operating systems are free to add their own signals that are
not covered by POSIX (e.g., SIGINFO on BSD).

Fixes: bazelbuild#240
EdSchouten added a commit to EdSchouten/remote-apis that referenced this issue Mar 15, 2023
POSIX issue 7, 2018 edition, section 3.303 "Process Termination" states
the following:

> There are two kinds of process termination:
>
> 1. Normal termination occurs by a return from main(), when requested
>    with the exit(), _exit(), or _Exit() functions; or when the last
>    thread in the process terminates by returning from its start
>    function, by calling the pthread_exit() function, or through
>    cancellation.
>
> 2. Abnormal termination occurs when requested by the abort() function
>    or when some signals are received.

Both these cases can be detected by parent processes separately.
Unfortunately, REv2 currently doesn't allow this, as we only provide an
exit code. Implementations such as Buildbarn tend to set the exit code
for abnormal termination to 128+signum, meaning that if a segmentation
fault occurs, an action terminates with exit code 139. This tends to be
fairly confusing for non-expert users.

The goal of this change is to make all of this less confusing. It adds a
new message type named ProcessTerminationResult, which contains the
termination result in a more structured way. It can now either be an
exit code or a signal number. If it turns out that other operating
systems provide even more ways, we can add additional `oneof` cases.

For the exit code in case of normal termination I decided to use an
int64 instead of int32. The reason being that POSIX allows you to get
the full exit code back, using APIs like waitid(). On ILP64 systems, the
exit code may thus be 64 bits.

For the signal in case of abnormal termination I decided to use a
string. The reason being that signal numbers are not covered by POSIX.
Everybody always assumes that 9 is SIGKILL, but POSIX only documents
this as a convention for the command line flags of the `kill` utility.
Not the the actual `SIG*` constants. Adding an enumeration would also be
unwise, as operating systems are free to add their own signals that are
not covered by POSIX (e.g., SIGINFO on BSD).

Fixes: bazelbuild#240
EdSchouten added a commit to EdSchouten/remote-apis that referenced this issue Mar 15, 2023
POSIX issue 7, 2018 edition, section 3.303 "Process Termination" states
the following:

> There are two kinds of process termination:
>
> 1. Normal termination occurs by a return from main(), when requested
>    with the exit(), _exit(), or _Exit() functions; or when the last
>    thread in the process terminates by returning from its start
>    function, by calling the pthread_exit() function, or through
>    cancellation.
>
> 2. Abnormal termination occurs when requested by the abort() function
>    or when some signals are received.

Both these cases can be detected by parent processes separately.
Unfortunately, REv2 currently doesn't allow this, as we only provide an
exit code. Implementations such as Buildbarn tend to set the exit code
for abnormal termination to 128+signum, meaning that if a segmentation
fault occurs, an action terminates with exit code 139. This tends to be
fairly confusing for non-expert users.

The goal of this change is to make all of this less confusing. It adds a
new message type named ProcessTerminationResult, which contains the
termination result in a more structured way. It can now either be an
exit code or a signal number. If it turns out that other operating
systems provide even more ways, we can add additional `oneof` cases.

For the exit code in case of normal termination I decided to use an
int64 instead of int32. The reason being that POSIX allows you to get
the full exit code back, using APIs like waitid(). On ILP64 systems, the
exit code may thus be 64 bits.

For the signal in case of abnormal termination I decided to use a
string. The reason being that signal numbers are not covered by POSIX.
Everybody always assumes that 9 is SIGKILL, but POSIX only documents
this as a convention for the command line flags of the `kill` utility.
Not the the actual `SIG*` constants. Adding an enumeration would also be
unwise, as operating systems are free to add their own signals that are
not covered by POSIX (e.g., SIGINFO on BSD).

Fixes: bazelbuild#240
EdSchouten added a commit to EdSchouten/bazel that referenced this issue Jul 1, 2023
On POSIX-like systems, processes may either terminate normally with an
integer exit code. The exit code may span the full range of int, even
though all but the waitid() function retur the bottom eight bits. In
addition to that, processes may terminate abnormally due to a signal
(SIGABRT, SIGSEGV, etc.)

POSIX shells (sh, bash, etc.) are more restrictive, in that they can
only return exit codes between 0 and 126. 127 is used to denote that the
executable cannot be found. Exit codes above 128 indicate that the
process terminated due to a signal.

Right now we let test-setup.sh terminate using the exit code obtained
using $?. This means that if a program terminates due to SIGABRT,
test-setup.sh terminates with exit code 128+6=134. This causes us to
lose some information, as the (remote) execution environment now only
sees plain exit codes.

This change extends test-setup.sh to check for exit codes above 128. In
that case it will send a signal to itself, so that the original signal
condition is raised once again.

See also: bazelbuild/remote-apis#240
copybara-service bot pushed a commit to bazelbuild/bazel that referenced this issue Jul 13, 2023
On POSIX-like systems, processes may either terminate normally with an integer exit code. The exit code may span the full range of int, even though all but the waitid() function retur the bottom eight bits. In addition to that, processes may terminate abnormally due to a signal (SIGABRT, SIGSEGV, etc.)

POSIX shells (sh, bash, etc.) are more restrictive, in that they can only return exit codes between 0 and 126. 127 is used to denote that the executable cannot be found. Exit codes above 128 indicate that the process terminated due to a signal.

Right now we let test-setup.sh terminate using the exit code obtained using $?. This means that if a program terminates due to SIGABRT, test-setup.sh terminates with exit code 128+6=134. This causes us to lose some information, as the (remote) execution environment now only sees plain exit codes.

This change extends test-setup.sh to check for exit codes above 128. In that case it will send a signal to itself, so that the original signal condition is raised once again.

See also: bazelbuild/remote-apis#240

Closes #18827.

PiperOrigin-RevId: 547773406
Change-Id: Ia29a6ea1eefdb8caa5624a799755b420a61478a0
iancha1992 pushed a commit to iancha1992/bazel that referenced this issue Jul 13, 2023
On POSIX-like systems, processes may either terminate normally with an integer exit code. The exit code may span the full range of int, even though all but the waitid() function retur the bottom eight bits. In addition to that, processes may terminate abnormally due to a signal (SIGABRT, SIGSEGV, etc.)

POSIX shells (sh, bash, etc.) are more restrictive, in that they can only return exit codes between 0 and 126. 127 is used to denote that the executable cannot be found. Exit codes above 128 indicate that the process terminated due to a signal.

Right now we let test-setup.sh terminate using the exit code obtained using $?. This means that if a program terminates due to SIGABRT, test-setup.sh terminates with exit code 128+6=134. This causes us to lose some information, as the (remote) execution environment now only sees plain exit codes.

This change extends test-setup.sh to check for exit codes above 128. In that case it will send a signal to itself, so that the original signal condition is raised once again.

See also: bazelbuild/remote-apis#240

Closes bazelbuild#18827.

PiperOrigin-RevId: 547773406
Change-Id: Ia29a6ea1eefdb8caa5624a799755b420a61478a0
iancha1992 added a commit to bazelbuild/bazel that referenced this issue Jul 13, 2023
On POSIX-like systems, processes may either terminate normally with an integer exit code. The exit code may span the full range of int, even though all but the waitid() function retur the bottom eight bits. In addition to that, processes may terminate abnormally due to a signal (SIGABRT, SIGSEGV, etc.)

POSIX shells (sh, bash, etc.) are more restrictive, in that they can only return exit codes between 0 and 126. 127 is used to denote that the executable cannot be found. Exit codes above 128 indicate that the process terminated due to a signal.

Right now we let test-setup.sh terminate using the exit code obtained using $?. This means that if a program terminates due to SIGABRT, test-setup.sh terminates with exit code 128+6=134. This causes us to lose some information, as the (remote) execution environment now only sees plain exit codes.

This change extends test-setup.sh to check for exit codes above 128. In that case it will send a signal to itself, so that the original signal condition is raised once again.

See also: bazelbuild/remote-apis#240

Closes #18827.

PiperOrigin-RevId: 547773406
Change-Id: Ia29a6ea1eefdb8caa5624a799755b420a61478a0

Co-authored-by: Ed Schouten <eschouten@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant