ark: Weird crash when restarting R 4.1 repeatedly #720

lionel- · 2023-06-09T13:44:21Z

Restarting R 4.1 repeatedly causes a segfault (tested locally on arm64 with 4.1.2 and on Davis' intel computer with 4.1.3 - we couldn't reproduce with 4.2 and 4.3).

Screen.Recording.2023-06-09.at.15.32.16.mov

Using posit-dev/ark#22 and #708, I get this backtrace (after a round of c++filt):

2023-06-07T16:01:07.576771000Z [ark-unknown] INFO crates/ark/src/main.rs:146: Loaded connection information from front-end in /var/folders/cr/whn38wbj72zd8h7wwwtnch500000gn/T/kernel-g9DuBp/connection.json
2023-06-07T16:01:07.577814000Z [ark-unknown] DEBUG crates/ark/src/main.rs:150: Connection data: ConnectionFile { control_port: 54973, shell_port: 42280, stdin_port: 60834, iopub_port: 52645, hb_port: 51043, transport: "tcp", signature_scheme: "hmac-sha256", ip: "127.0.0.1", key: "69292e51da84173afb44144559766844" }
2023-06-07T16:01:07.578049000Z [ark-unknown] TRACE crates/ark/src/interface.rs:493: Waiting for R's initial input prompt...
2023-06-07T16:01:07.578123000Z [ark-unknown] TRACE crates/amalthea/src/socket/socket.rs:56: Binding to ZeroMQ 'Shell' socket at tcp://127.0.0.1:42280
2023-06-07T16:01:07.578441000Z [ark-unknown] TRACE crates/ark/src/interface.rs:394: R_HOME: "/Library/Frameworks/R.framework/Resources"
2023-06-07T16:01:07.849303000Z [ark-unknown] INFO crates/stdext/src/spawn.rs:51: >>> Backtrace for signal 11
[R] >>> Backtrace for signal 11
[R] >>> In thread main
[R]    0: backtrace::backtrace::libunwind::trace
[R]              at /Users/lionel/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.67/src/backtrace/libunwind.rs:93:5
[R]       backtrace::backtrace::trace_unsynchronized
[R]              at /Users/lionel/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.67/src/backtrace/mod.rs:66:5
[R]    1: backtrace::backtrace::trace
[R]              at /Users/lionel/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.67/src/backtrace/mod.rs:53:14
[R]    2: backtrace::capture::Backtrace::create
[R]              at /Users/lionel/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.67/src/capture.rs:176:9
[R]    3: backtrace::capture::Backtrace::new
[R]              at /Users/lionel/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.67/src/capture.rs:140:22
[R]    4: stdext::spawn::backtrace_handler
[R]              at crates/stdext/src/spawn.rs:50:14
[R]    5: _OSAtomicTestAndClearBarrier
[R]    6: <unknown>
[R]    7: <unknown>
[R]    8: <unknown>
[R]    9: <unknown>
[R]   10: <unknown>
[R]   11: <unknown>
[R]   12: <unknown>
[R]   13: zmq::tcp_address_t::to_string(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&) const
[R]              at /Users/lionel/.cargo/registry/src/github.com-1ecc6299db9ec823/zeromq-src-0.2.5+4.3.4/vendor/src/tcp_address.cpp:146:20
[R]   14: std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > zmq::get_socket_name<zmq::tcp_address_t>(int, zmq::socket_end_t)
[R]              at /Users/lionel/.cargo/registry/src/github.com-1ecc6299db9ec823/zeromq-src-0.2.5+4.3.4/vendor/src/address.hpp:157:10
[R]   15: zmq::tcp_listener_t::get_socket_name(int, zmq::socket_end_t) const
[R]              at /Users/lionel/.cargo/registry/src/github.com-1ecc6299db9ec823/zeromq-src-0.2.5+4.3.4/vendor/src/tcp_listener.cpp:101:12
[R]   16: zmq::tcp_listener_t::set_local_address(char const*)
[R]              at /Users/lionel/.cargo/registry/src/github.com-1ecc6299db9ec823/zeromq-src-0.2.5+4.3.4/vendor/src/tcp_listener.cpp:183:17
[R]   17: zmq::socket_base_t::bind(char const*)
[R]              at /Users/lionel/.cargo/registry/src/github.com-1ecc6299db9ec823/zeromq-src-0.2.5+4.3.4/vendor/src/socket_base.cpp:653:24
[R]   18: zmq_bind
[R]              at /Users/lionel/.cargo/registry/src/github.com-1ecc6299db9ec823/zeromq-src-0.2.5+4.3.4/vendor/src/zmq.cpp:326:15
[R]   19: zmq::Socket::bind
[R]              at /Users/lionel/.cargo/registry/src/github.com-1ecc6299db9ec823/zmq-0.10.0/src/lib.rs:646:27
[R]   20: amalthea::socket::socket::Socket::new
[R]              at crates/amalthea/src/socket/socket.rs:57:35
[R]   21: amalthea::kernel::Kernel::connect
[R]              at crates/amalthea/src/kernel.rs:118:28
[R]   22: ark::start_kernel
[R]              at crates/ark/src/main.rs:84:11
[R]   23: ark::parse_file
[R]              at crates/ark/src/main.rs:151:13
[R]   24: ark::main
[R]              at crates/ark/src/main.rs:268:9
[R]   25: core::ops::function::FnOnce::call_once
[R]              at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/ops/function.rs:250:5
[R]   26: std::sys_common::backtrace::__rust_begin_short_backtrace
[R]              at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/sys_common/backtrace.rs:134:18
[R]   27: std::rt::lang_start::{{closure}}
[R]              at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/rt.rs:166:18
[R]   28: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once
[R]              at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/ops/function.rs:287:13
[R]       std::panicking::try::do_call
[R]              at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:487:40
[R]       std::panicking::try
[R]              at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:451:19
[R]       std::panic::catch_unwind
[R]              at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panic.rs:140:14
[R]       std::rt::lang_start_internal::{{closure}}
[R]              at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/rt.rs:148:48
[R]       std::panicking::try::do_call
[R]              at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:487:40
[R]       std::panicking::try
[R]              at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:451:19
[R]       std::panic::catch_unwind
[R]              at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panic.rs:140:14
[R]       std::rt::lang_start_internal
[R]              at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/rt.rs:148:20
[R]   29: std::rt::lang_start
[R]              at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/rt.rs:165:17
[R]   30: _main

The _OSAtomicTestAndClearBarrier frame is part of the backtrace capture, so the crash happens in the 0MQ stack, here precisely https://github.com/zeromq/libzmq/blob/5bf04ee2ff207f0eaf34298658fe354ea61e1839/src/tcp_address.cpp#L121 (line number differs because our version of 0MQ is 3 years old, but the function hasn't changed).

const int rc = getnameinfo (addr (), addrlen (), hbuf, sizeof (hbuf), NULL,
                            0, NI_NUMERICHOST);

I thought perhaps there was some ordering issue that caused us to pass uninitialised memory to getnameinfo(). I used a local 0MQ with some printf to make sure, and this doesn't seem to be the case:

addrlen: 16

// struct sockaddr_in {
//   __uint8_t       sin_len;
//   sa_family_t     sin_family;
//   in_port_t       sin_port;
//   struct  in_addr sin_addr;
//   char            sin_zero[8];
// };

addr: {
  sin_len: 16
  sin_family: 2
  sin_port: 61348
  sin_addr: { 16777343 }  // 127.0.0.1
  sin_zero: 0
}

I couldn't reproduce the crash when I was attaching Ark on startup so I thought there might be a race somewhere. It seems that it is the case as sleeping a bit before starting up solves the crash:

modified   Cargo.lock
@@ -51,6 +51,7 @@ dependencies = [
  "stdext",
  "strum",
  "strum_macros",
+ "time 0.3.21",
  "uuid",
  "zmq",
 ]
@@ -303,7 +304,7 @@ dependencies = [
  "js-sys",
  "num-integer",
  "num-traits",
- "time",
+ "time 0.1.45",
  "wasm-bindgen",
  "winapi",
 ]
@@ -2066,6 +2067,22 @@ dependencies = [
  "winapi",
 ]
 
+[[package]]
+name = "time"
+version = "0.3.21"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "8f3403384eaacbca9923fa06940178ac13e4edb725486d70e8e15881d0c836cc"
+dependencies = [
+ "serde",
+ "time-core",
+]
+
+[[package]]
+name = "time-core"
+version = "0.1.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "7300fbefb4dadc1af235a9cef3737cea692a9d97e1b9cbcd4ebdae6f8868e6fb"
+
 [[package]]
 name = "tinyvec"
 version = "1.6.0"
modified   crates/amalthea/Cargo.toml
@@ -25,6 +25,7 @@ zmq = "0.10.0"
 strum = "0.24"
 strum_macros = "0.24"
 crossbeam = { version = "0.8.2", features = ["crossbeam-channel"] }
+time = "0.3.21"
 
 [dev-dependencies]
 rand = "0.8.5"
modified   crates/amalthea/src/socket/socket.rs
@@ -25,6 +25,8 @@ pub struct Socket {
     socket: zmq::Socket,
 }
 
+use std::time;
+
 impl Socket {
     /// Create a new Socket instance from a kernel session and a ZeroMQ context.
     pub fn new(
@@ -53,6 +55,7 @@ impl Socket {
         // will be the server, and the `connect()` side will be the client.
         match kind {
             zmq::SocketType::ROUTER | zmq::SocketType::PUB | zmq::SocketType::REP => {
+                std::thread::sleep(time::Duration::from_millis(50));
                 trace!("Binding to ZeroMQ '{}' socket at {}", name, endpoint);
                 if let Err(err) = socket.bind(&endpoint) {
                     return Err(Error::SocketBindError(name, endpoint, err));

Since there is nothing wrong with the data passed to getnameinfo(), and since sleeping a bit fixes the crash, I'm tempted to say this is a weird race condition in the network stack of macOS? It would be interesting to see if this reproduces on Linux or Windows.

I also wondered if properly closing all the 0mq sockets and context on shutdown would help, but didn't go through with this because it seems like a lot of work to propagate the shutdown to all the threads and coordinate the closing of everything. Also since we're closing the process anyway, it shouldn't matter if we're leaving the sockets open?

So I have two questions:

What else can I try to figure out this crash?
Is the sleep workaround a reasonable stopgap that is worth a PR?

The text was updated successfully, but these errors were encountered:

kevinushey · 2023-06-09T16:19:28Z

We bumped into some very similar bugs on the RStudio side that effectively came down to various networking function calls using getenv(), and those calls being unsafe in a multithreaded context.

https://github.com/rstudio/rstudio-pro/pull/4648

kevinushey · 2023-06-09T16:21:15Z

https://man7.org/linux/man-pages/man3/getnameinfo.3.html suggests (at least with glibc) that the function is:

MT-Safe env locale

which effectively means the function makes (unsynchronized) calls to access facets of the environment / locale; so things could definitely go awry if we were modifying those in a separate thread.

lionel- · 2023-06-14T18:38:30Z

Using #742 (well a local version that actually works, updates incoming) I managed to get a backtrace for the R thread:

__ulock_wait (@__ulock_wait:5)
_os_unfair_lock_lock_slow (@_os_unfair_lock_lock_slow:54)
localeconv_l (@localeconv_l:19)
__vfprintf (@__vfprintf:45)
_vsnprintf (@_vsnprintf:59)
snprintf (@snprintf:20)
R_reInitTempDir (@R_reInitTempDir:50)
Rf_InitTempDir (@Rf_InitTempDir:4)
setup_Rmainloop (@setup_Rmainloop:121)
ark::interface::start_r (/Users/lionel/Sync/Projects/Positron/positron/extensions/positron-r/amalthea/crates/ark/src

I believe this is from this line:

snprintf(buf, len, "R_SESSION_TMPDIR=%s", tmp);

I can't find much information about thread-safety of localeconv_l(), macOS seems to have a different implementation anyway. Interesting article: https://aras-p.info/blog/2022/02/25/Curious-lack-of-sprintf-scaling/

To fix this I'll just delay initialisation of R after that of 0MQ as discussed yesterday.

Edit: On the main thread side, the debugger shows a more complete backtrace:

localeconv_l (@localeconv_l:23)
__vfprintf (@__vfprintf:45)
_vsnprintf (@_vsnprintf:59)
snprintf (@snprintf:20)
inet_ntop4 (@inet_ntop4:43)
si_nameinfo (@si_nameinfo:197)
getnameinfo (@getnameinfo:72)
zmq::tcp_address_t::to_string(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>&) const (/Users/lionel/.cargo/registry/src/index.crates.io-6f17d22bba15001f/zeromq-src-0.2.5+4.3.4/vendor/src/tcp_address.cpp:146)

So it looks like printf() is not thread-safe on macOS...

lionel- · 2023-06-15T15:02:50Z

@DavisVaughan ran into this crash with R 4.3 as well.

I used the debugger to pause on invocations of snprintf() outside of the R thread. Many of them happen during 0MQ initialization, e.g. the call stack of the above report, and things like:

snprintf (@snprintf:3)
inet_ntop4 (@inet_ntop4:30)
si_nameinfo (@si_nameinfo:197)
getnameinfo (@getnameinfo:72)
zmq::tcp_address_t::to_string(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>&) const (/Users/lionel/.cargo/registry/src/index.crates.io-6f17d22bba15001f/zeromq-src-0.2.5+4.3.4/vendor/src/tcp_address.cpp:146)
std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> zmq::get_socket_name<zmq::tcp_address_t>(int, zmq::socket_end_t) (/Users/lionel/.cargo/registry/src/index.crates.io-6f17d22bba15001f/zeromq-src-0.2.5+4.3.4/vendor/src/address.hpp:157)
zmq::tcp_listener_t::get_socket_name(int, zmq::socket_end_t) const (/Users/lionel/.cargo/registry/src/index.crates.io-6f17d22bba15001f/zeromq-src-0.2.5+4.3.4/vendor/src/tcp_listener.cpp:101)
zmq::stream_listener_base_t::create_engine(int) (/Users/lionel/.cargo/registry/src/index.crates.io-6f17d22bba15001f/zeromq-src-0.2.5+4.3.4/vendor/src/stream_listener_base.cpp:103)
zmq::tcp_listener_t::in_event() (/Users/lionel/.cargo/registry/src/index.crates.io-6f17d22bba15001f/zeromq-src-0.2.5+4.3.4/vendor/src/tcp_listener.cpp:94)
zmq::kqueue_t::loop() (/Users/lionel/.cargo/registry/src/index.crates.io-6f17d22bba15001f/zeromq-src-0.2.5+4.3.4/vendor/src/kqueue.cpp:218)
zmq::worker_poller_base_t::worker_routine(void*) (/Users/lionel/.cargo/registry/src/index.crates.io-6f17d22bba15001f/zeromq-src-0.2.5+4.3.4/vendor/src/poller_base.cpp:146)
thread_routine(void*) (/Users/lionel/.cargo/registry/src/index.crates.io-6f17d22bba15001f/zeromq-src-0.2.5+4.3.4/vendor/src/thread.cpp:257)
_pthread_start (@_pthread_start:40)

Once we're past the startup, most snprintf calls in other threads happen within the R lock so they are safe. I did see these while creating plots:

snprintf (@snprintf:3)
hexNumber(unsigned long long) (@hexNumber(unsigned long long):19)
asString(void const*) (@asString(void const*):1006)
asString(void const*) (@asString(void const*):246)
copyApplicationInformationDictionaryGivenASNUsingLocalCache(__LSSharedMemoryPage const*, __LSASN const*) (@copyApplicationInformationDictionaryGivenASNUsingLocalCache(__LSSharedMemoryPage const*, __LSASN const*):348)
_LSCopyApplicationInformation (@_LSCopyApplicationInformation:207)
_LSCopyApplicationInformationItem (@_LSCopyApplicationInformationItem:40)
__105-[NSApplication(NSPersistentUISupport) _disableRelaunchOnLoginIfNotLaunchedByLaunchServicesOrBlacklisted]_block_invoke (@__105-[NSApplication(NSPersistentUISupport) _disableRelaunchOnLoginIfNotLaunchedByLaunchServicesOrBlacklisted]_block_invoke:14)
___NSPersistentUIDispatchQueueAsync_block_invoke (@___NSPersistentUIDispatchQueueAsync_block_invoke:20)
_dispatch_call_block_and_release (@_dispatch_call_block_and_release:11)
_dispatch_client_callout (@_dispatch_client_callout:8)
_dispatch_root_queue_drain (@_dispatch_root_queue_drain:174)
_dispatch_worker_thread2 (@_dispatch_worker_thread2:44)
_pthread_wqthread (@_pthread_wqthread:60)

---

snprintf (@snprintf:3)
_xpc_connection_create (@_xpc_connection_create:54)
_xpc_connection_mach_event (@_xpc_connection_mach_event:506)
_dispatch_client_callout4 (@_dispatch_client_callout4:8)
_dispatch_mach_msg_invoke (@_dispatch_mach_msg_invoke:120)
_dispatch_lane_serial_drain (@_dispatch_lane_serial_drain:97)
_dispatch_mach_invoke (@_dispatch_mach_invoke:115)
_dispatch_lane_serial_drain (@_dispatch_lane_serial_drain:97)
_dispatch_lane_invoke (@_dispatch_lane_invoke:99)
_dispatch_workloop_worker_thread (@_dispatch_workloop_worker_thread:166)
_pthread_wqthread (@_pthread_wqthread:75)

But these seem to be transient threads spawned by the OS, presumably when R is waiting. So these should be safe. If these weren't safe, we'd have seen problems outside Ark.

So overall, delaying initialisation of R until 0MQ init is complete should get us out of most trouble.

That said this investigation only concerns snprintf() and there are other unsafe functions that could cause problems. I worry that Ark users will encounter rare crashes and lose trust in their application. So if we end up seeing random crashes I wonder if we should consider an alternative design in the longer term, to make Ark as robust as it can be. I was thinking about something like:

Ark would be split up in two processes communicating with some RPC protocol. The first process would look much like the current one and would be multithreaded. The second process would run R in a fully single threaded way.
Execution and introspection requests would be pulled (rather than pushed) at yield time (read-console and interrupt) from the main process via IPC and run sequentially. This seems mostly equivalent to what we have today with the R lock since there is only one thread that can query R at any time.

DavisVaughan · 2023-06-20T17:12:39Z

I restarted R 4.1.3 roughly 70 times with this fix installed and didn't see this anymore. I think we are pretty confident about the source of this, and that posit-dev/ark#43 fixes it, so I'll close. Thanks @lionel-!

kevinushey · 2023-06-20T17:47:41Z

roughly 70 times

I appreciate the dedication 😄

lionel- self-assigned this Jun 13, 2023

lionel- mentioned this issue Jun 16, 2023

Wait until 0MQ sockets are created before starting R posit-dev/ark#43

Merged

DavisVaughan closed this as completed Jun 20, 2023

DavisVaughan mentioned this issue Jun 20, 2023

Sync amalthea #770

Merged

4 tasks

lionel- mentioned this issue Jun 29, 2023

ark: Should Ark be split in two processes? #802

Closed

petetronic added this to the Private Alpha milestone Aug 26, 2023

wesm added the lang: r label Feb 29, 2024

github-actions bot locked as resolved and limited conversation to collaborators Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ark: Weird crash when restarting R 4.1 repeatedly #720

ark: Weird crash when restarting R 4.1 repeatedly #720

lionel- commented Jun 9, 2023

kevinushey commented Jun 9, 2023

kevinushey commented Jun 9, 2023

lionel- commented Jun 14, 2023 •

edited

Loading

lionel- commented Jun 15, 2023 •

edited

Loading

DavisVaughan commented Jun 20, 2023

kevinushey commented Jun 20, 2023

ark: Weird crash when restarting R 4.1 repeatedly #720

ark: Weird crash when restarting R 4.1 repeatedly #720

Comments

lionel- commented Jun 9, 2023

kevinushey commented Jun 9, 2023

kevinushey commented Jun 9, 2023

lionel- commented Jun 14, 2023 • edited Loading

lionel- commented Jun 15, 2023 • edited Loading

DavisVaughan commented Jun 20, 2023

kevinushey commented Jun 20, 2023

lionel- commented Jun 14, 2023 •

edited

Loading

lionel- commented Jun 15, 2023 •

edited

Loading