-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ark: Weird crash when restarting R 4.1 repeatedly #720
Comments
We bumped into some very similar bugs on the RStudio side that effectively came down to various networking function calls using |
https://man7.org/linux/man-pages/man3/getnameinfo.3.html suggests (at least with glibc) that the function is:
which effectively means the function makes (unsynchronized) calls to access facets of the environment / locale; so things could definitely go awry if we were modifying those in a separate thread. |
Using #742 (well a local version that actually works, updates incoming) I managed to get a backtrace for the R thread:
I believe this is from this line: snprintf(buf, len, "R_SESSION_TMPDIR=%s", tmp); I can't find much information about thread-safety of To fix this I'll just delay initialisation of R after that of 0MQ as discussed yesterday. Edit: On the main thread side, the debugger shows a more complete backtrace:
So it looks like |
@DavisVaughan ran into this crash with R 4.3 as well. I used the debugger to pause on invocations of
Once we're past the startup, most snprintf calls in other threads happen within the R lock so they are safe. I did see these while creating plots:
But these seem to be transient threads spawned by the OS, presumably when R is waiting. So these should be safe. If these weren't safe, we'd have seen problems outside Ark. So overall, delaying initialisation of R until 0MQ init is complete should get us out of most trouble. That said this investigation only concerns
|
I restarted R 4.1.3 roughly 70 times with this fix installed and didn't see this anymore. I think we are pretty confident about the source of this, and that posit-dev/ark#43 fixes it, so I'll close. Thanks @lionel-! |
I appreciate the dedication 😄 |
Restarting R 4.1 repeatedly causes a segfault (tested locally on arm64 with 4.1.2 and on Davis' intel computer with 4.1.3 - we couldn't reproduce with 4.2 and 4.3).
Screen.Recording.2023-06-09.at.15.32.16.mov
Using posit-dev/ark#22 and #708, I get this backtrace (after a round of c++filt):
The
_OSAtomicTestAndClearBarrier
frame is part of the backtrace capture, so the crash happens in the 0MQ stack, here precisely https://github.com/zeromq/libzmq/blob/5bf04ee2ff207f0eaf34298658fe354ea61e1839/src/tcp_address.cpp#L121 (line number differs because our version of 0MQ is 3 years old, but the function hasn't changed).I thought perhaps there was some ordering issue that caused us to pass uninitialised memory to
getnameinfo()
. I used a local 0MQ with some printf to make sure, and this doesn't seem to be the case:I couldn't reproduce the crash when I was attaching Ark on startup so I thought there might be a race somewhere. It seems that it is the case as sleeping a bit before starting up solves the crash:
Since there is nothing wrong with the data passed to
getnameinfo()
, and since sleeping a bit fixes the crash, I'm tempted to say this is a weird race condition in the network stack of macOS? It would be interesting to see if this reproduces on Linux or Windows.I also wondered if properly closing all the 0mq sockets and context on shutdown would help, but didn't go through with this because it seems like a lot of work to propagate the shutdown to all the threads and coordinate the closing of everything. Also since we're closing the process anyway, it shouldn't matter if we're leaving the sockets open?
So I have two questions:
The text was updated successfully, but these errors were encountered: