Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collecting from unknown thread in relation to libuv & getaddrinfo on GNU libc #561

Open
plajjan opened this issue Jun 27, 2023 · 4 comments

Comments

@plajjan
Copy link
Contributor

plajjan commented Jun 27, 2023

I'm getting a "Collecting from unknown thread" abort from bdwgc. It's an application written in the programming language Acton where bdwgc is used for garbage collection in conjunction with libuv for async I/O and I'm primarily linking with GNU libc.

I think getaddrinfo is notoriously difficult to make async, so libuv starts a threadpool to run DNS queries. getaddrinfo in turn is implemented in libc. I've noted that if I link with Musl libc I don't seem to get this problem. Not sure if that's entirely conclusive or if I'm just lucky / unlucky.

I'm a little bit fumbling in the dark here and I'm opening this ticket in case someone (well most likely you Ivan :)) has a little bit of experience and has perhaps run in to something similar.

My current reproduction is about repeatedly running DNS queries, pretty much as fast as possible. The queries are for a non-existent name, so they quickly error (local DNS resolver). After 20-100 queries, the application dies with a "Collecting from unknown thread".

.... lots more appl output...
DNS resolution error...
ERror: DNS lookup error: unknown node or service
query: 
backoff: 0
DNS resolution error...
ERror: DNS lookup error: unknown node or service
query: 
backoff: 0
DNS resolution error...
ERror: DNS lookup error: unknown node or service
query: 
backoff: 0
Collecting from unknown thread
bash: line 1:  1306 Aborted                 (core dumped) /xcon --p2p rtr1/1--rtr2/1 --rts-wthreads=2

The backtrace looks like this:

(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1  0x00007f848327cd2f in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  0x00007f848322def2 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007f8483218472 in __GI_abort () at ./stdlib/abort.c:79
#4  0x00000000006ba17d in GC_push_all_stacks () at pthread_stop_world.c:913
#5  0x00000000006b8c19 in GC_default_push_other_roots () at os_dep.c:2834
#6  0x00000000006b2b53 in GC_push_roots (all=1, cold_gc_frame=0x7f847a8aefa0 "0\360\212z\204\177") at mark_rts.c:979
#7  0x00000000006afb95 in push_roots_and_advance (push_all=1, cold_gc_frame=0x7f847a8aefa0 "0\360\212z\204\177") at mark.c:275
#8  0x00000000006aed02 in GC_mark_some (cold_gc_frame=0x7f847a8aefa0 "0\360\212z\204\177") at mark.c:357
#9  0x00000000006a2dfe in GC_stopped_mark (stop_func=0x6a1ee0 <GC_never_stop_func>) at alloc.c:890
#10 0x00000000006a2a63 in GC_try_to_collect_inner (stop_func=0x6a1ee0 <GC_never_stop_func>) at alloc.c:605
#11 0x00000000006a4efe in GC_collect_or_expand (needed_blocks=2, flags=0, retry=0) at alloc.c:1685
#12 0x00000000006a0635 in GC_alloc_large (lb=4112, k=1, flags=0, align_m1=0) at malloc.c:66
#13 0x00000000006a0dcb in GC_generic_malloc_aligned (lb=4096, k=1, flags=0, align_m1=0) at malloc.c:271
#14 0x00000000006a10f9 in GC_malloc_kind_global (lb=4096, k=1) at malloc.c:342
#15 0x00000000006a112b in GC_malloc_kind (lb=4096, k=1) at malloc.c:348
#16 0x00000000006a117a in GC_malloc (lb=4096) at malloc.c:361
#17 0x00000000006a14d5 in malloc (lb=4096) at malloc.c:476
#18 0x00007f848326776c in __GI__IO_file_doallocate (fp=0x7f847889e200) at ./libio/filedoalloc.c:101
#19 0x00007f8483274f50 in __GI__IO_doallocbuf (fp=0x7f847889e200) at ./libio/libioP.h:947
#20 __GI__IO_doallocbuf (fp=fp@entry=0x7f847889e200) at ./libio/genops.c:342
#21 0x00007f8483272b8c in _IO_new_file_seekoff (fp=0x7f847889e200, offset=0, dir=0, mode=<optimized out>) at ./libio/fileops.c:937
#22 0x00007f8483270ad3 in __fseeko (fp=0x7f847889e200, offset=0, whence=0) at ./libio/fseeko.c:40
#23 0x00007f84833252f7 in __GI___nss_hash (len=6, keyarg=<optimized out>) at ./nss/nss_hash.c:60
#24 __GI___nss_hash (keyarg=<optimized out>, len=6) at ./nss/nss_hash.c:30
#25 0x00007f84788888b0 in ?? ()
#26 0x00007f84833291c6 in __GI__nss_files_gethostbyaddr_r (addr=0x7f84788888b0, len=2055927536, af=2055927216, result=0x2, buffer=0x400 <error: Cannot access memory at address 0x400>, buflen=140206968338480, errnop=0x0, herrnop=0x7f847a8b069c) at nss_files/files-hosts.c:104
#27 0x0000000000000000 in ?? ()
(gdb)  

I don't recognize what thread this is. I suspect it is something that GNU libc sets up, the files towards the bottom of the stack are for nss_files, nss, libio and with symbols that seem to be part of GNU libc.

I am guessing that Musl libc implements getaddrinfo differently, without a thread like that, which is why I'm not running into the same problem. It could also be that the linking with musl is done statically. I do link time redirection of malloc & friends, so does this mean I redirect musls malloc too? I think not based on the order in which we link them, plus malloc is part of libc itself, right?

There are many threads in total, most of which are the worker threads of the Acton Run Time System, some auxillary threads, the GC threads (I'm running the parallel collector) and some for the libuv threadpool. I believe that collecting is OK from all the normal threads, i.e. I correctly hook the thread creation to get the GC wrapping in place, but it's quite possible that we miss a GNU libc created thread, if that's what we are seeing here.

I'm puzzled - I can't be the only one using bdwgc in combination with DNS requests on gnu libc, right? Why is noone else running into this?

I wonder, would it be possible to simply cancel a GC collection run in case we notice the thread is an unknown thread. It feels wrong, my intuition tells me all threads need to be paused to avoid freeing something that is actively being used by some other thread, but perhaps GNU libc is carefully enough written, with a clever enough interface and any internal threads are isolated enough that it could indeed work?

Any pointers? :)

@ivmai
Copy link
Owner

ivmai commented Jun 27, 2023

The reason is that you redirect malloc but don't redirect (intercept) pthread_create everywhere.

but it's quite possible that we miss a GNU libc created thread

Yes.
GC_USE_LD_WRAP or GC_USE_DLOPEN_WRAP might help you, please glance the documentation of bdwgc.

would it be possible to simply cancel a GC collection run in case we notice the thread is an unknown thread.

No! Any unregistered thread should not call malloc or even just manipulate (load/store) pointers returned by malloc. Otherwise bdwgc may reclaim an object which is still in-use.

@plajjan
Copy link
Contributor Author

plajjan commented Jun 27, 2023

Thanks @ivmai. It is indeed right. I added back the dlopen wrapping and it works properly now.

I wonder about macos though. Dlopen wrapping isn't supported there IIRC. Hope its libc doesn't do threads for dns or I'm screwed I think

@plajjan
Copy link
Contributor Author

plajjan commented Jun 27, 2023

Funny how it took me this long to run into this problem. I guess libc mostly doesn't use threads so this isn't an issue most of the time... I haven't done enough DNS requests. Anyhow it's fixed for Acton on Linux and I hope this isn't an issue on Macos..

That comment you wrote when you removed the error related to this is quite spot on! 6b73b6e

Still curious what's the case on Macos.

Do you think it would be possible to adapt the dlopen wrapping code to run on macos too, or is it some more fundamental limitation?

@ivmai
Copy link
Owner

ivmai commented Jun 28, 2023

Still curious what's the case on Macos.
Do you think it would be possible to adapt the dlopen wrapping code to run on macos too, or is it some more fundamental limitation?

I don't known (or at least I don't remember anyone reported an issue related).
Darwin differs significantly from other platforms in many low-level aspects (threads suspension, mprotect-based virtual dirty bits, dynamic libs registration) and subject to change from time to time.
Please check and report it if possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants