Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure ruby_xfree won't segfault if called after ruby_vm_destruct #7663

Conversation

flavorjones
Copy link
Contributor

@flavorjones flavorjones commented Apr 5, 2023

This commit modifies ruby_sized_xfree to handle when ruby_current_vm_ptr (a.k.a. GET_VM()) is NULL by calling ruby_mimfree to free the memory without using rb_objspace.

[Bug #19580]

Rationale

The POSIX Threads API provides the ability to define a destructor to clean up thread-local storage at thread exit by using pthread_key_create to bind a storage key and an associated destructor function, and pthread_setspecific to associate a value with that key. If a C extension is using ruby_xmalloc and ruby_xfree to manage memory, then the destructor function will likely call ruby_xfree.

There is a small window of time as a process is exiting -- after ruby_vm_destruct but before the process exits -- in which a thread may exit and the destructor function called, leading to a segfault.

Real-world use case

The real-world scenario motivating this change is libxml2's pthread code which uses pthread_key_create to set up a destructor that is called at thread exit to clean up thread-local storage associated with the thread by pthread_setspecific.

Nokogiri has, since 2009, configured libxml2 to use ruby_xmalloc and ruby_xfree so that Ruby's memory management is aware of the allocated memory. As a result, ruby_xfree is being called by the pthread destructor, which may lead to a segfault that looks like:

[BUG] Segmentation fault at 0x0000000000000420
ruby 3.3.0dev (2023-03-19T06:13:25Z master ca9355e173) [x86_64-linux]

-- Machine register context ------------------------------------------------
 RIP: 0x00007f9b6311124e RBP: 0x00007f9b58001530 RSP: 0x00007f9b5d8eed60
 RAX: 0x0000000000000000 RBX: 0x00007f9b58001888 RCX: 0x00000000000003ff
 RDX: 0x00007f9b5dad62b0 RDI: 0x00007f9b58007090 RSI: 0x0000000000000000
  R8: 0x00007f9b5d8eed64  R9: 0x00000000000000ca R10: 0x0000000000000000
 R11: 0x0000000000000246 R12: 0x00007f9b58007090 R13: 0x00007f9b62e1bae8
 R14: 0x0000000000000004 R15: 0x00007f9b5d8efb58 EFL: 0x0000000000010206

-- C level backtrace information -------------------------------------------
.../3.3.0-dev/lib/libruby.so.3.3(rb_print_backtrace+0xd) [0x7f9b632e6f1f] /home/flavorjones/code/oss/ruby/vm_dump.c:785
.../3.3.0-dev/lib/libruby.so.3.3(rb_vm_bugreport) /home/flavorjones/code/oss/ruby/vm_dump.c:1101
.../3.3.0-dev/lib/libruby.so.3.3(rb_bug_for_fatal_signal+0xf4) [0x7f9b630eead4] /home/flavorjones/code/oss/ruby/error.c:813
.../3.3.0-dev/lib/libruby.so.3.3(sigsegv+0x4d) [0x7f9b63233bbd] /home/flavorjones/code/oss/ruby/signal.c:964
/lib/x86_64-linux-gnu/libc.so.6(0x7f9b62c42520) [0x7f9b62c42520]
.../3.3.0-dev/lib/libruby.so.3.3(ruby_sized_xfree+0x3) [0x7f9b6311124e] /home/flavorjones/code/oss/ruby/gc.c:12666
.../3.3.0-dev/lib/libruby.so.3.3(ruby_sized_xfree) /home/flavorjones/code/oss/ruby/gc.c:12663
.../lib/nokogiri/nokogiri.so(xmlResetError+0x0) [0x7f9b5da6ab4a] .../tmp/x86_64-linux/nokogiri/3.0.5/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.10.3/libxml2-2.10.3/error.c:880
.../lib/nokogiri/nokogiri.so(xmlResetError) .../tmp/x86_64-linux/nokogiri/3.0.5/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.10.3/libxml2-2.10.3/error.c:873
.../lib/nokogiri/nokogiri.so(xmlResetError) (null):0
.../lib/nokogiri/nokogiri.so(xmlFreeGlobalState+0x14) [0x7f9b5dad62c4] .../tmp/x86_64-linux/nokogiri/3.0.5/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.10.3/libxml2-2.10.3/threads.c:548
/lib/x86_64-linux-gnu/libc.so.6(__nptl_deallocate_tsd+0x77) [0x7f9b62c91711] ./nptl/nptl_deallocate_tsd.c:73
/lib/x86_64-linux-gnu/libc.so.6(__nptl_deallocate_tsd) ./nptl/nptl_deallocate_tsd.c:22
/lib/x86_64-linux-gnu/libc.so.6(start_thread+0x17a) [0x7f9b62c949ca] ./nptl/pthread_create.c:453
[0x7f9b62d26a00]

I've reproduced this in Ruby 3.0, 3.1, 3.2, and master by using the following ruby code:

# must have an error in it to cause pthread_setspecific to be called
html = "<div foo='asdf>asdf</div>"

Thread.new { Nokogiri::HTML4::Document.parse(html) }
sleep 3 # THREAD_CACHE_TIME

exit 0

The timing is admittedly hard to reproduce, but Gitlab has seen this in their CI pipelines a few dozen times and it's made easier if the process lifetime is extended by an atexit-registered function.

gc.c Show resolved Hide resolved
@flavorjones flavorjones force-pushed the flavorjones-handle-free-after-objspace-free branch from 03ac3f3 to 05aa728 Compare April 5, 2023 14:01
[Bug #19580]

The real-world scenario motivating this change is libxml2's pthread
code which uses `pthread_key_create` to set up a destructor that is
called at thread exit to free thread-local storage.

There is a small window of time -- after ruby_vm_destruct but before
the process exits -- in which a pthread may exit and the destructor is
called, leading to a segfault.

Please note that this window of time may be relatively large if
`atexit` is being used.
@flavorjones flavorjones force-pushed the flavorjones-handle-free-after-objspace-free branch from 05aa728 to e38a82f Compare April 5, 2023 14:04
Copy link
Member

@peterzhu2118 peterzhu2118 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution!

@peterzhu2118 peterzhu2118 merged commit 52e571f into ruby:master Apr 5, 2023
@flavorjones flavorjones deleted the flavorjones-handle-free-after-objspace-free branch April 5, 2023 18:21
maxlazio pushed a commit to gitlabhq/omnibus-gitlab that referenced this pull request Jun 16, 2023
As sparklemotion/nokogiri#2785 (comment)
explains, there is a bug in the Ruby interpreter
(https://bugs.ruby-lang.org/issues/19580) that has been fixed upstream
(ruby/ruby#7663) that causes a seg fault
during shutdown with libxml2/Nokogiri.

We patched the Ruby interpreter in CI to work around the problem
(https://gitlab.com/gitlab-org/gitlab-build-images/-/merge_requests/672)
in https://gitlab.com/gitlab-org/gitlab/-/issues/390313, but it
appears the seg faults have appeared in production now. On GitLab.com,
this week we have seen more than 20 cases with the error:

```
[BUG] Segmentation fault at 0x0000000000000440
```

We could also work around this problem by setting
`NOKOGIRI_LIBXML_MEMORY_MANAGEMENT=default`, but this may cause
unexpected memory growth since Ruby no longer manages the memory (see
https://github.com/sparklemotion/nokogiri/pull/2843/files).

Let's just fix the interpreter since we'd also need to make sure that
environment variable is set in any environment that uses Nokogiri
(including Rake tasks).

Changelog: fixed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants