-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ruby test suite failures with libffi-3.4.2 #102
Comments
I have reduced the reproducer:
I think this might be similar to ffi/ffi/issues/621 I have also reported this issue in Fedora: |
I tested this on Ubuntu (with clang and gcc) and macOS with clang using libffi version 3.4.2 but I'm not able to reproduce this. Since the problem started happening after upgrading libffi, could it be a problem with libffi rather than a problem with Fiddle? |
We looked at this issue closer, and it seems to be composed of SELinux, FFI Closure, and Forking the Ruby process. From the linked bug: "In modern libffi 3.4.2 we use memfd_create quite eagerly for SELinux protected systems and such closures are shared between the parent and child process." and "Closures may be inherited by the child, and so the parent and child must coordinate the usage.". Which would be consistent with our findings. See the following minimal reproducer that triggers the bug with Fedora 36 and later: require 'fiddle/closure'
require 'fiddle/struct'
Fiddle::Closure.new(Fiddle::TYPE_VOID, [])
fork { }
GC.start This describes all required to trigger the bug on the Ruby side. Note that SELinux plays a role in this as well since libffi takes a different approach to allocate the closure on SELinux systems. The reproducer demonstrates the following: The failure we observe in the Ruby test suite follows a very similar pattern to the reproducer when it fails for us in the build. First, some Fiddle tests that allocate Closures run, then some other test forks, and then the GC needs to be triggered. Some tests trigger them manually (e.g. TestHash#test_replace_bug15358), and some are very likely to trigger the GC simply by their memory usage (e.g. TestTrick2015#test_kinaba). We can see the problem is that this memory usage is not coordinated properly between parent and child process. The possible mitigations for this are:
To the last point, we also found that if the Fiddle::Closure is allocated in the child, it is not a problem for overall status of the test suite, therefore this approach could be used for tests using closure to ensure that the tests can run even with SELinux and whatnot. Additionally, with this approach, one more test would be created with the mentioned reproducer to test the integration of libffi on the current system. Using the previous reproducer: require 'fiddle/closure'
require 'fiddle/struct'
fork do
Fiddle::Closure.new(Fiddle::TYPE_VOID, [])
end
fork { }
GC.start This does not crash the process, unlike the previous reproducer. I noticed that Ruby tests include |
Thx for the patch. I leave the testing to @jackorp thx 😉 Nevertheless, this is fixing the specific use case in the test suite, while this is generic issue. Therefore I wonder if there is a way to e.g.:
I am not ffi expert, so apologies if I am completely off. |
Thanks for the patch! It improved the situation somewhat but not complemetely. After uncomentting and re-enabling the tests for Ruby and applying the patch I observe new issue in the build logs:
The Ruby test suite failed. Although without a segfault, that is an improvement. Interestingly there seems to be some other issue with this, as I have another build log:
Here, only a single test failed. Some builds do pass, but there are more of those that do not pass. I think sometimes there is a (uncollectable maybe?) reference to Fiddle::Closure introduced in the tests. I'll try some local testing and poking around to find what that reference is. Maybe asserting that there is either 0 or 1 reference in the ObjectSpace would be the way to go, but it requires lengthier testing on my side to confirm that this approach would not lead to a segfault due to the "random" nature of GC. |
Thanks for confirming the patch. closure = Fiddle::Closure.new(...)
closure.free # Free the closure immediately.
Fiddle::Closure.allocate(...) do |closure|
# ...
end # Free the allocated closure when the block is exited. But I don't have a good name for the feature for now... I consider the name in RubyKaigi 2022... (I give a talk today. I consider the name after my talk is finished.) |
Hmm, not better. I also applied the previous commit with However I have to note, in order to run the whole Ruby test suite together with the patches, I patch the fiddle shipped with Ruby 3.1.2. I may be missing a commit that makes all this works together. If my patch-only approach fails, I'll try importing fiddle from the newest master branch commit as a gem and see if that improves the situation. The work is happening on my downstream Fedora Ruby fork branch: https://src.fedoraproject.org/fork/jackorp/rpms/ruby/commits/fiddle_closure_test Failed build: https://koji.fedoraproject.org/koji/taskinfo?taskID=91998265
|
GitHub: GH-102 This also improves freed closures assertions.
GitHub: GH-102 This also improves freed closures assertions.
Sorry. One more commit: 2530496 |
Thanks! The state is Much better now. However this error from the logs is puzzling:
I don't understand how it does not know about assert_false... I assume Ruby only has the test-unit available in the tool/lib/test/unit during the tests phase. That one does not have Alternatively, consider something like This seems like the single error so far. I'll try out more builds in parallel later, just to be more sure. |
For now I am using the following patch for From e1221297fb0177d98c8670e5490c8131227621d5 Mon Sep 17 00:00:00 2001
From: Sutou Kouhei <kou@clear-code.com>
Date: Thu, 15 Sep 2022 06:40:31 +0900
Subject: [PATCH 2/6] test: don't use power-assert
It seems that we can't use it in ruby/ruby.
---
test/fiddle/test_closure.rb | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)
diff --git a/test/fiddle/test_closure.rb b/test/fiddle/test_closure.rb
index 1726db7..13dfa5b 100644
--- a/test/fiddle/test_closure.rb
+++ b/test/fiddle/test_closure.rb
@@ -83,13 +83,9 @@ module Fiddle
def test_free
Closure.create(:int, [:void]) do |closure|
- assert do
- not closure.freed?
- end
+ assert_equal(closure.freed?, false)
closure.free
- assert do
- closure.freed?
- end
+ assert_equal(closure.freed?, true)
closure.free
end
end
--
2.37.3
Note I changed the So far 4 arch specific builds passed with the changes: https://koji.fedoraproject.org/koji/taskinfo?taskID=92035826 I'll run more builds to confirm the results, but this seems successful so far. |
Ran 5 builds with 4 different arches each, only problem was unrelated failure on aarch64 on a few of those. Otherwise, nothing fiddle related! |
GitHub: GH-102 They aren't available in ruby/ruby.
Thanks! |
GitHub: fix GH-102 We can't use Fiddle::Closure before we fork the process. If we do it, the process may be crashed with SELinux. See ruby/fiddle#102 (comment) for details. Reported by Vít Ondruch. Thanks!!! ruby/fiddle@1343ac7a95
GitHub: fix GH-102 It's for freeing a closure explicitly. We can't use Fiddle::Closure before we fork the process. If we do it, the process may be crashed with SELinux. See ruby/fiddle#102 (comment) for details. Reported by Vít Ondruch. Thanks!!! ruby/fiddle@a0ccc6bb1b
Fiddle::Closure object is making use of FFI closure from libffi. When such object is created (instantiated) in Ruby, and then the process forks on an SELinux-enabled system, the memory will become corrupted. That is usually not a problem until the The garbage collector sweeps the object and tries to free it, in which case the Ruby process will fail with signal SIGABRT. Tests in test/fiddle/test_closure.rb, test/fiddle/test_func.rb, and test/fiddle/test_function.rb use the `Fiddle::Closure` class directly and fiddle/test_import.rb use the class indirectly through `bind_function` method, therefore they are disabled to prevent introducing the problematic object into the Ruby GC during test suite execution instead of relying on that fork and subsequent garbage collection will not happen. If an FFI closure object is allocated in Ruby and the `fork` function is used afterward, the memory pointing to the closure gets corrupted, and if Ruby GC tries to collect the object in that state, a SIGABRT error occurs. The minimal Ruby reproducer for the issue is the following: ~~~ $ cat fiddle_fork.rb require 'fiddle/closure' require 'fiddle/struct' Fiddle::Closure.new(Fiddle::TYPE_VOID, []) fork { } GC.start ~~~ We allocate an unused Closure object, so it is free for the GC to pick up. Before we call `GC.start` we fork the process as that corrupts the memory. Running this with ruby-3.1.2-167.fc37.x86_64 on SELinux enabled system: ~~~ $ ruby fiddle_fork.rb Aborted (core dumped) ~~~ Such issues may appear at random (depending on the use of forking and GC) in larger applications that use Fiddle::Closure but can be spotted by the following functions appearing in the coredump backtrace: ~~~ 0x00007f6284d3e5b3 in dlfree (mem=<optimized out>) at ../src/dlmalloc.c:4350 0x00007f6284d6d0b1 in dealloc () from /usr/lib64/ruby/fiddle.so 0x00007f6295e432ec in finalize_list () from /lib64/libruby.so.3.1 0x00007f6295e43420 in finalize_deferred.lto_priv () from /lib64/libruby.so.3.1 0x00007f6295e4ff1c in gc_start_internal.lto_priv () from /lib64/libruby.so.3.1 ~~~ Possible solutions to prevent Ruby from crashing: * Do not use Fiddle::Closure. * Use the Fiddle::Closure object only in isolated subprocess that will not fork further. * Enable static trampolines in libffi as noted in bugzilla comment: <https://bugzilla.redhat.com/show_bug.cgi?id=2040380#c9> See related discussion on <https://bugzilla.redhat.com/show_bug.cgi?id=2040380> Ruby upstream ticket: <https://bugs.ruby-lang.org/issues/18914> Ruby Fiddle ticket: <ruby/fiddle#102>
Fiddle::Closure object is making use of FFI closure from libffi. When such object is created (instantiated) in Ruby, and then the process forks on an SELinux-enabled system, the memory will become corrupted. That is usually not a problem until the The garbage collector sweeps the object and tries to free it, in which case the Ruby process will fail with signal SIGABRT. Tests in test/fiddle/test_closure.rb, test/fiddle/test_func.rb, and test/fiddle/test_function.rb use the `Fiddle::Closure` class directly and fiddle/test_import.rb use the class indirectly through `bind_function` method, therefore they are disabled to prevent introducing the problematic object into the Ruby GC during test suite execution instead of relying on that fork and subsequent garbage collection will not happen. If an FFI closure object is allocated in Ruby and the `fork` function is used afterward, the memory pointing to the closure gets corrupted, and if Ruby GC tries to collect the object in that state, a SIGABRT error occurs. The minimal Ruby reproducer for the issue is the following: ~~~ $ cat fiddle_fork.rb require 'fiddle/closure' require 'fiddle/struct' Fiddle::Closure.new(Fiddle::TYPE_VOID, []) fork { } GC.start ~~~ We allocate an unused Closure object, so it is free for the GC to pick up. Before we call `GC.start` we fork the process as that corrupts the memory. Running this with ruby-3.1.2-167.fc37.x86_64 on SELinux enabled system: ~~~ $ ruby fiddle_fork.rb Aborted (core dumped) ~~~ Such issues may appear at random (depending on the use of forking and GC) in larger applications that use Fiddle::Closure but can be spotted by the following functions appearing in the coredump backtrace: ~~~ 0x00007f6284d3e5b3 in dlfree (mem=<optimized out>) at ../src/dlmalloc.c:4350 0x00007f6284d6d0b1 in dealloc () from /usr/lib64/ruby/fiddle.so 0x00007f6295e432ec in finalize_list () from /lib64/libruby.so.3.1 0x00007f6295e43420 in finalize_deferred.lto_priv () from /lib64/libruby.so.3.1 0x00007f6295e4ff1c in gc_start_internal.lto_priv () from /lib64/libruby.so.3.1 ~~~ Possible solutions to prevent Ruby from crashing: * Do not use Fiddle::Closure. * Use the Fiddle::Closure object only in isolated subprocess that will not fork further. * Enable static trampolines in libffi as noted in bugzilla comment: <https://bugzilla.redhat.com/show_bug.cgi?id=2040380#c9> See related discussion on <https://bugzilla.redhat.com/show_bug.cgi?id=2040380> Ruby upstream ticket: <https://bugs.ruby-lang.org/issues/18914> Ruby Fiddle ticket: <ruby/fiddle#102>
Hi, libffi-3.4.2 recently landed in Fedora and since that time, I observe strange failures in Ruby test suite:
Executing the test on itself, they passes just fine, but testing e.g.
test/ruby/test_autoload.rb -v -n '/TestAutoload#test_autoload_fork/'
together withtest/fiddle/test_import.rb
makes the reproducer smaller.Trying to reduce the issue even further, I have reduced the test/ruby/test_import.rb into the following shape:
And use just miniruby:
The text was updated successfully, but these errors were encountered: