Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault with rayon & moka #495

Open
tatsuya6502 opened this issue Feb 4, 2025 · 3 comments
Open

Segfault with rayon & moka #495

tatsuya6502 opened this issue Feb 4, 2025 · 3 comments
Assignees

Comments

@tatsuya6502
Copy link
Member

crossbeam-rs/crossbeam#1175

I've observed segfaults while using moka with rayon, and the backtrace looks like this:

https://github.com/polachok/moka-crossbeam-bug - code to reproduce.
I see this happening both on macOS/arm64 & linux/amd64.

#8  0x0000555555599ca4 in crossbeam_epoch::internal::Local::flush (self=0x7fffe8001300, guard=0x7ffff7969fc0) at src/internal.rs:376
#9  crossbeam_epoch::guard::Guard::flush (self=0x7ffff7969fc0) at src/guard.rs:294
#10 0x000055555556a29d in moka::sync_base::base_cache::BaseCache<u64, u64, std::hash::random::RandomState>::do_post_update_steps<u64, u64, std::hash::random::RandomState> (
    self=<optimized out>, ts=..., key=..., old_info=..., upd_op=...) at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/moka-0.12.10/src/sync_base/base_cache.rs:601

CC: @polachok

@tatsuya6502
Copy link
Member Author

tatsuya6502 commented Feb 4, 2025

Hi. Thanks for reporting the issue. I tried your code briefly before going to work, but I am afraid I could not reproduce it.

It crashes, but for a different reason than you mentioned:

$ cargo run --release
thread '<unknown>' has overflowed its stack
fatal runtime error: stack overflow
zsh: abort      cargo run --release

I gave the names to the Rayon threads:

diff --git a/src/main.rs b/src/main.rs
index 704224a..9cd14fb 100644
--- a/src/main.rs
+++ b/src/main.rs
@@ -12,6 +12,7 @@ fn main() {
         .build();
     rayon::ThreadPoolBuilder::new()
         .num_threads(4)
+        .thread_name(|thread_id| format!("rayon-thread-{thread_id}"))
         .exit_handler(|thread_id| {
             println!("Thread '{}' exited", thread_id);
         })
thread 'rayon-thread-3' has overflowed its stack
fatal runtime error: stack overflow

Then I tried increasing the stack size of the threads:

     rayon::ThreadPoolBuilder::new()
         .num_threads(4)
+        .thread_name(|thread_id| format!("rayon-thread-{thread_id}"))
+        .stack_size(6 * 1024 * 1024)
         .exit_handler(|thread_id| {
             println!("Thread '{}' exited", thread_id);
         })
thread 'rayon-thread-0' has overflowed its stack
fatal runtime error: stack overflow

I found that it runs fine when I increase the stack size to 7MB:

     rayon::ThreadPoolBuilder::new()
         .num_threads(4)
+        .thread_name(|thread_id| format!("rayon-thread-{thread_id}"))
+        .stack_size(7 * 1024 * 1024)
         .exit_handler(|thread_id| {
             println!("Thread '{}' exited", thread_id);
         })
$ cargo run --release
...

$ echo $?
0

It seems that the spawned async tasks build up on the Rayon stack quickly(?).

Can you please check if this works for you as well? You might also want to try increasing the stack size of the Rayon threads in your real program?


Environment:

  • Mac
    • macOS Sequoia 15.3
    • Apple M2 chip (4 high-performance cores, 4 high-efficiency cores)
  • Rust 1.84.0
    • The host and target: aarch64-apple-darwin

@tatsuya6502
Copy link
Member Author

I ran it on Linux x86_64 and got the same stack overflow. It fixed after increasing the stack size to 6MB.

I think the following Rayon issues would explain the reason for the stack overflow errors:

From 854,

Rayon has implicit "recursion" due to work stealing. That is, whenever a rayon thread is blocked on the result from another rayon thread, it will look for other pending work to do in the meantime. That stolen work is executed directly from the same stack where it was blocked.

Your par_iter().for_each() becomes a bunch of nested joins, and each one of those may block if one half gets stolen to a new thread. Since stealing is somewhat random, the pool will have a mix of stolen joins and new spawns creating even more joins, and I can definitely see how that might get out of control. You're not doing anything wrong, but I'm not sure how to tame that.

@polachok
Copy link

polachok commented Feb 8, 2025

Thanks, i think you're right and this can be closed.
I ran it under gdb, and it just shows Segmentation fault

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants