Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce SchedulingStateMachine for unified scheduler #129

Merged
merged 54 commits into from
Apr 4, 2024

Conversation

ryoqun
Copy link
Member

@ryoqun ryoqun commented Mar 7, 2024

Migrated solana-labs#35286.

(let's finish up the review at the original pr; then just r+ on this pr; sans unfortunate rebase due to ci; this branch will be sync with the original branch)

perf numbers

before:

[2024-04-05T03:35:52.628226692Z INFO  solana_ledger::blockstore_processor] ledger processed in 37 seconds, 775 ms, 302 µs and 825 ns. root slot is 249712363, 36 banks: 249712363, 249712364, 249712365, 249712366, 249712367, 249712368, 249712369, 249712370, 249712371, 249712372, 249712373, 249712374, 249712375, 249712378, 249712379, 249712380, 249712381, 249712382, 249712383, 249712384, 249712385, 249712386, 249712387, 249712388, 249712389, 249712390, 249712391, 249712392, 249712393, 249712394, 249712395, 249712396, 249712397, 249712398, 249712399, 249712400

after:

[2024-04-05T03:24:34.316539714Z INFO  solana_ledger::blockstore_processor] ledger processed in 21 seconds, 643 ms, 40 µs and 195 ns. root slot is 249712363, 36 banks: 249712363, 249712364, 249712365, 249712366, 249712367, 249712368, 249712369, 249712370, 249712371, 249712372, 249712373, 249712374, 249712375, 249712378, 249712379, 249712380, 249712381, 249712382, 249712383, 249712384, 249712385, 249712386, 249712387, 249712388, 249712389, 249712390, 249712391, 249712392, 249712393, 249712394, 249712395, 249712396, 249712397, 249712398, 249712399, 249712400

@ryoqun ryoqun changed the title Scheduling state machine for agave Introduce SchedulingStateMachine for unified scheduler Mar 7, 2024
@codecov-commenter
Copy link

codecov-commenter commented Mar 7, 2024

Codecov Report

Attention: Patch coverage is 96.96510% with 20 lines in your changes are missing coverage. Please review.

Project coverage is 81.9%. Comparing base (cc3afa5) to head (8c0c6bd).
Report is 56 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff            @@
##           master     #129    +/-   ##
========================================
  Coverage    81.9%    81.9%            
========================================
  Files         837      837            
  Lines      226823   227455   +632     
========================================
+ Hits       185828   186415   +587     
- Misses      40995    41040    +45     

@alessandrod
Copy link

I've just left my review on solana-labs#35286 (review) which is what I was tagged on. Hopefully this PR matches the same code 😅

@ryoqun
Copy link
Member Author

ryoqun commented Mar 18, 2024

... Hopefully this PR matches the same code 😅

I'll keep them in-sync. :) fyi, you can confirm my diligence by this:

$ git range-diff ryoqun/scheduling-state-machine...ryoqun/scheduling-state-machine-for-agave

@ryoqun ryoqun force-pushed the scheduling-state-machine-for-agave branch 2 times, most recently from 86b27ce to 9e07dcd Compare March 18, 2024 13:45
@alessandrod alessandrod self-requested a review March 22, 2024 07:23
///
/// Note that lifetime of the acquired reference is still restricted to 'self, not
/// 'token, in order to avoid use-after-free undefined behaviors.
pub(super) fn with_borrow_mut<R>(
Copy link

@alessandrod alessandrod Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I might be missing something, but what prevents this

            struct FakeQueue {
                v: Vec<u8>,
            }

            let queue = Arc::new(TokenCell::new(FakeQueue { v: Vec::new() }));

            let q1 = Arc::clone(&queue);
            std::thread::spawn(move || {
                let mut token = unsafe { Token::<FakeQueue>::assume_exclusive_mutating_thread() };
                q1.with_borrow_mut(&mut token, |queue| {
                    // this is UB
                    let v = &mut queue.v;
                });
            });
            std::thread::spawn(move || {
                let mut token = unsafe { Token::<FakeQueue>::assume_exclusive_mutating_thread() };
                queue.with_borrow_mut(&mut token, |queue| {
                    // this is UB
                    let v = &mut queue.v;
                });
            });

There's one token per thread, so I'm respecting the Token::assume_blah()
contract. I have a token, so I'm respecting the TokenCell::with_borrow_mut contract.

Copy link
Member Author

@ryoqun ryoqun Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, the above example is violating this #129 (comment):

    /// By extension, it's allowed to create _multiple_ tokens in a _single_ process as long as no
    /// instance of [`TokenCell`] is shared by multiple instances of [`Token`].

i.e. the Arc-ed TokenCell is shared by the tokens in the two threads.

as a quick recap, the objective of this fancy TokenCell is to offload creation of them to multiple threads. those threads doesn't need mutable references, so no tokens needed for them. on the other hand, the scheduler thread needs mutability. and it's the sole thread to mutate all of those objects created ever. in this way, i want to avoid Mutex-ing things.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've re-read that comment a few times and I still don't understand it to mean what you suggest it should mean. I can't create multiple Token's in the same thread because the constructor asserts that. And the Arc is shared among threads the same way UsageQueue is shared among threads. I understand what you're saying, but from the API and docs you're providing me, I don't think I'm doing anything wrong by sharing the TokenCell among threads, I thought that was the whole point.

as a quick recap, the objective of this fancy TokenCell is to offload creation of them to multiple threads. those threads doesn't need mutable references, so no tokens needed for them. on the other hand, the scheduler thread needs mutability. and it's the sole thread to mutate all of those objects created ever. in this way, i want to avoid Mutex-ing things.

Ok, but now you've created TokenCell, which claims to be Send and Sync, but it's in fact not Sync. The documentation claims that the associated Token API is what ensures that there's ever only one live mutable access to the inner value, when it doesn't. In fact as far as I can tell, the token just acts as an inconvenience/deterrent within one thread, so callers are somehow alerted that something dangerous is happening. But then the same could be achieved without a token at all, just by calling the accessors borrow_but_this_is_really_dangerous() borrow_mut_this_is_even_more_dangerous().

I don't know enough about how these UsageQueues are being instantiated, used etc (feel free to link to code if you want to), but my feeling is that we should be using a Mutex, wrapped in a wrapper that always uses try_lock and asserts that the Mutex is in fact never contended. That way you create a Sync API that is 100% safe, at the cost of a compare and swap, and - again without currently knowing much about usage patterns - I'd bet 1 BONK that it wouldn't have a significant performance impact.

Alternatively if you don't want to pay the price of CAS at all, I think you should look into moving these UsageQueues by value so that they don't need to be Sync at all. IMO creating an API which looks and sounds safe to be used by multiple threads, but in fact very easily introduces UB when used by multiple threads unless "held correctly", is not something we should do, especially in new code.

Copy link

@apfitzge apfitzge Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given hyper-sensitivity of scheduler it might be that the Mutex::try_lock CAS does have an impact - but would be good to measure that impact if we can avoid the unsafe here.

I think the ultimate usage of the unsafe-ness works as currently written, because we only allow token access through &mut self fns on the SchedulingStateMachine. But introducing something that other devs are confused about makes this more challenging to maintain.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alessandrod and @apfitzge: really thanks for spending time on reviewing this.

I've re-read that comment a few times and I still don't understand it to mean what you suggest it should mean. I can't create multiple Token's in the same thread because the constructor asserts that. And the Arc is shared among threads the same way UsageQueue is shared among threads. I understand what you're saying, but from the API and docs you're providing me, I don't think I'm doing anything wrong by sharing the TokenCell among threads, I thought that was the whole point.

okay, in this case, i think doc is worded badly. I'll try to rearrange it.

as a quick recap, the objective of this fancy TokenCell is to offload creation of them to multiple threads. those threads doesn't need mutable references, so no tokens needed for them. on the other hand, the scheduler thread needs mutability. and it's the sole thread to mutate all of those objects created ever. in this way, i want to avoid Mutex-ing things.

Ok, but now you've created TokenCell, which claims to be Send and Sync, but it's in fact not Sync. The documentation claims that the associated Token API is what ensures that there's ever only one live mutable access to the inner value, when it doesn't.

as long as the (poorly-written) user contract is properly adhered, there should only be one.

In fact as far as I can tell, the token just acts as an inconvenience/deterrent within one thread, so callers are somehow alerted that something dangerous is happening. But then the same could be achieved without a token at all, just by calling the accessors borrow_but_this_is_really_dangerous() borrow_mut_this_is_even_more_dangerous().

yep. it's just deterrent. however, consider this:

this would pass compilation. yet this is ub because of rust's aliasing rule violation:

// assume arced_token_cell111 and arced_token_cell222 could point to same instance.
let a = arced_token_cell111.borrow_mut_this_is_even_more_dangerous();

for ... {
    let b = arced_token_cell222.borrow_mut_this_is_even_more_dangerous();
}

but, the inconvenient TokenCell can prevent this from happening accidentally. in fact, i did so while coding. that's why i introduced this because it has no runtime cost.

I don't know enough about how these UsageQueues are being instantiated, used etc (feel free to link to code if you want to), but my feeling is that we should be using a Mutex, wrapped in a wrapper that always uses try_lock and asserts that the Mutex is in fact never contended. That way you create a Sync API that is 100% safe, at the cost of a compare and swap, and - again without currently knowing much about usage patterns - I'd bet 1 BONK that it wouldn't have a significant performance impact.

sadly, even a single cas will have a significant performance impact: solana-labs#35286 (comment)

this piece of code must wade through 1 million transactions or more per sec for line-rate at the very least to find most-paying transactions, when it is sited at the front of the banking stage (in the future). And more throughput is better. and there's no enough. ;)

Alternatively if you don't want to pay the price of CAS at all, I think you should look into moving these UsageQueues by value so that they don't need to be Sync at all. IMO creating an API which looks and sounds safe to be used by multiple threads, but in fact very easily introduces UB when used by multiple threads unless "held correctly", is not something we should do, especially in new code.

UsasaQueue::blocked_usages must be some kind of a collection. Currently, vanilla VecDequeue. so, move by value isn't a viable option, either...

Given hyper-sensitivity of scheduler it might be that the Mutex::try_lock CAS does have an impact - but would be good to measure that impact if we can avoid the unsafe here.

the non-trivial overhead is measured here as a cas in Arc, (basically same): solana-labs#35286 (comment)

I think the ultimate usage of the unsafe-ness works as currently written, because we only allow token access through &mut self fns on the SchedulingStateMachine. But introducing something that other devs are confused about makes this more challenging to maintain.

i admit this is confusing as hell. but i want to ship it for the merits. I'll strive TokenCell to be documented clearly and it's not general purpose library. its use is very local to the unified scheduler code only.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VecDeque is actually pretty fast to move? In fact, doesn't this whole problem go away if instead of Arcs to share between other threads and the state machine you move Boxes back and forth?

Comment on lines +191 to +192
/// By extension, it's allowed to create _multiple_ tokens in a _single_ process as long as no
/// instance of [`TokenCell`] is shared by multiple instances of [`Token`].
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here

@alessandrod
Copy link

What is the advantage of having UsageQueueLoader outside SchedulingStateMachine? In the code that exists right now UsageQueueLoader is a dashmap, which is a type with horrible performance characteristics. It seems to me that if you move it inside SchedulingStateMachine then it can be a regular hash map, tokencell isn't needed, and everything gets faster?

@ryoqun
Copy link
Member Author

ryoqun commented Mar 23, 2024

UsageQueueLoader is a dashmap, which is a type with horrible performance characteristics.

you mean complaints like this? solana-labs#34955

i haven't evaluated DashMap in depth. just picked it as it's popular in the codebase. so, my mind hasn't set in dashmap. I just need a fast & scalable multithreaded hash table.

It seems to me that if you move it inside SchedulingStateMachine then it can be a regular hash map, tokencell isn't needed

this is correct so far... yet...

..., and everything gets faster?

this isn't.

I don't think a single-threaded vanilla HashMap can compete with 20-threaded DashMap-ed loader. The multi-threaded loader only incurs overhead of 1 crossbeam::Channel::recv(), which is exactly non-contended 2 cas per a task for the scheduler thread under busy-looping or saturation for this MPSC channel arrangement, regardless of the number of accounts.

On the other hand, a tx with 100 accounts needs 100 HashMap::<Pubkey, _> lookups at the very least (excluding the insertion cost) for the single-threaded approach. Even Hash::hash()-ing a single Pubkey can be slower than 2 cas.

As for the need of this much of low latency, i documented here: https://github.com/solana-labs/solana/pull/34676/files#r1536585166

@alessandrod
Copy link

alessandrod commented Mar 23, 2024

I don't think a single-threaded vanilla HashMap can compete with 20-threaded DashMap-ed loader. The multi-threaded loader only incurs overhead of 1 crossbeam::Channel::recv(), which is exactly non-contended 2 cas per a task for the scheduler thread under busy-looping or saturation for this MPSC channel arrangement, regardless of the number of accounts.

This analysis is missing the dashmap overhead, which is pretty terrible. Dashmap partitions the key space into shards, then shards are protected with rwlocks. If you look at dashmap usage in the validator now, it's a futex sleeping fest.

Have you explored moving things (eg Box) instead of using Arc and inner mutability? Why didn't that work?

EDIT: Box doesn't work, because you do want to share outside the state machine.

I'll try to play around with the code and get a better feel for the API.

@ryoqun
Copy link
Member Author

ryoqun commented Mar 23, 2024

I don't think a single-threaded vanilla HashMap can compete with 20-threaded DashMap-ed loader. The multi-threaded loader only incurs overhead of 1 crossbeam::Channel::recv(), which is exactly non-contended 2 cas per a task for the scheduler thread under busy-looping or saturation for this MPSC channel arrangement, regardless of the number of accounts.

This analysis is missing the dashmap overhead, which is pretty terrible. Dashmap partitions the key space into shards, then shards are protected with rwlocks. If you look at dashmap usage in the validator now, it's a futex sleeping fest.

as you said dashmap looks not promising. I wonder why this crate is used in the validator in the first place. ;) I went to crate shopping. how about leapfrog? the bench number seems to be promising. https://github.com/robclu/conc-map-bench?tab=readme-ov-file#read-heavy-std-hasher.

I'll try to play around with the code and get a better feel for the API.

oh, really thanks for the effort.

@ryoqun
Copy link
Member Author

ryoqun commented Mar 24, 2024

as you said dashmap looks not promising. I wonder why this crate is used in the validator in the first place. ;) I went to crate shopping. how about leapfrog? the bench number seems to be promising. https://github.com/robclu/conc-map-bench?tab=readme-ov-file#read-heavy-std-hasher.

fyi, i've run the popular bench (conc-map-bench) with Pubkey as keys. My casual finding is that leapfrog is faster in deeed. but both it and dashmap are likewisely scalable and greatly faster than hashmap. notably, i didn't see any noticeable perf degradation in dashmap:

throughput

latency

also, note that ReadHeavy is appropriate bench scenario for UsageQueueLoader. as looking already existing entry up will by far the most common operation as our account access pattern is so skewed to certain hot ones.

@ryoqun ryoqun force-pushed the scheduling-state-machine-for-agave branch from 8c0c6bd to 6157e84 Compare April 2, 2024 12:42
@ryoqun
Copy link
Member Author

ryoqun commented Apr 2, 2024

btw, I made solana-unified-scheduler-logic tests run under miri in ci (6157e84, BLAKE3-team/BLAKE3#387, #488, #534). As far as i fiddled with the code abit locally, miri can detect all aliasing violations around UnsafeCell, which didn't caught at compile-time. hope this fact can alleviate the tension of unsafe... In other words, ci now can detect UBs, should there be now due to bugs in TokenCell or misuse at the callsite.

@ryoqun
Copy link
Member Author

ryoqun commented Apr 3, 2024

#129 (comment)

I'll try to play around with the code and get a better feel for the API.

@alessandrod How's going with this? I wonder i can merge the pr at least for now to unblock my work... i know there's some disagreement about the general design and justification of unsafe. I'm appreciating you finding of the potential use-after-free UB. that said, i think this is ub-free now, right (it's covered now even by miri)? Fortunately, there's no api compatibility thing, so we can freely change impl later as we like. or, do i absolutely have to back off the unsafe parts? (cc: @apfitzge)

(sorry, i have to ping you here from now on, because i got inquiries from multiple people about the progress of this pr...)

@apfitzge
Copy link

apfitzge commented Apr 3, 2024

Given miri UB detection, I think I'm fine to merge this if @alessandrod is as well. I was happy with the other parts of implementation last I reviewed, and mainly had concerns about the unsafe aspects getting changes and becoming invalid. Now that miri will check that as part of ci, we can have some more confidence we won't break it.

@ryoqun ryoqun force-pushed the scheduling-state-machine-for-agave branch from 7b2dd3a to 29066e8 Compare April 4, 2024 06:24
@ryoqun ryoqun force-pushed the scheduling-state-machine-for-agave branch from 29066e8 to b851631 Compare April 4, 2024 06:36
@ryoqun ryoqun force-pushed the scheduling-state-machine-for-agave branch from b851631 to f681fde Compare April 4, 2024 06:48
@alessandrod
Copy link

@alessandrod How's going with this? I wonder i can merge the pr at least for now to unblock my work

Sorry been looking into quic congestion issues this week. Yeah sure I have no problem to merge as is if it's blocking you.

@ryoqun ryoqun force-pushed the scheduling-state-machine-for-agave branch from bf19014 to ffa1857 Compare April 4, 2024 13:11
@ryoqun
Copy link
Member Author

ryoqun commented Apr 4, 2024

@alessandrod How's going with this? I wonder i can merge the pr at least for now to unblock my work

... Yeah sure I have no problem to merge as is if it's blocking you.

thanks for the reply and jumping on code-review of unfamiliar code. your insightful review comments made this pr better in various ways.

I was happy with the other parts of implementation last I reviewed, and mainly had concerns about the unsafe aspects getting changes and becoming invalid. Now that miri will check that as part of ci, we can have some more confidence we won't break it.

@apfitzge thanks for positive feedback. it seems the time has finally come. :) I'm in the ship-it mood very much.. lol

could you review the new commits since you reviewed, starting from 9eed18b if you haven't? btw, I just added negative test cases for miri: f681fde, ffa1857

Copy link

@apfitzge apfitzge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked at commits since my last review. Looks good, like the changes you and @alessandrod made!

@ryoqun ryoqun merged commit 0b9c637 into anza-xyz:master Apr 4, 2024
50 checks passed
@ryoqun ryoqun changed the title Introduce SchedulingStateMachine for unified scheduler Introduce SchedulingStateMachine for unified scheduler Apr 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants