Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deterministic simulation for sui tests #4429

Merged
merged 32 commits into from
Sep 9, 2022
Merged

Deterministic simulation for sui tests #4429

merged 32 commits into from
Sep 9, 2022

Conversation

mystenmark
Copy link
Contributor

@mystenmark mystenmark commented Sep 1, 2022

This PR enables running tests in a deterministic simulator. To use:

   ./scripts/simtest/install.sh # install the cargo-simtest command
   cargo simtest # build and run tests in simulator mode

The simulator itself is at https://github.com/MystenLabs/mysten-simulator (in a separate repo so that it can be integrated into narwhal as well). The simulator provides a drop-in replacement for tokio which is patched in via cargo's patch mechanism - but only when running the tests. Normal builds use normal tokio.

How it works, in brief:

  • The simulator contains:
    • a network simulator + TcpSocket/TcpStream/TcpListener implementations to deliver all network traffic within the simulation with deterministic delays.
    • a deterministic, randomized executor.
  • The simulator also contains a replacement for getrandom() and getentropy() in order to transparently provide deterministic behavior from OsRng.
  • Every test is run in a new thread - because HashSet/HashMap use thread local state (seeded from getrandom()), we get deterministic hash container behavior for free.

Two new macros are provided:

  • #[sui_test] - Runs the test as a #[tokio::test] under cargo test and a simulator test under cargo simtest.
  • #[sim_test] - Marks the test as ignored for cargo test, but runs it under cargo simtest

Tests defined with #[tokio::test] are ignored under simtest.

Tests defined with #[test] are run in both. I would like to ignore them with simtest but I don't know how to do that.

Most of the code in this PR is boilerplate necessary to handle configuration differences for swarm.

Remaining work for subsequent PRs:

  • Fix non-determinism caused by EventStore (currently disabled in simulator)
  • Network configuration (e.g. latency, partitions, etc)
  • Move more tests to #[sui_test] and/or #[sim_test]
  • Docs
  • Github job that runs a subset of tests (e.g. the cluster tests) repeatedly, to search for rare bugs.

@bmwill
Copy link
Contributor

bmwill commented Sep 1, 2022

I haven't taken a deep look at this yet, i'll need to book some time but here are some initial questions given i know very little about what you've done so far.

The simulator provides a drop-in replacement for tokio which is patched in via cargo's patch mechanism - but only when running the tests.

I thought we had found a way that didn't require forking tokio?

a network simulator + TcpSocket/TcpStream/TcpListener implementations to deliver all network traffic within the simulation with deterministic delays.

What about Udp given i'm actively migrating things off of tcp atm.

How does this handle things that happen inside of 3rd party code? Does it also require that we fork those third-party libraries?

@mystenmark
Copy link
Contributor Author

I haven't taken a deep look at this yet, i'll need to book some time but here are some initial questions given i know very little about what you've done so far.

The simulator provides a drop-in replacement for tokio which is patched in via cargo's patch mechanism - but only when running the tests.

I thought we had found a way that didn't require forking tokio?

It mostly doesn't:

  • Production code continues using mainline tokio. One of my primary goals was to have the minimal impact on what code is actually compiled into our release builds.
  • The simulator isn't a fork of tokio, but a reimplementation of certain pieces of tokio (mainly the executor and networking), which delegates as much as possible back to "real" tokio.
  • I did have to make a very minimal fork of tokio (which is used only by the simulator) in order to expose certain internals to the simulator. The alternative would have been increasing the amount of re-implemented functionality in tokio. You can see the fork history here: https://github.com/mystenmark/tokio-madsim-fork/commits/madsim-1.20.1 - it's a very minimal set of edits that are intended to rebase as easily as possible.

a network simulator + TcpSocket/TcpStream/TcpListener implementations to deliver all network traffic within the simulation with deterministic delays.

What about Udp given i'm actively migrating things off of tcp atm.

UDP is even easier to support than TCP (unsurprisingly), and TCP was only a day or two of work (https://github.com/MystenLabs/mysten-simulator/blob/main/msim-tokio/src/sim/net.rs). As soon as you have a branch that is using UDP, let me know, and I will make sure it is supported.

How does this handle things that happen inside of 3rd party code? Does it also require that we fork those third-party libraries?

It uses [patch] so that all 3rd parties pick up the replacement tokio. This might sound awful but it only took a day or two to implement enough tokio APIs to get all of our deps to compile. See: https://github.com/MystenLabs/sui/pull/4429/files#diff-2e6e8f4204bfb1de9dbd59556fa2d3f093fc9915d4a64d9fb40c4f582fabe3b1R47

I did have to make a minimal hyper fork, because it uses tokio::TcpSocket::from_raw_fd(). Again, I was careful to minimize difficulty of tracking upstream changes: mystenmark/hyper-madsim-fork@ab1fe48. There is also a TODO in the hyper code to stop using from_raw_fd() as soon as tokio supports set_tcp_keepalive. We could probably take care of that ourselves if we cared about not having a fork that much.

@mystenmark mystenmark force-pushed the mlogan-sim branch 8 times, most recently from cc4f4d5 to e0f7462 Compare September 8, 2022 14:27
Copy link
Contributor

@andll andll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some minor non-blocking comments.

In general I am quite (pleasantly) surprised this worked and tests run deterministically, this is amazing :)

I am also curious to see how this interacts with quic networking stack and whether we still have determinism with that

config_directory: PathBuf,
randomize_ports: bool,
committee_size: NonZeroUsize,
committee: Option<CommitteeConfig>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am tryin to understand why is this an option - seems like there is no code path in which it can be None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, that was unnecessary, fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it was necessary - we need to be able to move out of self.committee_config in the build() method - the ValidatorGenesisInfo struct can't be made Clone() because it contains a keypair.

.expect("Response from lockservice was cancelled, should not happen!")
exec_client_future(async move {
let (os_sender, os_receiver) = oneshot::channel::<SuiResult>();
// NOTE: below is blocking, switch to Tokio channels which are async?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment seem outdated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed - will remove as long as i'm polluting the blame anyway.

// timing of the reply from the other thread.
pub(crate) async fn exec_client_future<F: Future>(fut: F) -> <F as Future>::Output {
if cfg!(msim) {
futures::executor::block_on(fut)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sort of surprised this worked - normally you can't call block_on from inside tokio thread

Copy link
Contributor Author

@mystenmark mystenmark Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't call Runtime::block_on (I think that panics?) - but this is (almost) just a short-hand for calling poll() in a loop until it returns Ready. There's nothing to stop you from doing that.

Obviously this will deadlock if your future is waiting for something to happen on the current thread, but that's explicitly not the case here - we are sending a message to another thread and waiting for it to come back. So ultimately this boils down to just waiting on the condvar inside the channel, which is a perfectly reasonable thing to do.

// Used to exec futures that send data to/from other threads. In the simulator, this effectively
// becomes a blocking call, which removes the non-determinism that would otherwise be caused by the
// timing of the reply from the other thread.
pub(crate) async fn exec_client_future<F: Future>(fut: F) -> <F as Future>::Output {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand why you need this function, but I don't understand choice of the name to be honest :)

Maybe something like sim_blocking_call or sim_no_preempt would be better name? It would give intuition that
(a) this is done as a call specifically to make it work in simulator
(b) what it helps specifically with is communicating with threads that are outside of simulator scope

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion.

@mystenmark
Copy link
Contributor Author

I am also curious to see how this interacts with quic networking stack and whether we still have determinism with that

Anemo just uses quinn, which uses tokio UdpSocket asynchronously, so there shouldn't be any major challenges. Getting it to work at all is a minor schlep, but once it works, I bet it will be deterministic.

@mystenmark mystenmark merged commit af4ce7a into main Sep 9, 2022
@mystenmark mystenmark deleted the mlogan-sim branch September 9, 2022 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants