Deterministic simulation for sui tests #4429

mystenmark · 2022-09-01T06:13:59Z

This PR enables running tests in a deterministic simulator. To use:

   ./scripts/simtest/install.sh # install the cargo-simtest command
   cargo simtest # build and run tests in simulator mode

The simulator itself is at https://github.com/MystenLabs/mysten-simulator (in a separate repo so that it can be integrated into narwhal as well). The simulator provides a drop-in replacement for tokio which is patched in via cargo's patch mechanism - but only when running the tests. Normal builds use normal tokio.

How it works, in brief:

The simulator contains:
- a network simulator + TcpSocket/TcpStream/TcpListener implementations to deliver all network traffic within the simulation with deterministic delays.
- a deterministic, randomized executor.
The simulator also contains a replacement for getrandom() and getentropy() in order to transparently provide deterministic behavior from OsRng.
Every test is run in a new thread - because HashSet/HashMap use thread local state (seeded from getrandom()), we get deterministic hash container behavior for free.

Two new macros are provided:

#[sui_test] - Runs the test as a #[tokio::test] under cargo test and a simulator test under cargo simtest.
#[sim_test] - Marks the test as ignored for cargo test, but runs it under cargo simtest

Tests defined with #[tokio::test] are ignored under simtest.

Tests defined with #[test] are run in both. I would like to ignore them with simtest but I don't know how to do that.

Most of the code in this PR is boilerplate necessary to handle configuration differences for swarm.

Remaining work for subsequent PRs:

Fix non-determinism caused by EventStore (currently disabled in simulator)
Network configuration (e.g. latency, partitions, etc)
Move more tests to #[sui_test] and/or #[sim_test]
Docs
Github job that runs a subset of tests (e.g. the cluster tests) repeatedly, to search for rare bugs.

bmwill · 2022-09-01T15:48:10Z

I haven't taken a deep look at this yet, i'll need to book some time but here are some initial questions given i know very little about what you've done so far.

The simulator provides a drop-in replacement for tokio which is patched in via cargo's patch mechanism - but only when running the tests.

I thought we had found a way that didn't require forking tokio?

a network simulator + TcpSocket/TcpStream/TcpListener implementations to deliver all network traffic within the simulation with deterministic delays.

What about Udp given i'm actively migrating things off of tcp atm.

How does this handle things that happen inside of 3rd party code? Does it also require that we fork those third-party libraries?

mystenmark · 2022-09-01T17:56:05Z

I haven't taken a deep look at this yet, i'll need to book some time but here are some initial questions given i know very little about what you've done so far.

The simulator provides a drop-in replacement for tokio which is patched in via cargo's patch mechanism - but only when running the tests.

I thought we had found a way that didn't require forking tokio?

It mostly doesn't:

Production code continues using mainline tokio. One of my primary goals was to have the minimal impact on what code is actually compiled into our release builds.
The simulator isn't a fork of tokio, but a reimplementation of certain pieces of tokio (mainly the executor and networking), which delegates as much as possible back to "real" tokio.
I did have to make a very minimal fork of tokio (which is used only by the simulator) in order to expose certain internals to the simulator. The alternative would have been increasing the amount of re-implemented functionality in tokio. You can see the fork history here: https://github.com/mystenmark/tokio-madsim-fork/commits/madsim-1.20.1 - it's a very minimal set of edits that are intended to rebase as easily as possible.

a network simulator + TcpSocket/TcpStream/TcpListener implementations to deliver all network traffic within the simulation with deterministic delays.

What about Udp given i'm actively migrating things off of tcp atm.

UDP is even easier to support than TCP (unsurprisingly), and TCP was only a day or two of work (https://github.com/MystenLabs/mysten-simulator/blob/main/msim-tokio/src/sim/net.rs). As soon as you have a branch that is using UDP, let me know, and I will make sure it is supported.

How does this handle things that happen inside of 3rd party code? Does it also require that we fork those third-party libraries?

It uses [patch] so that all 3rd parties pick up the replacement tokio. This might sound awful but it only took a day or two to implement enough tokio APIs to get all of our deps to compile. See: https://github.com/MystenLabs/sui/pull/4429/files#diff-2e6e8f4204bfb1de9dbd59556fa2d3f093fc9915d4a64d9fb40c4f582fabe3b1R47

I did have to make a minimal hyper fork, because it uses tokio::TcpSocket::from_raw_fd(). Again, I was careful to minimize difficulty of tracking upstream changes: mystenmark/hyper-madsim-fork@ab1fe48. There is also a TODO in the hyper code to stop using from_raw_fd() as soon as tokio supports set_tcp_keepalive. We could probably take care of that ourselves if we cared about not having a fork that much.

andll

I left some minor non-blocking comments.

In general I am quite (pleasantly) surprised this worked and tests run deterministically, this is amazing :)

I am also curious to see how this interacts with quic networking stack and whether we still have determinism with that

andll · 2022-09-08T17:19:09Z

crates/sui-config/src/builder.rs

    config_directory: PathBuf,
    randomize_ports: bool,
-    committee_size: NonZeroUsize,
+    committee: Option<CommitteeConfig>,


I am tryin to understand why is this an option - seems like there is no code path in which it can be None?

You're right, that was unnecessary, fixed.

Actually, it was necessary - we need to be able to move out of self.committee_config in the build() method - the ValidatorGenesisInfo struct can't be made Clone() because it contains a keypair.

andll · 2022-09-08T17:29:01Z

crates/sui-storage/src/lock_service.rs

-            .expect("Response from lockservice was cancelled, should not happen!")
+        exec_client_future(async move {
+            let (os_sender, os_receiver) = oneshot::channel::<SuiResult>();
+            // NOTE: below is blocking, switch to Tokio channels which are async?


The comment seem outdated

agreed - will remove as long as i'm polluting the blame anyway.

andll · 2022-09-08T17:33:47Z

crates/sui-storage/src/lib.rs

+// timing of the reply from the other thread.
+pub(crate) async fn exec_client_future<F: Future>(fut: F) -> <F as Future>::Output {
+    if cfg!(msim) {
+        futures::executor::block_on(fut)


I am sort of surprised this worked - normally you can't call block_on from inside tokio thread

You can't call Runtime::block_on (I think that panics?) - but this is (almost) just a short-hand for calling poll() in a loop until it returns Ready. There's nothing to stop you from doing that.

Obviously this will deadlock if your future is waiting for something to happen on the current thread, but that's explicitly not the case here - we are sending a message to another thread and waiting for it to come back. So ultimately this boils down to just waiting on the condvar inside the channel, which is a perfectly reasonable thing to do.

andll · 2022-09-08T17:37:31Z

crates/sui-storage/src/lib.rs

+// Used to exec futures that send data to/from other threads. In the simulator, this effectively
+// becomes a blocking call, which removes the non-determinism that would otherwise be caused by the
+// timing of the reply from the other thread.
+pub(crate) async fn exec_client_future<F: Future>(fut: F) -> <F as Future>::Output {


I think I understand why you need this function, but I don't understand choice of the name to be honest :)

Maybe something like sim_blocking_call or sim_no_preempt would be better name? It would give intuition that
(a) this is done as a call specifically to make it work in simulator
(b) what it helps specifically with is communicating with threads that are outside of simulator scope

Good suggestion.

…g. fullnodes)

…trol

mystenmark · 2022-09-09T15:55:23Z

I am also curious to see how this interacts with quic networking stack and whether we still have determinism with that

Anemo just uses quinn, which uses tokio UdpSocket asynchronously, so there shouldn't be any major challenges. Getting it to work at all is a minor schlep, but once it works, I bet it will be deterministic.

mystenmark requested review from tharbert, huitseeker, Clay-Mysten and randall-Mysten as code owners September 1, 2022 06:14

mystenmark removed request for huitseeker, tharbert, Clay-Mysten and randall-Mysten September 1, 2022 06:14

mystenmark force-pushed the mlogan-sim branch from 5312763 to ab8fdd8 Compare September 1, 2022 06:24

mystenmark requested review from bmwill, andll, longbowlu and huitseeker September 1, 2022 06:48

mystenmark force-pushed the mlogan-sim branch from 5d29058 to 726d9b5 Compare September 1, 2022 17:35

mystenmark force-pushed the mlogan-sim branch 8 times, most recently from cc4f4d5 to e0f7462 Compare September 8, 2022 14:27

andll approved these changes Sep 8, 2022

View reviewed changes

mystenmark and others added 5 commits September 9, 2022 08:41

Changes for starting simulator swarm

27a3056

Don't start jsonrpsee or prometheus servers in simulator

926aac7

Make start_console work when Send is required

d22e5c4

simulator determinism tests

3e9f303

Add sui-macros crate for #[sui_test] and #[sim_test]

e03651b

mlogan added 24 commits September 9, 2022 08:44

Add madsim deps (NOTE: fix, uses local paths)

1103e57

Disable event processing on fullnode in simulator, for determinism

0cfa1ff

network_config_snapshot_matches cannot run in a simulator build

a6c69d5

Don't crash by trying to kill nodes if the runtime has shut down

d5f0551

Fix hash container non-determinism

8c709bc

Stop nodes when swarm is dropped

4eff852

Remove unneeded log init

38225d0

Make lock_service calls block in the simulator for determinism

1d18de8

Create deterministic local addresses for nodes not built by swarm (e.…

8080ae0

…g. fullnodes)

madsim -> msim

74f9b2c

Replace local paths with git pointers

e29fece

Fix ordinary #[test]s in simulator

05a044a

Add github workflow

2c4a09f

Update deps

2389e2a

Move sim-specific container stuff to separate module

b7d671e

Fix lints

298826d

Update mysten-sim dep

da7b1e9

Make the installed cargo-simtest script invoke the one in version con…

3d0f675

…trol

skip tests that don't run well in simulator

e860c6e

Fix license check

fd04f0a

Add workspace hack deps

ea06d8f

Simplify config builder

df314eb

Disable unneeded simulator tests

5fee9d1

Update Cargo.lock

ea62763

mystenmark force-pushed the mlogan-sim branch from e0f7462 to ea62763 Compare September 9, 2022 15:47

PR comments

ac5d588

mystenmark merged commit af4ce7a into main Sep 9, 2022

mystenmark deleted the mlogan-sim branch September 9, 2022 16:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deterministic simulation for sui tests #4429

Deterministic simulation for sui tests #4429

mystenmark commented Sep 1, 2022 •

edited

Loading

bmwill commented Sep 1, 2022 •

edited

Loading

mystenmark commented Sep 1, 2022

andll left a comment

andll Sep 8, 2022

mystenmark Sep 9, 2022

mystenmark Sep 9, 2022

andll Sep 8, 2022

mystenmark Sep 9, 2022

andll Sep 8, 2022

mystenmark Sep 9, 2022 •

edited

Loading

andll Sep 8, 2022

mystenmark Sep 9, 2022

mystenmark commented Sep 9, 2022

Deterministic simulation for sui tests #4429

Deterministic simulation for sui tests #4429

Conversation

mystenmark commented Sep 1, 2022 • edited Loading

bmwill commented Sep 1, 2022 • edited Loading

mystenmark commented Sep 1, 2022

andll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mystenmark Sep 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mystenmark commented Sep 9, 2022

mystenmark commented Sep 1, 2022 •

edited

Loading

bmwill commented Sep 1, 2022 •

edited

Loading

mystenmark Sep 9, 2022 •

edited

Loading