[dag] dag rebootstrap #9967

ibalajiarun · 2023-09-07T20:38:03Z

Description

This PR introduces the rebootstrap logic for the DAG. Essentially, when there is a need to state sync the DAG, we abort the existing handlers, return to the bootstrapper and let the bootstrapper state sync the DAG and recreate all the components and start the handlers again. This is to unify the logic such that we use the same bootstraping logic for recovery as well as state sync.

Test Plan

Existing Tests

ibalajiarun · 2023-09-07T20:38:18Z

Current dependencies on/for this PR:

main
- PR [dag] split fetcher into service and helper struct #9830
  - PR [dag][bugfix] reverse nodes before adding to dag store #9831
    - PR [dag] preliminary state sync implementation #9724
      - PR [dag] broadcast CertifiedNodeMsg with LedgerInfo #9968
        
        PR [dag] dag rebootstrap #9967 👈
        
        PR [consensus] introduce CommitSignerProvider trait #10006
        
        PR [consensus] epoch manager refactoring; dag is coming #9994
        
        PR [dag] commit signer #10040
        
        PR [dag] store parent block info in adapter #10041
        PR [dag] simplify DAGNetworkSender impl #10042
        PR [dag] async message handler support #10055
        PR [consensus][dag] dag integration helpers #10056
        PR [dag] epoch manager integration; dag is here #10007
      - PR [dag] state sync good case unit test #9832

This comment was auto-generated by Graphite.

zekun000 · 2023-09-08T23:09:04Z

consensus/src/dag/bootstrap.rs

+                        error!(error = ?e, "unable to sync");
+                    }
+                },
+                _ = handler.start(&mut dag_rpc_rx) => {}


I don't think this works? once it starts, it'll never go out from the inner loop

i think it works. as long as we have an .await within the loop, the future should stop polling and yield to another future in the outer select!. If we get rebootstrap notification, it will complete the future and then we will simply not poll handler again. instead, we will start a new handler.

zekun000 · 2023-09-08T23:09:34Z

consensus/src/dag/bootstrap.rs

+                },
+                Some(node) = rebootstrap_notification_rx.recv() => {
+                    df_handle.abort();
+                    let _ = df_handle.await;


probably should have a guard for these services

zekun000

looking at it again, why we need to separate the trigger and sync manager? since we block the handler anyway, why not just check and sync directly inside the handler?

zekun000 · 2023-09-11T02:42:47Z

consensus/src/dag/bootstrap.rs

+            AggregateSignature::empty(),
+        );
+
+        let mut shutdown_rx = shutdown_rx.into_stream();


I don't think you need this, just &mut shutdown_rx is enough

I need this btw, because the oneshot::Receiver seems moved within the select statement.

ibalajiarun · 2023-09-11T15:43:38Z

looking at it again, why we need to separate the trigger and sync manager? since we block the handler anyway, why not just check and sync directly inside the handler?

I think I need to abort the fetch service before starting an ad-hoc fetch for state sync. otherwise, there could be two fetches for same nodes happening concurrently?

zekun000 · 2023-09-12T21:50:13Z

consensus/src/dag/bootstrap.rs

+
+                    let dag_fetcher = DagFetcher::new(self.epoch_state.clone(), self.dag_network_sender.clone(), self.time_service.clone());
+
+                    if let Err(e) = sync_manager.sync_dag_to(&certified_node_msg, dag_fetcher, dag_store.clone()).await {


i thought we're creating a new dag store instead of re-using the current one?

yes, we are. I pass the existing dag store to do some assertion checks on whether to actually state sync.

hmm, that sounds weird, the check should be done in the check function?

yes, i am being paranoid. i check in the check function and i assert in the sync_to function.

zekun000 · 2023-09-12T21:59:54Z

consensus/src/dag/bootstrap.rs

+            let (handler, fetch_service) =
+                self.bootstrap_components(dag_store.clone(), order_rule, state_sync_trigger);
+
+            let df_handle = tokio::spawn(fetch_service.start());


nit: we can just have a drop guard like this created and avoid the abort/await lines in both branches?

aptos-core/consensus/src/experimental/commit_reliable_broadcast.rs

Line 89 in 374358e

pub struct DropGuard {

zekun000 · 2023-09-12T22:01:17Z

consensus/src/dag/dag_handler.rs

+                match certified_node_msg.verify(&self.epoch_state.verifier) {
+                    Ok(_) => match self.state_sync_trigger.check(certified_node_msg).await {
+                        ret @ (NeedsSync(_), None) => return Ok(ret.0),
+                        (Synced, Some(certified_node_msg)) => self


the message can be carried in the StateSyncStatus::Synced to avoid the second Option?

I need to send the Synced status from process_rpc fn as well, so I either clone the ceritifed_node_msg or use the second option.

Alternatively, I could have two enums, state sync check return one enum and I can convert it to another one for process_rpc. i thought it was too much.

I don't think we need to define two enum, this function can just return a Result<(), CertifiedNodeMessage>?

zekun000 · 2023-09-13T21:21:47Z

consensus/src/dag/dag_handler.rs

+                match certified_node_msg.verify(&self.epoch_state.verifier) {
+                    Ok(_) => match self.state_sync_trigger.check(certified_node_msg).await {
+                        ret @ (NeedsSync(_), None) => return Ok(ret.0),
+                        (Synced, Some(certified_node_msg)) => self


I don't think we need to define two enum, this function can just return a Result<(), CertifiedNodeMessage>?

zekun000 · 2023-09-13T21:22:35Z

consensus/src/dag/bootstrap.rs

+
+                    let dag_fetcher = DagFetcher::new(self.epoch_state.clone(), self.dag_network_sender.clone(), self.time_service.clone());
+
+                    if let Err(e) = sync_manager.sync_dag_to(&certified_node_msg, dag_fetcher, dag_store.clone()).await {


hmm, that sounds weird, the check should be done in the check function?

sasha8

Nice!

sasha8 · 2023-09-15T06:17:42Z

consensus/src/dag/dag_state_sync.rs

I think we now ready to increase the dag_window.

yes, when we introduce the onchain config.

sasha8 · 2023-09-15T06:38:44Z

consensus/src/dag/bootstrap.rs

+                dag.clone(),
+                self.time_service.clone(),
+            );
+        let fetch_requester = Arc::new(fetch_requester);


nit: Why cannot you return arc from DagFetcherService::new?

github-actions · 2023-09-15T16:32:41Z

✅ Forge suite `compat` success on `aptos-node-v1.6.2` ==> `3318585eae993e20a272c2f33de24bec49f75c65`

Compatibility test results for aptos-node-v1.6.2 ==> 3318585eae993e20a272c2f33de24bec49f75c65 (PR)
1. Check liveness of validators at old version: aptos-node-v1.6.2
compatibility::simple-validator-upgrade::liveness-check : committed: 4660 txn/s, latency: 6649 ms, (p50: 6900 ms, p90: 9200 ms, p99: 10000 ms), latency samples: 181760
2. Upgrading first Validator to new version: 3318585eae993e20a272c2f33de24bec49f75c65
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 1786 txn/s, latency: 16391 ms, (p50: 18800 ms, p90: 22300 ms, p99: 22600 ms), latency samples: 92900
3. Upgrading rest of first batch to new version: 3318585eae993e20a272c2f33de24bec49f75c65
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 1824 txn/s, latency: 15760 ms, (p50: 19300 ms, p90: 22000 ms, p99: 22600 ms), latency samples: 91220
4. upgrading second batch to new version: 3318585eae993e20a272c2f33de24bec49f75c65
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 3372 txn/s, latency: 9450 ms, (p50: 9400 ms, p90: 13900 ms, p99: 25000 ms), latency samples: 138280
5. check swarm health
Compatibility test for aptos-node-v1.6.2 ==> 3318585eae993e20a272c2f33de24bec49f75c65 passed
Test Ok

Grafana dashboard
Humio Logs
Test runner output
Test run is land-blocking

github-actions · 2023-09-15T16:34:23Z

✅ Forge suite `realistic_env_max_load` success on `3318585eae993e20a272c2f33de24bec49f75c65`

two traffics test: inner traffic : committed: 5841 txn/s, latency: 6617 ms, (p50: 6300 ms, p90: 8400 ms, p99: 14200 ms), latency samples: 2564200
two traffics test : committed: 100 txn/s, latency: 3078 ms, (p50: 3000 ms, p90: 3500 ms, p99: 7300 ms), latency samples: 1880
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.228, avg: 0.211", "QsPosToProposal: max: 0.299, avg: 0.183", "ConsensusProposalToOrdered: max: 0.687, avg: 0.630", "ConsensusOrderedToCommit: max: 0.569, avg: 0.531", "ConsensusProposalToCommit: max: 1.214, avg: 1.161"]
Max round gap was 1 [limit 4] at version 826366. Max no progress secs was 3.968346 [limit 10] at version 2618845.
Test Ok

Grafana dashboard
Humio Logs
Test runner output
Test run is land-blocking

github-actions · 2023-09-15T18:17:26Z

✅ Forge suite `framework_upgrade` success on `aptos-node-v1.5.1` ==> `3318585eae993e20a272c2f33de24bec49f75c65`

Compatibility test results for aptos-node-v1.5.1 ==> 3318585eae993e20a272c2f33de24bec49f75c65 (PR)
Upgrade the nodes to version: 3318585eae993e20a272c2f33de24bec49f75c65
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 5128 txn/s, latency: 6332 ms, (p50: 5400 ms, p90: 9500 ms, p99: 17500 ms), latency samples: 189740
5. check swarm health
Compatibility test for aptos-node-v1.5.1 ==> 3318585eae993e20a272c2f33de24bec49f75c65 passed
Test Ok

Grafana dashboard
Humio Logs
Test runner output
Test run is land-blocking

This was referenced Sep 7, 2023

[dag] split fetcher into service and helper struct #9830

Merged

[dag][bugfix] reverse nodes before adding to dag store #9831

Merged

[dag] preliminary state sync implementation #9724

Merged

ibalajiarun mentioned this pull request Sep 7, 2023

[dag] state sync good case unit test #9832

Merged

ibalajiarun force-pushed the balaji/dag-state-sync branch from 7e41139 to 3932158 Compare September 7, 2023 21:48

ibalajiarun mentioned this pull request Sep 7, 2023

[dag] broadcast CertifiedNodeMsg with LedgerInfo #9968

Merged

ibalajiarun force-pushed the balaji/dag-state-sync branch 2 times, most recently from fab0ad8 to 6dd6796 Compare September 7, 2023 23:41

Base automatically changed from balaji/dag-state-sync to main September 8, 2023 14:45

ibalajiarun changed the base branch from main to balaji/bcast-certified-node-msg September 8, 2023 16:35

ibalajiarun force-pushed the balaji/dag-rebootstrap branch from b00b863 to 017824b Compare September 8, 2023 16:35

zekun000 reviewed Sep 8, 2023

View reviewed changes

zekun000 reviewed Sep 11, 2023

View reviewed changes

ibalajiarun force-pushed the balaji/bcast-certified-node-msg branch from 89c4dec to 1db0e3f Compare September 11, 2023 15:30

ibalajiarun force-pushed the balaji/dag-rebootstrap branch from 017824b to 60eace0 Compare September 11, 2023 15:30

ibalajiarun mentioned this pull request Sep 11, 2023

[consensus] epoch manager refactoring; dag is coming #9994

Merged

ibalajiarun force-pushed the balaji/bcast-certified-node-msg branch from 1db0e3f to a2a16b5 Compare September 11, 2023 15:32

Base automatically changed from balaji/bcast-certified-node-msg to main September 11, 2023 20:22

ibalajiarun force-pushed the balaji/dag-rebootstrap branch from 60eace0 to c1842f6 Compare September 11, 2023 22:31

ibalajiarun requested a review from zekun000 September 11, 2023 22:33

ibalajiarun force-pushed the balaji/dag-rebootstrap branch from c1842f6 to a8e5453 Compare September 11, 2023 22:36

ibalajiarun marked this pull request as ready for review September 11, 2023 22:38

ibalajiarun requested a review from sasha8 as a code owner September 11, 2023 22:38

ibalajiarun force-pushed the balaji/dag-rebootstrap branch from a8e5453 to c921272 Compare September 11, 2023 23:50

This was referenced Sep 11, 2023

[consensus] introduce CommitSignerProvider trait #10006

Merged

[dag] epoch manager integration; dag is here #10007

Merged

zekun000 reviewed Sep 12, 2023

View reviewed changes

ibalajiarun force-pushed the balaji/dag-rebootstrap branch from c921272 to 51158fc Compare September 13, 2023 19:18

zekun000 approved these changes Sep 13, 2023

View reviewed changes

This was referenced Sep 14, 2023

[dag] commit signer #10040

Merged

[dag] store parent block info in adapter #10041

Merged

[dag] simplify DAGNetworkSender impl #10042

Merged

[dag] async message handler support #10055

Merged

[consensus][dag] dag integration helpers #10056

Merged

sasha8 reviewed Sep 15, 2023

View reviewed changes

sasha8 approved these changes Sep 15, 2023

View reviewed changes

ibalajiarun added 5 commits September 15, 2023 08:27

[dag] dag rebootstrap

27c3021

fix tests

e642b50

Return state sync notification instead of channel

a29438b

feedback

df5a89d

lint

d85cfd6

ibalajiarun force-pushed the balaji/dag-rebootstrap branch from c41b645 to d85cfd6 Compare September 15, 2023 15:28

rebase fix

3318585

ibalajiarun enabled auto-merge (squash) September 15, 2023 16:00

This comment has been minimized.

Sign in to view

ibalajiarun merged commit fcf58d0 into main Sep 15, 2023
71 of 74 checks passed

ibalajiarun deleted the balaji/dag-rebootstrap branch September 15, 2023 18:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dag] dag rebootstrap #9967

[dag] dag rebootstrap #9967

ibalajiarun commented Sep 7, 2023 •

edited

Loading

ibalajiarun commented Sep 7, 2023 •

edited

Loading

zekun000 Sep 8, 2023

ibalajiarun Sep 9, 2023

zekun000 Sep 8, 2023

zekun000 left a comment

zekun000 Sep 11, 2023

ibalajiarun Sep 14, 2023

ibalajiarun commented Sep 11, 2023

zekun000 Sep 12, 2023

ibalajiarun Sep 13, 2023

zekun000 Sep 13, 2023

ibalajiarun Sep 14, 2023

zekun000 Sep 12, 2023

zekun000 Sep 12, 2023

ibalajiarun Sep 13, 2023

zekun000 Sep 13, 2023

zekun000 Sep 13, 2023

zekun000 Sep 13, 2023

sasha8 left a comment

sasha8 Sep 15, 2023

ibalajiarun Sep 15, 2023

sasha8 Sep 15, 2023

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Sep 15, 2023

github-actions bot commented Sep 15, 2023

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Sep 15, 2023


		let dag_fetcher = DagFetcher::new(self.epoch_state.clone(), self.dag_network_sender.clone(), self.time_service.clone());

		if let Err(e) = sync_manager.sync_dag_to(&certified_node_msg, dag_fetcher, dag_store.clone()).await {

[dag] dag rebootstrap #9967

[dag] dag rebootstrap #9967

Conversation

ibalajiarun commented Sep 7, 2023 • edited Loading

Description

Test Plan

ibalajiarun commented Sep 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zekun000 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ibalajiarun commented Sep 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sasha8 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Sep 15, 2023

✅ Forge suite compat success on aptos-node-v1.6.2 ==> 3318585eae993e20a272c2f33de24bec49f75c65

github-actions bot commented Sep 15, 2023

✅ Forge suite realistic_env_max_load success on 3318585eae993e20a272c2f33de24bec49f75c65

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Sep 15, 2023

✅ Forge suite framework_upgrade success on aptos-node-v1.5.1 ==> 3318585eae993e20a272c2f33de24bec49f75c65

ibalajiarun commented Sep 7, 2023 •

edited

Loading

ibalajiarun commented Sep 7, 2023 •

edited

Loading

✅ Forge suite `compat` success on `aptos-node-v1.6.2` ==> `3318585eae993e20a272c2f33de24bec49f75c65`

✅ Forge suite `realistic_env_max_load` success on `3318585eae993e20a272c2f33de24bec49f75c65`

✅ Forge suite `framework_upgrade` success on `aptos-node-v1.5.1` ==> `3318585eae993e20a272c2f33de24bec49f75c65`