Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dag] state sync refactor and notifier reinit #10106

Merged
merged 3 commits into from
Sep 22, 2023
Merged

Conversation

ibalajiarun
Copy link
Contributor

@ibalajiarun ibalajiarun commented Sep 18, 2023

Description

This PR splits the notifier into OrderedNotifier and ProofNotifier.

  • Cleans up some code to make sure ordered notifier adapter is re-instantiated after each state sync, so that the parent block info is accurate. To do so, it moves the epoch change notification logic from state sync manager to state sync notifier.
  • Also, cleans up some StateSyncStatus logic and makes sure that DAG doesn't reinit when an epoch change is required.
  • NetworkSender implements ProofNotifier. This makes a clean abstraction and keeps all epoch manager/buffer manager related components (including messages) outside of the DAG.

Test Plan

@ibalajiarun
Copy link
Contributor Author

ibalajiarun commented Sep 18, 2023

Comment on lines +232 to +240
StateSyncStatus::EpochEnds => {
// Wait for epoch manager to signal shutdown
_ = shutdown_rx.await;
return;
},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is correct, because this peer could have received a old ledger info and it may already be in the new epoch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the invariant we should hold is that the dag instance only receives message from the same epoch. if it's already in the new epoch, it shouldn't receive message from previous epoch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sending multiple/stale epoch change proof is okay I think, epoch manager should handle that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't check the ledger info is from the same epoch actually. I need to fix that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the check to make sure its same epoch.

return StateSyncStatus::Synced(Some(node));
}

if ledger_info.ledger_info().ends_epoch() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic is different than sync manager but I think it's good

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Contributor

✅ Forge suite compat success on aptos-node-v1.6.2 ==> 0a6989d6ce5ce87935463d6d2bceb52fd3e57157

Compatibility test results for aptos-node-v1.6.2 ==> 0a6989d6ce5ce87935463d6d2bceb52fd3e57157 (PR)
1. Check liveness of validators at old version: aptos-node-v1.6.2
compatibility::simple-validator-upgrade::liveness-check : committed: 4301 txn/s, latency: 6981 ms, (p50: 6900 ms, p90: 9200 ms, p99: 16500 ms), latency samples: 180680
2. Upgrading first Validator to new version: 0a6989d6ce5ce87935463d6d2bceb52fd3e57157
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 1850 txn/s, latency: 15637 ms, (p50: 18700 ms, p90: 22000 ms, p99: 22300 ms), latency samples: 92500
3. Upgrading rest of first batch to new version: 0a6989d6ce5ce87935463d6d2bceb52fd3e57157
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 1703 txn/s, latency: 16435 ms, (p50: 17500 ms, p90: 21700 ms, p99: 30300 ms), latency samples: 86900
4. upgrading second batch to new version: 0a6989d6ce5ce87935463d6d2bceb52fd3e57157
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 3377 txn/s, latency: 9068 ms, (p50: 9900 ms, p90: 12000 ms, p99: 12600 ms), latency samples: 141840
5. check swarm health
Compatibility test for aptos-node-v1.6.2 ==> 0a6989d6ce5ce87935463d6d2bceb52fd3e57157 passed
Test Ok

@github-actions
Copy link
Contributor

✅ Forge suite realistic_env_max_load success on 0a6989d6ce5ce87935463d6d2bceb52fd3e57157

two traffics test: inner traffic : committed: 6133 txn/s, latency: 6379 ms, (p50: 6000 ms, p90: 8400 ms, p99: 12600 ms), latency samples: 2680340
two traffics test : committed: 100 txn/s, latency: 2980 ms, (p50: 2900 ms, p90: 3800 ms, p99: 7500 ms), latency samples: 1780
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.233, avg: 0.214", "QsPosToProposal: max: 0.172, avg: 0.160", "ConsensusProposalToOrdered: max: 0.672, avg: 0.624", "ConsensusOrderedToCommit: max: 0.550, avg: 0.526", "ConsensusProposalToCommit: max: 1.184, avg: 1.150"]
Max round gap was 1 [limit 4] at version 759407. Max no progress secs was 3.688538 [limit 10] at version 2695495.
Test Ok

@ibalajiarun ibalajiarun merged commit 0f6ff07 into main Sep 22, 2023
82 of 83 checks passed
@ibalajiarun ibalajiarun deleted the balaji/split-notifier branch September 22, 2023 20:04
@github-actions
Copy link
Contributor

❌ Forge suite framework_upgrade failure on aptos-node-v1.5.1 ==> 0a6989d6ce5ce87935463d6d2bceb52fd3e57157

Compatibility test results for aptos-node-v1.5.1 ==> 0a6989d6ce5ce87935463d6d2bceb52fd3e57157 (PR)
Upgrade the nodes to version: 0a6989d6ce5ce87935463d6d2bceb52fd3e57157
Test Failed: API error: Unknown error error sending request for url (http://aptos-node-3-validator.forge-framework-upgrade-pr-10106.svc:8080/v1/estimate_gas_price): error trying to connect: dns error: failed to lookup address information: Name or service not known

Stack backtrace:
   0: aptos_release_builder::validate::execute_release::{{closure}}
             at ./aptos-move/aptos-release-builder/src/validate.rs:399:22
      aptos_release_builder::validate::validate_config_and_generate_release::{{closure}}
             at ./aptos-move/aptos-release-builder/src/validate.rs:460:6
      aptos_release_builder::validate::validate_config::{{closure}}
             at ./aptos-move/aptos-release-builder/src/validate.rs:446:80
      tokio::runtime::park::CachedParkThread::block_on::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/park.rs:283:63
      tokio::runtime::coop::with_budget
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/coop.rs:107:5
      tokio::runtime::coop::budget
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/coop.rs:73:5
      tokio::runtime::park::CachedParkThread::block_on
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/park.rs:283:31
   1: tokio::runtime::context::blocking::BlockingRegionGuard::block_on
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/context/blocking.rs:66:9
      tokio::runtime::scheduler::multi_thread::MultiThread::block_on::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/scheduler/multi_thread/mod.rs:87:13
      tokio::runtime::context::runtime::enter_runtime
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/context/runtime.rs:65:16
   2: tokio::runtime::scheduler::multi_thread::MultiThread::block_on
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/scheduler/multi_thread/mod.rs:86:9
      tokio::runtime::runtime::Runtime::block_on
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/runtime.rs:313:50
   3: <aptos_testcases::framework_upgrade::FrameworkUpgrade as aptos_forge::interface::network::NetworkTest>::run
             at ./testsuite/testcases/src/framework_upgrade.rs:97:9
   4: aptos_forge::runner::Forge<F>::run::{{closure}}
             at ./testsuite/forge/src/runner.rs:545:42
      aptos_forge::runner::run_test
             at ./testsuite/forge/src/runner.rs:613:11
      aptos_forge::runner::Forge<F>::run
             at ./testsuite/forge/src/runner.rs:545:30
   5: forge::run_forge
             at ./testsuite/forge-cli/src/main.rs:410:11
      forge::main
             at ./testsuite/forge-cli/src/main.rs:336:21
   6: core::ops::function::FnOnce::call_once
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/core/src/ops/function.rs:250:5
      std::sys_common::backtrace::__rust_begin_short_backtrace
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/std/src/sys_common/backtrace.rs:135:18
   7: std::rt::lang_start::{{closure}}
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/std/src/rt.rs:166:18
   8: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/core/src/ops/function.rs:284:13
      std::panicking::try::do_call
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/std/src/panicking.rs:500:40
      std::panicking::try
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/std/src/panicking.rs:464:19
      std::panic::catch_unwind
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/std/src/panic.rs:142:14
      std::rt::lang_start_internal::{{closure}}
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/std/src/rt.rs:148:48
      std::panicking::try::do_call
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/std/src/panicking.rs:500:40
      std::panicking::try
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/std/src/panicking.rs:464:19
      std::panic::catch_unwind
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/std/src/panic.rs:142:14
      std::rt::lang_start_internal
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/std/src/rt.rs:148:20
   9: main
  10: __libc_start_main
  11: _start
Trailing Log Lines:
      std::panicking::try
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/std/src/panicking.rs:464:19
      std::panic::catch_unwind
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/std/src/panic.rs:142:14
      std::rt::lang_start_internal
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/std/src/rt.rs:148:20
   9: main
  10: __libc_start_main
  11: _start


Swarm logs can be found here: See fgi output for more information.
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: ApiError: namespaces "forge-framework-upgrade-pr-10106" not found: NotFound (ErrorResponse { status: "Failure", message: "namespaces \"forge-framework-upgrade-pr-10106\" not found", reason: "NotFound", code: 404 })

Caused by:
    namespaces "forge-framework-upgrade-pr-10106" not found: NotFound

Stack backtrace:
   0: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/core/src/result.rs:1961:27
      aptos_forge::backend::k8s::cluster_helper::delete_k8s_cluster::{{closure}}
             at ./testsuite/forge/src/backend/k8s/cluster_helper.rs:289:13
      aptos_forge::backend::k8s::cluster_helper::uninstall_testnet_resources::{{closure}}
             at ./testsuite/forge/src/backend/k8s/cluster_helper.rs:399:48
   1: tokio::runtime::park::CachedParkThread::block_on::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/park.rs:283:63
      tokio::runtime::coop::with_budget
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/coop.rs:107:5
      tokio::runtime::coop::budget
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/coop.rs:73:5
      tokio::runtime::park::CachedParkThread::block_on
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/park.rs:283:31
      tokio::runtime::context::blocking::BlockingRegionGuard::block_on
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/context/blocking.rs:66:9
   2: tokio::runtime::scheduler::multi_thread::MultiThread::block_on::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/scheduler/multi_thread/mod.rs:87:13
      tokio::runtime::context::runtime::enter_runtime
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/context/runtime.rs:65:16
   3: tokio::runtime::scheduler::multi_thread::MultiThread::block_on
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/scheduler/multi_thread/mod.rs:86:9
      tokio::runtime::runtime::Runtime::block_on
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/runtime.rs:313:50
   4: <aptos_forge::backend::k8s::swarm::K8sSwarm as core::ops::drop::Drop>::drop
             at ./testsuite/forge/src/backend/k8s/swarm.rs:674:13
   5: core::ptr::drop_in_place<aptos_forge::backend::k8s::swarm::K8sSwarm>
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/core/src/ptr/mod.rs:497:1
   6: core::ptr::drop_in_place<alloc::boxed::Box<dyn aptos_forge::interface::swarm::Swarm>>
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/core/src/ptr/mod.rs:497:1
   7: aptos_forge::runner::Forge<F>::run
             at ./testsuite/forge/src/runner.rs:558:9
   8: forge::run_forge
             at ./testsuite/forge-cli/src/main.rs:410:11
      forge::main
             at ./testsuite/forge-cli/src/main.rs:336:21
   9: core::ops::function::FnOnce::call_once
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/core/src/ops/function.rs:250:5
      std::sys_common::backtrace::__rust_begin_short_backtrace
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/std/src/sys_common/backtrace.rs:135:18
  10: std::rt::lang_start::{{closure}}
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/std/src/rt.rs:166:18
  11: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/core/src/ops/function.rs:284:13
      std::panicking::try::do_call
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/std/src/panicking.rs:500:40
      std::panicking::try
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/std/src/panicking.rs:464:19
      std::panic::catch_unwind
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/std/src/panic.rs:142:14
      std::rt::lang_start_internal::{{closure}}
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/std/src/rt.rs:148:48
      std::panicking::try::do_call
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/std/src/panicking.rs:500:40
      std::panicking::try
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/std/src/panicking.rs:464:19
      std::panic::catch_unwind
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/std/src/panic.rs:142:14
      std::rt::lang_start_internal
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/std/src/rt.rs:148:20
  12: main
  13: __libc_start_main
  14: _start', testsuite/forge/src/backend/k8s/swarm.rs:676:18
stack backtrace:
   0: rust_begin_unwind
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/std/src/panicking.rs:593:5
   1: core::panicking::panic_fmt
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/core/src/panicking.rs:67:14
   2: core::result::unwrap_failed
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/core/src/result.rs:1651:5
   3: core::result::Result<T,E>::unwrap
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/core/src/result.rs:1076:23
   4: <aptos_forge::backend::k8s::swarm::K8sSwarm as core::ops::drop::Drop>::drop
             at ./testsuite/forge/src/backend/k8s/swarm.rs:674:13
   5: core::ptr::drop_in_place<aptos_forge::backend::k8s::swarm::K8sSwarm>
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/core/src/ptr/mod.rs:497:1
   6: core::ptr::drop_in_place<alloc::boxed::Box<dyn aptos_forge::interface::swarm::Swarm>>
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/core/src/ptr/mod.rs:497:1
   7: aptos_forge::runner::Forge<F>::run
             at ./testsuite/forge/src/runner.rs:558:9
   8: forge::run_forge
             at ./testsuite/forge-cli/src/main.rs:410:11
   9: forge::main
             at ./testsuite/forge-cli/src/main.rs:336:21
  10: core::ops::function::FnOnce::call_once
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Debugging output:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants