-
Notifications
You must be signed in to change notification settings - Fork 792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce subsystem benchmarking tool #2528
Conversation
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
…reim/subsystem-bench
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks, good to me!
09215c5 Backport from `polkadot-sdk` + bump (#2725) 6327261 Bump serde from 1.0.192 to 1.0.193 fff9ddd Bump sysinfo from 0.29.10 to 0.29.11 4be99fe Monitoring and alerts for Rococo/Westend (#2710) 67a683a Bump ed25519-dalek from 2.0.0 to 2.1.0 8e0e794 quick and dirty fix for the `wait -p` and older distros (#2712) 3ab6562 Add withdraw reserve assets to zombienet tests (#2711) c2c409b increase init timeouts in zombienet tests (#2706) a8c60b4 fix lane id and bridged chain id (#2705) 9ac0f26 removed bp-asset-hub-kusama and bp-asset-hub-polkadot (#2703) 4916475 Some fixes for zombienet tests (polkadot-staging) (#2704) 6f9a147 zombienet from Wococo to Westend (#2699) 3ba7910 Porting changes from polkadot-sdk to polkadot-staging - before update subtree with removed wococo stuff (#2696) 653448f Remove Woococo related stuff (#2692) 03aaab2 Gitspiegel polkadot staging (#2695) 702a4c1 Drop Rialto <> Millau bridges (#2663) (#2694) 6a63b5f Start version guards for the ED loop (#2678) 896b9a9 typo (#2690) 671d27c Bump serde from 1.0.190 to 1.0.192 991b229 Bump clap from 4.4.7 to 4.4.8 ec267ec Bump env_logger from 0.10.0 to 0.10.1 592e407 Bump tokio from 1.33.0 to 1.34.0 c49ce3d Bump serde_json from 1.0.107 to 1.0.108 04b3319 Update subxt-codegen version (#2674) 03f9804 backport #2139 (#2673) 49245dd removed unused PARACHAINS_FINALITY_PALLET_NAME constant (#2670) 658a3f5 BHR/BHWE spec_version according to the `polkadot-sdk` (#2668) 7666b94 Nit from `polkadot-sdk` (#2665) b5c43bb Adjusted constant because for measuring we used mistakenly rococo constants (#2664) 062449d Add Rococo<>Westend bridge support/relay (#2647) 55eb44e Add basic zombienet test to be used in the future (#2649) (#2660) 93b6b3f Bump clap from 4.4.6 to 4.4.7 4c01ab0 Bump futures from 0.3.28 to 0.3.29 a31a6c0 Bump tempfile from 3.8.0 to 3.8.1 bcdfe83 Bump serde from 1.0.189 to 1.0.190 f7433b0 Port #2648 to polkadot-staging (#2651) 3896738 Bump scale-info from 2.9.0 to 2.10.0 12d62c5 Bump thiserror from 1.0.49 to 1.0.50 1d78aa1 Backport from `polkadot-sdk` with actual master (#2633) ab4de94 Grandpa justifications: Avoid duplicate vote ancestries (#2634) (#2635) 465562a add missing crate descriptions (#2629) 28d3680 Bump fixed-hash 67528c4 Bump serde from 1.0.188 to 1.0.189 d450c47 Bump time from 0.3.29 to 0.3.30 6a19f83 Bump async-trait from 0.1.73 to 0.1.74 a92d213 Millau, Rialto: accept equivocation reports (#2614) (#2617) a61f777 Bump tokio from 1.32.0 to 1.33.0 0052f64 Bump subxt from 0.32.0 to 0.32.1 ccc849d Bump num-traits from 0.2.16 to 0.2.17 22f2752 apply late suggestions for #2600 (#2603) 0320172 actualize check_obsolete_call comment (#2601) 5cbbd25 Reject transactions if bridge pallets are halted (#2600) ca4dfe3 Bump subxt from 0.31.0 to 0.32.0 8bf7b58 Bump clap from 4.4.4 to 4.4.6 88b0b99 Bump thiserror from 1.0.48 to 1.0.49 263833b https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/3833103 (#2589) 4f44968 Backport changes from polkadot-sdk (#2588) 7200ed1 fiox overflow when computing priority boost (#2587) e02cbd3 Bump time from 0.3.28 to 0.3.29 a097dd2 Bump clap from 4.4.3 to 4.4.4 801ce88 Merge bulletin chain changes into polkadot staging (#2574) a3803ce Add unit tests for the equivocation detection loop (#2571) 26dfc31 Bump clap from 4.4.2 to 4.4.3 66a8beb Bump serde_json from 1.0.106 to 1.0.107 18c50da Bump trie-db from 0.27.1 to 0.28.0 4c4fa92 Equivocation detection loop: Reorganize block checking logic as state machine (#2555) (#2557) 6bd317a Bump serde_json from 1.0.105 to 1.0.106 a7e6bfd Backport for polkadot-sdk#1446 (#2546) d9f8050 Bump sysinfo from 0.29.9 to 0.29.10 901f44c Bump thiserror from 1.0.47 to 1.0.48 82eeb50 Bump sysinfo from 0.29.8 to 0.29.9 a0c934b Bump strum from 0.24.1 to 0.25.0 1064fbf Bump subxt from 0.28.0 to 0.31.0 e50398d bridges subtree fixes (#2528) 99af075 Markdown linter (#1309) (#2526) 733ff0f `polkadot-staging` branch: Use polkadot-sdk dependencies (#2524) e8a59f1 Fix benchmark with new XCM::V3 `MAX_INSTRUCTIONS_TO_DECODE` (#2514) 62b185d Backport `polkadot-sdk` changes to `polkadot-staging` (#2518) d9658f4 Fix equivocation detection containers startup (#2516) (#2517) d65db28 Backport: building images from locally built binaries (#2513) 5fdbaf4 Start the equivocation detection loop from the complex relayer (#2507) (#2512) 7fbb67d Backport: Implement basic equivocations detection loop (#2375) cb7efe2 Manually update deps in polkadot staging (#2371) d17981f #2351 to polkadot-staging (#2359) git-subtree-dir: bridges git-subtree-split: 09215c5
…reim/subsystem-bench Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
…reim/subsystem-bench
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
The CI pipeline was cancelled due to failure one of the required jobs. |
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
This tool makes it easy to run parachain consensus stress/performance testing on your development machine or in CI. ## Motivation The parachain consensus node implementation spans across many modules which we call subsystems. Each subsystem is responsible for a small part of logic of the parachain consensus pipeline, but in general the most load and performance issues are localized in just a few core subsystems like `availability-recovery`, `approval-voting` or `dispute-coordinator`. In the absence of such a tool, we would run large test nets to load/stress test these parts of the system. Setting up and making sense of the amount of data produced by such a large test is very expensive, hard to orchestrate and is a huge development time sink. ## PR contents - CLI tool - Data Availability Read test - reusable mockups and components needed so far - Documentation on how to get started ### Data Availability Read test An overseer is built with using a real `availability-recovery` susbsytem instance while dependent subsystems like `av-store`, `network-bridge` and `runtime-api` are mocked. The network bridge will emulate all the network peers and their answering to requests. The test is going to be run for a number of blocks. For each block it will generate send a “RecoverAvailableData” request for an arbitrary number of candidates. We wait for the subsystem to respond to all requests before moving to the next block. At the same time we collect the usual subsystem metrics and task CPU metrics and show some nice progress reports while running. ### Here is how the CLI looks like: ``` [2023-11-28T13:06:27Z INFO subsystem_bench::core::display] n_validators = 1000, n_cores = 20, pov_size = 5120 - 5120, error = 3, latency = Some(PeerLatency { min_latency: 1ms, max_latency: 100ms }) [2023-11-28T13:06:27Z INFO subsystem-bench::availability] Generating template candidate index=0 pov_size=5242880 [2023-11-28T13:06:27Z INFO subsystem-bench::availability] Created test environment. [2023-11-28T13:06:27Z INFO subsystem-bench::availability] Pre-generating 60 candidates. [2023-11-28T13:06:30Z INFO subsystem-bench::core] Initializing network emulation for 1000 peers. [2023-11-28T13:06:30Z INFO subsystem-bench::availability] Current block 1/3 [2023-11-28T13:06:30Z INFO substrate_prometheus_endpoint] 〽️ Prometheus exporter started at 127.0.0.1:9999 [2023-11-28T13:06:30Z INFO subsystem_bench::availability] 20 recoveries pending [2023-11-28T13:06:37Z INFO subsystem_bench::availability] Block time 6262ms [2023-11-28T13:06:37Z INFO subsystem-bench::availability] Sleeping till end of block (0ms) [2023-11-28T13:06:37Z INFO subsystem-bench::availability] Current block 2/3 [2023-11-28T13:06:37Z INFO subsystem_bench::availability] 20 recoveries pending [2023-11-28T13:06:43Z INFO subsystem_bench::availability] Block time 6369ms [2023-11-28T13:06:43Z INFO subsystem-bench::availability] Sleeping till end of block (0ms) [2023-11-28T13:06:43Z INFO subsystem-bench::availability] Current block 3/3 [2023-11-28T13:06:43Z INFO subsystem_bench::availability] 20 recoveries pending [2023-11-28T13:06:49Z INFO subsystem_bench::availability] Block time 6194ms [2023-11-28T13:06:49Z INFO subsystem-bench::availability] Sleeping till end of block (0ms) [2023-11-28T13:06:49Z INFO subsystem_bench::availability] All blocks processed in 18829ms [2023-11-28T13:06:49Z INFO subsystem_bench::availability] Throughput: 102400 KiB/block [2023-11-28T13:06:49Z INFO subsystem_bench::availability] Block time: 6276 ms [2023-11-28T13:06:49Z INFO subsystem_bench::availability] Total received from network: 415 MiB Total sent to network: 724 KiB Total subsystem CPU usage 24.00s CPU usage per block 8.00s Total test environment CPU usage 0.15s CPU usage per block 0.05s ``` ### Prometheus/Grafana stack in action <img width="1246" alt="Screenshot 2023-11-28 at 15 11 10" src="https://github.com/paritytech/polkadot-sdk/assets/54316454/eaa47422-4a5e-4a3a-aaef-14ca644c1574"> <img width="1246" alt="Screenshot 2023-11-28 at 15 12 01" src="https://github.com/paritytech/polkadot-sdk/assets/54316454/237329d6-1710-4c27-8f67-5fb11d7f66ea"> <img width="1246" alt="Screenshot 2023-11-28 at 15 12 38" src="https://github.com/paritytech/polkadot-sdk/assets/54316454/a07119e8-c9f1-4810-a1b3-f1b7b01cf357"> --------- Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
68d8650 Bump thiserror from 1.0.50 to 1.0.51 009c989 remove no longer valid check from the ensure_weights_are_correct (#2740) 94c44a7 Added Rococo BH <> Rococo Bulletin bridge (#2724) 5fe0f2f Bump tokio from 1.34.0 to 1.35.0 25f8251 Grafana update stuff (#2733) 06fbe8b Improved `ExportXcm::validate` implementation for BridgeHubs - step 1 (#2727) 390e836 Select header that will be fully refunded in on-demand batch finality relay (#2729) ce701dd separate constants for average and worst case relay headers (#2728) 09215c5 Backport from `polkadot-sdk` + bump (#2725) 6327261 Bump serde from 1.0.192 to 1.0.193 fff9ddd Bump sysinfo from 0.29.10 to 0.29.11 4be99fe Monitoring and alerts for Rococo/Westend (#2710) 67a683a Bump ed25519-dalek from 2.0.0 to 2.1.0 8e0e794 quick and dirty fix for the `wait -p` and older distros (#2712) 3ab6562 Add withdraw reserve assets to zombienet tests (#2711) c2c409b increase init timeouts in zombienet tests (#2706) a8c60b4 fix lane id and bridged chain id (#2705) 9ac0f26 removed bp-asset-hub-kusama and bp-asset-hub-polkadot (#2703) 4916475 Some fixes for zombienet tests (polkadot-staging) (#2704) 6f9a147 zombienet from Wococo to Westend (#2699) 3ba7910 Porting changes from polkadot-sdk to polkadot-staging - before update subtree with removed wococo stuff (#2696) 653448f Remove Woococo related stuff (#2692) 03aaab2 Gitspiegel polkadot staging (#2695) 702a4c1 Drop Rialto <> Millau bridges (#2663) (#2694) 6a63b5f Start version guards for the ED loop (#2678) 896b9a9 typo (#2690) 671d27c Bump serde from 1.0.190 to 1.0.192 991b229 Bump clap from 4.4.7 to 4.4.8 ec267ec Bump env_logger from 0.10.0 to 0.10.1 592e407 Bump tokio from 1.33.0 to 1.34.0 c49ce3d Bump serde_json from 1.0.107 to 1.0.108 04b3319 Update subxt-codegen version (#2674) 03f9804 backport #2139 (#2673) 49245dd removed unused PARACHAINS_FINALITY_PALLET_NAME constant (#2670) 658a3f5 BHR/BHWE spec_version according to the `polkadot-sdk` (#2668) 7666b94 Nit from `polkadot-sdk` (#2665) b5c43bb Adjusted constant because for measuring we used mistakenly rococo constants (#2664) 062449d Add Rococo<>Westend bridge support/relay (#2647) 55eb44e Add basic zombienet test to be used in the future (#2649) (#2660) 93b6b3f Bump clap from 4.4.6 to 4.4.7 4c01ab0 Bump futures from 0.3.28 to 0.3.29 a31a6c0 Bump tempfile from 3.8.0 to 3.8.1 bcdfe83 Bump serde from 1.0.189 to 1.0.190 f7433b0 Port #2648 to polkadot-staging (#2651) 3896738 Bump scale-info from 2.9.0 to 2.10.0 12d62c5 Bump thiserror from 1.0.49 to 1.0.50 1d78aa1 Backport from `polkadot-sdk` with actual master (#2633) ab4de94 Grandpa justifications: Avoid duplicate vote ancestries (#2634) (#2635) 465562a add missing crate descriptions (#2629) 28d3680 Bump fixed-hash 67528c4 Bump serde from 1.0.188 to 1.0.189 d450c47 Bump time from 0.3.29 to 0.3.30 6a19f83 Bump async-trait from 0.1.73 to 0.1.74 a92d213 Millau, Rialto: accept equivocation reports (#2614) (#2617) a61f777 Bump tokio from 1.32.0 to 1.33.0 0052f64 Bump subxt from 0.32.0 to 0.32.1 ccc849d Bump num-traits from 0.2.16 to 0.2.17 22f2752 apply late suggestions for #2600 (#2603) 0320172 actualize check_obsolete_call comment (#2601) 5cbbd25 Reject transactions if bridge pallets are halted (#2600) ca4dfe3 Bump subxt from 0.31.0 to 0.32.0 8bf7b58 Bump clap from 4.4.4 to 4.4.6 88b0b99 Bump thiserror from 1.0.48 to 1.0.49 263833b https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/3833103 (#2589) 4f44968 Backport changes from polkadot-sdk (#2588) 7200ed1 fiox overflow when computing priority boost (#2587) e02cbd3 Bump time from 0.3.28 to 0.3.29 a097dd2 Bump clap from 4.4.3 to 4.4.4 801ce88 Merge bulletin chain changes into polkadot staging (#2574) a3803ce Add unit tests for the equivocation detection loop (#2571) 26dfc31 Bump clap from 4.4.2 to 4.4.3 66a8beb Bump serde_json from 1.0.106 to 1.0.107 18c50da Bump trie-db from 0.27.1 to 0.28.0 4c4fa92 Equivocation detection loop: Reorganize block checking logic as state machine (#2555) (#2557) 6bd317a Bump serde_json from 1.0.105 to 1.0.106 a7e6bfd Backport for polkadot-sdk#1446 (#2546) d9f8050 Bump sysinfo from 0.29.9 to 0.29.10 901f44c Bump thiserror from 1.0.47 to 1.0.48 82eeb50 Bump sysinfo from 0.29.8 to 0.29.9 a0c934b Bump strum from 0.24.1 to 0.25.0 1064fbf Bump subxt from 0.28.0 to 0.31.0 e50398d bridges subtree fixes (#2528) 99af075 Markdown linter (#1309) (#2526) 733ff0f `polkadot-staging` branch: Use polkadot-sdk dependencies (#2524) e8a59f1 Fix benchmark with new XCM::V3 `MAX_INSTRUCTIONS_TO_DECODE` (#2514) 62b185d Backport `polkadot-sdk` changes to `polkadot-staging` (#2518) d9658f4 Fix equivocation detection containers startup (#2516) (#2517) d65db28 Backport: building images from locally built binaries (#2513) 5fdbaf4 Start the equivocation detection loop from the complex relayer (#2507) (#2512) 7fbb67d Backport: Implement basic equivocations detection loop (#2375) cb7efe2 Manually update deps in polkadot staging (#2371) d17981f #2351 to polkadot-staging (#2359) git-subtree-dir: bridges git-subtree-split: 68d8650
## Summary Built on top of the tooling and ideas introduced in #2528, this PR introduces a synthetic benchmark for measuring and assessing the performance characteristics of the approval-voting and approval-distribution subsystems. Currently this allows, us to simulate the behaviours of these systems based on the following dimensions: ``` TestConfiguration: # Test 1 - objective: !ApprovalsTest last_considered_tranche: 89 min_coalesce: 1 max_coalesce: 6 enable_assignments_v2: true send_till_tranche: 60 stop_when_approved: false coalesce_tranche_diff: 12 workdir_prefix: "/tmp" num_no_shows_per_candidate: 0 approval_distribution_expected_tof: 6.0 approval_distribution_cpu_ms: 3.0 approval_voting_cpu_ms: 4.30 n_validators: 500 n_cores: 100 n_included_candidates: 100 min_pov_size: 1120 max_pov_size: 5120 peer_bandwidth: 524288000000 bandwidth: 524288000000 latency: min_latency: secs: 0 nanos: 1000000 max_latency: secs: 0 nanos: 100000000 error: 0 num_blocks: 10 ``` ## The approach 1. We build a real overseer with the real implementations for approval-voting and approval-distribution subsystems. 2. For a given network size, for each validator we pre-computed all potential assignments and approvals it would send, because this a computation heavy operation this will be cached on a file on disk and be re-used if the generation parameters don't change. 3. The messages will be sent accordingly to the configured parameters and those are split into 3 main benchmarking scenarios. ## Benchmarking scenarios ### Best case scenario *approvals_throughput_best_case.yaml* It send to the approval-distribution only the minimum required tranche to gathered the needed_approvals, so that a candidate is approved. ### Behaviour in the presence of no-shows *approvals_no_shows.yaml* It sends the tranche needed to approve a candidate when we have a maximum of *num_no_shows_per_candidate* tranches with no-shows for each candidate. ### Maximum throughput *approvals_throughput.yaml* It sends all the tranches for each block and measures the used CPU and necessary network bandwidth. by the approval-voting and approval-distribution subsystem. ## How to run it ``` cargo run -p polkadot-subsystem-bench --release -- test-sequence --path polkadot/node/subsystem-bench/examples/approvals_throughput.yaml ``` ## Evaluating performance ### Use the real subsystems metrics If you follow the steps in https://github.com/paritytech/polkadot-sdk/tree/master/polkadot/node/subsystem-bench#install-grafana for installing locally prometheus and grafana, all real metrics for the `approval-distribution`, `approval-voting` and overseer are available. E.g: <img width="2149" alt="Screenshot 2023-12-05 at 11 07 46" src="https://github.com/paritytech/polkadot-sdk/assets/49718502/cb8ae2dd-178b-4922-bfa4-dc37e572ed38"> <img width="2551" alt="Screenshot 2023-12-05 at 11 09 42" src="https://github.com/paritytech/polkadot-sdk/assets/49718502/8b4542ba-88b9-46f9-9b70-cc345366081b"> <img width="2154" alt="Screenshot 2023-12-05 at 11 10 15" src="https://github.com/paritytech/polkadot-sdk/assets/49718502/b8874d8d-632e-443a-9840-14ad8e90c54f"> <img width="2535" alt="Screenshot 2023-12-05 at 11 10 52" src="https://github.com/paritytech/polkadot-sdk/assets/49718502/779a439f-fd18-4985-bb80-85d5afad78e2"> ### Profile with pyroscope 1. Setup pyroscope following the steps in https://github.com/paritytech/polkadot-sdk/tree/master/polkadot/node/subsystem-bench#install-pyroscope, then run any of the benchmark scenario with `--profile` as the arguments. 2. Open the pyroscope dashboard in grafana, e.g: <img width="2544" alt="Screenshot 2024-01-09 at 17 09 58" src="https://github.com/paritytech/polkadot-sdk/assets/49718502/58f50c99-a910-4d20-951a-8b16639303d9"> ### Useful logs 1. Network bandwidth requirements: ``` Payload bytes received from peers: 503993 KiB total, 50399 KiB/block Payload bytes sent to peers: 629971 KiB total, 62997 KiB/block ``` 2. Cpu usage by the approval-distribution/approval-voting subsystems. ``` approval-distribution CPU usage 84.061s approval-distribution CPU usage per block 8.406s approval-voting CPU usage 96.532s approval-voting CPU usage per block 9.653s ``` 3. Time passed until a given block is approved ``` Chain selection approved after 3500 ms hash=0x0101010101010101010101010101010101010101010101010101010101010101 Chain selection approved after 4500 ms hash=0x0202020202020202020202020202020202020202020202020202020202020202 ``` ### Using benchmark to quantify improvements from #1178 + #1191 Using a versi-node we compare the scenarios where all new optimisations are disabled with a scenarios where tranche0 assignments are sent in a single message and a conservative simulation where the coalescing of approvals gives us just 50% reduction in the number of messages we send. Overall, what we see is a speedup of around 30-40% in the time it takes to process the necessary messages and a 30-40% reduction in the necessary bandwidth. #### Best case scenario comparison(minimum required tranches sent). Unoptimised ``` Number of blocks: 10 Payload bytes received from peers: 53289 KiB total, 5328 KiB/block Payload bytes sent to peers: 52489 KiB total, 5248 KiB/block approval-distribution CPU usage 6.732s approval-distribution CPU usage per block 0.673s approval-voting CPU usage 9.523s approval-voting CPU usage per block 0.952s ``` vs Optimisation enabled ``` Number of blocks: 10 Payload bytes received from peers: 32141 KiB total, 3214 KiB/block Payload bytes sent to peers: 37314 KiB total, 3731 KiB/block approval-distribution CPU usage 4.658s approval-distribution CPU usage per block 0.466s approval-voting CPU usage 6.236s approval-voting CPU usage per block 0.624s ``` #### Worst case all tranches sent, very unlikely happens when sharding breaks. Unoptimised ``` Number of blocks: 10 Payload bytes received from peers: 746393 KiB total, 74639 KiB/block Payload bytes sent to peers: 729151 KiB total, 72915 KiB/block approval-distribution CPU usage 118.681s approval-distribution CPU usage per block 11.868s approval-voting CPU usage 124.118s approval-voting CPU usage per block 12.412s ``` vs optimised ``` Number of blocks: 10 Payload bytes received from peers: 503993 KiB total, 50399 KiB/block Payload bytes sent to peers: 629971 KiB total, 62997 KiB/block approval-distribution CPU usage 84.061s approval-distribution CPU usage per block 8.406s approval-voting CPU usage 96.532s approval-voting CPU usage per block 9.653s ``` ## TODOs [x] Polish implementation. [x] Use what we have so far to evaluate #1191 before merging. [x] List of features and additional dimensions we want to use for benchmarking. [x] Run benchmark on hardware similar with versi and kusama nodes. [ ] Add benchmark to be run in CI for catching regression in performance. [ ] Rebase on latest changes for network emulation. --------- Signed-off-by: Andrei Sandu <andrei-mihail@parity.io> Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io> Co-authored-by: Andrei Sandu <andrei-mihail@parity.io> Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
This pull request has been mentioned on Polkadot Forum. There might be relevant details there: https://forum.polkadot.network/t/what-are-subsystem-benchmarks/8212/1 |
This tool makes it easy to run parachain consensus stress/performance testing on your development machine or in CI.
Motivation
The parachain consensus node implementation spans across many modules which we call subsystems. Each subsystem is responsible for a small part of logic of the parachain consensus pipeline, but in general the most load and performance issues are localized in just a few core subsystems like
availability-recovery
,approval-voting
ordispute-coordinator
. In the absence of such a tool, we would run large test nets to load/stress test these parts of the system. Setting up and making sense of the amount of data produced by such a large test is very expensive, hard to orchestrate and is a huge development time sink.PR contents
Data Availability Read test
An overseer is built with using a real
availability-recovery
susbsytem instance while dependent subsystems likeav-store
,network-bridge
andruntime-api
are mocked. The network bridge will emulate all the network peers and their answering to requests.The test is going to be run for a number of blocks. For each block it will generate send a “RecoverAvailableData” request for an arbitrary number of candidates. We wait for the subsystem to respond to all requests before moving to the next block.
At the same time we collect the usual subsystem metrics and task CPU metrics and show some nice progress reports while running.
Here is how the CLI looks like:
Prometheus/Grafana stack in action