Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce code cache and loader V2 #14184

Merged
merged 87 commits into from
Oct 29, 2024
Merged

Introduce code cache and loader V2 #14184

merged 87 commits into from
Oct 29, 2024

Conversation

georgemitenkov
Copy link
Contributor

@georgemitenkov georgemitenkov commented Aug 1, 2024

Description

This PR keeps track of all loader & code cache changes, and will be linked to an AIP.

Type of Change

  • New feature
  • Bug fix
  • Breaking change
  • Performance improvement
  • Refactoring
  • Dependency update
  • Documentation update
  • Tests

Which Components or Systems Does This Change Impact?

  • Validator Node
  • Full Node (API, Indexer, etc.)
  • Move/Aptos Virtual Machine
  • Aptos Framework
  • Aptos CLI/SDK
  • Developer Infrastructure
  • Other (specify)

How Has This Been Tested?

Key Areas to Review

Checklist

  • I have read and followed the CONTRIBUTING doc
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I identified and added all stakeholders and component owners affected by this change as reviewers
  • I tested both happy and unhappy path of the functionality
  • I have made corresponding changes to the documentation

Copy link
Contributor Author

georgemitenkov commented Aug 1, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @georgemitenkov and the rest of your teammates on Graphite Graphite

@georgemitenkov georgemitenkov changed the title [loader] Add feature flag to be able to switch to V2 implementation [DO NOT MERGE] Introduce code cache and loader V2 Aug 1, 2024
@georgemitenkov georgemitenkov marked this pull request as ready for review August 1, 2024 07:31
@georgemitenkov georgemitenkov changed the base branch from main to george/loaded-func August 1, 2024 11:39
Copy link

codecov bot commented Aug 1, 2024

Codecov Report

Attention: Patch coverage is 33.26643% with 2660 lines in your changes missing coverage. Please review.

Project coverage is 59.2%. Comparing base (0e4f5df) to head (4a31c6e).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
aptos-move/block-executor/src/code_cache.rs 0.0% 488 Missing ⚠️
aptos-move/aptos-vm/src/aptos_vm.rs 0.0% 421 Missing ⚠️
third_party/move/move-vm/runtime/src/loader/mod.rs 61.7% 188 Missing ⚠️
...ptos-move/block-executor/src/cross_block_caches.rs 0.0% 173 Missing ⚠️
aptos-move/block-executor/src/executor.rs 0.0% 150 Missing ⚠️
aptos-move/aptos-vm-environment/src/environment.rs 0.0% 149 Missing ⚠️
aptos-move/block-executor/src/captured_reads.rs 0.0% 122 Missing ⚠️
aptos-move/aptos-vm-environment/src/gas.rs 0.0% 69 Missing ⚠️
...e_vm_ext/session/user_transaction_sessions/user.rs 0.0% 67 Missing ⚠️
third_party/move/move-vm/runtime/src/session.rs 22.9% 67 Missing ⚠️
... and 51 more
Additional details and impacted files
@@            Coverage Diff            @@
##             main   #14184     +/-   ##
=========================================
- Coverage    60.1%    59.2%   -0.9%     
=========================================
  Files         856      876     +20     
  Lines      211110   214751   +3641     
=========================================
+ Hits       126962   127230    +268     
- Misses      84148    87521   +3373     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

@ziaptos ziaptos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No major issues found.

I'd just make sure to:

  1. Document in a separate document all TODOs introduced by this PR (and there seems to be a lot of them). Then change the comments to refer to that document (or an issue no).
  2. I'd still do the renaming of some variable and types, like ModuleStorageAdapter, to LegacyModuleStorageAdapter, etc....

Self {
runtime: VMRuntime::new(natives, vm_config),
}
}

pub fn new_with_runtime_environment(runtime_environment: &RuntimeEnvironment) -> Self {
// Loader V2 does not store any natives, so we save the clone here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit confusing comment. Seems like runtime_envoronment.natives() should be empty here? Maybe debug_assert?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was confusing, fixed. Before, what I did is just avoided clone because V2 loader does not store natives (but they are stored in runtime environment). After addressing comments: removed MoveVM::new and MoveVM::new_with_config to ensure VM is always created based on the specified runtime environment (MoveVM::new_with_runtime_environment). Now there is less duplication, and the code is clear, and less clones in tests.

Ok(module.as_ref().deref().clone())
},
Loader::V2(_) => Err(PartialVMError::new(
StatusCode::UNKNOWN_INVARIANT_VIOLATION_ERROR,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also be unreachable? right? Do we need to handle it gracefully?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gracefully as if ..? Here we use invariant violation so that we do not panic, in case something goes wrong

module_id.name(),
function_name,
)
// Note: Keeping this consistent with Loader V1 implementation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the note, but would add more context, as the V1 code above will be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, also changed to TODO, because not remapping the error breaks some tests, but replay might be ok

vm_config,
),
loader,
// Note(loader_v2): We still create this cache, but if V2 loader is used, it is not used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: even though V2 is not using it. Maybe add a removal plan in the comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Changed into a TODO so we can track this

use std::sync::Arc;

/// Returns the hash (SHA-3-256) of the bytes. Used for both modules and scripts.
pub fn compute_code_hash(bytes: &[u8]) -> [u8; 32] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This can be a generic utility function as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the name to sha3_256, and moved to move-vm-types crate so it serves as a generic utility.

third_party/move/move-vm/runtime/src/storage/publishing.rs Outdated Show resolved Hide resolved
types/src/access_path.rs Outdated Show resolved Hide resolved
@@ -311,17 +312,12 @@ impl Module {
let module_handle = module.module_handle_at(struct_handle.module);
let module_id = module.module_id_for_handle(module_handle);

if module_handle != module.self_handle() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems right, We might be changing some behavior though by removing these. However, on the other side, if the reply is fine, we should be OK.

@ziaptos ziaptos requested a review from beer-1 October 29, 2024 02:37
Copy link

@beer-1 beer-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for amazing improvement.

Current implementation is already very good for us than loader v1 a lot, but if I have to say something, it would be cool if we can use custom type_cache impl in the RuntimeEnvironment like ModuleStorage. In our case, we are using code checksum based cache, so want to implement checksum based type_cache impl if possible 👍

In addition to the comments, also force eviction of published
modules from global cache. Enable V2 flag for some tests.
@georgemitenkov
Copy link
Contributor Author

@beer-1 good point about the type cache! Let's address this in a separate PR, I added a loader TODO for this to track it.

@georgemitenkov
Copy link
Contributor Author

@ziaptos thanks for review!

  1. For TODOs, I use TODO(loader_v2): ... to make things searchable and do not lose any. I think creating issues is an overkill because half of them I will address as a part of the roll out. But we can create issues for things we do not want to fix short-term. Doc is a good idea to organise those👍 I already have one in notion
  2. Renamed the types as well

  - Renamed all types/traits to `Legacy..`
  - Added TODOs for removal and type cache handling

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@georgemitenkov georgemitenkov removed the CICD:run-execution-performance-full-test Run execution performance test (full version) label Oct 29, 2024
@georgemitenkov georgemitenkov enabled auto-merge (squash) October 29, 2024 17:57

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

✅ Forge suite realistic_env_max_load success on bf7b8fd8a36dd768f7fdeca5bcc8fa44a29f0fe4

two traffics test: inner traffic : committed: 14451.46 txn/s, latency: 2750.22 ms, (p50: 2700 ms, p70: 2700, p90: 2900 ms, p99: 3000 ms), latency samples: 5495020
two traffics test : committed: 100.08 txn/s, latency: 1519.07 ms, (p50: 1300 ms, p70: 1400, p90: 1500 ms, p99: 7400 ms), latency samples: 1840
Latency breakdown for phase 0: ["MempoolToBlockCreation: max: 1.976, avg: 1.575", "ConsensusProposalToOrdered: max: 0.313, avg: 0.292", "ConsensusOrderedToCommit: max: 0.364, avg: 0.353", "ConsensusProposalToCommit: max: 0.657, avg: 0.645"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.90s no progress at version 2886964 (avg 0.20s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 8.48s no progress at version 2886962 (avg 8.48s) [limit 15].
Test Ok

Copy link
Contributor

✅ Forge suite framework_upgrade success on 9c922ebe94f5ff4b58df4617f3ff003e2ce10ccd ==> bf7b8fd8a36dd768f7fdeca5bcc8fa44a29f0fe4

Compatibility test results for 9c922ebe94f5ff4b58df4617f3ff003e2ce10ccd ==> bf7b8fd8a36dd768f7fdeca5bcc8fa44a29f0fe4 (PR)
Upgrade the nodes to version: bf7b8fd8a36dd768f7fdeca5bcc8fa44a29f0fe4
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1075.75 txn/s, submitted: 1077.59 txn/s, failed submission: 1.84 txn/s, expired: 1.84 txn/s, latency: 3137.38 ms, (p50: 2700 ms, p70: 3300, p90: 5400 ms, p99: 11300 ms), latency samples: 93700
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1312.33 txn/s, submitted: 1315.03 txn/s, failed submission: 2.71 txn/s, expired: 2.71 txn/s, latency: 2343.42 ms, (p50: 2100 ms, p70: 2400, p90: 3900 ms, p99: 5000 ms), latency samples: 116400
5. check swarm health
Compatibility test for 9c922ebe94f5ff4b58df4617f3ff003e2ce10ccd ==> bf7b8fd8a36dd768f7fdeca5bcc8fa44a29f0fe4 passed
Upgrade the remaining nodes to version: bf7b8fd8a36dd768f7fdeca5bcc8fa44a29f0fe4
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1171.07 txn/s, submitted: 1174.42 txn/s, failed submission: 3.35 txn/s, expired: 3.35 txn/s, latency: 2614.28 ms, (p50: 2700 ms, p70: 3000, p90: 3900 ms, p99: 5400 ms), latency samples: 104800
Test Ok

Copy link
Contributor

✅ Forge suite compat success on 9c922ebe94f5ff4b58df4617f3ff003e2ce10ccd ==> bf7b8fd8a36dd768f7fdeca5bcc8fa44a29f0fe4

Compatibility test results for 9c922ebe94f5ff4b58df4617f3ff003e2ce10ccd ==> bf7b8fd8a36dd768f7fdeca5bcc8fa44a29f0fe4 (PR)
1. Check liveness of validators at old version: 9c922ebe94f5ff4b58df4617f3ff003e2ce10ccd
compatibility::simple-validator-upgrade::liveness-check : committed: 13712.76 txn/s, latency: 2111.69 ms, (p50: 1600 ms, p70: 1800, p90: 2000 ms, p99: 23200 ms), latency samples: 545680
2. Upgrading first Validator to new version: bf7b8fd8a36dd768f7fdeca5bcc8fa44a29f0fe4
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 5864.22 txn/s, latency: 4720.99 ms, (p50: 4700 ms, p70: 5700, p90: 6500 ms, p99: 7000 ms), latency samples: 113360
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 6002.51 txn/s, latency: 5361.79 ms, (p50: 5800 ms, p70: 6100, p90: 6500 ms, p99: 7400 ms), latency samples: 208280
3. Upgrading rest of first batch to new version: bf7b8fd8a36dd768f7fdeca5bcc8fa44a29f0fe4
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 6190.31 txn/s, latency: 4313.16 ms, (p50: 4900 ms, p70: 5000, p90: 5200 ms, p99: 5300 ms), latency samples: 125060
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 6628.49 txn/s, latency: 4862.90 ms, (p50: 5200 ms, p70: 5300, p90: 6300 ms, p99: 6700 ms), latency samples: 229820
4. upgrading second batch to new version: bf7b8fd8a36dd768f7fdeca5bcc8fa44a29f0fe4
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 8639.70 txn/s, latency: 3233.42 ms, (p50: 3600 ms, p70: 3800, p90: 4100 ms, p99: 4500 ms), latency samples: 157320
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 7230.43 txn/s, latency: 4359.66 ms, (p50: 3600 ms, p70: 3900, p90: 9800 ms, p99: 11100 ms), latency samples: 244320
5. check swarm health
Compatibility test for 9c922ebe94f5ff4b58df4617f3ff003e2ce10ccd ==> bf7b8fd8a36dd768f7fdeca5bcc8fa44a29f0fe4 passed
Test Ok

@georgemitenkov georgemitenkov merged commit 751c9a3 into main Oct 29, 2024
76 of 95 checks passed
@georgemitenkov georgemitenkov deleted the george/main branch October 29, 2024 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants