Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Fair reusing of wasm runtime instances #3011

Merged
merged 72 commits into from
Jul 25, 2019
Merged

Fair reusing of wasm runtime instances #3011

merged 72 commits into from
Jul 25, 2019

Conversation

pepyakin
Copy link
Contributor

@pepyakin pepyakin commented Jul 3, 2019

Closes #2051
Closes #2967
Supersedes #2938

Background

In #1345 an optimization for wasm runtime instantiation was implemented. The idea was that we would track the areas using some heuristic (used_size and lowest_used) that were used in the course of runtime execution and then zero them. However, it turned out that:

  1. It assumed that the area where rustc usually puts the data section (i.e. static data, globals, etc) doesn't change. Introduction of mutable static variables in the runtime violated this assumption and ultimately caused Fix wasm module reuse #2051.
  2. Upon a closer inspection of the heuristic, it cleared only the stack leaving the heap untouched. This can potentially (however, unlikely) cause undefined behavior because a compiler can assume and take advantage of the fact that WebAssembly guarantees the memory to be zeroed.

There were a couple attempts to fix it:

  1. Initially, we thought that full instantiation in the background will do and tried to implement this approach in Ensure clean wasm instances #2931, but it turned out that this would make us use Arc instead of Rc in wasmi which would considerably harm the execution performance.
  2. Then we assumed that maybe the initial assumption that @arkpar is not correct and instantiation doesn't take a lot of time. So we came up with Ensure clean wasm instances via synchronous clone. #2938. Initial bencmark results showed potential in this approach, however, it turned out that the algorithm was flawed since it copied only up to used_size (the highest accessed memory location within the execution) and @arkpar was indeed right: there is a lot of unnecessary work being done. The current node-runtime allocates 1MB of stack space, has data section and on top of that we preallocate a lot of memory upfront, and copying and/or zeroing it takes a lot of time. The execution time on generating 1500 blocks almost doubled.

It turned out that it was hard to fix this issue with the given constraints (e.g. preserving determinism, allowing of execution of general wasm modules, etc) without regressing the performance.

One option was to actually ban mutable static variables in the runtime. However, IMO, I think it is hard to ensure that we don't accidentally bring some sort of mutable static globals into the runtime and also it seems that they can be indeed useful.

Solution

Because of the above I decided to take the following approach along with the following tradeoffs:

  1. Implement the mmap backend for linear memory in wasmi Use mmap for allocation wasmi-labs/wasmi#190 and introduce a function for quickly zeroing the memory instance. This allowed us to allocate arbitrary amounts of memory which is zeroed lazily on the first access. erase function allows to basically reallocate. The downside is that mmap is inherently platform dependent and works only on unix systems. E.g. Windows will suffer noticable slowdowns. I also tried to leverage the GlobalAlloc::alloc_zeroed function, but it turns out that it falls back to bzero (analogue of memset specialized for 0) on macOS which gives noticeble slowdowns (approx. 40 secs vs 230 secs for transaction factory of master on MasterTo1 with 1500 blocks). However, we can do the similar trick on Windows since it has the similar APIs.
  2. Extract the data segments (chunks of memory that are copied to the linear memory at the instantiation time) from the wasm module and cache them. At the time when the new wasm runtime instance is requested, the linear memory is zeroed and then the data segments are copied.
  3. Scan the mutable global variables and restore their values when the new wasm runtime instance is requested. This solves the Wasm stack pointer is not restored to its initial value #2967.
  4. Ban the start function in wasm runtime modules. start function is run as the last step of instantiation and theoretically can do arbitrary changes to the wasm memory. This is OK for rustc produced binaries since they don't actually leverage the start function and IMO it is unlikely that it will use them. However, it might pose a problem for languages such as C++ which can have "pre-main" logic such as constructor initialization used in globals. This is not a fundamential problem since we can still run the start function and scan the memory for chunks of memory instead of relying on the static data segments. So this restriction can be lifted in the follow-up PRs.
  5. Require the __heap_base global variable. It turned out that the allocator also depended on the used_size to detect the starting position of the heap which can be problematic. Luckily for us, wasm-ld (LLD) has a convention to expose the __heap_base in the produced binaries which points to the one byte past from the data section which we can use to seed the allocator. This solves potential problems that the used_size heuristic can miss, like BSS. I am not sure how this restriction can be lifted in the future though.

The initial benchmarks shows that there is no noticeble regression introduced in this PR compared to the heuristic approach we were using. Although unexpectedly now substrate performs a little bit better on macOS than on Linux.

I hope that these tradeoffs are reasonable.

cc @arkpar @gavofyork @cmichi

TODO:

Copy link
Contributor

@Demi-Marie Demi-Marie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

.globals()
.iter()
.filter(|g| g.is_mutable())
.zip(self.global_mut_values.iter())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pepyakin I agree with you on this one. I do not remember knowing about try_for_each until @bkchr mentioned it.

/// reset to the initial memory. So, one runtime instance is reused for
/// every fetch request.
///
/// For now the cache grows indefinitely, but that should be fine for now since runtimes can only be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this, but there should probably be an issue for this. That said, I think that issue can be closed as WONTFIX unless it becomes an issue, and it almost certainly will not.

core/executor/src/wasm_runtimes_cache.rs Outdated Show resolved Hide resolved
@Demi-Marie Demi-Marie added A6-mustntgrumble and removed A0-please_review Pull request needs code review. labels Jul 13, 2019
Demi-Marie and others added 2 commits July 13, 2019 14:09
Co-Authored-By: DemiMarie-parity <48690212+DemiMarie-parity@users.noreply.github.com>
Copy link
Contributor

@jimpo jimpo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good change.

I benchmarked this vs the master branch (the last commit merged into this branch) on my laptop (Linux with i7-7700HQ CPU). I'm consistently seeing a 20-25 us slowdown on this change vs master per Wasm module execution, both on a calls "Core_version" and "Core_execute_block" into the runtime. The benchmarking code is here. The benchmarks show that your copy-on-write branch has no noticeable performance improvement over this branch.

Also, this does not work using the latest Rust nightlies due to a change in LLVM that stops exporting __heap_base by default. Here is a change that fixes this: jimpo@1087316.

core/executor/src/wasm_runtimes_cache.rs Show resolved Hide resolved
core/executor/src/wasm_executor.rs Outdated Show resolved Hide resolved
core/executor/src/wasm_runtimes_cache.rs Outdated Show resolved Hide resolved
core/executor/src/wasm_runtimes_cache.rs Outdated Show resolved Hide resolved
/// A cache of runtime instances along with metadata, ready to be reused.
///
/// Instances are keyed by the hash of their code.
instances: HashMap<[u8; 32], Result<Rc<CachedRuntime>, CacheError>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why keep around multiple cached runtimes instead of just the last one?

Copy link
Member

@bkchr bkchr Jul 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because you can have for example RPC calls into old runtimes.

pepyakin added 2 commits July 25, 2019 15:19
# Conflicts:
#	core/executor/src/allocator.rs
#	core/test-runtime/client/src/lib.rs
#	node/runtime/src/lib.rs
@pepyakin
Copy link
Contributor Author

Merging this since 20-25 us slowdown is tolerable.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wasm stack pointer is not restored to its initial value Fix wasm module reuse
9 participants