-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Fair reusing of wasm runtime instances #3011
Conversation
…-wasm-instances-v2
# Conflicts: # node/runtime/src/lib.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
.globals() | ||
.iter() | ||
.filter(|g| g.is_mutable()) | ||
.zip(self.global_mut_values.iter()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// reset to the initial memory. So, one runtime instance is reused for | ||
/// every fetch request. | ||
/// | ||
/// For now the cache grows indefinitely, but that should be fine for now since runtimes can only be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this, but there should probably be an issue for this. That said, I think that issue can be closed as WONTFIX unless it becomes an issue, and it almost certainly will not.
Co-Authored-By: DemiMarie-parity <48690212+DemiMarie-parity@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good change.
I benchmarked this vs the master branch (the last commit merged into this branch) on my laptop (Linux with i7-7700HQ CPU). I'm consistently seeing a 20-25 us slowdown on this change vs master per Wasm module execution, both on a calls "Core_version" and "Core_execute_block" into the runtime. The benchmarking code is here. The benchmarks show that your copy-on-write branch has no noticeable performance improvement over this branch.
Also, this does not work using the latest Rust nightlies due to a change in LLVM that stops exporting __heap_base
by default. Here is a change that fixes this: jimpo@1087316.
See https://reviews.llvm.org/D62744 Co-authored-by: Jim Posen <jim.posen@gmail.com>
/// A cache of runtime instances along with metadata, ready to be reused. | ||
/// | ||
/// Instances are keyed by the hash of their code. | ||
instances: HashMap<[u8; 32], Result<Rc<CachedRuntime>, CacheError>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why keep around multiple cached runtimes instead of just the last one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because you can have for example RPC calls into old runtimes.
# Conflicts: # core/executor/src/allocator.rs # core/test-runtime/client/src/lib.rs # node/runtime/src/lib.rs
Merging this since 20-25 us slowdown is tolerable. |
Closes #2051
Closes #2967
Supersedes #2938
Background
In #1345 an optimization for wasm runtime instantiation was implemented. The idea was that we would track the areas using some heuristic (
used_size
andlowest_used
) that were used in the course of runtime execution and then zero them. However, it turned out that:There were a couple attempts to fix it:
Arc
instead ofRc
in wasmi which would considerably harm the execution performance.used_size
(the highest accessed memory location within the execution) and @arkpar was indeed right: there is a lot of unnecessary work being done. The current node-runtime allocates 1MB of stack space, has data section and on top of that we preallocate a lot of memory upfront, and copying and/or zeroing it takes a lot of time. The execution time on generating 1500 blocks almost doubled.It turned out that it was hard to fix this issue with the given constraints (e.g. preserving determinism, allowing of execution of general wasm modules, etc) without regressing the performance.
One option was to actually ban mutable static variables in the runtime. However, IMO, I think it is hard to ensure that we don't accidentally bring some sort of mutable static globals into the runtime and also it seems that they can be indeed useful.
Solution
Because of the above I decided to take the following approach along with the following tradeoffs:
mmap
backend for linear memory in wasmi Use mmap for allocation wasmi-labs/wasmi#190 and introduce a function for quickly zeroing the memory instance. This allowed us to allocate arbitrary amounts of memory which is zeroed lazily on the first access.erase
function allows to basically reallocate. The downside is thatmmap
is inherently platform dependent and works only on unix systems. E.g. Windows will suffer noticable slowdowns. I also tried to leverage theGlobalAlloc::alloc_zeroed
function, but it turns out that it falls back tobzero
(analogue ofmemset
specialized for 0) on macOS which gives noticeble slowdowns (approx. 40 secs vs 230 secs for transaction factory of master on MasterTo1 with 1500 blocks). However, we can do the similar trick on Windows since it has the similar APIs.start
function in wasm runtime modules.start
function is run as the last step of instantiation and theoretically can do arbitrary changes to the wasm memory. This is OK for rustc produced binaries since they don't actually leverage thestart
function and IMO it is unlikely that it will use them. However, it might pose a problem for languages such as C++ which can have "pre-main" logic such as constructor initialization used in globals. This is not a fundamential problem since we can still run thestart
function and scan the memory for chunks of memory instead of relying on the static data segments. So this restriction can be lifted in the follow-up PRs.__heap_base
global variable. It turned out that the allocator also depended on theused_size
to detect the starting position of the heap which can be problematic. Luckily for us, wasm-ld (LLD) has a convention to expose the__heap_base
in the produced binaries which points to the one byte past from the data section which we can use to seed the allocator. This solves potential problems that theused_size
heuristic can miss, like BSS. I am not sure how this restriction can be lifted in the future though.The initial benchmarks shows that there is no noticeble regression introduced in this PR compared to the heuristic approach we were using. Although unexpectedly now substrate performs a little bit better on macOS than on Linux.
I hope that these tradeoffs are reasonable.
cc @arkpar @gavofyork @cmichi
TODO: