Interpreter performance and footprint #46

ia0 · 2023-03-22T17:17:52Z

The current interpreter is very simple:

It doesn't pre-compute anything.
It executes directly from flash.

It is thus an in-place (non-optimized) interpreter according to A fast in-place interpreter for WebAssembly by Ben L. Titzer.

We would like to give users control over the tradeoff between performance and footprint (both in flash and memory). The following improvements would go towards this direction:

Implement the optimizations discussed in the paper above, and feature-gate those that don't improve both performance and footprint (like the sidetable).
Implement optimizations similar to what wasm3 does, by translating the opcodes into their function addresses, and feature-gate them.
Implement a feature to persist in flash the translated code in the optimization above. This optimization means that a platform update would invalidate the applet (unless the original bytecode is preserved). Such translated code wouldn't need validation and translation when starting and may execute directly.
JIT and AOT are probably out of the question for now. If wasmtime or wasmer end up supporting no-std, the situation might simplify. The difficulty being that they would need to run in a different thread to not take control of the scheduler thread. See Explore Wasmtime as an alternative WebAssembly runtime #458
Use Pulley which principles and rationale seem to fit very well to our use-case.

Open questions:

Does it make sense to have a feature like toctou which would remove all dead panics? (not just those with a corresponding dynamic check, but all where the compiler doesn't prove it on its own)
How to benchmark? Which suite to use?
If there's a way to detect hot/cold functions and optimizations have a cost per function, only hot functions may be optimized.

Related work:

https://github.com/Neopallium/s1vm aims to be a Rust port of Wasm3, but didn't get any activity since 2 years.

The work is tracked in the dev/fast-interp branch.

The text was updated successfully, but these errors were encountered:

zhouwfang · 2024-05-08T04:42:32Z

@ia0 What is the expected timeline for #458? I assume it might make this issue irrelevant?

ia0 · 2024-05-08T08:10:09Z

It depends when someone would take a look, but I expect that it won't work in the short term. And if it ever works, I don't expect it to be a replacement of the current interpreter, but just an alternative, the same way having a rewriting interpreter and a simple compiler would be alternatives to the current in-place interpreter. They all provide a different trade-off between performance, footprint, and portability. I'll update the issue with this alternative. EDIT: Actually, the issue was already mentioning Wasmtime as an option. I just linked the issue.

zhouwfang · 2024-05-08T15:38:04Z

Thanks for the clarification!

zhouwfang · 2024-07-01T04:17:53Z

@ia0 I was wondering if it is worth considering using wasmi.

In https://google.github.io/wasefire/faq.html#why-implement-a-new-interpreter, it says "wasmi consumes too much RAM for embedded". However, in the recent release wasmi has been migrated from stack-based IR to register-based IR, and the release doc writes, "The new register-based IR was carefully designed to enhance execution performance and to minimize memory usage. As the vast majority of a Wasm binary is comprised of encoded instructions, this substantially decreases memory usage and enhances cache efficiency when executing Wasm through Wasmi. [...] with a fantastic startup performance and low memory consumption especially suited for embedded environments."

(Of course, we would still need to implement streamed compilation to flash.)

ia0 · 2024-07-01T10:22:53Z

I was wondering if it is worth considering using wasmi.

It makes sense to give it a try if they claim being now suited for embedded environments. That would be one more comparison point along the following dimensions: interpreter code size, interpreter RAM usage, interpreter performance.

For testing purposes, let's first modify wasefire-scheduler in place to use wasmi instead wasefire-interpreter. This is just a quick and dirty solution to assess the vialibility of wasmi. If the results are good, we can create a feature to choose between both implementations.

Of course, we would still need to implement streamed compilation to flash

Yes, that's probably a necessary step, but it could be done in a second phase.

zhouwfang · 2024-07-03T06:21:16Z

First, just wanted to confirm my understanding about Scheduler::run(wasm):

scheduler.load(wasm) may call host functions, and these calls are handled by scheduler.process_applet() within scheduler.load(wasm). This means we would need to "fit" the wasmi's host functions API into scheduler.process_applet().
scheduler.flush_events() handles pending events from the board such as button presses.

For performance evaluation purpose, can we ignore host functions? In other words, can we only compare scheduler.load(wasm) and "wasmi.load(wasm)" without host functions in wasm and remove this infinite loop? Thanks.

ia0 · 2024-07-03T08:41:48Z

It's true that the need for an in-place interpreter (in particular regarding RAM usage) was not the only reason to write a custom interpreter. There was also the multiplexing at host-function level. That said, I believe this last part could be done otherwise as long as interpreters have a way to give back the control flow (e.g. using fuel in wasmi or epoch in wasmtime).

I guess the simplest to check performance is to use my riscv branch and add wasmi as a runtime. This should be easy, since it should behave similarly to wasm3.

zhouwfang · 2024-07-03T23:10:36Z

CoreMark results based on your riscv branch, with a linux docker container on my personal laptop:

wasmi
CoreMark result: 692.2593 (in 17.595s)
CoreMark result: 663.20996 (in 18.242s)
CoreMark result: 620.90765 (in 19.381s)
CoreMark result: 657.69806 (in 18.73s)
CoreMark result: 689.52545 (in 17.623s)

wasm3
CoreMark result: 881.8342 (in 13.633s)
CoreMark result: 865.6646 (in 13.928s)
CoreMark result: 947.5407 (in 12.794s)
CoreMark result: 951.55707 (in 12.729s)
CoreMark result: 957.02106 (in 12.683s)

base
CoreMark result: 25.864857 (in 19.304s)
CoreMark result: 27.917364 (in 18.341s)
CoreMark result: 27.032507 (in 18.921s)
CoreMark result: 27.77392 (in 18.362s)
CoreMark result: 27.847397 (in 18.577s)

wasmi looks quite competent. WDYT?

Edit: Another advantage of wasmi is that it supports streamed translation -- https://wasmi-labs.github.io/blog/posts/wasmi-v0.32/#non-streaming-translation

ia0 · 2024-07-04T07:20:13Z

Thanks! Can you push your branch on your fork? I would like to test on embedded devices, which is what matters.

Another advantage of wasmi is that it supports streamed translation

That's already something, but it's the same for the current interpreter. What is important is not only streamed translation, but also persistent translation (not to RAM, but to flash).

zhouwfang · 2024-07-04T19:30:46Z

Thanks! Can you push your branch on your fork? I would like to test on embedded devices, which is what matters.

Here is the branch with the wasmi runtime. Looking forward to the result on a embedded device.

That's already something, but it's the same for the current interpreter.

What do you mean by "the same"? Thanks.

ia0 · 2024-07-05T08:47:51Z

Thanks! So here are the results (linux is my machine and nordic is nRF52840):

target	runtime	coremark	time	code	RAM
linux	base	28.510336	17.991s
linux	wasm3	2592.5205	19.479s
linux	wasmi	1297.4375	23.742s
nordic	base	0.088684715	225.861s	136K	5416
nordic	wasm3
nordic	wasmi	3.394433	20.678s	912K	91960

We can see the speed up between wasmi and base is ~45x on linux and ~38x on nordic, so quite comparable in terms of order of magnitude. We also see that wasm3 is ~2x faster than wasmi, so I would expect something similar on nordic if wasm3 would compile there.

Also important to notice is that wasmi is ~7x bigger than base in terms of code size and ~17x bigger than base in terms of RAM usage. That's quite a big issue and a no-go to use as-is.

So I think we should instead implement the optimizations described by Ben Titzer in https://arxiv.org/abs/2205.01183 and redo this benchmark. I would expect between ~2x and ~10x improvement on coremark with little code and RAM increase.

What do you mean by "the same"? Thanks.

I mean for the validation step, which is done linearly. It's not pure streaming because it still assumes a slice as input, but it processes it linearly without this assumption. It wouldn't be a big change to fix that.

ia0 · 2024-07-05T09:10:17Z

By the way, once #523 is merged, could you create a PR to the dev/wasm-bench branch with your commit adding wasmi? This will be useful for future comparisons.

zhouwfang · 2024-07-06T04:47:10Z

Thanks for testing on nordic! (I should think harder about how to do that by myself in my remote work set-up.)

I just added the optimization for wasmi in #524 according to its documentation. On linux, it improves the CoreMark from ~660 to ~1100. Could you give this optimized wasmi another try on nordic? Thanks!

I'll look into the in-place optimization paper.

ia0 · 2024-07-08T15:35:34Z

Thanks! I think those optimizations make sense in general, so I enabled them for all in your PR. Here are the results (linux is not the same though, but nordic is the same):

target	runtime	coremark	time	code	RAM
linux	base	27.179453	18.802s
linux	wasm3	2169.0405	18.947s
linux	wasmi	1112.2234	28.174s
nordic	base	0.09126336	219.468s	144K	5416
nordic	wasm3
nordic	wasmi	4.488666	15.643s	820K	91960

We see some improvement, but the code size is still unacceptable.

By the way, wasmtime recently decided to support no-std bytecodealliance/wasmtime#8341. Could you also add a similar runtime support for wasmtime as you did for wasmi? The code should be rather similar. I'm curious to see if it already works on nordic.

zhouwfang · 2024-07-10T05:58:08Z

On my linux docker container, the wasmtime coremark is ~20 times of the wasm3 one. Is this expected?

IIUC, JIT interpreters such as wasmtime are more suitable for compute-intensive wasm workloads, while rewriting interpreters are more suitable for translation-intensive workloads. I was wondering whether we should prioritize the optimization of execution time or translation time, or potentially both.

ia0 · 2024-07-10T14:34:08Z

Yes, Wasmtime is much faster because it's compiled. But if we can get a compiler that is small in code size and doesn't use too much RAM, then we take it.

Regarding prioritization, there's not a single goal. In the end we want the user to be able to choose the trade-off between performance, security, footprint, and portability. So we want to provide as much alternatives (that are different enough from each other) as possible.

zhouwfang · 2024-07-15T05:42:44Z

I found another wasm interpreter in Rust named stitch. According its README, it has similar Coremark result with wasm3 on Linux, and it relies on the LLVM optimizations on sibling calls on 64-bit platforms. But unfortunately, "Stitch currently does not run on 32-bit platforms. The reason for this is that I have not yet found a way to get LLVM to perform sibling call optimisation on these platforms (ideas welcome)." So it is probably not generally applicable to embedded for now. But there are some 64-bit embedded architectures. WDYT?
I looked more into the in-place optimization paper, and "[i]t is implemented using a macro assembler that generates x86-64 machine code", as "[it] allows for near-perfect register allocation and unlocks all possible dispatch and organization techniques". So the implementation seems heavily dependent on the architecture, and I guess a implementation in a high-level language like Rust may have worse performance. Is there a primary embedded architecture we want to support like risc-v? Or do we ideally want architecture independence?

ia0 · 2024-07-15T09:24:48Z

Indeed, this is too specific to high-end machines and won't work on embedded.
Yes, we won't be able to do all tricks done in Wizard (the language they use). However, I think we should be able to do the key techniques described at the beginning of the 3rd chapter (side-table and value stack). We should also be able to do the dispatch table.

There was an earlier review in zhouwfang#2. #46 --------- Co-authored-by: Julien Cretin <cretin@google.com> Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com>

Jumping from "if without else" should be 1 step less than jumping from "if with else". For "if without else", we don't want to jump over the "end" because `exit_label()` need to be [called](https://github.com/zhouwfang/wasefire/blob/de6c7e889865736cb1f11877d5d4e1e9f50729a8/crates/interpreter/src/exec.rs#L788) at the "end". But for "if with else", we want to jump over the "else" because otherwise we would be forced to jump to the "end" at "else" (normally, we would reach "else" after executing the true branch of "if", and we should not execute the "else" branch). This is equivalent to https://github.com/google/wasefire/blob/main/crates/interpreter/src/parser.rs#L566-L569, which will be deprecated with the side table applied in `exec.rs`. #46 --------- Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com>

When "if" has an "else", we should skip the entry for "else" in the side table. It was addressed in #690 during execution and should be moved to validation. #46 Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com>

zhouwfang · 2024-12-10T00:11:39Z

TODO: Improve the performance of such accessor functions from O(n) to O(1).

Apply `delta_ip` and `delta_stp` from side table in `exec.rs`. #46 --------- Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com> Co-authored-by: Julien Cretin <cretin@google.com>

On my Linux Docker container: 1. CoreMark results of this PR: 37.991516 (in 18.695s), 36.7287 (in 19.258s) CoreMark results of #690: 37.101162 (in 19.133s), 37.39716 (in 18.996s) Look very similar. Not sure about `nordic`. 2. CoreMark results of `main`: 28.791477 (in 18.09s), 28.814291 (in 17.734s) So there does seem some minor performance improvement with `delta_ip` and `delta_stp`. #46 --------- Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com>

#46 --------- Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com> Co-authored-by: Julien Cretin <cretin@google.com>

It improves the performance by removing the costly `last_frame_values_cnt()`. On linux, `CoreMark result: 39.795715 (in 17.853s)`. cf. CoreMark result was about 37 in 19s in #705. #46 --------- Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com>

zhouwfang · 2024-12-24T22:08:04Z

There are several low-hanging performance improvement opportunities in exec.rs:

https://github.com/zhouwfang/wasefire/blob/27012f5c8ffa1359a043eaf7eae0e36c42f4f497/crates/interpreter/src/exec.rs#L159

https://github.com/zhouwfang/wasefire/blob/27012f5c8ffa1359a043eaf7eae0e36c42f4f497/crates/interpreter/src/exec.rs#L947

https://github.com/zhouwfang/wasefire/blob/27012f5c8ffa1359a043eaf7eae0e36c42f4f497/crates/interpreter/src/exec.rs#L463

#46 --------- Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com>

- Apply `val_cnt` and `pop_cnt`. - Remove `Label` by introducing `Frame::labels_cnt`. There is not much performance improvement on Linux due to this PR, which replaces ``` let values_cnt: usize = frame.labels[i ..].iter().map(|label| label.values_cnt).sum(); ``` in `pop_label()` by `pop_cnt + val_cnt`. #46 --------- Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com> Co-authored-by: Julien Cretin <cretin@google.com>

zhouwfang · 2024-12-27T23:21:26Z

We have completed the value stack and the side table in Ben's paper (modulo some refactoring unimportant to performance), and it is a good time to summarize the benchmarks.

target	runtime	coremark	time (s)	code (KB)	RAM
linux	base (main)	28.510336	17.991
linux	dev/fast-interp	38.32886	18.504
linux	wasm3	2592.5205	19.479
linux	wasmi	1297.4375	23.742
nordic	base (main)	0.088684715	225.861	136	5416
nordic	dev/fast-interp	0.1541212	130.039	141	6244
nordic	wasm3
nordic	wasmi	3.394433	20.678s	912	91960

So on nordic, the performance so far is about 2X of the base with little increase in footprints.

Based on experiments, these are the smallest masks to pass CI. In other words, `SideTableEntry` only requires `u35` at this point. I'll add the fields from `func_type()` and `func()` in `module.rs` to the side table in #711, and `SideTableEntry` as `u64` might not be sufficient. #46 --------- Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com>

#46 --------- Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com>

I'd like to get some feedback about the design: the "section table" is used to improve the performance of [these accessors](https://github.com/zhouwfang/wasefire/blob/715bbaa6e2f79e40464e9b85ad9fc506c2d02d57/crates/interpreter/src/module.rs#L176-L209) in `module.rs`, which are used in execution. For these three accessors that return `Parser`, we can precompute the `ParseRange` during validation, and create a parser based on the range during execution. I'm not sure about [the other accessors](https://github.com/zhouwfang/wasefire/blob/715bbaa6e2f79e40464e9b85ad9fc506c2d02d57/crates/interpreter/src/module.rs#L119-L174) that return types or even `Option<ExportDesc>`. IIUC, we want to store the section table in flash as the side table. Does that mean we would have to design some encoding rules for these return types? **Update 1** We will only optimize [`func()`](https://github.com/zhouwfang/wasefire/blob/715bbaa6e2f79e40464e9b85ad9fc506c2d02d57/crates/interpreter/src/module.rs#L188) and [`func_type()`](https://github.com/zhouwfang/wasefire/blob/715bbaa6e2f79e40464e9b85ad9fc506c2d02d57/crates/interpreter/src/module.rs#L119) in `module.rs`, because the other accessors should be fast. **Update 2** - Rename the previous `SideTableEntry` as `BranchTableEntry`. - Create per-function `SideTable` to include branch table and other function metadata. #46 --------- Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com> Co-authored-by: Julien Cretin <cretin@google.com>

#46 --------- Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com> Co-authored-by: Julien Cretin <cretin@google.com>

ia0 added needs:design Needs design to make progress crate:interpreter Modifies the interpreter for:performance Improves firmware performance for:footprint Reduces firmware footprint labels Mar 22, 2023

This was referenced Mar 23, 2023

Pre-compute the size in flash of a module #54

Open

Applet management #56

Open

zhouwfang self-assigned this Apr 23, 2024

This was referenced Jun 25, 2024

Support tail call in the WASM interpreter #514

Closed

Trampolining draft #520

Closed

zhouwfang mentioned this issue Jul 6, 2024

Add wasmi for benchmark #524

Merged

This was referenced Jul 10, 2024

Add wasmtime for benchmark #525

Closed

Add wasmtime for benchmark #526

Merged

ia0 added a commit that referenced this issue Nov 18, 2024

Add val_cnt and pop_cnt in side table (#649)

24781ed

There was an earlier review in zhouwfang#2. #46 --------- Co-authored-by: Julien Cretin <cretin@google.com> Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com>

This was referenced Dec 4, 2024

Reduce delta_ip by 1 for "if without else" in validation #696

Merged

Increment delta_stp by 1 for "if with else" in validation #700

Merged

zhouwfang mentioned this issue Dec 11, 2024

Use instead of restoring side table for Loop zhouwfang/wasefire#4

Closed

zhouwfang mentioned this issue Dec 11, 2024

Use instead of restoring side table for Loop #705

Merged

This was referenced Dec 18, 2024

Fix val_cnt for if #706

Merged

Add debug asserts for val_cnt and pop_cnt in side table zhouwfang/wasefire#5

Closed

Add prev_stack in Frame #707

Merged

ia0 added a commit that referenced this issue Dec 23, 2024

Fix val_cnt for if (#706)

18bedde

#46 --------- Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com> Co-authored-by: Julien Cretin <cretin@google.com>

This was referenced Dec 25, 2024

Add debug asserts for val_cnt and pop_cnt in side table #709

Merged

Apply val_cnt and pop_cnt to pop_label() #710

Merged

Generalize side table #711

Merged

ia0 pushed a commit that referenced this issue Dec 27, 2024

Add debug asserts for val_cnt and pop_cnt in side table (#709)

70f9ae0

#46 --------- Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com>

zhouwfang mentioned this issue Jan 2, 2025

Optimize side table masks #713

Merged

zhouwfang mentioned this issue Jan 7, 2025

Remove Option in side table creation #718

Merged

ia0 pushed a commit that referenced this issue Jan 8, 2025

Remove Option in side table creation (#718)

009a15d

#46 --------- Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com>

This was referenced Jan 13, 2025

Return generalized side table from validation #720

Merged

Improve module accessors performance zhouwfang/wasefire#7

Open

ia0 added a commit that referenced this issue Jan 15, 2025

Return generalized side table from validation (#720)

7e5355d

#46 --------- Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com> Co-authored-by: Julien Cretin <cretin@google.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interpreter performance and footprint #46

Interpreter performance and footprint #46

ia0 commented Mar 22, 2023 •

edited

Loading

zhouwfang commented May 8, 2024 •

edited

Loading

ia0 commented May 8, 2024 •

edited

Loading

zhouwfang commented May 8, 2024

zhouwfang commented Jul 1, 2024

ia0 commented Jul 1, 2024

zhouwfang commented Jul 3, 2024

ia0 commented Jul 3, 2024

zhouwfang commented Jul 3, 2024 •

edited

Loading

ia0 commented Jul 4, 2024

zhouwfang commented Jul 4, 2024

ia0 commented Jul 5, 2024

ia0 commented Jul 5, 2024 •

edited

Loading

zhouwfang commented Jul 6, 2024

ia0 commented Jul 8, 2024

zhouwfang commented Jul 10, 2024

ia0 commented Jul 10, 2024

zhouwfang commented Jul 15, 2024 •

edited

Loading

ia0 commented Jul 15, 2024

zhouwfang commented Dec 10, 2024

zhouwfang commented Dec 24, 2024 •

edited

Loading

zhouwfang commented Dec 27, 2024 •

edited

Loading

Interpreter performance and footprint #46

Interpreter performance and footprint #46

Comments

ia0 commented Mar 22, 2023 • edited Loading

zhouwfang commented May 8, 2024 • edited Loading

ia0 commented May 8, 2024 • edited Loading

zhouwfang commented May 8, 2024

zhouwfang commented Jul 1, 2024

ia0 commented Jul 1, 2024

zhouwfang commented Jul 3, 2024

ia0 commented Jul 3, 2024

zhouwfang commented Jul 3, 2024 • edited Loading

ia0 commented Jul 4, 2024

zhouwfang commented Jul 4, 2024

ia0 commented Jul 5, 2024

ia0 commented Jul 5, 2024 • edited Loading

zhouwfang commented Jul 6, 2024

ia0 commented Jul 8, 2024

zhouwfang commented Jul 10, 2024

ia0 commented Jul 10, 2024

zhouwfang commented Jul 15, 2024 • edited Loading

ia0 commented Jul 15, 2024

zhouwfang commented Dec 10, 2024

zhouwfang commented Dec 24, 2024 • edited Loading

zhouwfang commented Dec 27, 2024 • edited Loading

ia0 commented Mar 22, 2023 •

edited

Loading

zhouwfang commented May 8, 2024 •

edited

Loading

ia0 commented May 8, 2024 •

edited

Loading

zhouwfang commented Jul 3, 2024 •

edited

Loading

ia0 commented Jul 5, 2024 •

edited

Loading

zhouwfang commented Jul 15, 2024 •

edited

Loading

zhouwfang commented Dec 24, 2024 •

edited

Loading

zhouwfang commented Dec 27, 2024 •

edited

Loading