Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interpreter performance and footprint #46

Open
ia0 opened this issue Mar 22, 2023 · 21 comments · Fixed by #563, #598, #605 or #608
Open

Interpreter performance and footprint #46

ia0 opened this issue Mar 22, 2023 · 21 comments · Fixed by #563, #598, #605 or #608
Assignees
Labels
crate:interpreter Modifies the interpreter for:footprint Reduces firmware footprint for:performance Improves firmware performance needs:design Needs design to make progress

Comments

@ia0
Copy link
Member

ia0 commented Mar 22, 2023

The current interpreter is very simple:

  • It doesn't pre-compute anything.
  • It executes directly from flash.

It is thus an in-place (non-optimized) interpreter according to A fast in-place interpreter for WebAssembly by Ben L. Titzer.

We would like to give users control over the tradeoff between performance and footprint (both in flash and memory). The following improvements would go towards this direction:

  • Implement the optimizations discussed in the paper above, and feature-gate those that don't improve both performance and footprint (like the sidetable).
  • Implement optimizations similar to what wasm3 does, by translating the opcodes into their function addresses, and feature-gate them.
  • Implement a feature to persist in flash the translated code in the optimization above. This optimization means that a platform update would invalidate the applet (unless the original bytecode is preserved). Such translated code wouldn't need validation and translation when starting and may execute directly.
  • JIT and AOT are probably out of the question for now. If wasmtime or wasmer end up supporting no-std, the situation might simplify. The difficulty being that they would need to run in a different thread to not take control of the scheduler thread. See Explore Wasmtime as an alternative WebAssembly runtime #458
  • Use Pulley which principles and rationale seem to fit very well to our use-case.

Open questions:

  • Does it make sense to have a feature like toctou which would remove all dead panics? (not just those with a corresponding dynamic check, but all where the compiler doesn't prove it on its own)
  • How to benchmark? Which suite to use?
  • If there's a way to detect hot/cold functions and optimizations have a cost per function, only hot functions may be optimized.

Related work:

The work is tracked in the dev/fast-interp branch.

@ia0 ia0 added needs:design Needs design to make progress crate:interpreter Modifies the interpreter for:performance Improves firmware performance for:footprint Reduces firmware footprint labels Mar 22, 2023
@zhouwfang zhouwfang self-assigned this Apr 23, 2024
@zhouwfang
Copy link
Member

zhouwfang commented May 8, 2024

@ia0 What is the expected timeline for #458? I assume it might make this issue irrelevant?

@ia0
Copy link
Member Author

ia0 commented May 8, 2024

It depends when someone would take a look, but I expect that it won't work in the short term. And if it ever works, I don't expect it to be a replacement of the current interpreter, but just an alternative, the same way having a rewriting interpreter and a simple compiler would be alternatives to the current in-place interpreter. They all provide a different trade-off between performance, footprint, and portability. I'll update the issue with this alternative. EDIT: Actually, the issue was already mentioning Wasmtime as an option. I just linked the issue.

@zhouwfang
Copy link
Member

Thanks for the clarification!

@zhouwfang
Copy link
Member

@ia0 I was wondering if it is worth considering using wasmi.

In https://google.github.io/wasefire/faq.html#why-implement-a-new-interpreter, it says "wasmi consumes too much RAM for embedded". However, in the recent release wasmi has been migrated from stack-based IR to register-based IR, and the release doc writes, "The new register-based IR was carefully designed to enhance execution performance and to minimize memory usage. As the vast majority of a Wasm binary is comprised of encoded instructions, this substantially decreases memory usage and enhances cache efficiency when executing Wasm through Wasmi. [...] with a fantastic startup performance and low memory consumption especially suited for embedded environments."

(Of course, we would still need to implement streamed compilation to flash.)

@ia0
Copy link
Member Author

ia0 commented Jul 1, 2024

I was wondering if it is worth considering using wasmi.

It makes sense to give it a try if they claim being now suited for embedded environments. That would be one more comparison point along the following dimensions: interpreter code size, interpreter RAM usage, interpreter performance.

For testing purposes, let's first modify wasefire-scheduler in place to use wasmi instead wasefire-interpreter. This is just a quick and dirty solution to assess the vialibility of wasmi. If the results are good, we can create a feature to choose between both implementations.

Of course, we would still need to implement streamed compilation to flash

Yes, that's probably a necessary step, but it could be done in a second phase.

@zhouwfang
Copy link
Member

First, just wanted to confirm my understanding about Scheduler::run(wasm):

  1. scheduler.load(wasm) may call host functions, and these calls are handled by scheduler.process_applet() within scheduler.load(wasm). This means we would need to "fit" the wasmi's host functions API into scheduler.process_applet().

  2. scheduler.flush_events() handles pending events from the board such as button presses.

For performance evaluation purpose, can we ignore host functions? In other words, can we only compare scheduler.load(wasm) and "wasmi.load(wasm)" without host functions in wasm and remove this infinite loop? Thanks.

@ia0
Copy link
Member Author

ia0 commented Jul 3, 2024

It's true that the need for an in-place interpreter (in particular regarding RAM usage) was not the only reason to write a custom interpreter. There was also the multiplexing at host-function level. That said, I believe this last part could be done otherwise as long as interpreters have a way to give back the control flow (e.g. using fuel in wasmi or epoch in wasmtime).

I guess the simplest to check performance is to use my riscv branch and add wasmi as a runtime. This should be easy, since it should behave similarly to wasm3.

@zhouwfang
Copy link
Member

zhouwfang commented Jul 3, 2024

CoreMark results based on your riscv branch, with a linux docker container on my personal laptop:

wasmi
CoreMark result: 692.2593 (in 17.595s)
CoreMark result: 663.20996 (in 18.242s)
CoreMark result: 620.90765 (in 19.381s)
CoreMark result: 657.69806 (in 18.73s)
CoreMark result: 689.52545 (in 17.623s)

wasm3
CoreMark result: 881.8342 (in 13.633s)
CoreMark result: 865.6646 (in 13.928s)
CoreMark result: 947.5407 (in 12.794s)
CoreMark result: 951.55707 (in 12.729s)
CoreMark result: 957.02106 (in 12.683s)

base
CoreMark result: 25.864857 (in 19.304s)
CoreMark result: 27.917364 (in 18.341s)
CoreMark result: 27.032507 (in 18.921s)
CoreMark result: 27.77392 (in 18.362s)
CoreMark result: 27.847397 (in 18.577s)

wasmi looks quite competent. WDYT?

Edit: Another advantage of wasmi is that it supports streamed translation -- https://wasmi-labs.github.io/blog/posts/wasmi-v0.32/#non-streaming-translation

@ia0
Copy link
Member Author

ia0 commented Jul 4, 2024

Thanks! Can you push your branch on your fork? I would like to test on embedded devices, which is what matters.

Another advantage of wasmi is that it supports streamed translation

That's already something, but it's the same for the current interpreter. What is important is not only streamed translation, but also persistent translation (not to RAM, but to flash).

@zhouwfang
Copy link
Member

Thanks! Can you push your branch on your fork? I would like to test on embedded devices, which is what matters.

Here is the branch with the wasmi runtime. Looking forward to the result on a embedded device.

That's already something, but it's the same for the current interpreter.

What do you mean by "the same"? Thanks.

@ia0
Copy link
Member Author

ia0 commented Jul 5, 2024

Thanks! So here are the results (linux is my machine and nordic is nRF52840):

target runtime coremark time code RAM
linux base 28.510336 17.991s
linux wasm3 2592.5205 19.479s
linux wasmi 1297.4375 23.742s
nordic base 0.088684715 225.861s 136K 5416
nordic wasm3
nordic wasmi 3.394433 20.678s 912K 91960

We can see the speed up between wasmi and base is ~45x on linux and ~38x on nordic, so quite comparable in terms of order of magnitude. We also see that wasm3 is ~2x faster than wasmi, so I would expect something similar on nordic if wasm3 would compile there.

Also important to notice is that wasmi is ~7x bigger than base in terms of code size and ~17x bigger than base in terms of RAM usage. That's quite a big issue and a no-go to use as-is.

So I think we should instead implement the optimizations described by Ben Titzer in https://arxiv.org/abs/2205.01183 and redo this benchmark. I would expect between ~2x and ~10x improvement on coremark with little code and RAM increase.

What do you mean by "the same"? Thanks.

I mean for the validation step, which is done linearly. It's not pure streaming because it still assumes a slice as input, but it processes it linearly without this assumption. It wouldn't be a big change to fix that.

@ia0
Copy link
Member Author

ia0 commented Jul 5, 2024

By the way, once #523 is merged, could you create a PR to the dev/wasm-bench branch with your commit adding wasmi? This will be useful for future comparisons.

@zhouwfang
Copy link
Member

Thanks for testing on nordic! (I should think harder about how to do that by myself in my remote work set-up.)

I just added the optimization for wasmi in #524 according to its documentation. On linux, it improves the CoreMark from ~660 to ~1100. Could you give this optimized wasmi another try on nordic? Thanks!

I'll look into the in-place optimization paper.

@ia0
Copy link
Member Author

ia0 commented Jul 8, 2024

Thanks! I think those optimizations make sense in general, so I enabled them for all in your PR. Here are the results (linux is not the same though, but nordic is the same):

target runtime coremark time code RAM
linux base 27.179453 18.802s
linux wasm3 2169.0405 18.947s
linux wasmi 1112.2234 28.174s
nordic base 0.09126336 219.468s 144K 5416
nordic wasm3
nordic wasmi 4.488666 15.643s 820K 91960

We see some improvement, but the code size is still unacceptable.

By the way, wasmtime recently decided to support no-std bytecodealliance/wasmtime#8341. Could you also add a similar runtime support for wasmtime as you did for wasmi? The code should be rather similar. I'm curious to see if it already works on nordic.

This was referenced Jul 10, 2024
@zhouwfang
Copy link
Member

On my linux docker container, the wasmtime coremark is ~20 times of the wasm3 one. Is this expected?

IIUC, JIT interpreters such as wasmtime are more suitable for compute-intensive wasm workloads, while rewriting interpreters are more suitable for translation-intensive workloads. I was wondering whether we should prioritize the optimization of execution time or translation time, or potentially both.

@ia0
Copy link
Member Author

ia0 commented Jul 10, 2024

Yes, Wasmtime is much faster because it's compiled. But if we can get a compiler that is small in code size and doesn't use too much RAM, then we take it.

Regarding prioritization, there's not a single goal. In the end we want the user to be able to choose the trade-off between performance, security, footprint, and portability. So we want to provide as much alternatives (that are different enough from each other) as possible.

@zhouwfang
Copy link
Member

zhouwfang commented Jul 15, 2024

  1. I found another wasm interpreter in Rust named stitch. According its README, it has similar Coremark result with wasm3 on Linux, and it relies on the LLVM optimizations on sibling calls on 64-bit platforms. But unfortunately, "Stitch currently does not run on 32-bit platforms. The reason for this is that I have not yet found a way to get LLVM to perform sibling call optimisation on these platforms (ideas welcome)." So it is probably not generally applicable to embedded for now. But there are some 64-bit embedded architectures. WDYT?

  2. I looked more into the in-place optimization paper, and "[i]t is implemented using a macro assembler that generates x86-64 machine code", as "[it] allows for near-perfect register allocation and unlocks all possible dispatch and organization techniques". So the implementation seems heavily dependent on the architecture, and I guess a implementation in a high-level language like Rust may have worse performance. Is there a primary embedded architecture we want to support like risc-v? Or do we ideally want architecture independence?

@ia0
Copy link
Member Author

ia0 commented Jul 15, 2024

  1. Indeed, this is too specific to high-end machines and won't work on embedded.
  2. Yes, we won't be able to do all tricks done in Wizard (the language they use). However, I think we should be able to do the key techniques described at the beginning of the 3rd chapter (side-table and value stack). We should also be able to do the dispatch table.

ia0 added a commit that referenced this issue Nov 18, 2024
There was an earlier review in
zhouwfang#2.

#46

---------

Co-authored-by: Julien Cretin <cretin@google.com>
Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com>
ia0 pushed a commit that referenced this issue Dec 5, 2024
Jumping from "if without else" should be 1 step less than jumping from
"if with else". For "if without else", we don't want to jump over the
"end" because `exit_label()` need to be
[called](https://github.com/zhouwfang/wasefire/blob/de6c7e889865736cb1f11877d5d4e1e9f50729a8/crates/interpreter/src/exec.rs#L788)
at the "end". But for "if with else", we want to jump over the "else"
because otherwise we would be forced to jump to the "end" at "else"
(normally, we would reach "else" after executing the true branch of
"if", and we should not execute the "else" branch).

This is equivalent to
https://github.com/google/wasefire/blob/main/crates/interpreter/src/parser.rs#L566-L569,
which will be deprecated with the side table applied in `exec.rs`.

#46

---------

Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com>
ia0 pushed a commit that referenced this issue Dec 6, 2024
When "if" has an "else", we should skip the entry for "else" in the side
table. It was addressed in #690 during execution and should be moved to
validation.

#46

Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com>
@zhouwfang
Copy link
Member

TODO: Improve the performance of such accessor functions from O(n) to O(1).

ia0 added a commit that referenced this issue Dec 11, 2024
Apply `delta_ip` and `delta_stp` from side table in `exec.rs`.

#46

---------

Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com>
Co-authored-by: Julien Cretin <cretin@google.com>
ia0 pushed a commit that referenced this issue Dec 13, 2024
On my Linux Docker container:

1. 

CoreMark results of this PR: 37.991516 (in 18.695s), 36.7287 (in
19.258s)

CoreMark results of #690:
37.101162 (in 19.133s), 37.39716 (in 18.996s)

Look very similar. Not sure about `nordic`.

2. 

CoreMark results of `main`: 28.791477 (in 18.09s), 28.814291 (in
17.734s)

So there does seem some minor performance improvement with `delta_ip`
and `delta_stp`.

#46

---------

Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com>
ia0 added a commit that referenced this issue Dec 23, 2024
#46

---------

Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com>
Co-authored-by: Julien Cretin <cretin@google.com>
ia0 pushed a commit that referenced this issue Dec 24, 2024
It improves the performance by removing the costly
`last_frame_values_cnt()`.

On linux, `CoreMark result: 39.795715 (in 17.853s)`. cf. CoreMark result
was about 37 in 19s in #705.

#46

---------

Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com>
ia0 pushed a commit that referenced this issue Dec 27, 2024
#46

---------

Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com>
ia0 added a commit that referenced this issue Dec 27, 2024
- Apply `val_cnt` and `pop_cnt`.
- Remove `Label` by introducing `Frame::labels_cnt`.

There is not much performance improvement on Linux due to this PR, which
replaces
```
let values_cnt: usize = frame.labels[i ..].iter().map(|label| label.values_cnt).sum();
```
in `pop_label()` by `pop_cnt + val_cnt`.

#46

---------

Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com>
Co-authored-by: Julien Cretin <cretin@google.com>
@zhouwfang
Copy link
Member

zhouwfang commented Dec 27, 2024

We have completed the value stack and the side table in Ben's paper (modulo some refactoring unimportant to performance), and it is a good time to summarize the benchmarks.

target runtime coremark time (s) code (KB) RAM
linux base (main) 28.510336 17.991
linux dev/fast-interp 38.32886 18.504
linux wasm3 2592.5205 19.479
linux wasmi 1297.4375 23.742
nordic base (main) 0.088684715 225.861 136 5416
nordic dev/fast-interp 0.1541212 130.039 141 6244
nordic wasm3
nordic wasmi 3.394433 20.678s 912 91960

So on nordic, the performance so far is about 2X of the base with little increase in footprints.

ia0 pushed a commit that referenced this issue Jan 3, 2025
Based on experiments, these are the smallest masks to pass CI. In other
words, `SideTableEntry` only requires `u35` at this point. I'll add the
fields from `func_type()` and `func()` in `module.rs` to the side table
in #711, and `SideTableEntry` as `u64` might not be sufficient.

#46

---------

Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com>
ia0 pushed a commit that referenced this issue Jan 8, 2025
#46

---------

Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com>
ia0 added a commit that referenced this issue Jan 10, 2025
I'd like to get some feedback about the design: the "section table" is
used to improve the performance of [these
accessors](https://github.com/zhouwfang/wasefire/blob/715bbaa6e2f79e40464e9b85ad9fc506c2d02d57/crates/interpreter/src/module.rs#L176-L209)
in `module.rs`, which are used in execution. For these three accessors
that return `Parser`, we can precompute the `ParseRange` during
validation, and create a parser based on the range during execution.

I'm not sure about [the other
accessors](https://github.com/zhouwfang/wasefire/blob/715bbaa6e2f79e40464e9b85ad9fc506c2d02d57/crates/interpreter/src/module.rs#L119-L174)
that return types or even `Option<ExportDesc>`. IIUC, we want to store
the section table in flash as the side table. Does that mean we would
have to design some encoding rules for these return types?

**Update 1**

We will only optimize
[`func()`](https://github.com/zhouwfang/wasefire/blob/715bbaa6e2f79e40464e9b85ad9fc506c2d02d57/crates/interpreter/src/module.rs#L188)
and
[`func_type()`](https://github.com/zhouwfang/wasefire/blob/715bbaa6e2f79e40464e9b85ad9fc506c2d02d57/crates/interpreter/src/module.rs#L119)
in `module.rs`, because the other accessors should be fast.

**Update 2**

- Rename the previous `SideTableEntry` as `BranchTableEntry`.
- Create per-function `SideTable` to include branch table and other
function metadata.

#46

---------

Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com>
Co-authored-by: Julien Cretin <cretin@google.com>
ia0 added a commit that referenced this issue Jan 15, 2025
#46

---------

Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com>
Co-authored-by: Julien Cretin <cretin@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
crate:interpreter Modifies the interpreter for:footprint Reduces firmware footprint for:performance Improves firmware performance needs:design Needs design to make progress
Projects
None yet
2 participants