-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interpreter performance and footprint #46
Comments
It depends when someone would take a look, but I expect that it won't work in the short term. And if it ever works, I don't expect it to be a replacement of the current interpreter, but just an alternative, the same way having a rewriting interpreter and a simple compiler would be alternatives to the current in-place interpreter. They all provide a different trade-off between performance, footprint, and portability. I'll update the issue with this alternative. EDIT: Actually, the issue was already mentioning Wasmtime as an option. I just linked the issue. |
Thanks for the clarification! |
@ia0 I was wondering if it is worth considering using In https://google.github.io/wasefire/faq.html#why-implement-a-new-interpreter, it says " (Of course, we would still need to implement streamed compilation to flash.) |
It makes sense to give it a try if they claim being now suited for embedded environments. That would be one more comparison point along the following dimensions: interpreter code size, interpreter RAM usage, interpreter performance. For testing purposes, let's first modify
Yes, that's probably a necessary step, but it could be done in a second phase. |
First, just wanted to confirm my understanding about
For performance evaluation purpose, can we ignore host functions? In other words, can we only compare |
It's true that the need for an in-place interpreter (in particular regarding RAM usage) was not the only reason to write a custom interpreter. There was also the multiplexing at host-function level. That said, I believe this last part could be done otherwise as long as interpreters have a way to give back the control flow (e.g. using fuel in wasmi or epoch in wasmtime). I guess the simplest to check performance is to use my |
CoreMark results based on your
Edit: Another advantage of |
Thanks! Can you push your branch on your fork? I would like to test on embedded devices, which is what matters.
That's already something, but it's the same for the current interpreter. What is important is not only streamed translation, but also persistent translation (not to RAM, but to flash). |
Here is the branch with the
What do you mean by "the same"? Thanks. |
Thanks! So here are the results (linux is my machine and nordic is nRF52840):
We can see the speed up between wasmi and base is ~45x on linux and ~38x on nordic, so quite comparable in terms of order of magnitude. We also see that wasm3 is ~2x faster than wasmi, so I would expect something similar on nordic if wasm3 would compile there. Also important to notice is that wasmi is ~7x bigger than base in terms of code size and ~17x bigger than base in terms of RAM usage. That's quite a big issue and a no-go to use as-is. So I think we should instead implement the optimizations described by Ben Titzer in https://arxiv.org/abs/2205.01183 and redo this benchmark. I would expect between ~2x and ~10x improvement on coremark with little code and RAM increase.
I mean for the validation step, which is done linearly. It's not pure streaming because it still assumes a slice as input, but it processes it linearly without this assumption. It wouldn't be a big change to fix that. |
By the way, once #523 is merged, could you create a PR to the |
Thanks for testing on nordic! (I should think harder about how to do that by myself in my remote work set-up.) I just added the optimization for wasmi in #524 according to its documentation. On linux, it improves the CoreMark from ~660 to ~1100. Could you give this optimized I'll look into the in-place optimization paper. |
Thanks! I think those optimizations make sense in general, so I enabled them for all in your PR. Here are the results (linux is not the same though, but nordic is the same):
We see some improvement, but the code size is still unacceptable. By the way, wasmtime recently decided to support no-std bytecodealliance/wasmtime#8341. Could you also add a similar runtime support for wasmtime as you did for wasmi? The code should be rather similar. I'm curious to see if it already works on nordic. |
On my linux docker container, the IIUC, JIT interpreters such as |
Yes, Wasmtime is much faster because it's compiled. But if we can get a compiler that is small in code size and doesn't use too much RAM, then we take it. Regarding prioritization, there's not a single goal. In the end we want the user to be able to choose the trade-off between performance, security, footprint, and portability. So we want to provide as much alternatives (that are different enough from each other) as possible. |
|
|
There was an earlier review in zhouwfang#2. #46 --------- Co-authored-by: Julien Cretin <cretin@google.com> Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com>
Jumping from "if without else" should be 1 step less than jumping from "if with else". For "if without else", we don't want to jump over the "end" because `exit_label()` need to be [called](https://github.com/zhouwfang/wasefire/blob/de6c7e889865736cb1f11877d5d4e1e9f50729a8/crates/interpreter/src/exec.rs#L788) at the "end". But for "if with else", we want to jump over the "else" because otherwise we would be forced to jump to the "end" at "else" (normally, we would reach "else" after executing the true branch of "if", and we should not execute the "else" branch). This is equivalent to https://github.com/google/wasefire/blob/main/crates/interpreter/src/parser.rs#L566-L569, which will be deprecated with the side table applied in `exec.rs`. #46 --------- Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com>
TODO: Improve the performance of such accessor functions from |
Apply `delta_ip` and `delta_stp` from side table in `exec.rs`. #46 --------- Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com> Co-authored-by: Julien Cretin <cretin@google.com>
On my Linux Docker container: 1. CoreMark results of this PR: 37.991516 (in 18.695s), 36.7287 (in 19.258s) CoreMark results of #690: 37.101162 (in 19.133s), 37.39716 (in 18.996s) Look very similar. Not sure about `nordic`. 2. CoreMark results of `main`: 28.791477 (in 18.09s), 28.814291 (in 17.734s) So there does seem some minor performance improvement with `delta_ip` and `delta_stp`. #46 --------- Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com>
#46 --------- Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com> Co-authored-by: Julien Cretin <cretin@google.com>
There are several low-hanging performance improvement opportunities in |
#46 --------- Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com>
- Apply `val_cnt` and `pop_cnt`. - Remove `Label` by introducing `Frame::labels_cnt`. There is not much performance improvement on Linux due to this PR, which replaces ``` let values_cnt: usize = frame.labels[i ..].iter().map(|label| label.values_cnt).sum(); ``` in `pop_label()` by `pop_cnt + val_cnt`. #46 --------- Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com> Co-authored-by: Julien Cretin <cretin@google.com>
We have completed the value stack and the side table in Ben's paper (modulo some refactoring unimportant to performance), and it is a good time to summarize the benchmarks.
So on |
Based on experiments, these are the smallest masks to pass CI. In other words, `SideTableEntry` only requires `u35` at this point. I'll add the fields from `func_type()` and `func()` in `module.rs` to the side table in #711, and `SideTableEntry` as `u64` might not be sufficient. #46 --------- Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com>
I'd like to get some feedback about the design: the "section table" is used to improve the performance of [these accessors](https://github.com/zhouwfang/wasefire/blob/715bbaa6e2f79e40464e9b85ad9fc506c2d02d57/crates/interpreter/src/module.rs#L176-L209) in `module.rs`, which are used in execution. For these three accessors that return `Parser`, we can precompute the `ParseRange` during validation, and create a parser based on the range during execution. I'm not sure about [the other accessors](https://github.com/zhouwfang/wasefire/blob/715bbaa6e2f79e40464e9b85ad9fc506c2d02d57/crates/interpreter/src/module.rs#L119-L174) that return types or even `Option<ExportDesc>`. IIUC, we want to store the section table in flash as the side table. Does that mean we would have to design some encoding rules for these return types? **Update 1** We will only optimize [`func()`](https://github.com/zhouwfang/wasefire/blob/715bbaa6e2f79e40464e9b85ad9fc506c2d02d57/crates/interpreter/src/module.rs#L188) and [`func_type()`](https://github.com/zhouwfang/wasefire/blob/715bbaa6e2f79e40464e9b85ad9fc506c2d02d57/crates/interpreter/src/module.rs#L119) in `module.rs`, because the other accessors should be fast. **Update 2** - Rename the previous `SideTableEntry` as `BranchTableEntry`. - Create per-function `SideTable` to include branch table and other function metadata. #46 --------- Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com> Co-authored-by: Julien Cretin <cretin@google.com>
#46 --------- Co-authored-by: Zhou Fang <33002388+zhou-w-fang@users.noreply.github.com> Co-authored-by: Julien Cretin <cretin@google.com>
The current interpreter is very simple:
It is thus an in-place (non-optimized) interpreter according to A fast in-place interpreter for WebAssembly by Ben L. Titzer.
We would like to give users control over the tradeoff between performance and footprint (both in flash and memory). The following improvements would go towards this direction:
Open questions:
toctou
which would remove all dead panics? (not just those with a corresponding dynamic check, but all where the compiler doesn't prove it on its own)Related work:
The work is tracked in the
dev/fast-interp
branch.The text was updated successfully, but these errors were encountered: