-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to reconstruct your benchmarks? #1
Comments
I think my results are from about 2 years ago. I will try to make some time this week to re-run the benchmarks. It is possible that My main goal with
Looking at your closure based tests they don't use the same nesting that
A virtual stack is built during the compile phase to track the inputs/outputs for each opcode: enum Input {
Local(u32), // Normally a function argument.
Const(StackValue), // Constance value from wasm bytecode.
Op(OpFunc), // Closure that produces a value (the opcode pushes a stack value).
} This is code cut from the bytecode -> closure compiler here // For binops (add, sub, etc....), get the two `Input` values from the virtual stack (compile time)
let right = state.pop()?;
let left = state.pop()?;
match left {
/* .... CUT OTHER LEFT matches: local/constant. */
Input::Op(left_closure) => {
match right {
/* ... CUT OTHER RIGHT matches: local/constant. */
Input::Op(right_closure) => {
// Push the new closure onto the virtual stack (compile time).
state.push(Input::Op(Box::new(
///////////////// BINOP closure:
// Two parameters `state` is the immutable WASM state (functions, code). `store` is the mutable WASM instance state (memory, runtime stack).
move |state: &vm::State, store: &mut Store| -> Trap<StackValue> {
let left = left_closure(state, store)?; // <---------------------- call the closure that produces the left value.
let right = right_closure(state, store)?; // <------------------- call the closure that produces the right value.
let res = (left.0 as $type).$op(right.0 as $type); // <------ the opcode's code.
Ok(StackValue(res as _)) // <------ return the results (native stack, not wasm runtime stack).
}
////////////////////
)));
return Ok(());
},
}
},
/* .... CUT OTHER LEFT matches */
} |
While Rust does not support guaranteed tail-calls we can still use them when compiling for You are right that my closure based approaches for instruction dispatch are not entirely similar to yours, however, the I will probably try out a fully nested closure based dispatch soon.
All the dispatching approaches in the research repo use a register machine as the backbone so no push or pop operations. I wonder what you mean. Maybe there is some confusion. The main question in the room to me is: Are closures actually worth it for dispatching? You could go even further with this approach because inputs and outputs can also be global variables or even linear memory accesses. So you could for example define an When you benchmark the new |
I am looking forward to Rust adding support for proper tail-calls. The main reason I don't try using tail-calls in Rust is because they only work in release mode, but also because they will not work with Another reason I had started
It also looks like the test code for the different designs are not using the same set of instructions: The closure designs only have a
I should have said read/write instead of push/pop. Even with the It looked like the Instruction handlers are still doing a PC increment here:
I has been a few years, but I think that when I had created the first "closure" version it was either the same speed or slower than a loop+switch design. It wasn't until I started to add optimizations/specializations to eliminate the read/writes to the wasm runtime stack that it started to be faster. At one point I had both VM designs in the repo to make it easier to compare.
If you mean the I think the closure based design will allow for more optimizations:
|
I just tried out the master branch of My original benchmarks were just using the Here is a quickly modified extern crate parity_wasm;
extern crate wasmi;
use std::env::args;
use wasmi::RuntimeValue;
use wasmi_v1 as v1;
fn main() {
let args: Vec<_> = args().collect();
if args.len() < 3 {
println!("Usage: {} <wasm file> <exported func> [<arg>...]", args[0]);
return;
}
let (_, program_args) = args.split_at(3);
let module = load_module(&args[1]);
let mut linker = <v1::Linker<()>>::default();
let mut store = v1::Store::new(module.engine(), ());
let instance = linker.instantiate(&mut store, &module)
.unwrap()
.start(&mut store)
.unwrap();
let fib_func = instance.get_export(&store, "fib")
.and_then(v1::Extern::into_func)
.unwrap();
let mut result = [RuntimeValue::I32(0)];
let args = [
RuntimeValue::I32(
program_args[0]
.parse::<i32>()
.unwrap_or_else(|_| panic!("Can't parse arg #{} as i32", program_args[0])),
),
];
fib_func.call(&mut store, &args, &mut result).expect;
println!("{:?}", result[0]);
}
fn load_module(file: &str) -> v1::Module {
let buf = std::fs::read(file).expect("Read file");
let engine = v1::Engine::default();
v1::Module::new(&engine, &buf[..]).expect("Failed to load")
} Run on my wasmi v0:
wasmi v1:
s1vm:
Cachegrind results. Just the "I refs" (CPU instructions executed) and "D refs" (memory accesses).
wasmi v1:
s1vm:
|
Also this is the |
Not sure why wasmi v1 was slower then v0 for me. I didn't have the time to fully port the |
It seems you were not using the following [profile.release]
lto = "fat"
codegen-units = 1 In experiments we saw regressions of up to 400% for For a fair comparison of the actual execution time (which we are mainly talking about) you should wrap the execution only: let before = std::time::Instant::now();
execute(fib);
let elapsed = before.elapsed();
println!("time = {:?}", elapsed); |
Oh yes! I too wait for tail call support in Rust. However, it is stated by the devs to have low priority. So I guess if we ever want to see it soon(TM) we should
This sounds interesting. Can you elaborate why your closure based approach is well suited for this? You mean without performance pentalties in the non-resume execution? The old
Hehe, you are comparing two different test cases there. ;) I just benchmarked the
This is interesting. I have heard a lot about "moving stuff to registers" by the Wasm3 authors which is probably what you are saying there but I have to admit that I was not able to fully grasp what they meant in their artificial descriptions of how they achieve it. I will have another look at your code again. Will definitely update my research repo once understood.
Yeah sorry for the bad description what I mean by "fused": With fused I mean to not only allow instructions such as local.get 0
i32.load 0
i32.const 1
i32.add
local.get 0
i32.store 0 You'd be able to fuse all those instructions into a single one:
Which does the entire loading, adding and storing in a single instruction without calling any other function or dispatching into a closure for any of the sub procedures. This is especially possible with closures. We simply not only regard |
Correct. That helped a lot for the wasmi v1. I re-ran my simple benchmark with all 3 compiled with those options. wasmi v0:
wasmi v1
No improvement for s1vm:
For benchmarking VMs I like to include the setup time too. Since some VMs (Java) can have high startup times. Also It could be useful to split the start & execution times. One thing that I haven't checked is the memory cost of using closures in |
Thank you for the updated benchmarks! Yet,
If this ever becomes a problem you can use
I generally agree, but having both separate (setup and execution) would be a very useful information here. Also as I have mentioned earlier it is not entirely fair to compare a full fledged implementation with a POC that lacks most of the functionality (of Wasm). |
Is there an open issue or RFC about the
The closure design might make it a little bit more difficult. A loop+switch design is much easier to suspend and resume. To support async with the closure based compiler, I think it would be best to add "yield" points to instruction blocks. This would allow most of the closures to be normal non-async functions. Only opcodes that call other functions (internal or external) would need to support async and the compiler could do some static analysis to avoid adding overhead to simple functions that don't have loops or complex branching. There is a wasm module translator that can instrument a wasm module with yield calls (external calls). https://emscripten.org/docs/porting/asyncify.html
It has been a few years since I read about the wasm3 VM. I think they used a fixed set of parameters for each opcode function which allows them to
It would be easy to add Memory/Global support to the |
The most recent activities are
I have not yet figured out a way that would not introduce additional overhead for cases where you do not need to resume execution. For
These parameters are exactly what I never could wrap my head around in how they are used. Unfortunately Wasm3 implementation is rather cryptic to me with all of its macros.
Yeah exactly, this is where the closure approach actually could shine. Just look at this research for an example of how NOT to do it. :D |
My "simple" benchmarks are not high-quality. I used a high input value for I think using criterion while developing the VM would help a lot. 2 years ago I didn't have as much experience with Rust, so didn't know about all the tooling that was available.
I agree that keeping the closure data close together should improve things for larger wasm module, but I would wait until there is a test case that shows high memory usage or high cache misses before trying to improve that.
I agree that benchmarking |
I am super thankful for you publishing Some questions that arise from the top of my head about the POC:
|
Originally I wasn't even going to support WASM. I was just going to mock-up a simple bytecode based VM to try out different ideas for a Rust based VM, much like your research repo. But after reading the wasm specs, I decided it was a great starting point and the I also decided against trying to rewrite the internal wasmi VM, because like you said it can't be done over a weekend.
A simple way to support |
Thanks for the links.
I think the compiler could be made to only add async support when requested. I try to push as much work as possible on the compiler, to keep execution/runtime as fast as possible.
Not sure if you have seen this document about the wasm3 interpreter: It does require some understanding of C ABI calling conventions and assembly to understand why it helps with performance. |
I have seen and read this and understood the basic architecture. The only detail that I do not yet grasp is how the
Hmmm I see a problem in there since it is not really known at compile time if resumability of code is actually used after compilation. But overall I like the approach of putting as many things as possible into the compilation step. A drawback though (as you mentioned earlier) is that it is still a trade-off. For example, for Parity's use case we usually have to compile relatively small to medium sized Wasm binaries that tend to execute pretty fast. So compilation overhead therefore tends to play a bigger role since we do not use the compiled code for more than once. So generally for interpreters it is about finding the sweet spot for compilation and execution time. |
I re-implemented your interpreter architecture in my research repo under the name |
Loops and if/else are supported. Only a few Each block (branch point, loop, if/else, etc...) has a Vec of
Not sure I fully understand your question here. For a "register machine" it is important to know how the registers are stored. If they are just a
The closures have two parts static code (generated during Rust compile-time) and data (created during wasm compile time). I don't think the generated machine code will be that much bigger than a VM with a lot of specialized opcodes. A loop+switch based VM with simple opcodes would have the smallest machine code size and could possibly run without many Instruction L1 cache misses. But the number of instructions executed (loop overhead) would be higher.
For the current design of |
That's exactly what I mean. In my research repo I tried out the
In my implementation at the research repo loops only have a single |
It still seems to pass values in the register's I haven't looked at the specs for |
I was wrong. I just tried out
Yes, that's how |
The I am sure there will be some upper limit on how much larger the native code size for all the specialized closure can be before performance starts to drop. For any opcode that already has a high cost (load/store memory, get/set global), it might be best to keep those as simple unspecialized cases. It is the simpler opcodes (add, sub, mul, etc...) or control flow that would get the biggest improvement from specialization.
I had the most difficulty with nested blocks and control flow in the closure based compiler. It basically has to keep track of the PC counter while compiling wasm opcodes to handle jump targets. Your Also you might want to make sure that the Rust compile doesn't get to "smart" and do too much optimization. Using something like blackbox around the input opcodes |
For this simple research VM, there is no need for registers. Just like how the
|
That is an interesting thought but you might be right with this.
I imagine this to work with the Wasm branching depth that is decreased by 1 for everytime you return a branch. At depth of 0 the outer control structure can take off again.
I don't think this would work well for Wasm execution since in Wasm you could have multiple break points out of a loop. It is probably best to either use a list of op codes directly (as you did) or to have a single block body instruction that might also be a block instruction (containing multiple instructions again).
This is interesting. Can you elaborate when/why I would want to prevent optimizations from happening in this interpreter architecture?
I think this is true until you try to support Wasm operands such as
So you'd say that |
The mapping of those parameters is the standard C x86 ABI calling convention (for 32bit code, 64bit code has more parameters as registers). Other arch. like ARM use different conventions on which parameters to pass by register. Also Windows has its own convention.
|
To handle multiple breakpoints or breaking out of a nested block, there is the
Normally the wasm instructions get loaded from an external source (file, buffer) and the Rust compiler can't see them. But with micro benchmarks where the input values (instructions to execute) are available for the compiler to see, it can produce machine code optimized for those inputs, even calculate the results at compile time. The compiler can also optimize away loops too. Another
If you look at how Line 381 in 4b80d88
It might require some detection of conditional loops from wasm opcodes.
Basically yes. The compiler tries to eliminate as much of the runtime stack usage as it can. Whiling compiling the wasm instructions it uses it's virtual stack to track which opcodes produce values (push a result) and which consume a value (pop values). It resolve those dependencies at compile time. |
I found an interesting optimization while thinking about how wasm3 has a "register" parameter in it's opcodes and that On branch local_opt1 I added a third parameter I am not sure how well this method would extend to handle more locals this way, but other locals (index > 0) can still be handled on the runtime stack like normal. The improvement wasn't as much as I was hoping, but it did lower the memory accesses and instruction counts:
|
Sounds interesting and probably explains how Wasm3 uses this Having support for |
The new parameter is a mutable reference, so it can be updated by any opcode in that function. The
I was going to try implementing Yesterday while profiling Something like this which is similar to your type LocalIdx = u32;
enum Input {
Local(LocalIdx),
Const(StackValue),
Op(Box<Opcode>), // Need to box `Opcode` to avoid recursive type.
// Might need a variant for opcodes that don't produce values: SetLocal, SetGlobal.
}
// the `Box<Opcode>` parameters would provide the nested execution of opcodes like what is done in s1vm
// with the current closures.
enum Opcode {
// Specialize some opcodes for the different `Input` variants, which is the same that the closure specialization does.
I32Add_L_L(LocalIdx, LocalIdx),
I32Add_L_C(LocalIdx, StackValue),
I32Add_L_O(LocalIdx, Box<Opcode>),
I32Add_O_C(Box<Opcode>, StackValue),
I32Add_O_C(Box<Opcode>, Box<Opcode>),
I32Sub_L_L(LocalIdx, LocalIdx),
I32Sub_L_C(LocalIdx, StackValue),
I32Sub_L_O(LocalIdx, Box<Opcode>),
I32Sub_O_C(Box<Opcode>, StackValue),
I32Sub_O_C(Box<Opcode>, Box<Opcode>),
I32Lts_L(LocalIdx),
I32Lts_C(StackValue),
I32Lts_O(Box<Opcode>),
Call0(FuncIdx), // Zero parameter funcs.
Call1(FuncIdx, Input), // One parameter funcs.
// Generic call op for funcs with > 1 parameters.
Call(FuncIdx, Vec<Input>),
// ..... many more opcodes.
} Would need to do profiling to see if it had better memory access patterns. |
Looking forward to your benchmark results with this |
Tried out a variant of your proposed dispatch technique (not put a lot of effort into it though) and it is super slow unfortunately: |
The new edit: Added another On my system for both |
Hi, have you had some time and energy to experiment on these approaches more? |
Hi, @Robbepop I had some time this weekend, so I did some testing with your research repo. Also fixed loops/branching in One thing I have notice is that the benchmark numbers are not stable if multiple threads are used. The testsuite can be forked to use one thread with See my fork here: https://github.com/Neopallium/interpreter-dispatch-research I added to new implementations Here are the number I get on my laptop (AMD Ryzen 9 5900HX) and Rust 1.60.0 (7737e0b5c 2022-04-04)
|
Hi @Neopallium , sorry for the long pause I was on a break from coding. Thanks a lot for the interesting additions. It is great to see how much potential there still is! I especially looked into the In the meantime I kinda finalized the initial implementation of the register-machine approach for Still, even as register machine I am referring to the Do you have any thoughts on this topic? |
Have you tried running The "research" vms don't have logic for early break out of parent blocks like wasm has. But |
I was using
Gonna have to take another look at this functionality. With respect to the performance characteristics I think the difference between Wasm3 and our tools is that Wasm3 was super tuned for speed. From what I see Wasm3 does not support certain things such as multiple memories (or multiple instances), making memory interopt obviously faster since lookup is simpler than we can afford in |
Good luck. Let me know if you have any questions. I am subscribed to your wasmi PR, so I have seen the increase in commits. When I have some free time I may give a go at profiling it to see if anything pops out. |
Thank you! I just did some performance comparisons between your s1vm and the register-machine based
Tested on So it seems your approach wins significantly for call-heavy workloads while the new register machine based Even with those wins for Note: For I used this (module
(func $fib_recursive (export "fib_recursive") (param $N i64) (result i64)
(if
(i64.le_s (local.get $N) (i64.const 1))
(then (return (local.get $N)))
)
(return
(i64.add
(call $fib_recursive
(i64.sub (local.get $N) (i64.const 1))
)
(call $fib_recursive
(i64.sub (local.get $N) (i64.const 2))
)
)
)
)
(func $fib_iterative (export "fib_iterative") (param $N i64) (result i64)
(local $n1 i64)
(local $n2 i64)
(local $tmp i64)
(local $i i64)
;; return $N for N <= 1
(if
(i64.le_s (local.get $N) (i64.const 1))
(then (return (local.get $N)))
)
(local.set $n1 (i64.const 1))
(local.set $n2 (i64.const 1))
(local.set $i (i64.const 2))
;;since we normally return n2, handle n=1 case specially
(loop $again
(if
(i64.lt_s (local.get $i) (local.get $N))
(then
(local.set $tmp (i64.add (local.get $n1) (local.get $n2)))
(local.set $n1 (local.get $n2))
(local.set $n2 (local.get $tmp))
(local.set $i (i64.add (local.get $i) (i64.const 1)))
(br $again)
)
)
)
(local.get $n2)
)
)
edit: I forgot to enable The new results are as follows:
In conclusion the results are even more devastating for the register machine based What further interests me a lot is how the compilation times of Wasm to bytecode (or closures) perform between Also thought about how to efficiently add Wasm |
Yeah, it would be interesting to bench the "compile" overhead for both vms. i.e. just load, but not execute. I have a feeling that I haven't tried optimizing loop code yet (i.e. hoisting the loop counter, which should be possible with the closure based design). |
I have optimized The new results are as follows:
I have primarily added a bunch of new optimized variants such as Not all of those optimizations have been pushed to GitHub, yet. The optimizations regarding recursive calls roughly boil down to the fact that so far |
Hi @Neopallium, long time no update.
So the old edit: Updated the values several times after implementation of new optimizations. |
I know it isn't much but I just wanted to let you know that |
Hi, I am one of the authors of
wasmi
and got curious about yours1vm
which claims to be significantly (approx 2.2 times) faster thanwasmi
for some fibonacci testcase.I got seriously interested in your approach because of this claim and tried to reproduce your benchmarks.
My results so far with
fib(25)
:s1vm
:9.668629ms
wasmi_v0
:21.343 ms
~120% slowerwasmi_v1
:12.966 ms
~30% slowerNote that
wasmi_v0
uses an old engine whereaswasmi_v1
is the most recent engine. Both use stack machine based bytecode which naturally is inferior with respect to performance to register machine bytecode used in both Wasm3 ands1vm
. This paper indicates roughly 50%-150% performance improvement by switching over to a register machine base bytecode. So ifwami_v1
were to use register machine based bytecode it would probably not be ~30% slower but ~35% faster thans1vm
. In fact there is an issue open forwasmi
to implement register machine based bytecode, however, we are not yet sure which instruction dispatching technique to use. For this we currently research into the most effective dispatching techniques and closure-based approaches so far do not seem to be very promising unfortunately.To reconstruct my own benchmarks you can checkout
wasmi
using this branch. And a fork of yours1vm
using this link.For
wasmi
simply usecargo bench fib_recursive
and for thes1vm
fork usecargo run --release -- fib.wasm fib_recursive 30
.The text was updated successfully, but these errors were encountered: