Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unmergeable: benchmarking #21207

Closed
wants to merge 6 commits into from
Closed

Conversation

holiman
Copy link
Contributor

@holiman holiman commented Jun 10, 2020

This is not actually a PR, but an analysis of the cost of calling precompiles, due to EIP-2046 to lower the cost of calling precompiles.

It's based on benchmarking the following two loops, which both loops until OOG:

prog1

Code which calls the identity precompile with zero arguments:

		byte(vm.JUMPDEST), //  [ count ]
		// push args for the call
		byte(vm.PUSH1), 0, // out size
		byte(vm.DUP1),       // out offset
		byte(vm.DUP1),       // out insize
		byte(vm.DUP1),       // in offset
		byte(vm.PUSH1), 0x4, // address of identity
		byte(vm.GAS), // gas
		byte(vm.STATICCALL),
		byte(vm.POP),      // pop return value
		byte(vm.PUSH1), 0, // jumpdestination
		byte(vm.JUMP),

prog2

And code which doesn't do the call, but instead just pops the arguments off the stack:

		byte(vm.JUMPDEST), //  [ count ]
		// push args for the call
		byte(vm.PUSH1), 0, // out size
		byte(vm.DUP1),       // out offset
		byte(vm.DUP1),       // out insize
		byte(vm.DUP1),       // in offset
		byte(vm.PUSH1), 0x4, // address of identity
		byte(vm.GAS), // gas

		byte(vm.POP),byte(vm.POP),byte(vm.POP),byte(vm.POP),byte(vm.POP),byte(vm.POP),
		byte(vm.PUSH1), 0, // jumpdestination
		byte(vm.JUMP),

Current

With the current cost of 700, the prog1 is obviously a lot faster, burning through
100M gas in 120ms as opposed to 730ms:

BenchmarkSimpleLoop/identity-precompile-100M-6         	       9	 120120165 ns/op	42914516 B/op	  670317 allocs/op
BenchmarkSimpleLoop/loop-100M-6                        	       2	 732444458 ns/op	   14044 B/op	      42 allocs/op

With the proposed 40 gas:

BenchmarkSimpleLoop/identity-precompile-100M-6         	       1	1404082804 ns/op	593559048 B/op	 5814130 allocs/op
BenchmarkSimpleLoop/loop-100M-6                        	       2	 657346380 ns/op	   14044 B/op	      42 allocs/op

It's actually a factor 2 too slow. This means that we lowered the gas too much!

Testing with 100 gives a pretty close match:

--- a/params/protocol_params.go
+++ b/params/protocol_params.go
@@ -81,7 +81,7 @@ const (

        // These have been changed during the course of the chain
        CallGasFrontier              uint64 = 40  // Once per CALL operation & message call transaction.
-       CallGasEIP150                uint64 = 700 // Static portion of gas for CALL-derivates after EIP 150 (Tangerine)
+       CallGasEIP150                uint64 = 100 // Static portion of gas for CALL-derivates after EIP 150 (Tangerine)
        BalanceGasFrontier           uint64 = 20  // The cost of a BALANCE operation
        BalanceGasEIP150             uint64 = 400 // The cost of a BALANCE operation after Tangerine
        BalanceGasEIP1884            uint64 = 700 // The cost of a BALANCE operation after EIP 1884 (part of Istanbul)

This gives a pretty good match:

BenchmarkSimpleLoop/identity-precompile-100M-6         	       2	 617293453 ns/op	219197308 B/op	 3424734 allocs/op
BenchmarkSimpleLoop/loop-100M-6                        	       2	 677941768 ns/op	   14044 B/op	      42 allocs/op

Note though, that this includes a cost of 15 + 1 for the call to identity -- the cost of the input. So, let's try
with that cost set to zero, and the precompile doing nothing.

--- a/core/vm/contracts.go
+++ b/core/vm/contracts.go
@@ -187,10 +187,12 @@ type dataCopy struct{}
 // This method does not require any overflow checking as the input size gas costs
 // required for anything significant is so high it's impossible to pay for.
 func (c *dataCopy) RequiredGas(input []byte) uint64 {
+       return 0
        return uint64(len(input)+31)/32*params.IdentityPerWordGas + params.IdentityBaseGas
 }
 func (c *dataCopy) Run(in []byte) ([]byte, error) {
-       return in, nil
+       return nil, nil
+       //return in, nil
 }

This makes the match even better:

BenchmarkSimpleLoop/identity-precompile-100M
BenchmarkSimpleLoop/identity-precompile-100M-6         	       2	 674733768 ns/op	244293824 B/op	 3816870 allocs/op
BenchmarkSimpleLoop/loop-100M
BenchmarkSimpleLoop/loop-100M-6                        	       2	 666473308 ns/op	   13508 B/op	      41 allocs/op

So for go-ethereum, the most 'correct' value, with the current implementation, seems to be a cost of 100 for calling a precmpile.

@axic
Copy link
Member

axic commented Jun 10, 2020

Nice analysis!

Do you want to create a PR against 2046 documenting this in an appendix and including the new proposed figure?

It would be also important to reproduce these results in at least another client.

@shamatar
Copy link

That seems to mean that precompile CALL cost itself is almost zero and everything that takes time is popping items from the stack, right?

@shamatar
Copy link

shamatar commented Jun 10, 2020

I've briefly looked at the code of STATICCALL here and it does quite a few things indeed, but it should be ok to assume that in-memory DB is fast enough to simulate happy-path to quickly jump to the precompile. Can you dump the code w/o STATICCALL to hex, I'll try to poke Rust implementation? I wonder if EVMONE will be much faster, but this is beyond my expertise.

Also for modified identity precompile 3816870 allocations/op looks strange compared to a test without STATICCALL at all.

@shamatar
Copy link

Oh, it just hit my mind. Your code is actually 6 pushes + staticcall (or 6 pushes + 6 pops), so staticcall cost that you have proposed should be divided by two, no?

@holiman
Copy link
Contributor Author

holiman commented Jun 10, 2020

Oh, it just hit my mind. Your code is actually 6 pushes + staticcall (or 6 pushes + 6 pops), so staticcall cost that you have proposed should be divided by two, no?

I don't understand your reasoning there... care to elaborate?
I mean, both loops run until OOG, not some set number of loops.

@holiman
Copy link
Contributor Author

holiman commented Jun 10, 2020

Also, I've found what appears to be an error in go. I'll report it upstream and see what they say (about those allocs)

@holiman
Copy link
Contributor Author

holiman commented Jun 10, 2020

That seems to mean that precompile CALL cost itself is almost zero and everything that takes time is popping items from the stack, right?

No, I think you misunderstand what the analysis shows. I can elaborate further some other time, I've been digging into golang/go#39514 the entire evening.

However, If we can somehow work around whatever causes that to happen, I'm pretty sure we'll see the number go down from 100 to something closer to the proposal.

Nice analysis!

Thanks!

Do you want to create a PR against 2046 documenting this in an appendix and including the new proposed figure?

A bit too early, IMO.

It would be also important to reproduce these results in at least another client.

Yes, that would be very interesting! That's why I cc:d so many on ACD. It hit me now that maybe I missed cc:ing the parity folks -- cc @vorot93 @sorpaas

@shamatar
Copy link

Oh, it just hit my mind. Your code is actually 6 pushes + staticcall (or 6 pushes + 6 pops), so staticcall cost that you have proposed should be divided by two, no?

I don't understand your reasoning there... care to elaborate?

I mean, both loops run until OOG, not some set number of loops.

Oh, what I have meant is that each simple piece of code crunches through the 100M gas over N loop iterations, but those iterations include not only staticcall, but an extra code around it. The conclusion looks to be is that staticcall is equal to 6 POPs in Geth, that should be 12 gas.

@holiman
Copy link
Contributor Author

holiman commented Jun 11, 2020 via email

@shamatar
Copy link

If this benchmarks results

BenchmarkSimpleLoop/identity-precompile-100M
BenchmarkSimpleLoop/identity-precompile-100M-6         	       2	 674733768 ns/op	244293824 B/op	 3816870 allocs/op
BenchmarkSimpleLoop/loop-100M
BenchmarkSimpleLoop/loop-100M-6                        	       2	 666473308 ns/op	   13508 B/op	      41 allocs/op

represents that a function with STATICCALL to precompile that does nothing (modified identity) and a function than does just 6 POPs instead take roughly the same time to execute then STATICCALL = 6 POPs = 12 gas?

@holiman
Copy link
Contributor Author

holiman commented Jun 11, 2020

If this benchmarks results represents that a function with STATICCALL to precompile that does nothing (modified identity) and a function than does just 6 POPs instead take roughly the same time to execute then STATICCALL = 6 POPs = 12 gas?

Not quite. The benchmark result represent:

  • A loop which iterates X number of times (X is irrelevant), consuming 100M gas, where the bulk of the operations are control-flow + push/pop.
  • A loop which iterates Y number of times (Y is irrelevant), consuming 100M gas, where each iteration does a staticcall-to-precompile, but remaining ops are largely the same.

Now, in the benchmark your quoted, the cost for staticcall-to-precompile had been set to 100. When it was set to 100, both loops burned through 100M gas in roughly the same amount of time. Therefore, 100 is a well-balanced value, for geth right now.

What's important is not any absolute values on gas per second on a particular hardware. That's totally irrelevant. The only thing that is important is that operations are balanced. If one operation is unbalaned, then a block filled with that particular operation would cause slow blocks, which could be a DoS vulnerability. That's exactly what happened during the shanghai attacks.

holiman added 4 commits June 11, 2020 14:03
name                                  old time/op    new time/op    delta
SimpleLoop/identity-precompile-10M-6     117ms ±75%      43ms ± 1%  -63.09%  (p=0.008 n=5+5)
SimpleLoop/loop-10M-6                   79.6ms ± 4%    70.5ms ± 1%  -11.42%  (p=0.008 n=5+5)

name                                  old alloc/op   new alloc/op   delta
SimpleLoop/identity-precompile-10M-6    24.4MB ± 0%     4.9MB ± 0%  -79.94%  (p=0.008 n=5+5)
SimpleLoop/loop-10M-6                   13.2kB ± 0%    13.2kB ± 0%     ~     (p=0.357 n=5+5)

name                                  old allocs/op  new allocs/op  delta
SimpleLoop/identity-precompile-10M-6      382k ± 0%      153k ± 0%  -59.99%  (p=0.000 n=5+4)
SimpleLoop/loop-10M-6                     40.0 ± 0%      40.0 ± 0%     ~     (all equal)
name                                  old time/op    new time/op    delta
SimpleLoop/identity-precompile-10M-6    43.3ms ± 1%    42.4ms ± 7%     ~     (p=0.151 n=5+5)
SimpleLoop/loop-10M-6                   70.5ms ± 1%    76.7ms ± 1%   +8.67%  (p=0.008 n=5+5)

name                                  old alloc/op   new alloc/op   delta
SimpleLoop/identity-precompile-10M-6    4.90MB ± 0%    2.46MB ± 0%  -49.83%  (p=0.008 n=5+5)
SimpleLoop/loop-10M-6                   13.2kB ± 0%    13.2kB ± 1%     ~     (p=0.571 n=5+5)

name                                  old allocs/op  new allocs/op  delta
SimpleLoop/identity-precompile-10M-6      153k ± 0%       76k ± 0%  -49.98%  (p=0.029 n=4+4)
SimpleLoop/loop-10M-6                     40.0 ± 0%      40.0 ± 0%     ~     (all equal)
name                                  old time/op    new time/op    delta
SimpleLoop/identity-precompile-10M-6    42.4ms ± 7%    37.5ms ± 6%  -11.68%  (p=0.008 n=5+5)
SimpleLoop/loop-10M-6                   76.7ms ± 1%    69.1ms ± 1%   -9.82%  (p=0.008 n=5+5)

name                                  old alloc/op   new alloc/op   delta
SimpleLoop/identity-precompile-10M-6    2.46MB ± 0%    0.02MB ± 0%  -99.35%  (p=0.008 n=5+5)
SimpleLoop/loop-10M-6                   13.2kB ± 1%    13.2kB ± 0%     ~     (p=0.143 n=5+5)

name                                  old allocs/op  new allocs/op  delta
SimpleLoop/identity-precompile-10M-6     76.4k ± 0%      0.1k ± 0%     ~     (p=0.079 n=4+5)
SimpleLoop/loop-10M-6                     40.0 ± 0%      40.0 ± 0%     ~     (all equal)
@holiman
Copy link
Contributor Author

holiman commented Jun 11, 2020

I have pushed some changes now, which work their way around the alloc problems. With those in place, it appears that 40 is a pretty solid value:

[user@work runtime]$ go test . -run - -bench BenchmarkSimpleLoop 
goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/core/vm/runtime
BenchmarkSimpleLoop/identity-precompile-100M-6                 2         664909701 ns/op           16828 B/op         63 allocs/op
BenchmarkSimpleLoop/loop-100M-6                                2         676632829 ns/op           14044 B/op         42 allocs/op
BenchmarkSimpleLoop/identity-precompile-10M-6                 18          66192889 ns/op           16006 B/op         61 allocs/op
BenchmarkSimpleLoop/loop-10M-6                                16          68658524 ns/op           13235 B/op         40 allocs/op

@holiman
Copy link
Contributor Author

holiman commented Jun 11, 2020

Note, though: that the pushed changes are not complete. They need to be applied for the other call variants too, so at this point they're not quite ready to just merge to master

to = AccountRef(addr)
snapshot = evm.StateDB.Snapshot()
)
snapshot := evm.StateDB.Snapshot()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a snapshot is not used in RunPrecompiledContract below does it affect the performance?

@holiman
Copy link
Contributor Author

holiman commented Jun 11, 2020 via email

@shamatar
Copy link

shamatar commented Jun 11, 2020

Didn't dig deeper, but for similar tests in OpenEthereum

crunch through identity for 100M gas                                                                           
                        time:   [151.31 ms 153.17 ms 154.75 ms]
crunch through identity for 100M gas with 40 gas per staticcall                                                                          
                        time:   [1.5607 s 1.6060 s 1.6629 s]
crunch through dummy loop with pops instead of staticcall for 100M gas                        
                        time:   [1.8245 s 1.8524 s 1.8747 s]

Now I need to understand how to interpret them, but roughly it looks like it may be below 40 gas for OpenEthereum.

Important: identity precompile is not modified!

@holiman
Copy link
Contributor Author

holiman commented Jun 12, 2020 via email

@holiman
Copy link
Contributor Author

holiman commented Jun 12, 2020 via email

@holiman
Copy link
Contributor Author

holiman commented Jun 12, 2020 via email

@shamatar
Copy link

shamatar commented Jun 12, 2020

The code is here. I didn't test other values cause you can optimistically get 40 gas per STATICCALL already implemented if you use "berlin" schedule.

The way how benchmarking is written VM is made before the measurement, and benchmarking framework takes care of averaging and filtering outliers.

@tkstanczak
Copy link

@holiman I think that @shamatar was right here
your benchmark would give different results if you executed it for 1 billion gas for example

I think the better approach would be to set a counter to X and loop that many times and then compare the timings - this would give you the actual gas cost of the static call

you would need to set the gast cost of the static call to the number to make these two benchmarks equal and then you would have to subtract 12 (cost of 6 pops) from this value

@tkstanczak
Copy link

by the way - the number that I am getting is something like 350 gas for static call now but I need to dig a bit deeper

@tkstanczak
Copy link

image

@tkstanczak
Copy link

looking at this allocs on our side to bring it down - it seems that our code invokes empty account zero transfer touch logic on precompile which slows it down (and a few other things)

@holiman
Copy link
Contributor Author

holiman commented Jun 12, 2020 via email

@shamatar
Copy link

Do avoid any doubt about what the benchmark measures can we just place a code that runs through the loop for 1M times, and measure how long it would take to run? Irrespective of the gas consumed. Then we can place POPs to make running time equal and price STATICCALL this way.

@shamatar
Copy link

shamatar commented Jun 12, 2020

The benchmarks as it’s written right now will either stop if it does N iterations with staticcall or M iterations with POPs.

If prices are the proper then N * cost = M * cost_1 = time * GAS/second, so invariants should be preserved and you can calculate costs if you will read N or M somehow.

@holiman
Copy link
Contributor Author

holiman commented Jun 12, 2020 via email

@shamatar
Copy link

Dummy cycle can be make out from many opcodes, but if most popular opcodes (as it seems to me) in EVM are not well priced then it’s kind of a separate large problem

@holiman
Copy link
Contributor Author

holiman commented Jun 12, 2020

Here's with 10 billion, 100 million and 10 million gas. I'd say it' pretty consistent:

BenchmarkSimpleLoop/staticcall-identity-10B
BenchmarkSimpleLoop/staticcall-identity-10B-6         	       1	89446314582 ns/op	   17752 B/op	      66 allocs/op
BenchmarkSimpleLoop/loop-10B
BenchmarkSimpleLoop/loop-10B-6                        	       1	72261810479 ns/op	   14968 B/op	      45 allocs/op
BenchmarkSimpleLoop/staticcall-identity-100M
BenchmarkSimpleLoop/staticcall-identity-100M-6       	       2	 704888916 ns/op	   16828 B/op	      63 allocs/op
BenchmarkSimpleLoop/loop-100M
BenchmarkSimpleLoop/loop-100M-6                      	       2	 966903886 ns/op	   13508 B/op	      41 allocs/op
BenchmarkSimpleLoop/staticcall-identity-10M
BenchmarkSimpleLoop/staticcall-identity-10M-6        	      16	  69748344 ns/op	   16019 B/op	      61 allocs/op
BenchmarkSimpleLoop/loop-10M
BenchmarkSimpleLoop/loop-10M-6                       	      15	  73673448 ns/op	   13314 B/op	      40 allocs/op

@shamatar
Copy link

Inversion between 10B and 100M is interesting

@tkstanczak
Copy link

so in Nethermind we are at around 150 gas with the current code

we can go to 41 gas with some significant changes (chosing a different path for precompile calls)
this also assumes some logic that avoids the path for account Touch handling

@holiman are you reporting the account touch event in your benchmark?

@tkstanczak
Copy link

also, I think we picked a test where no memory operations are executed - because of the loop with STATICCALL passing 0 length memory access and zero length result write
These operations are not coming for free so maybe worth to also verify a scenario where the result copying is happening (unless this is covered by the precompile base cost?

@shamatar
Copy link

How does EVM handle memory extensions and on what point?

@tkstanczak
Copy link

How does EVM handle memory extensions and on what point?

memory extension is not that important (it is priced separately)
but if you are saving in the already available memory then you have some minor cost for copying data
but I believe it is better to price it in the dynamic part of the precompile

@axic
Copy link
Member

axic commented Jun 12, 2020

How does EVM handle memory extensions and on what point?

Every single opcode which handles memory needs to check if a memory offset + length exceeds the current memory size, if it does, expansion needs to be charged for.

@tkstanczak
Copy link

Here's with 10 billion, 100 million and 10 million gas. I'd say it' pretty consistent:

BenchmarkSimpleLoop/staticcall-identity-10B
BenchmarkSimpleLoop/staticcall-identity-10B-6         	       1	89446314582 ns/op	   17752 B/op	      66 allocs/op
BenchmarkSimpleLoop/loop-10B
BenchmarkSimpleLoop/loop-10B-6                        	       1	72261810479 ns/op	   14968 B/op	      45 allocs/op
BenchmarkSimpleLoop/staticcall-identity-100M
BenchmarkSimpleLoop/staticcall-identity-100M-6       	       2	 704888916 ns/op	   16828 B/op	      63 allocs/op
BenchmarkSimpleLoop/loop-100M
BenchmarkSimpleLoop/loop-100M-6                      	       2	 966903886 ns/op	   13508 B/op	      41 allocs/op
BenchmarkSimpleLoop/staticcall-identity-10M
BenchmarkSimpleLoop/staticcall-identity-10M-6        	      16	  69748344 ns/op	   16019 B/op	      61 allocs/op
BenchmarkSimpleLoop/loop-10M
BenchmarkSimpleLoop/loop-10M-6                       	      15	  73673448 ns/op	   13314 B/op	      40 allocs/op

yeah, I think the inversion of results between 10B and 100M may be due to the problem that we are trying to explain
simply the changed number of iterations makes the comparison non-linear

@shamatar
Copy link

In benchmarks what I’ve made in a repo there is one extra memory copy at the end.

@tkstanczak
Copy link

tkstanczak commented Jun 12, 2020

    byte[] bytecode2 = Prepare.EvmCode
        .Op(Instruction.JUMPDEST)
        .PushData(0)
        .Op(Instruction.DUP1)
        .Op(Instruction.DUP1)
        .Op(Instruction.DUP1)
        .PushData(4)
        .Op(Instruction.GAS)
        .Op(Instruction.POP)
        .Op(Instruction.POP)
        .Op(Instruction.POP)
        .Op(Instruction.POP)
        .Op(Instruction.POP)
        .Op(Instruction.POP)
        
        .PushData(0)
        .Op(Instruction.POP)
        .PushData(0)
        .Op(Instruction.POP)
        .PushData(0)
        .Op(Instruction.POP)
        .PushData(0)
        .Op(Instruction.POP)
        .PushData(0)
        .Op(Instruction.POP)
        .PushData(0)
        .Op(Instruction.POP)
        
        .PushData(0)
        .Op(Instruction.JUMP)
        .Done;

one can fill with PUSH - POP calls - 6 of them is 30 GAS on top of 12 from 6xPOP
then you can compare the loops and it should be more consistent with 40 gas whichever gas limit you choose
it is less important here to ensure that PUSH POP are priced well - it important that the number of iterations will be the same

@fadhlan1986
Copy link

fadhlan1986 commented Jun 13, 2020 via email

@shamatar
Copy link

shamatar commented Jun 13, 2020

Well, we first have to decide if current pricing of POPs and PUSHes is proper (it well reflects time spent to execute them). If it is then we should definitely set a loop size instead of gas limit and look at the execution time. If not (POP and PUSH prices are off and no one cares) then Martin's approach to benchmarking is the only way, but anyway have a bigger problem :)

@tkstanczak
Copy link

Well, we first have to decide if current pricing of POPs and PUSHes is proper (it well reflects time spent to execute them). If it is then we should definitely set a loop size instead of gas limit and look at the execution time. If not (POP and PUSH prices are off and no one cares) then Martin's approach to benchmarking is the only way, but anyway have a bigger problem :)

not really - even with mispriced PUSH and POP you get at least the same number of loops when you add them to balance the chosen cost of STATICCALL

@shamatar
Copy link

If their price does not reflect the time we can not make a conclusion about “equal running time for N loop iteration => equal pricing”

@holiman holiman closed this Jul 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants