Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rlp, trie: faster trie node encoding #24126

Merged
merged 25 commits into from
Mar 9, 2022

Conversation

qianbin
Copy link
Contributor

@qianbin qianbin commented Dec 17, 2021

rlp package is usually efficient. But it's expensive to encode interface type values. Trie node is of nested interface type.

@qianbin
Copy link
Contributor Author

qianbin commented Dec 17, 2021

rlp.Encode

goos: darwin
goarch: arm64
pkg: github.com/ethereum/go-ethereum/trie
BenchmarkEncodeFullNode
BenchmarkEncodeFullNode-8   	 815104	      1461 ns/op	     288 B/op	      1 allocs/op
PASS
ok  	github.com/ethereum/go-ethereum/trie	2.144s

frlp.Encode

goos: darwin
goarch: arm64
pkg: github.com/ethereum/go-ethereum/trie
BenchmarkFastEncodeFullNode
BenchmarkFastEncodeFullNode-8   4550047	     244.0 ns/op	     0 B/op	     0 allocs/op
PASS
ok  	github.com/ethereum/go-ethereum/trie	2.058s

@qianbin qianbin changed the title WIP: fast & zero alloc trie node encoding WIP: fast & zero-alloc trie node encoding Dec 17, 2021
@qianbin qianbin changed the title WIP: fast & zero-alloc trie node encoding fast & zero-alloc trie node encoding Dec 18, 2021
@karalabe
Copy link
Member

karalabe commented Jan 6, 2022

This work may or may not overlap with the generated encoders from @fjl. Could you PTAL Felix?

@fjl
Copy link
Contributor

fjl commented Jan 11, 2022

I think this PR is valuable and we should check how it impacts performance of types.DeriveSha. This requires integrating the fast encoder into StackTrie.

@qianbin
Copy link
Contributor Author

qianbin commented Jan 11, 2022

@fjl just pushed StackTrie stuff.

@holiman
Copy link
Contributor

holiman commented Jan 12, 2022

There are some definite improvements to be had through this approach:

name                      old time/op    new time/op    delta
Prove-6                     1.00ms ±27%    0.68ms ±11%  -31.47%  (p=0.008 n=5+5)
VerifyProof-6               26.4µs ±28%    22.2µs ±10%     ~     (p=0.095 n=5+5)
VerifyRangeProof10-6         253µs ±31%     224µs ±23%     ~     (p=0.690 n=5+5)
VerifyRangeProof100-6       1.34ms ±46%    1.20ms ±13%     ~     (p=0.548 n=5+5)
VerifyRangeProof1000-6      12.3ms ±32%     9.5ms ±13%  -23.01%  (p=0.032 n=5+5)
VerifyRangeProof5000-6      44.1ms ±31%    41.1ms ±17%     ~     (p=0.690 n=5+5)
VerifyRangeNoProof10-6       601µs ± 3%     621µs ±28%     ~     (p=0.841 n=5+5)
VerifyRangeNoProof500-6     2.76ms ±26%    2.20ms ±22%     ~     (p=0.056 n=5+5)
VerifyRangeNoProof1000-6    5.38ms ± 9%    4.33ms ± 6%  -19.50%  (p=0.008 n=5+5)

name                      old alloc/op   new alloc/op   delta
Prove-6                      144kB ±16%     121kB ±10%  -15.86%  (p=0.016 n=5+5)
VerifyProof-6               5.02kB ± 3%    5.07kB ± 6%     ~     (p=1.000 n=5+5)
VerifyRangeProof10-6        41.2kB ± 8%    41.2kB ± 0%     ~     (p=0.730 n=5+4)
VerifyRangeProof100-6        267kB ±16%     276kB ±14%     ~     (p=1.000 n=5+5)
VerifyRangeProof1000-6      2.12MB ± 1%    1.99MB ± 2%   -6.40%  (p=0.008 n=5+5)
VerifyRangeProof5000-6      10.1MB ± 0%     9.5MB ± 1%   -5.78%  (p=0.008 n=5+5)
VerifyRangeNoProof10-6      56.3kB ± 3%    56.6kB ± 2%     ~     (p=0.841 n=5+5)
VerifyRangeNoProof500-6      201kB ± 1%     199kB ± 2%     ~     (p=0.310 n=5+5)
VerifyRangeNoProof1000-6     370kB ± 1%     368kB ± 1%     ~     (p=0.421 n=5+5)

name                      old allocs/op  new allocs/op  delta
Prove-6                      1.98k ±16%     1.84k ±10%     ~     (p=0.056 n=5+5)
VerifyProof-6                  103 ± 2%       104 ± 5%     ~     (p=0.952 n=5+5)
VerifyRangeProof10-6           428 ± 5%       436 ± 5%     ~     (p=0.952 n=5+5)
VerifyRangeProof100-6        1.96k ± 8%     1.99k ± 6%     ~     (p=0.841 n=5+5)
VerifyRangeProof1000-6       16.6k ± 0%     16.1k ± 1%   -3.33%  (p=0.008 n=5+5)
VerifyRangeProof5000-6       80.9k ± 0%     78.7k ± 1%   -2.74%  (p=0.008 n=5+5)
VerifyRangeNoProof10-6       1.19k ± 1%     1.19k ± 1%     ~     (p=0.952 n=5+5)
VerifyRangeNoProof500-6      3.73k ± 0%     3.71k ± 1%     ~     (p=0.095 n=5+5)
VerifyRangeNoProof1000-6     6.80k ± 1%     6.78k ± 0%     ~     (p=0.421 n=5+5)

Doesn't seem to affect DeriveSha though

name                       old time/op    new time/op    delta
DeriveSha200/std_trie-6      1.18ms ±22%    1.21ms ±52%    ~     (p=0.841 n=5+5)
DeriveSha200/stack_trie-6     681µs ±39%     675µs ±10%    ~     (p=0.690 n=5+5)

name                       old alloc/op   new alloc/op   delta
DeriveSha200/std_trie-6       271kB ± 0%     266kB ± 0%  -1.73%  (p=0.008 n=5+5)
DeriveSha200/stack_trie-6    55.1kB ± 0%    55.1kB ± 0%    ~     (p=0.968 n=5+5)

name                       old allocs/op  new allocs/op  delta
DeriveSha200/std_trie-6       2.69k ± 0%     2.68k ± 0%  -0.59%  (p=0.008 n=5+5)
DeriveSha200/stack_trie-6     1.25k ± 0%     1.25k ± 0%    ~     (all equal)

@qianbin
Copy link
Contributor Author

qianbin commented Jan 12, 2022

The fast encoder is especially efficient to encode full-nodes full of child nodes. In the case of DeriveSha200, that produces 200 short-nodes but only 16 full-nodes. So it's reasonable that does not affect the second approach.

@fjl
Copy link
Contributor

fjl commented Jan 19, 2022

Hey, this is just to let you know I haven't forgotten about this PR!

I have just published another PR where the encoder buffer is exposed in a slightly different way: #24251. The LowEncoder encoder you added here could be replaced with the EncoderBuffer from my PR after it is merged. I will take care of that.

@fjl fjl changed the title fast & zero-alloc trie node encoding trie: faster trie node encoding Jan 19, 2022
@qianbin
Copy link
Contributor Author

qianbin commented Jan 19, 2022

Okey. I'll update my PR when that merged.

@qianbin
Copy link
Contributor Author

qianbin commented Mar 1, 2022

@fjl I just pushed a new commit using rlp.EncoderBuffer.

@fjl fjl changed the title trie: faster trie node encoding rlp, trie: faster trie node encoding Mar 8, 2022
@fjl
Copy link
Contributor

fjl commented Mar 8, 2022

This is a pretty solid performance improvement now:

name                       old time/op    new time/op    delta
DeriveSha200/std_trie-8       252µs ± 0%     228µs ± 0%   -9.32%  (p=0.000 n=10+10)
DeriveSha200/stack_trie-8     248µs ± 0%     207µs ± 0%  -16.60%  (p=0.000 n=10+10)

name                       old alloc/op   new alloc/op   delta
DeriveSha200/std_trie-8       270kB ± 0%     265kB ± 0%   -1.72%  (p=0.000 n=10+9)
DeriveSha200/stack_trie-8    55.1kB ± 0%    41.0kB ± 0%  -25.70%  (p=0.000 n=10+10)

name                       old allocs/op  new allocs/op  delta
DeriveSha200/std_trie-8       2.66k ± 0%     2.65k ± 0%   -0.56%  (p=0.000 n=10+10)
DeriveSha200/stack_trie-8     1.25k ± 0%     1.04k ± 0%  -17.15%  (p=0.000 n=10+10)
name                      old time/op    new time/op    delta
Prove-8                      305µs ± 2%     155µs ±14%  -49.15%  (p=0.000 n=8+10)
VerifyProof-8               4.57µs ± 6%    4.52µs ± 7%     ~     (p=0.754 n=10+10)
VerifyRangeProof10-8        46.3µs ± 4%    34.0µs ± 2%  -26.49%  (p=0.000 n=8+10)
VerifyRangeProof100-8        207µs ±10%     160µs ± 1%  -22.82%  (p=0.000 n=10+10)
VerifyRangeProof1000-8      1.75ms ± 3%    1.32ms ± 6%  -24.15%  (p=0.000 n=10+9)
VerifyRangeProof5000-8      5.93ms ± 1%    5.01ms ± 1%  -15.55%  (p=0.000 n=10+10)
VerifyRangeNoProof10-8       206µs ± 4%     126µs ± 3%  -38.61%  (p=0.000 n=10+10)
VerifyRangeNoProof500-8      848µs ± 2%     517µs ± 2%  -39.05%  (p=0.000 n=10+10)
VerifyRangeNoProof1000-8    1.58ms ± 1%    0.99ms ± 1%  -37.25%  (p=0.000 n=10+9)

name                      old alloc/op   new alloc/op   delta
Prove-8                      132kB ± 3%     123kB ±19%   -6.88%  (p=0.043 n=8+10)
VerifyProof-8               5.02kB ± 3%    5.01kB ± 6%     ~     (p=0.912 n=10+10)
VerifyRangeProof10-8        42.5kB ± 0%    40.6kB ± 1%   -4.46%  (p=0.000 n=7+10)
VerifyRangeProof100-8        275kB ±15%     265kB ± 0%   -3.78%  (p=0.003 n=10+8)
VerifyRangeProof1000-8      2.07MB ± 1%    1.96MB ± 1%   -5.31%  (p=0.000 n=9+7)
VerifyRangeProof5000-8      10.0MB ± 0%     9.4MB ± 0%   -5.67%  (p=0.000 n=10+10)
VerifyRangeNoProof10-8      56.4kB ± 4%    34.3kB ± 4%  -39.15%  (p=0.000 n=10+10)
VerifyRangeNoProof500-8      202kB ± 3%     114kB ± 1%  -43.82%  (p=0.000 n=10+10)
VerifyRangeNoProof1000-8     368kB ± 2%     211kB ± 1%  -42.62%  (p=0.000 n=10+9)

name                      old allocs/op  new allocs/op  delta
Prove-8                      1.83k ± 1%     1.88k ±19%     ~     (p=0.395 n=8+10)
VerifyProof-8                  103 ± 3%       103 ± 5%     ~     (p=0.927 n=10+10)
VerifyRangeProof10-8           431 ± 1%       423 ± 2%   -2.05%  (p=0.001 n=7+10)
VerifyRangeProof100-8        1.92k ± 7%     1.89k ± 0%   -1.99%  (p=0.003 n=10+8)
VerifyRangeProof1000-8       16.0k ± 1%     15.7k ± 1%   -2.35%  (p=0.000 n=10+10)
VerifyRangeProof5000-8       78.8k ± 0%     76.9k ± 0%   -2.52%  (p=0.000 n=10+10)
VerifyRangeNoProof10-8       1.19k ± 2%     0.94k ± 1%  -21.47%  (p=0.000 n=10+9)
VerifyRangeNoProof500-8      3.74k ± 1%     2.91k ± 1%  -22.24%  (p=0.000 n=10+10)
VerifyRangeNoProof1000-8     6.77k ± 1%     5.31k ± 1%  -21.64%  (p=0.000 n=10+10)

@fjl fjl requested a review from holiman March 8, 2022 21:26
if err := n.EncodeRLP(&h.tmp); err != nil {
panic("encode error: " + err.Error())
}
n.encode(h.encbuf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like shortNodeToHash and fullNodeToHash are identical now, so can maybe be merged into one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit more efficient not to merge them because encode can be called directly instead of through the interface. But I agree that it's silly.

trie/stacktrie.go Show resolved Hide resolved
Copy link
Contributor

@holiman holiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fjl fjl added this to the 1.10.17 milestone Mar 9, 2022
@fjl fjl merged commit 65ed1a6 into ethereum:master Mar 9, 2022
sidhujag pushed a commit to syscoin/go-ethereum that referenced this pull request Mar 9, 2022
This change speeds up trie hashing and all other activities that require
RLP encoding of trie nodes by approximately 20%. The speedup is achieved by
avoiding reflection overhead during node encoding.

The interface type trie.node now contains a method 'encode' that works with
rlp.EncoderBuffer. Management of EncoderBuffers is left to calling code.
trie.hasher, which is pooled to avoid allocations, now maintains an
EncoderBuffer. This means memory resources related to trie node encoding
are tied to the hasher pool.

Co-authored-by: Felix Lange <fjl@twurst.com>
JacekGlen pushed a commit to JacekGlen/go-ethereum that referenced this pull request May 26, 2022
This change speeds up trie hashing and all other activities that require
RLP encoding of trie nodes by approximately 20%. The speedup is achieved by
avoiding reflection overhead during node encoding.

The interface type trie.node now contains a method 'encode' that works with
rlp.EncoderBuffer. Management of EncoderBuffers is left to calling code.
trie.hasher, which is pooled to avoid allocations, now maintains an
EncoderBuffer. This means memory resources related to trie node encoding
are tied to the hasher pool.

Co-authored-by: Felix Lange <fjl@twurst.com>
maoueh pushed a commit to streamingfast/go-ethereum that referenced this pull request Sep 30, 2022
This change speeds up trie hashing and all other activities that require
RLP encoding of trie nodes by approximately 20%. The speedup is achieved by
avoiding reflection overhead during node encoding.

The interface type trie.node now contains a method 'encode' that works with
rlp.EncoderBuffer. Management of EncoderBuffers is left to calling code.
trie.hasher, which is pooled to avoid allocations, now maintains an
EncoderBuffer. This means memory resources related to trie node encoding
are tied to the hasher pool.

Co-authored-by: Felix Lange <fjl@twurst.com>
cp-wjhan pushed a commit to cp-yoonjin/go-wemix that referenced this pull request Nov 16, 2022
This change speeds up trie hashing and all other activities that require
RLP encoding of trie nodes by approximately 20%. The speedup is achieved by
avoiding reflection overhead during node encoding.

The interface type trie.node now contains a method 'encode' that works with
rlp.EncoderBuffer. Management of EncoderBuffers is left to calling code.
trie.hasher, which is pooled to avoid allocations, now maintains an
EncoderBuffer. This means memory resources related to trie node encoding
are tied to the hasher pool.

Co-authored-by: Felix Lange <fjl@twurst.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants