rlp, trie: faster trie node encoding #24126

qianbin · 2021-12-17T17:02:13Z

rlp package is usually efficient. But it's expensive to encode interface type values. Trie node is of nested interface type.

qianbin · 2021-12-17T17:58:52Z

rlp.Encode

goos: darwin
goarch: arm64
pkg: github.com/ethereum/go-ethereum/trie
BenchmarkEncodeFullNode
BenchmarkEncodeFullNode-8   	 815104	      1461 ns/op	     288 B/op	      1 allocs/op
PASS
ok  	github.com/ethereum/go-ethereum/trie	2.144s

frlp.Encode

goos: darwin
goarch: arm64
pkg: github.com/ethereum/go-ethereum/trie
BenchmarkFastEncodeFullNode
BenchmarkFastEncodeFullNode-8   4550047	     244.0 ns/op	     0 B/op	     0 allocs/op
PASS
ok  	github.com/ethereum/go-ethereum/trie	2.058s

karalabe · 2022-01-06T14:05:05Z

This work may or may not overlap with the generated encoders from @fjl. Could you PTAL Felix?

fjl · 2022-01-11T09:59:37Z

I think this PR is valuable and we should check how it impacts performance of types.DeriveSha. This requires integrating the fast encoder into StackTrie.

qianbin · 2022-01-11T17:06:59Z

@fjl just pushed StackTrie stuff.

holiman · 2022-01-12T10:14:31Z

There are some definite improvements to be had through this approach:

name                      old time/op    new time/op    delta
Prove-6                     1.00ms ±27%    0.68ms ±11%  -31.47%  (p=0.008 n=5+5)
VerifyProof-6               26.4µs ±28%    22.2µs ±10%     ~     (p=0.095 n=5+5)
VerifyRangeProof10-6         253µs ±31%     224µs ±23%     ~     (p=0.690 n=5+5)
VerifyRangeProof100-6       1.34ms ±46%    1.20ms ±13%     ~     (p=0.548 n=5+5)
VerifyRangeProof1000-6      12.3ms ±32%     9.5ms ±13%  -23.01%  (p=0.032 n=5+5)
VerifyRangeProof5000-6      44.1ms ±31%    41.1ms ±17%     ~     (p=0.690 n=5+5)
VerifyRangeNoProof10-6       601µs ± 3%     621µs ±28%     ~     (p=0.841 n=5+5)
VerifyRangeNoProof500-6     2.76ms ±26%    2.20ms ±22%     ~     (p=0.056 n=5+5)
VerifyRangeNoProof1000-6    5.38ms ± 9%    4.33ms ± 6%  -19.50%  (p=0.008 n=5+5)

name                      old alloc/op   new alloc/op   delta
Prove-6                      144kB ±16%     121kB ±10%  -15.86%  (p=0.016 n=5+5)
VerifyProof-6               5.02kB ± 3%    5.07kB ± 6%     ~     (p=1.000 n=5+5)
VerifyRangeProof10-6        41.2kB ± 8%    41.2kB ± 0%     ~     (p=0.730 n=5+4)
VerifyRangeProof100-6        267kB ±16%     276kB ±14%     ~     (p=1.000 n=5+5)
VerifyRangeProof1000-6      2.12MB ± 1%    1.99MB ± 2%   -6.40%  (p=0.008 n=5+5)
VerifyRangeProof5000-6      10.1MB ± 0%     9.5MB ± 1%   -5.78%  (p=0.008 n=5+5)
VerifyRangeNoProof10-6      56.3kB ± 3%    56.6kB ± 2%     ~     (p=0.841 n=5+5)
VerifyRangeNoProof500-6      201kB ± 1%     199kB ± 2%     ~     (p=0.310 n=5+5)
VerifyRangeNoProof1000-6     370kB ± 1%     368kB ± 1%     ~     (p=0.421 n=5+5)

name                      old allocs/op  new allocs/op  delta
Prove-6                      1.98k ±16%     1.84k ±10%     ~     (p=0.056 n=5+5)
VerifyProof-6                  103 ± 2%       104 ± 5%     ~     (p=0.952 n=5+5)
VerifyRangeProof10-6           428 ± 5%       436 ± 5%     ~     (p=0.952 n=5+5)
VerifyRangeProof100-6        1.96k ± 8%     1.99k ± 6%     ~     (p=0.841 n=5+5)
VerifyRangeProof1000-6       16.6k ± 0%     16.1k ± 1%   -3.33%  (p=0.008 n=5+5)
VerifyRangeProof5000-6       80.9k ± 0%     78.7k ± 1%   -2.74%  (p=0.008 n=5+5)
VerifyRangeNoProof10-6       1.19k ± 1%     1.19k ± 1%     ~     (p=0.952 n=5+5)
VerifyRangeNoProof500-6      3.73k ± 0%     3.71k ± 1%     ~     (p=0.095 n=5+5)
VerifyRangeNoProof1000-6     6.80k ± 1%     6.78k ± 0%     ~     (p=0.421 n=5+5)

Doesn't seem to affect DeriveSha though

name                       old time/op    new time/op    delta
DeriveSha200/std_trie-6      1.18ms ±22%    1.21ms ±52%    ~     (p=0.841 n=5+5)
DeriveSha200/stack_trie-6     681µs ±39%     675µs ±10%    ~     (p=0.690 n=5+5)

name                       old alloc/op   new alloc/op   delta
DeriveSha200/std_trie-6       271kB ± 0%     266kB ± 0%  -1.73%  (p=0.008 n=5+5)
DeriveSha200/stack_trie-6    55.1kB ± 0%    55.1kB ± 0%    ~     (p=0.968 n=5+5)

name                       old allocs/op  new allocs/op  delta
DeriveSha200/std_trie-6       2.69k ± 0%     2.68k ± 0%  -0.59%  (p=0.008 n=5+5)
DeriveSha200/stack_trie-6     1.25k ± 0%     1.25k ± 0%    ~     (all equal)

qianbin · 2022-01-12T15:22:26Z

The fast encoder is especially efficient to encode full-nodes full of child nodes. In the case of DeriveSha200, that produces 200 short-nodes but only 16 full-nodes. So it's reasonable that does not affect the second approach.

fjl · 2022-01-19T01:02:04Z

Hey, this is just to let you know I haven't forgotten about this PR!

I have just published another PR where the encoder buffer is exposed in a slightly different way: #24251. The LowEncoder encoder you added here could be replaced with the EncoderBuffer from my PR after it is merged. I will take care of that.

qianbin · 2022-01-19T06:01:18Z

Okey. I'll update my PR when that merged.

qianbin · 2022-03-01T06:45:43Z

@fjl I just pushed a new commit using rlp.EncoderBuffer.

fjl · 2022-03-08T20:45:46Z

This is a pretty solid performance improvement now:

name                       old time/op    new time/op    delta
DeriveSha200/std_trie-8       252µs ± 0%     228µs ± 0%   -9.32%  (p=0.000 n=10+10)
DeriveSha200/stack_trie-8     248µs ± 0%     207µs ± 0%  -16.60%  (p=0.000 n=10+10)

name                       old alloc/op   new alloc/op   delta
DeriveSha200/std_trie-8       270kB ± 0%     265kB ± 0%   -1.72%  (p=0.000 n=10+9)
DeriveSha200/stack_trie-8    55.1kB ± 0%    41.0kB ± 0%  -25.70%  (p=0.000 n=10+10)

name                       old allocs/op  new allocs/op  delta
DeriveSha200/std_trie-8       2.66k ± 0%     2.65k ± 0%   -0.56%  (p=0.000 n=10+10)
DeriveSha200/stack_trie-8     1.25k ± 0%     1.04k ± 0%  -17.15%  (p=0.000 n=10+10)

name                      old time/op    new time/op    delta
Prove-8                      305µs ± 2%     155µs ±14%  -49.15%  (p=0.000 n=8+10)
VerifyProof-8               4.57µs ± 6%    4.52µs ± 7%     ~     (p=0.754 n=10+10)
VerifyRangeProof10-8        46.3µs ± 4%    34.0µs ± 2%  -26.49%  (p=0.000 n=8+10)
VerifyRangeProof100-8        207µs ±10%     160µs ± 1%  -22.82%  (p=0.000 n=10+10)
VerifyRangeProof1000-8      1.75ms ± 3%    1.32ms ± 6%  -24.15%  (p=0.000 n=10+9)
VerifyRangeProof5000-8      5.93ms ± 1%    5.01ms ± 1%  -15.55%  (p=0.000 n=10+10)
VerifyRangeNoProof10-8       206µs ± 4%     126µs ± 3%  -38.61%  (p=0.000 n=10+10)
VerifyRangeNoProof500-8      848µs ± 2%     517µs ± 2%  -39.05%  (p=0.000 n=10+10)
VerifyRangeNoProof1000-8    1.58ms ± 1%    0.99ms ± 1%  -37.25%  (p=0.000 n=10+9)

name                      old alloc/op   new alloc/op   delta
Prove-8                      132kB ± 3%     123kB ±19%   -6.88%  (p=0.043 n=8+10)
VerifyProof-8               5.02kB ± 3%    5.01kB ± 6%     ~     (p=0.912 n=10+10)
VerifyRangeProof10-8        42.5kB ± 0%    40.6kB ± 1%   -4.46%  (p=0.000 n=7+10)
VerifyRangeProof100-8        275kB ±15%     265kB ± 0%   -3.78%  (p=0.003 n=10+8)
VerifyRangeProof1000-8      2.07MB ± 1%    1.96MB ± 1%   -5.31%  (p=0.000 n=9+7)
VerifyRangeProof5000-8      10.0MB ± 0%     9.4MB ± 0%   -5.67%  (p=0.000 n=10+10)
VerifyRangeNoProof10-8      56.4kB ± 4%    34.3kB ± 4%  -39.15%  (p=0.000 n=10+10)
VerifyRangeNoProof500-8      202kB ± 3%     114kB ± 1%  -43.82%  (p=0.000 n=10+10)
VerifyRangeNoProof1000-8     368kB ± 2%     211kB ± 1%  -42.62%  (p=0.000 n=10+9)

name                      old allocs/op  new allocs/op  delta
Prove-8                      1.83k ± 1%     1.88k ±19%     ~     (p=0.395 n=8+10)
VerifyProof-8                  103 ± 3%       103 ± 5%     ~     (p=0.927 n=10+10)
VerifyRangeProof10-8           431 ± 1%       423 ± 2%   -2.05%  (p=0.001 n=7+10)
VerifyRangeProof100-8        1.92k ± 7%     1.89k ± 0%   -1.99%  (p=0.003 n=10+8)
VerifyRangeProof1000-8       16.0k ± 1%     15.7k ± 1%   -2.35%  (p=0.000 n=10+10)
VerifyRangeProof5000-8       78.8k ± 0%     76.9k ± 0%   -2.52%  (p=0.000 n=10+10)
VerifyRangeNoProof10-8       1.19k ± 2%     0.94k ± 1%  -21.47%  (p=0.000 n=10+9)
VerifyRangeNoProof500-8      3.74k ± 1%     2.91k ± 1%  -22.24%  (p=0.000 n=10+10)
VerifyRangeNoProof1000-8     6.77k ± 1%     5.31k ± 1%  -21.64%  (p=0.000 n=10+10)

holiman · 2022-03-09T12:20:12Z

trie/hasher.go

-	if err := n.EncodeRLP(&h.tmp); err != nil {
-		panic("encode error: " + err.Error())
-	}
+	n.encode(h.encbuf)


Looks like shortNodeToHash and fullNodeToHash are identical now, so can maybe be merged into one?

It's a bit more efficient not to merge them because encode can be called directly instead of through the interface. But I agree that it's silly.

trie/stacktrie.go

holiman

LGTM

This change speeds up trie hashing and all other activities that require RLP encoding of trie nodes by approximately 20%. The speedup is achieved by avoiding reflection overhead during node encoding. The interface type trie.node now contains a method 'encode' that works with rlp.EncoderBuffer. Management of EncoderBuffers is left to calling code. trie.hasher, which is pooled to avoid allocations, now maintains an EncoderBuffer. This means memory resources related to trie node encoding are tied to the hasher pool. Co-authored-by: Felix Lange <fjl@twurst.com>

qianbin changed the title ~~WIP: fast & zero alloc trie node encoding~~ WIP: fast & zero-alloc trie node encoding Dec 17, 2021

qianbin changed the title ~~WIP: fast & zero-alloc trie node encoding~~ fast & zero-alloc trie node encoding Dec 18, 2021

gabrielbussolo mentioned this pull request Dec 20, 2021

Validate terminal strings at account with tests #24135

Closed

qianbin force-pushed the fast-trie-node-encoder branch from 77a00b4 to 7486437 Compare January 11, 2022 17:15

fjl changed the title ~~fast & zero-alloc trie node encoding~~ trie: faster trie node encoding Jan 19, 2022

trie: implement & apply fastNodeEncoder

e228b5f

qianbin force-pushed the fast-trie-node-encoder branch from 7486437 to e228b5f Compare March 1, 2022 06:40

fjl added 14 commits March 8, 2022 12:57

trie: remove unused 'tmp' buffer in committer

8dc0bc2

rlp: add AppendToBytes

433515d

trie: add encode() method to node

59ae72b

trie: don't print to stdout in tests

470249b

trie: use new encoder in hasher

ef32db6

trie: use new encoder in StackTrie

76c7467

rlp: apply buffer-reusing optimization for *EncoderBuffer, too

2060ce3

trie: move emptyNode case

994a67a

trie: add comment in StackTrie.hash

a622a8e

trie: implement EncodeRLP using encode

8a478b9

trie: avoid node encoding allocs in StackTrie

a08eb13

trie: avoid node encoding allocs in hasher

0077809

trie: use hasher instead of stackTrieHashContext

a905cc8

trie: improve comment

727dab9

fjl added 4 commits March 8, 2022 19:46

trie: use hexToCompactInPlace for StackTrie shortnode

f5e3901

trie: remove sliceBuffer

faee3b0

trie: avoid *EncoderBuffer

781ca96

trie: add copyright header

9ff55d1

fjl changed the title ~~trie: faster trie node encoding~~ rlp, trie: faster trie node encoding Mar 8, 2022

fjl added 5 commits March 8, 2022 21:40

trie: comment update

d083f5d

rlp: comment update

7ccf22e

trie: comment update

ea6d398

trie: comment update

d8192d4

trie: restore accidentally removed comment in proof.go

930fc95

fjl requested a review from holiman March 8, 2022 21:26

trie: comment update

32b7f37

holiman reviewed Mar 9, 2022

View reviewed changes

holiman approved these changes Mar 9, 2022

View reviewed changes

fjl added this to the 1.10.17 milestone Mar 9, 2022

fjl merged commit 65ed1a6 into ethereum:master Mar 9, 2022

hqjang-pepper mentioned this pull request Oct 25, 2022

Optimize & Improve Trie, tx processing klaytn/klaytn#1649

Closed

7 tasks

hqjang-pepper mentioned this pull request Nov 15, 2022

Faster trie node encoding klaytn/klaytn#1679

Closed

9 tasks

trinhdn97 mentioned this pull request Dec 22, 2022

Optimize & Improve Trie, Preparing for state trie pruner kardiachain/go-kardia#237

Closed

3 tasks

trinhdn2 mentioned this pull request Jul 16, 2023

[Non-breaking changes] Optimize trie node RLP and implement stacktrie BuildOnViction/victionchain#370

Closed

2 tasks

gzliudan mentioned this pull request May 15, 2024

upgarde package rlp to 2024-05-15 XinFinOrg/XDPoSChain#542

Merged

19 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rlp, trie: faster trie node encoding #24126

rlp, trie: faster trie node encoding #24126

qianbin commented Dec 17, 2021 •

edited

Loading

qianbin commented Dec 17, 2021

karalabe commented Jan 6, 2022

fjl commented Jan 11, 2022

qianbin commented Jan 11, 2022

holiman commented Jan 12, 2022 •

edited

Loading

qianbin commented Jan 12, 2022

fjl commented Jan 19, 2022 •

edited

Loading

qianbin commented Jan 19, 2022

qianbin commented Mar 1, 2022

fjl commented Mar 8, 2022

holiman Mar 9, 2022

fjl Mar 9, 2022

holiman left a comment

rlp, trie: faster trie node encoding #24126

rlp, trie: faster trie node encoding #24126

Conversation

qianbin commented Dec 17, 2021 • edited Loading

qianbin commented Dec 17, 2021

karalabe commented Jan 6, 2022

fjl commented Jan 11, 2022

qianbin commented Jan 11, 2022

holiman commented Jan 12, 2022 • edited Loading

qianbin commented Jan 12, 2022

fjl commented Jan 19, 2022 • edited Loading

qianbin commented Jan 19, 2022

qianbin commented Mar 1, 2022

fjl commented Mar 8, 2022

holiman Mar 9, 2022

Choose a reason for hiding this comment

fjl Mar 9, 2022

Choose a reason for hiding this comment

holiman left a comment

Choose a reason for hiding this comment

qianbin commented Dec 17, 2021 •

edited

Loading

holiman commented Jan 12, 2022 •

edited

Loading

fjl commented Jan 19, 2022 •

edited

Loading