Re-visit how well Erigon can restore values from 'p2p/enode/nodedb.go' after restart #3581

AskAlexSharov · 2022-02-23T01:34:07Z

Problem: after restart Erigon loosing Good peers and very slow gathering them back
I see next message after node start (and sometime on shutdown also):

Successfully update p2p node database    updated=0 deleted=8206

Need re-check how well Erigon can restore values from 'p2p/enode/nodedb.go' after restart

Need check:

can it re-use most of values?
if can re-use - how fast such peers can become "good peers"?
if Erigon has records in nodedb, can it start with empty Bootnodes list?

The text was updated successfully, but these errors were encountered:

battlmonstr · 2022-03-07T15:37:12Z

@AskAlexSharov

I think that primarily it is a regression after introducing a nodedb cache.

There are multiple issues that cause this behaviour.

The cache is flushed to DB in 2 cases:

When reaching ~3-4K new peers due to this check. In practice this rarely happens, because getting to thousands of discovered peers is gonna take quite a while.
Upon a graceful shutdown. If the server is killed without a graceful shutdown, nothing will be persisted.

Answering to your questions:

updated=0 deleted=8206

This log message is misleading due to a typo - entriesUpdated/entriesDeleted variables should be reversed there for logging. The message happens normally upon shutdown and says that 8206 database cache entries were flushed to DB (corresponding to about 2K peers, because each peer has 4-5 related entries).

can it re-use most of values?

Yes, but there are caveats:

The nodes are saved to DB only after seedMinTableTime=5 minutes of staying in the routing table. If you run erigon for less than 5 minutes the seed nodes are empty. It's possible to check if there are some nodes coming from DB after restart by setting a breakpoint here and watch seeds.
QuerySeeds will poke 150 random entries in the whole node DB and ignore hitting "field" entries. In a long-running scenario each "node record" entry has 3-4 "field" entries (lastping, lastpong, findfail, seq), so making 150 attempts hopefully find about 30 "node record" entries. In a bootstrap scenario this fails, because initially it produces hundreds of pings, and the table has very few "node record" entries, so 150 random attempts are not enough. After running for 15 minutes I've got totalEntryCount=1508 nodeRecordCount=114 entries meaning that there's a 1/16 chance of hitting a "node record" entry, and it means finding just about 10 nodes of 114 total on average from 150 attempts.

how fast such peers can become "good peers"?

There's no prioritization of good peers (the ones that were just connected with Eth66 before shutdown) versus peers that replied to a Ping during the last 5 days (seedMaxAge). QuerySeeds happily returns 30 random peers to put into the table, and then it will start a random lookup. The first phase of that lookup will return 16 of them back from the table. After restarting it is likely to try to connect to a totally different set of peers.

can it start with empty Bootnodes list?

Yes, it can. I tried - it worked.

Possible solutions:

Revert the nodedb cache optimization.
Implement a different solution for reducing the number of commits to the nodedb. I feel that most commits are caused by lastping/lastpong updates. If we keep a transaction object for those (using BeginRw instead of Update), we can delay the commits on it. The commits could be done based on a timer, e.g. every 5 sec. This needs a bit of testing to confirm that writing to the other entries is rare.
Split "node record" entries to a separate table such that QuerySeeds doesn't do idle cycle hits. Or find a different way to fix QuerySeeds, e.g. iterate over all the entries.

This solves the regression. When it comes to prioritizing good peers, this feels like a separate "feature" task that we could discuss and design separately.

CC @AlexeyAkhunov

battlmonstr · 2022-03-07T16:34:45Z

Yeah, so it is mostly updating lastping/lastpong/findfail:

  53 UpdateFindFails
  68 UpdateLastPingReceived
  77 UpdateLastPongReceived

UpdateFindFails/UpdateLastPingReceived/UpdateLastPongReceived events are causing bursty DB commits (100 per minute). This optimization throttles the commits to happen at most once in a few seconds, because this info doesn't need to be persisted immediately.

AskAlexSharov · 2022-03-08T01:36:24Z

“Holding db tx object” - main problem here is - rwtx object can’t be moved between threads (means can’t “just hide it behind mutex and use from multiple goroutines”). And cache object introduced mostly to handle concurrent writes. There is some known design pattern to handle much parallel writes per second with lmdb, see .Batch method of bbolt https://github.com/etcd-io/bbolt#batch-read-write-transactions (if you like - can implement such in mdbx-go bindings, or kv_mdbx wrapper).
Also readTx will not see any updates until rwtx commit (i don’t know if it’s important here).

“ this feels like a separate "feature" task” - feel free to break it to smaller PR’s as you like.

I think we don’t need “peers prioritization”, but need just re-view that no obvious bugs that make us loose list of peers or too slow restore them.

UpdateFindFails/UpdateLastPingReceived/UpdateLastPongReceived events are causing bursty DB commits (100 per minute). This optimization throttles the disk writes to happen at most once in a few seconds, because this info doesn't need to be persisted immediately. This helps on HDD drives.

Problem: QuerySeeds will poke 150 random entries in the whole node DB and ignore hitting "field" entries. In a bootstrap scenario it might hit hundreds of :lastping :lastpong entries, and very few true "node record" entries. After running for 15 minutes I've got totalEntryCount=1508 nodeRecordCount=114 entries. There's a 1/16 chance of hitting a "node record" entry. It means finding just about 10 nodes of 114 total on average from 150 attempts. Solution: Split "node record" entries to a separate table such that QuerySeeds doesn't do idle cycle hits.

…3581) This reverts commit 65a9a26.

Revert "Prevent frequent commits to the node DB in sentries (#2505)". This reverts commit 65a9a26.

UpdateFindFails/UpdateLastPingReceived/UpdateLastPongReceived events are causing bursty DB commits (100 per minute). This optimization throttles the disk writes to happen at most once in a few seconds, because this info doesn't need to be persisted immediately. This helps on HDD drives.

Revert "Prevent frequent commits to the node DB in sentries (#2505)". This reverts commit 65a9a26.

UpdateFindFails/UpdateLastPingReceived/UpdateLastPongReceived events are causing bursty DB commits (100 per minute). This optimization throttles the disk writes to happen at most once in a few seconds, because this info doesn't need to be persisted immediately. This helps on HDD drives.

Problem: QuerySeeds will poke 150 random entries in the whole node DB and ignore hitting "field" entries. In a bootstrap scenario it might hit hundreds of :lastping :lastpong entries, and very few true "node record" entries. After running for 15 minutes I've got totalEntryCount=1508 nodeRecordCount=114 entries. There's a 1/16 chance of hitting a "node record" entry. It means finding just about 10 nodes of 114 total on average from 150 attempts. Solution: Split "node record" entries to a separate table such that QuerySeeds doesn't do idle cycle hits.

battlmonstr · 2022-03-10T13:58:05Z

Mostly done for Erigon2 at 04f07a0 .

I've made a tentative revert PR #3675 for stable. Do we want to backport all the fixes to stable?

Revert "Prevent frequent commits to the node DB in sentries (#2505)". This reverts commit 65a9a26.

AskAlexSharov · 2022-03-11T05:04:07Z

@battlmonstr "Do we want to backport all the fixes to stable?" - no, only if it fixing some problem. "slow discovery"/"too much commits of enodedb" - is likely a problem.

UpdateFindFails/UpdateLastPingReceived/UpdateLastPongReceived events are causing bursty DB commits (100 per minute). This optimization throttles the disk writes to happen at most once in a few seconds, because this info doesn't need to be persisted immediately. This helps on HDD drives.

Problem: QuerySeeds will poke 150 random entries in the whole node DB and ignore hitting "field" entries. In a bootstrap scenario it might hit hundreds of :lastping :lastpong entries, and very few true "node record" entries. After running for 15 minutes I've got totalEntryCount=1508 nodeRecordCount=114 entries. There's a 1/16 chance of hitting a "node record" entry. It means finding just about 10 nodes of 114 total on average from 150 attempts. Solution: Split "node record" entries to a separate table such that QuerySeeds doesn't do idle cycle hits.

* Update to erigon-lib stable * Discovery: throttle node DB commits (#3581) (#3656) UpdateFindFails/UpdateLastPingReceived/UpdateLastPongReceived events are causing bursty DB commits (100 per minute). This optimization throttles the disk writes to happen at most once in a few seconds, because this info doesn't need to be persisted immediately. This helps on HDD drives. * Update erigon-lib * Discovery: split node records to a sepatate DB table (#3581) (#3667) Problem: QuerySeeds will poke 150 random entries in the whole node DB and ignore hitting "field" entries. In a bootstrap scenario it might hit hundreds of :lastping :lastpong entries, and very few true "node record" entries. After running for 15 minutes I've got totalEntryCount=1508 nodeRecordCount=114 entries. There's a 1/16 chance of hitting a "node record" entry. It means finding just about 10 nodes of 114 total on average from 150 attempts. Solution: Split "node record" entries to a separate table such that QuerySeeds doesn't do idle cycle hits. * Discovery: add Context to Listen. (#3577) Add explicit Context to ListenV4 and ListenV5. This makes it possible to stop listening by an external signal. * Discovery: refactor public key to node ID conversions. (#3634) Encode and hash logic was duplicated in multiple places. * Move encoding to p2p/discover/v4wire * Move hashing to p2p/enode/idscheme * Change newRandomLookup to create a proper random key on a curve. * Discovery: speed up lookup tests (#3677) * Update erigon-lib Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> Co-authored-by: battlmonstr <battlmonstr@users.noreply.github.com>

* block by timestamp for stable (erigontech#3617) * Add timings of forward stages to logs (erigontech#3621) * save * save * deleted bor and starknet from doc (erigontech#3627) * add nosqlite tag (erigontech#3653) * add nosqlite tag * save * save (erigontech#3665) * save (erigontech#3663) * linter up (erigontech#3672) (erigontech#3673) * linter up (erigontech#3672) * save * save * Revert node DB cache (erigontech#3581) (erigontech#3674) (erigontech#3675) Revert "Prevent frequent commits to the node DB in sentries (erigontech#2505)". This reverts commit 65a9a26. * [stable] Fixes to discovery nodedb (erigontech#3691) * Update to erigon-lib stable * Discovery: throttle node DB commits (erigontech#3581) (erigontech#3656) UpdateFindFails/UpdateLastPingReceived/UpdateLastPongReceived events are causing bursty DB commits (100 per minute). This optimization throttles the disk writes to happen at most once in a few seconds, because this info doesn't need to be persisted immediately. This helps on HDD drives. * Update erigon-lib * Discovery: split node records to a sepatate DB table (erigontech#3581) (erigontech#3667) Problem: QuerySeeds will poke 150 random entries in the whole node DB and ignore hitting "field" entries. In a bootstrap scenario it might hit hundreds of :lastping :lastpong entries, and very few true "node record" entries. After running for 15 minutes I've got totalEntryCount=1508 nodeRecordCount=114 entries. There's a 1/16 chance of hitting a "node record" entry. It means finding just about 10 nodes of 114 total on average from 150 attempts. Solution: Split "node record" entries to a separate table such that QuerySeeds doesn't do idle cycle hits. * Discovery: add Context to Listen. (erigontech#3577) Add explicit Context to ListenV4 and ListenV5. This makes it possible to stop listening by an external signal. * Discovery: refactor public key to node ID conversions. (erigontech#3634) Encode and hash logic was duplicated in multiple places. * Move encoding to p2p/discover/v4wire * Move hashing to p2p/enode/idscheme * Change newRandomLookup to create a proper random key on a curve. * Discovery: speed up lookup tests (erigontech#3677) * Update erigon-lib Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> Co-authored-by: battlmonstr <battlmonstr@users.noreply.github.com> * [stable] Fixes for state overrides in RPC (erigontech#3693) * State override support (erigontech#3628) * added stateOverride type * solved import cycle * refactoring * imported wrong package * fixed Call arguments * typo * override for traceCall * Fix eth call (erigontech#3618) * added isFake * using isFake instead of checkNonce * Revert "using isFake instead of checkNonce" This reverts commit 6a202bb. * Revert "added isFake" This reverts commit 2c48024. * only checking EOA if we are checking for Nonce Co-authored-by: Enrique Jose Avila Asapche <eavilaasapche@gmail.com> * new bootnodes (erigontech#3591) (erigontech#3695) Co-authored-by: Enrique Jose Avila Asapche <eavilaasapche@gmail.com> * Update skip analysis and preverified hashes (erigontech#3700) (erigontech#3704) Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> * Update version.go (erigontech#3701) * simulate future blocks: timestamp and block number incremented by 1 for calls to trace_call(Many), debug_traceCall, eth_createAccessList * expose UsedGas through trace_call and trace_callMany * expose accessList via trace_call(Many) * plumb error into trace_call(many) outer json response Co-authored-by: Enrique Jose Avila Asapche <eavilaasapche@gmail.com> Co-authored-by: Alex Sharov <AskAlexSharov@gmail.com> Co-authored-by: battlmonstr <battlmonstr@users.noreply.github.com> Co-authored-by: ledgerwatch <akhounov@gmail.com> Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local>

* block by timestamp for stable (erigontech#3617) * Add timings of forward stages to logs (erigontech#3621) * save * save * deleted bor and starknet from doc (erigontech#3627) * add nosqlite tag (erigontech#3653) * add nosqlite tag * save * save (erigontech#3665) * save (erigontech#3663) * linter up (erigontech#3672) (erigontech#3673) * linter up (erigontech#3672) * save * save * Revert node DB cache (erigontech#3581) (erigontech#3674) (erigontech#3675) Revert "Prevent frequent commits to the node DB in sentries (erigontech#2505)". This reverts commit 65a9a26. * [stable] Fixes to discovery nodedb (erigontech#3691) * Update to erigon-lib stable * Discovery: throttle node DB commits (erigontech#3581) (erigontech#3656) UpdateFindFails/UpdateLastPingReceived/UpdateLastPongReceived events are causing bursty DB commits (100 per minute). This optimization throttles the disk writes to happen at most once in a few seconds, because this info doesn't need to be persisted immediately. This helps on HDD drives. * Update erigon-lib * Discovery: split node records to a sepatate DB table (erigontech#3581) (erigontech#3667) Problem: QuerySeeds will poke 150 random entries in the whole node DB and ignore hitting "field" entries. In a bootstrap scenario it might hit hundreds of :lastping :lastpong entries, and very few true "node record" entries. After running for 15 minutes I've got totalEntryCount=1508 nodeRecordCount=114 entries. There's a 1/16 chance of hitting a "node record" entry. It means finding just about 10 nodes of 114 total on average from 150 attempts. Solution: Split "node record" entries to a separate table such that QuerySeeds doesn't do idle cycle hits. * Discovery: add Context to Listen. (erigontech#3577) Add explicit Context to ListenV4 and ListenV5. This makes it possible to stop listening by an external signal. * Discovery: refactor public key to node ID conversions. (erigontech#3634) Encode and hash logic was duplicated in multiple places. * Move encoding to p2p/discover/v4wire * Move hashing to p2p/enode/idscheme * Change newRandomLookup to create a proper random key on a curve. * Discovery: speed up lookup tests (erigontech#3677) * Update erigon-lib Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> Co-authored-by: battlmonstr <battlmonstr@users.noreply.github.com> * [stable] Fixes for state overrides in RPC (erigontech#3693) * State override support (erigontech#3628) * added stateOverride type * solved import cycle * refactoring * imported wrong package * fixed Call arguments * typo * override for traceCall * Fix eth call (erigontech#3618) * added isFake * using isFake instead of checkNonce * Revert "using isFake instead of checkNonce" This reverts commit 6a202bb. * Revert "added isFake" This reverts commit 2c48024. * only checking EOA if we are checking for Nonce Co-authored-by: Enrique Jose Avila Asapche <eavilaasapche@gmail.com> * new bootnodes (erigontech#3591) (erigontech#3695) Co-authored-by: Enrique Jose Avila Asapche <eavilaasapche@gmail.com> * Update skip analysis and preverified hashes (erigontech#3700) (erigontech#3704) Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> * Update version.go (erigontech#3701) * rpcdaemon: fix TxContext in traceBlock (erigontech#3716) Previously `txCtx` is not updated for every tx, which leads to wrong tracing results. * Mdbx: WriteMap fallback on error (erigontech#3714) * save * save * Pool cost fix (erigontech#3725) * save * save * Update to erigon-lib stable Co-authored-by: Alex Sharp <alexsharp@Alexs-MacBook-Pro.local> * mdbx v0.11.6 (erigontech#3771) * mdbx fix after v0.11.6 (erigontech#3775) * save * save * save * [stable] Event log subscription (erigontech#3773) * Logs sub (erigontech#3666) * save * Add onLogs * Fix lint * Add proper logs * Update go.mod * goimports * Add unwind * feat/rpcadaemon_logs_sub (erigontech#3751) * Fixes to subscribe logs (erigontech#3769) * Fixes to subscribe logs * Add criteria to logs subscription * Skeleton of RPC daemon event log distribution * Simplify * Send aggregated filter to Erigon * Change API * Print * Fixes * Fix topics filtering * Fill txHash and blockHash * Timing logs, fill tx index * Print * More print * Print * Asynchronous sending of log events to RPC daemon * Remove prints * Only extract logs if there are subscribers * Check empty when RPC daemon is removed Co-authored-by: Alex Sharp <alexsharp@Alexs-MacBook-Pro.local> Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> * Fix up * Update to erigon-lib stable * Update to erigon-lib stable Co-authored-by: primal_concrete_sledge <ryban92@gmail.com> Co-authored-by: Alex Sharp <alexsharp@Alexs-MacBook-Pro.local> Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> * Update version.go (erigontech#3776) * Update Skip analysis and preverified hashes (erigontech#3777) (erigontech#3778) * Update skip analysis * Add preverified hashes for mainnet and ropsten * preverified hashes and bootnode for sepolia Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> * Integration: reset StageFinish also (erigontech#3783) * docker hub - fetch git tags before build erigontech#3781 * fix nil pointer in fetch.go (erigontech#3802) * Update preverified hashes and skip analysis (erigontech#3831) (erigontech#3832) * Update skip_analysis * Preverified hashes Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> * Fix 'all defaults' case for eth_estimateGas (erigontech#3790) (erigontech#3824) * Fix 'all defaults' case for eth_estimateGas * fix tests Co-authored-by: Igor Mandrigin <i@mandrigin.ru> Co-authored-by: Igor Mandrigin <mandrigin@users.noreply.github.com> Co-authored-by: Igor Mandrigin <i@mandrigin.ru> * Update version.go (erigontech#3829) * Change libmdbx submodule origin (erigontech#3894) * save * Restore testdata Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> * Update to erigon-lib stable (erigontech#3895) Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> * Update version.go (erigontech#3896) * Update skip_analysis.go (erigontech#3897) (erigontech#3898) * save (erigontech#3904) * [stable] Fixes for header download (erigontech#3911) * Rollback preverified hashes for mainnet * Not remove header * Set verified = true * Fix verified extendUp and connect * Skip already persisted links * Prevent rewriting historical headers * Not load links after highestInDb * Restore preverified * Fix tests * Fix error handling Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> * save (erigontech#3916) * Update libmdbx source (erigontech#3974) Same change as already merged in `devel` * Makefile (erigontech#3779): pass docker build arguments (erigontech#4239) Dockerfile requires some --build-arg options. Fix "docker" target to pass them. Fix GIT_TAG to reflect the most recent tag related to HEAD, instead of an unrelated most recent tag. Use it as the image VERSION. Image tags need to be passed explicitly if needed: DOCKER_FLAGS='-t erigon:latest' make docker * save (erigontech#4346) * Gray Glacier bomb delay (erigontech#4444) * Update version.go on stable branch (erigontech#4447) * Update version.go * Fix lint Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> * Clean up * in transaction execution, subtract from account balance only after enough gaspool is ensured (erigontech#4450) - noticed the difference when executing testdata#10 in go-ethereum and erigon * Update skip_analysis.go (erigontech#4452) * Adjust version Co-authored-by: Enrique Jose Avila Asapche <eavilaasapche@gmail.com> Co-authored-by: Alex Sharov <AskAlexSharov@gmail.com> Co-authored-by: battlmonstr <battlmonstr@users.noreply.github.com> Co-authored-by: ledgerwatch <akhounov@gmail.com> Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> Co-authored-by: can <can@canx.me> Co-authored-by: Alex Sharp <alexsharp@Alexs-MacBook-Pro.local> Co-authored-by: primal_concrete_sledge <ryban92@gmail.com> Co-authored-by: Igor Mandrigin <mandrigin@users.noreply.github.com> Co-authored-by: Igor Mandrigin <i@mandrigin.ru> Co-authored-by: Andrea Lanfranchi <andrea.lanfranchi@gmail.com> Co-authored-by: Andrew Ashikhmin <34320705+yperbasis@users.noreply.github.com> Co-authored-by: sudeep <sudeepdino008@gmail.com>

battlmonstr self-assigned this Mar 7, 2022

battlmonstr added a commit that referenced this issue Mar 10, 2022

Revert "Prevent frequent commits to the node DB in sentries (#2505)" (#…

9458c8d

…3581) This reverts commit 65a9a26.

battlmonstr added a commit that referenced this issue Mar 10, 2022

Revert node DB cache (#3581) (#3674)

42d128e

Revert "Prevent frequent commits to the node DB in sentries (#2505)". This reverts commit 65a9a26.

battlmonstr added a commit that referenced this issue Mar 10, 2022

Revert node DB cache (#3581) (#3674)

b4a1a11

Revert "Prevent frequent commits to the node DB in sentries (#2505)". This reverts commit 65a9a26.

AlexeyAkhunov pushed a commit that referenced this issue Mar 10, 2022

Revert node DB cache (#3581) (#3674) (#3675)

a52e53a

Revert "Prevent frequent commits to the node DB in sentries (#2505)". This reverts commit 65a9a26.

battlmonstr closed this as completed Apr 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-visit how well Erigon can restore values from 'p2p/enode/nodedb.go' after restart #3581

Re-visit how well Erigon can restore values from 'p2p/enode/nodedb.go' after restart #3581

AskAlexSharov commented Feb 23, 2022 •

edited

Loading

battlmonstr commented Mar 7, 2022 •

edited

Loading

battlmonstr commented Mar 7, 2022

AskAlexSharov commented Mar 8, 2022 •

edited

Loading

battlmonstr commented Mar 10, 2022 •

edited

Loading

AskAlexSharov commented Mar 11, 2022

Re-visit how well Erigon can restore values from 'p2p/enode/nodedb.go' after restart #3581

Re-visit how well Erigon can restore values from 'p2p/enode/nodedb.go' after restart #3581

Comments

AskAlexSharov commented Feb 23, 2022 • edited Loading

battlmonstr commented Mar 7, 2022 • edited Loading

Answering to your questions:

Possible solutions:

battlmonstr commented Mar 7, 2022

AskAlexSharov commented Mar 8, 2022 • edited Loading

battlmonstr commented Mar 10, 2022 • edited Loading

AskAlexSharov commented Mar 11, 2022

AskAlexSharov commented Feb 23, 2022 •

edited

Loading

battlmonstr commented Mar 7, 2022 •

edited

Loading

AskAlexSharov commented Mar 8, 2022 •

edited

Loading

battlmonstr commented Mar 10, 2022 •

edited

Loading