-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-visit how well Erigon can restore values from 'p2p/enode/nodedb.go' after restart #3581
Comments
I think that primarily it is a regression after introducing a nodedb cache. There are multiple issues that cause this behaviour. The cache is flushed to DB in 2 cases:
Answering to your questions:
This log message is misleading due to a typo - entriesUpdated/entriesDeleted variables should be reversed there for logging. The message happens normally upon shutdown and says that 8206 database cache entries were flushed to DB (corresponding to about 2K peers, because each peer has 4-5 related entries).
Yes, but there are caveats:
There's no prioritization of good peers (the ones that were just connected with Eth66 before shutdown) versus peers that replied to a Ping during the last 5 days (seedMaxAge). QuerySeeds happily returns 30 random peers to put into the table, and then it will start a random lookup. The first phase of that lookup will return 16 of them back from the table. After restarting it is likely to try to connect to a totally different set of peers.
Yes, it can. I tried - it worked. Possible solutions:
This solves the regression. When it comes to prioritizing good peers, this feels like a separate "feature" task that we could discuss and design separately. |
Yeah, so it is mostly updating lastping/lastpong/findfail:
|
UpdateFindFails/UpdateLastPingReceived/UpdateLastPongReceived events are causing bursty DB commits (100 per minute). This optimization throttles the commits to happen at most once in a few seconds, because this info doesn't need to be persisted immediately.
“Holding db tx object” - main problem here is - rwtx object can’t be moved between threads (means can’t “just hide it behind mutex and use from multiple goroutines”). And cache object introduced mostly to handle concurrent writes. There is some known design pattern to handle much parallel writes per second with lmdb, see .Batch method of bbolt https://github.com/etcd-io/bbolt#batch-read-write-transactions (if you like - can implement such in mdbx-go bindings, or kv_mdbx wrapper). “ this feels like a separate "feature" task” - feel free to break it to smaller PR’s as you like. I think we don’t need “peers prioritization”, but need just re-view that no obvious bugs that make us loose list of peers or too slow restore them. |
UpdateFindFails/UpdateLastPingReceived/UpdateLastPongReceived events are causing bursty DB commits (100 per minute). This optimization throttles the disk writes to happen at most once in a few seconds, because this info doesn't need to be persisted immediately. This helps on HDD drives.
Problem: QuerySeeds will poke 150 random entries in the whole node DB and ignore hitting "field" entries. In a bootstrap scenario it might hit hundreds of :lastping :lastpong entries, and very few true "node record" entries. After running for 15 minutes I've got totalEntryCount=1508 nodeRecordCount=114 entries. There's a 1/16 chance of hitting a "node record" entry. It means finding just about 10 nodes of 114 total on average from 150 attempts. Solution: Split "node record" entries to a separate table such that QuerySeeds doesn't do idle cycle hits.
Problem: QuerySeeds will poke 150 random entries in the whole node DB and ignore hitting "field" entries. In a bootstrap scenario it might hit hundreds of :lastping :lastpong entries, and very few true "node record" entries. After running for 15 minutes I've got totalEntryCount=1508 nodeRecordCount=114 entries. There's a 1/16 chance of hitting a "node record" entry. It means finding just about 10 nodes of 114 total on average from 150 attempts. Solution: Split "node record" entries to a separate table such that QuerySeeds doesn't do idle cycle hits.
UpdateFindFails/UpdateLastPingReceived/UpdateLastPongReceived events are causing bursty DB commits (100 per minute). This optimization throttles the disk writes to happen at most once in a few seconds, because this info doesn't need to be persisted immediately. This helps on HDD drives.
UpdateFindFails/UpdateLastPingReceived/UpdateLastPongReceived events are causing bursty DB commits (100 per minute). This optimization throttles the disk writes to happen at most once in a few seconds, because this info doesn't need to be persisted immediately. This helps on HDD drives.
Problem: QuerySeeds will poke 150 random entries in the whole node DB and ignore hitting "field" entries. In a bootstrap scenario it might hit hundreds of :lastping :lastpong entries, and very few true "node record" entries. After running for 15 minutes I've got totalEntryCount=1508 nodeRecordCount=114 entries. There's a 1/16 chance of hitting a "node record" entry. It means finding just about 10 nodes of 114 total on average from 150 attempts. Solution: Split "node record" entries to a separate table such that QuerySeeds doesn't do idle cycle hits.
Problem: QuerySeeds will poke 150 random entries in the whole node DB and ignore hitting "field" entries. In a bootstrap scenario it might hit hundreds of :lastping :lastpong entries, and very few true "node record" entries. After running for 15 minutes I've got totalEntryCount=1508 nodeRecordCount=114 entries. There's a 1/16 chance of hitting a "node record" entry. It means finding just about 10 nodes of 114 total on average from 150 attempts. Solution: Split "node record" entries to a separate table such that QuerySeeds doesn't do idle cycle hits.
@battlmonstr "Do we want to backport all the fixes to stable?" - no, only if it fixing some problem. "slow discovery"/"too much commits of enodedb" - is likely a problem. |
UpdateFindFails/UpdateLastPingReceived/UpdateLastPongReceived events are causing bursty DB commits (100 per minute). This optimization throttles the disk writes to happen at most once in a few seconds, because this info doesn't need to be persisted immediately. This helps on HDD drives.
Problem: QuerySeeds will poke 150 random entries in the whole node DB and ignore hitting "field" entries. In a bootstrap scenario it might hit hundreds of :lastping :lastpong entries, and very few true "node record" entries. After running for 15 minutes I've got totalEntryCount=1508 nodeRecordCount=114 entries. There's a 1/16 chance of hitting a "node record" entry. It means finding just about 10 nodes of 114 total on average from 150 attempts. Solution: Split "node record" entries to a separate table such that QuerySeeds doesn't do idle cycle hits.
* Update to erigon-lib stable * Discovery: throttle node DB commits (#3581) (#3656) UpdateFindFails/UpdateLastPingReceived/UpdateLastPongReceived events are causing bursty DB commits (100 per minute). This optimization throttles the disk writes to happen at most once in a few seconds, because this info doesn't need to be persisted immediately. This helps on HDD drives. * Update erigon-lib * Discovery: split node records to a sepatate DB table (#3581) (#3667) Problem: QuerySeeds will poke 150 random entries in the whole node DB and ignore hitting "field" entries. In a bootstrap scenario it might hit hundreds of :lastping :lastpong entries, and very few true "node record" entries. After running for 15 minutes I've got totalEntryCount=1508 nodeRecordCount=114 entries. There's a 1/16 chance of hitting a "node record" entry. It means finding just about 10 nodes of 114 total on average from 150 attempts. Solution: Split "node record" entries to a separate table such that QuerySeeds doesn't do idle cycle hits. * Discovery: add Context to Listen. (#3577) Add explicit Context to ListenV4 and ListenV5. This makes it possible to stop listening by an external signal. * Discovery: refactor public key to node ID conversions. (#3634) Encode and hash logic was duplicated in multiple places. * Move encoding to p2p/discover/v4wire * Move hashing to p2p/enode/idscheme * Change newRandomLookup to create a proper random key on a curve. * Discovery: speed up lookup tests (#3677) * Update erigon-lib Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> Co-authored-by: battlmonstr <battlmonstr@users.noreply.github.com>
* block by timestamp for stable (erigontech#3617) * Add timings of forward stages to logs (erigontech#3621) * save * save * deleted bor and starknet from doc (erigontech#3627) * add nosqlite tag (erigontech#3653) * add nosqlite tag * save * save (erigontech#3665) * save (erigontech#3663) * linter up (erigontech#3672) (erigontech#3673) * linter up (erigontech#3672) * save * save * Revert node DB cache (erigontech#3581) (erigontech#3674) (erigontech#3675) Revert "Prevent frequent commits to the node DB in sentries (erigontech#2505)". This reverts commit 65a9a26. * [stable] Fixes to discovery nodedb (erigontech#3691) * Update to erigon-lib stable * Discovery: throttle node DB commits (erigontech#3581) (erigontech#3656) UpdateFindFails/UpdateLastPingReceived/UpdateLastPongReceived events are causing bursty DB commits (100 per minute). This optimization throttles the disk writes to happen at most once in a few seconds, because this info doesn't need to be persisted immediately. This helps on HDD drives. * Update erigon-lib * Discovery: split node records to a sepatate DB table (erigontech#3581) (erigontech#3667) Problem: QuerySeeds will poke 150 random entries in the whole node DB and ignore hitting "field" entries. In a bootstrap scenario it might hit hundreds of :lastping :lastpong entries, and very few true "node record" entries. After running for 15 minutes I've got totalEntryCount=1508 nodeRecordCount=114 entries. There's a 1/16 chance of hitting a "node record" entry. It means finding just about 10 nodes of 114 total on average from 150 attempts. Solution: Split "node record" entries to a separate table such that QuerySeeds doesn't do idle cycle hits. * Discovery: add Context to Listen. (erigontech#3577) Add explicit Context to ListenV4 and ListenV5. This makes it possible to stop listening by an external signal. * Discovery: refactor public key to node ID conversions. (erigontech#3634) Encode and hash logic was duplicated in multiple places. * Move encoding to p2p/discover/v4wire * Move hashing to p2p/enode/idscheme * Change newRandomLookup to create a proper random key on a curve. * Discovery: speed up lookup tests (erigontech#3677) * Update erigon-lib Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> Co-authored-by: battlmonstr <battlmonstr@users.noreply.github.com> * [stable] Fixes for state overrides in RPC (erigontech#3693) * State override support (erigontech#3628) * added stateOverride type * solved import cycle * refactoring * imported wrong package * fixed Call arguments * typo * override for traceCall * Fix eth call (erigontech#3618) * added isFake * using isFake instead of checkNonce * Revert "using isFake instead of checkNonce" This reverts commit 6a202bb. * Revert "added isFake" This reverts commit 2c48024. * only checking EOA if we are checking for Nonce Co-authored-by: Enrique Jose Avila Asapche <eavilaasapche@gmail.com> * new bootnodes (erigontech#3591) (erigontech#3695) Co-authored-by: Enrique Jose Avila Asapche <eavilaasapche@gmail.com> * Update skip analysis and preverified hashes (erigontech#3700) (erigontech#3704) Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> * Update version.go (erigontech#3701) * simulate future blocks: timestamp and block number incremented by 1 for calls to trace_call(Many), debug_traceCall, eth_createAccessList * expose UsedGas through trace_call and trace_callMany * expose accessList via trace_call(Many) * plumb error into trace_call(many) outer json response Co-authored-by: Enrique Jose Avila Asapche <eavilaasapche@gmail.com> Co-authored-by: Alex Sharov <AskAlexSharov@gmail.com> Co-authored-by: battlmonstr <battlmonstr@users.noreply.github.com> Co-authored-by: ledgerwatch <akhounov@gmail.com> Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local>
* block by timestamp for stable (erigontech#3617) * Add timings of forward stages to logs (erigontech#3621) * save * save * deleted bor and starknet from doc (erigontech#3627) * add nosqlite tag (erigontech#3653) * add nosqlite tag * save * save (erigontech#3665) * save (erigontech#3663) * linter up (erigontech#3672) (erigontech#3673) * linter up (erigontech#3672) * save * save * Revert node DB cache (erigontech#3581) (erigontech#3674) (erigontech#3675) Revert "Prevent frequent commits to the node DB in sentries (erigontech#2505)". This reverts commit 65a9a26. * [stable] Fixes to discovery nodedb (erigontech#3691) * Update to erigon-lib stable * Discovery: throttle node DB commits (erigontech#3581) (erigontech#3656) UpdateFindFails/UpdateLastPingReceived/UpdateLastPongReceived events are causing bursty DB commits (100 per minute). This optimization throttles the disk writes to happen at most once in a few seconds, because this info doesn't need to be persisted immediately. This helps on HDD drives. * Update erigon-lib * Discovery: split node records to a sepatate DB table (erigontech#3581) (erigontech#3667) Problem: QuerySeeds will poke 150 random entries in the whole node DB and ignore hitting "field" entries. In a bootstrap scenario it might hit hundreds of :lastping :lastpong entries, and very few true "node record" entries. After running for 15 minutes I've got totalEntryCount=1508 nodeRecordCount=114 entries. There's a 1/16 chance of hitting a "node record" entry. It means finding just about 10 nodes of 114 total on average from 150 attempts. Solution: Split "node record" entries to a separate table such that QuerySeeds doesn't do idle cycle hits. * Discovery: add Context to Listen. (erigontech#3577) Add explicit Context to ListenV4 and ListenV5. This makes it possible to stop listening by an external signal. * Discovery: refactor public key to node ID conversions. (erigontech#3634) Encode and hash logic was duplicated in multiple places. * Move encoding to p2p/discover/v4wire * Move hashing to p2p/enode/idscheme * Change newRandomLookup to create a proper random key on a curve. * Discovery: speed up lookup tests (erigontech#3677) * Update erigon-lib Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> Co-authored-by: battlmonstr <battlmonstr@users.noreply.github.com> * [stable] Fixes for state overrides in RPC (erigontech#3693) * State override support (erigontech#3628) * added stateOverride type * solved import cycle * refactoring * imported wrong package * fixed Call arguments * typo * override for traceCall * Fix eth call (erigontech#3618) * added isFake * using isFake instead of checkNonce * Revert "using isFake instead of checkNonce" This reverts commit 6a202bb. * Revert "added isFake" This reverts commit 2c48024. * only checking EOA if we are checking for Nonce Co-authored-by: Enrique Jose Avila Asapche <eavilaasapche@gmail.com> * new bootnodes (erigontech#3591) (erigontech#3695) Co-authored-by: Enrique Jose Avila Asapche <eavilaasapche@gmail.com> * Update skip analysis and preverified hashes (erigontech#3700) (erigontech#3704) Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> * Update version.go (erigontech#3701) * rpcdaemon: fix TxContext in traceBlock (erigontech#3716) Previously `txCtx` is not updated for every tx, which leads to wrong tracing results. * Mdbx: WriteMap fallback on error (erigontech#3714) * save * save * Pool cost fix (erigontech#3725) * save * save * Update to erigon-lib stable Co-authored-by: Alex Sharp <alexsharp@Alexs-MacBook-Pro.local> * mdbx v0.11.6 (erigontech#3771) * mdbx fix after v0.11.6 (erigontech#3775) * save * save * save * [stable] Event log subscription (erigontech#3773) * Logs sub (erigontech#3666) * save * Add onLogs * Fix lint * Add proper logs * Update go.mod * goimports * Add unwind * feat/rpcadaemon_logs_sub (erigontech#3751) * Fixes to subscribe logs (erigontech#3769) * Fixes to subscribe logs * Add criteria to logs subscription * Skeleton of RPC daemon event log distribution * Simplify * Send aggregated filter to Erigon * Change API * Print * Fixes * Fix topics filtering * Fill txHash and blockHash * Timing logs, fill tx index * Print * More print * Print * Asynchronous sending of log events to RPC daemon * Remove prints * Only extract logs if there are subscribers * Check empty when RPC daemon is removed Co-authored-by: Alex Sharp <alexsharp@Alexs-MacBook-Pro.local> Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> * Fix up * Update to erigon-lib stable * Update to erigon-lib stable Co-authored-by: primal_concrete_sledge <ryban92@gmail.com> Co-authored-by: Alex Sharp <alexsharp@Alexs-MacBook-Pro.local> Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> * Update version.go (erigontech#3776) * Update Skip analysis and preverified hashes (erigontech#3777) (erigontech#3778) * Update skip analysis * Add preverified hashes for mainnet and ropsten * preverified hashes and bootnode for sepolia Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> * Integration: reset StageFinish also (erigontech#3783) * docker hub - fetch git tags before build erigontech#3781 * fix nil pointer in fetch.go (erigontech#3802) * Update preverified hashes and skip analysis (erigontech#3831) (erigontech#3832) * Update skip_analysis * Preverified hashes Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> * Fix 'all defaults' case for eth_estimateGas (erigontech#3790) (erigontech#3824) * Fix 'all defaults' case for eth_estimateGas * fix tests Co-authored-by: Igor Mandrigin <i@mandrigin.ru> Co-authored-by: Igor Mandrigin <mandrigin@users.noreply.github.com> Co-authored-by: Igor Mandrigin <i@mandrigin.ru> * Update version.go (erigontech#3829) * Change libmdbx submodule origin (erigontech#3894) * save * Restore testdata Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> * Update to erigon-lib stable (erigontech#3895) Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> * Update version.go (erigontech#3896) * Update skip_analysis.go (erigontech#3897) (erigontech#3898) * save (erigontech#3904) * [stable] Fixes for header download (erigontech#3911) * Rollback preverified hashes for mainnet * Not remove header * Set verified = true * Fix verified extendUp and connect * Skip already persisted links * Prevent rewriting historical headers * Not load links after highestInDb * Restore preverified * Fix tests * Fix error handling Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> * save (erigontech#3916) * Update libmdbx source (erigontech#3974) Same change as already merged in `devel` * Makefile (erigontech#3779): pass docker build arguments (erigontech#4239) Dockerfile requires some --build-arg options. Fix "docker" target to pass them. Fix GIT_TAG to reflect the most recent tag related to HEAD, instead of an unrelated most recent tag. Use it as the image VERSION. Image tags need to be passed explicitly if needed: DOCKER_FLAGS='-t erigon:latest' make docker * save (erigontech#4346) * Gray Glacier bomb delay (erigontech#4444) * Update version.go on stable branch (erigontech#4447) * Update version.go * Fix lint Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> * Clean up * in transaction execution, subtract from account balance only after enough gaspool is ensured (erigontech#4450) - noticed the difference when executing testdata#10 in go-ethereum and erigon * Update skip_analysis.go (erigontech#4452) * Adjust version Co-authored-by: Enrique Jose Avila Asapche <eavilaasapche@gmail.com> Co-authored-by: Alex Sharov <AskAlexSharov@gmail.com> Co-authored-by: battlmonstr <battlmonstr@users.noreply.github.com> Co-authored-by: ledgerwatch <akhounov@gmail.com> Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> Co-authored-by: can <can@canx.me> Co-authored-by: Alex Sharp <alexsharp@Alexs-MacBook-Pro.local> Co-authored-by: primal_concrete_sledge <ryban92@gmail.com> Co-authored-by: Igor Mandrigin <mandrigin@users.noreply.github.com> Co-authored-by: Igor Mandrigin <i@mandrigin.ru> Co-authored-by: Andrea Lanfranchi <andrea.lanfranchi@gmail.com> Co-authored-by: Andrew Ashikhmin <34320705+yperbasis@users.noreply.github.com> Co-authored-by: sudeep <sudeepdino008@gmail.com>
Problem: after restart Erigon loosing Good peers and very slow gathering them back
I see next message after node start (and sometime on shutdown also):
Need re-check how well Erigon can restore values from 'p2p/enode/nodedb.go' after restart
Need check:
The text was updated successfully, but these errors were encountered: