Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Receipt indexing and lookup performance in Lotus #19

Closed
i-norden opened this issue May 24, 2023 · 14 comments
Closed

Improve Receipt indexing and lookup performance in Lotus #19

i-norden opened this issue May 24, 2023 · 14 comments

Comments

@i-norden
Copy link
Contributor

i-norden commented May 24, 2023

Background:

Lotus maintains sqlite indexes for fast transaction lookup: https://github.com/filecoin-project/lotus/blob/master/chain/index/msgindex.go

We can see this index being leveraged in the SearchForMessages method: https://github.com/filecoin-project/lotus/blob/master/chain/stmgr/searchwait.go#L140

In SearchForMessages we first load the message directly from the blockstore using the provided CID: https://github.com/filecoin-project/lotus/blob/master/chain/store/messages.go#L43.

Once we have the message, we attempt to find the corresponding receipt in the head-1 ("head" being the tipset provided to the method) tipset by calling tipsetExecutedMessage. This method loads the parent-of-head tipset, loads the messages for the parent tipset, scans backwards through these messages (in order to account for replaceable message CIDs). Once it has identified the correct (potentially replaced) message through this process, it takes the index of that message and uses it to lookup the corresponding receipt. The receipt lookup itself involves loading and searching through the AMT for the receipts.

If we couldn't find the receipt in the head-1 tipset using the above process, then we attempt to lookup msgInfo in the sqlite message index by calling searchForIndexedMsg. This method searches the index for the messageCID and the index maps this CID to the appropriate tipset to lookup the receipts with. We then call the tipsetExecutedMessage method again, this time using the tipset that we identified from the index.

If msgInfo cannot be found in the index, then we fall through to the most pessimistic search path which involves calling the searchBackForMsg. As the name implies, this method involves iterating backwards over tipsets from the provided head, searching through all the messages until the target message and corresponding receipt are found (or until the search limit is reached).

Ideas (WIP):

  1. We need a secondary index for receipts. The sqlite message index maps messageCIDs to tipset keys, but we still have to then load the TipSet and then search through its referenced receipt AMT to retrieve the corresponding receipt. We could instead index the corresponding receipt CID directly in the sqlite table, or do something similar to geth where we maintain a KVStore bucket that allows us to prepend a simple prefix (e.g. []byte("r")) to a message CID to directly lookup the corresponding receipt or receipt CID.
  2. Better guarantees that the sqlite index will actually be hit, preventing us from having to go down the expensive searchBackForMsg path.
  3. The search path is made to accommodate allowReplaced. Even when allowReplaced is false, the access pattern proceeds down a non-optimal path in order to accommodate the possibility for it to be true. If we know it can't be replaced, then we should be able to avoid some of the work done in tipsetExecutedMessage. Need to investigate this further.
@i-norden
Copy link
Contributor Author

This path is taken when servicing eth_getTransactionReceipt requests

eth_getTransactionReceipt is ultimately an alias/shim on-top of this method: https://github.com/filecoin-project/lotus/blob/master/node/impl/full/eth.go#L427 which calls StateSearchMsg.

@i-norden
Copy link
Contributor Author

i-norden commented May 24, 2023

I think there is a “quick” fix to slow eth_getTransactionReceipt queries that are looking for old receipts. The biggest issue when looking for a super early receipt is that when the endpoint is called the StateSearchMsg method described above is called with a types.EmptyTSK

This means the StateSearchMsg defaults to using the heaviest tipsetkey, which is super far away from any early/earliest message/receipt. So if the message CID is not found in the message index, we fall down to the worst path and end up iterating over every single tipset back from head until we find the target receipt.

If we somehow maintained a mapping of tx_hashes/msg_CIDs to epoch/tipsetkey and were able to pass in the correct tipset or one much closer to the target one I think that would speed things up dramatically without having to change any of the internals of StateSearchMsg.

We can test this hypothesis quickly by hardcoding passing in the correct or a much closer tipsetkey to StateSearchMsg when eth_getTransactionReceipt is called.

@eshon
Copy link

eshon commented May 24, 2023

@i-norden - FYI - filecoin-project/lotus#10807 - The Lotus/FVM team also identified this problem and plan to implement these additional indexes.

However their ETA for an MVP fix is 2-3 weeks. If you're able to propose any nearer-term improvements that could be released as a patch that'd be incredibly helpful, as this is currently blocking a number of users of archival nodes in our ecosystem.

@fridrik01
Copy link

I think there is a “quick” fix to slow eth_getTransactionReceipt queries that are looking for old receipts. The biggest issue when looking for a super early receipt is that when the endpoint is called the StateSearchMsg method described above is called with a types.EmptyTSK

This means the StateSearchMsg defaults to using the heaviest tipsetkey, which is super far away from any early/earliest message/receipt. So if the message CID is not found in the message index, we fall down to the worst path and end up iterating over every single tipset back from head until we find the target receipt.

If we somehow maintained a mapping of tx_hashes/msg_CIDs to epoch/tipsetkey and were able to pass in the correct tipset or one much closer to the target one I think that would speed things up dramatically without having to change any of the internals of StateSearchMsg.

We can test this hypothesis quickly by hardcoding passing in the correct or a much closer tipsetkey to StateSearchMsg when eth_getTransactionReceipt is called.

We definitely need to have better guarantees that the relevant sql indices are properly maintained. For example, the logic for backfilling the sqlite index relevant to eth_getTransactionReceipt needs to be reconsidered IMO as it simply skips backfilling if the sqlite/txhash.db already exists (see relevant code). Could this be what is happening in your case?

@i-norden
Copy link
Contributor Author

Thanks @eshon and @fridrik01!

I think guaranteeing the sqlite/msgindex.db is populated is also vital to this, as this is where once we have retrieved the corresponding message_cid for an eth tx_hash from the sqlite/txhash.db index we use that message_cid to try and find the correct tipset to search within.

Loose proposal for quick fixes for this and #20:

  1. Ensure the sqlite/txhash.db is fully populated if index is turned on. Add capabilities to trigger complete backfilling of this index on an already synced node. (for Improve Receipt indexing and lookup performance in Lotus #19)
  2. Ensure the sqlite/msgindex.db is fully populated if index is turned on. Add capabilities to trigger complete backfilling of this index on an already synced node. (for Improve Receipt indexing and lookup performance in Lotus #19 and Improve eth_getBlockByNumber query performance. #20)
  3. Add a btree index to the epoch column in the the sqlite/msgindex.db messages table and add a method to search this index to quickly retrieve the tipset for a given epoch/block_height (for Improve eth_getBlockByNumber query performance. #20).

@i-norden
Copy link
Contributor Author

i-norden commented May 25, 2023

Clear the sqlite/ directories before running node ontop of fresh snapshot. This is probably why we don't have sqlite.txhash.db index populated! Look further into why sqlite/msgindex.db appears to also be unpopulated.

We need two operating modes, one which allows the node to operate in an optimistic manner that is allowed to fall back into the slower head-back searching mode if index is hit, and a mode which ensures the indexes are properly indexed and that this pathway will never be taken (node falls over into "maintenance" mode than repairs the indexes- if index is missed, enter said mode and also have some background process which monitors it?).

Moment the cache is known/hinted at to be inconsistent- node moves into mode that focuses on rebuilding the cache/index.

Flag that turns on/off the above property.

@i-norden
Copy link
Contributor Author

i-norden commented May 25, 2023

A message can be in the local blockstore but not in the chain, so we can't simply revert to maintenance mode when we can fidn a message in blockstore but not in the messages index.

We can rely on federation with other nodes to check if message needs to be in index?

DO THE BELOW:

Hashing/snapshotting/checkpointing the database periodically so that we can check if it is correct/complete? Create a new empty db, attach to current db, pop regions that predefined for hashing over into the new DB, hash the new DB, should get the same hash. Need some way to store and expose these checkpoints, canonical hash tree? Would this be a sidecar process or part of Lotus? Eventually add Lotus API to support this?

Map epoch boundaries to the canonical hash. Need to define an attestation schema. ETL sql statement to take msgindex and populate the new schema. Run hash over these segments.

Service runs alongside Lotus and performs this algo, checks the hashes it generates vs the ones exposed by other SPs. If it is not a canonical hash, stop the node, fix the index, restart it, re-run test. Service needs to run this algo every "period".

"period": 2880 epochs (1 day)

See https://github.com/ribasushi/fil-fip36-vote-tally/blob/master/parsestate/model.go#L39-L68

Trailing process that probes the index in the background and checks that the messages it contains are sufficient to satisfy state transitions?

@jennijuju
Copy link

if you already have the state but not the index lotus-shed msgindex backfill can help you backfill them

@i-norden
Copy link
Contributor Author

i-norden commented Jun 12, 2023

Thanks @jennijuju, we did but we were running off of the latest release which didn't include msgindex support 🙃. We have switched to building off of main, in order to also capture the new shed backfill commands. We have ran the backfill on one of our nodes that was synced off of a filesystem transfer we received from a partner, unfortunately this node appears to have been synced starting from a light snapshot above the FEVM activation height and so does not have all the messages necessary to completely rebuild the index. But this has allowed us to move forward with some testing.

We are simultaneously syncing new nodes from light snapshots right below the FEVM activation blockheight in order to capture the full FEVM msg history. These syncs were (re)started about 2 weeks ago but have not yet completed (around epoch 2750000 now; at current rate will take 5-6 more weeks). If you are aware of anybody who could provide a full snapshot or transfer the data dir for a non-pruning node synced from a light snapshot below 2683348 we would be very interested in that.

The machines for the light snapshot syncs are NVMe with 10Gib ethernet.

Supermicro SYS-2029U-TN24R4T
CLX 4214R 2P 12C/24T 2.4G 16.5M 9.6GT 100W 3647 L1 x1
64GB DDR4-3200 2Rx4 (16Gb) ECC RDIMM,RoHS x12
WD/HGST SN640 6.4TB 7mm 2.5" PCIe 3.1x4 U.2 TLC ISE 2DWPD x12
Intel D3-S4610 240GB SATA6Gb/s 3D TLC 2.5" 7mm 3DWPD Rev.3 x2
MellanoxConnectX-5EN network card,50GbE dual-port QSFP28 x1

@i-norden
Copy link
Contributor Author

i-norden commented Jun 12, 2023

While syncing from the light snapshots starting below 2683348, we have noticed at startup that the node's struggle to find peers. They eventually do, but then continue to run into these issues with a high frequency:

ID": "f01146045"}
2023-06-12T17:35:20.871Z        WARN    sub     sub/incoming.go:544     cannot authenticate messsage    {"err": "failed to load miner actor: actor not found", "errVerbose": "failed to load miner actor:\n    github.com/filecoin-project/lotus/node/impl/full.(*StateModule).StateMinerInfo\n        /home/srw/lotus/node/impl/full/state.go:153\n  - actor not found", "peer": "12D3KooWBjsDHppZZqumacEmJqfuUSBd6x41aCeVCdo1gcM11iyA", "minerID": "f02182902"}
2023-06-12T17:35:21.251Z        WARN    sub     sub/incoming.go:544     cannot authenticate messsage    {"err": "failed to load miner actor: actor not found", "errVerbose": "failed to load miner actor:\n    github.com/filecoin-project/lotus/node/impl/full.(*StateModule).StateMinerInfo\n        /home/srw/lotus/node/impl/full/state.go:153\n  - actor not found", "peer": "12D3KooWEiPRcfjXJVehty8okJGJpBZP8zM5UBoCK5yw2MXfx98x", "minerID": "f02098126"}

and

2023-06-12T17:36:23.749Z        INFO    net/identify    identify/id.go:432      failed negotiate identify protocol with peer    {"peer": "12D3KooWFvcNjT29GtPwaT6VeMGMoAJZXJ4GRZ9JkUpQ73b2aodn", "error": "stream reset"}
2023-06-12T17:36:23.750Z        WARN    net/identify    identify/id.go:407      failed to identify 12D3KooWFvcNjT29GtPwaT6VeMGMoAJZXJ4GRZ9JkUpQ73b2aodn: stream reset
2023-06-12T17:36:24.731Z        INFO    consensus-common        consensus/compute_state.go:301  ApplyBlocks stats       {"early": 0.00000782, "earlyCronGas": 0, "vmMsg": 2.996465943, "msgGas": 26812048661, "vmCron": 10.457947094, "cronGas": 450624768184, "vmFlush": 1.945487151, "epoch": "2757064", "tsk": "{bafy2bzacedngohola2r3qrekkyitohqucrwaemu72lcavhzncmsj4atvpgbdu,bafy2bzacecslqgxoqyquarokju72zfozbhvlvjwxhazsewg3hc4yjukfkxzg6,bafy2bzaceb5ecnsbbfhki66mhqjprt2uz3xniqxythv6jfhnqwaeh5bjc73m2,bafy2bzacedhc5ysif3fhvyqf7fmrkbqdasf35cfnttui66bjwgvp5n3btjiz6,bafy2bzacebr4wuxolzmonyp377iluvnt24gy7i4re5emw6u5ht3mb7msesjei,bafy2bzacedodefp32o5mjhd35k5zeebzf2rwuzohnuy6tutlchohywua465ly,bafy2bzacec34bzudacjs2olicf6qwp3qoaysppi6mebqykaj2ztu5bngrg4c2}"}

and

2023-06-12T17:42:59.667Z        INFO    bs:sess session/session.go:460  No peers - broadcasting {"session": 93355, "want-count": 56}
2023-06-12T17:42:59.736Z        INFO    bs:sess session/session.go:460  No peers - broadcasting {"session": 93356, "want-count": 1}
2023-06-12T17:43:00.147Z        INFO    bs:sess session/session.go:460  No peers - broadcasting {"session": 93357, "want-count": 27}
2023-06-12T17:43:00.165Z        INFO    bs:sess session/sessionwantsender.go:456        peer 12D3KooWNj5A5yxQ8Do8gUX54ahZC9aVVKqMPBraFMnwchXMFFn9 sent too many dont haves, removing from session 93357
2023-06-12T17:43:00.188Z        INFO    bs:sess session/session.go:460  No peers - broadcasting {"session": 93358, "want-count": 21}
2023-06-12T17:43:00.193Z        INFO    bs:sess session/session.go:460  No peers - broadcasting {"session": 93359, "want-count": 1}
2023-06-12T17:43:00.197Z        INFO    bs:sess session/session.go:460  No peers - broadcasting {"session": 93360, "want-count": 17}
2023-06-12T17:43:00.201Z        INFO    bs:sess session/session.go:460  No peers - broadcasting {"session": 93361, "want-count": 1}

Have not dug into these messages in depth yet, but on the surface this behavior suggests to me that the nodes are struggling to find peers that can provide the historical data needed. This would sort of makes sense to me at a high level since the standard way of syncing a node is to sync from the latest official light snapshots that are provided- such peers won't have historical transaction or state data below the snapshot they started at. It seems this could have been exasperated further by the introduction of the SplitStore, since the official docs recommend migrating to the SplitStore by throwing away your current datadir and starting an entirely new sync from a new light snapshot with SplitStore=true turned on (https://lotus.filecoin.io/lotus/configure/splitstore/).

Note: we are not running the node using systemd, as we were unable to get it to sync at all in that context. Related: filecoin-project/lotus#8001.

@jennijuju
Copy link

If you are aware of anybody who could provide a full snapshot or transfer the data dir for a non-pruning node synced from a light snapshot below 2683348 we would be very interested in that.

There is one potential provider AFAIK & they are doing some testing. @eshon might have more updated info to share here.

@jennijuju
Copy link

These syncs were (re)started about 2 weeks ago but have not yet completed (around epoch 2750000 now; at current rate will take 5-6 more weeks)

Small good news i hope: i expect the sync to be faster passing epoch 2809800 due to a protocol improvement we shipped to the network!

@i-norden
Copy link
Contributor Author

Thanks @jennijuju, we successfully transferred an archive node. Between what you shared here and the recent work that has been done on the skip list and otherwise by @fridrik01 I'm closing this. Thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants