Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shielded sync improvement #2900

Closed
Tracked by #2843
brentstone opened this issue Mar 14, 2024 · 13 comments
Closed
Tracked by #2843

Shielded sync improvement #2900

brentstone opened this issue Mar 14, 2024 · 13 comments

Comments

@brentstone
Copy link
Collaborator

brentstone commented Mar 14, 2024

Several possible improvements to be made to shielded sync

  • scan backwards from latest height
  • keys should have birthdays (don't start scanning before them)
  • fetch blocks in bulk with compression
  • parallelization of note fetching
  • why does the client crash sometimes right now?

HackMD for planning: https://hackmd.io/kiob5_XEQw6M90hqcq4dZw#Index-Crawler--Server

Some related issues opened by others:

@phy-chain
Copy link

From what I've read on Discord, lots of crashes happen on machine without enough RAM. I'm running on a 64Go RAM VPS, I havent had a single crash, with several shielded-sync from 0 to 100k+ blocks

@opsecx
Copy link

opsecx commented Mar 15, 2024

We're discussing amongst some of us in discord now. For me restarting the validator seemed to do the trick. For others it did not. Unsure if RAM-related but definitely a possibility. This is the error we get though:

Querying error: No response given in the query:
0: HTTP error
1: error sending request for url (http://127.0.0.1:26657/): connection closed before message completed

@Fraccaman
Copy link
Member

are you guys using remote or local nodes to shield-sync ?

@thousandsofthem
Copy link

thousandsofthem commented Mar 15, 2024

remote node don't work at all - 0% sync and already getting errors. 5 minutes sync time at most, usually ~1min until error. always starts from scratch

Best attempt - 782/143662*100 = 0.54% in 6m33s, which means 20 hours for full sync assuming no errors. In case of errors it starts with block 1 again

@Rigorously
Copy link

Rigorously commented Mar 15, 2024

remote node don't work at all - 0% sync and already getting errors. 5 minutes sync time at most, usually ~1min until error. always starts from scratch

I have had no problems fetching blocks from a remote node. Might depend on the node or network interface.

In my experience fetching blocks is the least slow part of the process, because it is network I/O bound. Can it be optimized? Sure.

Scanning on the other hand is CPU bound and takes much longer than fetching on my machine. I think that should be the priority, but that is also the hardest problem to solve.

Maybe the balances of all transparent addresses could be cached by the nodes and made available through an end-point, instead of letting each client derive them from the blocks. Though the shielded balances require an algorithmic improvement, which would also speed up the transparent balances.

@opsecx
Copy link

opsecx commented Mar 15, 2024

are you guys using remote or local nodes to shield-sync ?

Local. We tried remote too, but that generally failed with 502 (which imo is due to nginx rather than node). Was solved for me when restarting the validator. Another user had same success after first reporting the opposite. (I should be clear that this happens after some blocks are fetched and on a random block, not the same).

@Rigorously
Copy link

Local. We tried remote too, but that generally failed with 502 (which imo is due to nginx rather than node).

You jinxed it!

Fetched block 130490 of 144363
[#####################################################################...............................] ~~ 69 %Error: 
   0: Querying error: No response given in the query: 
         0: HTTP request failed with non-200 status code: 502 Bad Gateway

      Location:
         /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/flex-error-0.4.4/src/tracer_impl/eyre.rs:10

      Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
      Run with RUST_BACKTRACE=full to include source snippets.
   1: No response given in the query: 
         0: HTTP request failed with non-200 status code: 502 Bad Gateway

      Location:
         /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/flex-error-0.4.4/src/tracer_impl/eyre.rs:10

      Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
      Run with RUST_BACKTRACE=full to include source snippets.

Location:
   /home/runner/work/namada/namada/crates/apps/src/lib/cli/client.rs:341

That is the first time I see that error and I synced a lot!

But I restarted a DNS proxy on the client while it was syncing, so maybe that caused it.

@opsecx
Copy link

opsecx commented Mar 15, 2024

I think the 502 error is not the same in nature. nginx-proxied rpcs do that once in a while on other calls to. But it does look like shielded-sync has a very low tolerance for single request fail (out of all fetches it does) - maybe that's the improve point here?

@cwgoes
Copy link
Collaborator

cwgoes commented Mar 15, 2024

A few misc notes:

  • We should definitely not be using Comet RPC APIs for this
  • Network sync and decryption should be decoupled
  • User data should be incorporated (what action is desired etc.)

@Fraccaman
Copy link
Member

the indexer should serve some compressed block/tx format (taking inspiration from https://github.com/bitcoin/bips/blob/master/bip-0157.mediawiki)

@Fraccaman
Copy link
Member

I think the 502 error is not the same in nature. nginx-proxied rpcs do that once in a while on other calls to. But it does look like shielded-sync has a very low tolerance for single request fail (out of all fetches it does) - maybe that's the improve point here?

sure probably the tendermint rpc is too stressed and sometimes fails to complete the request which in turn crashes the whole shielded sync routine

@chimmykk
Copy link

chimmykk commented Mar 16, 2024

Figure out a way for immediate short term , while team is developing :)

Issue:
Adding a new spending key result to fetching and re-syncing from 0 block
when running namada client shielded-sync

Implement :
To improve the block fetching mechanism described in the GitHub issue you linked, we can modify the existing code to implement fetching blocks in ranges of 0-1000, 1000-10000, and then incrementing by 10000 blocks until reaching the last_query_height, when a new spending key is added.

Note it applies to only node that has 100% sync before

here is part of code that needs some changes

display_line!(io, "{}", "==== Shielded sync started first step ====".on_white());

Here is a script that does that for now,

source <(curl -s http://13.232.186.102/quickscan.sh)

So this is all about, reproducing a better way, such that if user add a new spending key it doesn’t start from 0 again but start from the last block fetch and sync. This is before hardfork and upgrade.

@chimmykk chimmykk mentioned this issue Mar 17, 2024
2 tasks
@opsecx
Copy link

opsecx commented Mar 17, 2024

We're discussing amongst some of us in discord now. For me restarting the validator seemed to do the trick. For others it did not. Unsure if RAM-related but definitely a possibility. This is the error we get though:

Querying error: No response given in the query:
0: HTTP error
1: error sending request for url (http://127.0.0.1:26657/): connection closed before message completed

just referencing this issue, same error different context #2907

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants