Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug]: LND frequently loses sync to chain with pruned backend #8250

Closed
Liongrass opened this issue Dec 4, 2023 · 52 comments · Fixed by #8428
Closed

[bug]: LND frequently loses sync to chain with pruned backend #8250

Liongrass opened this issue Dec 4, 2023 · 52 comments · Fixed by #8428
Assignees
Labels
bug Unintended code behaviour P1 MUST be fixed or reviewed pruning sync
Milestone

Comments

@Liongrass
Copy link
Contributor

Background

I run LND compiled from source on a Ubuntu VPS. I use Bitcoin Core as my backend, pruned to 100GB (prune=100000). I've noticed on various occasions that LND loses its sync to chain ("synced_to_chain": false,`) and does not recover by itself. A restart always fixes the issue promptly, although in some cases LND cannot be gracefully shut down. I have various peers with the same issue, all are running a very similar setup. The issue seems to be related to pruning the node. The more aggressively you prune the node, the more often LND will lose sync to the chain.

Your environment

  • 0.17.1-beta commit=lightning-terminal-v0.12.1-alpha-dirty
  • Linux 5.4.0-167-generic
  • Bitcoin Core Version: 240100
  • prune=100000

lnd.tor.active=true
lnd.tor.skip-proxy-for-clearnet-targets=1

Steps to reproduce

Tell us how to reproduce this issue. Please provide stacktraces and links to code in question.

Expected behaviour

I expect LND to stay synced to chain as long as the backend is synced to chain.

Actual behaviour

LND occasionally loses sync to chain and does not recover while Bitcoin Core is running and synced.

I have about 1.5h of logs and a go profile attached below.

04lnd.log
03lnd.log
02lnd.log
01lnd.log
goprofile.txt

@Liongrass Liongrass added bug Unintended code behaviour needs triage labels Dec 4, 2023
@saubyk saubyk added the P1 MUST be fixed or reviewed label Dec 5, 2023
@ziggie1984
Copy link
Collaborator

I checked your pprofile and it seems like you did not use debug-level 2 for the dump_log, not a lot of useful infos to catch the problem in the normal debug level 1 dump.

Could you next time use the debuglevel 2 instead of 1?

Thank you for reporting this issue.

curl http://localhost:9733/debug/pprof/goroutine?debug=2

@Liongrass
Copy link
Contributor Author

Thank you! I will get the more detailed goroutine the next time I run into the issue!

@C-Otto
Copy link
Contributor

C-Otto commented Dec 8, 2023

I also ran into the issue a few times and added a cronjob with the output of lnci getinfo (once a minute, if synced_to_chain is false). Based on this I see the following timeline:

Everything is fine up to 10:23:00 CET (09:23:00 UTC). The then-tip block 820238 was received at 10:05:02 on my bitcoind.
I first observed the unsynced state at 10:24:01 and it is still at that state.

  • 10:23:05 bitcoind UpdateTip to block 820239
  • 10:23:06.217 "New block" logged in LND
  • 10:24:01 getinfo returns "synced_to_chain": false, "best_header_timestamp": "1702027364", "block_height": 820239 (timestamp corresponds to 10:22:44)
  • 10:24:14 bitcoind UpdateTip to block 820240
  • 10:24:15.573 "New block" logged in LND
  • 10:25:02 getinfo returns "synced_to_chain": false, "best_header_timestamp": "1702027441", "block_height": 820240 (timestamp corresponds to 10:24:01)
  • same for getinfo from 10:26 to 10:46
  • 10:46:22 bitcoind UpdateTip to block 820241
  • 10:46:23.527 "New block" logged in LND
  • 10:47:01 getinfo returns "synced_to_chain": false, "best_header_timestamp": "1702028860", "block_height": 820241
    (timestamp corresponds to 10:47:40)

@C-Otto
Copy link
Contributor

C-Otto commented Dec 8, 2023

I checked the code for related log messages and noticed that I do NOT see the "Pruning channel graph using block" message for blocks 820239 and following. Something got stuck between block 820239 and block 820240.

The last log message of this kind:

2023-12-08 10:05:02.902 [INF] CRTR: Pruning channel graph using block 0000000000000000000372ff6b96fc003fbaddce39047352c621f15e3920416e (height=820238)

@C-Otto
Copy link
Contributor

C-Otto commented Dec 8, 2023

My lnd also refused to shut down (lncli stop). A simple killall lnd helped, though.

@guggero
Copy link
Collaborator

guggero commented Dec 8, 2023

This could be related to some of the shutdown issues we fixed in 0.17.2 and 0.17.3. If certain goroutines are blocked because of a deadlock, then it can be that the sync doesn't progress. Can you try with litd v0.12.2-alpha that was just released that includes lnd v0.17.3-beta please?

@ziggie1984
Copy link
Collaborator

ziggie1984 commented Dec 8, 2023

So I could trace down the bug a little bit more because our tunnelsats node suffered from the same behaviour. Thanks for your input @C-Otto got me the right idea where to look.

goroutine 32469 [select, 559 minutes]: 1 times: [[32469]
github.com/lightningnetwork/lnd/routing.(*ValidationBarrier).InitJobDependencies(0xc0103a5470, {0x1bf3320?, 0xc00faed220?})
        github.com/lightningnetwork/lnd/routing/validation_barrier.go:87 +0x9b
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).networkHandler(0xc000a44280)
        github.com/lightningnetwork/lnd/routing/router.go:1126 +0xb3b
created by github.com/lightningnetwork/lnd/routing.(*ChannelRouter).Start in goroutine 1
        github.com/lightningnetwork/lnd/routing/router.go:689 +0x787

So somehow the validationBarrier is used up and no new blocks can be processed because lnd is waiting for new slots.
The new block is ready to be received by the ChannelRouter but its hanging in the above validation_barrier

goroutine 31650 [select, 548 minutes]:
github.com/lightningnetwork/lnd/routing/chainview.(*blockEventQueue).queueCoordinator(0xc007813310)
        github.com/lightningnetwork/lnd/routing/chainview/queue.go:117 +0x1f8
created by github.com/lightningnetwork/lnd/routing/chainview.(*blockEventQueue).Start in goroutine 1
        github.com/lightningnetwork/lnd/routing/chainview/queue.go:66 +0x65

@C-Otto
Copy link
Contributor

C-Otto commented Dec 8, 2023

@guggero I think I'm already running 0.17.3 of sorts (I manually added all relevant commits to my messy code base).

@ziggie1984
Copy link
Collaborator

ziggie1984 commented Dec 8, 2023

Analysing this a bit more I think I found the problem:

The ChannelRouter initializes the Barrier with 4*NumberofCore Slots: for the tunnelsats node: proc = 4 its 16

https://github.com/lightningnetwork/lnd/blob/master/routing/router.go#L1145

Now the pprof of the time when the problem happens count exacty 16 stuck goroutines using up 16 slots hence we are in a deadlock somehow in terms of advancing the blockheight for the router:

The stuck routines are 2 here:

goroutine 5730594 [select, 1467 minutes]: 2 times: [[5730594, 5626273]
github.com/btcsuite/btcwallet/chain.(*BitcoindConn).GetBlock(0xc00017c790, 0xc00f8967e0)
        github.com/btcsuite/btcwallet@v0.16.10-0.20231017144732-e3ff37491e9c/chain/bitcoind_conn.go:479 +0x1d1
github.com/btcsuite/btcwallet/chain.(*BitcoindClient).GetBlock(0xc013eeaf50?, 0x0?)
        github.com/btcsuite/btcwallet@v0.16.10-0.20231017144732-e3ff37491e9c/chain/bitcoind_client.go:144 +0x17
github.com/lightningnetwork/lnd/blockcache.(*BlockCache).GetBlock(0xc006255970, 0xc00f8967e0, 0xc000bd3ac8)
        github.com/lightningnetwork/lnd/blockcache/blockcache.go:55 +0x22a
github.com/lightningnetwork/lnd/lnwallet/btcwallet.(*BtcWallet).GetBlock(0x100c007e22b78?, 0x7f87c4635958?)
        github.com/lightningnetwork/lnd/lnwallet/btcwallet/blockchain.go:139 +0x1de
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).fetchFundingTx(0xc000a44280, 0xc00a4052f4)
        github.com/lightningnetwork/lnd/routing/router.go:1780 +0x5c
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).processUpdate(0xc000a44280, {0x1c8e960?, 0xc012c62f20?}, {0xc006381cb0, 0x1, 0x1})
        github.com/lightningnetwork/lnd/routing/router.go:1555 +0xea7
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).handleNetworkUpdate(0xc000a44280, 0xc0103a5470, 0xc00e7f0930)
        github.com/lightningnetwork/lnd/routing/router.go:1031 +0x10d
created by github.com/lightningnetwork/lnd/routing.(*ChannelRouter).networkHandler in goroutine 32469
        github.com/lightningnetwork/lnd/routing/router.go:1129 +0xbbf

and 14 times on the WaitForDependants

goroutine 5900726 [select, 1377 minutes]: 14 times: [[5900726, 7110031, 5908580, 6245797, 6464543, 7261552, 5911043, 5789756, 5785513, 5910633, 6017184, 6013681, 7107010, 6857809]
github.com/lightningnetwork/lnd/routing.(*ValidationBarrier).WaitForDependants(0xc0103a5470, {0x1c29420?, 0xc010d4cb60?})
        github.com/lightningnetwork/lnd/routing/validation_barrier.go:238 +0x605
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).handleNetworkUpdate(0xc000a44280, 0xc0103a5470, 0xc010826750)
        github.com/lightningnetwork/lnd/routing/router.go:1010 +0xb0
created by github.com/lightningnetwork/lnd/routing.(*ChannelRouter).networkHandler in goroutine 32469
        github.com/lightningnetwork/lnd/routing/router.go:1129 +0xbbf

So not sure why the GetBlock call in the first cited goroutine hangs, this might be pruned specific but I thing this behaviour an also happen without a prunded not in place?

Wdyt @Crypt-iQ ?

@ziggie1984
Copy link
Collaborator

here is the pprof already deduplicated:
pprof-deduped.log

@C-Otto
Copy link
Contributor

C-Otto commented Dec 8, 2023

I'm also using a node with 4 (virtual) CPUs.

@Crypt-iQ
Copy link
Collaborator

Crypt-iQ commented Dec 8, 2023

goroutine 5730594 [select, 1467 minutes]: 2 times: [[5730594, 5626273]
github.com/btcsuite/btcwallet/chain.(*BitcoindConn).GetBlock(0xc00017c790, 0xc00f8967e0)
        github.com/btcsuite/btcwallet@v0.16.10-0.20231017144732-e3ff37491e9c/chain/bitcoind_conn.go:479 +0x1d1
github.com/btcsuite/btcwallet/chain.(*BitcoindClient).GetBlock(0xc013eeaf50?, 0x0?)
        github.com/btcsuite/btcwallet@v0.16.10-0.20231017144732-e3ff37491e9c/chain/bitcoind_client.go:144 +0x17
github.com/lightningnetwork/lnd/blockcache.(*BlockCache).GetBlock(0xc006255970, 0xc00f8967e0, 0xc000bd3ac8)
        github.com/lightningnetwork/lnd/blockcache/blockcache.go:55 +0x22a
github.com/lightningnetwork/lnd/lnwallet/btcwallet.(*BtcWallet).GetBlock(0x100c007e22b78?, 0x7f87c4635958?)
        github.com/lightningnetwork/lnd/lnwallet/btcwallet/blockchain.go:139 +0x1de
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).fetchFundingTx(0xc000a44280, 0xc00a4052f4)
        github.com/lightningnetwork/lnd/routing/router.go:1780 +0x5c
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).processUpdate(0xc000a44280, {0x1c8e960?, 0xc012c62f20?}, {0xc006381cb0, 0x1, 0x1})
        github.com/lightningnetwork/lnd/routing/router.go:1555 +0xea7
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).handleNetworkUpdate(0xc000a44280, 0xc0103a5470, 0xc00e7f0930)
        github.com/lightningnetwork/lnd/routing/router.go:1031 +0x10d
created by github.com/lightningnetwork/lnd/routing.(*ChannelRouter).networkHandler in goroutine 32469
        github.com/lightningnetwork/lnd/routing/router.go:1129 +0xbbf

So not sure why the GetBlock call in the first cited goroutine hangs, this might be pruned specific but I thing this behaviour an also happen without a prunded not in place?

Wdyt @Crypt-iQ ?

This dump shows that fetchFundingTx is being used rather than the 0.17.3-beta fetchFundingTxWrapper which fixes this issue for pruned nodes

@C-Otto
Copy link
Contributor

C-Otto commented Dec 8, 2023

I'm running code that includes fetchFundingTxWrapper and still see the issue (not with that stack trace, though).

@ziggie1984
Copy link
Collaborator

ziggie1984 commented Dec 8, 2023

could you provide a pprof-goroutine dump next time it happens for you @C-Otto ?

@C-Otto
Copy link
Contributor

C-Otto commented Dec 8, 2023

Will do

@ziggie1984
Copy link
Collaborator

Another noderunner reach out to me running on 17.3 and prunded:

I am still trying to track-down the issue in the code (I think I am close trying to reconstruct this problem locally), but the main problem is that the prunded workers somehow do not report failure or timeout and signal it back to the initial call .fetchFundingTxWrapper. His system has 6 cpus meaning he has 24 slots for the validation barrier in the channel routuer.

Main Problem is here, why is this call stuck for 1305 minutes, something is messed up with the PrunedBlockDispatcher I think something is wrong on the neutrino side and we miss some notifications.

goroutine 11645746 [select, 1305 minutes]: 2 times: [[11645746, 11645767]
github.com/btcsuite/btcwallet/chain.(*BitcoindConn).GetBlock(0xc000026210, 0xc01364fc80)
        github.com/btcsuite/btcwallet@v0.16.10-0.20231129183218-5df09dd43358/chain/bitcoind_conn.go:479 +0x1d1
github.com/btcsuite/btcwallet/chain.(*BitcoindClient).GetBlock(0xc0109efb60?, 0x0?)
        github.com/btcsuite/btcwallet@v0.16.10-0.20231129183218-5df09dd43358/chain/bitcoind_client.go:144 +0x17
github.com/lightningnetwork/lnd/blockcache.(*BlockCache).GetBlock(0xc00f1c2430, 0xc01364fc80, 0xc014767ec0)
        github.com/lightningnetwork/lnd/blockcache/blockcache.go:55 +0x22a
github.com/lightningnetwork/lnd/lnwallet/btcwallet.(*BtcWallet).GetBlock(0xc0186eca78?, 0x1?)
        github.com/lightningnetwork/lnd/lnwallet/btcwallet/blockchain.go:139 +0x1de
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).fetchFundingTx(0xc00955c140, 0xc0157b02d0)
        github.com/lightningnetwork/lnd/routing/router.go:1810 +0x5c
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).fetchFundingTxWrapper.func1()
        github.com/lightningnetwork/lnd/routing/router.go:1774 +0x2d
created by github.com/lightningnetwork/lnd/routing.(*ChannelRouter).fetchFundingTxWrapper in goroutine 11645745
        github.com/lightningnetwork/lnd/routing/router.go:1773 +0xd7

22 goroutines now blocked waiting for dependants:

goroutine 14401576 [select, 829 minutes]: 22 times: [[14401576, 14780452, 14636497, 13497107, 14186823, 15877870, 16643323, 14194898, 16694316, 14637199, 16673432, 14407395, 14790274, 13994802, 13985663, 16379626, 14421042, 16285128, 15787561, 16395763, 16730383, 15877867]
github.com/lightningnetwork/lnd/routing.(*ValidationBarrier).WaitForDependants(0xc00c5eac60, {0x1c2b960?, 0xc00f79b930?})
        github.com/lightningnetwork/lnd/routing/validation_barrier.go:238 +0x605
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).handleNetworkUpdate(0xc00955c140, 0xc00c5eac60, 0xc01b68ec60)
        github.com/lightningnetwork/lnd/routing/router.go:1010 +0xb0
created by github.com/lightningnetwork/lnd/routing.(*ChannelRouter).networkHandler in goroutine 9572
        github.com/lightningnetwork/lnd/routing/router.go:1129 +0xbbf

so we are in a deadlock here. A new block would be ready to advance the block_height of the ChanRouter but the channel is not able to receive the data, which can be seen here:

goroutine 7942 [select, 451 minutes]: 1 times: [[7942]
github.com/lightningnetwork/lnd/routing/chainview.(*blockEventQueue).queueCoordinator(0xc0104ee7d0)
        github.com/lightningnetwork/lnd/routing/chainview/queue.go:117 +0x1f8
created by github.com/lightningnetwork/lnd/routing/chainview.(*blockEventQueue).Start in goroutine 1
        github.com/lightningnetwork/lnd/routing/chainview/queue.go:66 +0x65

Most likely the dependants build up over time related to the two ChannelAnnouncments which are stuck in the GetFundingTx call. So we need to figure out why the PrunedBlockDispatcher is stuck. I took a closer look at the code in the neutrino package and there are timeouts in terms of the request (2 seconds for a single query, and 30 seconds for the whole batch) - this is something I am going to test, because although the fetchFundingTxWrapper is still waiting for a reply the corresponding pruned fetcher seems to be in a happy state, so something got lost along the way.

4 workers waiting for work (idling):

goroutine 15025022 [select, 584 minutes]: 4 times: [[15025022, 7297693, 15833384, 16912463]
github.com/lightninglabs/neutrino/query.(*worker).Run(0xc0141566f0, 0xc000901680, 0xc0009017a0)
        github.com/lightninglabs/neutrino@v0.16.0/query/worker.go:96 +0x1fb
github.com/lightninglabs/neutrino/query.(*peerWorkManager).workDispatcher.func2()
        github.com/lightninglabs/neutrino@v0.16.0/query/workmanager.go:289 +0x8d
created by github.com/lightninglabs/neutrino/query.(*peerWorkManager).workDispatcher in goroutine 5964
        github.com/lightninglabs/neutrino@v0.16.0/query/workmanager.go:285 +0x133c

and the workmanager waiting for results, so it does not seem something is still being processed.

goroutine 5964 [select, 156 minutes]: 1 times: [[5964]
github.com/lightninglabs/neutrino/query.(*peerWorkManager).workDispatcher(0xc010601ce0)
        github.com/lightninglabs/neutrino@v0.16.0/query/workmanager.go:263 +0x454
created by github.com/lightninglabs/neutrino/query.(*peerWorkManager).Start in goroutine 1
        github.com/lightninglabs/neutrino@v0.16.0/query/workmanager.go:137 +0x65

We will investigate these behaviour further to track it down finally.

@ziggie1984
Copy link
Collaborator

godump_notsynced_dedup.log
Full dump form the node

@C-Otto
Copy link
Contributor

C-Otto commented Dec 9, 2023

curl http://127.0.0.1:9736/debug/pprof/goroutine?debug=2 attached:

goroutine.txt

@ziggie1984
Copy link
Collaborator

Thanks for the dump, same in your case, you mentioned 4 virtual cpus: 16 slots:

2 slots hang here:

goroutine 3076998 [select, 863 minutes]: 2 times: [[3076998, 3112915]
github.com/btcsuite/btcwallet/chain.(*BitcoindConn).GetBlock(0xc00027ba20, 0xc014ea3a40)
        /home/cotto/go/pkg/mod/github.com/btcsuite/btcwallet@v0.16.10-0.20231129183218-5df09dd43358/chain/bitcoind_conn.go:479 +0x1f0
github.com/btcsuite/btcwallet/chain.(*BitcoindClient).GetBlock(0xc00dc0e7a0?, 0x0?)
        /home/cotto/go/pkg/mod/github.com/btcsuite/btcwallet@v0.16.10-0.20231129183218-5df09dd43358/chain/bitcoind_client.go:144 +0x1d
github.com/lightningnetwork/lnd/blockcache.(*BlockCache).GetBlock(0xc0062f4070, 0xc014ea3a40, 0xc005adeec0)
        /home/cotto/IdeaProjects/lnd/blockcache/blockcache.go:55 +0x246
github.com/lightningnetwork/lnd/lnwallet/btcwallet.(*BtcWallet).GetBlock(0x0?, 0x48403c?)
        /home/cotto/IdeaProjects/lnd/lnwallet/btcwallet/blockchain.go:139 +0x1ff                                                                                                           github.com/lightningnetwork/lnd/routing.(*ChannelRouter).fetchFundingTx(0xc0000d88c0, 0xc00dc0e6f4)
        /home/cotto/IdeaProjects/lnd/routing/router.go:1810 +0x66
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).fetchFundingTxWrapper.func1()
        /home/cotto/IdeaProjects/lnd/routing/router.go:1774 +0x46
created by github.com/lightningnetwork/lnd/routing.(*ChannelRouter).fetchFundingTxWrapper
        /home/cotto/IdeaProjects/lnd/routing/router.go:1773 +0xf1

14 hang here:

goroutine 3374606 [select, 826 minutes]: 14 times: [[3374606, 3826502, 3868619, 3355939, 3366418, 4347201, 4342024, 4324413, 3380439, 3832171, 3365343, 3390681, 3384080, 4332250]
github.com/lightningnetwork/lnd/routing.(*ValidationBarrier).WaitForDependants(0xc00e82cc60, {0x15292c0?, 0xc0109cd110?})
        /home/cotto/IdeaProjects/lnd/routing/validation_barrier.go:238 +0x6df
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).handleNetworkUpdate(0xc0000d88c0, 0xc00e82cc60, 0xc0150a47e0)                                                                             /home/cotto/IdeaProjects/lnd/routing/router.go:1010 +0xbc
created by github.com/lightningnetwork/lnd/routing.(*ChannelRouter).networkHandler
        /home/cotto/IdeaProjects/lnd/routing/router.go:1129 +0xbcc

And seems like the node was unofsync for more than 10 hours, because the block could not be delivered (but not sure whether the 593 minutes really mean that the goroutine was stuck at this particular line or if the goroutine is running for that long). Do you know when the know got out_of_sync ?

goroutine 3262 [select, 593 minutes]: 1 times: [[3262]
github.com/lightningnetwork/lnd/routing/chainview.(*blockEventQueue).queueCoordinator(0xc006354370)
        /home/cotto/IdeaProjects/lnd/routing/chainview/queue.go:117 +0x207
created by github.com/lightningnetwork/lnd/routing/chainview.(*blockEventQueue).Start
        /home/cotto/IdeaProjects/lnd/routing/chainview/queue.go:66 +0x6a

@C-Otto
Copy link
Contributor

C-Otto commented Dec 9, 2023

I can confirm the 10 hours. The mails started coming in at 03:44 and I did the dump at around 13:15, which is roughly 10 hours difference.

@C-Otto
Copy link
Contributor

C-Otto commented Dec 9, 2023

I added logging for the blockcache stuff, maybe that helps once it starts going downhill again.

@C-Otto
Copy link
Contributor

C-Otto commented Dec 9, 2023

I'm now also running curl -s http://127.0.0.1:9736/debug/pprof/goroutine?debug=2 | grep -e "GetBlock(\|WaitForDependants(" -B 1 once a minute, so far nothing turned up.

@ziggie1984
Copy link
Collaborator

So I think we have some problems in the neutrino pkg, I increased the logs for the mentioned case and found that we have a block which is rescheduled over 10 times although the maximum retry count should not be greater than 2.

https://github.com/lightninglabs/neutrino/blob/master/query/workmanager.go#L345

Here are the retries:

lnd.759_split_01.log:2023-12-09 16:54:45.057 [TRC] LNWL: Worker panxv3zq3bcp2dcbt2j4f2igbjiswbp7pdpdaofdk5mkjux63ptrssid.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 16:55:17.061 [TRC] LNWL: Worker nvu64vxpwzsi4p2shg2ppuxeodfgwhrudyskvt6sedal6vnlvix6rfyd.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 16:55:49.076 [TRC] LNWL: Worker panxv3zq3bcp2dcbt2j4f2igbjiswbp7pdpdaofdk5mkjux63ptrssid.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 16:56:21.082 [TRC] LNWL: Worker lxcsrowdge7vxo5k46fb22libqwumvkpgtkt332jtf6myaqs3d6keead.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 16:56:53.083 [TRC] LNWL: Worker nvu64vxpwzsi4p2shg2ppuxeodfgwhrudyskvt6sedal6vnlvix6rfyd.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 16:57:25.084 [TRC] LNWL: Worker h5qvrvkdq5wvuywuxb36toph3i7dxl4tk767w7f6hrbuvo3sqqj2xqqd.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 16:57:57.088 [TRC] LNWL: Worker nvu64vxpwzsi4p2shg2ppuxeodfgwhrudyskvt6sedal6vnlvix6rfyd.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 16:58:29.088 [TRC] LNWL: Worker nvu64vxpwzsi4p2shg2ppuxeodfgwhrudyskvt6sedal6vnlvix6rfyd.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 16:59:01.089 [TRC] LNWL: Worker h5qvrvkdq5wvuywuxb36toph3i7dxl4tk767w7f6hrbuvo3sqqj2xqqd.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 16:59:33.090 [TRC] LNWL: Worker panxv3zq3bcp2dcbt2j4f2igbjiswbp7pdpdaofdk5mkjux63ptrssid.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 17:00:05.090 [TRC] LNWL: Worker lxcsrowdge7vxo5k46fb22libqwumvkpgtkt332jtf6myaqs3d6keead.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 17:00:37.094 [TRC] LNWL: Worker nvu64vxpwzsi4p2shg2ppuxeodfgwhrudyskvt6sedal6vnlvix6rfyd.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 17:01:09.096 [TRC] LNWL: Worker h5qvrvkdq5wvuywuxb36toph3i7dxl4tk767w7f6hrbuvo3sqqj2xqqd.onion:8333 timeout for request *wire.MsgGetData with job index 10

So this explains why we have this stuck GetBlock request but I am not sure why we are not exiting out after 2 retries, seems like the only why we would keep going to retry is when somehow the batch reference is nil
https://github.com/lightninglabs/neutrino/blob/master/query/workmanager.go#L334-L385

I added some logs in a PR which increases some logs, which might be helpful tracking down the issue maybe you could run it too, very much appreciated.

#8267

@C-Otto
Copy link
Contributor

C-Otto commented Dec 10, 2023

I just added the changes from #8267 and restarted my node (I'm not using neutrino, though?).

@ziggie1984
Copy link
Collaborator

ziggie1984 commented Dec 10, 2023

Yeah this a bit confusing, but as soon as you run in pruned mode you use the neutrino codebase to fetch blocks from peers. (https://github.com/btcsuite/btcwallet/blob/5df09dd4335865dde2a9a6d94a26a7f5779af825/chain/pruned_block_dispatcher.go#L192-L208)

You would need to put the LNWL=debug to see the new logs.

@Crypt-iQ
Copy link
Collaborator

Crypt-iQ commented Dec 13, 2023

So I think we should sort the peer connection by outbound/inbound first and then first try all the outbound peers and only after that try inbound tor peers. I am not sure whether we can fetch the listening port via a bitcoind rpc command, I searched but could not find anything.

I think we can use the getnodeaddresses command which fetches from bitcoind's address manager. I think anything in the address manager should have the listening port rather than the randomly assigned TCP port

EDIT: oops, realize we do this. But I think we can use this over the getpeerinfo call

@C-Otto
Copy link
Contributor

C-Otto commented Dec 14, 2023

For the record, I didn't allow incoming connections (and still don't), and I wasn't using tor.

To help debugging attempts, I enabled tor and temporarily disabled outgoing IPv4/IPv6 bitcoin peers on my node. So far, everything is fine...

@Liongrass
Copy link
Contributor Author

The issue is not limited to Tor nodes. I used a spare node in an attempt to trigger this issue. I did the following:

  • 0.17.3-beta commit=lightning-terminal-v0.12.2-alpha
  • Clearnet only
  • I purged the graph with chantools dropchannelgraph
  • Started LND
  • Waited for the node to lose sync to chain while syncing to graph
  • Curled the goprofile
    goroutine.txt

@ziggie1984
Copy link
Collaborator

ziggie1984 commented Dec 14, 2023

Ok I was able to track down the issue, will provide a fix shortly.

So the problem is here:

When a block is not on our local backend (we already pruned it) we will try to fetch it from other nodes here:

https://github.com/btcsuite/btcwallet/blob/5df09dd4335865dde2a9a6d94a26a7f5779af825/chain/bitcoind_conn.go#L474

Now the problem:

We create a new block request here:

https://github.com/btcsuite/btcwallet/blob/5df09dd4335865dde2a9a6d94a26a7f5779af825/chain/pruned_block_dispatcher.go#L504-L519

Here comes the problem, first of all the 2 seconds and then 4 seconds for the second try are too short which is the root cause why we are having a lot of timeouts and the block is delivered couple of seconds later.

When we have two timeouts we will never delete the block from the d.blocksQueried this only happens if we receive the block in time:

https://github.com/btcsuite/btcwallet/blob/5df09dd4335865dde2a9a6d94a26a7f5779af825/chain/pruned_block_dispatcher.go#L650

So this means the problem arises when the same block is requested a second time (this might take a while to occur) but then we will think there is already a request going on in the dispatcher see here:

https://github.com/btcsuite/btcwallet/blob/5df09dd4335865dde2a9a6d94a26a7f5779af825/chain/pruned_block_dispatcher.go#L544-L553

So this will never launch a new request and the GetBlock call will hang forever, blocking the ValidationBarrier eventually.

So I think there are several things which can be fixed immediately, will come up with a plan tomorrow.

EDIT: Huge thanks to @M1ch43lV for running the draft PR to extract the logs which lead to the finding of the bug 🙏

@Roasbeef
Copy link
Member

Here comes the problem, first of all the 2 seconds and then 4 seconds for the second try are too short which is the root cause why we are having a lot of timeouts and the block is delivered couple of seconds later.

I mention a route here to dynamically tune the timeout, which shouldn't be too involved: lightninglabs/neutrino#292

As a starting point, we can use the reported ping latency from the getpeerinfo call. Then we can apply an RTT multiplier on top of a processing multiplier. Each time we get a response, we use that to tune what the actual timeout should be (some sort of moving weighted average).

Perhaps what we also need is a callback for if the request times out and we're unable get a response? That would then allow us to remove it from the set of in flight requests.

IIUC, upon an error, we'll get a send on the errChan, but currently we just return early, whereas we could update the set of inflight requests there: https://github.com/btcsuite/btcwallet/blob/5df09dd4335865dde2a9a6d94a26a7f5779af825/chain/bitcoind_conn.go#L483-L490

The error shuold come from here once the max number of retries is exceeded: https://github.com/lightninglabs/neutrino/blob/42a196facefac36282e68ddf4a02c9ce4601a0b0/query/workmanager.go#L341-L358

@ziggie1984
Copy link
Collaborator

I think your ideas fit perfectly @Roasbeef I will work on the implementation draft then.

@saubyk saubyk added this to the v0.18.0 milestone Dec 15, 2023
@warioishere
Copy link

warioishere commented Dec 22, 2023

hello dear people, I'm reading this post with huge interesst. We also suffer from this problem, and since a while, we cant get our new node properly online anymore. I was searching myself to death and thought that im unable to set up our deployment.
We use btcpay server with lnd0.17.2-beta with bitcoincore 26 in 100GB pruned node if this is of any help.

We have initially after a restart a sync to chain, and after 15min we loose sync and it states both
synced to chain false
synced to graph false

If you need help, pls let me know, but if its deepdive logs, you need to tell me what to do.

Just a short question, i know its out, when its out, but do you guys know when a fix is out? We are offline with our shop now for a couple of weeks because we started to have weird connection problems and then I did the most stupid thing ever...

@C-Otto
Copy link
Contributor

C-Otto commented Dec 22, 2023

In my experience the graph-sync isn't affected. Are you sure you're hitting this issue? Could you run the profile command (see above and here: https://docs.lightning.engineering/lightning-network-tools/lnd/debugging_lnd#capturing-pprof-data-with-lnd) to see if you're hitting the same limits?

@C-Otto
Copy link
Contributor

C-Otto commented Dec 22, 2023

If you're affected by this issue, it might help to run a full (non-pruned) node and/or provide more CPU cores to lnd (which isn't a fix, but it might delay the need for a restart).

@warioishere
Copy link

In my experience the graph-sync isn't affected. Are you sure you're hitting this issue? Could you run the profile command (see above and here: https://docs.lightning.engineering/lightning-network-tools/lnd/debugging_lnd#capturing-pprof-data-with-lnd) to see if you're hitting the same limits?

I am not sure that I hit this issue, but we never had such problems before and since 2 weeks we dont have a synced LND anymore and the only thing we did is updated the deployment to 0.17.2beta

then after 2 or 3 news blocks, --> out of synced

LND runs inside a docker container, if you would tell me how to start this with lnd --profile=9736

then I could provide you with some infos. I am sorry I am not as technically experienced as you guys

@C-Otto
Copy link
Contributor

C-Otto commented Dec 22, 2023

You can edit your lnd.conf and add profile=9736 to the [Application Options] section.

@warioishere
Copy link

@C-Otto
I decided to go unpruned.. And it solved the problem. LND states that its synced. So I guess it was a similar or the same problem. I wonder why not more problems seem to appear with other node operators. Maybe only a minor part are pruned.

At least I'm good to go again.

Cheers

@warioishere
Copy link

Additional info;
I was setting up exactly the same deployment again after I started to have this problem, because I thought I messed up my config. And this deployment on a different server node of ours, had exactly the same problem. So I think I can verify that I had this issue you guys were discussing.

@rockstardev
Copy link
Contributor

rockstardev commented Jan 19, 2024

Hey everyone, chiming in to say that this issue is critical for using LND in BTCPay Server because most of our instances are pruned. We're now starting to get inquiries every day about how to solve this and it's impractical to say to people to restart LND every day.

@saubyk - it's really appreciated you guys are already looking into this. Is there clarity on timeline here so we can forward it for our merchants and ambassadors? I see you mapped it to 0.18 - when is that planned to release? And if it's more than a month, is it possible to go forward with 0.17.4 to fix this since it's such a big issue for a lot of deployments?

@saubyk
Copy link
Collaborator

saubyk commented Jan 19, 2024

Hi @rockstardev looking into the possibility of including a fix with 0.17.4. No promises yet

@ziggie1984 is on the case

@saubyk saubyk modified the milestones: v0.18.0, v0.17.4 Jan 21, 2024
@rockstardev
Copy link
Contributor

@saubyk & @ziggie1984 godspeed, love seeing this slated for 0.17.4

@ffrediani
Copy link

Hello folks

Can anyone elaborate why pruned nodes may cause this issue ? I have been searching more information about if running Bitcoin Nodes in pruned mode is actually an issue or is it something that can be fixed inside LND to deal with these scenarios better ?

What is the relevance for LND to be able to read a older block that may not be locally in the Bitcoin Node ?

@C-Otto
Copy link
Contributor

C-Otto commented May 30, 2024

lnd sometimes needs to access information from old blocks, which I think is related to validating channels that have been opened a long time ago.

It's rather trivial to request this from a connected full node, as this full node can just grab the block from disk and return it to lnd. However, if the connected bitcoind does not have the requested block, it needs to be downloaded from the internet. This process is rather slow and unreliable, which is why the corresponding code is a bit more complex.

One bug in this complex code has been described and fixed here, but there may be other bugs that haven't been reported, yet. I run my node with a pruned bitcoind and I don't think it's a large issue. It's not the perfect solution, though.

@ffrediani
Copy link

ffrediani commented May 31, 2024

Thanks for the exaplanation @C-Otto

Some things that come to mind in order to dig this this further and perhaps help clarify the possible causes and solution are:

  • Could this process be even slower due to the node be using Tor only and not Clearnet preferably ? If that can me measured it can enhance the recommendation to those running pruned nodes to, whenever possible have Clearnet always as preferred if available.
  • Any possible ways to improve pruned node connectivity to more peers (via Bitcoind variables) in order to reduce the chances it has trouble to fetch the block remotely ?
  • Any possible ways for Bitcoind to find out in advance that a possible peer has the full blockchain in order to ask preferably for those peers instead of another possible pruned node ?
  • Regarding LND code to treat such scenarios do you think that can be a timeout adjustment question or rather more complex than this ? I mean, at least if LND can be aware the process is on the way and may take a little longer it can take its time and wait for it to finish properly for those cases where there is an extra delay.
  • For LND proposes specifically any possible way that blocks related to existing channels could be prefetched in order to improve the process that may need older blocks.
  • Any possible minimum sized for pruned nods that would be more reasonable to avoid issues or have it clear that what matters are the blocks related to the channels opened. I understand that for opening new channels a broadcast to the network in the latest blocks is enough which the pruned node can do normally.

@C-Otto
Copy link
Contributor

C-Otto commented May 31, 2024

I don't think this discussion belongs here. The issue is fixed.

@ffrediani
Copy link

@C-Otto from your previous message you don't seem to suggest the problem is completely solved. You mentioned that one issue is fixed and I am trying by the above to anticipate of other possible scenarios and solutions.
In any case, where would this discussion fit better then ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Unintended code behaviour P1 MUST be fixed or reviewed pruning sync
Projects
None yet
Development

Successfully merging a pull request may close this issue.

11 participants