[bug]: LND frequently loses sync to chain with pruned backend #8250

Liongrass · 2023-12-04T21:08:16Z

Background

I run LND compiled from source on a Ubuntu VPS. I use Bitcoin Core as my backend, pruned to 100GB (prune=100000). I've noticed on various occasions that LND loses its sync to chain ("synced_to_chain": false,`) and does not recover by itself. A restart always fixes the issue promptly, although in some cases LND cannot be gracefully shut down. I have various peers with the same issue, all are running a very similar setup. The issue seems to be related to pruning the node. The more aggressively you prune the node, the more often LND will lose sync to the chain.

Your environment

0.17.1-beta commit=lightning-terminal-v0.12.1-alpha-dirty
Linux 5.4.0-167-generic
Bitcoin Core Version: 240100
prune=100000

lnd.tor.active=true
lnd.tor.skip-proxy-for-clearnet-targets=1

Steps to reproduce

Tell us how to reproduce this issue. Please provide stacktraces and links to code in question.

Expected behaviour

I expect LND to stay synced to chain as long as the backend is synced to chain.

Actual behaviour

LND occasionally loses sync to chain and does not recover while Bitcoin Core is running and synced.

I have about 1.5h of logs and a go profile attached below.

04lnd.log
03lnd.log
02lnd.log
01lnd.log
goprofile.txt

The text was updated successfully, but these errors were encountered:

ziggie1984 · 2023-12-06T16:55:25Z

I checked your pprofile and it seems like you did not use debug-level 2 for the dump_log, not a lot of useful infos to catch the problem in the normal debug level 1 dump.

Could you next time use the debuglevel 2 instead of 1?

Thank you for reporting this issue.

curl http://localhost:9733/debug/pprof/goroutine?debug=2

Liongrass · 2023-12-06T19:08:21Z

Thank you! I will get the more detailed goroutine the next time I run into the issue!

C-Otto · 2023-12-08T10:30:58Z

I also ran into the issue a few times and added a cronjob with the output of lnci getinfo (once a minute, if synced_to_chain is false). Based on this I see the following timeline:

Everything is fine up to 10:23:00 CET (09:23:00 UTC). The then-tip block 820238 was received at 10:05:02 on my bitcoind.
I first observed the unsynced state at 10:24:01 and it is still at that state.

10:23:05 bitcoind UpdateTip to block 820239
10:23:06.217 "New block" logged in LND
10:24:01 getinfo returns "synced_to_chain": false, "best_header_timestamp": "1702027364", "block_height": 820239 (timestamp corresponds to 10:22:44)
10:24:14 bitcoind UpdateTip to block 820240
10:24:15.573 "New block" logged in LND
10:25:02 getinfo returns "synced_to_chain": false, "best_header_timestamp": "1702027441", "block_height": 820240 (timestamp corresponds to 10:24:01)
same for getinfo from 10:26 to 10:46
10:46:22 bitcoind UpdateTip to block 820241
10:46:23.527 "New block" logged in LND
10:47:01 getinfo returns "synced_to_chain": false, "best_header_timestamp": "1702028860", "block_height": 820241
(timestamp corresponds to 10:47:40)

C-Otto · 2023-12-08T10:38:45Z

I checked the code for related log messages and noticed that I do NOT see the "Pruning channel graph using block" message for blocks 820239 and following. Something got stuck between block 820239 and block 820240.

The last log message of this kind:

2023-12-08 10:05:02.902 [INF] CRTR: Pruning channel graph using block 0000000000000000000372ff6b96fc003fbaddce39047352c621f15e3920416e (height=820238)

C-Otto · 2023-12-08T10:43:32Z

My lnd also refused to shut down (lncli stop). A simple killall lnd helped, though.

guggero · 2023-12-08T12:00:32Z

This could be related to some of the shutdown issues we fixed in 0.17.2 and 0.17.3. If certain goroutines are blocked because of a deadlock, then it can be that the sync doesn't progress. Can you try with litd v0.12.2-alpha that was just released that includes lnd v0.17.3-beta please?

ziggie1984 · 2023-12-08T12:07:33Z

So I could trace down the bug a little bit more because our tunnelsats node suffered from the same behaviour. Thanks for your input @C-Otto got me the right idea where to look.

goroutine 32469 [select, 559 minutes]: 1 times: [[32469]
github.com/lightningnetwork/lnd/routing.(*ValidationBarrier).InitJobDependencies(0xc0103a5470, {0x1bf3320?, 0xc00faed220?})
        github.com/lightningnetwork/lnd/routing/validation_barrier.go:87 +0x9b
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).networkHandler(0xc000a44280)
        github.com/lightningnetwork/lnd/routing/router.go:1126 +0xb3b
created by github.com/lightningnetwork/lnd/routing.(*ChannelRouter).Start in goroutine 1
        github.com/lightningnetwork/lnd/routing/router.go:689 +0x787

So somehow the validationBarrier is used up and no new blocks can be processed because lnd is waiting for new slots.
The new block is ready to be received by the ChannelRouter but its hanging in the above validation_barrier

goroutine 31650 [select, 548 minutes]:
github.com/lightningnetwork/lnd/routing/chainview.(*blockEventQueue).queueCoordinator(0xc007813310)
        github.com/lightningnetwork/lnd/routing/chainview/queue.go:117 +0x1f8
created by github.com/lightningnetwork/lnd/routing/chainview.(*blockEventQueue).Start in goroutine 1
        github.com/lightningnetwork/lnd/routing/chainview/queue.go:66 +0x65

C-Otto · 2023-12-08T12:15:26Z

@guggero I think I'm already running 0.17.3 of sorts (I manually added all relevant commits to my messy code base).

ziggie1984 · 2023-12-08T13:02:29Z

Analysing this a bit more I think I found the problem:

The ChannelRouter initializes the Barrier with 4*NumberofCore Slots: for the tunnelsats node: proc = 4 its 16

https://github.com/lightningnetwork/lnd/blob/master/routing/router.go#L1145

Now the pprof of the time when the problem happens count exacty 16 stuck goroutines using up 16 slots hence we are in a deadlock somehow in terms of advancing the blockheight for the router:

The stuck routines are 2 here:

goroutine 5730594 [select, 1467 minutes]: 2 times: [[5730594, 5626273]
github.com/btcsuite/btcwallet/chain.(*BitcoindConn).GetBlock(0xc00017c790, 0xc00f8967e0)
        github.com/btcsuite/btcwallet@v0.16.10-0.20231017144732-e3ff37491e9c/chain/bitcoind_conn.go:479 +0x1d1
github.com/btcsuite/btcwallet/chain.(*BitcoindClient).GetBlock(0xc013eeaf50?, 0x0?)
        github.com/btcsuite/btcwallet@v0.16.10-0.20231017144732-e3ff37491e9c/chain/bitcoind_client.go:144 +0x17
github.com/lightningnetwork/lnd/blockcache.(*BlockCache).GetBlock(0xc006255970, 0xc00f8967e0, 0xc000bd3ac8)
        github.com/lightningnetwork/lnd/blockcache/blockcache.go:55 +0x22a
github.com/lightningnetwork/lnd/lnwallet/btcwallet.(*BtcWallet).GetBlock(0x100c007e22b78?, 0x7f87c4635958?)
        github.com/lightningnetwork/lnd/lnwallet/btcwallet/blockchain.go:139 +0x1de
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).fetchFundingTx(0xc000a44280, 0xc00a4052f4)
        github.com/lightningnetwork/lnd/routing/router.go:1780 +0x5c
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).processUpdate(0xc000a44280, {0x1c8e960?, 0xc012c62f20?}, {0xc006381cb0, 0x1, 0x1})
        github.com/lightningnetwork/lnd/routing/router.go:1555 +0xea7
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).handleNetworkUpdate(0xc000a44280, 0xc0103a5470, 0xc00e7f0930)
        github.com/lightningnetwork/lnd/routing/router.go:1031 +0x10d
created by github.com/lightningnetwork/lnd/routing.(*ChannelRouter).networkHandler in goroutine 32469
        github.com/lightningnetwork/lnd/routing/router.go:1129 +0xbbf

and 14 times on the WaitForDependants

goroutine 5900726 [select, 1377 minutes]: 14 times: [[5900726, 7110031, 5908580, 6245797, 6464543, 7261552, 5911043, 5789756, 5785513, 5910633, 6017184, 6013681, 7107010, 6857809]
github.com/lightningnetwork/lnd/routing.(*ValidationBarrier).WaitForDependants(0xc0103a5470, {0x1c29420?, 0xc010d4cb60?})
        github.com/lightningnetwork/lnd/routing/validation_barrier.go:238 +0x605
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).handleNetworkUpdate(0xc000a44280, 0xc0103a5470, 0xc010826750)
        github.com/lightningnetwork/lnd/routing/router.go:1010 +0xb0
created by github.com/lightningnetwork/lnd/routing.(*ChannelRouter).networkHandler in goroutine 32469
        github.com/lightningnetwork/lnd/routing/router.go:1129 +0xbbf

So not sure why the GetBlock call in the first cited goroutine hangs, this might be pruned specific but I thing this behaviour an also happen without a prunded not in place?

Wdyt @Crypt-iQ ?

ziggie1984 · 2023-12-08T13:03:44Z

here is the pprof already deduplicated:
pprof-deduped.log

C-Otto · 2023-12-08T13:04:05Z

I'm also using a node with 4 (virtual) CPUs.

Crypt-iQ · 2023-12-08T16:09:04Z

goroutine 5730594 [select, 1467 minutes]: 2 times: [[5730594, 5626273]
github.com/btcsuite/btcwallet/chain.(*BitcoindConn).GetBlock(0xc00017c790, 0xc00f8967e0)
        github.com/btcsuite/btcwallet@v0.16.10-0.20231017144732-e3ff37491e9c/chain/bitcoind_conn.go:479 +0x1d1
github.com/btcsuite/btcwallet/chain.(*BitcoindClient).GetBlock(0xc013eeaf50?, 0x0?)
        github.com/btcsuite/btcwallet@v0.16.10-0.20231017144732-e3ff37491e9c/chain/bitcoind_client.go:144 +0x17
github.com/lightningnetwork/lnd/blockcache.(*BlockCache).GetBlock(0xc006255970, 0xc00f8967e0, 0xc000bd3ac8)
        github.com/lightningnetwork/lnd/blockcache/blockcache.go:55 +0x22a
github.com/lightningnetwork/lnd/lnwallet/btcwallet.(*BtcWallet).GetBlock(0x100c007e22b78?, 0x7f87c4635958?)
        github.com/lightningnetwork/lnd/lnwallet/btcwallet/blockchain.go:139 +0x1de
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).fetchFundingTx(0xc000a44280, 0xc00a4052f4)
        github.com/lightningnetwork/lnd/routing/router.go:1780 +0x5c
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).processUpdate(0xc000a44280, {0x1c8e960?, 0xc012c62f20?}, {0xc006381cb0, 0x1, 0x1})
        github.com/lightningnetwork/lnd/routing/router.go:1555 +0xea7
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).handleNetworkUpdate(0xc000a44280, 0xc0103a5470, 0xc00e7f0930)
        github.com/lightningnetwork/lnd/routing/router.go:1031 +0x10d
created by github.com/lightningnetwork/lnd/routing.(*ChannelRouter).networkHandler in goroutine 32469
        github.com/lightningnetwork/lnd/routing/router.go:1129 +0xbbf

So not sure why the GetBlock call in the first cited goroutine hangs, this might be pruned specific but I thing this behaviour an also happen without a prunded not in place?

Wdyt @Crypt-iQ ?

This dump shows that fetchFundingTx is being used rather than the 0.17.3-beta fetchFundingTxWrapper which fixes this issue for pruned nodes

C-Otto · 2023-12-08T18:37:16Z

I'm running code that includes fetchFundingTxWrapper and still see the issue (not with that stack trace, though).

ziggie1984 · 2023-12-08T19:17:58Z

could you provide a pprof-goroutine dump next time it happens for you @C-Otto ?

C-Otto · 2023-12-08T19:38:18Z

Will do

ziggie1984 · 2023-12-09T10:42:58Z

Another noderunner reach out to me running on 17.3 and prunded:

I am still trying to track-down the issue in the code (I think I am close trying to reconstruct this problem locally), but the main problem is that the prunded workers somehow do not report failure or timeout and signal it back to the initial call .fetchFundingTxWrapper. His system has 6 cpus meaning he has 24 slots for the validation barrier in the channel routuer.

Main Problem is here, why is this call stuck for 1305 minutes, something is messed up with the PrunedBlockDispatcher I think something is wrong on the neutrino side and we miss some notifications.

goroutine 11645746 [select, 1305 minutes]: 2 times: [[11645746, 11645767]
github.com/btcsuite/btcwallet/chain.(*BitcoindConn).GetBlock(0xc000026210, 0xc01364fc80)
        github.com/btcsuite/btcwallet@v0.16.10-0.20231129183218-5df09dd43358/chain/bitcoind_conn.go:479 +0x1d1
github.com/btcsuite/btcwallet/chain.(*BitcoindClient).GetBlock(0xc0109efb60?, 0x0?)
        github.com/btcsuite/btcwallet@v0.16.10-0.20231129183218-5df09dd43358/chain/bitcoind_client.go:144 +0x17
github.com/lightningnetwork/lnd/blockcache.(*BlockCache).GetBlock(0xc00f1c2430, 0xc01364fc80, 0xc014767ec0)
        github.com/lightningnetwork/lnd/blockcache/blockcache.go:55 +0x22a
github.com/lightningnetwork/lnd/lnwallet/btcwallet.(*BtcWallet).GetBlock(0xc0186eca78?, 0x1?)
        github.com/lightningnetwork/lnd/lnwallet/btcwallet/blockchain.go:139 +0x1de
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).fetchFundingTx(0xc00955c140, 0xc0157b02d0)
        github.com/lightningnetwork/lnd/routing/router.go:1810 +0x5c
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).fetchFundingTxWrapper.func1()
        github.com/lightningnetwork/lnd/routing/router.go:1774 +0x2d
created by github.com/lightningnetwork/lnd/routing.(*ChannelRouter).fetchFundingTxWrapper in goroutine 11645745
        github.com/lightningnetwork/lnd/routing/router.go:1773 +0xd7

22 goroutines now blocked waiting for dependants:

goroutine 14401576 [select, 829 minutes]: 22 times: [[14401576, 14780452, 14636497, 13497107, 14186823, 15877870, 16643323, 14194898, 16694316, 14637199, 16673432, 14407395, 14790274, 13994802, 13985663, 16379626, 14421042, 16285128, 15787561, 16395763, 16730383, 15877867]
github.com/lightningnetwork/lnd/routing.(*ValidationBarrier).WaitForDependants(0xc00c5eac60, {0x1c2b960?, 0xc00f79b930?})
        github.com/lightningnetwork/lnd/routing/validation_barrier.go:238 +0x605
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).handleNetworkUpdate(0xc00955c140, 0xc00c5eac60, 0xc01b68ec60)
        github.com/lightningnetwork/lnd/routing/router.go:1010 +0xb0
created by github.com/lightningnetwork/lnd/routing.(*ChannelRouter).networkHandler in goroutine 9572
        github.com/lightningnetwork/lnd/routing/router.go:1129 +0xbbf

so we are in a deadlock here. A new block would be ready to advance the block_height of the ChanRouter but the channel is not able to receive the data, which can be seen here:

goroutine 7942 [select, 451 minutes]: 1 times: [[7942]
github.com/lightningnetwork/lnd/routing/chainview.(*blockEventQueue).queueCoordinator(0xc0104ee7d0)
        github.com/lightningnetwork/lnd/routing/chainview/queue.go:117 +0x1f8
created by github.com/lightningnetwork/lnd/routing/chainview.(*blockEventQueue).Start in goroutine 1
        github.com/lightningnetwork/lnd/routing/chainview/queue.go:66 +0x65

Most likely the dependants build up over time related to the two ChannelAnnouncments which are stuck in the GetFundingTx call. So we need to figure out why the PrunedBlockDispatcher is stuck. I took a closer look at the code in the neutrino package and there are timeouts in terms of the request (2 seconds for a single query, and 30 seconds for the whole batch) - this is something I am going to test, because although the fetchFundingTxWrapper is still waiting for a reply the corresponding pruned fetcher seems to be in a happy state, so something got lost along the way.

4 workers waiting for work (idling):

goroutine 15025022 [select, 584 minutes]: 4 times: [[15025022, 7297693, 15833384, 16912463]
github.com/lightninglabs/neutrino/query.(*worker).Run(0xc0141566f0, 0xc000901680, 0xc0009017a0)
        github.com/lightninglabs/neutrino@v0.16.0/query/worker.go:96 +0x1fb
github.com/lightninglabs/neutrino/query.(*peerWorkManager).workDispatcher.func2()
        github.com/lightninglabs/neutrino@v0.16.0/query/workmanager.go:289 +0x8d
created by github.com/lightninglabs/neutrino/query.(*peerWorkManager).workDispatcher in goroutine 5964
        github.com/lightninglabs/neutrino@v0.16.0/query/workmanager.go:285 +0x133c

and the workmanager waiting for results, so it does not seem something is still being processed.

goroutine 5964 [select, 156 minutes]: 1 times: [[5964]
github.com/lightninglabs/neutrino/query.(*peerWorkManager).workDispatcher(0xc010601ce0)
        github.com/lightninglabs/neutrino@v0.16.0/query/workmanager.go:263 +0x454
created by github.com/lightninglabs/neutrino/query.(*peerWorkManager).Start in goroutine 1
        github.com/lightninglabs/neutrino@v0.16.0/query/workmanager.go:137 +0x65

We will investigate these behaviour further to track it down finally.

ziggie1984 · 2023-12-09T10:44:20Z

godump_notsynced_dedup.log
Full dump form the node

C-Otto · 2023-12-09T12:37:03Z

curl http://127.0.0.1:9736/debug/pprof/goroutine?debug=2 attached:

goroutine.txt

ziggie1984 · 2023-12-09T13:21:21Z

Thanks for the dump, same in your case, you mentioned 4 virtual cpus: 16 slots:

2 slots hang here:

goroutine 3076998 [select, 863 minutes]: 2 times: [[3076998, 3112915]
github.com/btcsuite/btcwallet/chain.(*BitcoindConn).GetBlock(0xc00027ba20, 0xc014ea3a40)
        /home/cotto/go/pkg/mod/github.com/btcsuite/btcwallet@v0.16.10-0.20231129183218-5df09dd43358/chain/bitcoind_conn.go:479 +0x1f0
github.com/btcsuite/btcwallet/chain.(*BitcoindClient).GetBlock(0xc00dc0e7a0?, 0x0?)
        /home/cotto/go/pkg/mod/github.com/btcsuite/btcwallet@v0.16.10-0.20231129183218-5df09dd43358/chain/bitcoind_client.go:144 +0x1d
github.com/lightningnetwork/lnd/blockcache.(*BlockCache).GetBlock(0xc0062f4070, 0xc014ea3a40, 0xc005adeec0)
        /home/cotto/IdeaProjects/lnd/blockcache/blockcache.go:55 +0x246
github.com/lightningnetwork/lnd/lnwallet/btcwallet.(*BtcWallet).GetBlock(0x0?, 0x48403c?)
        /home/cotto/IdeaProjects/lnd/lnwallet/btcwallet/blockchain.go:139 +0x1ff                                                                                                           github.com/lightningnetwork/lnd/routing.(*ChannelRouter).fetchFundingTx(0xc0000d88c0, 0xc00dc0e6f4)
        /home/cotto/IdeaProjects/lnd/routing/router.go:1810 +0x66
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).fetchFundingTxWrapper.func1()
        /home/cotto/IdeaProjects/lnd/routing/router.go:1774 +0x46
created by github.com/lightningnetwork/lnd/routing.(*ChannelRouter).fetchFundingTxWrapper
        /home/cotto/IdeaProjects/lnd/routing/router.go:1773 +0xf1

14 hang here:

goroutine 3374606 [select, 826 minutes]: 14 times: [[3374606, 3826502, 3868619, 3355939, 3366418, 4347201, 4342024, 4324413, 3380439, 3832171, 3365343, 3390681, 3384080, 4332250]
github.com/lightningnetwork/lnd/routing.(*ValidationBarrier).WaitForDependants(0xc00e82cc60, {0x15292c0?, 0xc0109cd110?})
        /home/cotto/IdeaProjects/lnd/routing/validation_barrier.go:238 +0x6df
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).handleNetworkUpdate(0xc0000d88c0, 0xc00e82cc60, 0xc0150a47e0)                                                                             /home/cotto/IdeaProjects/lnd/routing/router.go:1010 +0xbc
created by github.com/lightningnetwork/lnd/routing.(*ChannelRouter).networkHandler
        /home/cotto/IdeaProjects/lnd/routing/router.go:1129 +0xbcc

And seems like the node was unofsync for more than 10 hours, because the block could not be delivered (but not sure whether the 593 minutes really mean that the goroutine was stuck at this particular line or if the goroutine is running for that long). Do you know when the know got out_of_sync ?

goroutine 3262 [select, 593 minutes]: 1 times: [[3262]
github.com/lightningnetwork/lnd/routing/chainview.(*blockEventQueue).queueCoordinator(0xc006354370)
        /home/cotto/IdeaProjects/lnd/routing/chainview/queue.go:117 +0x207
created by github.com/lightningnetwork/lnd/routing/chainview.(*blockEventQueue).Start
        /home/cotto/IdeaProjects/lnd/routing/chainview/queue.go:66 +0x6a

C-Otto · 2023-12-09T13:23:47Z

I can confirm the 10 hours. The mails started coming in at 03:44 and I did the dump at around 13:15, which is roughly 10 hours difference.

C-Otto · 2023-12-09T14:20:51Z

I added logging for the blockcache stuff, maybe that helps once it starts going downhill again.

C-Otto · 2023-12-09T15:59:40Z

I'm now also running curl -s http://127.0.0.1:9736/debug/pprof/goroutine?debug=2 | grep -e "GetBlock(\|WaitForDependants(" -B 1 once a minute, so far nothing turned up.

ziggie1984 · 2023-12-10T13:27:31Z

So I think we have some problems in the neutrino pkg, I increased the logs for the mentioned case and found that we have a block which is rescheduled over 10 times although the maximum retry count should not be greater than 2.

https://github.com/lightninglabs/neutrino/blob/master/query/workmanager.go#L345

Here are the retries:

lnd.759_split_01.log:2023-12-09 16:54:45.057 [TRC] LNWL: Worker panxv3zq3bcp2dcbt2j4f2igbjiswbp7pdpdaofdk5mkjux63ptrssid.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 16:55:17.061 [TRC] LNWL: Worker nvu64vxpwzsi4p2shg2ppuxeodfgwhrudyskvt6sedal6vnlvix6rfyd.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 16:55:49.076 [TRC] LNWL: Worker panxv3zq3bcp2dcbt2j4f2igbjiswbp7pdpdaofdk5mkjux63ptrssid.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 16:56:21.082 [TRC] LNWL: Worker lxcsrowdge7vxo5k46fb22libqwumvkpgtkt332jtf6myaqs3d6keead.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 16:56:53.083 [TRC] LNWL: Worker nvu64vxpwzsi4p2shg2ppuxeodfgwhrudyskvt6sedal6vnlvix6rfyd.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 16:57:25.084 [TRC] LNWL: Worker h5qvrvkdq5wvuywuxb36toph3i7dxl4tk767w7f6hrbuvo3sqqj2xqqd.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 16:57:57.088 [TRC] LNWL: Worker nvu64vxpwzsi4p2shg2ppuxeodfgwhrudyskvt6sedal6vnlvix6rfyd.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 16:58:29.088 [TRC] LNWL: Worker nvu64vxpwzsi4p2shg2ppuxeodfgwhrudyskvt6sedal6vnlvix6rfyd.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 16:59:01.089 [TRC] LNWL: Worker h5qvrvkdq5wvuywuxb36toph3i7dxl4tk767w7f6hrbuvo3sqqj2xqqd.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 16:59:33.090 [TRC] LNWL: Worker panxv3zq3bcp2dcbt2j4f2igbjiswbp7pdpdaofdk5mkjux63ptrssid.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 17:00:05.090 [TRC] LNWL: Worker lxcsrowdge7vxo5k46fb22libqwumvkpgtkt332jtf6myaqs3d6keead.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 17:00:37.094 [TRC] LNWL: Worker nvu64vxpwzsi4p2shg2ppuxeodfgwhrudyskvt6sedal6vnlvix6rfyd.onion:8333 timeout for request *wire.MsgGetData with job index 10
lnd.759_split_01.log:2023-12-09 17:01:09.096 [TRC] LNWL: Worker h5qvrvkdq5wvuywuxb36toph3i7dxl4tk767w7f6hrbuvo3sqqj2xqqd.onion:8333 timeout for request *wire.MsgGetData with job index 10

So this explains why we have this stuck GetBlock request but I am not sure why we are not exiting out after 2 retries, seems like the only why we would keep going to retry is when somehow the batch reference is nil
https://github.com/lightninglabs/neutrino/blob/master/query/workmanager.go#L334-L385

I added some logs in a PR which increases some logs, which might be helpful tracking down the issue maybe you could run it too, very much appreciated.

#8267

C-Otto · 2023-12-10T16:39:45Z

I just added the changes from #8267 and restarted my node (I'm not using neutrino, though?).

ziggie1984 · 2023-12-10T16:53:37Z

Yeah this a bit confusing, but as soon as you run in pruned mode you use the neutrino codebase to fetch blocks from peers. (https://github.com/btcsuite/btcwallet/blob/5df09dd4335865dde2a9a6d94a26a7f5779af825/chain/pruned_block_dispatcher.go#L192-L208)

You would need to put the LNWL=debug to see the new logs.

Crypt-iQ · 2023-12-13T17:50:55Z

So I think we should sort the peer connection by outbound/inbound first and then first try all the outbound peers and only after that try inbound tor peers. I am not sure whether we can fetch the listening port via a bitcoind rpc command, I searched but could not find anything.

I think we can use the getnodeaddresses command which fetches from bitcoind's address manager. I think anything in the address manager should have the listening port rather than the randomly assigned TCP port

EDIT: oops, realize we do this. But I think we can use this over the getpeerinfo call

C-Otto · 2023-12-14T17:17:16Z

For the record, I didn't allow incoming connections (and still don't), and I wasn't using tor.

To help debugging attempts, I enabled tor and temporarily disabled outgoing IPv4/IPv6 bitcoin peers on my node. So far, everything is fine...

Liongrass · 2023-12-14T18:10:59Z

The issue is not limited to Tor nodes. I used a spare node in an attempt to trigger this issue. I did the following:

0.17.3-beta commit=lightning-terminal-v0.12.2-alpha
Clearnet only
I purged the graph with chantools dropchannelgraph
Started LND
Waited for the node to lose sync to chain while syncing to graph
Curled the goprofile
goroutine.txt

ziggie1984 · 2023-12-14T23:06:02Z

Ok I was able to track down the issue, will provide a fix shortly.

So the problem is here:

When a block is not on our local backend (we already pruned it) we will try to fetch it from other nodes here:

https://github.com/btcsuite/btcwallet/blob/5df09dd4335865dde2a9a6d94a26a7f5779af825/chain/bitcoind_conn.go#L474

Now the problem:

We create a new block request here:

https://github.com/btcsuite/btcwallet/blob/5df09dd4335865dde2a9a6d94a26a7f5779af825/chain/pruned_block_dispatcher.go#L504-L519

Here comes the problem, first of all the 2 seconds and then 4 seconds for the second try are too short which is the root cause why we are having a lot of timeouts and the block is delivered couple of seconds later.

When we have two timeouts we will never delete the block from the d.blocksQueried this only happens if we receive the block in time:

https://github.com/btcsuite/btcwallet/blob/5df09dd4335865dde2a9a6d94a26a7f5779af825/chain/pruned_block_dispatcher.go#L650

So this means the problem arises when the same block is requested a second time (this might take a while to occur) but then we will think there is already a request going on in the dispatcher see here:

https://github.com/btcsuite/btcwallet/blob/5df09dd4335865dde2a9a6d94a26a7f5779af825/chain/pruned_block_dispatcher.go#L544-L553

So this will never launch a new request and the GetBlock call will hang forever, blocking the ValidationBarrier eventually.

So I think there are several things which can be fixed immediately, will come up with a plan tomorrow.

EDIT: Huge thanks to @M1ch43lV for running the draft PR to extract the logs which lead to the finding of the bug 🙏

Roasbeef · 2023-12-14T23:19:00Z

Here comes the problem, first of all the 2 seconds and then 4 seconds for the second try are too short which is the root cause why we are having a lot of timeouts and the block is delivered couple of seconds later.

I mention a route here to dynamically tune the timeout, which shouldn't be too involved: lightninglabs/neutrino#292

As a starting point, we can use the reported ping latency from the getpeerinfo call. Then we can apply an RTT multiplier on top of a processing multiplier. Each time we get a response, we use that to tune what the actual timeout should be (some sort of moving weighted average).

Perhaps what we also need is a callback for if the request times out and we're unable get a response? That would then allow us to remove it from the set of in flight requests.

IIUC, upon an error, we'll get a send on the errChan, but currently we just return early, whereas we could update the set of inflight requests there: https://github.com/btcsuite/btcwallet/blob/5df09dd4335865dde2a9a6d94a26a7f5779af825/chain/bitcoind_conn.go#L483-L490

The error shuold come from here once the max number of retries is exceeded: https://github.com/lightninglabs/neutrino/blob/42a196facefac36282e68ddf4a02c9ce4601a0b0/query/workmanager.go#L341-L358

ziggie1984 · 2023-12-15T12:41:52Z

I think your ideas fit perfectly @Roasbeef I will work on the implementation draft then.

warioishere · 2023-12-22T08:30:59Z

hello dear people, I'm reading this post with huge interesst. We also suffer from this problem, and since a while, we cant get our new node properly online anymore. I was searching myself to death and thought that im unable to set up our deployment.
We use btcpay server with lnd0.17.2-beta with bitcoincore 26 in 100GB pruned node if this is of any help.

We have initially after a restart a sync to chain, and after 15min we loose sync and it states both
synced to chain false
synced to graph false

If you need help, pls let me know, but if its deepdive logs, you need to tell me what to do.

Just a short question, i know its out, when its out, but do you guys know when a fix is out? We are offline with our shop now for a couple of weeks because we started to have weird connection problems and then I did the most stupid thing ever...

C-Otto · 2023-12-22T08:40:21Z

In my experience the graph-sync isn't affected. Are you sure you're hitting this issue? Could you run the profile command (see above and here: https://docs.lightning.engineering/lightning-network-tools/lnd/debugging_lnd#capturing-pprof-data-with-lnd) to see if you're hitting the same limits?

C-Otto · 2023-12-22T08:41:19Z

If you're affected by this issue, it might help to run a full (non-pruned) node and/or provide more CPU cores to lnd (which isn't a fix, but it might delay the need for a restart).

warioishere · 2023-12-22T08:48:36Z

In my experience the graph-sync isn't affected. Are you sure you're hitting this issue? Could you run the profile command (see above and here: https://docs.lightning.engineering/lightning-network-tools/lnd/debugging_lnd#capturing-pprof-data-with-lnd) to see if you're hitting the same limits?

I am not sure that I hit this issue, but we never had such problems before and since 2 weeks we dont have a synced LND anymore and the only thing we did is updated the deployment to 0.17.2beta

then after 2 or 3 news blocks, --> out of synced

LND runs inside a docker container, if you would tell me how to start this with lnd --profile=9736

then I could provide you with some infos. I am sorry I am not as technically experienced as you guys

C-Otto · 2023-12-22T09:30:01Z

You can edit your lnd.conf and add profile=9736 to the [Application Options] section.

warioishere · 2023-12-23T18:18:30Z

@C-Otto
I decided to go unpruned.. And it solved the problem. LND states that its synced. So I guess it was a similar or the same problem. I wonder why not more problems seem to appear with other node operators. Maybe only a minor part are pruned.

At least I'm good to go again.

Cheers

warioishere · 2023-12-23T18:32:26Z

Additional info;
I was setting up exactly the same deployment again after I started to have this problem, because I thought I messed up my config. And this deployment on a different server node of ours, had exactly the same problem. So I think I can verify that I had this issue you guys were discussing.

rockstardev · 2024-01-19T15:51:02Z

Hey everyone, chiming in to say that this issue is critical for using LND in BTCPay Server because most of our instances are pruned. We're now starting to get inquiries every day about how to solve this and it's impractical to say to people to restart LND every day.

@saubyk - it's really appreciated you guys are already looking into this. Is there clarity on timeline here so we can forward it for our merchants and ambassadors? I see you mapped it to 0.18 - when is that planned to release? And if it's more than a month, is it possible to go forward with 0.17.4 to fix this since it's such a big issue for a lot of deployments?

saubyk · 2024-01-19T23:17:24Z

Hi @rockstardev looking into the possibility of including a fix with 0.17.4. No promises yet

@ziggie1984 is on the case

rockstardev · 2024-01-23T04:13:20Z

@saubyk & @ziggie1984 godspeed, love seeing this slated for 0.17.4

ffrediani · 2024-05-29T23:15:03Z

Hello folks

Can anyone elaborate why pruned nodes may cause this issue ? I have been searching more information about if running Bitcoin Nodes in pruned mode is actually an issue or is it something that can be fixed inside LND to deal with these scenarios better ?

What is the relevance for LND to be able to read a older block that may not be locally in the Bitcoin Node ?

C-Otto · 2024-05-30T07:12:22Z

lnd sometimes needs to access information from old blocks, which I think is related to validating channels that have been opened a long time ago.

It's rather trivial to request this from a connected full node, as this full node can just grab the block from disk and return it to lnd. However, if the connected bitcoind does not have the requested block, it needs to be downloaded from the internet. This process is rather slow and unreliable, which is why the corresponding code is a bit more complex.

One bug in this complex code has been described and fixed here, but there may be other bugs that haven't been reported, yet. I run my node with a pruned bitcoind and I don't think it's a large issue. It's not the perfect solution, though.

ffrediani · 2024-05-31T19:38:11Z

Thanks for the exaplanation @C-Otto

Some things that come to mind in order to dig this this further and perhaps help clarify the possible causes and solution are:

Could this process be even slower due to the node be using Tor only and not Clearnet preferably ? If that can me measured it can enhance the recommendation to those running pruned nodes to, whenever possible have Clearnet always as preferred if available.
Any possible ways to improve pruned node connectivity to more peers (via Bitcoind variables) in order to reduce the chances it has trouble to fetch the block remotely ?
Any possible ways for Bitcoind to find out in advance that a possible peer has the full blockchain in order to ask preferably for those peers instead of another possible pruned node ?
Regarding LND code to treat such scenarios do you think that can be a timeout adjustment question or rather more complex than this ? I mean, at least if LND can be aware the process is on the way and may take a little longer it can take its time and wait for it to finish properly for those cases where there is an extra delay.
For LND proposes specifically any possible way that blocks related to existing channels could be prefetched in order to improve the process that may need older blocks.
Any possible minimum sized for pruned nods that would be more reasonable to avoid issues or have it clear that what matters are the blocks related to the channels opened. I understand that for opening new channels a broadcast to the network in the latest blocks is enough which the pruned node can do normally.

C-Otto · 2024-05-31T21:10:52Z

I don't think this discussion belongs here. The issue is fixed.

ffrediani · 2024-06-03T19:01:58Z

@C-Otto from your previous message you don't seem to suggest the problem is completely solved. You mentioned that one issue is fixed and I am trying by the above to anticipate of other possible scenarios and solutions.
In any case, where would this discussion fit better then ?

Liongrass added bug Unintended code behaviour needs triage labels Dec 4, 2023

Roasbeef added the pruning label Dec 4, 2023

saubyk added the P1 MUST be fixed or reviewed label Dec 5, 2023

ziggie1984 mentioned this issue Dec 10, 2023

routing: bump btcwallet version and increase logs. #8267

Draft

saubyk added this to the v0.18.0 milestone Dec 15, 2023

saubyk modified the milestones: v0.18.0, v0.17.4 Jan 21, 2024

This was referenced Jan 24, 2024

chain: PrunedBlockDispatcher bugfix. btcsuite/btcwallet#903

Merged

Fix Pruned Node Loss Sync Issue. #8428

Merged

saubyk linked a pull request Jan 26, 2024 that will close this issue

Fix Pruned Node Loss Sync Issue. #8428

Merged

yyforyongyu closed this as completed in #8428 Jan 29, 2024

C-Otto mentioned this issue Mar 6, 2024

[bug]: Lnd doesn't seem to gracefully shutdown on 0.17.0-beta.rc1 #7928

Closed

petzsch mentioned this issue Mar 27, 2024

[bug]: LND 0.17.4-beta still loses sync to chain after long times of inactivity as it seems #8593

Closed

[bug]: LND frequently loses sync to chain with pruned backend #8250

[bug]: LND frequently loses sync to chain with pruned backend #8250

Comments

Liongrass commented Dec 4, 2023

Background

Your environment

Steps to reproduce

Expected behaviour

Actual behaviour

ziggie1984 commented Dec 6, 2023

Liongrass commented Dec 6, 2023

C-Otto commented Dec 8, 2023 • edited Loading

C-Otto commented Dec 8, 2023 • edited Loading

C-Otto commented Dec 8, 2023

guggero commented Dec 8, 2023 • edited Loading

ziggie1984 commented Dec 8, 2023 • edited Loading

C-Otto commented Dec 8, 2023

ziggie1984 commented Dec 8, 2023 • edited Loading

ziggie1984 commented Dec 8, 2023

C-Otto commented Dec 8, 2023

Crypt-iQ commented Dec 8, 2023

C-Otto commented Dec 8, 2023

ziggie1984 commented Dec 8, 2023 • edited Loading

C-Otto commented Dec 8, 2023

ziggie1984 commented Dec 9, 2023

ziggie1984 commented Dec 9, 2023

C-Otto commented Dec 9, 2023

ziggie1984 commented Dec 9, 2023

C-Otto commented Dec 9, 2023

C-Otto commented Dec 9, 2023 • edited Loading

C-Otto commented Dec 9, 2023 • edited Loading

ziggie1984 commented Dec 10, 2023

C-Otto commented Dec 10, 2023 • edited Loading

ziggie1984 commented Dec 10, 2023 • edited Loading

Crypt-iQ commented Dec 13, 2023 • edited Loading

C-Otto commented Dec 14, 2023 • edited Loading

Liongrass commented Dec 14, 2023

ziggie1984 commented Dec 14, 2023 • edited Loading

Roasbeef commented Dec 14, 2023

ziggie1984 commented Dec 15, 2023

warioishere commented Dec 22, 2023 • edited Loading

C-Otto commented Dec 22, 2023

C-Otto commented Dec 22, 2023 • edited Loading

warioishere commented Dec 22, 2023

C-Otto commented Dec 22, 2023 • edited Loading

warioishere commented Dec 23, 2023

warioishere commented Dec 23, 2023

rockstardev commented Jan 19, 2024 • edited Loading

saubyk commented Jan 19, 2024

rockstardev commented Jan 23, 2024

ffrediani commented May 29, 2024

C-Otto commented May 30, 2024

ffrediani commented May 31, 2024 • edited Loading

C-Otto commented May 31, 2024

ffrediani commented Jun 3, 2024

C-Otto commented Dec 8, 2023 •

edited

Loading

C-Otto commented Dec 8, 2023 •

edited

Loading

guggero commented Dec 8, 2023 •

edited

Loading

ziggie1984 commented Dec 8, 2023 •

edited

Loading

ziggie1984 commented Dec 8, 2023 •

edited

Loading

ziggie1984 commented Dec 8, 2023 •

edited

Loading

C-Otto commented Dec 9, 2023 •

edited

Loading

C-Otto commented Dec 9, 2023 •

edited

Loading

C-Otto commented Dec 10, 2023 •

edited

Loading

ziggie1984 commented Dec 10, 2023 •

edited

Loading

Crypt-iQ commented Dec 13, 2023 •

edited

Loading

C-Otto commented Dec 14, 2023 •

edited

Loading

ziggie1984 commented Dec 14, 2023 •

edited

Loading

warioishere commented Dec 22, 2023 •

edited

Loading

C-Otto commented Dec 22, 2023 •

edited

Loading

C-Otto commented Dec 22, 2023 •

edited

Loading

rockstardev commented Jan 19, 2024 •

edited

Loading

ffrediani commented May 31, 2024 •

edited

Loading