-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug]: GossipSyncer should remove unreliable Peers from syncing the ChannelGraph #8889
Comments
Re sanity checks, today we do have a reject cache here to ensure that we won't continue validating channels that we already deemed to be invalid. |
Happened to me twice this week. LND quietly becomes unresponsive and the node must be rebooted. Logs are completely full with these errors. How do I find out which peer is causing this, so I can close the channel? Ok, found the peer by looking for |
Try to find this log line:
what you can do as a preliminary measure is the following: Close the channel to this node (above node id), also make sure you don't connect to this peer anymore by blocking its traffic and also inbound connection from this peer. or use specific peers for the acitve peer sync so that you don't run into the risk of peering with the particular node again. Specify exactly the amount of synced peers you have set up for active sync (default: numgraphsyncpeers=3)
|
I identified a certain peer as the cause, so I limited my sync peers to Boltz, WoS and Bfx. All was fine for a few days, but today the problem returned. LND log and all three archives are completely full with 'unable to fetch utxo', so I cannot find 'starting query for'... Edit: Same when disconnecting Boltz or WOS. Anyone knows a reliable gossip peer that won't cause this mess? |
Well, this did not work. I pinned the peers, but still see My lnd.conf:
Log output:
|
Can you share the peer's NodeID? I want to connect to it and see what's happening. |
The peer I identified while analysing the issue was this one: https://amboss.space/node/02dfe525d9c5b4bb52a55aa3d67115fa4a6326599c686dbd1083cffe0f45c114f8 |
Sure, by bad peers are: |
Two of my bad peers are running 0.18.0. the other two did not respond .. |
@Impa10r what kind of hardware is your node running on? |
Raspi4 8gb. It can be that this is not a new problem, but LND 0.18 is more CPU intensive. |
I also saw this on a Raspi4 8gb. The pinned peers suggestion seems to have fixed it, running a few days now. |
Happened again today. Needed a restart to recover |
I had 1101 from a different peer, but LND did not restart this time. |
lncli listchannels shows zero_conf as false on all channels |
private are normally not zero_conf. https://github.com/blckbx/NodeTools/blob/main/private-trusted-zero-conf-channels.md |
listchannels reports private as true and zero_conf for that channel as false |
No that made no difference, same problem twice today. Seems to be caused by the same peer node |
For all people struggling with this problem can you set the following config setting until a proper fix is out.
this soley guarantees that only the pinned peers will be used for a historical scan of the channel graph rather than some other random peers. Will keep you updated while working on a proper fix. |
By "pinned peers" you mean peers which a node has active channels with, correct? |
Good question, yes only peers you have channels with should be used for pinned peers, pinned peers can basically be selected via the lnd.conf file (theoretically it could be any peer but the problem would be we wouldn't guarantee connectivity with the peer so a peer should be used that we have a channel with): Lines 1620 to 1637 in 2f2efc7
|
What was the cause of the problem:To end up in the in the described situation several factors had to play out. So one problem is that we would reply in the Lines 2287 to 2289 in c0420fe
rather its: 0xFFFFFFF1886E0900 or the golang representation: -e7791f700 When converted into uint32 we get Lines 1119 to 1125 in c0420fe
So basically now all Zombie Channels will be reactivated. It only needs one node in the network to serve this kind of Gossip Msg (for example a neutrino node) which runs via the Potential Solution/ FixProblem with the RejectCache:The Reject Cache is a rotating memory with a default size of 50k entries, we saw in the log files that sometimes we receiver way over 200k old channels we already deleted or forgot. This high amount of new Chans makes the rejectCache basically useless because it rotates away. So what I suggest implementing:
Looking forward for your evaluation. Special thanks to @Crypt-iQ who got me into the right direction where to look for when analysing this problem. |
Great catch! I agree that
Because LND deletes the edges and doesn't totally resurrect them by updating
This points to the spam-serving nodes being neutrino nodes as they won't validate the |
Very good analysis, I think also running for a short period of time in neutrino mode and then switching back could also infect us with all those channels which are already closed. I think we would never get rid of them without dropping the chan-graph.
I think as long as we introduce the |
It seems that these nodes are not neutrino nodes, so I'm not sure how they are storing old announcements... |
So one other problem we have to think about is, that those nodes are not pruning those old announcements, I think the culprit is, that they have also no ChanUpdate for those announcements hence no entry in the Lines 4404 to 4416 in 0aced5c
we probably need to insert a dummy timestamp into the |
In case more examples are useful:
118268 new chans 🙄 The requesting node is a Voltage.cloud lite node of mine that has only the one channel and is running 0.18.2-beta |
There are peers in the network which have a very old ChannelGraph view and when connected to them it requires LND to do a lot of unnecessary work to check for the validity of their out of date graph data. Moreover for pruned nodes this comes with a burden in terms of bandwidth because for channels the node doesn't have the blockdata anymore they need to be fetched from peers which is very bandwidth intensive especially if you use VPN services for your node which restrict the bandwidth at certain points.
I think we should first introduce some sanity checks, if our channel-graph and the channel-graph of our peers is too far apart, we should trust ourselves more and remove the peer from syncing the data. Probably we should also ban that peer at some point.
This is how an out of date peer sync could look like (form a mainnet node):
The text was updated successfully, but these errors were encountered: