Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug]: GossipSyncer should remove unreliable Peers from syncing the ChannelGraph #8889

Closed
ziggie1984 opened this issue Jul 3, 2024 · 29 comments
Assignees
Labels
bug Unintended code behaviour gossip
Milestone

Comments

@ziggie1984
Copy link
Collaborator

ziggie1984 commented Jul 3, 2024

There are peers in the network which have a very old ChannelGraph view and when connected to them it requires LND to do a lot of unnecessary work to check for the validity of their out of date graph data. Moreover for pruned nodes this comes with a burden in terms of bandwidth because for channels the node doesn't have the blockdata anymore they need to be fetched from peers which is very bandwidth intensive especially if you use VPN services for your node which restrict the bandwidth at certain points.

I think we should first introduce some sanity checks, if our channel-graph and the channel-graph of our peers is too far apart, we should trust ourselves more and remove the peer from syncing the data. Probably we should also ban that peer at some point.

This is how an out of date peer sync could look like (form a mainnet node):

 DISC: GossipSyncer(02dfe525d9c5b4bb52a55aa3d67115fa4a6326599c686dbd1083cffe0f45c114f8): starting query for 164060 new chans
2024-06-29 11:20:06.456 [INF] DISC: GossipSyncer(02dfe525d9c5b4bb52a55aa3d67115fa4a6326599c686dbd1083cffe0f45c114f8): querying for 500 new channels
2024-06-29 11:20:06.717 [INF] DISC: GossipSyncer(02dfe525d9c5b4bb52a55aa3d67115fa4a6326599c686dbd1083cffe0f45c114f8): querying for 500 new channels
2024-06-29 11:20:06.894 [ERR] CRTR: process network updates got: unable to fetch utxo for chan_id=556369376388317185, chan_point=8450839336340be66e5e1bf02d6ab046d4e83b44d97f034f0b8da609e4f74df1:1: target output has been spent
2024-06-29 11:20:06.988 [ERR] CRTR: process network updates got: unable to fetch utxo for chan_id=556449640836104192, chan_point=6950a0ad8d4d3751f26afd8ab09c32e28b017e721503adefdf6d1f06ae2b19d4:0: target output has been spent
2024-06-29 11:20:07.210 [ERR] CRTR: process network updates got: unable to fetch utxo for chan_id=556484825143181312, chan_point=8ca07553aff9287badc5a792f85e6785aaccd8c8596b3d7664eb7499e9de67fb:0: target output has been spent
2024-06-29 11:20:07.567 [ERR] CRTR: process network updates got: unable to fetch utxo for chan_id=556816877593165824, chan_point=1451d36ff82bb6faa74eae09486b520eef1702a41273ea3c5e6785daaff2fd40:0: target output has been spent

....
@ziggie1984 ziggie1984 added bug Unintended code behaviour gossip labels Jul 3, 2024
@ziggie1984 ziggie1984 changed the title [bug]: GossipSyncer should throttle or Remove unreliable Peers from syncing the ChannelGraph [bug]: GossipSyncer should remove unreliable Peers from syncing the ChannelGraph Jul 3, 2024
@saubyk saubyk added this to the v0.18.2 milestone Jul 3, 2024
@Roasbeef
Copy link
Member

Roasbeef commented Jul 8, 2024

Re sanity checks, today we do have a reject cache here to ensure that we won't continue validating channels that we already deemed to be invalid.

@Impa10r
Copy link

Impa10r commented Jul 15, 2024

Happened to me twice this week. LND quietly becomes unresponsive and the node must be rebooted. Logs are completely full with these errors. How do I find out which peer is causing this, so I can close the channel?

Ok, found the peer by looking for filtering through ..... chans large number > 100k

@ziggie1984
Copy link
Collaborator Author

ziggie1984 commented Jul 16, 2024

Try to find this log line:

GossipSyncer(XXX): starting query for ???? new chans

what you can do as a preliminary measure is the following:

Close the channel to this node (above node id), also make sure you don't connect to this peer anymore by blocking its traffic and also inbound connection from this peer.

or

use specific peers for the acitve peer sync so that you don't run into the risk of peering with the particular node again.

Specify exactly the amount of synced peers you have set up for active sync (default: numgraphsyncpeers=3)

[Application Options]

# Always sync from these peers:
gossip.pinned-syncers=<pubkey>

@saubyk saubyk modified the milestones: v0.18.3, v0.19.0 Jul 18, 2024
@Impa10r
Copy link

Impa10r commented Jul 26, 2024

I identified a certain peer as the cause, so I limited my sync peers to Boltz, WoS and Bfx. All was fine for a few days, but today the problem returned. LND log and all three archives are completely full with 'unable to fetch utxo', so I cannot find 'starting query for'...

Edit: Same when disconnecting Boltz or WOS. Anyone knows a reliable gossip peer that won't cause this mess?

@Impa10r
Copy link

Impa10r commented Jul 29, 2024

use specific peers for the acitve peer sync

Well, this did not work. I pinned the peers, but still see starting query for <large number> from other peers, some of which causing the above problem. LND 0.18.0...

My lnd.conf:

# The number of peers that we should receive new graph updates from. This option
# can be tuned to save bandwidth for light clients or routing nodes. 
numgraphsyncpeers=3

# Always sync from these peers:
# Boltz
gossip.pinned-syncers=026165850492521f4ac8abd9bd8088123446d126f648ca35e60f88177dc149ceb2
# WoS
gossip.pinned-syncers=035e4ff418fc8b5554c5d9eea66396c227bd429a3251c8cbc711002ba215bfc226
# Bfx
gossip.pinned-syncers=03cde60a6323f7122d5178255766e38114b4722ede08f7c9e0c5df9b912cc201d6

Log output:

$ docker logs lightning_lnd_1 --follow --tail 100000 | grep "starting query for"
2024-07-29 20:39:59.557 [INF] DISC: GossipSyncer(03d2d9b865127074632a6b8f8e67c977e9618b55ba0e7d9c005011549bd6ecc302): starting query for 38479 new chans
2024-07-29 20:40:02.081 [INF] DISC: GossipSyncer(026165850492521f4ac8abd9bd8088123446d126f648ca35e60f88177dc149ceb2): starting query for 2 new chans
2024-07-29 23:39:55.568 [INF] DISC: GossipSyncer(023369b071005d376bfa94b76d2ec52d01d6709f57d6b67bf26250f2ef57a66aea): starting query for 2 new chans
2024-07-30 01:39:55.935 [INF] DISC: GossipSyncer(0293bc8988a035126d3d343cd918c9f336567faf8cbe41d9df7574c8644895aa14): starting query for 78 new chans
2024-07-30 03:39:54.865 [INF] DISC: GossipSyncer(026d0169e8c220d8e789de1e7543f84b9041bbb3e819ab14b9824d37caa94f1eb2): starting query for 4 new chans
2024-07-30 05:39:56.089 [INF] DISC: GossipSyncer(0293bc8988a035126d3d343cd918c9f336567faf8cbe41d9df7574c8644895aa14): starting query for 78 new chans
2024-07-30 06:39:55.333 [INF] DISC: GossipSyncer(0201bfc5d9c14d50ebaf4dfd83d70df314ed2a9f0119f28619f3f000a1079ef580): starting query for 5 new chans
2024-07-30 07:39:53.836 [INF] DISC: GossipSyncer(03e86afe389d298f8f53a2f09fcc4d50cdd34e2fbd8f32cbd55583c596413705c2): starting query for 1 new chans
2024-07-30 09:39:54.518 [INF] DISC: GossipSyncer(03ac432467085ff175b84e39def012f616cdbf9164a6923d1bbfffe710f9c8ad3f): starting query for 1 new chans

@Crypt-iQ
Copy link
Collaborator

I identified a certain peer as the cause, so I limited my sync peers to Boltz, WoS and Bfx

Can you share the peer's NodeID? I want to connect to it and see what's happening.

@ziggie1984
Copy link
Collaborator Author

The peer I identified while analysing the issue was this one: https://amboss.space/node/02dfe525d9c5b4bb52a55aa3d67115fa4a6326599c686dbd1083cffe0f45c114f8

@Impa10r
Copy link

Impa10r commented Jul 30, 2024

Can you share the peer's NodeID? I want to connect to it and see what's happening.

Sure, by bad peers are:
023e09c43b215bd3dbf483bcb409da3322ea5ea3b046f74698b89ee9ea785dd30a
0309bf5fe006bd2b14d61126723481c6c685bda44efe093800908527dfe9cd8db7
02ce7abb5e625990f7898520bca721b7ae316165eef81133dcfe4068fcfbe37ac6
03d2d9b865127074632a6b8f8e67c977e9618b55ba0e7d9c005011549bd6ecc302

@Impa10r
Copy link

Impa10r commented Jul 31, 2024

Two of my bad peers are running 0.18.0. the other two did not respond ..

@Crypt-iQ
Copy link
Collaborator

@Impa10r what kind of hardware is your node running on?

@Impa10r
Copy link

Impa10r commented Jul 31, 2024

@Impa10r what kind of hardware is your node running on?

Raspi4 8gb. It can be that this is not a new problem, but LND 0.18 is more CPU intensive.

@dunkfan
Copy link

dunkfan commented Jul 31, 2024

I also saw this on a Raspi4 8gb. The pinned peers suggestion seems to have fixed it, running a few days now.
I'm running 0.18.2

@dunkfan
Copy link

dunkfan commented Aug 1, 2024

Happened again today. Needed a restart to recover
2024-08-01 08:07:57.089 [INF] DISC: GossipSyncer(023e09c43b215bd3dbf483bcb409da3322ea5ea3b046f74698b89ee9ea785dd30a): querying for 500 new channels
The query appeared in a seemingly endless loop, 500 channels a time

@Impa10r
Copy link

Impa10r commented Aug 1, 2024

I had 1101 from a different peer, but LND did not restart this time.

@Crypt-iQ
Copy link
Collaborator

Crypt-iQ commented Aug 2, 2024

Do any of you have zero-conf channels? @Impa10r @dunkfan

@dunkfan
Copy link

dunkfan commented Aug 2, 2024

lncli listchannels shows zero_conf as false on all channels
I do have one private channel operational

@Impa10r
Copy link

Impa10r commented Aug 3, 2024

lncli listchannels shows zero_conf as false on all channels I do have one private channel operational

private are normally not zero_conf. https://github.com/blckbx/NodeTools/blob/main/private-trusted-zero-conf-channels.md

@dunkfan
Copy link

dunkfan commented Aug 6, 2024

listchannels reports private as true and zero_conf for that channel as false
I have since moved bitcoin core from 24.0.1 to 25.2
So far 1 day no sign of the issue. Will stay vigilant

@ziggie1984 ziggie1984 self-assigned this Aug 6, 2024
@dunkfan
Copy link

dunkfan commented Aug 10, 2024

No that made no difference, same problem twice today. Seems to be caused by the same peer node

@ziggie1984
Copy link
Collaborator Author

ziggie1984 commented Aug 12, 2024

For all people struggling with this problem can you set the following config setting until a proper fix is out.

numgraphsyncpeers=0

this soley guarantees that only the pinned peers will be used for a historical scan of the channel graph rather than some other random peers.

Will keep you updated while working on a proper fix.

@saubyk
Copy link
Collaborator

saubyk commented Aug 12, 2024

this soley guarantees that only the pinned peers will be used for a historical scan of the channel graph rather than some other random peers

By "pinned peers" you mean peers which a node has active channels with, correct?

@ziggie1984
Copy link
Collaborator Author

ziggie1984 commented Aug 12, 2024

By "pinned peers" you mean peers which a node has active channels with, correct?

Good question, yes only peers you have channels with should be used for pinned peers, pinned peers can basically be selected via the lnd.conf file (theoretically it could be any peer but the problem would be we wouldn't guarantee connectivity with the peer so a peer should be used that we have a channel with):

lnd/sample-lnd.conf

Lines 1620 to 1637 in 2f2efc7

; Specify a set of pinned gossip syncers, which will always be actively syncing
; whenever the corresponding peer is online. A pinned syncer does not count
; towards the configured `numgraphsyncpeers` since pinned syncers are not
; rotated. Configuring a pinned syncer does not ensure a persistent connection
; to the target peer, they will only be pinned if the connection remains active
; via some other mechanism, e.g. having an open channel.
;
; This feature is useful when trying to ensure that a node keeps its
; routing table tightly synchronized with a set of remote peers, e.g. multiple
; lightning nodes operated by the same service.
;
; Each value should be a hex-encoded pubkey of the pinned peer. Multiple pinned
; peers can be specified by setting multiple flags/fields in the config.
; Default:
; gossip.pinned-syncers=
; Example:
; gossip.pinned-syncers=pubkey1
; gossip.pinned-syncers=pubkey2

@ziggie1984
Copy link
Collaborator Author

ziggie1984 commented Aug 14, 2024

What was the cause of the problem:

To end up in the in the described situation several factors had to play out.

So one problem is that we would reply in the reply_channel_range with a future timestamp for channels we would not have real ChanUpdate for so that would trigger the querying node to resurrect all its Zombie Channels and query them during the sync.
This is because time.Time is not 0 when not initialized here:

lnd/channeldb/graph.go

Lines 2287 to 2289 in c0420fe

chanInfo := ChannelUpdateInfo{
ShortChannelID: cid,
}

rather its: 0xFFFFFFF1886E0900 or the golang representation: -e7791f700

When converted into uint32 we get 886E0900. See here:

lnd/discovery/syncer.go

Lines 1119 to 1125 in c0420fe

timestamps[i].Timestamp1 = uint32(
info.Node1UpdateTimestamp.Unix(),
)
timestamps[i].Timestamp2 = uint32(
info.Node2UpdateTimestamp.Unix(),
)

So basically now all Zombie Channels will be reactivated. It only needs one node in the network to serve this kind of Gossip Msg (for example a neutrino node) which runs via the asssumeValid flag.

Potential Solution/ Fix

Problem with the RejectCache:

The Reject Cache is a rotating memory with a default size of 50k entries, we saw in the log files that sometimes we receiver way over 200k old channels we already deleted or forgot. This high amount of new Chans makes the rejectCache basically useless because it rotates away.
We could increase it, but I don't think we should go into this direction

So what I suggest implementing:

  1. Fix the timestamp bug
  2. Make sure we don't use the timestamps in the reply_channel_range msg to resurrect zombie channels, because this can be easily be circumvented.
  3. Make sure we also reject Updates which are too far into the future, maybe also 2 weeks ?
  4. Introduce a ratelimiter when fetching the data from the peer
  5. Introduce a ban list, which the user can enhance manually via maybe an RPC call when a peer gossip sync gets out of control, and in addition maybe ban peers via some heuristic. For example remembering the new channels we got from this peer when syncing for the first time and if this number is still the same (when fetching historical data again) we can be sure something is off and we ban the peer.

Looking forward for your evaluation.

Special thanks to @Crypt-iQ who got me into the right direction where to look for when analysing this problem.

@Crypt-iQ
Copy link
Collaborator

Crypt-iQ commented Aug 15, 2024

So one problem is that we would reply in the reply_channel_range with a future timestamp for channels we would not have real ChanUpdate for so that would trigger the querying node to resurrect all its Zombie Channels and query them during the sync.

Great catch! I agree that reply_channel_range with bogus timestamps are part of the problem; that explains why I was seeing future timestamps. I think something else is happening that is causing these nodes to broadcast announcements for outdated channels. LND has the following behavior related to spent outputs and zombies:

  • prunes (deletes) graph edges that have had their funding outpoints spent.
  • deletes edges when channels are marked as zombies.
  • cleans up the zombies after a certain interval. Since LND deletes these zombie edges, these spamming nodes can't force LND to resurrect deleted zombie channels.
  • resurrects zombie channels upon receiving a ChannelUpdate with more recent timestamps via MarkEdgeLive. Note that this only means the zombie edge is removed from the zombie bucket if it exists. It does not mean that we insert an edge into the edgeIndexBucket. I think this means we can ignore point 2 in ziggie's comment.

Because LND deletes the edges and doesn't totally resurrect them by updating edgeIndexBucket, it seems that full nodes can't be serving these outdated announcements. Also, given the behavior above, non-neutrino LND nodes will do the following when receiving a reply_channel_range with bogus timestamp entries:

  • look in the graph and find the entry in either the zombie bucket or not find it at all
  • request the channel's ChannelAnnouncement + ChannelUpdate via QueryShortChanIDs
  • receive the associated ChannelAnnouncement and fail to validate it, marking the channel as a zombie

This points to the spam-serving nodes being neutrino nodes as they won't validate the ChannelAnnouncement due to AssumeChannelValid being true. If the spammy nodes are running neutrino, I'm not sure how they were first "infected" with old channels. I'll try fingerprinting the spammy nodes and see if they are running neutrino.

@ziggie1984
Copy link
Collaborator Author

This points to the spam-serving nodes being neutrino nodes as they won't validate the ChannelAnnouncement due to AssumeChannelValid being true. If the spammy nodes are running neutrino, I'm not sure how they were first "infected" with old channels. I'll try fingerprinting the spammy nodes and see if they are running neutrino.

Very good analysis, I think also running for a short period of time in neutrino mode and then switching back could also infect us with all those channels which are already closed. I think we would never get rid of them without dropping the chan-graph.

resurrects zombie channels upon receiving a ChannelUpdate with more recent timestamps via MarkEdgeLive ...

I think as long as we introduce the banman we can ignore the timestamp check in the reply_channel_range msg as you mentioned. Otherwise I think it might serve at least as some kind of protection, for example if we would not have marked edges live again based on the timestamp, all of the channels not validateable would have all ended up in the zombieIndex and hence never requested again. So I think without the ban-protection letting a peer resurrect all our zombiechannels and send us all the ChanAnnouncments over and over again might be a vulnerability (in case the rejectCache remains the same size as is today)

@Crypt-iQ
Copy link
Collaborator

It seems that these nodes are not neutrino nodes, so I'm not sure how they are storing old announcements...

@ziggie1984
Copy link
Collaborator Author

So one other problem we have to think about is, that those nodes are not pruning those old announcements, I think the culprit is, that they have also no ChanUpdate for those announcements hence no entry in the edgeUpdateIndexBucket bucket which means they will not be picked up by the pruneZombieChans job. I think we are missing an entry in this bucket when receiving a chan announcement in general here:

lnd/channeldb/graph.go

Lines 4404 to 4416 in 0aced5c

func putChanEdgePolicyUnknown(edges kvdb.RwBucket, channelID uint64,
from []byte) error {
var edgeKey [33 + 8]byte
copy(edgeKey[:], from)
byteOrder.PutUint64(edgeKey[33:], channelID)
if edges.Get(edgeKey[:]) != nil {
return fmt.Errorf("cannot write unknown policy for channel %v "+
" when there is already a policy present", channelID)
}
return edges.Put(edgeKey[:], unknownPolicy)

we probably need to insert a dummy timestamp into the edgeUpdateIndexBucket (maybe the current timestamp) so we also remove channels which don't receive any chanupdate after 2 weeks via the pruning job.

@HamishMacEwan
Copy link

HamishMacEwan commented Aug 24, 2024

In case more examples are useful:

admin@myNode:~ $ sudo journalctl --no-page -u lnd.service | grep "starting query for"
Aug 24 18:48:31 myNode lnd[10853]: 2024-08-24 18:48:31.474 [INF] DISC: GossipSyncer(026ec3e3438308519a75ca4496822a6c1e229174fbcaadeeb174704c377112c331): starting query for 423 new chans
Aug 24 20:48:32 myNode lnd[10853]: 2024-08-24 20:48:32.949 [INF] DISC: GossipSyncer(03e691f81f08c56fa876cc4ef5c9e8b727bd682cf35605be25d48607a802526053): starting query for 1 new chans
Aug 24 23:48:37 myNode lnd[10853]: 2024-08-24 23:48:37.677 [INF] DISC: GossipSyncer(02f95c6f3498caf4f6811c5edc4ac9806ba345534a18391e94ae9f7af41104df18): starting query for 5 new chans
Aug 25 01:48:29 myNode lnd[10853]: 2024-08-25 01:48:29.672 [INF] DISC: GossipSyncer(03864ef025fde8fb587d989186ce6a4a186895ee44a926bfc370e2c366597a3f8f): starting query for 2 new chans
Aug 25 02:48:29 myNode lnd[10853]: 2024-08-25 02:48:29.779 [INF] DISC: GossipSyncer(03864ef025fde8fb587d989186ce6a4a186895ee44a926bfc370e2c366597a3f8f): starting query for 1 new chans
Aug 25 04:48:50 myNode lnd[10853]: 2024-08-25 04:48:50.125 [INF] DISC: GossipSyncer(0236e577964da670c730e7f362fa0c561b75dc9af02c0d822179f155e1455b20bd): starting query for 118268 new chans
Aug 25 05:48:33 myNode lnd[10853]: 2024-08-25 05:48:33.962 [INF] DISC: GossipSyncer(034ea80f8b148c750463546bd999bf7321a0e6dfc60aaf84bd0400a2e8d376c0d5): starting query for 53 new chans
Aug 25 06:48:31 myNode lnd[10853]: 2024-08-25 06:48:31.282 [INF] DISC: GossipSyncer(03864ef025fde8fb587d989186ce6a4a186895ee44a926bfc370e2c366597a3f8f): starting query for 2 new chans
admin@myNode:~ $

118268 new chans 🙄

The requesting node is a Voltage.cloud lite node of mine that has only the one channel and is running 0.18.2-beta

@saubyk
Copy link
Collaborator

saubyk commented Aug 29, 2024

Closed with #9009 and #9011

@saubyk saubyk closed this as completed Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Unintended code behaviour gossip
Projects
None yet
Development

No branches or pull requests

7 participants