Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connectd: avoid use-after-free upon multiple reconnections by a peer #5300

Merged

Conversation

whitslack
Copy link
Collaborator

peer_reconnected was freeing a struct peer_reconnected instance while a pointer to that instance was registered to be passed as an argument to the retry_peer_connected callback function. This caused a use-after-free crash when retry_peer_connected attempted to reparent the instance to the temporary context.

Instead, never have peer_reconnected free a struct peer_reconnected instance, and only ever allow such an instance to be freed after the retry_peer_connected callback has finished with it. To ensure that the instance is freed even if the connection is closed before the callback can be invoked, parent the instance to the connection rather than to the daemon.

Absent the need to free struct peer_reconnected instances outside of the retry_peer_connected callback, there is no use for the reconnected hashtable, so remove it as well.

See: #5282 (comment)
Fixes: #5282
Fixes: #5284
Changelog-Fixed: connectd no longer crashes when peers reconnect.

This may also fix #5299.

`peer_reconnected` was freeing a `struct peer_reconnected` instance
while a pointer to that instance was registered to be passed as an
argument to the `retry_peer_connected` callback function. This caused a
use-after-free crash when `retry_peer_connected` attempted to reparent
the instance to the temporary context.

Instead, never have `peer_reconnected` free a `struct peer_reconnected`
instance, and only ever allow such an instance to be freed after the
`retry_peer_connected` callback has finished with it. To ensure that the
instance is freed even if the connection is closed before the callback
can be invoked, parent the instance to the connection rather than to the
daemon.

Absent the need to free `struct peer_reconnected` instances outside of
the `retry_peer_connected` callback, there is no use for the
`reconnected` hashtable, so remove it as well.

See: ElementsProject#5282 (comment)
Fixes: ElementsProject#5282
Fixes: ElementsProject#5284
Changelog-Fixed: connectd no longer crashes when peers reconnect.
@whitslack
Copy link
Collaborator Author

This looks to be obviated by #5261.

@whitslack
Copy link
Collaborator Author

Despite my effort to avoid leaking memory, my lightning_connectd process, aside from pegging the CPU at 100% continuously, is also leaking memory like a sieve. After only a day and a half of uptime, the process is using 246 MiB of memory. (lightningd is using 1.7 GiB of memory, more than double what v0.10.2 had been using after running for weeks.) I really don't understand why these processes are holding so much state in RAM.

@whitslack
Copy link
Collaborator Author

whitslack commented Jun 6, 2022

lightning_connectd is now using 2.3 GiB of RAM. 😠

Does anyone know if I can undo the database schema changes (manually) and successfully go back to running v0.10.2? Aside from the resource leaks, v0.11.1 is costing me several channels a day due to "internal error" closures. 😢

@rustyrussell
Copy link
Contributor

Re: memleak. I looked at my node, and the immediate workaround is common/gossip_rcvd_filter.c line 78, change 10000 to 100.

The question of why we're not cleaning this cache correctly will take some diag.

@rustyrussell
Copy link
Contributor

(For future reference: attaching a debugger to the process and running p tal_dump() will have it dump all the memory allocations to stderr, which is how I found this)

@whitslack
Copy link
Collaborator Author

whitslack commented Jun 7, 2022

common/gossip_rcvd_filter.c line 78, change 10000 to 100.

@rustyrussell: Okay, I'll give that a shot, but I don't see how even 10000 gossip entries in a cache would be consuming 2.5 GiB of memory. That would be an average of 262 KiB per gossip entry. (Obviously that ignores all other users of memory, but presumably the leaked memory comprises the vast majority of the used memory.)

@rustyrussell
Copy link
Contributor

It might not be the only source, but it's definitely the memory hog on my node (it's per-peer, with two caches per peer).

@whitslack
Copy link
Collaborator Author

It might not be the only source, but it's definitely the memory hog on my node (it's per-peer, with two caches per peer).

@rustyrussell: #5312 completely solves it for me. lightning_connectd has been using about 13 MiB the entire time it's been running (which is close to a day now).

@rustyrussell
Copy link
Contributor

I believe this reintroduces the "multiple connections" bug, where we can have multiple reconnections, and a stale one eventually gets restored (which is what the patch which introduced this htable was avoiding).

One alternative is to have us only try reconnecting once: if we reconnect and it doesn't work, simply free it (by closing the connection).

(I was hoping to get the more significant rewrite in, but it's stalled on a heap of timing changes which broke CI)

@whitslack
Copy link
Collaborator Author

whitslack commented Jun 16, 2022

@rustyrussell: Let me see if I understand the logic. There can be multiple connections with a given peer but only one "active" connection with any given peer at any given instant. The first duplicate connection from a peer to arrive will cause the current active connection with that peer to be terminated. While that is occurring, a second duplicate connection from the same peer can arrive. Both duplicate connections will be stalled, waiting for the active connection to be dropped. Once the active connection is dropped, then the first duplicate connection will become the active connection. I think the intention at this point is that this new active connection will then immediately be dropped, and subsequently the second duplicate connection will become the active connection. This effectively forms a FIFO queue, where only the last (i.e., the most recently arrived) duplicate connection is allowed to survive as the active connection.

I am not confident about the timing of the message that is sent to the master daemon to tell it to drop the active connection for the peer. I have a spidey sense that something like the following could occur:

  1. First duplicate connection arrives. connectd sends message to lightningd to disconnect the active connection for this peer.
  2. Second duplicate connection arrives. connectd sends message to lightningd to disconnect the active connection for this peer.
  3. lightningd receives the message sent in step 1, arranges for the active connection with the peer to be killed, and informs connectd that this has been done.
  4. lightningd receives the message sent in step 2, but there is no active connection for the peer. Whoops.
  5. connectd receives the confirmation sent in step 3 and turns the first duplicate connection into the active connection.
  6. Uhh?? Is the second duplicate connection simply left hanging at this point?

It seems like the intent of the htable was to discard the first duplicate connection (which has not yet become the active connection) when the second duplicate connection arrives, but the code wasn't actually doing that, as it was not actually closing that connection but was merely deallocating the structure that the I/O callback function would later reference.

@rustyrussell
Copy link
Contributor

@rustyrussell: Let me see if I understand the logic. There can be multiple connections with a given peer but only one "active" connection with any given peer at any given instant. The first duplicate connection from a peer to arrive will cause the current active connection with that peer to be terminated. While that is occurring, a second duplicate connection from the same peer can arrive. Both duplicate connections will be stalled, waiting for the active connection to be dropped.

So far, so good.

Once the active connection is dropped, then the first duplicate connection will become the active connection. I think the intention at this point is that this new active connection will then immediately be dropped, and subsequently the second duplicate connection will become the active connection. This effectively forms a FIFO queue, where only the last (i.e., the most recently arrived) duplicate connection is allowed to survive as the active connection.

Well, it's not clear which will actually get in next, and which will go back to waiting. The intent of the code is that there can only ever be one waiting: if there's one already waiting, take its place.

I am not confident about the timing of the message that is sent to the master daemon to tell it to drop the active connection for the peer. I have a spidey sense that something like the following could occur:

1. First duplicate connection arrives. `connectd` sends message to `lightningd` to disconnect the active connection for this peer.

2. Second duplicate connection arrives. `connectd` sends message to `lightningd` to disconnect the active connection for this peer.

3. `lightningd` receives the message sent in step 1, arranges for the active connection with the peer to be killed, and informs `connectd` that this has been done.

4. `lightningd` receives the message sent in step 2, but there is no active connection for the peer. Whoops.

5. `connectd` receives the confirmation sent in step 3 and turns the first duplicate connection into the active connection.

6. Uhh?? Is the second duplicate connection simply left hanging at this point?

Possibly. The rewrite eschews all this for a smoother flow of "kill current connection, tell lightningd it's dead, then tell it about new connection like normal".

It seems like the intent of the htable was to discard the first duplicate connection (which has not yet become the active connection) when the second duplicate connection arrives, but the code wasn't actually doing that, as it was not actually closing that connection but was merely deallocating the structure that the I/O callback function would later reference.

Indeed! A quick fix would have been to keep the conn pointer in the struct, and free that...

Happens occasionally when running
`tests/test_connection.py::test_mutual_reconnect_race` (which is too
flaky to add, without more fixes):


```
lightningd: lightningd/peer_control.c:1252: peer_active: Assertion `!channel->owner' failed.
lightningd: FATAL SIGNAL 6 (version v0.11.0.1-38-g4f167da)
0x5594a41f8f45 send_backtrace
	common/daemon.c:33
0x5594a41f8fef crashdump
	common/daemon.c:46
0x7f7cb585c08f ???
	/build/glibc-SzIz7B/glibc-2.31/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
0x7f7cb585c00b __GI_raise
	../sysdeps/unix/sysv/linux/raise.c:51
0x7f7cb583b858 __GI_abort
	/build/glibc-SzIz7B/glibc-2.31/stdlib/abort.c:79
0x7f7cb583b728 __assert_fail_base
	/build/glibc-SzIz7B/glibc-2.31/assert/assert.c:92
0x7f7cb584cfd5 __GI___assert_fail
	/build/glibc-SzIz7B/glibc-2.31/assert/assert.c:101
0x5594a41b45ca peer_active
	lightningd/peer_control.c:1252
0x5594a418794c connectd_msg
	lightningd/connect_control.c:457
0x5594a41cd457 sd_msg_read
	lightningd/subd.c:556
0x5594a41ccbe5 read_fds
	lightningd/subd.c:357
0x5594a4269fc2 next_plan
	ccan/ccan/io/io.c:59
0x5594a426abca do_plan
	ccan/ccan/io/io.c:407
0x5594a426ac0c io_ready
	ccan/ccan/io/io.c:417
0x5594a426ceff io_loop
	ccan/ccan/io/poll.c:453
0x5594a41930d9 io_loop_with_timers
	lightningd/io_loop_with_timers.c:22
0x5594a4199293 main
	lightningd/lightningd.c:1181
0x7f7cb583d082 __libc_start_main
	../csu/libc-start.c:308
0x5594a416e15d ???
	???:0
0xffffffffffffffff ???
	???:0
```
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This was fixed in 1c495ca ("connectd:
fix accidental handling of old reconnections.") and then reverted by
the rework in "connectd: avoid use-after-free upon multiple
reconnections by a peer".

The latter made the race much less likely, since we cleaned up the
reconnecting struct once the connection was hung up by the remote
node, but it's still theoretically possible.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
@rustyrussell
Copy link
Contributor

I added two minimal fixes to this, which resulted from stress-testing (the test unfortunately hangs frequently, since without other fixes the connect command can race).

This is because I want to do an 11.2 release with this fix and the memleak fix... basically, the "whitslack" point release.

@rustyrussell rustyrussell added this to the v0.11.2 milestone Jun 19, 2022
@rustyrussell rustyrussell merged commit fd90e57 into ElementsProject:master Jun 28, 2022
@whitslack whitslack deleted the connectd_use_after_free branch August 30, 2022 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants