-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: force-closing local discovery TCP server could always time out #498
Conversation
We have a way to force-close the local discovery TCP server. This was susceptible to the following race condition bug: 1. A connection opens and is established. 2. A duplicate connection opens. [We decide to keep the new one. We start to close the old connection and enqueue a "promotion" of the new one.][2] 3. We close the TCP server. This does not immediately halt the server; it ["stops the server from accepting new connections and keeps existing connections"][docs]. [We iterate through all existing connections and destroy them][3], but the duplicate connection is not yet in this list so it is never destroyed and the server will never close. 4. The old connection is closed and ["promotion" of the new connection completes][4]. It is never destroyed. This caused [intermittent CI failures][ci] and should be fixed by this change. [docs]: https://nodejs.org/docs/latest-v18.x/api/net.html#serverclosecallback [ci]: https://github.com/digidem/mapeo-core-next/actions/runs/7996074898/job/21837815901 [2]: https://github.com/digidem/mapeo-core-next/blob/5e05e35200297435f4ebd33df2c5b1444ed4ef1b/src/discovery/local-discovery.js#L229-L232 [3]: https://github.com/digidem/mapeo-core-next/blob/5e05e35200297435f4ebd33df2c5b1444ed4ef1b/src/discovery/local-discovery.js#L278-L280 [4]: https://github.com/digidem/mapeo-core-next/blob/5e05e35200297435f4ebd33df2c5b1444ed4ef1b/src/discovery/local-discovery.js#L173
// Bail if the server has already stopped by this point. This should be | ||
// rare but can happen during connection swaps if the new connection is | ||
// "promoted" after the server's doors have already been closed. | ||
if (!this.#server.listening) { | ||
this.#log('server stopped, destroying connection %h', remotePublicKey) | ||
conn.destroy() | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been unable to get my head deep enough into this code to fully understand the problem/solution. I think we are still missing a piece around gracefully closing connections to peers. Currently I think peers stay connected until the connection errors or they are force-terminated. Ideally what would happen is:
- Peer starts intention to close connections (e.g. when going into the background)
- Peer immediately stops accepting new connections
- For each connection to a peer, if all data is "in sync", close the connection
- Else, wait peers to be "in sync", then close the connection
- If after a timeout, peers are still not "in sync", then force close the connection
This comment probably belongs somewhere else, just thought I'd initially drop it here since I think it's relevant to the logic here.
Moving to draft because we might delete a lot of this code. |
We have a way to force-close the local discovery TCP server. This was susceptible to the following race condition bug:
A connection opens and is established.
A duplicate connection opens. We decide to keep the newer one. We start to close the old connection and enqueue a "promotion" of the new one.
https://github.com/digidem/mapeo-core-next/blob/5e05e35200297435f4ebd33df2c5b1444ed4ef1b/src/discovery/local-discovery.js#L229-L232
We close the TCP server. This does not immediately halt the server; it "stops the server from accepting new connections and keeps existing connections". In other words, it closes the doors but doesn't kick people out.
We iterate through all existing connections and destroy them...
https://github.com/digidem/mapeo-core-next/blob/5e05e35200297435f4ebd33df2c5b1444ed4ef1b/src/discovery/local-discovery.js#L278-L280
...but the duplicate connection is not yet in this list so it is never destroyed and the server will never close.
The old connection is closed and "promotion" of the new connection completes. It is never destroyed.
This caused intermittent CI failures and should be fixed by this change.