Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add max number of elements to non-prio queue #1077

Merged
merged 9 commits into from
Mar 25, 2024

Conversation

diegomrsantos
Copy link
Collaborator

@diegomrsantos diegomrsantos commented Mar 21, 2024

This PR adds a default limit to the number of elements in the non-priority queue. When this limit has been reached, the peer will be disconnected. The default value is 1024 but can also be configured through a new GossipSub param maxNumElementsInNonPriorityQueue.

@diegomrsantos
Copy link
Collaborator Author

@@ -181,6 +182,24 @@ proc handle*(p: PubSubPeer, conn: Connection) {.async.} =
debug "exiting pubsub read loop",
conn, peer = p, closed = conn.closed

proc disconnect(p: PubSubPeer) {.async.} =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't disconnect the peer - it just closes the send connection

Copy link
Collaborator Author

@diegomrsantos diegomrsantos Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't disconnect the peer from PubSubPeer, it's available only in GossipSub and I don't have a reference to it here cause of the circular dependency limitation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we emit a Disconnected event that doesn't mean a disconnection.

Copy link
Collaborator Author

@diegomrsantos diegomrsantos Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if I goit it right.

  1. We reset the stream and close the underlying connection here, but don't raise any error so at this time no error happens
    await s.conn.close()
  2. Next time we try to write to the stream this error happens
    debug "Exception occurred in PubSubPeer.handle",
  3. Then this is executed

So the only difference now is that the underlying connection hasn't been closed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the only difference now is that the underlying connection hasn't been closed?

yes - closing the underlying ensures that all ongoing concurrent reads, writes etc are aborted and we get rid of the peer - this is important because we don't know what state gossipsub is in if we just abort here - they might have received control messages or not so closing the connection entirely is one of the safest and least bad things you can do.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could rename PubSubPeerEventKind.Disconnected to PubSubPeerEventKind.StreamClosed and create a PubSubPeerEventKind.Disconnect. Then GossipSub would disconnect the peer when receiving the latter. This would be the closest to the current behavior, I guess. Another option would be to disconnect the peer here

for topic, peers in p.mesh.mpairs():
, but it would change the current behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how can I close the underlying connection here?

see suggested send-all-non-prio-messages trick - it should work because if there's at least 1 in-flight future, it will cause mplex to do the disconnection.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't emit an event that tells GossipSub to disconnect the peer be simpler and easier to understand?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know - that might introduce new async ordering problems - I'm not necessarily against it, just that it is a larger and more invasive change

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I think just writing the whole queue might not trigger the disconnection if the futures complete fast enough.

@arnetheduck
Copy link
Contributor

arnetheduck commented Mar 21, 2024

@arnetheduck how about replacing https://github.com/vacp2p/nim-libp2p/pull/1077/files#diff-abbf987f19a24c9940cd01bde79a1b1e0819f1c3b6d7ba2efa04ad680d022c78R182-R186 by unsubscribePeer?

let's start with peer disconnection (to maintain current behavior) - when we reach a stable point with that working, we can consider other strategies (so that we can get a release out).

@codecov-commenter
Copy link

codecov-commenter commented Mar 21, 2024

Codecov Report

Attention: Patch coverage is 71.42857% with 10 lines in your changes are missing coverage. Please review.

Project coverage is 84.79%. Comparing base (458b088) to head (067b0f5).

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##           unstable    #1077      +/-   ##
============================================
- Coverage     84.83%   84.79%   -0.04%     
============================================
  Files            91       91              
  Lines         15405    15420      +15     
============================================
+ Hits          13069    13076       +7     
- Misses         2336     2344       +8     
Files Coverage Δ
libp2p/protocols/pubsub/pubsub.nim 84.29% <100.00%> (ø)
libp2p/protocols/pubsub/gossipsub.nim 86.54% <66.66%> (-0.18%) ⬇️
libp2p/protocols/pubsub/pubsubpeer.nim 86.14% <70.96%> (-2.10%) ⬇️

... and 1 file with indirect coverage changes

@diegomrsantos diegomrsantos force-pushed the non-prio-queue-max-capacity branch 2 times, most recently from b5db38e to 484c25d Compare March 21, 2024 18:19
libp2p_gossipsub_non_priority_queue_size.inc(labelValues = [$p.peerId])
f
if len(p.rpcmessagequeue.nonPriorityQueue) == p.maxNumElementInNonPriorityQueue:
if not p.disconnected:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this flag is set, it seems prudent to check in connectOnce that we don't re-attempt connection, ie DisconnectionRequested should lead to the send conn never being opened again

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I saw that peers kept being disconnected multiple times and that was probably the reason. But I had to add the check in connectImpl.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, the non-prio queue should also be cleared (ie all sends in it should be failed) at this point since it will no longer be sent - and if there are any new sends after the disconnection, they should fail

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, it maybe makes sense to balance the streamOpened event with a streamclosed in this case, such events typically benefit from being neutralised this way

Copy link
Collaborator Author

@diegomrsantos diegomrsantos Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Canceling the futures will cause this problem again #1075, won't it?

Copy link
Collaborator Author

@diegomrsantos diegomrsantos Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe makes sense to balance the streamOpened event with a streamclosed in this case

Could you please elaborate more on what you mean by that? Should I also emit a Streamclosed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Canceling the futures will cause this problem again #1075, won't it?

Not sure if you meant canceling the futures. stopSendNonPriorityTask clears the queue and it's called in unsubscribePeer which should be called when the peer is disconnected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you meant canceling the futures.

I meant failing them, not cancelling, but now I realise we always return a completed future instead of storing pending futures in the queue, so there's nothing to fail:

https://github.com/vacp2p/nim-libp2p/blob/unstable/libp2p/protocols/pubsub/pubsubpeer.nim#L341

addLast always returns a completed future when the async queue is unbounded (as it is in this case).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I also emit a Streamclosed?

this is an idea for a future improvement - it's maybe not needed right now, however

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's basically a lifecycle here:

created -> [stream opened -> stream closed]* -> disconnected where the middle two states repeat, so when designing an event api around it, these are the relevant events.

@diegomrsantos diegomrsantos self-assigned this Mar 25, 2024
@diegomrsantos diegomrsantos marked this pull request as ready for review March 25, 2024 14:37
@diegomrsantos diegomrsantos merged commit 1a707e1 into unstable Mar 25, 2024
9 checks passed
@diegomrsantos diegomrsantos deleted the non-prio-queue-max-capacity branch March 25, 2024 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: done
Development

Successfully merging this pull request may close these issues.

3 participants