fix: avoid deadlocks in query and broadcast behaviours #63

iand · 2023-10-12T14:37:48Z

The query and broadcast behaviours notify query initiators of the ongoing progress of a query or broadcast. They also notify when the query or broadcast has finished.

This set of changes fixes two types of deadlock:

slow consumer - originally the notification was sent on a channel with no defined behaviour for slow consumers of the channel. This could cause the query behaviour to block preventing any other query from progressing. This was particularly evident when a query completes since the last successful response was notified followed immediately by a finished notification to the same channel. This has been fixed by intruducing a QueryMonitor type that buffers progress events that cannot be notified. The QueryMonitor also uses a separate channel for notifying the completion of a query or broadcast and has better defined semantics for when notifications will be sent and when they will stop being sent.
reentrancy - originally the Notify and Perform methods of the behaviours were guarded by a single mutex since both could advance the state of the embedded state machine. However, notifying a query initiator could cause the intiator to call the Notify method to stop the query. Since the notification was made while the mutex was held this would deadlock on the call to Notify. This has been fixed by separating the locking behaviour between Notify and Perform and refactoring the logic to ensure that the state machines are advanced by Perform only. Notify now only queues the inbound event.

internal/coord/behaviour.go

dennis-tra · 2023-10-12T15:11:06Z

internal/coord/behaviour.go

+	// The sender may attempt to drain any pending notifications before closing the other channels.
+	// The NotifyFinished channel will be closed once the sender has attempted to send the Finished notification.
+	NotifyFinished() chan<- CtxEvent[E]
+}


Given the current usage of the QueryMonitor I think it would be a nicer API if these were just regular methods that accepted CtxEvent[*EventQueryProgressed] and CtxEvent[E] events. I can only see it used down below as:

func (w *queryNotifier[E]) TryNotifyProgressed(ctx context.Context, ev *EventQueryProgressed) bool { if w.stopping { return false } ce := CtxEvent[*EventQueryProgressed]{Ctx: ctx, Event: ev} select { case w.monitor.NotifyProgressed() <- ce: return true default: w.pending = append(w.pending, ce) return false } } func (w *queryNotifier[E]) NotifyFinished(ctx context.Context, ev E) { w.stopping = true w.DrainPending() close(w.monitor.NotifyProgressed()) select { case w.monitor.NotifyFinished() <- CtxEvent[E]{Ctx: ctx, Event: ev}: default: } close(w.monitor.NotifyFinished()) }

This requires users of the types that implement this interface to deal with quite some internal details.

Alternative suggestion:

// A QueryMonitor receives event notifications on the progress of a query type QueryMonitor[E TerminalQueryEvent] interface { NotifyProgressed(e CtxEvent[*EventQueryProgressed]) bool // indicating successful notification NotifyFinished(e CtxEvent[E]) }

closing the specific channels could happen inside the type that implements that interface.

This approach doesn't give the caller of QueryMonitor, which is a behaviiour, any control over the blocking behaviour. A channel allows the behaviour to detect and avoid blocking. A method call could do anything and moves the slow consumer problem into the monitor implementation.

dennis-tra · 2023-10-12T15:15:22Z

internal/coord/query.go

+	close(w.monitor.NotifyProgressed())
+
+	select {
+	case w.monitor.NotifyFinished() <- CtxEvent[E]{Ctx: ctx, Event: ev}:


I don't have a good overview but could it be a problem if we enter the default case here? Other parts might rely on receiving the finished event (not an event due to closing the channel)

Maybe, but the channel is closed so the consumer will always know the query has finished. The alternative is to block, possibly forever if the consumer has gone away or is blocked themselves. And we would also not have a good place to clean up monitors for finished queries. The QueryMonitor documentation states it's the responsibility of the implementation to have capacity to accept one single finished notification.

This is similar to how https://pkg.go.dev/os/signal#Notify handles notification. The owner of the channel must ensure sufficient capacity, in our case a capacity of 1.

iand added 3 commits October 11, 2023 12:41

fix: avoid deadlocks in query behaviour

e42172a

Add inbound event queue to query behaviour

bf35bb2

Refactor perform logic in query and broadcast behaviours

bdbee9c

iand requested review from guillaumemichel and dennis-tra as code owners October 12, 2023 14:37

iand linked an issue Oct 12, 2023 that may be closed by this pull request

Deadlock in QueryBehaviour #57

Closed

iand mentioned this pull request Oct 12, 2023

Add deadlock regression test #61

Open

dennis-tra approved these changes Oct 12, 2023

View reviewed changes

iand merged commit b049f28 into main Oct 12, 2023
7 checks passed

iand deleted the query-deadlock branch October 12, 2023 15:35

iand mentioned this pull request Oct 16, 2023

Flaking test: TestDHT_SearchValue_quorum_test_suite/TestQuorumReachedPrematurely #48

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: avoid deadlocks in query and broadcast behaviours #63

fix: avoid deadlocks in query and broadcast behaviours #63

iand commented Oct 12, 2023

dennis-tra Oct 12, 2023

iand Oct 12, 2023

dennis-tra Oct 12, 2023

iand Oct 12, 2023

fix: avoid deadlocks in query and broadcast behaviours #63

fix: avoid deadlocks in query and broadcast behaviours #63

Conversation

iand commented Oct 12, 2023

dennis-tra Oct 12, 2023

Choose a reason for hiding this comment

iand Oct 12, 2023

Choose a reason for hiding this comment

dennis-tra Oct 12, 2023

Choose a reason for hiding this comment

iand Oct 12, 2023

Choose a reason for hiding this comment