Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: avoid deadlocks in query and broadcast behaviours #63

Merged
merged 3 commits into from
Oct 12, 2023
Merged

Conversation

iand
Copy link
Contributor

@iand iand commented Oct 12, 2023

The query and broadcast behaviours notify query initiators of the ongoing progress of a query or broadcast. They also notify when the query or broadcast has finished.

This set of changes fixes two types of deadlock:

  1. slow consumer - originally the notification was sent on a channel with no defined behaviour for slow consumers of the channel. This could cause the query behaviour to block preventing any other query from progressing. This was particularly evident when a query completes since the last successful response was notified followed immediately by a finished notification to the same channel. This has been fixed by intruducing a QueryMonitor type that buffers progress events that cannot be notified. The QueryMonitor also uses a separate channel for notifying the completion of a query or broadcast and has better defined semantics for when notifications will be sent and when they will stop being sent.
  2. reentrancy - originally the Notify and Perform methods of the behaviours were guarded by a single mutex since both could advance the state of the embedded state machine. However, notifying a query initiator could cause the intiator to call the Notify method to stop the query. Since the notification was made while the mutex was held this would deadlock on the call to Notify. This has been fixed by separating the locking behaviour between Notify and Perform and refactoring the logic to ensure that the state machines are advanced by Perform only. Notify now only queues the inbound event.

@iand iand linked an issue Oct 12, 2023 that may be closed by this pull request
internal/coord/behaviour.go Show resolved Hide resolved
// The sender may attempt to drain any pending notifications before closing the other channels.
// The NotifyFinished channel will be closed once the sender has attempted to send the Finished notification.
NotifyFinished() chan<- CtxEvent[E]
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the current usage of the QueryMonitor I think it would be a nicer API if these were just regular methods that accepted CtxEvent[*EventQueryProgressed] and CtxEvent[E] events. I can only see it used down below as:

func (w *queryNotifier[E]) TryNotifyProgressed(ctx context.Context, ev *EventQueryProgressed) bool {
	if w.stopping {
		return false
	}
	ce := CtxEvent[*EventQueryProgressed]{Ctx: ctx, Event: ev}
	select {
	case w.monitor.NotifyProgressed() <- ce:
		return true
	default:
		w.pending = append(w.pending, ce)
		return false
	}
}

func (w *queryNotifier[E]) NotifyFinished(ctx context.Context, ev E) {
	w.stopping = true
	w.DrainPending()
	close(w.monitor.NotifyProgressed())

	select {
	case w.monitor.NotifyFinished() <- CtxEvent[E]{Ctx: ctx, Event: ev}:
	default:
	}
	close(w.monitor.NotifyFinished())
}

This requires users of the types that implement this interface to deal with quite some internal details.

Alternative suggestion:

// A QueryMonitor receives event notifications on the progress of a query
type QueryMonitor[E TerminalQueryEvent] interface {
	NotifyProgressed(e CtxEvent[*EventQueryProgressed]) bool // indicating successful notification
	NotifyFinished(e CtxEvent[E])
}

closing the specific channels could happen inside the type that implements that interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach doesn't give the caller of QueryMonitor, which is a behaviiour, any control over the blocking behaviour. A channel allows the behaviour to detect and avoid blocking. A method call could do anything and moves the slow consumer problem into the monitor implementation.

close(w.monitor.NotifyProgressed())

select {
case w.monitor.NotifyFinished() <- CtxEvent[E]{Ctx: ctx, Event: ev}:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a good overview but could it be a problem if we enter the default case here? Other parts might rely on receiving the finished event (not an event due to closing the channel)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, but the channel is closed so the consumer will always know the query has finished. The alternative is to block, possibly forever if the consumer has gone away or is blocked themselves. And we would also not have a good place to clean up monitors for finished queries. The QueryMonitor documentation states it's the responsibility of the implementation to have capacity to accept one single finished notification.

This is similar to how https://pkg.go.dev/os/signal#Notify handles notification. The owner of the channel must ensure sufficient capacity, in our case a capacity of 1.

@iand iand merged commit b049f28 into main Oct 12, 2023
7 checks passed
@iand iand deleted the query-deadlock branch October 12, 2023 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Deadlock in QueryBehaviour
2 participants