invoiceregistry: re-design invoice registry notification internals, remove layers of mutexes #8023
Labels
bug
Unintended code behaviour
concurrency
epic
Issues created to track large feature development
needs triage
P1
MUST be fixed or reviewed
refactoring
Milestone
For the past month or so, we've been hunting down some deadlocks in the daemon. After receiving a blocking profile, we realize that the invoice registry, and in particular the hodl invoice registration and notification code was the culprit. In the past, we've had some incremental fixes to attempt to resolve the issue. However, it's clear now that we should bite the bullet and refactor the internals to eliminate the layers of mutexes, in favor of:
subscribe
package where relevant in favor of singly buffered channels. The package uses an unbounded concurrent queue under the hood. Within the invoice registry today, there're assumptions that singly buffered channels are enough as channels are never sent on twice, in practice we've seen this to be false.Along the way, we may want to adopt a more type safe subscription/notification variant from one of our related codebases:
https://github.com/lightninglabs/taproot-assets/blob/603078bd3569886da20a36eb1ae1a89098a62aa5/fn/events.go#L62-L78C2
Short Term Fixes
After digging through the code a bit, I found some candidates for shorter term fixes to alleviate the issue, while we go to the drawing board to re-work the internal architecture:
hodlChan
this would ensure that the mainnotifyHodlSubscribes
method won't block on the send to thesubscribers
channel. AFAICT, this is part of the root issue as the main mutex is held here, and if the channel send blocks, then nothing else can proceed.subscriber
type into a struct. This struct would then hold an internal cancel chan that's closed when the subscriber wants to exit. The main send loop would then listen on this, if sent, then it can remove those entries early. With this, then theHodlUnsubscribeAll
method no longer needs to try grab that main mutex. Instead, it can just cancel the internal channel and exit.HodlUnsubscribeAll
drain thesubcriber
channel before trying to grab the mutex. This would ensure that thenotifyHodlSubscribes
loop won't block and cause circular waiting.notifyHodlSubscribes
in a goroutine. Go routines go brrr, so we eliminate blocking in that area.hodlChan
to just be a concurrent queue using thatsubscribe
package.I lean towards either
#5
, or#2
for short term fixes.I also think we should re-examine the practice of using a channel as the key to a map. Can't find anything concrete re pitfalls, but it sets my spidey senses off....
The text was updated successfully, but these errors were encountered: