Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NRG: Wait for goroutines to shutdown when recreating group #5832

Merged
merged 3 commits into from
Nov 5, 2024

Conversation

neilalexander
Copy link
Member

@neilalexander neilalexander commented Aug 26, 2024

This should fix some logical races where multiple sets of goroutines can fight over the same store directory, i.e. when shutting down and recreating a group rapidly.

Signed-off-by: Neil Twigg neil@nats.io

@neilalexander neilalexander marked this pull request as ready for review August 26, 2024 22:22
@neilalexander neilalexander requested a review from a team as a code owner August 26, 2024 22:22
@neilalexander neilalexander changed the title NRG: Wait for goroutines on shutdown NRG (2.11): Wait for goroutines on shutdown Aug 27, 2024
@neilalexander neilalexander changed the title NRG (2.11): Wait for goroutines on shutdown NRG (2.11): Wait for goroutines on shutdown, hold JS lock on group create Sep 2, 2024
@neilalexander neilalexander marked this pull request as draft September 2, 2024 18:41
@neilalexander neilalexander changed the title NRG (2.11): Wait for goroutines on shutdown, hold JS lock on group create NRG (2.11): Wait for goroutines to shutdown when recreating group Sep 18, 2024
@neilalexander neilalexander force-pushed the neil/nrgwg branch 6 times, most recently from 61b9a2e to 3cc6288 Compare September 18, 2024 10:48
@neilalexander neilalexander marked this pull request as ready for review September 18, 2024 12:10
@neilalexander neilalexander changed the title NRG (2.11): Wait for goroutines to shutdown when recreating group NRG: Wait for goroutines to shutdown when recreating group Sep 20, 2024
@neilalexander
Copy link
Member Author

Have rebased, this one is ready for review.

server/jetstream_cluster.go Show resolved Hide resolved
server/raft.go Outdated
// Wait for goroutines to shut down before trying to clean up
// the log below, otherwise we might end up deleting the store
// from underneath running goroutines.
n.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have these patterns for sure elsewhere but I try hard to avoid them if at all possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can avoid it here, it's the only way we can potentially avoid a deadlock if the runAs goroutine is finishing up processing some operations.

@neilalexander neilalexander force-pushed the neil/nrgwg branch 3 times, most recently from 03b9f1d to 4a02a47 Compare October 4, 2024 10:48
@neilalexander neilalexander force-pushed the neil/nrgwg branch 2 times, most recently from 566c9cc to dcd8167 Compare October 18, 2024 16:02
server/jetstream_cluster.go Show resolved Hide resolved
server/raft.go Outdated
@@ -1728,12 +1731,17 @@ func (n *raft) Stop() {
n.shutdown(false)
}

func (n *raft) WaitForStop() {
n.wg.Wait()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check that the state is set to stopped to avoid calling in here and never returning?

server/raft.go Outdated
@@ -1744,12 +1752,12 @@ func (n *raft) shutdown(shouldDelete bool) {
if n.state.Swap(int32(Closed)) == int32(Closed) {
// If we get called again with shouldDelete, in case we were called first with Stop() cleanup
if shouldDelete {
n.wg.Wait()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we hold the lock here this may never return.

This should fix some logical races where multiple sets of goroutines
can fight over the same store directory, i.e. when shutting down and
recreating a group rapidly.

Signed-off-by: Neil Twigg <neil@nats.io>
Signed-off-by: Neil Twigg <neil@nats.io>
@neilalexander neilalexander marked this pull request as draft November 5, 2024 13:03
@neilalexander neilalexander force-pushed the neil/nrgwg branch 3 times, most recently from 3624123 to dc65149 Compare November 5, 2024 14:07
Signed-off-by: Neil Twigg <neil@nats.io>
@neilalexander
Copy link
Member Author

I have shuffled things around a bit:

  • The run goroutine is now responsible for cleaning up once the loop ends, which is a server-monitored goroutine;
  • The shutdown function now just signals the shutdown and doesn't block the caller;
  • The WaitForStop function only blocks now if the Raft group is really in a closed state;
  • We otherwise only block if we're looking up a Raft node and find that one already exists.

@neilalexander neilalexander marked this pull request as ready for review November 5, 2024 16:06
@derekcollison derekcollison self-requested a review November 5, 2024 18:33
Copy link
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@derekcollison derekcollison merged commit ec877ee into main Nov 5, 2024
5 checks passed
@derekcollison derekcollison deleted the neil/nrgwg branch November 5, 2024 18:38
neilalexander added a commit that referenced this pull request Nov 25, 2024
Includes the following:

- #5661
- #5666
- #5671
- #5344
- #5684
- #5689
- #5691
- #5714
- #5717
- #5707
- #5792
- #5912
- #5957
- #5700
- #5975
- #5991
- #5987
- #6027
- #6038
- #6053
- #5848
- #6055
- #6056
- #6060
- #6061
- #6072
- #5832
- #6073
- #6107

Signed-off-by: Neil Twigg <neil@nats.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants