Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: wait for comet node shutdown #444

Merged
merged 2 commits into from
Jan 8, 2024
Merged

server: wait for comet node shutdown #444

merged 2 commits into from
Jan 8, 2024

Conversation

jchappelow
Copy link
Member

cometbft's service Start() methods do not block while running, so we the errgroup.Go for this was returning almost right away. This meant that errgroup.Wait() did not wait for shutdown to complete. The effect was frequent errors from cometbft arising from one of our databases / key stores being shutdown too soon.

This is now resolved and the dependencies closers aren't closed until the node is actually stopped.

However, there is still an internal cometbft bug during block sync (before the nodes switches to consensus mode) because of an unsupervised goroutine, blocksync.(*Reactor).poolRoutine. I fixed this bug in blocksync/reactor.go (their code) and the errors go away. We may need to PR cometbft to get clean shutdown during block sync.

@jchappelow jchappelow marked this pull request as draft December 22, 2023 14:53
@jchappelow
Copy link
Member Author

This is r4r, but the shutdown corruption revealed in cometbft is fixed in cometbft/cometbft#1879, which we'd definitely want paired with this if and before it ends up on a release branch.

@jchappelow jchappelow force-pushed the comet-shutdown-wait branch from 4bfbecf to 450e9e3 Compare January 2, 2024 15:17
@jchappelow jchappelow marked this pull request as ready for review January 8, 2024 15:24
@jchappelow
Copy link
Member Author

cometbft/cometbft#1879 was merged and backported to v0.38.x.
This PR now updates the cometbft require to use the fix. This is ok to since it is still on a stable release branch.

go.mod Show resolved Hide resolved
@jchappelow
Copy link
Member Author

jchappelow commented Jan 8, 2024

To test:

  • start kwild
  • observe "comet node is now started" logged, indicating that it's Start has indeed returned and the error group is waiting for the shutdown signal
  • wait for everything to be running, get a few blocks, etc.
  • CTRL+C, and observe cometbft clean shutdown with all of it's "module" stopping and then we see "comet server is stopped", followed by our databases closing, in that order
  • should be no errors about DBs already closed

cometbft's service Start() methods do not block while running,
so we the errgroup.Go for this was returning almost right away.
This meant that errgroup.Wait() did not wait for shutdown to
complete. The effect was frequent errors from cometbft arising
from one of our databases / key stores being shutdown too soon.

This is now resolved and the dependencies `closers` aren't closed
until the node is actually stopped.

However, there is still an internal cometbft bug during block sync
(before the nodes switches to consensus mode) because of an
unsupervised goroutine, `blocksync.(*Reactor).poolRoutine`. I fixed
this bug in blocksync/reactor.go and the errors go away. We may
need to PR cometbft to get clean shutdown during block sync.
@jchappelow jchappelow force-pushed the comet-shutdown-wait branch from 688cac7 to 077b80e Compare January 8, 2024 16:49
@brennanjl brennanjl merged commit 78b2aca into main Jan 8, 2024
2 checks passed
@brennanjl brennanjl deleted the comet-shutdown-wait branch January 8, 2024 17:22
@brennanjl brennanjl added this to the v0.6.4 milestone Jan 8, 2024
@jchappelow jchappelow removed this from the v0.6.4 milestone Jan 9, 2024
brennanjl pushed a commit that referenced this pull request Feb 26, 2024
* server: wait for comet node shutdown

cometbft's service Start() methods do not block while running,
so we the errgroup.Go for this was returning almost right away.
This meant that errgroup.Wait() did not wait for shutdown to
complete. The effect was frequent errors from cometbft arising
from one of our databases / key stores being shutdown too soon.

This is now resolved and the dependencies `closers` aren't closed
until the node is actually stopped.

However, there is still an internal cometbft bug during block sync
(before the nodes switches to consensus mode) because of an
unsupervised goroutine, `blocksync.(*Reactor).poolRoutine`. I fixed
this bug in blocksync/reactor.go and the errors go away. We may
need to PR cometbft to get clean shutdown during block sync.

* deps: update cometbft to v0.38.x HEAD
brennanjl pushed a commit that referenced this pull request Feb 26, 2024
* server: wait for comet node shutdown

cometbft's service Start() methods do not block while running,
so we the errgroup.Go for this was returning almost right away.
This meant that errgroup.Wait() did not wait for shutdown to
complete. The effect was frequent errors from cometbft arising
from one of our databases / key stores being shutdown too soon.

This is now resolved and the dependencies `closers` aren't closed
until the node is actually stopped.

However, there is still an internal cometbft bug during block sync
(before the nodes switches to consensus mode) because of an
unsupervised goroutine, `blocksync.(*Reactor).poolRoutine`. I fixed
this bug in blocksync/reactor.go and the errors go away. We may
need to PR cometbft to get clean shutdown during block sync.

* deps: update cometbft to v0.38.x HEAD
jchappelow added a commit that referenced this pull request Feb 26, 2024
* server: wait for comet node shutdown

cometbft's service Start() methods do not block while running,
so we the errgroup.Go for this was returning almost right away.
This meant that errgroup.Wait() did not wait for shutdown to
complete. The effect was frequent errors from cometbft arising
from one of our databases / key stores being shutdown too soon.

This is now resolved and the dependencies `closers` aren't closed
until the node is actually stopped.

However, there is still an internal cometbft bug during block sync
(before the nodes switches to consensus mode) because of an
unsupervised goroutine, `blocksync.(*Reactor).poolRoutine`. I fixed
this bug in blocksync/reactor.go and the errors go away. We may
need to PR cometbft to get clean shutdown during block sync.

* deps: update cometbft to v0.38.x HEAD
brennanjl pushed a commit that referenced this pull request Feb 26, 2024
* server: wait for comet node shutdown

cometbft's service Start() methods do not block while running,
so we the errgroup.Go for this was returning almost right away.
This meant that errgroup.Wait() did not wait for shutdown to
complete. The effect was frequent errors from cometbft arising
from one of our databases / key stores being shutdown too soon.

This is now resolved and the dependencies `closers` aren't closed
until the node is actually stopped.

However, there is still an internal cometbft bug during block sync
(before the nodes switches to consensus mode) because of an
unsupervised goroutine, `blocksync.(*Reactor).poolRoutine`. I fixed
this bug in blocksync/reactor.go and the errors go away. We may
need to PR cometbft to get clean shutdown during block sync.

* deps: update cometbft to v0.38.x HEAD
brennanjl pushed a commit that referenced this pull request Feb 26, 2024
* server: wait for comet node shutdown

cometbft's service Start() methods do not block while running,
so we the errgroup.Go for this was returning almost right away.
This meant that errgroup.Wait() did not wait for shutdown to
complete. The effect was frequent errors from cometbft arising
from one of our databases / key stores being shutdown too soon.

This is now resolved and the dependencies `closers` aren't closed
until the node is actually stopped.

However, there is still an internal cometbft bug during block sync
(before the nodes switches to consensus mode) because of an
unsupervised goroutine, `blocksync.(*Reactor).poolRoutine`. I fixed
this bug in blocksync/reactor.go and the errors go away. We may
need to PR cometbft to get clean shutdown during block sync.

* deps: update cometbft to v0.38.x HEAD
brennanjl pushed a commit that referenced this pull request Feb 26, 2024
* server: wait for comet node shutdown

cometbft's service Start() methods do not block while running,
so we the errgroup.Go for this was returning almost right away.
This meant that errgroup.Wait() did not wait for shutdown to
complete. The effect was frequent errors from cometbft arising
from one of our databases / key stores being shutdown too soon.

This is now resolved and the dependencies `closers` aren't closed
until the node is actually stopped.

However, there is still an internal cometbft bug during block sync
(before the nodes switches to consensus mode) because of an
unsupervised goroutine, `blocksync.(*Reactor).poolRoutine`. I fixed
this bug in blocksync/reactor.go and the errors go away. We may
need to PR cometbft to get clean shutdown during block sync.

* deps: update cometbft to v0.38.x HEAD
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants