Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop worker when ctx is canceled during recovered round execution #401

Merged
merged 1 commit into from
Sep 20, 2023

Conversation

poszu
Copy link
Contributor

@poszu poszu commented Sep 19, 2023

Closes #400

Root cause

When the poet was shut down during the recovered round execution, it would not immediately quit but proceed to select from the closed rounds channel and ctx.Done(). It would have a 50% chance to pick up the same round ID again and start it from the start because the registration also recovers a round in-progress and sends it over the closed channel. In a normal use case (when the recovered round finishes) the round recovered from registration is skipped.

The fix

The fix is to quit the Service::loop() immediately if the recovered round either fails or is canceled.

@poszu poszu requested a review from fasmat September 19, 2023 15:22
@codecov
Copy link

codecov bot commented Sep 19, 2023

Codecov Report

Merging #401 (5c143be) into develop (91c73ef) will decrease coverage by 0.2%.
Report is 3 commits behind head on develop.
The diff coverage is 58.3%.

@@            Coverage Diff            @@
##           develop    #401     +/-   ##
=========================================
- Coverage     74.8%   74.6%   -0.2%     
=========================================
  Files           23      23             
  Lines         1898    1904      +6     
=========================================
+ Hits          1421    1422      +1     
- Misses         359     363      +4     
- Partials       118     119      +1     
Files Changed Coverage Δ
service/service.go 79.8% <58.3%> (-0.5%) ⬇️

... and 1 file with indirect coverage changes

@fasmat
Copy link
Member

fasmat commented Sep 19, 2023

I don't understand what the issue is and how this PR fixes it. Could you elaborate?

@poszu
Copy link
Contributor Author

poszu commented Sep 20, 2023

I don't understand what the issue is and how this PR fixes it. Could you elaborate?

Sorry for this. Updated the PR description.

Comment on lines +120 to +125
eg.Go(func() error {
if err := round.Teardown(ctx, err == nil); err != nil {
logger.Warn("round teardown failed", zap.Error(err), zap.Uint("epoch", round.epoch))
}
return nil
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to make this a method? Especially since further below the same code is called again (without the cleanupRound shell?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, the method would need to return:

func() error {
				if err := round.Teardown(ctx, err == nil); err != nil {
					logger.Warn("round teardown failed", zap.Error(err), zap.Uint("epoch", round.epoch))
				}
				return nil
			}
}

to be scheduled on the local eg.

@poszu poszu added this pull request to the merge queue Sep 20, 2023
Merged via the queue into develop with commit 6392954 Sep 20, 2023
10 of 11 checks passed
@fasmat fasmat deleted the 400-stop-worker-when-recovered-round-is-cancelled branch September 20, 2023 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Weird behaviour of poet on BM - no metrics at all & reset state
2 participants