Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover Runners on Event Stream Restart #482

Merged
merged 5 commits into from
Oct 31, 2023

Conversation

mpass99
Copy link
Collaborator

@mpass99 mpass99 commented Oct 23, 2023

Related to #470

I tried to keep the changes small, but the codebase was only adjusted for only one recovery of the environments and runners. This led to multiple issues that environments and runners are replaced "dirty" - without destroying the environment and runner references leading to leaked goroutines.

@mpass99 mpass99 self-assigned this Oct 23, 2023
@MrSerth MrSerth linked an issue Oct 23, 2023 that may be closed by this pull request
@mpass99 mpass99 force-pushed the fix/#470-recover-after-event-stream branch from 0eb0b4c to b6153e5 Compare October 29, 2023 14:31
@codecov
Copy link

codecov bot commented Oct 29, 2023

Codecov Report

Merging #482 (e54d3cd) into main (809ca03) will decrease coverage by 0.16%.
The diff coverage is 66.66%.

@@            Coverage Diff             @@
##             main     #482      +/-   ##
==========================================
- Coverage   74.76%   74.60%   -0.16%     
==========================================
  Files          39       39              
  Lines        3653     3737      +84     
==========================================
+ Hits         2731     2788      +57     
- Misses        748      766      +18     
- Partials      174      183       +9     
Files Coverage Δ
cmd/poseidon/main.go 54.95% <100.00%> (+0.22%) ⬆️
internal/api/ws/codeocean_writer.go 78.88% <100.00%> (ø)
internal/environment/aws_manager.go 83.78% <100.00%> (-0.84%) ⬇️
internal/nomad/nomad.go 81.67% <ø> (ø)
internal/runner/abstract_manager.go 57.57% <ø> (+0.85%) ⬆️
internal/runner/execution_environment.go 100.00% <ø> (ø)
internal/runner/inactivity_timer.go 87.03% <ø> (ø)
internal/runner/nomad_runner.go 80.19% <100.00%> (+0.06%) ⬆️
pkg/dto/dto.go 52.38% <ø> (ø)
internal/environment/abstract_manager.go 43.90% <0.00%> (ø)
... and 5 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@mpass99 mpass99 force-pushed the fix/#470-recover-after-event-stream branch from b6153e5 to c9e9959 Compare October 29, 2023 15:46
@mpass99 mpass99 marked this pull request as ready for review October 29, 2023 16:00
@mpass99 mpass99 requested a review from MrSerth October 29, 2023 16:00
@MrSerth
Copy link
Member

MrSerth commented Oct 29, 2023

Nice done, thanks! I think, the code looks fine, so that I am focussing on the functionality in my review.

As I've noticed, you've deployed the code to staging, which allowed me to test it there. I did so using a Ruby exercise containing the following program:

30.times do |index|
  puts index
  sleep 1
end

Then, I went ahead and restarted all Nomad Servers through OpenStack Horizon, definitely causing the Event Stream to break:

Bildschirmfoto 2023-10-29 um 17 57 55

This action triggered two actions:

  1. CODEOCEAN-JP was triggered
  2. All runner data was deleted locally in Poseidon (besides some more log entries)

However, even after the Nomad servers were back online and the cluster working again, none of the runners were recovered (all execution environments were reported as having 0 idle runners). Shouldn't this be resolved with this PR, so that the test I performed is handled correctly by Poseidon? Further, I was wondering whether we should add some test cases to the CI for some behaviour, too?

@mpass99
Copy link
Collaborator Author

mpass99 commented Oct 30, 2023

First of all, thank you for your testing! As you have noticed, I've deployed the code to staging and also tested this scenario by restarting the Nomad service. It worked fine then. But, identifying that an error case remains when rebooting the servers is very important!

CODEOCEAN-JP was triggered

I would argue it's valid that CodeOcean receives an error when the requested operation (list_filesystem) has an error (in this case Nomad was unreachable). For me, the question arose whether Poseidon should disclose the whole internal error message. In this case, it was good for debugging. We may discuss this in the context of #265

All runner data was deleted locally in Poseidon (besides some more log entries)

Till this point, the logs look valid.

Even after the Nomad servers were back online and the cluster working again, none of the runners were recovered

The difference in your scenario is that the request timed out. The i/o timeout error is a context.DeadlineExceeded error - the same error we use for checking if the retry mechanism should continue. We fix this behavior in commit 111cc0f.

Side story Poseidon does not always behave the same for the same error.
Oct 30 16:24:09 poseidon-terraform poseidon[39546]: time="2023-10-30T16:24:09.906289Z" level=warning msg="Loading Environments failed! Retrying..." error="couldn't load template jobs: couldn't load jobs: error listing Nomad jobs: Get \"https://nomad.internal-codemoon.xopic.de:4646/v1/jobs?namespace=poseidon&prefix=template\": dial tcp 10.224.6.99:4646: i/o timeout" package=environment
Oct 30 16:24:10 poseidon-terraform poseidon[39546]: time="2023-10-30T16:24:10.907828Z" level=debug msg="retrying after error" count=1 error="couldn't load template jobs: couldn't load jobs: error listing Nomad jobs: Get \"https://nomad.internal-codemoon.xopic.de:4646/v1/jobs?namespace=poseidon&prefix=template\": dial tcp 10.224.6.99:4646: i/o timeout" package=util
Oct 30 16:27:57 poseidon-terraform poseidon[39546]: time="2023-10-30T16:27:57.356690Z" level=warning msg="Loading Environments failed! Retrying..." error="couldn't load template jobs: couldn't load jobs: error listing Nomad jobs: Get \"https://nomad.internal-codemoon.xopic.de:4646/v1/jobs?namespace=poseidon&prefix=template\": dial tcp 10.224.6.20:4646: i/o timeout" package=environment
Oct 30 16:27:58 poseidon-terraform poseidon[39546]: time="2023-10-30T16:27:58.359378Z" level=info msg="Stopped KeepEnvironmentsSynced" error="couldn't load template jobs: couldn't load jobs: error listing Nomad jobs: Get \"https://nomad.internal-codemoon.xopic.de:4646/v1/jobs?namespace=poseidon&prefix=template\": dial tcp 10.224.6.20:4646: i/o timeout" package=environment

In the first version, Poseidon retries after the timeout error. In the second version Poseidon stops retrying at all. But, the error looks the same.

The difference lies in the Golang packages. The poll package defines the DeadlineExceededError returning i/o timeout. It is used i.e. from the net package. The net package defines the timeoutError also returning i/o timeout, but it also defines that timeoutError == context.DeadlineExceeded. This case triggers the stop of our retry mechanism.

add some test cases to the CI for some behaviour, too?

Good point! In 464b496 I've added four test cases. Can you think of more?

@mpass99 mpass99 force-pushed the fix/#470-recover-after-event-stream branch from 464b496 to 341ceea Compare October 31, 2023 10:57
Before the List function dropped all idleRunners of all environments when fetch was set.

Additionally, the replaced environment was not destroyed properly so that a goroutine for it and for all its idle runners remained running.
from an approach that loaded the runners only once at the startup
to a method that will be repeated i.e. if the Nomad Event Stream connection interrupts.
for internal decisions as this error is strongly used by other packages. By checking such wrapped errors the internal decision can be influenced accidentally.
In this case the retry mechanism checked if the error is context.DeadlineExceeded and assumed it would be created by the internal context. This assumption was wrong.
@mpass99 mpass99 force-pushed the fix/#470-recover-after-event-stream branch from 341ceea to e54d3cd Compare October 31, 2023 10:57
Copy link
Member

@MrSerth MrSerth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, the improved functionality and the test added look really nice! I've given the new changes another test run on Staging and can confirm that even longer testing periods without any Nomad server are handled properly (besides reporting many events to sentry, for each retry...).

Further, the tests are plausible and, in my opinion, cover the most relevant parts of the code! 👏

@MrSerth MrSerth merged commit f259d65 into main Oct 31, 2023
11 of 12 checks passed
@MrSerth MrSerth deleted the fix/#470-recover-after-event-stream branch October 31, 2023 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Lost Runners, again
2 participants