Recover Runners on Event Stream Restart #482

mpass99 · 2023-10-23T13:01:06Z

Related to #470

I tried to keep the changes small, but the codebase was only adjusted for only one recovery of the environments and runners. This led to multiple issues that environments and runners are replaced "dirty" - without destroying the environment and runner references leading to leaked goroutines.

codecov · 2023-10-29T14:39:21Z

Codecov Report

Merging #482 (e54d3cd) into main (809ca03) will decrease coverage by 0.16%.
The diff coverage is 66.66%.

@@            Coverage Diff             @@
##             main     #482      +/-   ##
==========================================
- Coverage   74.76%   74.60%   -0.16%     
==========================================
  Files          39       39              
  Lines        3653     3737      +84     
==========================================
+ Hits         2731     2788      +57     
- Misses        748      766      +18     
- Partials      174      183       +9

Files	Coverage Δ
cmd/poseidon/main.go	`54.95% <100.00%> (+0.22%)`	⬆️
internal/api/ws/codeocean_writer.go	`78.88% <100.00%> (ø)`
internal/environment/aws_manager.go	`83.78% <100.00%> (-0.84%)`	⬇️
internal/nomad/nomad.go	`81.67% <ø> (ø)`
internal/runner/abstract_manager.go	`57.57% <ø> (+0.85%)`	⬆️
internal/runner/execution_environment.go	`100.00% <ø> (ø)`
internal/runner/inactivity_timer.go	`87.03% <ø> (ø)`
internal/runner/nomad_runner.go	`80.19% <100.00%> (+0.06%)`	⬆️
pkg/dto/dto.go	`52.38% <ø> (ø)`
internal/environment/abstract_manager.go	`43.90% <0.00%> (ø)`
... and 5 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

MrSerth · 2023-10-29T17:04:10Z

Nice done, thanks! I think, the code looks fine, so that I am focussing on the functionality in my review.

As I've noticed, you've deployed the code to staging, which allowed me to test it there. I did so using a Ruby exercise containing the following program:

30.times do |index|
  puts index
  sleep 1
end

Then, I went ahead and restarted all Nomad Servers through OpenStack Horizon, definitely causing the Event Stream to break:

This action triggered two actions:

CODEOCEAN-JP was triggered
All runner data was deleted locally in Poseidon (besides some more log entries)

However, even after the Nomad servers were back online and the cluster working again, none of the runners were recovered (all execution environments were reported as having 0 idle runners). Shouldn't this be resolved with this PR, so that the test I performed is handled correctly by Poseidon? Further, I was wondering whether we should add some test cases to the CI for some behaviour, too?

mpass99 · 2023-10-30T19:34:57Z

First of all, thank you for your testing! As you have noticed, I've deployed the code to staging and also tested this scenario by restarting the Nomad service. It worked fine then. But, identifying that an error case remains when rebooting the servers is very important!

CODEOCEAN-JP was triggered

I would argue it's valid that CodeOcean receives an error when the requested operation (list_filesystem) has an error (in this case Nomad was unreachable). For me, the question arose whether Poseidon should disclose the whole internal error message. In this case, it was good for debugging. We may discuss this in the context of #265

All runner data was deleted locally in Poseidon (besides some more log entries)

Till this point, the logs look valid.

Even after the Nomad servers were back online and the cluster working again, none of the runners were recovered

The difference in your scenario is that the request timed out. The i/o timeout error is a context.DeadlineExceeded error - the same error we use for checking if the retry mechanism should continue. We fix this behavior in commit 111cc0f.

Side story

Poseidon does not always behave the same for the same error.

Oct 30 16:24:09 poseidon-terraform poseidon[39546]: time="2023-10-30T16:24:09.906289Z" level=warning msg="Loading Environments failed! Retrying..." error="couldn't load template jobs: couldn't load jobs: error listing Nomad jobs: Get \"https://nomad.internal-codemoon.xopic.de:4646/v1/jobs?namespace=poseidon&prefix=template\": dial tcp 10.224.6.99:4646: i/o timeout" package=environment
Oct 30 16:24:10 poseidon-terraform poseidon[39546]: time="2023-10-30T16:24:10.907828Z" level=debug msg="retrying after error" count=1 error="couldn't load template jobs: couldn't load jobs: error listing Nomad jobs: Get \"https://nomad.internal-codemoon.xopic.de:4646/v1/jobs?namespace=poseidon&prefix=template\": dial tcp 10.224.6.99:4646: i/o timeout" package=util

Oct 30 16:27:57 poseidon-terraform poseidon[39546]: time="2023-10-30T16:27:57.356690Z" level=warning msg="Loading Environments failed! Retrying..." error="couldn't load template jobs: couldn't load jobs: error listing Nomad jobs: Get \"https://nomad.internal-codemoon.xopic.de:4646/v1/jobs?namespace=poseidon&prefix=template\": dial tcp 10.224.6.20:4646: i/o timeout" package=environment
Oct 30 16:27:58 poseidon-terraform poseidon[39546]: time="2023-10-30T16:27:58.359378Z" level=info msg="Stopped KeepEnvironmentsSynced" error="couldn't load template jobs: couldn't load jobs: error listing Nomad jobs: Get \"https://nomad.internal-codemoon.xopic.de:4646/v1/jobs?namespace=poseidon&prefix=template\": dial tcp 10.224.6.20:4646: i/o timeout" package=environment

In the first version, Poseidon retries after the timeout error. In the second version Poseidon stops retrying at all. But, the error looks the same.

The difference lies in the Golang packages. The poll package defines the DeadlineExceededError returning i/o timeout. It is used i.e. from the net package. The net package defines the timeoutError also returning i/o timeout, but it also defines that timeoutError == context.DeadlineExceeded. This case triggers the stop of our retry mechanism.

add some test cases to the CI for some behaviour, too?

Good point! In 464b496 I've added four test cases. Can you think of more?

Before the List function dropped all idleRunners of all environments when fetch was set. Additionally, the replaced environment was not destroyed properly so that a goroutine for it and for all its idle runners remained running.

from an approach that loaded the runners only once at the startup to a method that will be repeated i.e. if the Nomad Event Stream connection interrupts.

for internal decisions as this error is strongly used by other packages. By checking such wrapped errors the internal decision can be influenced accidentally. In this case the retry mechanism checked if the error is context.DeadlineExceeded and assumed it would be created by the internal context. This assumption was wrong.

MrSerth

Awesome, the improved functionality and the test added look really nice! I've given the new changes another test run on Staging and can confirm that even longer testing periods without any Nomad server are handled properly (besides reporting many events to sentry, for each retry...).

Further, the tests are plausible and, in my opinion, cover the most relevant parts of the code! 👏

mpass99 self-assigned this Oct 23, 2023

MrSerth linked an issue Oct 23, 2023 that may be closed by this pull request

Lost Runners, again #470

Closed

mpass99 force-pushed the fix/#470-recover-after-event-stream branch from 0eb0b4c to b6153e5 Compare October 29, 2023 14:31

mpass99 force-pushed the fix/#470-recover-after-event-stream branch from b6153e5 to c9e9959 Compare October 29, 2023 15:46

mpass99 marked this pull request as ready for review October 29, 2023 16:00

mpass99 requested a review from MrSerth October 29, 2023 16:00

mpass99 force-pushed the fix/#470-recover-after-event-stream branch from 464b496 to 341ceea Compare October 31, 2023 10:57

mpass99 added 5 commits October 31, 2023 11:57

Fix List of the Environments with fetch.

497ebe6

Before the List function dropped all idleRunners of all environments when fetch was set. Additionally, the replaced environment was not destroyed properly so that a goroutine for it and for all its idle runners remained running.

Refactor Nomad Recovery

cc63b2f

from an approach that loaded the runners only once at the startup to a method that will be repeated i.e. if the Nomad Event Stream connection interrupts.

Fix flaky test TestDestroysRunnerAfterTimeoutAndSignal.

94f58c4

Add unit tests for runner recovery.

e54d3cd

mpass99 force-pushed the fix/#470-recover-after-event-stream branch from 341ceea to e54d3cd Compare October 31, 2023 10:57

MrSerth approved these changes Oct 31, 2023

View reviewed changes

MrSerth merged commit f259d65 into main Oct 31, 2023
11 of 12 checks passed

MrSerth deleted the fix/#470-recover-after-event-stream branch October 31, 2023 14:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recover Runners on Event Stream Restart #482

Recover Runners on Event Stream Restart #482

mpass99 commented Oct 23, 2023

codecov bot commented Oct 29, 2023 •

edited

Loading

MrSerth commented Oct 29, 2023

mpass99 commented Oct 30, 2023

MrSerth left a comment

Recover Runners on Event Stream Restart #482

Recover Runners on Event Stream Restart #482

Conversation

mpass99 commented Oct 23, 2023

codecov bot commented Oct 29, 2023 • edited Loading

Codecov Report

MrSerth commented Oct 29, 2023

mpass99 commented Oct 30, 2023

MrSerth left a comment

Choose a reason for hiding this comment

codecov bot commented Oct 29, 2023 •

edited

Loading