-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recover Runners on Event Stream Restart #482
Conversation
0eb0b4c
to
b6153e5
Compare
Codecov Report
@@ Coverage Diff @@
## main #482 +/- ##
==========================================
- Coverage 74.76% 74.60% -0.16%
==========================================
Files 39 39
Lines 3653 3737 +84
==========================================
+ Hits 2731 2788 +57
- Misses 748 766 +18
- Partials 174 183 +9
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
b6153e5
to
c9e9959
Compare
Nice done, thanks! I think, the code looks fine, so that I am focussing on the functionality in my review. As I've noticed, you've deployed the code to staging, which allowed me to test it there. I did so using a Ruby exercise containing the following program: 30.times do |index|
puts index
sleep 1
end Then, I went ahead and restarted all Nomad Servers through OpenStack Horizon, definitely causing the Event Stream to break: This action triggered two actions:
However, even after the Nomad servers were back online and the cluster working again, none of the runners were recovered (all execution environments were reported as having 0 idle runners). Shouldn't this be resolved with this PR, so that the test I performed is handled correctly by Poseidon? Further, I was wondering whether we should add some test cases to the CI for some behaviour, too? |
First of all, thank you for your testing! As you have noticed, I've deployed the code to staging and also tested this scenario by restarting the Nomad service. It worked fine then. But, identifying that an error case remains when rebooting the servers is very important!
I would argue it's valid that CodeOcean receives an error when the requested operation (
Till this point, the logs look valid.
The difference in your scenario is that the request timed out. The i/o timeout error is a Side storyPoseidon does not always behave the same for the same error.
In the first version, Poseidon retries after the timeout error. In the second version Poseidon stops retrying at all. But, the error looks the same. The difference lies in the Golang packages. The
Good point! In 464b496 I've added four test cases. Can you think of more? |
464b496
to
341ceea
Compare
Before the List function dropped all idleRunners of all environments when fetch was set. Additionally, the replaced environment was not destroyed properly so that a goroutine for it and for all its idle runners remained running.
from an approach that loaded the runners only once at the startup to a method that will be repeated i.e. if the Nomad Event Stream connection interrupts.
for internal decisions as this error is strongly used by other packages. By checking such wrapped errors the internal decision can be influenced accidentally. In this case the retry mechanism checked if the error is context.DeadlineExceeded and assumed it would be created by the internal context. This assumption was wrong.
341ceea
to
e54d3cd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, the improved functionality and the test added look really nice! I've given the new changes another test run on Staging and can confirm that even longer testing periods without any Nomad server are handled properly (besides reporting many events to sentry, for each retry...).
Further, the tests are plausible and, in my opinion, cover the most relevant parts of the code! 👏
Related to #470
I tried to keep the changes small, but the codebase was only adjusted for only one recovery of the environments and runners. This led to multiple issues that environments and runners are replaced "dirty" - without destroying the environment and runner references leading to leaked goroutines.