-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lost Runners, again #470
Comments
I just had to resync the runners again, since we lost some another time. |
The runners were lost on Oct 4 from 7:44 to 7:53. Context:
All in all, I would argue that Nomad has lost some requests in the state that its health pings to the Nomad server 3 timed out (maybe due to the difference of requests timing out vs requests not getting accepted 🤷). As we have too little information about this and an issue within Poseidon is not revealed, I would accept (/ignore) this case and investigate the next occurrences. Possibilities how to address this case would be:
|
Thanks for digging deep on the first occurrence.
Yes. In order to enlarge the monitoring volume, I used our Ansible script. Besides performing the desired file system3 operation, it also updated Poseidon and thus performed a regular deployment. Hence, the script also touched the Nomad agents before, even though I am not aware of an explicit change. |
Regarding the lost runners on October 4th: Are you actually sure about the given time span? I mean, my intervention to fix the monitoring instance actually happened after I saw on the CodeOcean dashboard that we are missing runners. This was sometimes around 6am UTC. Then, I triggered a resync of the exection environments to Poseidon. Only later, at around 07:30am (until maybe 8am) UTC, I took care of the monitoring instance. Hence, I wonder whether there is another occurrence earlier that day, matching the timestamps of my mails (those times are in CEST, btw.). |
Actually, I was investigating the second occurrence you reported on Oct 5 (but the cause could be dated back to Oct 4).
Yes, there likely is another occurrence. But, as the monitoring data simplifies the identification of the exact timestamps of each runner lost, I have not investigated the first occurrence (without monitoring data) so far.
Yeah. I hope I was able to clarify the situation? |
Indeed, thank you!
Probably. Let's shortly check why the event stream stopped at 1:38 am UTC. I know, it's not that trivial, since we are missing the monitoring data. Potentially, we could also consider recovering all runner data from Nomad if the even stream was stopped for longer than X minutes. |
Regarding the first occurrence of Oct 4th
Poseidon has not received the `running` eventThis is one randomly sampled case. We see that we do not receive the `running` event from Nomad. Just when Poseidon reboots, it recovers the allocation. But, at `01:57:35` it is rescheduled again and we again do not receive the running event.
The cause of this behavior was the reboot of all hosts triggered by the unattended upgrade.
Also, we have seen in the example case that events fail to appear in the context of the Nomad Event Stream reinitiation. Therefore, your suggestion of recovering after reinitiating the event stream should improve the situation. Another idea apart from that: Since we are moving away from Grafana Alerts to Icinga, we could also move our Grafana alert about the 50% empty prewarming pool into Poseidon. This can mean that we check for each |
Nice findings and good presentation of the reboot times 🙂.
I am fine further increasing the random delay, when this could improve the situation. We could, alternatively, also try to define the times ourselves.
👍
Sure, we can check that in Poseidon, too. I like the idea, since this would also propagate further to the CodeOcean health checking 👍. |
On October 4th, we lost some runners again, leading to an empty pre-warming pool and errors being displayed to our learners.
Timestamps shown in CEST:
The text was updated successfully, but these errors were encountered: