Lost Runners, again #470

MrSerth · 2023-10-04T07:14:55Z

On October 4th, we lost some runners again, leading to an empty pre-warming pool and errors being displayed to our learners.

Timestamps shown in CEST:

MrSerth · 2023-10-05T09:17:53Z

I just had to resync the runners again, since we lost some another time.

mpass99 · 2023-10-08T13:17:07Z

The runners were lost on Oct 4 from 7:44 to 7:53.

Context:

Poseidon was restarted at 7:43 UTC
- Have you done a deployment then?
The recovery of runners and allocations lasts until 7:50
Within minute 7:44 several file copy requests to Nomad failed, but the Nomad event stream continued. After that, the requests succeeded.
At 7:50 the Nomad event stream times out and is restarted.
At 7:53 the first runner is created again.
Between the Poseidon restart and 7:53
- every requested runner led to one less idle runner that will not be replaced.
- The Nomad server logs contain issues reaching the Nomad server 3 nomad.stats_fetcher: failed retrieving server health: server=nomad-server-terraform-3.global error="context deadline exceeded"
- The Nomad agents do not report issues and contain logs about running executions
- Poseidon has received events from Nomad notifying that some allocations are now running, but no events about pending allocations.

All in all, I would argue that Nomad has lost some requests in the state that its health pings to the Nomad server 3 timed out (maybe due to the difference of requests timing out vs requests not getting accepted 🤷). As we have too little information about this and an issue within Poseidon is not revealed, I would accept (/ignore) this case and investigate the next occurrences.

Possibilities how to address this case would be:

Log every Evaluation (update) via Influxdb. Evaluations are the tracking options of Nomad. For each longer-running request (such as creating a Job) the HTTP request is returned immediately with an evaluation ID. Via the event stream, further status updates for the evaluation are provided.
Introduce alerting if an evaluation is not closed within a timespan (of 10 minutes).
Introduce alerting if an evaluation does not lead to (a pending and then) running job/allocation within a timespan.
Expand the retry mechanism also to recreate the runner if it is not created within the timespan.

MrSerth · 2023-10-08T13:44:47Z

Thanks for digging deep on the first occurrence.

Poseidon was restarted at 7:43 UTC

Have you done a deployment then?

Yes. In order to enlarge the monitoring volume, I used our Ansible script. Besides performing the desired file system3 operation, it also updated Poseidon and thus performed a regular deployment. Hence, the script also touched the Nomad agents before, even though I am not aware of an explicit change.

MrSerth · 2023-10-08T13:56:19Z

Regarding the lost runners on October 4th: Are you actually sure about the given time span? I mean, my intervention to fix the monitoring instance actually happened after I saw on the CodeOcean dashboard that we are missing runners. This was sometimes around 6am UTC. Then, I triggered a resync of the exection environments to Poseidon. Only later, at around 07:30am (until maybe 8am) UTC, I took care of the monitoring instance. Hence, I wonder whether there is another occurrence earlier that day, matching the timestamps of my mails (those times are in CEST, btw.).

mpass99 · 2023-10-09T10:41:14Z

Thanks for digging deep on the first occurrence.

Actually, I was investigating the second occurrence you reported on Oct 5 (but the cause could be dated back to Oct 4).
I have not investigated the first occurrence so far due to the missing monitoring data.

Hence, I wonder whether there is another occurrence earlier that day, matching the timestamps of my mails

Yes, there likely is another occurrence. But, as the monitoring data simplifies the identification of the exact timestamps of each runner lost, I have not investigated the first occurrence (without monitoring data) so far.

Are you actually sure about the given time span?

Yeah. I hope I was able to clarify the situation?

MrSerth · 2023-10-09T13:13:46Z

Yeah. I hope I was able to clarify the situation?

Indeed, thank you!

Yes, there likely is another occurrence.

Probably. Let's shortly check why the event stream stopped at 1:38 am UTC. I know, it's not that trivial, since we are missing the monitoring data.

Potentially, we could also consider recovering all runner data from Nomad if the even stream was stopped for longer than X minutes.

mpass99 · 2023-10-12T11:20:48Z

Regarding the first occurrence of Oct 4th

The Nomad event stream had to be reestablished 8 times
- 1:58:34 AM UTC, 1:56:40, 1:55:34, 1:39:30, 1:39:02, 1:38:23, 1:38:09, 1:36:21
1:36: Reboot of Nomad Server 1
1:37: Reboot of Nomad Agent 3
1:38: Reboot of Nomad Agent 2
1:39: Reboot of Nomad Server 2, and Server 3
1:48: Reboot of Poseidon
1:55: Reboot of Nomad Agent 4
1:57: Reboot of Nomad Agent 1

Poseidon has not received the `running` event

This is one randomly sampled case. We see that we do not receive the `running` event from Nomad. Just when Poseidon reboots, it recovers the allocation. But, at `01:57:35` it is rescheduled again and we again do not receive the running event.

time="2023-10-04T00:08:17.322172Z" level=debug msg="Handle Allocation Event" ClientStatus=pending DesiredStatus=run NextAllocation= PrevAllocation= alloc_id=1c2d98bd-8160-14e7-1974-0cf0960976f3 package=nomad
time="2023-10-04T00:08:17.796986Z" level=debug msg="Ignoring duplicate event" allocID=1c2d98bd-8160-14e7-1974-0cf0960976f3 package=nomad
time="2023-10-04T00:08:18.098020Z" level=debug msg="Handle Allocation Event" ClientStatus=running DesiredStatus=run NextAllocation= PrevAllocation= alloc_id=1c2d98bd-8160-14e7-1974-0cf0960976f3 package=nomad
time="2023-10-04T00:08:18.098088Z" level=debug msg="Runner started" package=runner runner_id=29-1e67d71a-624a-11ee-96fd-fa163e6e1d23 startupDuration=775.858824ms
time="2023-10-04T01:37:05.085521Z" level=debug msg="Handle Allocation Event" ClientStatus=pending DesiredStatus=run NextAllocation= PrevAllocation=1c2d98bd-8160-14e7-1974-0cf0960976f3 alloc_id=2bc8a0a7-4666-b89c-e626-0b3102af2ac3 package=nomad
time="2023-10-04T01:37:05.085601Z" level=debug msg="Runner stopped" package=runner runner_id=29-1e67d71a-624a-11ee-96fd-fa163e6e1d23
time="2023-10-04T01:37:05.086574Z" level=debug msg="Ignoring unknown allocation" allocID=1c2d98bd-8160-14e7-1974-0cf0960976f3 package=nomad
time="2023-10-04T01:38:06.645665Z" level=debug msg="Ignoring unknown allocation" allocID=1c2d98bd-8160-14e7-1974-0cf0960976f3 package=nomad
time="2023-10-04T01:48:42.474372Z" level=debug msg="Recovered Runner" isUsed=false package=runner runner_id=29-1e67d71a-624a-11ee-96fd-fa163e6e1d23
time="2023-10-04T01:57:35.558902Z" level=debug msg="Handle Allocation Event" ClientStatus=pending DesiredStatus=run NextAllocation= PrevAllocation=2bc8a0a7-4666-b89c-e626-0b3102af2ac3 alloc_id=7ea30198-0e26-cc2d-2314-2341e2cd8afd package=nomad
time="2023-10-04T01:57:35.558965Z" level=debug msg="Runner stopped" package=runner runner_id=29-1e67d71a-624a-11ee-96fd-fa163e6e1d23
time="2023-10-04T01:57:35.560901Z" level=debug msg="Ignoring unknown allocation" allocID=2bc8a0a7-4666-b89c-e626-0b3102af2ac3 package=nomad
time="2023-10-04T01:58:33.719590Z" level=debug msg="Ignoring unknown allocation" allocID=2bc8a0a7-4666-b89c-e626-0b3102af2ac3 package=nomad
time="2023-10-04T06:11:32.377782Z" level=debug msg="Handle Allocation Event" ClientStatus=running DesiredStatus=stop NextAllocation= PrevAllocation=2bc8a0a7-4666-b89c-e626-0b3102af2ac3 alloc_id=7ea30198-0e26-cc2d-2314-2341e2cd8afd package=nomad
time="2023-10-04T06:11:32.758506Z" level=debug msg="Handle Allocation Event" ClientStatus=complete DesiredStatus=stop NextAllocation= PrevAllocation=2bc8a0a7-4666-b89c-e626-0b3102af2ac3 alloc_id=7ea30198-0e26-cc2d-2314-2341e2cd8afd package=nomad
time="2023-10-04T06:11:32.758593Z" level=debug msg="Runner stopped" package=runner runner_id=29-1e67d71a-624a-11ee-96fd-fa163e6e1d23
time="2023-10-04T06:43:13.505543Z" level=debug msg="Ignoring unknown allocation" allocID=7ea30198-0e26-cc2d-2314-2341e2cd8afd package=nomad

The cause of this behavior was the reboot of all hosts triggered by the unattended upgrade.
I have not dug deep enough to tell what causes Nomad to fail the event stream only because its hosts get restarted.
However, it may be related to the small timespan of all reboots, as Nomad might not be able to recover one host going down before the next one is already down 🤷 If so, the increase of the random delay of the reboot might improve the situation.

Potentially, we could also consider recovering all runner data from Nomad if the even stream was stopped for longer than X minutes.

Also, we have seen in the example case that events fail to appear in the context of the Nomad Event Stream reinitiation. Therefore, your suggestion of recovering after reinitiating the event stream should improve the situation.

Another idea apart from that: Since we are moving away from Grafana Alerts to Icinga, we could also move our Grafana alert about the 50% empty prewarming pool into Poseidon. This can mean that we check for each Sampled runner if the prewarming pool contains at least 50% of the runners. If not, we propagate the status to the health route, start a timeout of 5 minutes, and then check again. If we still have so few runners, we send an alert/warning and resync the environment. What do you think?

MrSerth · 2023-10-12T14:28:04Z

Nice findings and good presentation of the reboot times 🙂.

If so, the increase of the random delay of the reboot might improve the situation.

I am fine further increasing the random delay, when this could improve the situation. We could, alternatively, also try to define the times ourselves.

Therefore, your suggestion of recovering after reinitiating the event stream should improve the situation.

👍

What do you think?

Sure, we can check that in Poseidon, too. I like the idea, since this would also propagate further to the CodeOcean health checking 👍.

MrSerth added the bug Something isn't working label Oct 4, 2023

MrSerth mentioned this issue Oct 9, 2023

Improve error handling for sending events to Influxdb #479

Closed

mpass99 mentioned this issue Oct 23, 2023

Recover Runners on Event Stream Restart #482

Merged

MrSerth linked a pull request Oct 23, 2023 that will close this issue

Recover Runners on Event Stream Restart #482

Merged

MrSerth mentioned this issue Oct 23, 2023

Monitor Nomad Health #392

Open

mpass99 mentioned this issue Oct 31, 2023

Add Prewarming Pool Alert #486

Merged

MrSerth closed this as completed in #482 Oct 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lost Runners, again #470

Lost Runners, again #470

MrSerth commented Oct 4, 2023

MrSerth commented Oct 5, 2023

mpass99 commented Oct 8, 2023

MrSerth commented Oct 8, 2023

MrSerth commented Oct 8, 2023

mpass99 commented Oct 9, 2023

MrSerth commented Oct 9, 2023

mpass99 commented Oct 12, 2023 •

edited

Loading

MrSerth commented Oct 12, 2023

Lost Runners, again #470

Lost Runners, again #470

Comments

MrSerth commented Oct 4, 2023

MrSerth commented Oct 5, 2023

mpass99 commented Oct 8, 2023

MrSerth commented Oct 8, 2023

MrSerth commented Oct 8, 2023

mpass99 commented Oct 9, 2023

MrSerth commented Oct 9, 2023

mpass99 commented Oct 12, 2023 • edited Loading

MrSerth commented Oct 12, 2023

mpass99 commented Oct 12, 2023 •

edited

Loading