Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lost Runners, again #470

Closed
MrSerth opened this issue Oct 4, 2023 · 8 comments · Fixed by #482 or #486
Closed

Lost Runners, again #470

MrSerth opened this issue Oct 4, 2023 · 8 comments · Fixed by #482 or #486
Labels
bug Something isn't working

Comments

@MrSerth
Copy link
Member

MrSerth commented Oct 4, 2023

On October 4th, we lost some runners again, leading to an empty pre-warming pool and errors being displayed to our learners.

Timestamps shown in CEST:
Bildschirmfoto 2023-10-04 um 08 38 48

@MrSerth MrSerth added the bug Something isn't working label Oct 4, 2023
@MrSerth
Copy link
Member Author

MrSerth commented Oct 5, 2023

I just had to resync the runners again, since we lost some another time.

@mpass99
Copy link
Collaborator

mpass99 commented Oct 8, 2023

The runners were lost on Oct 4 from 7:44 to 7:53.

Context:

  • Poseidon was restarted at 7:43 UTC
    • Have you done a deployment then?
  • The recovery of runners and allocations lasts until 7:50
  • Within minute 7:44 several file copy requests to Nomad failed, but the Nomad event stream continued. After that, the requests succeeded.
  • At 7:50 the Nomad event stream times out and is restarted.
  • At 7:53 the first runner is created again.
  • Between the Poseidon restart and 7:53
    • every requested runner led to one less idle runner that will not be replaced.
    • The Nomad server logs contain issues reaching the Nomad server 3 nomad.stats_fetcher: failed retrieving server health: server=nomad-server-terraform-3.global error="context deadline exceeded"
    • The Nomad agents do not report issues and contain logs about running executions
    • Poseidon has received events from Nomad notifying that some allocations are now running, but no events about pending allocations.

All in all, I would argue that Nomad has lost some requests in the state that its health pings to the Nomad server 3 timed out (maybe due to the difference of requests timing out vs requests not getting accepted 🤷). As we have too little information about this and an issue within Poseidon is not revealed, I would accept (/ignore) this case and investigate the next occurrences.

Possibilities how to address this case would be:

  1. Log every Evaluation (update) via Influxdb. Evaluations are the tracking options of Nomad. For each longer-running request (such as creating a Job) the HTTP request is returned immediately with an evaluation ID. Via the event stream, further status updates for the evaluation are provided.
  2. Introduce alerting if an evaluation is not closed within a timespan (of 10 minutes).
  3. Introduce alerting if an evaluation does not lead to (a pending and then) running job/allocation within a timespan.
  4. Expand the retry mechanism also to recreate the runner if it is not created within the timespan.

@MrSerth
Copy link
Member Author

MrSerth commented Oct 8, 2023

Thanks for digging deep on the first occurrence.

Poseidon was restarted at 7:43 UTC

  • Have you done a deployment then?

Yes. In order to enlarge the monitoring volume, I used our Ansible script. Besides performing the desired file system3 operation, it also updated Poseidon and thus performed a regular deployment. Hence, the script also touched the Nomad agents before, even though I am not aware of an explicit change.

@MrSerth
Copy link
Member Author

MrSerth commented Oct 8, 2023

Regarding the lost runners on October 4th: Are you actually sure about the given time span? I mean, my intervention to fix the monitoring instance actually happened after I saw on the CodeOcean dashboard that we are missing runners. This was sometimes around 6am UTC. Then, I triggered a resync of the exection environments to Poseidon. Only later, at around 07:30am (until maybe 8am) UTC, I took care of the monitoring instance. Hence, I wonder whether there is another occurrence earlier that day, matching the timestamps of my mails (those times are in CEST, btw.).

@mpass99
Copy link
Collaborator

mpass99 commented Oct 9, 2023

Thanks for digging deep on the first occurrence.

Actually, I was investigating the second occurrence you reported on Oct 5 (but the cause could be dated back to Oct 4).
I have not investigated the first occurrence so far due to the missing monitoring data.

Hence, I wonder whether there is another occurrence earlier that day, matching the timestamps of my mails

Yes, there likely is another occurrence. But, as the monitoring data simplifies the identification of the exact timestamps of each runner lost, I have not investigated the first occurrence (without monitoring data) so far.

Are you actually sure about the given time span?

Yeah. I hope I was able to clarify the situation?

@MrSerth
Copy link
Member Author

MrSerth commented Oct 9, 2023

Yeah. I hope I was able to clarify the situation?

Indeed, thank you!

Yes, there likely is another occurrence.

Probably. Let's shortly check why the event stream stopped at 1:38 am UTC. I know, it's not that trivial, since we are missing the monitoring data.

Potentially, we could also consider recovering all runner data from Nomad if the even stream was stopped for longer than X minutes.

@mpass99
Copy link
Collaborator

mpass99 commented Oct 12, 2023

Regarding the first occurrence of Oct 4th

  • The Nomad event stream had to be reestablished 8 times
    • 1:58:34 AM UTC, 1:56:40, 1:55:34, 1:39:30, 1:39:02, 1:38:23, 1:38:09, 1:36:21
  • 1:36: Reboot of Nomad Server 1
  • 1:37: Reboot of Nomad Agent 3
  • 1:38: Reboot of Nomad Agent 2
  • 1:39: Reboot of Nomad Server 2, and Server 3
  • 1:48: Reboot of Poseidon
  • 1:55: Reboot of Nomad Agent 4
  • 1:57: Reboot of Nomad Agent 1
Poseidon has not received the `running` event This is one randomly sampled case. We see that we do not receive the `running` event from Nomad. Just when Poseidon reboots, it recovers the allocation. But, at `01:57:35` it is rescheduled again and we again do not receive the running event.
time="2023-10-04T00:08:17.322172Z" level=debug msg="Handle Allocation Event" ClientStatus=pending DesiredStatus=run NextAllocation= PrevAllocation= alloc_id=1c2d98bd-8160-14e7-1974-0cf0960976f3 package=nomad
time="2023-10-04T00:08:17.796986Z" level=debug msg="Ignoring duplicate event" allocID=1c2d98bd-8160-14e7-1974-0cf0960976f3 package=nomad
time="2023-10-04T00:08:18.098020Z" level=debug msg="Handle Allocation Event" ClientStatus=running DesiredStatus=run NextAllocation= PrevAllocation= alloc_id=1c2d98bd-8160-14e7-1974-0cf0960976f3 package=nomad
time="2023-10-04T00:08:18.098088Z" level=debug msg="Runner started" package=runner runner_id=29-1e67d71a-624a-11ee-96fd-fa163e6e1d23 startupDuration=775.858824ms
time="2023-10-04T01:37:05.085521Z" level=debug msg="Handle Allocation Event" ClientStatus=pending DesiredStatus=run NextAllocation= PrevAllocation=1c2d98bd-8160-14e7-1974-0cf0960976f3 alloc_id=2bc8a0a7-4666-b89c-e626-0b3102af2ac3 package=nomad
time="2023-10-04T01:37:05.085601Z" level=debug msg="Runner stopped" package=runner runner_id=29-1e67d71a-624a-11ee-96fd-fa163e6e1d23
time="2023-10-04T01:37:05.086574Z" level=debug msg="Ignoring unknown allocation" allocID=1c2d98bd-8160-14e7-1974-0cf0960976f3 package=nomad
time="2023-10-04T01:38:06.645665Z" level=debug msg="Ignoring unknown allocation" allocID=1c2d98bd-8160-14e7-1974-0cf0960976f3 package=nomad
time="2023-10-04T01:48:42.474372Z" level=debug msg="Recovered Runner" isUsed=false package=runner runner_id=29-1e67d71a-624a-11ee-96fd-fa163e6e1d23
time="2023-10-04T01:57:35.558902Z" level=debug msg="Handle Allocation Event" ClientStatus=pending DesiredStatus=run NextAllocation= PrevAllocation=2bc8a0a7-4666-b89c-e626-0b3102af2ac3 alloc_id=7ea30198-0e26-cc2d-2314-2341e2cd8afd package=nomad
time="2023-10-04T01:57:35.558965Z" level=debug msg="Runner stopped" package=runner runner_id=29-1e67d71a-624a-11ee-96fd-fa163e6e1d23
time="2023-10-04T01:57:35.560901Z" level=debug msg="Ignoring unknown allocation" allocID=2bc8a0a7-4666-b89c-e626-0b3102af2ac3 package=nomad
time="2023-10-04T01:58:33.719590Z" level=debug msg="Ignoring unknown allocation" allocID=2bc8a0a7-4666-b89c-e626-0b3102af2ac3 package=nomad
time="2023-10-04T06:11:32.377782Z" level=debug msg="Handle Allocation Event" ClientStatus=running DesiredStatus=stop NextAllocation= PrevAllocation=2bc8a0a7-4666-b89c-e626-0b3102af2ac3 alloc_id=7ea30198-0e26-cc2d-2314-2341e2cd8afd package=nomad
time="2023-10-04T06:11:32.758506Z" level=debug msg="Handle Allocation Event" ClientStatus=complete DesiredStatus=stop NextAllocation= PrevAllocation=2bc8a0a7-4666-b89c-e626-0b3102af2ac3 alloc_id=7ea30198-0e26-cc2d-2314-2341e2cd8afd package=nomad
time="2023-10-04T06:11:32.758593Z" level=debug msg="Runner stopped" package=runner runner_id=29-1e67d71a-624a-11ee-96fd-fa163e6e1d23
time="2023-10-04T06:43:13.505543Z" level=debug msg="Ignoring unknown allocation" allocID=7ea30198-0e26-cc2d-2314-2341e2cd8afd package=nomad

The cause of this behavior was the reboot of all hosts triggered by the unattended upgrade.
I have not dug deep enough to tell what causes Nomad to fail the event stream only because its hosts get restarted.
However, it may be related to the small timespan of all reboots, as Nomad might not be able to recover one host going down before the next one is already down 🤷 If so, the increase of the random delay of the reboot might improve the situation.

Potentially, we could also consider recovering all runner data from Nomad if the even stream was stopped for longer than X minutes.

Also, we have seen in the example case that events fail to appear in the context of the Nomad Event Stream reinitiation. Therefore, your suggestion of recovering after reinitiating the event stream should improve the situation.

Another idea apart from that: Since we are moving away from Grafana Alerts to Icinga, we could also move our Grafana alert about the 50% empty prewarming pool into Poseidon. This can mean that we check for each Sampled runner if the prewarming pool contains at least 50% of the runners. If not, we propagate the status to the health route, start a timeout of 5 minutes, and then check again. If we still have so few runners, we send an alert/warning and resync the environment. What do you think?

@MrSerth
Copy link
Member Author

MrSerth commented Oct 12, 2023

Nice findings and good presentation of the reboot times 🙂.

If so, the increase of the random delay of the reboot might improve the situation.

I am fine further increasing the random delay, when this could improve the situation. We could, alternatively, also try to define the times ourselves.

Therefore, your suggestion of recovering after reinitiating the event stream should improve the situation.

👍

What do you think?

Sure, we can check that in Poseidon, too. I like the idea, since this would also propagate further to the CodeOcean health checking 👍.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants