From 7d528a01f95ed9a70ff4a07612779a41fb25c2d3 Mon Sep 17 00:00:00 2001 From: Maximilian Pass <22845248+mpass99@users.noreply.github.com> Date: Tue, 20 Aug 2024 12:58:50 +0200 Subject: [PATCH] Improve Nomad Docs to describe the behavior of running executions during Nomad restarts. --- docs/nomad_usage.md | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+) diff --git a/docs/nomad_usage.md b/docs/nomad_usage.md index 8f1c68f4..6b4614c6 100644 --- a/docs/nomad_usage.md +++ b/docs/nomad_usage.md @@ -27,6 +27,41 @@ If a user requests a new runner, Poseidon duplicates the template Job of the cor When a user then executes their code, Poseidon copies the code into the container and executes it. +### Nomad Restarts + +When the Nomad Servers or Agents restart, running executions can be terminated. +For agents, it depends on whether the runner allocation is placed on the restarted agent. +For servers, it depends both on the role - if the restarted server is the cluster leader - and on whether Poseidon is connected to the restarted server, e.g. due to DNS Resolving. +Poseidon can be connected to the server either for individual execution or for receiving the Nomad Event stream. +The following table lists the behavior for restarts of Nomad Servers depending on its role and on whether Poseidon is connected to it (via DNS Resolution). + +| Role | DNS Resolves | | WebSocket Problem? | Event Stream Problem? | +|----------|--------------|-|--------------------|-----------------------| +| Leader | Yes | | problematic | problematic | +| Leader | No | | problematic | fine | +| Follower | Yes | | problematic | problematic | +| Follower | No | | fine | fine | + +Such restarts lead to problems with either individual WebSocket connections of executions or the Nomad Event Stream. +When the Nomad Event Stream is aborted, Poseidon tries to reestablish it. Once it succeeds in doing so, all environments and runners are recovered from Nomad. + +In the case of Nomad Agent restarts the WebSocket connection of a running execution aborts. +Furthermore, when also Docker of the Nomad Agent is restarted, the containers are recreated. +Poseidon captures such occurrences and uses the runner as clean and idle. +The Nomad and Docker systemd services are connected via a systemd PartOf relationship. +This results in Nomad being restarted once Docker restarts, but not vice versa. + +### Nomad Event Stream + +We use the [Nomad Event Stream](https://developer.hashicorp.com/nomad/api-docs/events) to subscribe to Nomad Events. +We handle `Evaluation` and `Allocation` events. +`Evaluation` events are used to wait for an environment to be created when requested. +`Allocation` events are used to track the starts and stops of runners. +Because of the mapping of runners to Nomad Jobs, we could also monitor Nomad's `Job` events (registered, deregistered), however, due to historical reasons we listen to `Allocation` events. +These events behave similarly because each Job has at least one allocation to be executed in. +However, the `Allocation` events also contain information about restarts and reschedulings. +This increases the complexity of the event stream handling but allows us to identify OOM Killed runners and move a used runner to the idle runners once it got restarted or rescheduled. + ## Prewarming To reduce the response time in the process of claiming a runner, Poseidon creates a pool of runners that have been started in advance.