Skip to content

Commit

Permalink
Improve Nomad Docs
Browse files Browse the repository at this point in the history
to describe the behavior of running executions during Nomad restarts.
  • Loading branch information
mpass99 committed Sep 3, 2024
1 parent e89ce3b commit 7d528a0
Showing 1 changed file with 35 additions and 0 deletions.
35 changes: 35 additions & 0 deletions docs/nomad_usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,41 @@ If a user requests a new runner, Poseidon duplicates the template Job of the cor

When a user then executes their code, Poseidon copies the code into the container and executes it.

### Nomad Restarts

When the Nomad Servers or Agents restart, running executions can be terminated.
For agents, it depends on whether the runner allocation is placed on the restarted agent.
For servers, it depends both on the role - if the restarted server is the cluster leader - and on whether Poseidon is connected to the restarted server, e.g. due to DNS Resolving.
Poseidon can be connected to the server either for individual execution or for receiving the Nomad Event stream.
The following table lists the behavior for restarts of Nomad Servers depending on its role and on whether Poseidon is connected to it (via DNS Resolution).

| Role | DNS Resolves | | WebSocket Problem? | Event Stream Problem? |
|----------|--------------|-|--------------------|-----------------------|
| Leader | Yes | | problematic | problematic |
| Leader | No | | problematic | fine |
| Follower | Yes | | problematic | problematic |
| Follower | No | | fine | fine |

Such restarts lead to problems with either individual WebSocket connections of executions or the Nomad Event Stream.
When the Nomad Event Stream is aborted, Poseidon tries to reestablish it. Once it succeeds in doing so, all environments and runners are recovered from Nomad.

In the case of Nomad Agent restarts the WebSocket connection of a running execution aborts.
Furthermore, when also Docker of the Nomad Agent is restarted, the containers are recreated.
Poseidon captures such occurrences and uses the runner as clean and idle.
The Nomad and Docker systemd services are connected via a systemd PartOf relationship.
This results in Nomad being restarted once Docker restarts, but not vice versa.

### Nomad Event Stream

We use the [Nomad Event Stream](https://developer.hashicorp.com/nomad/api-docs/events) to subscribe to Nomad Events.
We handle `Evaluation` and `Allocation` events.
`Evaluation` events are used to wait for an environment to be created when requested.
`Allocation` events are used to track the starts and stops of runners.
Because of the mapping of runners to Nomad Jobs, we could also monitor Nomad's `Job` events (registered, deregistered), however, due to historical reasons we listen to `Allocation` events.
These events behave similarly because each Job has at least one allocation to be executed in.
However, the `Allocation` events also contain information about restarts and reschedulings.
This increases the complexity of the event stream handling but allows us to identify OOM Killed runners and move a used runner to the idle runners once it got restarted or rescheduled.

## Prewarming

To reduce the response time in the process of claiming a runner, Poseidon creates a pool of runners that have been started in advance.
Expand Down

0 comments on commit 7d528a0

Please sign in to comment.