Started Runner is already in use #597

sentry-io · 2024-05-21T20:16:11Z

Started Runner is already in use

The text was updated successfully, but these errors were encountered:

mpass99 · 2024-08-07T10:48:05Z

The event timing during a runner migration causes (the last event of) this error.

Nomad events

ClientDescription:Tasks are running ClientStatus:running DesiredDescription:alloc is being migrated DesiredStatus:stop NodeName:nomad-agent-terraform-3 11:20:47.674869723	Allocation	AllocationUpdated
ClientStatus:pending DesiredStatus:run NodeName:nomad-agent-terraform-4 11:20:47.702883207	Allocation	PlanResult
ClientDescription:Tasks are running ClientStatus:running DesiredStatus:run NodeName:nomad-agent-terraform-4 11:20:48.380804288	Allocation	AllocationUpdated
ClientDescription:All tasks have completed ClientStatus:complete DesiredDescription:alloc is being migrated DesiredStatus:stop NodeName:nomad-agent-terraform-3 11:20:48.410681202	Allocation	AllocationUpdated

The allocation was rescheduled for another host and started before the old one completely stopped.

Path 1: Let's fix this handling

We might:

Also mind the NodeName in our tracking,
Handle the StopRunner already when Nomad announces DesiredStatus:stop and not only when ClientStatus:complete is complete (See cfae6c9), or
Fix the assumption we have with the PreviousAllocation field.

Path 2: Let's reconsider

In this case, the PreviousAllocation (which we do already consider) field is set in none of the events. If it was, this error would be captured. We should investigate when this flag is actually set.
However, it feels like already our currently tracked fields are not working (or we have wrong assumptions for them). Adding more fields to be tracked further complicates the event handling.
Also in the past, the Nomad event tracking has caused unexpected errors.

We might consider our Nomad Event Handling Architecture. If I remember correctly, it was created when we mapped runners to allocations. Nowadays, we map runners to Jobs.
The Job events are much simpler and less frequent than the Allocation events. In future work, we might evaluate our requirements and design how they could be handled in a simpler and more reliable manner. From this #602 and #612 might also benefit.

Draft:

We need to know when a runner is ready for a payload
1. Job Event
2. We want to know the startup duration.
We need to know when a runner is being restarted or rescheduled because the user's file system is reset.
1. Allocation Event?
2. We want to identify if the reason was an OOM Kill to output a specific error message.
We want to know when a runner stopped permanently.
1. Job Event
We still want to monitor the Nomad events (InfluxDB), but
We do not want any of the previous events to be triggered duplicated.

Do we want to capture all reasons for failing allocations or do we just want to monitor the count of restarts?
...

MrSerth · 2024-08-07T12:51:12Z

Thanks for digging into this issue and drafting potential solution paths. As discussed in our meeting, we currently decided to postpone working on this issue further. This decision is based on the assumed complexity for both solutions (especially the second path, of course) and the relatively low number of events we see for this error. Given the current time frame, we hence want to focus on other issues first.

mpass99 · 2024-09-09T15:43:07Z

From hashicorp/nomad#23937 we have learned that the Job JobDeregistered event "is a user-driven action when you run nomad job stop". Therefore, the Job events are likely not sufficient for capturing all job state changes (as suggested here in Path 2). More investigations would be required.

sentry-io bot added the bug Something isn't working label May 21, 2024

MrSerth added the pending label Aug 7, 2024

mpass99 mentioned this issue Aug 16, 2024

No allocation found while updateFileSystem #649

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Started Runner is already in use #597

Started Runner is already in use #597

sentry-io bot commented May 21, 2024

mpass99 commented Aug 7, 2024 •

edited

Loading

MrSerth commented Aug 7, 2024

mpass99 commented Sep 9, 2024

Started Runner is already in use #597

Started Runner is already in use #597

Comments

sentry-io bot commented May 21, 2024

mpass99 commented Aug 7, 2024 • edited Loading

MrSerth commented Aug 7, 2024

mpass99 commented Sep 9, 2024

mpass99 commented Aug 7, 2024 •

edited

Loading