Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Started Runner is already in use #597

Open
sentry-io bot opened this issue May 21, 2024 · 3 comments
Open

Started Runner is already in use #597

sentry-io bot opened this issue May 21, 2024 · 3 comments
Labels
bug Something isn't working pending

Comments

@sentry-io
Copy link

sentry-io bot commented May 21, 2024

Sentry Issue: POSEIDON-5R

Started Runner is already in use
@sentry-io sentry-io bot added the bug Something isn't working label May 21, 2024
@mpass99
Copy link
Collaborator

mpass99 commented Aug 7, 2024

The event timing during a runner migration causes (the last event of) this error.

Nomad events

ClientDescription:Tasks are running ClientStatus:running DesiredDescription:alloc is being migrated DesiredStatus:stop NodeName:nomad-agent-terraform-3 11:20:47.674869723	Allocation	AllocationUpdated
ClientStatus:pending DesiredStatus:run NodeName:nomad-agent-terraform-4 11:20:47.702883207	Allocation	PlanResult
ClientDescription:Tasks are running ClientStatus:running DesiredStatus:run NodeName:nomad-agent-terraform-4 11:20:48.380804288	Allocation	AllocationUpdated
ClientDescription:All tasks have completed ClientStatus:complete DesiredDescription:alloc is being migrated DesiredStatus:stop NodeName:nomad-agent-terraform-3 11:20:48.410681202	Allocation	AllocationUpdated

The allocation was rescheduled for another host and started before the old one completely stopped.

Path 1: Let's fix this handling

We might:

  • Also mind the NodeName in our tracking,
  • Handle the StopRunner already when Nomad announces DesiredStatus:stop and not only when ClientStatus:complete is complete (See cfae6c9), or
  • Fix the assumption we have with the PreviousAllocation field.

Path 2: Let's reconsider

In this case, the PreviousAllocation (which we do already consider) field is set in none of the events. If it was, this error would be captured. We should investigate when this flag is actually set.
However, it feels like already our currently tracked fields are not working (or we have wrong assumptions for them). Adding more fields to be tracked further complicates the event handling.
Also in the past, the Nomad event tracking has caused unexpected errors.

We might consider our Nomad Event Handling Architecture. If I remember correctly, it was created when we mapped runners to allocations. Nowadays, we map runners to Jobs.
The Job events are much simpler and less frequent than the Allocation events. In future work, we might evaluate our requirements and design how they could be handled in a simpler and more reliable manner. From this #602 and #612 might also benefit.

Draft:

  1. We need to know when a runner is ready for a payload
    1. Job Event
    2. We want to know the startup duration.
  2. We need to know when a runner is being restarted or rescheduled because the user's file system is reset.
    1. Allocation Event?
    2. We want to identify if the reason was an OOM Kill to output a specific error message.
  3. We want to know when a runner stopped permanently.
    1. Job Event
  4. We still want to monitor the Nomad events (InfluxDB), but
  5. We do not want any of the previous events to be triggered duplicated.
  • Do we want to capture all reasons for failing allocations or do we just want to monitor the count of restarts?
  • ...

@MrSerth MrSerth added the pending label Aug 7, 2024
@MrSerth
Copy link
Member

MrSerth commented Aug 7, 2024

Thanks for digging into this issue and drafting potential solution paths. As discussed in our meeting, we currently decided to postpone working on this issue further. This decision is based on the assumed complexity for both solutions (especially the second path, of course) and the relatively low number of events we see for this error. Given the current time frame, we hence want to focus on other issues first.

@mpass99
Copy link
Collaborator

mpass99 commented Sep 9, 2024

From hashicorp/nomad#23937 we have learned that the Job JobDeregistered event "is a user-driven action when you run nomad job stop". Therefore, the Job events are likely not sufficient for capturing all job state changes (as suggested here in Path 2). More investigations would be required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pending
Projects
None yet
Development

No branches or pull requests

2 participants