-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing template task for some environments #522
Comments
The Sentry Issue CODEOCEAN-11J does no longer exist. @MrSerth have you experienced this issue lately or is there a follow-up Sentry issue? From your images, we can derive that Poseidon restarted at Indeed, when Poseidon is notified about a stopped environment job, it does absolutely nothing. poseidon/internal/runner/nomad_manager.go Lines 307 to 309 in 8390b90
However, the environment is only removed from Poseidon's memory when requested via the API. If that is not the case, I have to assume that the Grafana image uses a different timezone and the environment stopped existing once Poseidon got restarted. How should we deal with this issue?
|
Sentry Issue: CODEOCEAN-12Y |
I linked the follow-up Sentry issue CODEOCEAN-12Y, which I triggered manually on our staging system (so no real issue). But still, it would be the one where further events are grouped. Besides that, I don't have another recent occurrence, but saw the potentially lost environment on CodeOcean's dashboard during the past student courses (so in spring 2024).
I am not sure about the time zone (sorry!), but would assume that the Grafana dashboard is shown in my browser's time zone (should be CET, +01:00) -- that's at least the case when I visit Grafana today (and I cannot remember that it was UTC). Otherwise, I agree with your assumption: Probably, Poseidon got restarted after the environment job got lost, hence the recovery did not recover the environment at all.
That's true, of course. Still, if we can prevent this error, it would be even better.
Good question. I like that we improved the situation already through the better restart and rescheduling mechanism. Still, jobs might get lost. Hence, if an environment job fails finally (and is no longer retried by Nomad), we could (potentially) restart it, couldn't we? Do we have all the required information to do so? And would you assume restarting is fine, or could this result in duplicated jobs, ...? As an additional precaution, I tend to add support to Poseidon for this case, too (depending on your answer to the previous questions). |
We could and we have all the required information to do so. However, it would add some new complexity in the Nomad Event Stream Handling and new workflows within Poseidon.
There are three cases why environment jobs fail finally:
All in all, it feels like we would re-implement too much of Nomad's functionality and responsibility. If we want to change the behavior we might rather adjust the job policy. I would go with adding visibility to this issue as proposed with #668. |
We could and we have all the required information to do so. However, it would add some new complexity in the Nomad Event Stream Handling and new workflows within Poseidon. Okay, I see. Especially given the fact that we could try to advise Nomad to restart the job again, let's try this Nomad-based approach first.
That's fine, and I am not aware of any false deletion requests by Poseidon right now.
For the occurrence I created the issue for this (misconfigured environment) is highly unlikely. The environment (command, image, ...) doesn't change very frequently and I cannot remember of real issues with that recently. For sure, the I get that a Poseidon-based solution might not make sense here, but can we get creative and nevertheless adjust the task policy for environments (not for regular runner jobs) here? For example, I would assume that a wrong image leads to immediate and countless restarts. Other availability issues, however, would only fail for a limited time (like for the duration of one or two restarts) and work before or after that. With a comparatively small
Yes, we should monitor these and I like the enhanced logging with #668. |
#668 monitors also these cases. We might want to wait for an occurrence to verify that this is still a real issue.
That's right, I agree
Yeah, we might overwrite some policies within the Poseidon Code (e.g. just for template jobs). All in all, it feels like we would spend too much time and code complexity to something without a big impact and without known recent cases. Let's continue investigating this issue, once a real case occurs again. |
I would still like to see some improvement here and think adjusting some values for environments could be useful. However, giving the other pending issues (and especially #612), we decided to close and postpone this issue for now. |
I've just noticed an issue with Poseidon that needs further investigation:
According to CodeOcean, some environments were not found when executing code and were required to be synced CODEOCEAN-11J. Indeed, CodeOcean still shows empty pools for some environments, including the IDs 11, 18, 22, 33:
Nomad, however, has enough jobs scheduled:
During my investigation, I already identified that Poseidon is not aware of the environments:
This matches Nomad as well:
Hence, Poseidon thinks that everything is okay and is not issuing a Prewarming Pool Alert:
Poseidon, however, was not restarted for quite a while:
Probably, the affected environments were lost during the night:
The text was updated successfully, but these errors were encountered: