-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prewarming Pool Alert #587
Comments
The increase of the prewarming pool size for environment 10 was performed at 2024-05-02T08:13:18.776145Z (from 5 to 15). |
`poseidon_nomad_idle_runners`.Here, 3 runners get lost.
We see that one runner is lost due to bug #602. Two other runners are lost within the Nomad restart/deployment. `poseidon_nomad_events`
We see that the Job is started one time correctly. After the Nomad restart it tries two further times, but fails due to an unknown reason. Then, our configured limit of 3 attempts is reached.
We see that only one Allocation is created (and stopped). Therefore, the issue mus lie in a higher level.
The evaluations result in no further insights
We have two options:
|
Thanks! I would be glad to learn more about the "unknown" errors you've identified (regarding the Also, I was wondering whether we should increase the maximum number of reattempts from 3 to something higher? It would be a pity if the problem could have been resolved by retrying more often (maybe with an increasing interval). |
Thanks for shifting the focus to the reattempts! Our current configuration is poseidon/internal/environment/template-environment-job.hcl Lines 25 to 31 in 342b937
The following blocks result when adding all relevant parameters restart {
attempts = 3
delay = "0s"
interval = "24h"
mode = "fail"
}
reschedule {
unlimited = true
attempts = 0
interval = "24h"
delay = "5s"
delay_function = "constant"
} Restart and Reschedule explained
The Nomad Documentation contains detailed explanations of the restart and reschedule configuration.
Nomad Agent 4 was shutting down at Details
Working theory At Should we verify any statement of this theory? Improvement For the restart {
attempts = 3
delay = "1s"
interval = "1h"
mode = "fail"
} For the reschedule {
unlimited = false
attempts = 3
interval = "24h"
delay = "1m"
delay_function = "exponential"
} How does this evaluation sound to you? |
Working theory
Yes, I'd say let's verify the impact and interplay of Improvement
Why is this
That's true, we can do that. However, I would say it won't make a big difference, since we are usually failing all three ✅
Okay, for example with two deployments within 24 hours? Yes, than let's go with
👍
Okay, fine for me.
This ratio still looks somewhat "conservative" to me, but maybe I need to rethink it.
👍
👍 Does this knowledge influence the ✅ (as resolved with the previous comment) |
When testing with an invalid image specifier, we see that per node the Job is restarted three times and is rescheduled infinitely rotating through the nodes. Updated Working Theory
At Validation: The issue with this situation is that the job is always lost, independently, of the
|
Thanks for testing and updating the working theory.
|
👍
Yeah, persistence is a point here, as in the process of the deployment also Poseidon is being restarted. As far as the situation looks to me, the Nomad Job is completely deleted in the erroneous case, which would also delete the metadata about the number of restarts. |
Okay, fine. Your argumentation makes sense 👍 |
The (Sentry) issue is still occurring. Let's first wait for #612 and after it reevaluate if further actions are necessary. |
Sentry Issue: POSEIDON-4V |
Sentry Issue: POSEIDON-5H
In this event, environment 10 (java-8) was reloaded. The event happened in the context of a deployment.
Assumed issues:
The text was updated successfully, but these errors were encountered: