-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
React faster if VM instance is gone (i.e. don’t wait until full machineHealthTimeout/machineCreationTimeout lapses) #755
Comments
Hello Vedran Lerenc, This is kind reminder . Are there any updates here ? Kind regards, |
Grooming Results
|
One possible solution , is to rely on CCM, as it deletes node obj if VM is gone. So we won't have to check the VM on infra atleast in |
The problem with that is that infrastructures such as Azure are notoriously inconsistent for a very long time window (much more than the 10 minutes we try to shave off here). They show the machine, don't show it, show it - it takes a while until all API endpoints return the same result. Then again, if the machine is older than 2h (time window we see this issue, e.g. with Azure) and one API endpoint returns that it's gone, it's probably gone (unless that API endpoint is broken in some other way). We could safe-guard that somewhat more and wait to see this result 3x (ignoring other results were the machine is reported present). Still, risky. We simply cannot trust the cloud provider API to return the correct result due to eventual consistency. |
Agreed. We don't need to immediately take a decision to move the |
While implementing this check we should not take early actions if the automatic recovery(https://aws.amazon.com/about-aws/whats-new/2022/03/amazon-ec2-default-automatic-recovery/) is already happening for a failed instance as this is default on in our usage. |
What happened:
While investigating issues with pending workload during a rolling update that looked as if drain started on nodes without sufficient new capacity (see also issue #754), we found out that VM instances that will never make it are not considered as such and the full
machineHealthTimeout
/machineCreationTimeout
needs to lapse, which slows down auto-scaling/self-healing/rolling-update and may lead to prolonged pending pod times.What was expected:
When a VM instance backing a machine is in fatal error or even terminated, we should probably act faster. While it was observed that hyperscaler API endpoints may temporarily serve stale or broken or no data at all (these systems are highly distributed and only eventually consistent) we should still act, if e.g.:
kubelet
stopped reporting at around or before the VM instance last timestamp then...machineHealthTimeout
/machineCreationTimeout
) lapsed.This should help to recover much faster in the future. WDYT?
How to reproduce it:
Terminate a running VM instance backing a machine resource or create a new machine resource (e.g. scale-out), but block (e.g. immediately terminate) the new VM instance from coming up. You will see that the
machineHealthTimeout
/machineCreationTimeout
have to lapse before something happens even though the machines are in terminal state from which they will never recover and thekubelets
do no longer report in.The text was updated successfully, but these errors were encountered: