-
Notifications
You must be signed in to change notification settings - Fork 296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long running Test: VsphereMachine is gone, but Machine is marked as "Running". #1660
Comments
This seems to be associated w/ a netsplit between some of the nodes on my mgmt cluster (the workers that run capi controller manager) losing connectivity permanantly to the APIServer (kubevip) on the Workload clusters. Seems like resource reconcilation should still work, though, even if such netsplits happen, as long as the capv-controller manager can still talk to Vsphere instances... Drew a diagram of the netsplit here after doing more investigation... |
Am assuming maybe draining any (mgmt worker VMs) that are unable to connect to (WL apiserver) is a workaround, will test.. |
ah, "Deleting a kubernetes node associated with a machine is not allowed" after i bounced the capi-controller to a healthy, connected node.... (just a note to self)... will see how to resolve that later.... |
After trying to
|
ultimately, was able to fix everything back
by just carefully deleting finalizers and the above "bounce pod to connected node" trick. Ill file a follow on issue to have capi-controller-manager proactively fail in events of netsplits |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/kind bug
Note: This might , thus, be a cluster-api bug instead of a capv bug, but I figure the fact that the VSphereMachine disappeared , means that maybe CAPV could possibly do something to prevent this bug from occuring. Im happy to close this issue and file it in upstream CAPI if folks thing it would be better there.
What steps did you take and what happened:
I ran a test for 50 days, where i periodically used an iptables rule to cut off connectivity from a Workload Cluster, to a Management cluster, i.e.
Workarounds ?
I tried to manually delete the machine, but that seemed to not
Details
This can bee seen in the logs below:
(Note that there is only one node, the CP node, in the
tkg-vc-antrea
workload cluster.Now, below, looking at the "machines" we'll see theres a ghost machine that is running. Which is non existent (tkg-vc-antrea-md-0-b79b98c6d-8sfjs)....
Notes
I see no evidence in the logs that capv-controller-manager is attempting to do anything with the above machine in this machine deployment, i.e.
kubectl --context tkg-mgmt-vc-admin@tkg-mgmt-vc logs -f capv-controller-manager-bcd4d7496-zl4gx -n capv-system | grep tkg-vc-antrea-md-0-
turns up empty.Version
1.3.1
Workaround?
I tried to delete the ghost machine, tkg-vc-antrea-md-0-b79b98c6d-8sfjs, however, it didnt go away. It has a finalizer on it: machine.cluster.x-k8s.io. So I figured maybe CAPI is the one who owns that finalizer, and i looked in capi-controller-manager... What i found was it seemed unhappy ...
So, it seems like I would hypothesize here that, you cant delete machines if their IPs are unreachable, maybe, because the finalizer logic in capi becomes unhappy... And this prevents in cases where theres a netsplit between capi and the nodes it is monitoring, the healthy cleanup of the nodes after the fact.
The text was updated successfully, but these errors were encountered: