Skip to content
This repository has been archived by the owner on Jan 18, 2023. It is now read-only.

failure during uninstall #238

Open
cbf123 opened this issue May 15, 2019 · 5 comments · May be fixed by #239
Open

failure during uninstall #238

cbf123 opened this issue May 15, 2019 · 5 comments · May be fixed by #239

Comments

@cbf123
Copy link

cbf123 commented May 15, 2019

When running "kubectl apply -f cmk-uninstall-all-daemonset.yaml" the uninstall worked on one of my two nodes, but failed on the other. I'm now left with pod/cmk-uninstall-all-nodes-bm8lp in CrashLoopBackoff, daemonset.apps/cmk-uninstall-all-nodes still running, and /etc/cmk still present.

The final logs for the failed pod were as follows:

WARNING:root:"cmk-nodereport" for node "controller-0" does not exist.
INFO:root:"cmk-nodereport" for node "controller-0" removed.
INFO:root:Removing "cmk-reconcilereport" from Kubernetes API server for node "controller-0".
INFO:root:Converted "controller-0" to "controller-0" for TPR/CRD name
WARNING:root:"cmk-reconcilereport" for node "controller-0" does not exist.
INFO:root:"cmk-reconcilereport" for node "controller-0" removed.
INFO:root:Removing node taint.
INFO:root:Patching node controller-0:
[
{
"op": "replace",
"path": "/spec/taints",
"value": []
}
]
INFO:root:Removed node taint with key"cmk".
INFO:root:Removing node ERs
INFO:root:Patching node status controller-0:
[
{
"op": "remove",
"path": "/status/capacity/cmk.intel.com~1exclusive-cores"
}
]
ERROR:root:Aborting uninstall: Exception when removing ER: (422)
Reason: Unprocessable Entity
HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'Content-Length': '187', 'Date': 'Wed, 15 May 2019 17:44:02 GMT'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"the server rejected our request due to an error in our request","reason":"Invalid","details":{},"code":422}

@cbf123 cbf123 changed the title failure during uninstallon failure during uninstall May 15, 2019
@cbf123
Copy link
Author

cbf123 commented May 21, 2019

I did some more digging, and it looks like the "cmk uninstall" command will only work on one node, since it tries to delete the webhook and that can only succeed on one node (since on the other nodes it'll already be deleted). When that fails on the other nodes it aborts the uninstall leaving the node in a partially-installed state.

@lmdaly
Copy link
Contributor

lmdaly commented May 22, 2019

@cbf123 thanks for raising this issue. Are you using the cmk-uninstall pod or the cmk-uninstall daemonset?

@przemeklal
Copy link
Contributor

In the uninstall module all "delete something" functions do sys.exit(1) whenever any exception is raised. I think it would be better to display warnings in the logs and continue trying to remove all resources in the best effort mode without exiting early.

@przemeklal przemeklal linked a pull request May 22, 2019 that will close this issue
@przemeklal
Copy link
Contributor

I created a draft pull request linked above, which changes uninstall module behaviour to described. Feel free to give it a try and report your feedback.

I marked it as a "Draft" as most of the uninstall related tests fail now, I'll be happy to fix them once we agree that this is the right way to perform the uninstall process.

@cbf123
Copy link
Author

cbf123 commented May 22, 2019

lmdaly: I tried both options. With the cmk-uninstall pod you need to run it on each node manually, and it'll only successfully run on the first node. With the daemonset the first pod to run will be successful, and the other pods fail and keep restarting.

przemeklal: That's one way to do it, and I think it makes sense to clean up as much as we can on an uninstall. I took an alternate approach as attached which is a bit more narrow but I think I like yours better.
diff.txt

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants