Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Safe deletion of Kyma Clusters [EPIC] #126

Open
pbochynski opened this issue Jan 19, 2024 · 5 comments
Open

Safe deletion of Kyma Clusters [EPIC] #126

pbochynski opened this issue Jan 19, 2024 · 5 comments
Labels
area/control-plane Related to all activities around Kyma Control Plane Epic

Comments

@pbochynski
Copy link
Contributor

pbochynski commented Jan 19, 2024

Description

Instead of deletion we can hibernate cluster and delete it few days later. Accidental deletion can be recovered. Such cluster is not reconciled (Kyma resource can be deleted). Deletion of Kyma resource should not cause module deletion - it is just opt out from lifecycle management.

Note: customer data is still in the hibernated cluster so we should not keep it too long and we need to make sure we do not violate data privacy policies

Implementation idea:

  • cluster resource could be in the deleting state for longer period and the operator will remove shoot and finalizer after defined timeout
  • If recover is required for a hibernated cluster, the cluster will be flagged and the deletion of the hibernated cluster is no longer allowed
  • recovery would be a manual process: copy the resource, remove finalizer and recreate the resource from the copy

Reasons
We should protect our customers as much as possible against unintentional or malicious actions that cause data loss.

Attachments

Related to

  • Cross service consumption
@pbochynski pbochynski added kind/feature Categorizes issue or PR as related to a new feature. area/control-plane Related to all activities around Kyma Control Plane Epic labels Jan 19, 2024
@tobiscr tobiscr removed the kind/feature Categorizes issue or PR as related to a new feature. label Jun 26, 2024
@tobiscr tobiscr changed the title Safe deletion of Kyma Clusters Safe deletion of Kyma Clusters [EPIC] Jun 26, 2024
@tobiscr
Copy link
Contributor

tobiscr commented Jul 10, 2024

TODO: We have clarify the retention time for hibernated clusters as they are also charged by Gardener (Base-fee).

@tobiscr
Copy link
Contributor

tobiscr commented Aug 19, 2024

Hibernation was quite unstable in Gardener in the past. We have to create an POC which verifies how reliable the hibernation feature works.

@tobiscr
Copy link
Contributor

tobiscr commented Aug 27, 2024

Points to clarify:

  • how to react if hibernation fails by Gardener
    Following process will be used for hibernating/deleting clusters:
    1. If the hibernation is blocked by some Webhooks, KIM will delete the webhooks and retry the hibernation.
    2. Delete the worker pools (this step will reduce costs to the minimum)
    3. If this is not solving the hanging hibernation, KIM forces the deletion of the cluster after a few hours (timeout: 2 hours).
  • we have to add new cluster state in RuntimeCR (e.g. disposed - alternatively we could reuse the new fields described in Transition from KEB API to KIM Runtime CR kyma-metrics-collector#89 )
  • housekeeping job required to deleted disposed cluster when retention period is reached
    • Agreement of retention time happened
  • subscription-cleanup job has to consider clusters who are hibernated - subscription cannot be reused until the cluster got finally deleted
  • verify metering of KMC component: hibernated clusters cannot be charged anymore ( Transition from KEB API to KIM Runtime CR kyma-metrics-collector#89)
  • We have to ensure that, in case of a recovery of cluster, the deletion of the hibernated cluster will not happen (e.g. status has to be set to "Ready" or similar, documentation required)
  • Monitoring and alerting has to be extended to cover hibernated clusters, clusters per subaccount and Shoot-CRs without a RuntimeCR #414
    1. Number of all hibernated clusters
    2. Number of clusters per subaccount
    3. Shoot-CRs without Runtime CRs

@tobiscr
Copy link
Contributor

tobiscr commented Oct 11, 2024

During a POC we also verified if a hibernated cluster gets restarted before the deletion is happening.

This isn't the case. The POC covered following steps:

  1. Create a cluster via Gardener UI
  2. Run an endless loop which retrieves pods from the cluster via kubectl within 1 sec delay - check if connection is consistently rejected.
  3. Hibernate the cluster
  4. Delete the cluster

The POC confirmed that a hibernated cluster isn't started between hibernation and deletion. The loop which applied a kubectl command was, after hibernating the cluster, consistently replying a host not found error_:

$> for i in $(seq 1 10000); do; kubectl --kubeconfig kubeconfig-gardenlogin--kyma-dev--i539990-poc.yaml get po -A; sleep 1; done
# During hibernation, following response was received:

# During hibernation we received following output:

NAMESPACE     NAME                                    READY   STATUS    RESTARTS   AGE
kube-system   apiserver-proxy-df2p4                   2/2     Running   0          18m
kube-system   calico-node-zqmfk                       2/2     Running   0          18m
kube-system   csi-driver-node-t4nmj                   3/3     Running   0          13m
kube-system   egress-filter-applier-lt54x             1/1     Running   0          18m
kube-system   kube-proxy-worker-l2xim-v1.30.5-9879c   2/2     Running   0          10m
kube-system   network-problem-detector-host-p2nhz     1/1     Running   0          18m
kube-system   network-problem-detector-pod-gq248      1/1     Running   0          18m
kube-system   node-exporter-b778k                     1/1     Running   0          9m49s
kube-system   node-local-dns-x6dvg                    1/1     Running   0          10m
kube-system   node-problem-detector-56f6f             1/1     Running   0          8m49s
No resources found
No resources found
No resources found
No resources found
Get "https://api.i539990-poc.kyma-dev.shoot.canary.k8s-hana.ondemand.com/api/v1/pods?limit=500": read tcp 10.65.141.193:51162->52.208.206.155:443: read: connection reset by peer - error from a previous attempt: read tcp 10.65.141.193:51161->52.208.206.155:443: read: connection reset by peer
Get "https://api.i539990-poc.kyma-dev.shoot.canary.k8s-hana.ondemand.com/api/v1/pods?limit=500": read tcp 10.65.141.193:51176->52.208.206.155:443: read: connection reset by peer - error from a previous attempt: read tcp 10.65.141.193:51174->52.208.206.155:443: read: connection reset by peer
Get "https://api.i539990-poc.kyma-dev.shoot.canary.k8s-hana.ondemand.com/api/v1/pods?limit=500": read tcp 10.65.141.193:51189->52.208.206.155:443: read: connection reset by peer - error from a previous attempt: read tcp 10.65.141.193:51188->52.208.206.155:443: read: connection reset by peer
Get "https://api.i539990-poc.kyma-dev.shoot.canary.k8s-hana.ondemand.com/api/v1/pods?limit=500": read tcp 10.65.141.193:51202->52.208.206.155:443: read: connection reset by peer - error from a previous attempt: read tcp 10.65.141.193:51201->52.208.206.155:443: read: connection reset by peer
Get "https://api.i539990-poc.kyma-dev.shoot.canary.k8s-hana.ondemand.com/api/v1/pods?limit=500": read tcp 10.65.141.193:51215->52.208.206.155:443: read: connection reset by peer - error from a previous attempt: read tcp 10.65.141.193:51214->52.208.206.155:443: read: connection reset by peer
Get "https://api.i539990-poc.kyma-dev.shoot.canary.k8s-hana.ondemand.com/api/v1/pods?limit=500": read tcp 10.65.141.193:51231->52.208.206.155:443: read: connection reset by peer - error from a previous attempt: read tcp 10.65.141.193:51227->52.208.206.155:443: read: connection reset by peer
Get "https://api.i539990-poc.kyma-dev.shoot.canary.k8s-hana.ondemand.com/api/v1/pods?limit=500": read tcp 10.65.141.193:51247->52.208.206.155:443: read: connection reset by peer - error from a previous attempt: read tcp 10.65.141.193:51246->52.208.206.155:443: read: connection reset by peer
Get "https://api.i539990-poc.kyma-dev.shoot.canary.k8s-hana.ondemand.com/api/v1/pods?limit=500": read tcp 10.65.141.193:51260->52.208.206.155:443: read: connection reset by peer - error from a previous attempt: read tcp 10.65.141.193:51259->52.208.206.155:443: read: connection reset by peer
Get "https://api.i539990-poc.kyma-dev.shoot.canary.k8s-hana.ondemand.com/api/v1/pods?limit=500": dial tcp: lookup api.i539990-poc.kyma-dev.shoot.canary.k8s-hana.ondemand.com: no such host - error from a previous attempt: read tcp 10.65.141.193:51268->52.208.206.155:443: read: connection reset by peer


# After a while, the DNS entry was deleted and the hostname was from this time on no longer resolvable - this hasn't changed until the cluster was fully deleted:

Unable to connect to the server: dial tcp: lookup api.i539990-poc.kyma-dev.shoot.canary.k8s-hana.ondemand.com: no such host
Unable to connect to the server: dial tcp: lookup api.i539990-poc.kyma-dev.shoot.canary.k8s-hana.ondemand.com: no such host
Unable to connect to the server: dial tcp: lookup api.i539990-poc.kyma-dev.shoot.canary.k8s-hana.ondemand.com: no such host
Unable to connect to the server: dial tcp: lookup api.i539990-poc.kyma-dev.shoot.canary.k8s-hana.ondemand.com: no such host
Unable to connect to the server: dial tcp: lookup api.i539990-poc.kyma-dev.shoot.canary.k8s-hana.ondemand.com: no such host
Unable to connect to the server: dial tcp: lookup api.i539990-poc.kyma-dev.shoot.canary.k8s-hana.ondemand.com: no such host
Unable to connect to the server: dial tcp: lookup api.i539990-poc.kyma-dev.shoot.canary.k8s-hana.ondemand.com: no such host
...<response stayed the same until the cluster was fully deleted>...

@pbochynski
Copy link
Contributor Author

Before the team picks up the story please align with @PK85 and @ngrkajac on how it can work with KEB and Cloud Manager. In the initial proposal, the Kyma resource would be deleted immediately, but then other cloud resources would be also deleted immediately. We should be consistent here and offer a similar strategy not only to the gardener cluster but also to other cloud resources related to the Kyma instance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Related to all activities around Kyma Control Plane Epic
Projects
None yet
Development

No branches or pull requests

2 participants