Skip to content
This repository has been archived by the owner on Apr 25, 2023. It is now read-only.

A single federated cluster can stop propagation of a type for all clusters if it does not have the specified resource version. #1241

Closed
dangorst1066 opened this issue Jun 30, 2020 · 19 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@dangorst1066
Copy link

A single federated cluster can stop propagation of a type for all clusters if it does not have a particular resource version.

And a question - any good strategies for handling cluster estates that could have multiple versions of a resource in circulation (e.g. v1beta1 and v1 CRDs)

Editing the target type version in the federated type config to v1beta1 (lowest common denominator) appears to work around this ok (tbc), but it's still worrying a single cluster could stop all federation working - seems like this shouldn't be the expected behaviour.

What happened:

Run a federation control plane at kube version 1.16
Enabled federation of CRDs (v1)
Joined another 1.16 cluster - confirmed CRDs and CRs of that type are being propagated ok
Joined a 1.15 cluster - CRDs+CRs not propagated to the 1.15 cluster (CRDs at version v1beta1). All propagation of CRDs and CRs of the same type stopped working for the 1.16 cluster as well.

Logs for the controller manager show msgs like:

E0630 07:13:47.048845       1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.3/tools/cache/reflector.go:105: Failed to list apiextensions.k8s.io/v1, Kind=CustomResourceDefinition: the server could not find the requested resource

What you expected to happen:

I expected v1 CRDs not to propagate to the 1.15 cluster, however I did not expect the propagation of all CRDs to all clusters to stop working.

How to reproduce it (as minimally and precisely as possible):

Run a federation control plane at kube version 1.16+
Enabled federation of v1 CRDs
Create a Federated CRD, and a CR of that type with placement that will match all clusters
Join another 1.16 cluster - confirmed CRD and CR are being propagated ok
Join a 1.15 cluster - expect the CRD and CR not to be propagated
Create a new federated CRD, or a CR of the original type - these should still be propagated to the 1.16 cluster but I have observed they are not.

Anything else we need to know?:

Environment:

  • Kubernetes version 1.16 for the fed control plane, 1.15- for one or more federated clusters
  • KubeFed version 0.3.0
  • Scope of installation Cluster
  • AWS/EKS

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 30, 2020
@RainbowMango
Copy link
Contributor

@dgorst Thanks for your feedback.
Let me reproduce it locally and then get back to you.

@RainbowMango
Copy link
Contributor

RainbowMango commented Jul 1, 2020

@dgorst
Cloud you please help confirm if this the minimum step for reproducing?

Prepare clusters:

[root@ecs-d8b6 kubefed]# kubectl -n kube-federation-system get kubefedclusters
NAME       AGE     READY
cluster1   9d      True // v1.17.4  (apiextensions.k8s.io/v1)  `this is the host cluster`
cluster2   9d      True // v1.17.4 (apiextensions.k8s.io/v1)
cluster3   3h10m   True // v1.15.0 (apiextensions.k8s.io/`v1beta1`)

Operation Steps:

  • create a CRD, such as crontabs.stable.example.com which apiVersion is apiextensions.k8s.io/v1
  • enable CRD by command: kubefedctl enable customresourcedefinitions
  • federate CRD by command: kubefedctl federate crd crontabs.stable.example.com

Result:

[root@ecs-d8b6 kubefed]# kubectl get crds crontabs.stable.example.com --context cluster1
NAME                          CREATED AT
crontabs.stable.example.com   2020-07-01T12:50:31Z
[root@ecs-d8b6 kubefed]# kubectl get crds crontabs.stable.example.com --context cluster2
Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "crontabs.stable.example.com" not found
[root@ecs-d8b6 kubefed]# kubectl get crds crontabs.stable.example.com --context cluster3
Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "crontabs.stable.example.com" not found

You expected the CRD will be propagated to cluster2 and ignore the cluster3, right?

@dangorst1066
Copy link
Author

Yes exactly @RainbowMango 👍

It feels like the blast radius from a single (tbf misconfigured) cluster, should not impact propagation to the good clusters. So in your example, yes I don't expect a v1 CRD in cluster1 to be propagated to cluster3, but I would expect it to continue to be propagated to cluster2.

I mention a CR of the type of the CRD as that would also stop propagating at the point the 1.15 cluster is joined. But it's the same issue I guess (the CRD doesn't get propagated because it can't list v1/crds, so it also can't list that type either)

@RainbowMango
Copy link
Contributor

@dgorst
I did some investigation and found that the FederatedCustomResourceDefinition sync controller totally be blocked as one of the informers can't finish its sync process.

The following check keeps failing.
https://github.com/kubernetes-sigs/kubefed/blob/bf67d02369e9b2d93281f8224747b94afab3170e/pkg/controller/sync/controller.go#L235-L238

I agree with you that the propagation process should ignore bad clusters.
Let's see how to solve this.

@dangorst1066
Copy link
Author

Thanks @RainbowMango for recreating and confirming 👍

Happy to have a stab at resolving this if that'll help? (caveat: I'm new to the kubefed codebase so may need to reach on slack with some questions though!)

@RainbowMango
Copy link
Contributor

I've tried a workaround locally, but the community has discussed a better solution.

@hectorj2f @jimmidyson @irfanurrehman
Could you please take a look? If the solution that changes FederatedTypeConfigStatus OK for you?

@irfanurrehman
Copy link
Contributor

irfanurrehman commented Jul 15, 2020

@RainbowMango thanks for tracking this. IMO the solution proposed by pmorie as per the link you mentioned is completely legit and can be implemented. As far as I understand @font might not be available to complete it.
@dgorst are you up for taking this task up?

@RainbowMango
Copy link
Contributor

Given the implementation is a little bit complicated(API change, controller adopt, testing, etc...), I'd like to set up an umbrella issue and split this to several tasks and then run it by iteration. @dgorst you are welcome and feel free to pick any iterated items you interested in.

How do you say? @irfanurrehman , and If it's ok for you, can you help review the following PRs?

@irfanurrehman
Copy link
Contributor

Awsome suggestion @RainbowMango. I can certainly review the same.
If time permits, I will take up some tasks too.

@hectorj2f
Copy link
Contributor

Thanks for taking care of this @RainbowMango. It sounds good to me too. Share the action items to see if we can help somehow.

@RainbowMango
Copy link
Contributor

Just sent a draft issue #1252. I have started some work locally, so I'll take the first task.
Thanks for your support @irfanurrehman @hectorj2f .

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 14, 2020
@jimmidyson
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 14, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 12, 2021
@hectorj2f
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 12, 2021
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 12, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 12, 2021
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

7 participants