Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add API fields for AWS OIDC Provider #9271

Closed
wants to merge 1 commit into from

Conversation

rifelpet
Copy link
Member

@rifelpet rifelpet commented Jun 4, 2020

This PR allows Kops to create the IAM OIDC Provider resource, and discovery.json and keys.json files necessary to setup IAM Roles for Service Accounts. As discussed in the linked issue, initial implementation will create the json files and require that the user copies them to their chosen url (specifics don't matter: could be a public s3 bucket, static site, etc.)

The rough workflow will be:

  1. User chooses a url to host the json files and determines the CA Thumbprint for that url
  2. Add thumbprint and url to cluster spec
  3. Run kops update cluster --yes
  4. Copy resulting json files from state store to their url
  5. Update any IAM roles' assume role policies to allow assumption via the new OIDC Provider

TODO for this PR:

  • Create keys.json
  • Pull the service-account-signer keypair down on masters and pass via flags to kube-controller-manager
  • Fix integration test (it runs both terraform and cloudformation but this feature isnt supported by cloudformation)
  • Add docs
  • Figure out how we can have the Lister and Deleter work in pkg/resources/aws - these are used during cluster tear down. We're only given a cluster name (not the full cluster spec) and oidc providers don't support naming nor tagging, so theres no way for us to attribute an oidc provider back to a cluster without knowing an issuer url or thumbprint.

Roadmap for future PRs:

  • Add the admission webhook to bootstrapchannelbuilder.go - Currently there is no public docker image of the webhook. EDIT: images are now available here. There may also be hurdles with the webhook's CA bundle - This would be the first occurrence of a webhook in bootstrapchannelbuilder.go and we have to figure out how to generate the CA bundle while also making it available to boostrapchannelbuilder.go which I think means we can't use fitasks.Keypair. Alternatively we use a CSR and have kops-controller approve it.
  • Determine a reasonable way to publicly host the json files. This would simplify the workflow by not having the user copy them to a url themselves and also not having to determine the CA thumbprint.

ref: #8264

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 4, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rifelpet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/api area/provider/aws Issues or PRs related to aws provider approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jun 4, 2020
@rifelpet rifelpet force-pushed the irsa branch 2 times, most recently from f865161 to e99d89b Compare June 4, 2020 13:35
}
c.AddTask(discoveryFile)

// TODO create keys.json from https://github.com/aws/amazon-eks-pod-identity-webhook/blob/master/hack/self-hosted/main.go
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other thing we can do here is to get them from the apiserver. The big advantage of this is that I understand we're (eventually) planning on rotating the keys in apiserver, and it would be good to coordinate with that!

if err != nil {
return nil, fmt.Errorf("error getting IAM OIDC Provider: %v", err)
}
// TODO: only delete oidc providers if they're owned by the cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the issuer url was 1:1 with the cluster though, I think we could be reasonably sure?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the issuer url is 1:1 with the cluster, but we only have access to the cluster name here, and the initial plan was to require the user host the json files themselves at the url, so the url may be totally unrelated to the cluster name even if they are 1:1.

I was imagining a common pattern would be to use an externally-managed public s3 bucket, making the issuer url something like https://s3.amazonaws.com/peters-oidc-provider-bucket/cluster-foobar/. It's unique to the cluster but if we only have the cluster name i dont think we can guarantee that we've identified a provider to delete unless we document additional requirements like "the url's path must contain the full cluster name".

I proposed some alternative implementations we could explore down the road in #8264. I'm aware of kubernetes/enhancements#1393 and i think it would be good to integrate, perhaps in this case kops could setup an oidc dns record within the cluster's hosted zone that points to a dedicated load balancer targeting the new apiserver endpoints. If we can use that as the issuer url, it not only lets us identify the issuer purely from the cluster name but also lets us compute both the issuer url and the CA thumbprint ourselves so the user doesn't need to provide either. This setup would require an internet-accessible load balancer and the cluster using a public hosted zone, but I think thats a reasonable requirement. Is this what you had in mind?

return nil, fmt.Errorf("error getting IAM OIDC Provider: %v", err)
}
// TODO: only delete oidc providers if they're owned by the cluster.
// We need to figure out how we can determine that given only a cluster name.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the issuer url was 1:1 with the cluster though, I think we could be reasonably sure?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have to match on the URL, possibly in combination with the ClientIDs.

Copy link
Member

@johngmyers johngmyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A kops get subcommand to get the documents out of the state store would be helpful.

@@ -191,6 +191,8 @@ type ClusterSpec struct {
SysctlParameters []string `json:"sysctlParameters,omitempty"`
// RollingUpdate defines the default rolling-update settings for instance groups
RollingUpdate *RollingUpdate `json:"rollingUpdate,omitempty"`
// ServiceOIDCProvider defines the OIDC provider setup for the cluster (AWS only)
ServiceOIDCProvider *ServiceOIDCProviderSpec `json:"serviceOIDCProvider,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps put this inside CloudConfiguration? I'm wary of adding new options that are specific to a cloud provider into the base Cluster struct.

I'm thinking for the v1beta1 API each cloud provider would have a different struct, like the CNI providers do. The flip side is that a second cloud provider could have the concept of a service OIDC provider, but one would hope it wouldn't have such an annoyingly difficult configuration procedure as AWS does.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - naming for this is going to be tough.

  • It's not technically specific to AWS; if we publish the JWKS doc from e.g. an on-metal cluster we should be able to authenticate to AWS.
  • Similarly GCE has some support for this now also, in the form of GKE workload identity, so it's not just to AWS.
  • It's actually even broader than this; we can also authenticate service accounts across clusters etc.

The word that keeps sticking in my mind is "discovery", and this is specifically "identity discovery". I think it's also possible to do cross-cluster discovery using that same bucket (and those files wouldn't necessarily have to be public, either)

Maybe something like

discovery:
  store: s3://some-other-bucket/clusters
  identity: true

I think technically we could even reuse the same S3 bucket, I can't imagine it's worth it from an isolation point of view though :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how things other than AWS consume OIDC, but I would hope the IssuerCAThumbprints field wouldn't be copied by other cloud providers. It would be a lot more sane if one could just configure AWS with the identity document directly, rather than faffing about with serving it on the public Internet. If AWS ever changes the CA they use for their S3 buckets, everyone's IRSA will likely break immediately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point on the rotations - I presume this means that S3 effectively can't change CAs quickly, though I don't know what happens in 2025 when the current cert expires, presumably some pre-publication. (I think the cert fingerprint includes the expiration date?)

@rifelpet and I chatted on Friday, and we had an idea about an approach that might be a nice way to get this merged: we define a feature flag (or two) that effectively does all this automagically in a (very) limited set of circumstances - currently you must use a public ELB, with a real DNS name, likely others we don't even know about. The featureflag enables the apiserver with anonymous-access turned on (which normally we don't like to do in kops, particularly for public ELBs), we also grant the anonymous-access user permission to read the JWKS documents directly.

We should need to find a way to not encourage this as a real scenario (e.g. I don't want to preclude private ELBs). But it means we don't need an API surface in this PR and we don't need kops-controller to publish to an S3 bucket in the one I added. That PR is #9352

We can then discuss e.g. whether to have a public bucket, how to name the clusters, what the API should be for enabling AWS integration in a more lesiurely way. But we don't block working on e.g. per-pod roles or figuring out different ways to publish the JWKS doc (e.g. publish from kops CLI vs publish from kops-controller).

This also means that upstream (k/k) can start to see some real test coverage of the feature, and get it to GA.

pkg/apis/kops/cluster.go Outdated Show resolved Hide resolved
pkg/apis/kops/cluster.go Outdated Show resolved Hide resolved
pkg/apis/kops/validation/validation.go Outdated Show resolved Hide resolved
pkg/apis/kops/validation/validation.go Outdated Show resolved Hide resolved
pkg/model/awsmodel/oidc_provider.go Outdated Show resolved Hide resolved
pkg/model/awsmodel/oidc_provider.go Outdated Show resolved Hide resolved
pkg/model/awsmodel/oidc_provider.go Outdated Show resolved Hide resolved
pkg/resources/aws/aws.go Outdated Show resolved Hide resolved
return nil, fmt.Errorf("error getting IAM OIDC Provider: %v", err)
}
// TODO: only delete oidc providers if they're owned by the cluster.
// We need to figure out how we can determine that given only a cluster name.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have to match on the URL, possibly in combination with the ClientIDs.

@k8s-ci-robot k8s-ci-robot added area/documentation area/nodeup size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 7, 2020
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 7, 2020
@justinsb
Copy link
Member

FYI I just sent #9341 which extends this PR, adding the idea of automatically publishing the JWKS document to a bucket. (I expect it to be a different bucket, because the JWKS documents themselves need to be publicly readable)

This has clarified things a bit in my mind; there really are two things going on here:

  1. Publishing the JWKS document publicly, which enables consumers to trust serviceaccounts
  2. Registering the cluster/jwks with consumers, like AWS, so that they recognize k8s service accounts.

I do think we want to do both, but it can make sense to do the second even when not running on AWS, or to register more sources (GCP, Azure, maybe other services in future like github?)

In #9341 I defaulted the AWS registration, BTW (when a feature flag UsePodIAM is passed). We might therefore be able to get away without any API at all in this PR (though we do want it in the end, we might be able to separate the discussions).

@johngmyers
Copy link
Member

I think creating the JWKS document and publishing it publicly are separable. While publishing to an S3 bucket is convenient (and, incidentally, is what we do) there are other possible mechanisms.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 17, 2020
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 17, 2020
@rifelpet rifelpet changed the title [WIP] Add initial support for IAM Roles for Service Accounts [WIP] Add API fields for AWS OIDC Provider Jun 17, 2020
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 28, 2020
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 24, 2020
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 14, 2020
@k8s-ci-robot
Copy link
Contributor

@rifelpet: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

@rifelpet: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-kops-bazel-test c754e40 link /test pull-kops-bazel-test
pull-kops-verify-hashes c754e40 link /test pull-kops-verify-hashes
pull-kops-e2e-k8s-containerd c754e40 link /test pull-kops-e2e-k8s-containerd

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@rifelpet
Copy link
Member Author

rifelpet commented Feb 7, 2021

to be replaced by #10756

@rifelpet rifelpet closed this Feb 7, 2021
@rifelpet rifelpet deleted the irsa branch May 5, 2021 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/api area/documentation area/nodeup area/provider/aws Issues or PRs related to aws provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants