Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Knative Serving controller can not access the same set of private registries as the K8s cluster #1996

Closed
symfrog opened this issue Sep 11, 2018 · 26 comments · Fixed by #4084
Assignees
Labels
area/API API objects and controllers area/test-and-release It flags unit/e2e/conformance/perf test issues for product features kind/bug Categorizes issue or PR as related to a bug. P1 P1
Milestone

Comments

@symfrog
Copy link

symfrog commented Sep 11, 2018

Expected Behavior

The Knative Serving controller is able to access the same set of private registries as the K8s cluster.

Actual Behavior

The Knative Serving Controller fails to authenticate itself to the same private registries as the K8s cluster.

Steps to Reproduce the Problem

An example would be a Kops cluster that is able to access a private AWS ECR registry, while the Knative Serving controller can not.

A high-level set of steps to reproduce the issue is provided below:

  1. Create a Kops cluster that is able to access a private ECR registry
  2. Observe that creating a Pod directly in the cluster that references an image in the private ECR registry successfully pulls the image
  3. Observe that the Knative Serving controller will fail to access the image at this point with an error unsupported status code 401; body: Not Authorized

The difference in behavior is due to Kops setting the --cloud-provider flag and that this is not enabled by the Knative Serving controller when executing: https://github.com/kubernetes/kubernetes/blob/36877dafe40495cb43994464e2427355f99042c7/pkg/credentialprovider/aws/aws_credentials.go#L158

Additional Info

Possible solutions:

  1. The Knative Serving controller could create a pod in the target namespace and then query (e.g. .status.containerStatuses[0].imageID) the pod status to extract the digest. This would effectively delegate the image pull to the K8s cluster. An example pod spec is provided below:
kind: Pod
metadata:
  name: knative-resolver-{target_image_address}
spec:
    containers:
     - name: knative-resolver-{target_image_address}
       imagePullPolicy: Always
       args: []
       command: []
       image: {target_image_address}
  1. The Knative Serving controller image could provide a mechanism to supply Docker credentials helpers at a known path. This would make it possible to use, for example, the ECR credential helper.

Workarounds

  1. Specify the digest in the image address which will prevent the controller from attempting to resolve to a digest:

    if _, err := name.NewDigest(pod.Containers[i].Image, name.WeakValidation); err == nil {

  2. Configure the registry in registriesToSkip for the controller

The above stated workarounds both lose the tag-to-digest resolution functionality.

@knative-prow-robot knative-prow-robot added area/API API objects and controllers area/autoscale area/build Build topics specifically related to Knative area/monitoring area/networking area/test-and-release It flags unit/e2e/conformance/perf test issues for product features kind/question Further information is requested kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/doc Something isn't clear kind/feature Well-understood/specified features, ready for coding. kind/good-first-issue kind/process Changes in how we work kind/spec Discussion of how a feature should be exposed to customers. labels Sep 11, 2018
@mattmoor mattmoor removed area/autoscale area/build Build topics specifically related to Knative area/monitoring area/networking kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/doc Something isn't clear kind/feature Well-understood/specified features, ready for coding. kind/good-first-issue kind/process Changes in how we work kind/question Further information is requested kind/spec Discussion of how a feature should be exposed to customers. labels Sep 11, 2018
@tcnghia
Copy link
Contributor

tcnghia commented Sep 11, 2018

/cc @jonjohnsonjr @mattmoor

@cppforlife
Copy link
Contributor

i imagine once knative build can "output" image digests this functionality would go away from knative serving's controller. is that correct?

@mattmoor
Copy link
Member

Yes, but this same problem is shared by the support recently added to kaniko to support authenticating via cluster-identity.

@solsson
Copy link

solsson commented Sep 19, 2018

Using a local (i.e self-hosted) registry, as opposed to a private (authenticated) one, I'd say that this issue isn't only about authentication. Possible solution 2 under "Additional Info" above won't suffice. Solution 1 will probably work, as it uses nodes' docker daemon, but as explained below we must use registry URLs that differ between build and serving.

Builds can push to the registry over a regular Service using the service.namespace.svc.cluster.local DNS name without port. I've tested Kaniko, Riff and Buildpack. For all of them to work I had to expose registry as both non-ssl access on port 80 and TLS on 443.

However, TLS using a cluster-generated certificate isn't trusted. Kaniko must run with --skip-tls-verify, (while using curl --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt works). No flag needed for Riff and Buildpack, probably because they hit port 80 first and are ok with that. Because docker by convention uses ssl for non-localhost registries, and I found neither a way to generate valid .local certs nor to disable validation for pull, service access is effectively blocked - for example at the digest extraction - with errors such as "x509: certificate signed by unknown authority".

With regular k8s apps we can have images accessed through a proxy (auth directives removed) DaemonSet with hostPort: 5000 using image: 127.0.0.1:5000/....

Servings or any other containers without a mounted docker.sock can not use 127.0.0.1, unless we always set a digest and thus bypass the extraction step. Using the service for push and 127.0.0.1:5000 + digest for pull is the only way I found to support the complete Knative workflow.

I'm yet to test with a locally pused image as FROM with Kaniko.

@sebgoa
Copy link
Contributor

sebgoa commented Sep 24, 2018

I am curious to see who else faces this issue ?

knative/build#3

Since it seems that we can push to a in-cluster repo but serving cannot pull

@jonjohnsonjr
Copy link
Contributor

This issue touches on a few problems, one of which is that ECR doesn't automagically work with knative. I've written up a description of this problem here: google/go-containerregistry#355

The fix for this would be upstream in kubernetes/kubernetes ideally. I don't use ECR, so it's hard for me to test, but if anyone on AWS is willing to pick this up, I'm happy to provide any missing pointers.

@Justin2997
Copy link

I think the PR will fix it in the new version : kubernetes/kubernetes#75587

@yu2003w
Copy link
Contributor

yu2003w commented Jun 12, 2019

@bbrowning I setup private registry and dns resolve failed with error msg as below,

"Get https://icpdev.icp:8500/v2/: dial tcp: lookup icpdev.icp on 10.0.0.10:53: no such host"

10.0.0.10 is clusterIP of service kube-dns in my cluster.
Any advice to work around this?

@greghaynes
Copy link
Contributor

@yu2003w This looks like a k8s configuration issue on your end - .icp is not a valid TLD so I assume its something local you use for development (and is failing to be resolvable).

@yu2003w
Copy link
Contributor

yu2003w commented Jun 13, 2019

@greghaynes Yes, you're correct. icpdev.icp is an entry in /etc/hosts.
I resolved the issue with hostAliases in deployment of controller.

@mattmoor
Copy link
Member

Talked to @sebgoa about helping test @jonjohnsonjr patch for ECR support. If we don't hear back by EoD Thursday (or positively!), this will probably not land until 0.8.

@ryanbrainard
Copy link

If it helps anyone else workaround this issue until a better solution lands, take a look at the ecr-helper and also this post to run it periodically. The is more for the build side, but this helped me figure out how to make it work for the serving side too. The missing piece was 1) creating a docker-registry secret and then 2) adding the imagePullSecrets to the service account used by the ksvc. Since imagePullSecrets is namespace-local, this apparently has to be done for every namespace where a ksvc is running (and every service account if not using the default).

@mattmoor
Copy link
Member

@ryanbrainard if you are up for testing out some bits with ECR I know @jonjohnsonjr has a patch that we hope to make things work.

@ryanbrainard
Copy link

@mattmoor @jonjohnsonjr For sure! We were just about to dive into coding up this workaround into a somewhat automated fashion (so far just tested it in manual bash scripts), but would be great to avoid that. Can you point me at the code and any instrucitons? I'm also in knative slack if that helps.

cc: @jttyeung

@mattmoor
Copy link
Member

The PoC is here, it basically carries a patch to K8s upstream that's landed at head.

@ryanbrainard
Copy link

@mattmoor I gave it a whirl, but it didn't work. Details are here: #4084 (comment)

@mattmoor mattmoor modified the milestones: Serving 0.7, Serving 0.8 Jun 21, 2019
@mcpomm
Copy link

mcpomm commented Jun 22, 2019

If it helps anyone else workaround this issue until a better solution lands, take a look at the ecr-helper and also this post to run it periodically. The is more for the build side, but this helped me figure out how to make it work for the serving side too. The missing piece was 1) creating a docker-registry secret and then 2) adding the imagePullSecrets to the service account used by the ksvc. Since imagePullSecrets is namespace-local, this apparently has to be done for every namespace where a ksvc is running (and every service account if not using the default).

@ryanbrainard ksvc seems to use the default service account . I patched it accordingly (as described here: https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#add-imagepullsecrets-to-a-service-account). Then I added the serviceAccountName property to my service yaml.
apiVersion: serving.knative.dev/v1alpha1 # Current version of Knative kind: Service metadata: name: hello-event-test # The name of the app namespace: default # The namespace the app will use spec: template: spec: serviceAccountName: default containers: - image: xxxxx.dkr.ecr.eu-central-1.amazonaws.com/hello-event-test:latest

The result is unfortunately the same as before:
Status: Conditions: Last Transition Time: 2019-06-22T19:50:13Z Message: Revision "hello-event-test-pmd7k" failed with message: Unable to fetch image "05....dkr.ecr.eu-central-1.amazonaws.com/hello-event-test:latest": unsupported status code 401; body: Not Authorized

As described above, when I start a Pod directly into my Kops Cluster, which pulls the image of ECR, everything works fine. Only with ksvc it doesn't work.
Did you make it work?
I'm just wondering if the same problem occurs with AWS EKS.
Thank you very much in advance.

@ryanbrainard
Copy link

@mattmoor I dug a little deeper on trying the patch and got it to work! It turned out to just be an IAM issue with the controller pod (+ needing the patch applied). Details: #4084 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/API API objects and controllers area/test-and-release It flags unit/e2e/conformance/perf test issues for product features kind/bug Categorizes issue or PR as related to a bug. P1 P1
Projects
None yet
Development

Successfully merging a pull request may close this issue.