-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Error in logs with Flux for docker images repositories from AWS Public ECR endpoint #3492
Comments
@allamand we also have the same problem. |
FYI ts=2021-07-20T17:16:48.670094734Z caller=warming.go:180 component=warmer canonical_name=public.ecr.aws/eks-distro/kubernetes-csi/node-driver-registrar auth="{map[index.docker.io:<registry creds for [REDACTED]@index.docker.io, from /dockercfg/docker-config>]}" err="requesting tags: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\""
ts=2021-07-20T17:17:34.176691936Z caller=warming.go:180 component=warmer canonical_name=public.ecr.aws/eks-distro/kubernetes-csi/external-provisioner auth="{map[index.docker.io:<registry creds for [REDACTED]@index.docker.io, from /dockercfg/docker-config>]}" err="requesting tags: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\""
ts=2021-07-20T17:17:34.540198093Z caller=warming.go:180 component=warmer canonical_name=public.ecr.aws/eks-distro/kubernetes-csi/livenessprobe auth="{map[index.docker.io:<registry creds for [REDACTED]@index.docker.io, from /dockercfg/docker-config>]}" err="requesting tags: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\""
ts=2021-07-20T17:18:10.36478641Z caller=warming.go:180 component=warmer canonical_name=public.ecr.aws/aws-secrets-manager/secrets-store-csi-driver-provider-aws auth="{map[index.docker.io:<registry creds for [REDACTED]@index.docker.io, from /dockercfg/docker-config>]}" err="requesting tags: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\"" |
I don't have access to an ECR registry, but this might be related to #3015 or #3124 The issue appears to be that your registry does not allow listing of tags. I am almost certain I have seen this issue before... AWS provides this documentation about listing tags from ECR: That's an AWS API endpoint, not a Registry v2 API endpoint, for reference. I do not know if ECR supports publicly listing tags but it seems like if it did, that would be a special feature that might have to be separately enabled and permitted. This leads me to believe that the normal Docker Registry v2 API method for listing tags cannot be used by Flux. This seems to agree with the 404 response; if AWS does not publish a tags/index endpoint in their ECR Registry API, then this will not work with Flux. The docs do mention support for ECR in some places though, so I am not sure if this was ever supported. Is this a new problem, (something that worked before, but stopped working?) or related to some changes in your cluster, behavior that wasn't tested before and might not have ever worked? If it's not supported, the best you can do might be to add an exclusion to avoid scanning the public ECR endpoints. Please let me know if this information help at all. There is documentation for Flux v2 to support ECR, but I don't know if it covers public ECR or if it can cover that (as you may be aware, Flux v2 also requires listing images in order to promote image updates via ImagePolicy and Automation.) |
@kingdonb in our specific case it does not refer to a private registry but the problem occurs with the AWS public registry |
@pierluigilenoci I understand that, the issue is that ECR does not appear to respond to the Docker Registry API's method for listing existing tags as an index. AWS provides their own API instead. This will perhaps still be true for public registries. Docker registry clients that you can use as a stand-alone tool are a bit of a dark art, but I may have one handy that I can use to confirm this expected Registry v2 endpoint is or is not supported. It must provide access to list tags in the standard way, else it will likely not be usable with Image Automation in any (current or historical) version of Flux. |
OK, for example, here is a conversation with the Docker Hub registry that implements the full Registry API v2:
A list of tags comes back from the tag index endpoint, this is the expected reply from a conforming Docker Registry v2 (or, possibly this is a feature that was not added until version 2.1, I may have seen when looking for more info about this?) I'm using a Ruby Docker Registry client, https://github.com/deitch/docker_registry2 – the Docker Hub registry API endpoint v2 is a well-known URL that is usually baked into docker clients somehow, which can be used for debugging, in this case just to show what a normal conversation in Registry v2 looks like according to Flux. I can pull the manifest as an alternative way of determining if I've connected correctly to the public/unauthenticated registry endpoint (this occasionally fails with 429 Too Many Requests, but after a few moments patience and trying again, I get an affirmative reply back like this, with the manifest of the image belonging to the requested tag):
Here's what that conversation looks like when I try the same thing with the eks-d public registry that hosts for example The tag list function returns the same 404 error that you were reporting though, @pierluigilenoci
This seems to indicate the |
Thank you @kingdonb, |
@kingdonb AWS support confirmed the problem. |
Thanks for making the connection to AWS, aws/containers-roadmap#1262 (comment) (yikes!) |
Not sure if this was attempted, but I was able to get Flux to work with public ECR images by making sure the IAM role associated to the flux deployment had the appropriate IAM access policy permissions - (policy ARN |
That's interesting, so if your cluster is not on AWS and you are using images published via public.ecr.aws I guess that sort of implies you'll need to be using an AWS account and providing token credentials for the authenticator, (not exactly public!) @hspencer77 do you happen to know if this method uses the AWS API registry index method, or if it works through the docker registry API? My assumption is since it uses an IAM role, it must be the AWS API. |
@kingdonb , based upon the documentation, it looks like it relies on having an authentication token. For example, when I didn't pass an authentication token, I received the following error:
When I passed an authentication token, I was able to list the image tag manifests:
Another thing I noticed in the documentat is that there wasn't any example of |
@kingdonb as a follow-up, even though I was able to pull the image successfully and have it running on my EKS cluster, I still see this error any time I do a
The image I am using is this:
I have also confirmed that even though the error shows up in the logs, the image is successfully pulled down (in another deployment where specific version
Here is the error in the logs:
|
A little update on the situation on my side (EKS cluster). I tried, as suggested, to add the ts=2021-08-13T08:11:25.986658906Z caller=loop.go:134 component=sync-loop event=refreshed url=ssh://git@github.com/[REDACTED] branch=[REDACTED] HEAD=[REDACTED]
ts=2021-08-13T08:12:39.053016064Z caller=warming.go:180 component=warmer canonical_name=public.ecr.aws/aws-secrets-manager/secrets-store-csi-driver-provider-aws auth="{map[index.docker.io:<registry creds for [REDACTED]@index.docker.io, from /dockercfg/docker-credentials>]}" err="requesting tags: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\""
ts=2021-08-13T08:13:18.695749073Z caller=warming.go:180 component=warmer canonical_name=public.ecr.aws/aws-secrets-manager/secrets-store-csi-driver-provider-aws auth="{map[index.docker.io:<registry creds for [REDACTED]@index.docker.io, from /dockercfg/docker-credentials>]}" err="requesting tags: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\""
ts=2021-08-13T08:13:19.058279705Z caller=warming.go:180 component=warmer canonical_name=public.ecr.aws/eks-distro/kubernetes-csi/node-driver-registrar auth="{map[index.docker.io:<registry creds for [REDACTED]@index.docker.io, from /dockercfg/docker-credentials>]}" err="requesting tags: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\""
ts=2021-08-13T08:13:28.145340636Z caller=warming.go:180 component=warmer canonical_name=public.ecr.aws/eks-distro/kubernetes-csi/external-provisioner auth="{map[index.docker.io:<registry creds for [REDACTED]@index.docker.io, from /dockercfg/docker-credentials>]}" err="requesting tags: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\""
ts=2021-08-13T08:14:17.151279983Z caller=warming.go:180 component=warmer canonical_name=public.ecr.aws/eks-distro/kubernetes-csi/livenessprobe auth="{map[index.docker.io:<registry creds for [REDACTED]@index.docker.io, from /dockercfg/docker-credentials>]}" err="requesting tags: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\""
ts=2021-08-13T08:15:16.102895506Z caller=warming.go:180 component=warmer canonical_name=public.ecr.aws/eks-distro/kubernetes-csi/livenessprobe auth="{map[index.docker.io:<registry creds for [REDACTED]@index.docker.io, from /dockercfg/docker-credentials>]}" err="requesting tags: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\""
ts=2021-08-13T08:15:22.972276987Z caller=warming.go:180 component=warmer canonical_name=public.ecr.aws/aws-secrets-manager/secrets-store-csi-driver-provider-aws auth="{map[index.docker.io:<registry creds for [REDACTED]@index.docker.io, from /dockercfg/docker-credentials>]}" err="requesting tags: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\""
ts=2021-08-13T08:15:23.344056067Z caller=warming.go:180 component=warmer canonical_name=public.ecr.aws/eks-distro/kubernetes-csi/external-provisioner auth="{map[index.docker.io:<registry creds for [REDACTED]@index.docker.io, from /dockercfg/docker-credentials>]}" err="requesting tags: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\""
ts=2021-08-13T08:15:54.544792709Z caller=warming.go:180 component=warmer canonical_name=public.ecr.aws/eks-distro/kubernetes-csi/node-driver-registrar auth="{map[index.docker.io:<registry creds for [REDACTED]@index.docker.io, from /dockercfg/docker-credentials>]}" err="requesting tags: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\"" |
I have kingdonb/flux at 0294896 which contains all of the changes I planned to release in 1.24.0. I'm pushing that image out for testing as If you'd like to test this, there are a number of fixes and updates there which we haven't discussed individually, but I actually don't think this is going to solve your issue by itself. I think there is a configuration change needed on your end, and I'm not prepared to help with that from here. You might also need to add the AWS region to the Flux config through helm chart values, so that it bypasses the AWS metadata API for those things and just uses the role. (This is the issue from #3015) I am a little bit lost when it comes to AWS stuff as I don't have an AWS account that I am testing against. I am sorry I cannot be more helpful with this. If this is very important to you, we have paid support options that we can follow, and I'll be happy to contact you in private about them – but this is the limit of what I can do within the bounds of community support. There are people on my team with much more AWS knowledge than I have, but their availability is usually subject to a paid support contract. Please see: https://fluxcd.io/support/#i-am-stuck If you are already engaging our paid support then I can certainly try to help escalate this. |
Hello all Flux 1.24.0 has been released. Please re-check this issue, if you have been waiting for the release. I am not sure if all responders on this issue have the same issue, or if at this point the original poster has resolved their issue and we are spamming them. (If so, we can close it out and let anyone who has a separate issue report it again. You're welcome to refer back to this issue, but I need a complete and original report to follow up, so we can be sure we are not creating spam clouds around issues that don't pertain to the original reporter's issue.) |
Hi @kingdonb, kubectl get pods flux-b6c85484f-pbwb9 -o json | jq ".spec.containers | .[0].image" small_fix*
"docker.io/fluxcd/flux:1.24.0" I get the same result:
|
@pierluigilenoci from my previous message, highlight:
Please make sure your configuration matches the config described here: #3492 (comment) Flux can speak the AWS API, as long as an IAM role is assigned and Lines 191 to 196 in c1267f5
If you are on the latest version, and your IAM role is permitted to pull from public registries with the This impassable condition was addressed by #3124, so until that was merged into the 1.23 release series, it would in many cases not have been possible for this to work. It is a little bit counter-intuitive that you must set the AWS region to use ECR but I think it is unavoidable. (I am not sure how to document this well or to prevent this from coming up again, except to recommend that people should move on to Flux v2 as soon as possible, as any documentation call-out in the Flux v1 docs seems likely to be overlooked/unlikely to be noticed, but I will happily accept a PR for the docs in fluxcd/website if someone can show how the docs could be improved to better support this workflow.) Please keep us in the loop if this still isn't working, or if that information resolves it. There should be no technical reason for blocking this from working, but there are certainly a few things which you might trip over that may not be possible to resolve. |
All EKS nodes in my cluster has a AWS::IAM::Role with these permissions: arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
arn:aws:iam::aws:policy/AmazonElasticContainerRegistryPublicReadOnly
arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy and Flux 1.24 is configured in this way:
Logs: ts=2021-08-31T10:35:07.256293914Z caller=aws.go:125 component=aws info="using regions from local config"
ts=2021-08-31T10:35:07.256649046Z caller=aws.go:117 component=aws info="restricting ECR registry scans" regions=[eu-west-1] include-ids=[] exclude-ids="[602401143452 918309763551]"
ts=2021-08-31T10:35:07.499084486Z caller=checkpoint.go:24 component=checkpoint msg="up to date" latest=1.24.0
ts=2021-08-31T10:33:54.29279459Z caller=sync.go:542 method=Sync cmd=apply args= count=15
ts=2021-08-31T10:33:55.117563567Z caller=sync.go:608 method=Sync cmd="kubectl apply -f -" took=824.690172ms err=null output="helmrelease.helm.fluxcd.io/[REDACTED] unchanged\nhelmrelease.helm.fluxcd.io/[REDACTED] unchanged\nhelmrelease.helm.fluxcd.io/[REDACTED] unchanged\nhelmrelease.helm.fluxcd.io/[REDACTED] unchanged\nhelmrelease.helm.fluxcd.io/[REDACTED] unchanged\nhelmrelease.helm.fluxcd.io/[REDACTED] unchanged\nhelmrelease.helm.fluxcd.io/[REDACTED] unchanged\nhelmrelease.helm.fluxcd.io/[REDACTED] unchanged\nhelmrelease.helm.fluxcd.io/flux unchanged\nhelmrelease.helm.fluxcd.io/[REDACTED] unchanged\nhelmrelease.helm.fluxcd.io/[REDACTED] unchanged\nhelmrelease.helm.fluxcd.io/[REDACTED] unchanged\nhelmrelease.helm.fluxcd.io/[REDACTED] unchanged\nhelmrelease.helm.fluxcd.io/[REDACTED] unchanged\nhelmrelease.helm.fluxcd.io/[REDACTED] unchanged"
ts=2021-08-31T10:33:55.227492868Z caller=loop.go:134 component=sync-loop event=refreshed url=ssh://git@github.com/[REDACTED]/[REDACTED] branch=[REDACTED] HEAD=[REDACTED]
ts=2021-08-31T10:33:55.22753494Z caller=images.go:17 component=sync-loop msg="polling for new images for automated workloads"
ts=2021-08-31T10:34:33.064357264Z caller=warming.go:180 component=warmer canonical_name=public.ecr.aws/eks-distro/kubernetes-csi/external-provisioner auth="{map[index.docker.io:<registry creds for [REDACTED]@index.docker.io, from /dockercfg/docker-credentials>]}" err="requesting tags: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\""
ts=2021-08-31T10:34:35.102430814Z caller=warming.go:180 component=warmer canonical_name=public.ecr.aws/aws-secrets-manager/secrets-store-csi-driver-provider-aws auth="{map[index.docker.io:<registry creds for [REDACTED]@index.docker.io, from /dockercfg/docker-credentials>]}" err="requesting tags: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\"" So? |
@pierluigilenoci I'm sorry I didn't read this thread more carefully, @hspencer77 had those same errors in their logs as well. It might take some time to get around to reproducing this on my own. I don't have an AWS account or sponsored resources of any kind on AWS cloud. If someone from AWS was to step in and work on this, it would be more likely to get solved sooner. Is there full consensus that the image warmer does not successfully reach public.ecr.aws? I thought someone said that it worked, but now it isn't clear that it does. If it was something wrong in an existing feature then surely we should fix it, but I honestly don't know why AWS does this differently or if they have plans to remediate this and make it work more like any other standard Docker public registry. If so, then we would be better off waiting, rather than investing more into an EOL integration which is questionable whether or not it should be in the scope of Flux's maintenance mode. (Did this ever work before?) If it requires special support from Flux to use with their image registry, it will need to have existing feature support inside of the codebase, else it would be a new feature and therefore definitely would fall outside of the scope of maintenance mode. I know little about Flux's AWS integration. It is clear that it is meant to work with ECR, but it is not clear that the public.ecr.aws endpoint is actually treated by Flux as an ECR endpoint that requires IAM authentication. Perhaps there is someone from the EKS-D team or elsewhere in AWS support who can pop in and give an opinion on how we should proceed. (Do you know anyone @pierluigilenoci ? I think you mentioned that you have an AWS support contract, so if you have someone on the team who already has context around this problem, it will be easier to get the conversation started.) |
It's also not clear to me that simply giving the role to the nodes can solve any problem. You need the Flux daemon pod to have the role. EKS isolates the pods and prevents them from reaching the AWS Metadata API to prevent them from grabbing arbitrary roles that have been assigned to the nodes for some higher function of administration that ought not be granted to every pod on the cluster. We discussed assigning the role to the node group, but did you place whatever annotation on the Flux daemon pod/service account that gives the role assignment to the actual Flux pod? Some basic research indicates knowledge of this strategy (IRSA) might be important: I guess it is the Flux pod's service account that must be annotated to give it access to the desired/specified IAM role. Again I am not skilled or trained for AWS but this is just what I found, I've heard of IRSA before and understood this information about the blocked metadata API from working on #3124. Sorry again for the trouble around this. Hope some of this information helps, please do keep us informed so we can close this issue with a positive resolution if it is possible! |
@kingdonb , not sure if you saw this, but I think you could get an AWS account for free (given the purpose of the |
Thanks @hspencer77, that's helpful, and I will consider it for the future! I think AWS has been forthcoming with cloud resource credits in the past for us, I haven't been with the Flux project for that long, but what we really need is someone who is already in a position to reproduce this issue to spend the time on solving it, and add any documentation for Flux support of this feature if it is necessary. I can build an EKS cluster some time and try it out, but I will be learning a lot of AWS stuff practically from scratch; there may be time for that at some point, but unfortunately with the level of investment required to get myself started solving this issue it would be pretty counter-intuitive to me in my position and role. This especially, even doubly so, given that we don't know the state of this capability in Flux v2, and any work we do here to solve it on the Flux v1 side would potentially need to be repeated over there, else Flux v1 will potentially have features that Flux v2 doesn't. I would absolutely like nothing more than to guarantee that AWS public ECR registries are usable with all versions of Flux! The easiest way for me to do that, would be to convince AWS that they should abide by the standard and not require AWS role assignments granting permission and special AWS API clients for pulling image index from what are billed as public (standards-compliant) Docker image repos. We are very friendly with AWS as it should be clear from our collaboration on https://aws.amazon.com/ecr/faqs/ To be perfectly clear, AWS has several statements on ^this FAQ page about ECR's compatibility and support. It is said to be compatible with the Docker Registry v2 and OCI formats. This appears to be a clear divergence from the Registry v2 standard, at least with respect to image indexes, and so it seems to be the case that rather than writing AWS-specific things into Flux... it will benefit many more projects and teams, if instead ECR team can be convinced to support the unabridged standard. It would be great to have a statement from AWS on whether they will always respond to Docker Registry tag index for any public ECR repo with |
@kingdonb I am sorry but I am unable to provide further information and currently, I do not have time to investigate this in a profitable way. 😞 In my opinion, the heart of the matter is how the ECR responds to requests and how they implemented the protocol. I believe that for ECR public images it is not necessary to be authenticated. |
@pierluigilenoci we are agreed I was able to reach someone from AWS who is looking into this and we should have a more helpful answer soon. In the mean time I have been reminded that we offer Flux 2 migration workshops that are free, for those who are struggling with parts of the migration from legacy Flux v1. l would like to pass on this questionnaire to anyone who is interested: |
@yebyen thank you very much, I filled out the questionnaire. The problem is that we use helm to install Flux and Flux 2 doesn't have a chart available. |
For those interested there has been the Flux2 chart for some time |
This project is in Migration and security support only, so unfortunately this issue won't be fixed. We recommend users to migrate to Flux 2 at their earliest convenience. More information about the Flux 2 transition timetable can be found at: https://fluxcd.io/docs/migration/timetable/. |
Describe the bug
Errors are poping in the flux controller complaining for images in public ECR registry:
To Reproduce
Steps to reproduce the behaviour:
Expected behavior
No error il the flux logs
Logs
Note: I tried also to provide flux with the ECR Registry policy but still have the problem
Additional context
The text was updated successfully, but these errors were encountered: