-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get credentials from AWS ECR when needed #174
Conversation
I would love to see tests for this, but mocking AWS services is no small task. Perhaps something like https://github.com/localstack/localstack could help. |
@squaremo does this work with IRSA or with node IAM bindings or with both? |
From the explanation in #139 (comment) I would expect it works with the node IAM bindings that EKS does automatically, as well as IRSA (IAM roles for service accounts) and possibly other mechanisms, since "one can grant additional IAM permissions via a multitude of mechanics (IAM node role, IAM service account, or even configuring access keys manually via environment variables), and all official AWS SDKs will automatically try each until it gets a valid identity". @MartinEmrich Are you able to confirm that this is the correct expectation? |
It works with the IAM instance profile that each EKS node has to have. Container image pulling should have no relationship to IAM service accounts/IRSA (after all, service accounts are assigned to pods, which run an already pulled image). IRSA (or, putting access keys somewhere in your workloads, which is always bad and unneccesary), operates within your workloads (Pods), so are on a completely different level. (Disclaimer: I a just a dude with a google, I have no hidden insights in how AWS does stuff internally.) |
Regarding the name of the flag: I don't really have a strong opinion on |
As for the flag, I would like to urge you just one final time to make it enabled by the default, to follow the Principle Of Least Surprise. Again, this feature has no security implications, drawbacks or unexpected behaviour. (IMHO, even a flag to disable it makes no sense, as it does not protect anything from anything) |
This is my understanding:
This PR unconditionally asks ECR for credentials (given the enable flag, or if it were always enabled). That has the following effect:
So I think there are two reasons to guard this feature with a flag:
|
I think this is the way to go, especially since many orgs don't use the same account for ECR and EKS. We can't assume that credentials present on nodes are for the registry that hosts the app images. |
879ce9f
to
c90d8c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add an example to the API docs and explain how people can enable auto login of ECR?
I was wondering about that. It's not part of the API, but there's no "running the controller" docs kept in the individual repo. I think it would go somewhere in the docs website. |
Having said that, it can be mentioned in the API guide somewhere around the |
As discussed in #139 If the flag `--user-aws-ecr` is set, the controller will log in to AWS Elastic Container Registry by fetching a temporary token from AWS (when a given image is hosted in ECR). Signed-off-by: Martin Emrich <martin@martinemrich.me>
- rename to parseAwsImage, since it's not a URL - make return value include a bool rather than an error, since it's not an error for the image to not be ECR - return values explicitly rather than assigning them, since that can be prone to shadowing mistakes Signed-off-by: Michael Bridgen <michael@weave.works>
Checking for ECR whether or not there's a secret referenced means the reconciliation can fail despite there being a valid credential. If you are not running in EKS (or otherwise with AWS IAM permissions), but using ECR, you will still want to be able to supply a secret. Signed-off-by: Michael Bridgen <michael@weave.works>
This is my attempt at a descriptive flag for this feature. It mentions AWS and ECR, and suggests automatically getting credentials. It starts with `aws` so it will sort alongside any other AWS-specific flags that come along later. Signed-off-by: Michael Bridgen <michael@weave.works>
The TODO items for scanning ECR were based on Flux v1's ECR-specific code for image scanning; but it turns out they are not necessary for a viable implementation. I have removed the TODOs, and given an explanation why it's fit for purpose as it is. Signed-off-by: Michael Bridgen <michael@weave.works>
Signed-off-by: Michael Bridgen <michael@weave.works>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @squaremo ☕
PS. Maybe we should create issues for Azure and GCP auto login and point to this PR as an example for future contributors.
Good thinking: There's also #11 which is something like an umbrella issue for these. |
Hi @squaremo, I just tried this in Production today and it works. Thanks! Just one thing to note, I believe that you only have to log into a registry (account + region combination) once. This implementation appears to be doing if for every repository within the same registry, which will bulk out the logs and hit the API quite a lot when we start scanning our 100s of repos every minute or so. Perhaps some naive approach of working out all registries in the loop that will need logging into then just doing them once (in our use case that's 1 login vs. n logins for n repos)? |
🎉 Thank you for guinea-pigging this feature!
Yes, it's the simplest possible implementation to start with. I did a quick calculation to check that it would be well within the quota for requests for most users (you can check my assumption here). But, I do understand being nervous about quotas, and there is certainly room to reduce the number of requests required. Flux v1 caches the tokens it gets, and since this is a flag that applies to all image scans, the implementation here could do the same. |
…d): elaborate on how to enable autologin for ECR fluxcd/flux#1619 fluxcd/flux2#107 fluxcd#11 fluxcd#139 fluxcd#147 fluxcd#174 fluxcd#179 fluxcd#180 fluxcd/website#601 fluxcd#193 fluxcd/website#611 https://fluxcd.io/docs/guides/image-update/#aws-elastic-container-registry Signed-off-by: lloydchang <lloydchang@gmail.com>
…d): elaborate on how to enable autologin for ECR fluxcd/flux#1619 fluxcd/flux2#107 fluxcd#11 fluxcd#139 fluxcd#147 fluxcd#174 fluxcd#179 fluxcd#180 fluxcd/website#601 fluxcd#193 fluxcd/website#611 https://fluxcd.io/docs/guides/image-update/#aws-elastic-container-registry Signed-off-by: lloydchang <lloydchang@gmail.com>
…d): elaborate on how to enable autologin for ECR fluxcd/flux#1619 fluxcd/flux2#107 fluxcd#11 fluxcd#139 fluxcd#147 fluxcd#174 fluxcd#179 fluxcd#180 fluxcd/website#601 fluxcd#193 fluxcd/website#611 https://fluxcd.io/docs/guides/image-update/#aws-elastic-container-registry Signed-off-by: lloydchang <lloydchang@gmail.com> Signed-off-by: Kingdon Barrett <kingdon@weave.works>
…d): elaborate on how to enable autologin for ECR fluxcd/flux#1619 fluxcd/flux2#107 fluxcd#11 fluxcd#139 fluxcd#147 fluxcd#174 fluxcd#179 fluxcd#180 fluxcd/website#601 fluxcd#193 fluxcd/website#611 https://fluxcd.io/docs/guides/image-update/#aws-elastic-container-registry Signed-off-by: lloydchang <lloydchang@gmail.com> Signed-off-by: Kingdon Barrett <kingdon@weave.works>
A rebase of #147 (to avoid force pushing over a main branch).
TODOs carried over (in a comment, and based on Flux v1's implementation):
The quota for getting an auth token is 500/second/region: https://docs.aws.amazon.com/general/latest/gr/ecr.html. Since image scanning tends to be done on periods of O(minutes), you would need more than 500 * 60 images to scan (ImageRepository objects) before it was not viable to get an auth token for each scan. You might have that many if you run automation in lots of clusters in the same region, under the same account. But in my estimation, not bothering with caching is the 80/20 (80% of the value with 20% of the complexity, in this case) solution.
Expiry only needs to be accounted for if tokens are cached.
Yes, retrying after failures is an easy way to burn through your API quota (see above). As it happens though, if an image is detected as being from ECR, and the flag for ECR auto-login was set, then failing to get a token will error out of
.Reconcile()
, and the controller-runtime machinery will back off.I'm not aware of any special treatment needed for these -- perhaps if you wanted to scan an image in ECR for some non-global partition, from another partition. But let's cross that bridge when we get to it.