-
-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not replacing pruned images #843
Comments
@ppapp92 The reason for this behaviour is k8s-image-swapper/pkg/registry/ecr.go Lines 234 to 239 in e817e04
Restarting ECR doesn't signal the execution of its lifecycle policy runs, so essentially only pulling for that info is possible as far as I know. Possible option to resolve your issue:
|
@estahn Hello! Today we spent a couple hours debugging an issue where one of the images in our google artifact registry was removed per our cleanup policy (which we have now increased). We had GKE nodes that were unable to start because of this exact same issue (the coincidence on this being the most recent issue when I looked was a good laugh for us). Having the ability to either disable the in-memory cache or tune the timing of the cache (maybe 1-2 hours?) to avoid this would be awesome, add us to the list of people wanting some more options around this! edit: I've gave this a shot at #847, let me know what you think! |
Just a perspective from someone that just started using this project: It seems like the underlying challenge is that there's no bridge between the external purge management (ECR policies and whatnot) and the internal cache that is kept in memory. I had a conversation with one of my teammates about this and having built a few Kubernetes Operators in my lifetime, it seems that what's missing are two things to make this issue completely goes away:
With these two processes in place, there's very little need for any type of TTL that attempts to keep the two sources (cache and external OCI registry) in sync. In the other hand, this would make this project surprisingly close to a fully fledged operator as it would have to use libraries such as https://github.com/kubernetes/client-go/tree/master/informers and possibly https://github.com/kubernetes-sigs/controller-runtime. This is a very interesting project, so I gotta say congrats for the initiative 😃 . |
@ahatzz11 I do appreciate the initiative and time you've taken to contribute a PR. The PR itself looks great (code, docs, tests, etc) and something we could possibly merge. However I think this is going in the wrong direction. Whatever time the cache will be set to will eventually raise aforementioned failure as the "cleaning" process runs at a non-predictable time (at last for ECR). The diagram below hopefully shows deleting an image from the registry will always raise a failure due to the internal cache regardless of the TTL. @migueleliasweb Awesome! I really like the idea of using sequenceDiagram
Kubernetes API->>+k8s-image-swapper: 1. Webhook request
alt Image exists according to cache?
else Image exists according to image registry?
k8s-image-swapper->k8s-image-swapper: Update internal cache that image exists
else
k8s-image-swapper->>Image Registry: Push image to registry
k8s-image-swapper->k8s-image-swapper: Update internal cache that image exists
end
k8s-image-swapper->k8s-image-swapper: Replace image in webhook response
k8s-image-swapper->>-Kubernetes API: 1. Webhook response
Image Registry "Cleaner"->>Image Registry: Delete images
Kubernetes API->>+k8s-image-swapper: 2. Webhook request
alt Image exists according to cache?
else Image exists according to image registry?
k8s-image-swapper->k8s-image-swapper: Update internal cache that image exists
else
k8s-image-swapper->>Image Registry: Push image to registry
k8s-image-swapper->k8s-image-swapper: Update internal cache that image exists
end
k8s-image-swapper->k8s-image-swapper: Replace image in webhook response
k8s-image-swapper->>-Kubernetes API: 2. Webhook response
|
@estahn I love those flowcharts from MermaidJS 🤩
That's precisely my point. Mitigations can be made, but there will always be a bit of delay/disruption if we go this direction. Ps: Do you think by listening to pod events and periodically performing a full-sync from the ECR registry this problem would vanish? I personally think so. Ps2: Is there a reason why not to cache the whole ECR repo tags (maybe with an option for a filter)? Even with many thousands of images/tags as strings, something like a |
I think listening to the pod event as you described and either purging the entire cache or a specific cache entry solves the issue. I would argue its unlikely you need all images of the registry. The usual scenario is moving from one image version to the next. Therefore I don't see much of a benefit to hold this all in cache. If there is a good argument to be made I'm not against it.
There wasn't a need to as far as I'm concerned. Also keeping things lean and efficient :) |
I agree 😄 Having the cache at all introduces the opportunity for it to be wrong. I figured being able to set it to something like 30 minutes or something much smaller than 24 hours would at least make that time that is wrong a lot shorter/maybe unnoticeable, at least in our situation where it took us a couple hours to figure out what was even wrong. Listening to pod events would be spectacular, but much more of a lift than I have time to put into it. If this is something you do we would love to test it out! |
TTL is probably the easiest fix but a more comprehensive solution makes sense long term. For our use case right now even if it had to check ECR every time a new pod was being deployed the amount of API calls to check ECR would be pretty low. |
@estahn What are your thoughts about having this caching configuration as a short-medium term option for people to improve this, with the long term plan being a bit more tied to k8s events and truly be dynamic? It would definitely help out our organization. |
We're currently using k8s-image-swapper to proxy all the images in our cluster to ECR. I have setup a lifecycle policy for the images to delete after 30 days after push.
However once the image gets pruned by ECR k8s-image-swapper won't replace it if the image is requested again causing an error pulling the image because it no longer exists in ECR. Only fix I've found is to restart k8s-image-swapper pods and then the image gets pulled and pushed to ECR correctly.
The text was updated successfully, but these errors were encountered: