Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up the existing images in the registry #79

Closed
blairdrummond opened this issue May 25, 2020 · 13 comments
Closed

Clean up the existing images in the registry #79

blairdrummond opened this issue May 25, 2020 · 13 comments

Comments

@blairdrummond
Copy link
Contributor

CC @sylus

I noticed that we're about halfway through the storage in the container registry. A good chunk of that is probably my fault haha, from experimental images that were very large.

It might be worthwhile to do a cleanup before too long. Some of those images probably operate as root, too.

@blairdrummond
Copy link
Contributor Author

I opened this issue to think about a better solution for going forward

#80

If @sylus and @zachomedia like the idea, we could keep core platform images in the core registry, and then user-images in the semi-disposable registry.

@sylus
Copy link
Member

sylus commented Jul 29, 2020

First can we audit our primary ACR and remove any more developmental containers?

Can discuss the ones here we need to drop before actually do ^_^

@blairdrummond
Copy link
Contributor Author

Yep absolutely will :-) I'll sort them into a list and send off to you and Zach prior to removal.

Building on my earlier comment, do you know if it's possible to set permissions on users so that they can only push to a prefix of the container registry? For the future I was thinking maybe people other than daaas admins could push to $OUR_REGISTRY/users/${their_image_name}. Might make this cleanup a bit easier in the future

@blairdrummond
Copy link
Contributor Author

So I actually don't have deletion permissions anyway! So gonna leave this info with you @sylus (CC @zachomedia ) My code is attached.

  • I looked through the acr and found some of my development containers, which I marked in fully-remove.txt
  • I also included the list that @brendangadd gave me with the containers that are actively running (just for reference)
  • I provided some scripts and their output json files, the last of which gets all images more than 5 versions behind (measured by timestamp) and older than a month. I figured we could very safely delete those.

acr-cleanup.zip

There is also an az purge command, which I figure is how we'll delete the images? But I don't have permissions to use it. I figure we can just pipe the jq output into it using the --filter options that this az command is using.

FYI We only have 40gb of space left 😨

@sylus
Copy link
Member

sylus commented Aug 3, 2020

I removed most of the ones from the to remove list. Think I might eventually want to move the foundational ones to their own ACR but at least some things are cleaned up now. We also need to clean older images from the CI builds.

@blairdrummond
Copy link
Contributor Author

Yeah separating them makes sense to me I think.

Did you remove the ones in the third JSON file too? I saw some crazy stuff like 100+ versions of Orchard, as well as all our CI builds.

@frazs
Copy link
Contributor

frazs commented Aug 12, 2020

Deleted, including dangling manifests:

  • statcan-orchardcore up to last tag in use (188 from 23/06/2020)
  • shiny up to 5th latest tag (e22706bc7e24ff1ec7fa866d4fff8a3ca7bfbb12 from 27/5/2020)
  • r-studio-cpu up to last tag in use (937eb8bf35fae5cd303b8d02c5bd7fd71b270fd7 from 29/5/2020)
  • mlflow-operator up to 5th latest tag (81e2aac59118a83de151147063347ac88dc2902e from 28/6/2020)
  • mlflow up to last tag in use (d2adcddb0a3d3cba0ac151bb0c4ac9f59be488a9 from 28/6/2020)
  • minimal-notebook-gpu up to 5th latest tag (6fa16e8f0d1534d2b39e0fece7f2147442fcd0db from 6/8/2020)

To be continued.

@frazs
Copy link
Contributor

frazs commented Aug 12, 2020

Er, some of those minimal-notebook-gpu ones removed were recent (August 6th is not June 8th!), but there are no miniminal-notebook-gpu images in use right now.

How should we handle the case of more than 5 versions behind, and older than a month, but still in use? Particularly when there are many later but still outdated versions that are not in use?

  1. Delete anyway,
  2. Keep the one in in use but delete later unused + outdated versions, or
  3. Only delete up to last in use, even if it's very old?

There are 24 instances of minimal-notebook-cpu:e2981d65b5ceecaea3d161704632f4a881f18de2 out there, for instance, and that's a tag from 14/05/2020. Plus another 24 of minimal-notebook-cpu:562fa4a2899eeb9ae345c51c2491447ec31a87d7 from 30/05/2020.

While deleting the image wouldn't in of itself delete a container, I do not know if there's any point (e.g. pod rescheduling?) where the image is expected to still exist or if that's cached somewhere.

@frazs
Copy link
Contributor

frazs commented Aug 12, 2020

Deleted:

  • base-notebook cpu older than a month
  • base-notebook gpu older than a month
  • covid up to 5th latest tag (ca7917e0986b6d0c23e143bd207281d5b9aec8d7 from 8/4/2020)
  • daaas-get-covid-data up to 5th latest tag (5ba6720c84e560b858be440b0d9d623fbcf147d0 from 30/5/2020)
  • daaas-kenchu-wifr-aggregate up to 5th latest tag (5ba6720c84e560b858be440b0d9d623fbcf147d0 from 30/5/2020)
  • dremio up to 5th latest tag (79e71bbe68908b083d0f5551f9274c2a961d4f2c from 23/4/2020)

Skipping daaas-constrstarts-geo for now, there's something weird going on with sequential SHAs.

TBC

@frazs
Copy link
Contributor

frazs commented Aug 13, 2020

Per today's standup, I'll keep images that are in use but delete later unused + outdated versions. I'm placing a delete lock on the images in use, and will purge images older than the 5th most recent version, or a month, whichever is later.

@frazs
Copy link
Contributor

frazs commented Aug 13, 2020

A good 100 GB was restored just from purging machine-learning-notebook-gpu > 30d (except for 1 still in use). Best keep an eye on that one. I wonder if there's any way to track/visualize/get info about image size in our repos? Layer sharing makes that a bit complicated, but some idea would still be good.

Edit: machine-learning-notebook-cpu quite hefty too.

@frazs
Copy link
Contributor

frazs commented Aug 13, 2020

We're down to 203GB 😀

This takes care of clean-up based on Blair's scripts. There are still quite a lot of images that aren't tracked because they don't have more than five versions to begin with. I could go through those, but that would mean deleting entire container repos, so I'm not sure what the best way to sort them out would be.

@brendangadd
Copy link
Contributor

Great stuff @frazs! I think what you've done so far is sufficient; we can leave the edge cases for another time. We'll need to look into how we automate this, and answer some of your questions around dealing with old images that are still in use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants