-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discuss 'types' of images present in the cluster and how to clean them up. #499
Comments
Curated ImagesPossible Appropriate Actions A(ndrew): The 'do not auto-migrate' path.
Per Andrew This loses the convenience of notebooks being migrated for users, but the value add of us automatically updating user workloads is minimal. A migration kills in-memory state and messes with conda installed packages - that is marginally better than simply killing the workload and telling users to restart it themselves. Migrating is also much harder to communicate to users (think of describing to a user what a migration will/will not disrupt: If you're running something it will be stopped, but you might be able to rerun your notebook to get everything back, but if you conda install'd packages then probably not, but ...). That feels like added complexity/poor UX with marginal added value. B(Justin): The predefined update cycle path
|
I've been doing some thinking on and off about this one and doing some exploratory research. Taking heavy inspiration from both Justin + Andrew, here are some thoughts I have.
User Created ImagesAs for user-created images, we could treat them similarly to our provisioned jupyterlab-xyz images. Would enforce the labels and have them go through the same "send email, delete after x period of time". Something else to consider is the previous point of "no suitable image to update to". We could have their tag for enviro be separate from our tag for jupyterlab-xyz so we can filter on that instead, still sending users an email saying "please resolve your image's critical vulnerabilities and update or else the image will be deleted on X date" Platform ImagesOn Platform images , I don't think this task has to do anything special considering updates seem to be out of scope. It can still be under the "if image is unused in cluster, and over x period of time old then delete" Other notesAside, I haven't given it too much thought yet but not sure on how to implement a 'hold-off-on-delete' system. Putting a property tag is simple enough, but the 'how' of determining which images have the 'hold' might be rough (as in how does user let system know they need this image for a few more days). I'd like to think we can give enough time for user's to be aware and not do anything huge but I'm not an end-user or perhaps a critical vulnerability could be way too big to put of. So example flow is; Get list of vulnerable images from Artifactory --> Compare vulnerable with images in use. --> Send emails to users w/ vulnerable images --> Set some date on Artifactory for "delete on x date" --> X Date comes (this could be the first step of the process, checking Artifactory for any images to be deleted) --> Check for any images where X date has passed and does not have a "hold" tag or something like that --> Query Artifactory for suitable update images (say via labels + date created) --> get a list of in-use images --> match them using the This could be one cronjob where the first task is just checking for any overdue to delete (because of vulnerabilities) and just execute on everything else. Again the only thing I'm not too sure about just yet is the "place image on hold" idea unless users could just add some property to their own pod or something. |
Notes from discussion yesterday re: what to do after some image is found vulnerable. Have two "Actions".
The second action will take place only if there is no suitable image to update to. Note that this could leave users without an option to work IF there is no good replacement image that they can turn to. Worst comes to worst we could deploy a new image for them to use that has the vulnerabilities removed, or could go with the Ignore rules option to well 'ignore' the critical severity... Need to be careful to not step on toes of platform images. |
Here's a revised flow. Each major step is defined by a letter A) 'Safe to be deleted' images
B) Images with Vulnerability is found
Possible Next Steps ramblings |
Closing. This issue is linked on epic #461 for reference and further discussion can happen there. @Jose-Matsuda Just reopen this if you want to keep any active discussion here. |
Tackles first bullet in #461
Groundwork by Justin in #461 (comment)
2 Types of Workloads
A) Pods with limited lifetime: pipelines and rote work
B) Pods with "infinite" lifetimes: mostly notebooks, currently.
Three types of images
Curated Images
Description: Built by AAW and fully under the control of AAW. Most fall under workload
B
Platform Images
Description: Images for system which are part of the platform.
Justin notes that 'these usually have well-defined upgrade paths but are out of scope of this conversation since they already have a maintenance process'
User-workload Images
Description: iffy! Images created / used by end-users to accomplish specific tasks. Would mostly be used for pipelines?
TODO in following comments:
The text was updated successfully, but these errors were encountered: