Discuss 'types' of images present in the cluster and how to clean them up. #499

Jose-Matsuda · 2021-05-04T15:15:49Z

Tackles first bullet in #461
Groundwork by Justin in #461 (comment)

2 Types of Workloads
A) Pods with limited lifetime: pipelines and rote work
B) Pods with "infinite" lifetimes: mostly notebooks, currently.

Three types of images

Curated Images

Description: Built by AAW and fully under the control of AAW. Most fall under workload B

Platform Images

Description: Images for system which are part of the platform.
Justin notes that 'these usually have well-defined upgrade paths but are out of scope of this conversation since they already have a maintenance process'

User-workload Images

Description: iffy! Images created / used by end-users to accomplish specific tasks. Would mostly be used for pipelines?

TODO in following comments:

Address the iffy description for user-workload images.
Discuss appropriate actions for the Curated Images and User-workload images in terms of notification, deletes as well as the possibility of updates or blocking the downloads.

The text was updated successfully, but these errors were encountered:

Jose-Matsuda · 2021-05-04T15:59:24Z

Curated Images

Possible Appropriate Actions

A(ndrew): The 'do not auto-migrate' path.

Vulnerability detected. Admins notified
Set EOL date. (To mess w/ Artifactory properties image may need to be stored on their local repository
If image not in use, just delete it
If image in use as notebook image (as our Curated images are all currently). Notify user of offending notebook the EOL date and tell them they must take action.
(delete): When EOL is reached, kill all offending workloads and delete the image

Per Andrew

This loses the convenience of notebooks being migrated for users, but the value add of us automatically updating user workloads is minimal. A migration kills in-memory state and messes with conda installed packages - that is marginally better than simply killing the workload and telling users to restart it themselves. Migrating is also much harder to communicate to users (think of describing to a user what a migration will/will not disrupt: If you're running something it will be stopped, but you might be able to rerun your notebook to get everything back, but if you conda install'd packages then probably not, but ...). That feels like added complexity/poor UX with marginal added value.

B(Justin): The predefined update cycle path

Notification process to ensure users know about the predefined update cycle. Note: Having a way to reschedule / delay the update would be key.
Images have a predefined update cycle (weekly?). Users then get used to the routine. A side effect is "help reduce the fiduciary burden of the Cluster. By rescheduling workloads, we may be able to ensure that workloads are packed tighter onto nodes and reduce the number required.
Critical updates. Depends on the frequency of the update cycle. If the curated images are well done for the regular update processes then there's minimal extra disruption aside from the update cycle

Jose-Matsuda · 2021-05-11T16:10:15Z

I've been doing some thinking on and off about this one and doing some exploratory research.
As a TLDR of this issue, it boils down to what we as a team want to do when an image (platform, kubeflow-container variants, or user ) is found to be vulnerable.

Taking heavy inspiration from both Justin + Andrew, here are some thoughts I have.

While I really like Andrew's simpler approach, I understand that what Justin has said would be super nice to have.
I don't know if I like the 'predefined update cycle'. Say we have it running weekly, I don't think (right now anyways, maybe because of our current priorities) we push enough changes to our images to make full use of this feature (restart to only use the same image). That and the possible messing with installed packages has me wary.

Instead of this, our process could run weekly. Finding offending artifacts is easy as XRAY scans artifacts as they are brought in, and when a new vulnerability is added to their database, it auto runs to update the scans / information. If any vulnerabilities then yea we go ahead with emails and scheduling update+restarts.
Another option: When the XRAY policy finds a vulnerability it could also trigger a webhook to perhaps do things? (haven't used these aside from sending nice messages to a slack channel. I suppose instead of sending it to a slack channel it would send to this service.)

Regarding finding a suitable image to update to, I think there could be a few options to go with here

Semantic versioning as suggested by Justin. I was originally thinking using docker labels (if there are other suggestions I am open) but it seems a bit annoying to get a containers LABEL metadata out of a pod. Two issues about this A and B If I could get this easily then we can specify what label on the image we want. Something like items.find({"name":{"$eq":"manifest.json"},"@docker.label.version":{"$gt":"1.5.1"}, "path":{"$match":"my-docker-image/*" ($gt works fine in testing)
Another option is going LABEL enviro="prod" or have a label that indicates if it should be used by an auto-update feature. Instead of comparing on version, we would look for "@docker.label.enviro":{"$eq":"prod"} and then have another field that indicates that this new image's created date must be after the creation date of the current vulnerable image.
Now if the case of there not being a suitable image to update to (because of upstream dependencies etc) maybe we should not send an email alerting them and of course do not schedule any 'deletes'.

User Created Images

As for user-created images, we could treat them similarly to our provisioned jupyterlab-xyz images. Would enforce the labels and have them go through the same "send email, delete after x period of time". Something else to consider is the previous point of "no suitable image to update to". We could have their tag for enviro be separate from our tag for jupyterlab-xyz so we can filter on that instead, still sending users an email saying "please resolve your image's critical vulnerabilities and update or else the image will be deleted on X date"

Platform Images

On Platform images , I don't think this task has to do anything special considering updates seem to be out of scope. It can still be under the "if image is unused in cluster, and over x period of time old then delete"

Other notes

Aside, I haven't given it too much thought yet but not sure on how to implement a 'hold-off-on-delete' system. Putting a property tag is simple enough, but the 'how' of determining which images have the 'hold' might be rough (as in how does user let system know they need this image for a few more days). I'd like to think we can give enough time for user's to be aware and not do anything huge but I'm not an end-user or perhaps a critical vulnerability could be way too big to put of.

So example flow is;
Process starts --> Get list of old images from Artifactory --> Get list of in-use images --> Compare old images and in use, if not in use then those can be deleted. -->

Get list of vulnerable images from Artifactory --> Compare vulnerable with images in use. --> Send emails to users w/ vulnerable images --> Set some date on Artifactory for "delete on x date" -->

X Date comes (this could be the first step of the process, checking Artifactory for any images to be deleted) --> Check for any images where X date has passed and does not have a "hold" tag or something like that --> Query Artifactory for suitable update images (say via labels + date created) --> get a list of in-use images --> match them using the PATH up until the tag (which is like jupyterlab-cpu/sometag) --> patch the vulnerable images --> delete the images from Artifactory.

This could be one cronjob where the first task is just checking for any overdue to delete (because of vulnerabilities) and just execute on everything else. Again the only thing I'm not too sure about just yet is the "place image on hold" idea unless users could just add some property to their own pod or something.

Jose-Matsuda · 2021-05-14T17:23:53Z

Notes from discussion yesterday re: what to do after some image is found vulnerable.

Have two "Actions".

Patch the running notebook with a suitable "more up to date" / safe from critical vulnerabilities image.
Kill the notebook.

The second action will take place only if there is no suitable image to update to. Note that this could leave users without an option to work IF there is no good replacement image that they can turn to. Worst comes to worst we could deploy a new image for them to use that has the vulnerabilities removed, or could go with the Ignore rules option to well 'ignore' the critical severity...

Need to be careful to not step on toes of platform images.

Jose-Matsuda · 2021-05-14T17:55:25Z

Here's a revised flow.
I imagine this takes place sometime early in the morning (ex 3AM EDT) and perhaps only on weekdays (to give developers time to push out a fix / take other action (whitelist vulnerability?) ).

Each major step is defined by a letter

A) 'Safe to be deleted' images

Get a list of old images from Artifactory
Get a full list of images used in cluster
Compare the two and get a list of those old and unused
Delete

B) Images with Vulnerability is found

Obtain a list of Critical vulnerabilities using Xray rest api can filter by watch name as well so this is flexible in say we have a watch that only checks for CVSS score > 8 can limit it to that.
Using this list from B-1 and A-2, get the intersection. These are images in the cluster that are vulnerable.
Get a list of replacement images for the vulnerable images found in B-2. If there are suitable replacement go to step 4, else step 5.
If there is a suitable image patch offenders. Do not do step 5.
If there is not, kill the running notebook (be careful with this one, don't kill any platform images)…

Possible Next Steps ramblings
Step B-4 and B-5 thoughts, perhaps we could later put a date property on the images in Artifactory. Instead of patching / deleting them in the same job call (ie called on Monday everything happens on Monday) we could have it so that it does the patch / delete on the next business day. This could give developers a chance to push out a new image that doesn't have the critical vulnerability, or give users a chance to save their work (albeit its 1 days notice).

brendangadd · 2022-03-11T00:44:17Z

Closing. This issue is linked on epic #461 for reference and further discussion can happen there. @Jose-Matsuda Just reopen this if you want to keep any active discussion here.

This was referenced May 4, 2021

DUPLICATE Artifactory and cluster image cleanup. #498

Closed

[Epic] Notebook Security Scanning #461

Closed

Artifactory: send emails / notifications to owners of vulnerable images #512

Open

brendangadd closed this as completed Mar 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discuss 'types' of images present in the cluster and how to clean them up. #499

Discuss 'types' of images present in the cluster and how to clean them up. #499

Jose-Matsuda commented May 4, 2021 •

edited

Loading

Jose-Matsuda commented May 4, 2021 •

edited

Loading

Jose-Matsuda commented May 11, 2021 •

edited

Loading

Jose-Matsuda commented May 14, 2021 •

edited

Loading

Jose-Matsuda commented May 14, 2021 •

edited

Loading

brendangadd commented Mar 11, 2022

Discuss 'types' of images present in the cluster and how to clean them up. #499

Discuss 'types' of images present in the cluster and how to clean them up. #499

Comments

Jose-Matsuda commented May 4, 2021 • edited Loading

Curated Images

Platform Images

User-workload Images

Jose-Matsuda commented May 4, 2021 • edited Loading

Curated Images

Jose-Matsuda commented May 11, 2021 • edited Loading

User Created Images

Platform Images

Other notes

Jose-Matsuda commented May 14, 2021 • edited Loading

Jose-Matsuda commented May 14, 2021 • edited Loading

brendangadd commented Mar 11, 2022

Jose-Matsuda commented May 4, 2021 •

edited

Loading

Jose-Matsuda commented May 4, 2021 •

edited

Loading

Jose-Matsuda commented May 11, 2021 •

edited

Loading

Jose-Matsuda commented May 14, 2021 •

edited

Loading

Jose-Matsuda commented May 14, 2021 •

edited

Loading