Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discuss 'types' of images present in the cluster and how to clean them up. #499

Closed
Jose-Matsuda opened this issue May 4, 2021 · 5 comments

Comments

@Jose-Matsuda
Copy link
Contributor

Jose-Matsuda commented May 4, 2021

Tackles first bullet in #461
Groundwork by Justin in #461 (comment)


2 Types of Workloads
A) Pods with limited lifetime: pipelines and rote work
B) Pods with "infinite" lifetimes: mostly notebooks, currently.


Three types of images

Curated Images

Description: Built by AAW and fully under the control of AAW. Most fall under workload B

Platform Images

Description: Images for system which are part of the platform.
Justin notes that 'these usually have well-defined upgrade paths but are out of scope of this conversation since they already have a maintenance process'

User-workload Images

Description: iffy! Images created / used by end-users to accomplish specific tasks. Would mostly be used for pipelines?


TODO in following comments:

  • Address the iffy description for user-workload images.
  • Discuss appropriate actions for the Curated Images and User-workload images in terms of notification, deletes as well as the possibility of updates or blocking the downloads.
@Jose-Matsuda
Copy link
Contributor Author

Jose-Matsuda commented May 4, 2021

Curated Images

Possible Appropriate Actions


A(ndrew): The 'do not auto-migrate' path.

  1. Vulnerability detected. Admins notified
  2. Set EOL date. (To mess w/ Artifactory properties image may need to be stored on their local repository
  3. If image not in use, just delete it
  4. If image in use as notebook image (as our Curated images are all currently). Notify user of offending notebook the EOL date and tell them they must take action.
  5. (delete): When EOL is reached, kill all offending workloads and delete the image

Per Andrew

This loses the convenience of notebooks being migrated for users, but the value add of us automatically updating user workloads is minimal. A migration kills in-memory state and messes with conda installed packages - that is marginally better than simply killing the workload and telling users to restart it themselves. Migrating is also much harder to communicate to users (think of describing to a user what a migration will/will not disrupt: If you're running something it will be stopped, but you might be able to rerun your notebook to get everything back, but if you conda install'd packages then probably not, but ...). That feels like added complexity/poor UX with marginal added value.


B(Justin): The predefined update cycle path

  1. Notification process to ensure users know about the predefined update cycle. Note: Having a way to reschedule / delay the update would be key.
  2. Images have a predefined update cycle (weekly?). Users then get used to the routine. A side effect is "help reduce the fiduciary burden of the Cluster. By rescheduling workloads, we may be able to ensure that workloads are packed tighter onto nodes and reduce the number required.
  3. Critical updates. Depends on the frequency of the update cycle. If the curated images are well done for the regular update processes then there's minimal extra disruption aside from the update cycle

@Jose-Matsuda
Copy link
Contributor Author

Jose-Matsuda commented May 11, 2021

I've been doing some thinking on and off about this one and doing some exploratory research.
As a TLDR of this issue, it boils down to what we as a team want to do when an image (platform, kubeflow-container variants, or user ) is found to be vulnerable.

Taking heavy inspiration from both Justin + Andrew, here are some thoughts I have.

  1. While I really like Andrew's simpler approach, I understand that what Justin has said would be super nice to have.
  2. I don't know if I like the 'predefined update cycle'. Say we have it running weekly, I don't think (right now anyways, maybe because of our current priorities) we push enough changes to our images to make full use of this feature (restart to only use the same image). That and the possible messing with installed packages has me wary.
  • Instead of this, our process could run weekly. Finding offending artifacts is easy as XRAY scans artifacts as they are brought in, and when a new vulnerability is added to their database, it auto runs to update the scans / information. If any vulnerabilities then yea we go ahead with emails and scheduling update+restarts.

  • Another option: When the XRAY policy finds a vulnerability it could also trigger a webhook to perhaps do things? (haven't used these aside from sending nice messages to a slack channel. I suppose instead of sending it to a slack channel it would send to this service.)

  1. Regarding finding a suitable image to update to, I think there could be a few options to go with here
  • Semantic versioning as suggested by Justin. I was originally thinking using docker labels (if there are other suggestions I am open) but it seems a bit annoying to get a containers LABEL metadata out of a pod. Two issues about this A and B If I could get this easily then we can specify what label on the image we want. Something like items.find({"name":{"$eq":"manifest.json"},"@docker.label.version":{"$gt":"1.5.1"}, "path":{"$match":"my-docker-image/*" ($gt works fine in testing)
  • Another option is going LABEL enviro="prod" or have a label that indicates if it should be used by an auto-update feature. Instead of comparing on version, we would look for "@docker.label.enviro":{"$eq":"prod"} and then have another field that indicates that this new image's created date must be after the creation date of the current vulnerable image.
    Now if the case of there not being a suitable image to update to (because of upstream dependencies etc) maybe we should not send an email alerting them and of course do not schedule any 'deletes'.

User Created Images

As for user-created images, we could treat them similarly to our provisioned jupyterlab-xyz images. Would enforce the labels and have them go through the same "send email, delete after x period of time". Something else to consider is the previous point of "no suitable image to update to". We could have their tag for enviro be separate from our tag for jupyterlab-xyz so we can filter on that instead, still sending users an email saying "please resolve your image's critical vulnerabilities and update or else the image will be deleted on X date"

Platform Images

On Platform images , I don't think this task has to do anything special considering updates seem to be out of scope. It can still be under the "if image is unused in cluster, and over x period of time old then delete"

Other notes


Aside, I haven't given it too much thought yet but not sure on how to implement a 'hold-off-on-delete' system. Putting a property tag is simple enough, but the 'how' of determining which images have the 'hold' might be rough (as in how does user let system know they need this image for a few more days). I'd like to think we can give enough time for user's to be aware and not do anything huge but I'm not an end-user or perhaps a critical vulnerability could be way too big to put of.

So example flow is;
Process starts --> Get list of old images from Artifactory --> Get list of in-use images --> Compare old images and in use, if not in use then those can be deleted. -->

Get list of vulnerable images from Artifactory --> Compare vulnerable with images in use. --> Send emails to users w/ vulnerable images --> Set some date on Artifactory for "delete on x date" -->

X Date comes (this could be the first step of the process, checking Artifactory for any images to be deleted) --> Check for any images where X date has passed and does not have a "hold" tag or something like that --> Query Artifactory for suitable update images (say via labels + date created) --> get a list of in-use images --> match them using the PATH up until the tag (which is like jupyterlab-cpu/sometag) --> patch the vulnerable images --> delete the images from Artifactory.

This could be one cronjob where the first task is just checking for any overdue to delete (because of vulnerabilities) and just execute on everything else. Again the only thing I'm not too sure about just yet is the "place image on hold" idea unless users could just add some property to their own pod or something.

@Jose-Matsuda
Copy link
Contributor Author

Jose-Matsuda commented May 14, 2021

Notes from discussion yesterday re: what to do after some image is found vulnerable.

Have two "Actions".

  1. Patch the running notebook with a suitable "more up to date" / safe from critical vulnerabilities image.
  2. Kill the notebook.

The second action will take place only if there is no suitable image to update to. Note that this could leave users without an option to work IF there is no good replacement image that they can turn to. Worst comes to worst we could deploy a new image for them to use that has the vulnerabilities removed, or could go with the Ignore rules option to well 'ignore' the critical severity...

Need to be careful to not step on toes of platform images.

@Jose-Matsuda
Copy link
Contributor Author

Jose-Matsuda commented May 14, 2021

Here's a revised flow.
I imagine this takes place sometime early in the morning (ex 3AM EDT) and perhaps only on weekdays (to give developers time to push out a fix / take other action (whitelist vulnerability?) ).

Each major step is defined by a letter

A) 'Safe to be deleted' images

  1. Get a list of old images from Artifactory
  2. Get a full list of images used in cluster
  3. Compare the two and get a list of those old and unused
  4. Delete

B) Images with Vulnerability is found

  1. Obtain a list of Critical vulnerabilities using Xray rest api can filter by watch name as well so this is flexible in say we have a watch that only checks for CVSS score > 8 can limit it to that.
  2. Using this list from B-1 and A-2, get the intersection. These are images in the cluster that are vulnerable.
  3. Get a list of replacement images for the vulnerable images found in B-2. If there are suitable replacement go to step 4, else step 5.
  4. If there is a suitable image patch offenders. Do not do step 5.
  5. If there is not, kill the running notebook (be careful with this one, don't kill any platform images)…

Possible Next Steps ramblings
Step B-4 and B-5 thoughts, perhaps we could later put a date property on the images in Artifactory. Instead of patching / deleting them in the same job call (ie called on Monday everything happens on Monday) we could have it so that it does the patch / delete on the next business day. This could give developers a chance to push out a new image that doesn't have the critical vulnerability, or give users a chance to save their work (albeit its 1 days notice).

@brendangadd
Copy link
Contributor

Closing. This issue is linked on epic #461 for reference and further discussion can happen there. @Jose-Matsuda Just reopen this if you want to keep any active discussion here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants