-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Epic] Notebook Security Scanning #461
Comments
@brendangadd and @zachomedia have previously discussed automatically updating user's running notebook servers with updated images. This would cause some interruption to the users (basically dumps anything running in memory, also might destroy their package installs) and we don't want to start doing it right away, but documenting how to do it (and maybe start doing this for dev users so we know how the pain will feel first) would be good. As first steps on this task, let's build utilities for finding offending images, identifying which are in use, dumping email addresses for the users that are using those images, and making a form email to notify them. |
Had a few minutes today and was thinking about this. Here are a few snippets that might help: Getting lists of in-use images
acr purgeAzure has a purge command for use in an ACR Task. Might be able to use it here? Probably would want it to just do a dry-run so we could filter out some things, although it does have filtering capabilities too.
|
@zachomedia mentioned getting all images in use from pods would be preferable to nodes because we may start using virtual node pools Sprint discussion raised that we'll be moving to proxying our repos through artifactory, so rather than using ACR functions we should instead be getting "images older than X" from either artifactory or the base docker API |
So the way I was thinking of doing this is documented in this repo The TLDR; 4 onwards was unsure since that is scanning images as well as updating pods with newer images --original task description--- Some discussion about cleaning up existing images in the registry took place below --> #79 Criteria for Deletion, the easy stuff (in order of importance?)
Criteria for Deletion, to be ironed out, (concerning older images still being used)
An idea explored by Blair in the above issue is keeping the 5 most recent tags for an image and that might be good for those concerns (assuming we do not push willy nilly) |
Updating with information found on the scanning images. From this "once an artifact enters, XRay is triggered to run a scan" and in here fortunately Xray will identify artifacts (layers) that have new vulnerabilities (once the vulnerability is added to their managed database). The topic I want to bring up here is the the download blocking of resources feature. By configuring a policy with the security requirements we want, as well as a Having said this, I need to see if we have a policy set up, can I get a list of offending artifacts and then from there add them to the 'to delete / to migrate to newer image' list. All instead of sending a scanning policy request, then creating a watch to send it to XYZ repo and then getting the info from there. will probably update comment as I obtain more info / with any comments. |
The most basic of basic run throughs here. Using my personal repo (with passwords removed) here (note that since I just created the repo in artifactory the 1-aql-get-images is generalized to only get the manifest.jsons, would change that down the road to say stat.downloaded before 4w indicating it hadn't been downloaded for 4 weeks). Decided I needed to at least get this section working for sure (my pseudo of pseudo code was not cutting it) before re-tackling scanning, (which as long as I can get the path to the folder it should be easy to plug and play) and then scheduling workloads that are currently using vulnerable images to update. |
So getting a list of impacted artifacts is not too difficult and after some nice parsing into something readable / useful it comes out looking like this.
What bugs me here is I don't know where the default comes from, it could just be a setting that I never 'set' or explored and it gave that (in the example output in the first link it is What I would have liked ideally is something like in the UI where it gives me the following under |
IMPORTANT As for the actual update two options noted by Brendan are;
|
Something that was brought up at today's standup: How should we determine what image to update to? (whether it be to let the user know what they need to update to or the automatic update). This problem comes from development images so a "most recent image with the same name" approach would not work well (may grab a development image). Some possible solutions raised were;
Personally I like the idea of 1 with the metadata on the images. Something I could see going wrong is if someone does not change the label and in the dev branch the label is still set as 'prod' (though perhaps this setting of label can be done through the CI, ie: not set in the dockerfile / dockerbits) |
I've tried for 30 minutes to write this comment and can't articulate it very well... :| I'm having trouble picturing a workflow that:
I keep getting caught in traps like
If we want automatic migration for notebooks, maybe we should at least scope it to the simple cases of our primary images. Like Jose suggests we could add metadata that defines what "path" they're on (eg: jupyterlab-cpu) then the image updater could have a map that says how a vulnerable jupyterlab-cpu should be migrated (I think this should use our spawner list to be consistent and reduce effort). As an alternative though, I propose we do not automatically migrate anything. Instead, I say we simply notify or terminate depending on how long something has been an offender. The logic is much simpler for this workflow:
For notifications, users can be prompted to just start a new notebook using the typical way (as those notebooks are fresh). We don't need to tell them which notebook to use specifically - they will know. This loses the convenience of notebooks being migrated for users, but the value add of us automatically updating user workloads is minimal. A migration kills in-memory state and messes with conda installed packages - that is marginally better than simply killing the workload and telling users to restart it themselves. Migrating is also much harder to communicate to users (think of describing to a user what a migration will/will not disrupt: |
NOTE: It may be good to start moving this into an epic since its scoping has definitely gotten larger that the ticket description. This conversation is expanding quite a bit and there are some really interesting possibilities. I agree that keeping Notebooks updated is going to be critical and automating that will ensure the survivability and scalability of AAW in the long run. There seem to be perhaps 3 types of images running now or in the future (I believe I've heard the rumblings) from an end-user perspective:
There are also two types of workloads in the AAW:
Curated ImagesMost curated images will fall within "infinite" lifetime and as such require more finesse when updating. For the curated images, an update-controller (as mentioned conceptually in comments above) working beside the notebook-controller could be a good solution. This controller could have predefined update paths for known notebook images which could be achieved through explicit pathing or via a version scheme such a semver. This can be achieved due to our control of the upstream source of the image. NotificationsSome type of notification process (automated) could be defined to ensure user communication and, potentially, some type of delaying or rescheduling of updates may be important over time. Predefined Update CycleA suggestion would be to have a predefined update cycle, perhaps weekly. This could ensure that users learn a predictable routine when it comes to downtimes which allows for happier end-users. As a side-effect, a Predefined Update Cycle could also help reduce the fiduciary burden of the Cluster. By rescheduling workloads, we may be able to ensure that workloads are packed tighter onto nodes and reduce the number required. Critical UpdatesThese have a somewhat different (accelerated) schedule and require a more robust notification process depending on the frequency of the Update Cycle, however, if the curated images are well developed, tested, and maintained for regular update processes, there should be minimal disruptions to user workflows. User-workload ImagesThese are images that are created or used by end-users to accomplish specific tasks. These, from my understanding, will mostly be used for pipelines. If this assumption is correct, in-place upgrades aren't necessarily needed for these. Due to the lack of definition around such images currently, it's difficult to define an effective remediation strategy. Images sourced via Artifactory can use its download blocking feature as described by @Ito-Matsuda. Since these types of images are not infinitely running, this should cover most cases. To certain extent, the governance around such images and how they are used in the platform can be defined up-front with Usage Agreements and well-defined Notification Strategies to prevent loss of confidence from end-users while striking balance for the safe use of the platform. In the long run, investigating the use of the following will help align strategies for these types of workloads (as well as all others):
DisclaimerSo... This was long and I think I started to ramble. I had a few meetings between so I'm sorry if it's not all relevant! |
I think it's very relevant! You're right that we need to categorize the types of images under consideration (in terms of how they're used/built), and tailor the security/notification systems around them to the way they're utilized. A good starting ticket in the epic might be
|
I'm just going to chime in and say that I agree with what @justbert wrote and I think this echos much of what I said during yesterday's standup. |
2022-03-09 reviving this task
Three main points (that may be elaborated on / broken off into smaller tasks)
Runtime Scanning
Notebook Security Scanning: Runtime Scanning 2: Electric Boogaloo #923
Container Patching
User Notification
Put it all into definition files / dockerfiles
ACTUAL CLEAN UP OF UNUSED IMAGES
As this issue is growing out of its initial scope and has moved past a simple "get old images, get vulnerable images, delete". This issue has been updated to reflect that. My initial task description will be moved to one of my early comments.
A starting point courtesy of Blair / Justin
Define the types of images we are looking at and determine an appropriate course of action. Issue tracking here Discuss 'types' of images present in the cluster and how to clean them up. #499
Get a list of old and unused images. The format of this is just a text file with each line being a path to image folder. These are images that can be safely deleted from Artifactory. Tracking: Artifactory: Obtain list of old, unused images. #517
Create a collection of vulnerable images, compare it against images in the cluster, and make a list of the intersection with a date attached to them. This information should be persisted somewhere such that the next item can take it. Issue tracking in Artifactory: Prototype getting a list of vulnerable images in the cluster #503
Send notifications about vulnerable images. (Will require a way to get an email associated with a namespace of a container) Artifactory: send emails / notifications to owners of vulnerable images #512 --> Not relevant for Proof of Concept
--
'ACT' steps tracking in #526
--
Add more checkboxes as they come up
Final step
See Also
The text was updated successfully, but these errors were encountered: