Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Epic] Notebook Security Scanning #461

Closed
12 of 19 tasks
Jose-Matsuda opened this issue Feb 24, 2021 · 13 comments
Closed
12 of 19 tasks

[Epic] Notebook Security Scanning #461

Jose-Matsuda opened this issue Feb 24, 2021 · 13 comments

Comments

@Jose-Matsuda
Copy link
Contributor

Jose-Matsuda commented Feb 24, 2021

2022-03-09 reviving this task

Three main points (that may be elaborated on / broken off into smaller tasks)

Runtime Scanning

  • Determine if we want to use Artifactory to proxy the acr. (This might be ok as a remote repository, but for XRAY to do its thing it might need to try and pull the image. As in scanning doesn't just work on a remote repository like it would for regular local repositories)
  • Either use that, or install and run trivy in a pod.
    Notebook Security Scanning: Runtime Scanning 2: Electric Boogaloo #923

Container Patching

User Notification

  • Get user emails that are attached to their profiles
  • Notify users of workload restarts

Put it all into definition files / dockerfiles

ACTUAL CLEAN UP OF UNUSED IMAGES

  • Clean up the ACR (will need to use az cli for this ofc)

As this issue is growing out of its initial scope and has moved past a simple "get old images, get vulnerable images, delete". This issue has been updated to reflect that. My initial task description will be moved to one of my early comments.

A starting point courtesy of Blair / Justin

--
'ACT' steps tracking in #526

  • Find a list of suitable replacement images.
  • IF a replacement image is found, update the offending notebooks.
  • IF a replacement image is not found, kill the notebook

--

  • Compile a list of images that need to be deleted. This step would check if their EOL date has passed and if we decide to do so, if the image is "on-hold" temporarily. The result of this step would be like 517. Just a list of paths to keep the delete as pure as possible.
  • Using a file containing paths, delete the images. (this assumes the update, checking if EOL / on hold, and other steps have already happened). Issue tracking in Artifactory: Simple delete of images given list of images #509
  • TBD

Add more checkboxes as they come up
Final step

  • Put it all together in the environment

See Also

@ca-scribner
Copy link
Contributor

@brendangadd and @zachomedia have previously discussed automatically updating user's running notebook servers with updated images. This would cause some interruption to the users (basically dumps anything running in memory, also might destroy their package installs) and we don't want to start doing it right away, but documenting how to do it (and maybe start doing this for dev users so we know how the pain will feel first) would be good.

As first steps on this task, let's build utilities for finding offending images, identifying which are in use, dumping email addresses for the users that are using those images, and making a form email to notify them.

@ca-scribner
Copy link
Contributor

Had a few minutes today and was thinking about this. Here are a few snippets that might help:

Getting lists of in-use images

  • get all images used across all nodes: `kubectl get nods -o jsonpath="{.items[].status.images[].names[*]}"
  • the same(?) but from pods: kubectl get pods --all-namespaces -o jsonpath="{.items[*].spec.containers[*].image}" <-- not sure if that returns differently than above?

acr purge

Azure has a purge command for use in an ACR Task. Might be able to use it here? Probably would want it to just do a dry-run so we could filter out some things, although it does have filtering capabilities too.

  • Also available via acr-cli, but not sure what to do for login credentials (different from az acr login)

@ca-scribner
Copy link
Contributor

@zachomedia mentioned getting all images in use from pods would be preferable to nodes because we may start using virtual node pools

Sprint discussion raised that we'll be moving to proxying our repos through artifactory, so rather than using ACR functions we should instead be getting "images older than X" from either artifactory or the base docker API

@Jose-Matsuda
Copy link
Contributor Author

Jose-Matsuda commented Apr 6, 2021

So the way I was thinking of doing this is documented in this repo

The TLDR;
1- Get all the old manifests via Artifactory REST API into a JSON. Done through manifests because the layers for a docker image X are stored in a folder in a path hopefully like org/jfrog/test/multi2/3.0.0-SNAPSHOT but instead having the image name.
2- Using kubectl with enhanced permissions get a list of images still in use
3- Subtract step 2 from step 1 (here I'm hoping that step 2's kubectl image list would match up nicely to some part of the path from step 1) to get a resulting JSON file that has a list of paths to delete

4 onwards was unsure since that is scanning images as well as updating pods with newer images

--original task description---
Why: We have images that are either unused, vulnerable, outdated, or a combination of these
and should be deleted.
What: Do the above, but automatically without need of developer interference if possible.

Some discussion about cleaning up existing images in the registry took place below --> #79
Here images older than a month were deleted / up to 5th latest were kept.


Criteria for Deletion, the easy stuff (in order of importance?)

  1. Vulnerable containers
  2. Unused Images that are older than X months

Criteria for Deletion, to be ironed out, (concerning older images still being used)

  1. How will we tackle old images that are still in use? May be answered by 2 below.
  2. The idea of "here's a demo image for you to try out, this older one will deprecate in X months"
    has been floating around. I don't know how often these bigger(?) releases would be
    but should avoid avoid deleting them too early.

An idea explored by Blair in the above issue is keeping the 5 most recent tags for an image and that might be good for those concerns (assuming we do not push willy nilly)

@Jose-Matsuda
Copy link
Contributor Author

Updating with information found on the scanning images. From this "once an artifact enters, XRay is triggered to run a scan" and in here fortunately Xray will identify artifacts (layers) that have new vulnerabilities (once the vulnerability is added to their managed database).

The topic I want to bring up here is the the download blocking of resources feature. By configuring a policy with the security requirements we want, as well as a watches that looks at requested repositories we can stop the artifact from being downloaded. I suppose this could be configured for any 'critical' vulnerabilities? Andrew was telling me a while back about workloads getting rescheduled and if the image in any form cannot be re-pulled in any form then the reschedule would fail.

Having said this, I need to see if we have a policy set up, can I get a list of offending artifacts and then from there add them to the 'to delete / to migrate to newer image' list. All instead of sending a scanning policy request, then creating a watch to send it to XYZ repo and then getting the info from there.

will probably update comment as I obtain more info / with any comments.

@Jose-Matsuda
Copy link
Contributor Author

The most basic of basic run throughs here. Using my personal repo (with passwords removed) here (note that since I just created the repo in artifactory the 1-aql-get-images is generalized to only get the manifest.jsons, would change that down the road to say stat.downloaded before 4w indicating it hadn't been downloaded for 4 weeks).

Decided I needed to at least get this section working for sure (my pseudo of pseudo code was not cutting it) before re-tackling scanning, (which as long as I can get the path to the folder it should be easy to plug and play) and then scheduling workloads that are currently using vulnerable images to update.

@Jose-Matsuda
Copy link
Contributor Author

So getting a list of impacted artifacts is not too difficult and after some nice parsing into something readable / useful it comes out looking like this.

default/docker-quickstart-local/hello-world/latest/
default/docker-quickstart-local/hello-world/vulnerablehope/
default/docker-quickstart-local/my-docker-image/1reusedlayer/
default/docker-quickstart-local/my-docker-image/5eefshacode/

What bugs me here is I don't know where the default comes from, it could just be a setting that I never 'set' or explored and it gave that (in the example output in the first link it is arti1 instead of default.

What I would have liked ideally is something like in the UI where it gives me the following under Impacted Artifact where default is not included and it is just the "repo + path" which woud make the Delete easy as that's exactly the format it would want.

image

As the image information appears in artifactory
image

@Jose-Matsuda
Copy link
Contributor Author

IMPORTANT
As Andrew brought up need to keep the "updating currently in use images" in the fore-front of our minds as to not interrupt the user's running workflow.
Need to have some sort of notification system where the users will be able to know ahead of time that their image will need to be updated due to a 'critical' vulnerability and as such their processes will be interrupted.

As for the actual update two options noted by Brendan are;

  1. Scheduled CI Pipeline
  2. kubernetes cron job

@Jose-Matsuda Jose-Matsuda added size/XL 6+ days and removed size/L 4-5 days labels Apr 28, 2021
@Jose-Matsuda
Copy link
Contributor Author

Something that was brought up at today's standup: How should we determine what image to update to? (whether it be to let the user know what they need to update to or the automatic update). This problem comes from development images so a "most recent image with the same name" approach would not work well (may grab a development image).

Some possible solutions raised were;

  1. Have image labels. When a person is developing on a branch and they have 'auto deploy' on their label should be set to something aside from master. Then when searching in Artifactory for suitable images to upgrade to we can only search for images that have a "deploymenttype=prod" or something like that
  2. Change the way we deploy development images. Right now I believe it deploys to the same path. This could be a change in the path so when we search for images we can exclude that path.

Personally I like the idea of 1 with the metadata on the images. Something I could see going wrong is if someone does not change the label and in the dev branch the label is still set as 'prod' (though perhaps this setting of label can be done through the CI, ie: not set in the dockerfile / dockerbits)

@ca-scribner
Copy link
Contributor

I've tried for 30 minutes to write this comment and can't articulate it very well... :| I'm having trouble picturing a workflow that:

  • Automatically migrates any user's vulnerable notebook image to a good image
  • Handles non-notebook images (notify admins of any offenders, etc)
  • Avoids a lot of manual admin time to manage

I keep getting caught in traps like

  • if offending image is a notebook image, how to we define what image we migrate them to? What happens if that target image gets a vulnerability, or if the recommended target changes between notification and migration?
  • what do we do for an image that started out custom (eg: one of Christian's images)?
  • how do we handle odd cases (temporarily suspend an image from being deleted/running notebooks being killed)

If we want automatic migration for notebooks, maybe we should at least scope it to the simple cases of our primary images. Like Jose suggests we could add metadata that defines what "path" they're on (eg: jupyterlab-cpu) then the image updater could have a map that says how a vulnerable jupyterlab-cpu should be migrated (I think this should use our spawner list to be consistent and reduce effort).

As an alternative though, I propose we do not automatically migrate anything. Instead, I say we simply notify or terminate depending on how long something has been an offender. The logic is much simpler for this workflow:

  1. Vulnerability detected (+admins notified?)
  2. Set image end-of-life (EOL) date (maybe X days from vulnerability detection) <--we could put this into artifactory metadata or store it elsewhere
  3. (optional) if image is not in use, skip to delete step even if end-of-life date is not reached
  4. If image is in use as a notebook image, notify that user of the offending notebook/image and EOL date and tell them to take action (or that their workload will be killed at EOL) <--this can be done with minimal state. Can just be an automated daily task of "for each job running with image on vulnerable list, email to say they will get killed on X date"
  5. (delete): When EOL is reached, kill all offending workloads and delete the image

For notifications, users can be prompted to just start a new notebook using the typical way (as those notebooks are fresh). We don't need to tell them which notebook to use specifically - they will know.

This loses the convenience of notebooks being migrated for users, but the value add of us automatically updating user workloads is minimal. A migration kills in-memory state and messes with conda installed packages - that is marginally better than simply killing the workload and telling users to restart it themselves. Migrating is also much harder to communicate to users (think of describing to a user what a migration will/will not disrupt: If you're running something it will be stopped, but you might be able to rerun your notebook to get everything back, but if you conda install'd packages then probably not, but ...). That feels like added complexity/poor UX with marginal added value.

@justbert
Copy link

justbert commented May 4, 2021

NOTE: It may be good to start moving this into an epic since its scoping has definitely gotten larger that the ticket description.

This conversation is expanding quite a bit and there are some really interesting possibilities. I agree that keeping Notebooks updated is going to be critical and automating that will ensure the survivability and scalability of AAW in the long run.

There seem to be perhaps 3 types of images running now or in the future (I believe I've heard the rumblings) from an end-user perspective:

  • Curated images: Notebooks built by the AAW that are fully under the control of AAW
  • Platform images: Images for systems which are part of the platform. These usually have well-defined upgrade paths but are out of scope of this conversation since they already have a maintenance process.
  • User-workload images: These may not be fully-featured yet, however, it has been talked about a lot. These have iffily-defined upgrade-paths or may have none.

There are also two types of workloads in the AAW:

  • Pods with limited lifetimes: used for pipelines and rote work.
  • Pods with "infinite" lifetimes: mostly notebooks, currently

Curated Images

Most curated images will fall within "infinite" lifetime and as such require more finesse when updating.

For the curated images, an update-controller (as mentioned conceptually in comments above) working beside the notebook-controller could be a good solution. This controller could have predefined update paths for known notebook images which could be achieved through explicit pathing or via a version scheme such a semver. This can be achieved due to our control of the upstream source of the image.

Notifications

Some type of notification process (automated) could be defined to ensure user communication and, potentially, some type of delaying or rescheduling of updates may be important over time.

Predefined Update Cycle

A suggestion would be to have a predefined update cycle, perhaps weekly. This could ensure that users learn a predictable routine when it comes to downtimes which allows for happier end-users.

As a side-effect, a Predefined Update Cycle could also help reduce the fiduciary burden of the Cluster. By rescheduling workloads, we may be able to ensure that workloads are packed tighter onto nodes and reduce the number required.

Critical Updates

These have a somewhat different (accelerated) schedule and require a more robust notification process depending on the frequency of the Update Cycle, however, if the curated images are well developed, tested, and maintained for regular update processes, there should be minimal disruptions to user workflows.

User-workload Images

These are images that are created or used by end-users to accomplish specific tasks. These, from my understanding, will mostly be used for pipelines. If this assumption is correct, in-place upgrades aren't necessarily needed for these.

Due to the lack of definition around such images currently, it's difficult to define an effective remediation strategy.

Images sourced via Artifactory can use its download blocking feature as described by @Ito-Matsuda. Since these types of images are not infinitely running, this should cover most cases.

To certain extent, the governance around such images and how they are used in the platform can be defined up-front with Usage Agreements and well-defined Notification Strategies to prevent loss of confidence from end-users while striking balance for the safe use of the platform.

In the long run, investigating the use of the following will help align strategies for these types of workloads (as well as all others):

  • Runtime analysis and actions
  • Hardened runtimes
  • Using data generated by Starboard

Disclaimer

So... This was long and I think I started to ramble. I had a few meetings between so I'm sorry if it's not all relevant!

@blairdrummond
Copy link
Contributor

I think it's very relevant! You're right that we need to categorize the types of images under consideration (in terms of how they're used/built), and tailor the security/notification systems around them to the way they're utilized. A good starting ticket in the epic might be

  • Define in rough terms the types of images/workloads (long-lived, user-workloads, system images, etc and probably a cross-product of this)
  • Define appropriate actions for each category?
  • ???
  • Profit.

@zachomedia
Copy link

I'm just going to chime in and say that I agree with what @justbert wrote and I think this echos much of what I said during yesterday's standup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants