Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notebook Security Scanning: Patch affected aaw-kubeflow-containers #957

Closed
6 tasks done
Jose-Matsuda opened this issue Mar 22, 2022 · 11 comments
Closed
6 tasks done
Assignees
Labels
size/L 4-5 days

Comments

@Jose-Matsuda
Copy link
Contributor

Jose-Matsuda commented Mar 22, 2022

15/06/2022 UPDATE:

This should not be bad.

Cronjob that does a bunch of kubectl commands to get pod spec and then get say the imageid and then compare that with what we get from the acr api.
Then we can run a kubectl rollout restart for specific statefulsets where imageid sha256 is different.

What we will need

  • Dockerfile with kubectl and curl
  • buncha kubectl commands
  • ~~need to find out how to authenticate with whatever cli we use,~~authenticated with azure
  • Need K8s rbac to allow cronjob to restart statefulsets
  • Need netpols to allow cronjob to interact with artifactory or acr (depending on option chosen)
  • Could just put it in the notebook-scanning namespace or create a new one. <-- will deploy this in the same namespace do not need to create a new one

Doing some initial work in private repo https://github.com/Jose-Matsuda/patch-notebook-sts


EPIC: #461

As per sprint planning

BREAK DOWN

There are quite a few things to tackle here.

  1. Need to research imago sort of image updates, semver
  2. Attach maybe 3 annotations to notebooks such as; EmailStatus (sent or unsent), Action (evict or update), Date Of Action (maybe next day for updates, and a week for eviction or something)
  3. Update the images (this task)

The goal of this task is to have the script patch any notebook images that have a vulnerability.
This will be limited to images that are covered by our configmap.

Make use of labels to determine when to patch or if to patch.

Concerns

  1. Will need to be careful as to not update IF the image on the configmap is "already" vulnerable, do not want to restart user workloads for no reason if it's not an actual fix to any CVEs.
  2. Upon checking dev we have multiple configmaps with the jupyter-web-app-config sitting around (on prod there's only 1 configmap though). Have to determine a way to find out which one to use (just the youngest?)
  3. Other images not maintained by our team

Definition of Completed

Running in dev, but maybe on a soft run and not actually initiating any updates (as in updates are logged to stdout).

@Jose-Matsuda
Copy link
Contributor Author

Please look at #961, there was discussion there that may make this a bit easier. We can have the imago approach to updating images and have them be on the latest possible one, and with it we do not need to patch affected containers as imago would do it for us. Having said that need to look and make sure that imago can do it

@Jose-Matsuda Jose-Matsuda added size/L 4-5 days and removed size/M 2-3 days labels Jun 15, 2022
@Jose-Matsuda
Copy link
Contributor Author

Jose-Matsuda commented Jun 17, 2022

Using Artifactory and Crane to get the digest

image

(this matches of course what we get with az acr interacting with the acr directly)
image

But trying this with something that definitely does not exist yet returns;
image
Grants a forbidden

Note that you can ls it
image

huh, I can't do a crane pull on it either. something seems to have happened.
image

@Jose-Matsuda
Copy link
Contributor Author

Jose-Matsuda commented Jun 24, 2022

Have image building here and will push it hopefully

https://github.com/StatCan/aaw-contrib-containers/tree/feat-notebook-restart

Once this is tested and all right we can move to a different repository via copy-pasta. Doing this for quick iteration. Can modify this dockerfile at my will and its easy to also modify argocd

@Jose-Matsuda
Copy link
Contributor Author

Jose-Matsuda commented Jun 27, 2022

Couple things (don't forget to separate the args in argocd, as in args: ["/home/test.sh", "asd"]
Here's the result of the first run
image

Ok something is up with the kubectl command, the image is causing jq to fail and quit pre-epmtively
image

The azure authentication itself is also not being kind and itself also wants 'docker' but might be able to use without as said in the error message, this is proving to be annoying as heck. azure is being dumb with asking for interaction with the commandline.

It does seem to want to run a restart of Bhavika's workspace which is nice, though of course it should have wanted to do the same for my own as I do have a workspace with v1 on it,
image

@Jose-Matsuda
Copy link
Contributor Author

Jose-Matsuda commented Jun 27, 2022

It seems that these two 'notebooks' do not contain 'containerStatuses'. ctrl-f'ing all the other ones have them. I will ask about them
image

I had not considered the pods that just do not get scheduled.

FIX

Update to be
.items[] | select(.status.containerStatuses != null) | {stsName: (.metadata.ownerReferences[].name), namespace: (.metadata.namespace), image: (.status.containerStatuses[])}

@Jose-Matsuda
Copy link
Contributor Author

Ok since Azure is annoying I did the following
image

Where the test.sh looks like

jovyan@f642475077c3:~$ cat test.sh
#!/usr/bin/bash

az acr repository show -n acr --repository remote-desktop

and the expect I had to install, and then use autoexpect to make the file for me, it looks something like;

cat script.exp 
#!/usr/bin/expect -f
set force_conservative 0  ;# set to 1 to force conservative mode even if
                          ;# script wasn't run conservatively originally
if {$force_conservative} {
        set send_slow {1 .1}
        proc send {ignore arg} {
                sleep .1
                exp_send -s -- $arg
        }
}

set timeout -1
spawn ./test.sh
match_max 100000
expect -exact "[93mUnable to get AAD authorization tokens with message: Please run 'az login' to setup account.[0m\r
[93mUnable to get admin user credentials with message: Please run 'az login' to setup account.[0m\r
Username: "
send -- "read-metadata"
expect -exact "read-metadata"
send -- "\r"
expect -exact "\r
Password: "
send -- "TOKEN PASSWORD\r"
expect eof

@Jose-Matsuda
Copy link
Contributor Author

Ok authentication is dumb, doing this inside my locally built image gave me

jovyan@d06b206db2e4:~$ az acr repository show -n $registry --image $image --username $uname --password $pword                                 
The login server endpoint suffix '.azurecr.io' is automatically omitted.
{
  "changeableAttributes": {
    "deleteEnabled": true,
    "listEnabled": true,
    "readEnabled": true,
    "writeEnabled": true
  },
  "createdTime": "2022-04-28T18:21:06.6776525Z",
  "digest": "sha256:c1dac2f250609abbc845e96fc24fb0fe61a187e751c387d2feb7291221f2618b",
  "lastUpdateTime": "2022-06-21T17:53:13.6810436Z",
  "name": "v1",
  "quarantineState": "Passed",
  "signed": false
}

@Jose-Matsuda
Copy link
Contributor Author

Jose-Matsuda commented Jun 28, 2022

OK it seems like we have the dry run working as intended, the following is the output of one of the jobs (after forcing it to go)

image

Where the current most recent "v1" is
image

and the sha on my notebook is
image

If we want it to move to actually doing the restart just change the argument passed to execute instead of dry-run.

The next step we will want is to migrate user images to the long-lived v1 tag, and ensure that the imagePullPolicy on those images is also updated to Always (it's the default now but it will still be IfNotPresent for the older ones)

@Jose-Matsuda
Copy link
Contributor Author

Update after fixing up a bit of rbac and others (changing to execute)
image

and then running a -o yaml the sha is indeed updated and on a newer image
image

@Jose-Matsuda
Copy link
Contributor Author

Note that you also needed this authentication to be created in the container registry on azure, and put in the secret in the namespace via gitlab
image

@Jose-Matsuda
Copy link
Contributor Author

Actual PR that contains the PR / script stuffs
StatCan/aaw-contrib-containers#96

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/L 4-5 days
Projects
None yet
Development

No branches or pull requests

1 participant