Improve auto_deploy to support changing zone and testing changes. #323

jlewi · 2019-03-02T19:19:16Z

Explicitly delete the storage deployment; the delete script won't delete it
by design because we don't want to destroy data.
Instead of depending on a dummy kubeconfig file call generate/apply for
platform and then for k8s.
For repos take them in the form ${ORG}/${NAME}@Branch. This matches
what the other test script does. It also allows us to check out the
repos from people's forks which makes testing changes easier.
Move logic in checkout-snapshot.sh into repo_clone_snapshot.py
This is cleaner then having the python script shell out to a shell script.
Move the logic in deployment-workflow.sh into create_kf_instance.py
- Add an option in create_kf_instance.py to read and parse the snapshot.json
  file rather than doing it in bash.
Change the arg name to be datadir instead of nfs_mount because nfs is
an assumption of how it is run on K8s.
Check out the source into NFS to make it easier to debug.
Add a bash script to set the images using YQ
Add a wait operation for deletes and use it to wait for deletion of storage.
Rename init.sh to auto_deploy.sh to make it more descriptive.
Also modify auto_deploy.sh so we can run it locally.
Use the existing worker ci image as a base image to deduplicate the
Dockerfile.
Attach labels to the deployment not the cluster
- We want to use deployments not cluster to decide what to recycle
- Deloyments are global but clusters are zonal and we want to
  be able to move to different zones to deal with issues like stockouts.
The GCP API does return labels on deployments.
We can figure out which deployment to recycle just by looking at the
insertTime; we don't need to depend on deployment labels.
Add retries to deal with Fresh install renders 'no matches for kind "Application"' kubeflow#1933

Related to #95

This change is

@Branch

* Explicitly delete the storage deployment; the delete script won't delete it by design because we don't want to destroy data. * Instead of depending on a dummy kubeconfig file call generate/apply for platform and then for k8s. * For repos take them in the form ${ORG}/${NAME}@Branch. This matches what the other test script does. It also allows us to check out the repos from people's forks which makes testing changes easier. * Move logic in checkout-snapshot.sh into repo_clone_snapshot.py This is cleaner then having the python script shell out to a shell script. * Move the logic in deployment-workflow.sh into create_kf_instance.py * Add an option in create_kf_instance.py to read and parse the snapshot.json file rather than doing it in bash. * Change the arg name to be datadir instead of nfs_mount because nfs is an assumption of how it is run on K8s. * Check out the source into NFS to make it easier to debug. * Add a bash script to set the images using YQ * Add a wait operation for deletes and use it to wait for deletion of storage. * Rename init.sh to auto_deploy.sh to make it more descriptive. * Also modify auto_deploy.sh so we can run it locally. * Use the existing worker ci image as a base image to deduplicate the Dockerfile. * Attach labels to the deployment not the cluster * We want to use deployments not cluster to decide what to recycle * Deloyments are global but clusters are zonal and we want to be able to move to different zones to deal with issues like stockouts. * The GCP API does return labels on deployments. * We can figure out which deployment to recycle just by looking at the insertTime; we don't need to depend on deployment labels. * Add retries to deal with kubeflow/kubeflow#1933

jlewi · 2019-03-05T01:58:06Z

/assign @gabrielwen

jlewi · 2019-03-06T14:52:26Z

/assign @zhenghuiwang would you mind taking a look at this while @gabrielwen is out?

zhenghuiwang · 2019-03-06T21:57:36Z

/lgtm Thanks for making integration test easier!

zhenghuiwang · 2019-03-06T22:22:10Z

/lgtm
/approve

…ass not found. Related to kubeflow/kubeflow#2475

gabrielwen

Reviewed 7 of 17 files at r1.
Reviewable status: 7 of 18 files reviewed, 4 unresolved discussions (waiting on @jlewi, @ScorpioCPH, and @zjj2wry)

py/kubeflow/testing/create_kf_instance.py, line 31 at r3 (raw file):

  try:
    op = deployments_client.delete(project=project, deployment=name,

this looks like gcloud deployment-manager deployments delete <name>, but I think it doesn't deal with role bindings, right?

test-infra/auto-deploy/deploy-cron-master.yaml, line 9 at r3 (raw file):

spec:
  concurrencyPolicy: Forbid
  schedule: 0 */8 * * *

do you want to update cron spec?

test-infra/auto-deploy/Dockerfile, line 2 at r3 (raw file):

# Docker image for nightly deployment cronjob.
#

nit: remove hashtag?

test-infra/auto-deploy/checkout_lib/snapshot_kf_deployment.py, line 117 at r3 (raw file):

def lock_and_write(folder, payload):
  dirname = os.path.dirname(folder)
  dir_lock = filelock.FileLock(os.path.join(dirname, "dir.lock"))

this is to prevent write conflict, tough it's highly impossible. do you think it's ok that we don't deal with it?

jlewi

Reviewable status: 7 of 18 files reviewed, 4 unresolved discussions (waiting on @gabrielwen, @jlewi, @ScorpioCPH, and @zjj2wry)

py/kubeflow/testing/create_kf_instance.py, line 31 at r3 (raw file):

Previously, gabrielwen (Hung-Ting Wen) wrote…

this looks like gcloud deployment-manager deployments delete <name>, but I think it doesn't deal with role bindings, right?

Correct. There are now two GCP deployments. There is a new deployment that was created by the pipelines team for storage (e.g. PDs and NFS) used by pipelines.

That deployment doesn't have any storage bindings attached to it.

It isn't deleted by kfctl.sh delete because we don't want to delete storage unless told to do so explicitly. So we have to manually delete that deployment here.

jlewi

Reviewable status: 7 of 18 files reviewed, 4 unresolved discussions (waiting on @gabrielwen, @ScorpioCPH, and @zjj2wry)

test-infra/auto-deploy/deploy-cron-master.yaml, line 9 at r3 (raw file):

Previously, gabrielwen (Hung-Ting Wen) wrote…

do you want to update cron spec?

I thought I did; did I miss something?

test-infra/auto-deploy/Dockerfile, line 2 at r3 (raw file):

Previously, gabrielwen (Hung-Ting Wen) wrote…

nit: remove hashtag?

I think its a good idea to pin to a particular image so you don't get surprised by changes.

test-infra/auto-deploy/checkout_lib/snapshot_kf_deployment.py, line 117 at r3 (raw file):

Previously, gabrielwen (Hung-Ting Wen) wrote…

this is to prevent write conflict, tough it's highly impossible. do you think it's ok that we don't deal with it?

Yeah I think its ok.

jlewi · 2019-03-07T21:45:49Z

/test all

gabrielwen

Reviewable status: 7 of 18 files reviewed, 1 unresolved discussion (waiting on @jlewi, @ScorpioCPH, and @zjj2wry)

test-infra/auto-deploy/deploy-cron-master.yaml, line 9 at r3 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

I thought I did; did I miss something?

oh, I mean currently this cronjob on GCP runs every 12 hours, do you want to match that?

test-infra/auto-deploy/Dockerfile, line 2 at r3 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

I think its a good idea to pin to a particular image so you don't get surprised by changes.

nvm, I thought this empty line is not intentional :)

gabrielwen · 2019-03-08T04:27:37Z

/lgtm

jlewi

Reviewable status: 6 of 18 files reviewed, 1 unresolved discussion (waiting on @gabrielwen, @ScorpioCPH, and @zjj2wry)

test-infra/auto-deploy/deploy-cron-master.yaml, line 9 at r3 (raw file):

Previously, gabrielwen (Hung-Ting Wen) wrote…

oh, I mean currently this cronjob on GCP runs every 12 hours, do you want to match that?

It looks like its still using 8 hours.

testing/test-infra/auto-deploy/deploy-cron-master.yaml

Line 10 in 3fcfd0d

schedule: "0 */8 * * *"

Maybe you didn't change it when you checked in your change? I'll update it.

* This should be the current schedule but it looks like it was never checked in * We want to leave clusters up long enough to facilitate debugging.

zhenghuiwang · 2019-03-08T23:47:10Z

/lgtm

jlewi · 2019-03-08T23:48:15Z

/approve

k8s-ci-robot · 2019-03-08T23:48:19Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi, zhenghuiwang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jlewi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from ScorpioCPH and zjj2wry March 2, 2019 19:19

k8s-ci-robot added the size/XL label Mar 2, 2019

k8s-ci-robot assigned gabrielwen Mar 5, 2019

Fix lint.

9c7ed61

k8s-ci-robot assigned zhenghuiwang Mar 6, 2019

k8s-ci-robot added the lgtm label Mar 6, 2019

* With K8s 1.11 we need to set volumeName otherwise we get storage cl…

a0e983e

…ass not found. Related to kubeflow/kubeflow#2475

k8s-ci-robot removed the lgtm label Mar 7, 2019

gabrielwen reviewed Mar 7, 2019

View reviewed changes

jlewi commented Mar 7, 2019

View reviewed changes

gabrielwen reviewed Mar 8, 2019

View reviewed changes

k8s-ci-robot added the lgtm label Mar 8, 2019

Fix lint.

549966b

k8s-ci-robot removed the lgtm label Mar 8, 2019

jlewi commented Mar 8, 2019

View reviewed changes

* Change cron job to run every 12 hours.

abbcde4

* This should be the current schedule but it looks like it was never checked in * We want to leave clusters up long enough to facilitate debugging.

k8s-ci-robot added the lgtm label Mar 8, 2019

k8s-ci-robot added the approved label Mar 8, 2019

k8s-ci-robot merged commit 238ffc2 into kubeflow:master Mar 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve auto_deploy to support changing zone and testing changes. #323

Improve auto_deploy to support changing zone and testing changes. #323

jlewi commented Mar 2, 2019 •

edited

Loading

jlewi commented Mar 5, 2019

jlewi commented Mar 6, 2019

zhenghuiwang commented Mar 6, 2019

zhenghuiwang commented Mar 6, 2019

gabrielwen left a comment

jlewi left a comment

jlewi left a comment

jlewi commented Mar 7, 2019

gabrielwen left a comment

gabrielwen commented Mar 8, 2019

jlewi left a comment

zhenghuiwang commented Mar 8, 2019

jlewi commented Mar 8, 2019

k8s-ci-robot commented Mar 8, 2019

Improve auto_deploy to support changing zone and testing changes. #323

Improve auto_deploy to support changing zone and testing changes. #323

Conversation

jlewi commented Mar 2, 2019 • edited Loading

jlewi commented Mar 5, 2019

jlewi commented Mar 6, 2019

zhenghuiwang commented Mar 6, 2019

zhenghuiwang commented Mar 6, 2019

gabrielwen left a comment

Choose a reason for hiding this comment

jlewi left a comment

Choose a reason for hiding this comment

jlewi left a comment

Choose a reason for hiding this comment

jlewi commented Mar 7, 2019

gabrielwen left a comment

Choose a reason for hiding this comment

gabrielwen commented Mar 8, 2019

jlewi left a comment

Choose a reason for hiding this comment

zhenghuiwang commented Mar 8, 2019

jlewi commented Mar 8, 2019

k8s-ci-robot commented Mar 8, 2019

jlewi commented Mar 2, 2019 •

edited

Loading