Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve auto_deploy to support changing zone and testing changes. #323

Merged
merged 5 commits into from
Mar 8, 2019

Conversation

jlewi
Copy link
Contributor

@jlewi jlewi commented Mar 2, 2019

  • Explicitly delete the storage deployment; the delete script won't delete it
    by design because we don't want to destroy data.

  • Instead of depending on a dummy kubeconfig file call generate/apply for
    platform and then for k8s.

  • For repos take them in the form ${ORG}/${NAME}@Branch. This matches
    what the other test script does. It also allows us to check out the
    repos from people's forks which makes testing changes easier.

  • Move logic in checkout-snapshot.sh into repo_clone_snapshot.py
    This is cleaner then having the python script shell out to a shell script.

  • Move the logic in deployment-workflow.sh into create_kf_instance.py

    • Add an option in create_kf_instance.py to read and parse the snapshot.json
      file rather than doing it in bash.
  • Change the arg name to be datadir instead of nfs_mount because nfs is
    an assumption of how it is run on K8s.

  • Check out the source into NFS to make it easier to debug.

  • Add a bash script to set the images using YQ

  • Add a wait operation for deletes and use it to wait for deletion of storage.

  • Rename init.sh to auto_deploy.sh to make it more descriptive.

  • Also modify auto_deploy.sh so we can run it locally.

  • Use the existing worker ci image as a base image to deduplicate the
    Dockerfile.

  • Attach labels to the deployment not the cluster

    • We want to use deployments not cluster to decide what to recycle
    • Deloyments are global but clusters are zonal and we want to
      be able to move to different zones to deal with issues like stockouts.
  • The GCP API does return labels on deployments.

  • We can figure out which deployment to recycle just by looking at the
    insertTime; we don't need to depend on deployment labels.

  • Add retries to deal with Fresh install renders 'no matches for kind "Application"' kubeflow#1933

Related to #95


This change is Reviewable

* Explicitly delete the storage deployment; the delete script won't delete it
  by design because we don't want to destroy data.

* Instead of depending on a dummy kubeconfig file call generate/apply for
  platform and then for k8s.

* For repos take them in the form ${ORG}/${NAME}@Branch. This matches
  what the other test script does. It also allows us to check out the
  repos from people's forks which makes testing changes easier.

* Move logic in checkout-snapshot.sh into repo_clone_snapshot.py
  This is cleaner then having the python script shell out to a shell script.

* Move the logic in deployment-workflow.sh into create_kf_instance.py

  * Add an option in create_kf_instance.py to read and parse the snapshot.json
    file rather than doing it in bash.

* Change the arg name to be datadir instead of nfs_mount because nfs is
  an assumption of how it is run on K8s.

* Check out the source into NFS to make it easier to debug.

* Add a bash script to set the images using YQ
* Add a wait operation for deletes and use it to wait for deletion of storage.

* Rename init.sh to auto_deploy.sh to make it more descriptive.

* Also modify auto_deploy.sh so we can run it locally.

* Use the existing worker ci image as a base image to deduplicate the
  Dockerfile.

* Attach labels to the deployment not the cluster
  * We want to use deployments not cluster to decide what to recycle
  * Deloyments are global but clusters are zonal and we want to
    be able to move to different zones to deal with issues like stockouts.

* The GCP API does return labels on deployments.

* We can figure out which deployment to recycle just by looking at the
  insertTime; we don't need to depend on deployment labels.

* Add retries to deal with kubeflow/kubeflow#1933
@jlewi
Copy link
Contributor Author

jlewi commented Mar 5, 2019

/assign @gabrielwen

@jlewi
Copy link
Contributor Author

jlewi commented Mar 6, 2019

/assign @zhenghuiwang would you mind taking a look at this while @gabrielwen is out?

@zhenghuiwang
Copy link
Contributor

/lgtm Thanks for making integration test easier!

@zhenghuiwang
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot removed the lgtm label Mar 7, 2019
Copy link
Contributor

@gabrielwen gabrielwen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 17 files at r1.
Reviewable status: 7 of 18 files reviewed, 4 unresolved discussions (waiting on @jlewi, @ScorpioCPH, and @zjj2wry)


py/kubeflow/testing/create_kf_instance.py, line 31 at r3 (raw file):

  try:
    op = deployments_client.delete(project=project, deployment=name,

this looks like gcloud deployment-manager deployments delete <name>, but I think it doesn't deal with role bindings, right?


test-infra/auto-deploy/deploy-cron-master.yaml, line 9 at r3 (raw file):

spec:
  concurrencyPolicy: Forbid
  schedule: 0 */8 * * *

do you want to update cron spec?


test-infra/auto-deploy/Dockerfile, line 2 at r3 (raw file):

# Docker image for nightly deployment cronjob.
#

nit: remove hashtag?


test-infra/auto-deploy/checkout_lib/snapshot_kf_deployment.py, line 117 at r3 (raw file):

def lock_and_write(folder, payload):
  dirname = os.path.dirname(folder)
  dir_lock = filelock.FileLock(os.path.join(dirname, "dir.lock"))

this is to prevent write conflict, tough it's highly impossible. do you think it's ok that we don't deal with it?

Copy link
Contributor Author

@jlewi jlewi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 7 of 18 files reviewed, 4 unresolved discussions (waiting on @gabrielwen, @jlewi, @ScorpioCPH, and @zjj2wry)


py/kubeflow/testing/create_kf_instance.py, line 31 at r3 (raw file):

Previously, gabrielwen (Hung-Ting Wen) wrote…

this looks like gcloud deployment-manager deployments delete <name>, but I think it doesn't deal with role bindings, right?

Correct. There are now two GCP deployments. There is a new deployment that was created by the pipelines team for storage (e.g. PDs and NFS) used by pipelines.

That deployment doesn't have any storage bindings attached to it.

It isn't deleted by kfctl.sh delete because we don't want to delete storage unless told to do so explicitly. So we have to manually delete that deployment here.

Copy link
Contributor Author

@jlewi jlewi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 7 of 18 files reviewed, 4 unresolved discussions (waiting on @gabrielwen, @ScorpioCPH, and @zjj2wry)


test-infra/auto-deploy/deploy-cron-master.yaml, line 9 at r3 (raw file):

Previously, gabrielwen (Hung-Ting Wen) wrote…

do you want to update cron spec?

I thought I did; did I miss something?


test-infra/auto-deploy/Dockerfile, line 2 at r3 (raw file):

Previously, gabrielwen (Hung-Ting Wen) wrote…

nit: remove hashtag?

I think its a good idea to pin to a particular image so you don't get surprised by changes.


test-infra/auto-deploy/checkout_lib/snapshot_kf_deployment.py, line 117 at r3 (raw file):

Previously, gabrielwen (Hung-Ting Wen) wrote…

this is to prevent write conflict, tough it's highly impossible. do you think it's ok that we don't deal with it?

Yeah I think its ok.

@jlewi
Copy link
Contributor Author

jlewi commented Mar 7, 2019

/test all

Copy link
Contributor

@gabrielwen gabrielwen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 7 of 18 files reviewed, 1 unresolved discussion (waiting on @jlewi, @ScorpioCPH, and @zjj2wry)


test-infra/auto-deploy/deploy-cron-master.yaml, line 9 at r3 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

I thought I did; did I miss something?

oh, I mean currently this cronjob on GCP runs every 12 hours, do you want to match that?


test-infra/auto-deploy/Dockerfile, line 2 at r3 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

I think its a good idea to pin to a particular image so you don't get surprised by changes.

nvm, I thought this empty line is not intentional :)

@gabrielwen
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot removed the lgtm label Mar 8, 2019
Copy link
Contributor Author

@jlewi jlewi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 6 of 18 files reviewed, 1 unresolved discussion (waiting on @gabrielwen, @ScorpioCPH, and @zjj2wry)


test-infra/auto-deploy/deploy-cron-master.yaml, line 9 at r3 (raw file):

Previously, gabrielwen (Hung-Ting Wen) wrote…

oh, I mean currently this cronjob on GCP runs every 12 hours, do you want to match that?

It looks like its still using 8 hours.

Maybe you didn't change it when you checked in your change? I'll update it.

  * This should be the current schedule but it looks like it was never checked
    in
  * We want to leave clusters up long enough to facilitate debugging.
@zhenghuiwang
Copy link
Contributor

/lgtm

@jlewi
Copy link
Contributor Author

jlewi commented Mar 8, 2019

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi, zhenghuiwang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 238ffc2 into kubeflow:master Mar 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants