-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve auto_deploy to support changing zone and testing changes. #323
Conversation
* Explicitly delete the storage deployment; the delete script won't delete it by design because we don't want to destroy data. * Instead of depending on a dummy kubeconfig file call generate/apply for platform and then for k8s. * For repos take them in the form ${ORG}/${NAME}@Branch. This matches what the other test script does. It also allows us to check out the repos from people's forks which makes testing changes easier. * Move logic in checkout-snapshot.sh into repo_clone_snapshot.py This is cleaner then having the python script shell out to a shell script. * Move the logic in deployment-workflow.sh into create_kf_instance.py * Add an option in create_kf_instance.py to read and parse the snapshot.json file rather than doing it in bash. * Change the arg name to be datadir instead of nfs_mount because nfs is an assumption of how it is run on K8s. * Check out the source into NFS to make it easier to debug. * Add a bash script to set the images using YQ * Add a wait operation for deletes and use it to wait for deletion of storage. * Rename init.sh to auto_deploy.sh to make it more descriptive. * Also modify auto_deploy.sh so we can run it locally. * Use the existing worker ci image as a base image to deduplicate the Dockerfile. * Attach labels to the deployment not the cluster * We want to use deployments not cluster to decide what to recycle * Deloyments are global but clusters are zonal and we want to be able to move to different zones to deal with issues like stockouts. * The GCP API does return labels on deployments. * We can figure out which deployment to recycle just by looking at the insertTime; we don't need to depend on deployment labels. * Add retries to deal with kubeflow/kubeflow#1933
/assign @gabrielwen |
/assign @zhenghuiwang would you mind taking a look at this while @gabrielwen is out? |
/lgtm Thanks for making integration test easier! |
/lgtm |
…ass not found. Related to kubeflow/kubeflow#2475
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 17 files at r1.
Reviewable status: 7 of 18 files reviewed, 4 unresolved discussions (waiting on @jlewi, @ScorpioCPH, and @zjj2wry)
py/kubeflow/testing/create_kf_instance.py, line 31 at r3 (raw file):
try: op = deployments_client.delete(project=project, deployment=name,
this looks like gcloud deployment-manager deployments delete <name>
, but I think it doesn't deal with role bindings, right?
test-infra/auto-deploy/deploy-cron-master.yaml, line 9 at r3 (raw file):
spec: concurrencyPolicy: Forbid schedule: 0 */8 * * *
do you want to update cron spec?
test-infra/auto-deploy/Dockerfile, line 2 at r3 (raw file):
# Docker image for nightly deployment cronjob. #
nit: remove hashtag?
test-infra/auto-deploy/checkout_lib/snapshot_kf_deployment.py, line 117 at r3 (raw file):
def lock_and_write(folder, payload): dirname = os.path.dirname(folder) dir_lock = filelock.FileLock(os.path.join(dirname, "dir.lock"))
this is to prevent write conflict, tough it's highly impossible. do you think it's ok that we don't deal with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 7 of 18 files reviewed, 4 unresolved discussions (waiting on @gabrielwen, @jlewi, @ScorpioCPH, and @zjj2wry)
py/kubeflow/testing/create_kf_instance.py, line 31 at r3 (raw file):
Previously, gabrielwen (Hung-Ting Wen) wrote…
this looks like
gcloud deployment-manager deployments delete <name>
, but I think it doesn't deal with role bindings, right?
Correct. There are now two GCP deployments. There is a new deployment that was created by the pipelines team for storage (e.g. PDs and NFS) used by pipelines.
That deployment doesn't have any storage bindings attached to it.
It isn't deleted by kfctl.sh delete because we don't want to delete storage unless told to do so explicitly. So we have to manually delete that deployment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 7 of 18 files reviewed, 4 unresolved discussions (waiting on @gabrielwen, @ScorpioCPH, and @zjj2wry)
test-infra/auto-deploy/deploy-cron-master.yaml, line 9 at r3 (raw file):
Previously, gabrielwen (Hung-Ting Wen) wrote…
do you want to update cron spec?
I thought I did; did I miss something?
test-infra/auto-deploy/Dockerfile, line 2 at r3 (raw file):
Previously, gabrielwen (Hung-Ting Wen) wrote…
nit: remove hashtag?
I think its a good idea to pin to a particular image so you don't get surprised by changes.
test-infra/auto-deploy/checkout_lib/snapshot_kf_deployment.py, line 117 at r3 (raw file):
Previously, gabrielwen (Hung-Ting Wen) wrote…
this is to prevent write conflict, tough it's highly impossible. do you think it's ok that we don't deal with it?
Yeah I think its ok.
/test all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 7 of 18 files reviewed, 1 unresolved discussion (waiting on @jlewi, @ScorpioCPH, and @zjj2wry)
test-infra/auto-deploy/deploy-cron-master.yaml, line 9 at r3 (raw file):
Previously, jlewi (Jeremy Lewi) wrote…
I thought I did; did I miss something?
oh, I mean currently this cronjob on GCP runs every 12 hours, do you want to match that?
test-infra/auto-deploy/Dockerfile, line 2 at r3 (raw file):
Previously, jlewi (Jeremy Lewi) wrote…
I think its a good idea to pin to a particular image so you don't get surprised by changes.
nvm, I thought this empty line is not intentional :)
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 6 of 18 files reviewed, 1 unresolved discussion (waiting on @gabrielwen, @ScorpioCPH, and @zjj2wry)
test-infra/auto-deploy/deploy-cron-master.yaml, line 9 at r3 (raw file):
Previously, gabrielwen (Hung-Ting Wen) wrote…
oh, I mean currently this cronjob on GCP runs every 12 hours, do you want to match that?
It looks like its still using 8 hours.
schedule: "0 */8 * * *" |
Maybe you didn't change it when you checked in your change? I'll update it.
* This should be the current schedule but it looks like it was never checked in * We want to leave clusters up long enough to facilitate debugging.
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jlewi, zhenghuiwang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Explicitly delete the storage deployment; the delete script won't delete it
by design because we don't want to destroy data.
Instead of depending on a dummy kubeconfig file call generate/apply for
platform and then for k8s.
For repos take them in the form ${ORG}/${NAME}@Branch. This matches
what the other test script does. It also allows us to check out the
repos from people's forks which makes testing changes easier.
Move logic in checkout-snapshot.sh into repo_clone_snapshot.py
This is cleaner then having the python script shell out to a shell script.
Move the logic in deployment-workflow.sh into create_kf_instance.py
file rather than doing it in bash.
Change the arg name to be datadir instead of nfs_mount because nfs is
an assumption of how it is run on K8s.
Check out the source into NFS to make it easier to debug.
Add a bash script to set the images using YQ
Add a wait operation for deletes and use it to wait for deletion of storage.
Rename init.sh to auto_deploy.sh to make it more descriptive.
Also modify auto_deploy.sh so we can run it locally.
Use the existing worker ci image as a base image to deduplicate the
Dockerfile.
Attach labels to the deployment not the cluster
be able to move to different zones to deal with issues like stockouts.
The GCP API does return labels on deployments.
We can figure out which deployment to recycle just by looking at the
insertTime; we don't need to depend on deployment labels.
Add retries to deal with Fresh install renders 'no matches for kind "Application"' kubeflow#1933
Related to #95
This change is![Reviewable](https://camo.githubusercontent.com/1541c4039185914e83657d3683ec25920c672c6c5c7ab4240ee7bff601adec0b/68747470733a2f2f72657669657761626c652e696f2f7265766965775f627574746f6e2e737667)