Continuous build of docker images and updating kustomize manifests #450

jlewi · 2019-08-29T01:21:37Z

We need a good way to continuously build our docker images and then update our kustomize manifests to use the updated images.

This is critical for maintaining velocity. One of the big problems we are seeing with releases is that changes are piling up and not getting exercised until we start cutting releases because we haven't updated our kustomize manifests.

Also as the number of applications scale the toil around building docker images and then updating manifests becomes significant. This is especially true during releases as we try to rapidly push out fixes.

There is a POC based on the jupyter web app here.
https://github.com/kubeflow/kubeflow/tree/master/releasing/auto-update

We'd like to make it super easy for people to define new workflows to auto-build their application. In an ideal world they would just check in a YAMl file with a couple configurations e.g.

Location of their Dockerfile
Location of their kustomization.yaml file

A couple things we are missing

A good solution for triggering workflows on pre/postsubmits/cron jobs
A good solution for monitoring/alerting
A good story for reusability around common tasks (e.g. building images, creating a PR, etc...)

/cc @scottilee @animeshsingh @kkasravi @jinchihe

jlewi · 2019-09-05T14:01:41Z

GCB now has direct integration via GitHub App Triggers
https://cloud.google.com/cloud-build/docs/create-github-app-triggers

So if we install that GitHub App in our project then we can trigger GCB builds in response to PRs. The GCB build could then create K8s resources.

This is very similar to our prow infra works today. We use Prow to trigger Prow jobs which run run_workflow_e2e.py which in turn submits a bunch of Argo workflows based on prow_config.yaml.

We could do something similar but use GCB to invoke run_e2e_workflow.py

jlewi · 2019-09-05T20:21:04Z

With kubeflow/kubeflow#4029 we have a pretty good POC for CD of the jupyter web app image.
The next step is probably to generalize this to a 2nd image which should probably be the central dashboard; kubeflow/kubeflow#3781.

Parameterizing update_jupyter_web_app.py into a reusable script should be pretty easy
- It looks like there are basically a couple arguments
  1. The image location
  2. Kustomization location
  3. Location of source and the command to build the image
    - We could make certain assumptions; i.e. that there is a Makefile and the name of the build rule and variables parameterizing where it gets pushed
Make it easy for people to add new jobs to periodically build and push their image.

jlewi · 2019-11-27T17:13:04Z

I have created the CI/CD for Kubeflow Applications Card in the Engprod Project to track this.

kkasravi · 2019-11-27T19:05:40Z

@jlewi thanks. I should be able to get to do some work on this in the next few days

jlewi · 2019-11-27T20:47:36Z

Design doc is here: bit.ly/kfcd

It looks like its a bit outdated. It would be good to update it and then socialize our thinking at the community meeting.

scottilee · 2019-11-29T06:56:11Z

@jlewi I don't know if it's just me but that design doc link redirects to http://www.thelaptop-computers.info/2009/11/watauga-county-sheriff%E2%80%99s-office-arrests-two-suspected-burglars-go-blue-ridge/ which is not relevant.

kkasravi · 2019-11-29T16:16:18Z

I believe the design doc is https://docs.google.com/document/d/1oaBBJerOkKIuAAn_Swbu8uVBPtaYgvAVbQV1NxTsSSg/edit?ts=5d714796#heading=h.9g4gb5dvlquq

kkasravi · 2019-11-29T16:20:17Z

It looks like its a bit outdated. It would be good to update it and then socialize our thinking at the community meeting.

@jlewi I'll update the doc as you've suggested to just focus on #1 (A doc focused on continuous delivery of our applications)

…cluster * Get rid of the PVC used to pass the image digest file between the build and update manifests step * Creating a PVC just creates operational complexity * We combine the build and update manifests step into one task. We can then use /workspace (a pod volume) to pass data like the image digest file between the steps * Update pipelineRun to work with version 0.9 of Tekton * Field serviceAccount has been renamed serviceAccountName * TaskRun no longer supports outputImageDir so we remove it; we will have to use Tekton to pass the image digest file * Remove the namespace.yaml and secrets.yaml from the kustomize package * The secrets should be created out of band and not checked in * So the behavior should be to deploy the kustomize package in a namespace that already exists with the appropriate secrets * Checking in secrets is confusing * If we check in dummy secrets then users will get confused about whether the secrets are valid or not * Furthermore, the file secrets.yaml is an invitation to end up checking the secrets into source control. * Configure some values to use gcr.io/kubeflow-images-public * Disable ISTIO sidecar in the pipelines * For kaniko we don't need the secret to be named a certain way we just need to set GOOGLE_APPLICATION_CREDENTIALS to point to the correct value * We change kaniko to use the user-gcp-sa secret that Kubeflow creates * We shouldn't need an image pull secret since kubeflow-images-public is public * GOOGLE_APPLICATION_CREDENTIALS should be used for pushing images * Change the name of the secret containing ssh credentials for kubeflow-bot to kubeflow-bot-github-ssh * rebuild-manifests.sh should use /workspace to get the image digest rather than the PVC. * Simplify rebuild-manifests.ssh * Tekton will mount the .ssh information in /tekton/home/.ssh so we just need to create a symbolic link to /root/.ssh * The image digest file should be fetched from /workspace and not some PVC. * Set GITHUB_TOKEN environment variable using secrets so that we don't need to use kubectl get to fetch it * We need to make the clone of kubeflow/manifests a non-shallow clone before we can push changes to the remote repo Next steps: * This PR only updated the profile controller * We need to refactor how the PipelineRun's are laid out * I think we may want the PipelineRun's to be separate from the reused resurces like Task * rebuil-manifests.sh should only regenerate tests for changed files * The created PRs don't satisfy the Kubeflow CLA check. Related to: kubeflow/testing#450

…cluster * Get rid of the PVC used to pass the image digest file between the build and update manifests step * Creating a PVC just creates operational complexity * We combine the build and update manifests step into one task. We can then use /workspace (a pod volume) to pass data like the image digest file between the steps * Update pipelineRun to work with version 0.9 of Tekton * Field serviceAccount has been renamed serviceAccountName * TaskRun no longer supports outputImageDir so we remove it; we will have to use Tekton to pass the image digest file * Remove the namespace.yaml and secrets.yaml from the kustomize package * The secrets should be created out of band and not checked in * So the behavior should be to deploy the kustomize package in a namespace that already exists with the appropriate secrets * Checking in secrets is confusing * If we check in dummy secrets then users will get confused about whether the secrets are valid or not * Furthermore, the file secrets.yaml is an invitation to end up checking the secrets into source control. * Configure some values to use gcr.io/kubeflow-images-public * Disable ISTIO sidecar in the pipelines * For kaniko we don't need the secret to be named a certain way we just need to set GOOGLE_APPLICATION_CREDENTIALS to point to the correct value * We change kaniko to use the user-gcp-sa secret that Kubeflow creates * We shouldn't need an image pull secret since kubeflow-images-public is public * GOOGLE_APPLICATION_CREDENTIALS should be used for pushing images * Change the name of the secret containing ssh credentials for kubeflow-bot to kubeflow-bot-github-ssh * rebuild-manifests.sh should use /workspace to get the image digest rather than the PVC. * Simplify rebuild-manifests.ssh * Tekton will mount the .ssh information in /tekton/home/.ssh so we just need to create a symbolic link to /root/.ssh * The image digest file should be fetched from /workspace and not some PVC. * Set GITHUB_TOKEN environment variable using secrets so that we don't need to use kubectl get to fetch it * We need to make the clone of kubeflow/manifests a non-shallow clone before we can push changes to the remote repo * I was able to successfully run the profile controller workflow and create a PR kubeflow/manifests#669 Next steps: * This PR only updated the profile controller * We need to refactor how the PipelineRun's are laid out * I think we may want the PipelineRun's to be separate from the reused resurces like Task * rebuil-manifests.sh should only regenerate tests for changed files * The created PRs don't satisfy the Kubeflow CLA check. Related to: kubeflow/testing#450

jlewi · 2019-12-13T01:10:20Z

Status Update:

Make the Tekton CD pipeline for profile controller run on KF release cluster kubeflow#4568 - I made some changes so I could run the profile controller on the KF release cluster
- The PR has instructions for how to setup the KF release cluster
Running the pipeline created [auto PR] Update the profile-controller image to commit manifests#669

Next steps

Refactor the kustomize layout of the tekton scripts
I think we want to make it easier to fire off PipelineRun's for all the different applications we need to update
- I think as part of that we want to move all the scripts/pipeline resources into kubeflow/testing since
  that's our main engprod repo
We need Fix computation of changed files in generate-changed-only rule manifests#665 to be submitted so that we can regenerate manifest tests for only changed files

jlewi · 2019-12-17T01:28:45Z

@kkasravi I wrote up my current thinking in this doc:
http://bit.ly/kfappscd-201912

PTAL

kkasravi · 2019-12-18T00:55:25Z

@jlewi

I had commented on restructuring the PipelineRun to embed a pipelineSpec and resourceSpec rather than a pipelineRef and resourceRefs here: #544 (comment)

I'll comment on the doc as well

…cluster (#4568) * Get rid of the PVC used to pass the image digest file between the build and update manifests step * Creating a PVC just creates operational complexity * We combine the build and update manifests step into one task. We can then use /workspace (a pod volume) to pass data like the image digest file between the steps * Update pipelineRun to work with version 0.9 of Tekton * Field serviceAccount has been renamed serviceAccountName * TaskRun no longer supports outputImageDir so we remove it; we will have to use Tekton to pass the image digest file * Remove the namespace.yaml and secrets.yaml from the kustomize package * The secrets should be created out of band and not checked in * So the behavior should be to deploy the kustomize package in a namespace that already exists with the appropriate secrets * Checking in secrets is confusing * If we check in dummy secrets then users will get confused about whether the secrets are valid or not * Furthermore, the file secrets.yaml is an invitation to end up checking the secrets into source control. * Configure some values to use gcr.io/kubeflow-images-public * Disable ISTIO sidecar in the pipelines * For kaniko we don't need the secret to be named a certain way we just need to set GOOGLE_APPLICATION_CREDENTIALS to point to the correct value * We change kaniko to use the user-gcp-sa secret that Kubeflow creates * We shouldn't need an image pull secret since kubeflow-images-public is public * GOOGLE_APPLICATION_CREDENTIALS should be used for pushing images * Change the name of the secret containing ssh credentials for kubeflow-bot to kubeflow-bot-github-ssh * rebuild-manifests.sh should use /workspace to get the image digest rather than the PVC. * Simplify rebuild-manifests.ssh * Tekton will mount the .ssh information in /tekton/home/.ssh so we just need to create a symbolic link to /root/.ssh * The image digest file should be fetched from /workspace and not some PVC. * Set GITHUB_TOKEN environment variable using secrets so that we don't need to use kubectl get to fetch it * We need to make the clone of kubeflow/manifests a non-shallow clone before we can push changes to the remote repo * I was able to successfully run the profile controller workflow and create a PR kubeflow/manifests#669 Next steps: * This PR only updated the profile controller * We need to refactor how the PipelineRun's are laid out * I think we may want the PipelineRun's to be separate from the reused resurces like Task * rebuil-manifests.sh should only regenerate tests for changed files * The created PRs don't satisfy the Kubeflow CLA check. Related to: kubeflow/testing#450

* Update applications.yaml with a v0.8 release. * The purpose of this PR is to check that just by defining the appropriate release we can begin building images from release branches and updating the release branch of kubeflow/manifests * Related to kubeflow#450 - Continuous delivery of Kubeflow applications

* Update applications.yaml with a v0.8 release. * The purpose of this PR is to check that just by defining the appropriate release we can begin building images from release branches and updating the release branch of kubeflow/manifests * Related to kubeflow#450 - Continuous delivery of Kubeflow applications * Create a python script for opening up the PR; this script replaces the bash script rebuild-manifests.sh that was used previously * The new script doesn't assume that the base branch for PRs is master. We need this to support updating release branches. * Create a profile in skaffold.yaml for running on the release cluster. * Create an image_util package to parse image URLs. * Use the Docker image for apps-cd to run create_manifests_pr.py * Add kustomize, go, and some other tools we need * In the docker image create a symbolic link for .ssh so we can pick up ssh credentials created by Tekton.

* The infrastructure for continuously rebuilding our docker images and updating our kustomize manifests has now been generalized. see kubeflow/testing#450 and https://github.com/kubeflow/testing/tree/master/apps-cd * This is the old code for updating the jupyter web app and is no longer needed.

* Related to kubeflow/testing#450

* Define a v0.8 release * Update applications.yaml with a v0.8 release. * The purpose of this PR is to check that just by defining the appropriate release we can begin building images from release branches and updating the release branch of kubeflow/manifests * Related to #450 - Continuous delivery of Kubeflow applications * Create a python script for opening up the PR; this script replaces the bash script rebuild-manifests.sh that was used previously * The new script doesn't assume that the base branch for PRs is master. We need this to support updating release branches. * Create a profile in skaffold.yaml for running on the release cluster. * Create an image_util package to parse image URLs. * Use the Docker image for apps-cd to run create_manifests_pr.py * Add kustomize, go, and some other tools we need * In the docker image create a symbolic link for .ssh so we can pick up ssh credentials created by Tekton. * * Define a 1.0 release now that the branches have been cut * Related to kubeflow/kubeflow#4685

jlewi · 2020-01-27T14:49:14Z

Update

Define a v0.8 release #572 was merged to update release branches
Next missing piece is closing old PRs
- CD scripts for updating kubeflow manifests should close older PRs #571 is first PR for that
After that its just a matter of ensuring everything is working as expected

* The infrastructure for continuously rebuilding our docker images and updating our kustomize manifests has now been generalized. see kubeflow/testing#450 and https://github.com/kubeflow/testing/tree/master/apps-cd * This is the old code for updating the jupyter web app and is no longer needed.

jlewi · 2020-01-28T19:33:10Z

This is working. Here's a list of PRs indicating several PRs updating 1.0 applications which were successfully merged
https://github.com/kubeflow/manifests/pulls?utf8=%E2%9C%93&q=+is%3Aclosed+author%3Akubeflow-bot+

Only remaining thing to do before updating this PR is updating the instance of the release infrastructure in the prod namespace.

jlewi · 2020-01-30T20:09:40Z

Closing this issue.
Filed #593 to setup a prod instance

…cluster (kubeflow#4568) * Get rid of the PVC used to pass the image digest file between the build and update manifests step * Creating a PVC just creates operational complexity * We combine the build and update manifests step into one task. We can then use /workspace (a pod volume) to pass data like the image digest file between the steps * Update pipelineRun to work with version 0.9 of Tekton * Field serviceAccount has been renamed serviceAccountName * TaskRun no longer supports outputImageDir so we remove it; we will have to use Tekton to pass the image digest file * Remove the namespace.yaml and secrets.yaml from the kustomize package * The secrets should be created out of band and not checked in * So the behavior should be to deploy the kustomize package in a namespace that already exists with the appropriate secrets * Checking in secrets is confusing * If we check in dummy secrets then users will get confused about whether the secrets are valid or not * Furthermore, the file secrets.yaml is an invitation to end up checking the secrets into source control. * Configure some values to use gcr.io/kubeflow-images-public * Disable ISTIO sidecar in the pipelines * For kaniko we don't need the secret to be named a certain way we just need to set GOOGLE_APPLICATION_CREDENTIALS to point to the correct value * We change kaniko to use the user-gcp-sa secret that Kubeflow creates * We shouldn't need an image pull secret since kubeflow-images-public is public * GOOGLE_APPLICATION_CREDENTIALS should be used for pushing images * Change the name of the secret containing ssh credentials for kubeflow-bot to kubeflow-bot-github-ssh * rebuild-manifests.sh should use /workspace to get the image digest rather than the PVC. * Simplify rebuild-manifests.ssh * Tekton will mount the .ssh information in /tekton/home/.ssh so we just need to create a symbolic link to /root/.ssh * The image digest file should be fetched from /workspace and not some PVC. * Set GITHUB_TOKEN environment variable using secrets so that we don't need to use kubectl get to fetch it * We need to make the clone of kubeflow/manifests a non-shallow clone before we can push changes to the remote repo * I was able to successfully run the profile controller workflow and create a PR kubeflow/manifests#669 Next steps: * This PR only updated the profile controller * We need to refactor how the PipelineRun's are laid out * I think we may want the PipelineRun's to be separate from the reused resurces like Task * rebuil-manifests.sh should only regenerate tests for changed files * The created PRs don't satisfy the Kubeflow CLA check. Related to: kubeflow/testing#450

…ow#4688) * Related to kubeflow/testing#450

* The infrastructure for continuously rebuilding our docker images and updating our kustomize manifests has now been generalized. see kubeflow/testing#450 and https://github.com/kubeflow/testing/tree/master/apps-cd * This is the old code for updating the jupyter web app and is no longer needed.

…cluster (kubeflow#4568) * Get rid of the PVC used to pass the image digest file between the build and update manifests step * Creating a PVC just creates operational complexity * We combine the build and update manifests step into one task. We can then use /workspace (a pod volume) to pass data like the image digest file between the steps * Update pipelineRun to work with version 0.9 of Tekton * Field serviceAccount has been renamed serviceAccountName * TaskRun no longer supports outputImageDir so we remove it; we will have to use Tekton to pass the image digest file * Remove the namespace.yaml and secrets.yaml from the kustomize package * The secrets should be created out of band and not checked in * So the behavior should be to deploy the kustomize package in a namespace that already exists with the appropriate secrets * Checking in secrets is confusing * If we check in dummy secrets then users will get confused about whether the secrets are valid or not * Furthermore, the file secrets.yaml is an invitation to end up checking the secrets into source control. * Configure some values to use gcr.io/kubeflow-images-public * Disable ISTIO sidecar in the pipelines * For kaniko we don't need the secret to be named a certain way we just need to set GOOGLE_APPLICATION_CREDENTIALS to point to the correct value * We change kaniko to use the user-gcp-sa secret that Kubeflow creates * We shouldn't need an image pull secret since kubeflow-images-public is public * GOOGLE_APPLICATION_CREDENTIALS should be used for pushing images * Change the name of the secret containing ssh credentials for kubeflow-bot to kubeflow-bot-github-ssh * rebuild-manifests.sh should use /workspace to get the image digest rather than the PVC. * Simplify rebuild-manifests.ssh * Tekton will mount the .ssh information in /tekton/home/.ssh so we just need to create a symbolic link to /root/.ssh * The image digest file should be fetched from /workspace and not some PVC. * Set GITHUB_TOKEN environment variable using secrets so that we don't need to use kubectl get to fetch it * We need to make the clone of kubeflow/manifests a non-shallow clone before we can push changes to the remote repo * I was able to successfully run the profile controller workflow and create a PR kubeflow/manifests#669 Next steps: * This PR only updated the profile controller * We need to refactor how the PipelineRun's are laid out * I think we may want the PipelineRun's to be separate from the reused resurces like Task * rebuil-manifests.sh should only regenerate tests for changed files * The created PRs don't satisfy the Kubeflow CLA check. Related to: kubeflow/testing#450

…ow#4688) * Related to kubeflow/testing#450

* The infrastructure for continuously rebuilding our docker images and updating our kustomize manifests has now been generalized. see kubeflow/testing#450 and https://github.com/kubeflow/testing/tree/master/apps-cd * This is the old code for updating the jupyter web app and is no longer needed.

jlewi added priority/p1 kind/feature area/1.0.0 area/engprod labels Aug 29, 2019

This was referenced Sep 8, 2019

[WIP] enabling composable pipelines for kubeflow/components kubeflow/kubeflow#4091

Closed

[WIP] Tektoncd pipelines kubeflow/kfctl#10

Closed

This was referenced Nov 27, 2019

Install GitHub and other secrets for running Kubeflow's Tekton CD pipelines in release infra #532

Closed

Trigger Tekton Workflows In Response To GitHub events post-submits and cron #533

Closed

jlewi mentioned this issue Nov 27, 2019

Use Anthos Config Manager to sync Tekton templates to release cluster #534

Closed

jlewi mentioned this issue Dec 13, 2019

Make the Tekton CD pipeline for profile controller run on KF release cluster kubeflow/kubeflow#4568

Merged

This was referenced Dec 13, 2019

How to structure kustomize packages for Tekton CD pipelines for Kubeflow #544

Closed

Mechanism to determine if image/manifests is updated #545

Closed

jlewi mentioned this issue Jan 22, 2020

CD scripts for updating kubeflow manifests should close older PRs #571

Closed

jlewi mentioned this issue Jan 22, 2020

Define a v0.8 release #572

Merged

This was referenced Jan 23, 2020

Delete the old jupyter release infrastructure kubeflow/kubeflow#4686

Merged

Cleanup the old release infrastructure #573

Closed

jlewi pushed a commit to jlewi/kubeflow that referenced this issue Jan 23, 2020

Add release instructions for using the new CD infrastructure.

4f098b7

* Related to kubeflow/testing#450

jlewi mentioned this issue Jan 23, 2020

Add release instructions for using the new CD infrastructure. kubeflow/kubeflow#4688

Merged

k8s-ci-robot pushed a commit to kubeflow/kubeflow that referenced this issue Jan 23, 2020

Add release instructions for using the new CD infrastructure. (#4688)

4e7fbf1

* Related to kubeflow/testing#450

jlewi mentioned this issue Jan 24, 2020

Continuous releasing add command to rebase and force push #576

Closed

jlewi mentioned this issue Jan 27, 2020

Dashboard and monitoring for Kubeflow CD pipelines #580

Closed

jlewi mentioned this issue Jan 29, 2020

Add TFJob and PyTorch controller to the list of applications for CD. #587

Merged

jlewi mentioned this issue Jan 30, 2020

Setup a prod instance of the CD applications for Kubeflow #593

Closed

jlewi closed this as completed Jan 30, 2020

saffaalvi pushed a commit to StatCan/kubeflow that referenced this issue Feb 11, 2021

Add release instructions for using the new CD infrastructure. (kubefl…

55f36c6

…ow#4688) * Related to kubeflow/testing#450

saffaalvi pushed a commit to StatCan/kubeflow that referenced this issue Feb 12, 2021

Add release instructions for using the new CD infrastructure. (kubefl…

0f82f99

…ow#4688) * Related to kubeflow/testing#450

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continuous build of docker images and updating kustomize manifests #450

Continuous build of docker images and updating kustomize manifests #450

jlewi commented Aug 29, 2019

jlewi commented Sep 5, 2019

jlewi commented Sep 5, 2019

jlewi commented Nov 27, 2019

kkasravi commented Nov 27, 2019

jlewi commented Nov 27, 2019

scottilee commented Nov 29, 2019

kkasravi commented Nov 29, 2019

kkasravi commented Nov 29, 2019

jlewi commented Dec 13, 2019

jlewi commented Dec 17, 2019

kkasravi commented Dec 18, 2019

jlewi commented Jan 27, 2020

jlewi commented Jan 28, 2020

jlewi commented Jan 30, 2020

Continuous build of docker images and updating kustomize manifests #450

Continuous build of docker images and updating kustomize manifests #450

Comments

jlewi commented Aug 29, 2019

jlewi commented Sep 5, 2019

jlewi commented Sep 5, 2019

jlewi commented Nov 27, 2019

kkasravi commented Nov 27, 2019

jlewi commented Nov 27, 2019

scottilee commented Nov 29, 2019

kkasravi commented Nov 29, 2019

kkasravi commented Nov 29, 2019

jlewi commented Dec 13, 2019

jlewi commented Dec 17, 2019

kkasravi commented Dec 18, 2019

jlewi commented Jan 27, 2020

jlewi commented Jan 28, 2020

jlewi commented Jan 30, 2020