Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuous build of docker images and updating kustomize manifests #450

Closed
jlewi opened this issue Aug 29, 2019 · 16 comments
Closed

Continuous build of docker images and updating kustomize manifests #450

jlewi opened this issue Aug 29, 2019 · 16 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Aug 29, 2019

We need a good way to continuously build our docker images and then update our kustomize manifests to use the updated images.

This is critical for maintaining velocity. One of the big problems we are seeing with releases is that changes are piling up and not getting exercised until we start cutting releases because we haven't updated our kustomize manifests.

Also as the number of applications scale the toil around building docker images and then updating manifests becomes significant. This is especially true during releases as we try to rapidly push out fixes.

There is a POC based on the jupyter web app here.
https://github.com/kubeflow/kubeflow/tree/master/releasing/auto-update

We'd like to make it super easy for people to define new workflows to auto-build their application. In an ideal world they would just check in a YAMl file with a couple configurations e.g.

  • Location of their Dockerfile
  • Location of their kustomization.yaml file

A couple things we are missing

  1. A good solution for triggering workflows on pre/postsubmits/cron jobs
  2. A good solution for monitoring/alerting
  3. A good story for reusability around common tasks (e.g. building images, creating a PR, etc...)

/cc @scottilee @animeshsingh @kkasravi @jinchihe

@jlewi
Copy link
Contributor Author

jlewi commented Sep 5, 2019

GCB now has direct integration via GitHub App Triggers
https://cloud.google.com/cloud-build/docs/create-github-app-triggers

So if we install that GitHub App in our project then we can trigger GCB builds in response to PRs. The GCB build could then create K8s resources.

This is very similar to our prow infra works today. We use Prow to trigger Prow jobs which run run_workflow_e2e.py which in turn submits a bunch of Argo workflows based on prow_config.yaml.

We could do something similar but use GCB to invoke run_e2e_workflow.py

@jlewi
Copy link
Contributor Author

jlewi commented Sep 5, 2019

With kubeflow/kubeflow#4029 we have a pretty good POC for CD of the jupyter web app image.
The next step is probably to generalize this to a 2nd image which should probably be the central dashboard; kubeflow/kubeflow#3781.

  1. Parameterizing update_jupyter_web_app.py into a reusable script should be pretty easy

    • It looks like there are basically a couple arguments
      1. The image location
      2. Kustomization location
      3. Location of source and the command to build the image
        • We could make certain assumptions; i.e. that there is a Makefile and the name of the build rule and variables parameterizing where it gets pushed
  2. Make it easy for people to add new jobs to periodically build and push their image.

@jlewi
Copy link
Contributor Author

jlewi commented Nov 27, 2019

I have created the CI/CD for Kubeflow Applications Card in the Engprod Project to track this.

@kkasravi
Copy link
Contributor

@jlewi thanks. I should be able to get to do some work on this in the next few days

@jlewi
Copy link
Contributor Author

jlewi commented Nov 27, 2019

Design doc is here: bit.ly/kfcd

It looks like its a bit outdated. It would be good to update it and then socialize our thinking at the community meeting.

@scottilee
Copy link
Contributor

@jlewi I don't know if it's just me but that design doc link redirects to http://www.thelaptop-computers.info/2009/11/watauga-county-sheriff%E2%80%99s-office-arrests-two-suspected-burglars-go-blue-ridge/ which is not relevant.

@kkasravi
Copy link
Contributor

@kkasravi
Copy link
Contributor

It looks like its a bit outdated. It would be good to update it and then socialize our thinking at the community meeting.

@jlewi I'll update the doc as you've suggested to just focus on #1 (A doc focused on continuous delivery of our applications)

jlewi pushed a commit to jlewi/kubeflow that referenced this issue Dec 13, 2019
…cluster

* Get rid of the PVC used to pass the image digest file between the build
  and update manifests step

  * Creating a PVC just creates operational complexity

* We combine the build and update manifests step into one task. We can
  then use /workspace (a pod volume) to pass data like the image digest
  file between the steps

* Update pipelineRun to work with version 0.9 of Tekton
  * Field serviceAccount has been renamed serviceAccountName

  * TaskRun no longer supports outputImageDir so we remove it; we will
    have to use Tekton to pass the image digest file

* Remove the namespace.yaml and secrets.yaml from the kustomize package

  * The secrets should be created out of band and not checked in
  * So the behavior should be to deploy the kustomize package in a namespace
    that already exists with the appropriate secrets

  * Checking in secrets is confusing

    * If we check in dummy secrets then users will get confused about
      whether the secrets are valid or not

    * Furthermore, the file secrets.yaml is an invitation to end up checking
      the secrets into source control.

* Configure some values to use gcr.io/kubeflow-images-public

* Disable ISTIO sidecar in the pipelines

* For kaniko we don't need the secret to be named a certain way we just
  need to set GOOGLE_APPLICATION_CREDENTIALS to point to the correct value

* We change kaniko to use the user-gcp-sa secret that Kubeflow creates

* We shouldn't need an image pull secret since kubeflow-images-public is public
  * GOOGLE_APPLICATION_CREDENTIALS should be used for pushing images

* Change the name of the secret containing ssh credentials for kubeflow-bot
  to kubeflow-bot-github-ssh

* rebuild-manifests.sh should use /workspace to get the image digest
  rather than the PVC.

* Simplify rebuild-manifests.ssh

  * Tekton will mount the .ssh information in /tekton/home/.ssh
    so we just need to create a symbolic link to /root/.ssh

  * The image digest file should be fetched from /workspace and not some PVC.

  * Set GITHUB_TOKEN  environment variable using secrets so that we don't
    need to use kubectl get to fetch it

  * We need to make the clone of kubeflow/manifests a non-shallow clone
    before we can push changes to the remote repo

Next steps:

* This PR only updated the profile controller

* We need to refactor how the PipelineRun's are laid out

  * I think we may want the PipelineRun's to be separate from the reused
    resurces like Task

* rebuil-manifests.sh should only regenerate tests for changed files

* The created PRs don't satisfy the Kubeflow CLA check.

Related to: kubeflow/testing#450
jlewi pushed a commit to jlewi/kubeflow that referenced this issue Dec 13, 2019
…cluster

* Get rid of the PVC used to pass the image digest file between the build
  and update manifests step

  * Creating a PVC just creates operational complexity

* We combine the build and update manifests step into one task. We can
  then use /workspace (a pod volume) to pass data like the image digest
  file between the steps

* Update pipelineRun to work with version 0.9 of Tekton
  * Field serviceAccount has been renamed serviceAccountName

  * TaskRun no longer supports outputImageDir so we remove it; we will
    have to use Tekton to pass the image digest file

* Remove the namespace.yaml and secrets.yaml from the kustomize package

  * The secrets should be created out of band and not checked in
  * So the behavior should be to deploy the kustomize package in a namespace
    that already exists with the appropriate secrets

  * Checking in secrets is confusing

    * If we check in dummy secrets then users will get confused about
      whether the secrets are valid or not

    * Furthermore, the file secrets.yaml is an invitation to end up checking
      the secrets into source control.

* Configure some values to use gcr.io/kubeflow-images-public

* Disable ISTIO sidecar in the pipelines

* For kaniko we don't need the secret to be named a certain way we just
  need to set GOOGLE_APPLICATION_CREDENTIALS to point to the correct value

* We change kaniko to use the user-gcp-sa secret that Kubeflow creates

* We shouldn't need an image pull secret since kubeflow-images-public is public
  * GOOGLE_APPLICATION_CREDENTIALS should be used for pushing images

* Change the name of the secret containing ssh credentials for kubeflow-bot
  to kubeflow-bot-github-ssh

* rebuild-manifests.sh should use /workspace to get the image digest
  rather than the PVC.

* Simplify rebuild-manifests.ssh

  * Tekton will mount the .ssh information in /tekton/home/.ssh
    so we just need to create a symbolic link to /root/.ssh

  * The image digest file should be fetched from /workspace and not some PVC.

  * Set GITHUB_TOKEN  environment variable using secrets so that we don't
    need to use kubectl get to fetch it

  * We need to make the clone of kubeflow/manifests a non-shallow clone
    before we can push changes to the remote repo

* I was able to successfully run the profile controller workflow and create a
  PR

  kubeflow/manifests#669

Next steps:

* This PR only updated the profile controller

* We need to refactor how the PipelineRun's are laid out

  * I think we may want the PipelineRun's to be separate from the reused
    resurces like Task

* rebuil-manifests.sh should only regenerate tests for changed files

* The created PRs don't satisfy the Kubeflow CLA check.

Related to: kubeflow/testing#450
@jlewi
Copy link
Contributor Author

jlewi commented Dec 13, 2019

Status Update:

Next steps

  • Refactor the kustomize layout of the tekton scripts

  • I think we want to make it easier to fire off PipelineRun's for all the different applications we need to update

    • I think as part of that we want to move all the scripts/pipeline resources into kubeflow/testing since
      that's our main engprod repo
  • We need Fix computation of changed files in generate-changed-only rule manifests#665 to be submitted so that we can regenerate manifest tests for only changed files

@jlewi
Copy link
Contributor Author

jlewi commented Dec 17, 2019

@kkasravi I wrote up my current thinking in this doc:
http://bit.ly/kfappscd-201912

PTAL

@kkasravi
Copy link
Contributor

@jlewi

I had commented on restructuring the PipelineRun to embed a pipelineSpec and resourceSpec rather than a pipelineRef and resourceRefs here: #544 (comment)

I'll comment on the doc as well

k8s-ci-robot pushed a commit to kubeflow/kubeflow that referenced this issue Dec 20, 2019
…cluster (#4568)

* Get rid of the PVC used to pass the image digest file between the build
  and update manifests step

  * Creating a PVC just creates operational complexity

* We combine the build and update manifests step into one task. We can
  then use /workspace (a pod volume) to pass data like the image digest
  file between the steps

* Update pipelineRun to work with version 0.9 of Tekton
  * Field serviceAccount has been renamed serviceAccountName

  * TaskRun no longer supports outputImageDir so we remove it; we will
    have to use Tekton to pass the image digest file

* Remove the namespace.yaml and secrets.yaml from the kustomize package

  * The secrets should be created out of band and not checked in
  * So the behavior should be to deploy the kustomize package in a namespace
    that already exists with the appropriate secrets

  * Checking in secrets is confusing

    * If we check in dummy secrets then users will get confused about
      whether the secrets are valid or not

    * Furthermore, the file secrets.yaml is an invitation to end up checking
      the secrets into source control.

* Configure some values to use gcr.io/kubeflow-images-public

* Disable ISTIO sidecar in the pipelines

* For kaniko we don't need the secret to be named a certain way we just
  need to set GOOGLE_APPLICATION_CREDENTIALS to point to the correct value

* We change kaniko to use the user-gcp-sa secret that Kubeflow creates

* We shouldn't need an image pull secret since kubeflow-images-public is public
  * GOOGLE_APPLICATION_CREDENTIALS should be used for pushing images

* Change the name of the secret containing ssh credentials for kubeflow-bot
  to kubeflow-bot-github-ssh

* rebuild-manifests.sh should use /workspace to get the image digest
  rather than the PVC.

* Simplify rebuild-manifests.ssh

  * Tekton will mount the .ssh information in /tekton/home/.ssh
    so we just need to create a symbolic link to /root/.ssh

  * The image digest file should be fetched from /workspace and not some PVC.

  * Set GITHUB_TOKEN  environment variable using secrets so that we don't
    need to use kubectl get to fetch it

  * We need to make the clone of kubeflow/manifests a non-shallow clone
    before we can push changes to the remote repo

* I was able to successfully run the profile controller workflow and create a
  PR

  kubeflow/manifests#669

Next steps:

* This PR only updated the profile controller

* We need to refactor how the PipelineRun's are laid out

  * I think we may want the PipelineRun's to be separate from the reused
    resurces like Task

* rebuil-manifests.sh should only regenerate tests for changed files

* The created PRs don't satisfy the Kubeflow CLA check.

Related to: kubeflow/testing#450
jlewi pushed a commit to jlewi/testing that referenced this issue Jan 22, 2020
* Update applications.yaml with a v0.8 release.

* The purpose of this PR is to check that just by defining the appropriate
  release we can begin building images from release branches and updating
  the release branch of kubeflow/manifests

* Related to kubeflow#450 - Continuous delivery of Kubeflow applications
jlewi pushed a commit to jlewi/testing that referenced this issue Jan 23, 2020
* Update applications.yaml with a v0.8 release.

* The purpose of this PR is to check that just by defining the appropriate
  release we can begin building images from release branches and updating
  the release branch of kubeflow/manifests

* Related to kubeflow#450 - Continuous delivery of Kubeflow applications

* Create a python script for opening up the PR; this script replaces
  the bash script rebuild-manifests.sh that was used previously

* The new script doesn't assume that the base branch for PRs is master.
  We need this to support updating release branches.

* Create a profile in skaffold.yaml for running on the release cluster.
* Create an image_util package to parse image URLs.

* Use the Docker image for apps-cd to run create_manifests_pr.py
  * Add kustomize, go, and some other tools we need

  * In the docker image create a symbolic link for .ssh so we can pick
    up ssh credentials created by Tekton.
jlewi pushed a commit to jlewi/kubeflow that referenced this issue Jan 23, 2020
* The infrastructure for continuously rebuilding our docker images
  and updating our kustomize manifests has now been generalized.

  see kubeflow/testing#450
  and https://github.com/kubeflow/testing/tree/master/apps-cd

* This is the old code for updating the jupyter web app and is no longer
  needed.
jlewi pushed a commit to jlewi/kubeflow that referenced this issue Jan 23, 2020
k8s-ci-robot pushed a commit to kubeflow/kubeflow that referenced this issue Jan 23, 2020
k8s-ci-robot pushed a commit that referenced this issue Jan 25, 2020
* Define a v0.8 release

* Update applications.yaml with a v0.8 release.

* The purpose of this PR is to check that just by defining the appropriate
  release we can begin building images from release branches and updating
  the release branch of kubeflow/manifests

* Related to #450 - Continuous delivery of Kubeflow applications

* Create a python script for opening up the PR; this script replaces
  the bash script rebuild-manifests.sh that was used previously

* The new script doesn't assume that the base branch for PRs is master.
  We need this to support updating release branches.

* Create a profile in skaffold.yaml for running on the release cluster.
* Create an image_util package to parse image URLs.

* Use the Docker image for apps-cd to run create_manifests_pr.py
  * Add kustomize, go, and some other tools we need

  * In the docker image create a symbolic link for .ssh so we can pick
    up ssh credentials created by Tekton.

* * Define a 1.0 release now that the branches have been cut
  * Related to kubeflow/kubeflow#4685
@jlewi
Copy link
Contributor Author

jlewi commented Jan 27, 2020

Update

k8s-ci-robot pushed a commit to kubeflow/kubeflow that referenced this issue Jan 28, 2020
* The infrastructure for continuously rebuilding our docker images
  and updating our kustomize manifests has now been generalized.

  see kubeflow/testing#450
  and https://github.com/kubeflow/testing/tree/master/apps-cd

* This is the old code for updating the jupyter web app and is no longer
  needed.
@jlewi
Copy link
Contributor Author

jlewi commented Jan 28, 2020

This is working. Here's a list of PRs indicating several PRs updating 1.0 applications which were successfully merged
https://github.com/kubeflow/manifests/pulls?utf8=%E2%9C%93&q=+is%3Aclosed+author%3Akubeflow-bot+

Only remaining thing to do before updating this PR is updating the instance of the release infrastructure in the prod namespace.

@jlewi
Copy link
Contributor Author

jlewi commented Jan 30, 2020

Closing this issue.
Filed #593 to setup a prod instance

@jlewi jlewi closed this as completed Jan 30, 2020
saffaalvi pushed a commit to StatCan/kubeflow that referenced this issue Feb 11, 2021
…cluster (kubeflow#4568)

* Get rid of the PVC used to pass the image digest file between the build
  and update manifests step

  * Creating a PVC just creates operational complexity

* We combine the build and update manifests step into one task. We can
  then use /workspace (a pod volume) to pass data like the image digest
  file between the steps

* Update pipelineRun to work with version 0.9 of Tekton
  * Field serviceAccount has been renamed serviceAccountName

  * TaskRun no longer supports outputImageDir so we remove it; we will
    have to use Tekton to pass the image digest file

* Remove the namespace.yaml and secrets.yaml from the kustomize package

  * The secrets should be created out of band and not checked in
  * So the behavior should be to deploy the kustomize package in a namespace
    that already exists with the appropriate secrets

  * Checking in secrets is confusing

    * If we check in dummy secrets then users will get confused about
      whether the secrets are valid or not

    * Furthermore, the file secrets.yaml is an invitation to end up checking
      the secrets into source control.

* Configure some values to use gcr.io/kubeflow-images-public

* Disable ISTIO sidecar in the pipelines

* For kaniko we don't need the secret to be named a certain way we just
  need to set GOOGLE_APPLICATION_CREDENTIALS to point to the correct value

* We change kaniko to use the user-gcp-sa secret that Kubeflow creates

* We shouldn't need an image pull secret since kubeflow-images-public is public
  * GOOGLE_APPLICATION_CREDENTIALS should be used for pushing images

* Change the name of the secret containing ssh credentials for kubeflow-bot
  to kubeflow-bot-github-ssh

* rebuild-manifests.sh should use /workspace to get the image digest
  rather than the PVC.

* Simplify rebuild-manifests.ssh

  * Tekton will mount the .ssh information in /tekton/home/.ssh
    so we just need to create a symbolic link to /root/.ssh

  * The image digest file should be fetched from /workspace and not some PVC.

  * Set GITHUB_TOKEN  environment variable using secrets so that we don't
    need to use kubectl get to fetch it

  * We need to make the clone of kubeflow/manifests a non-shallow clone
    before we can push changes to the remote repo

* I was able to successfully run the profile controller workflow and create a
  PR

  kubeflow/manifests#669

Next steps:

* This PR only updated the profile controller

* We need to refactor how the PipelineRun's are laid out

  * I think we may want the PipelineRun's to be separate from the reused
    resurces like Task

* rebuil-manifests.sh should only regenerate tests for changed files

* The created PRs don't satisfy the Kubeflow CLA check.

Related to: kubeflow/testing#450
saffaalvi pushed a commit to StatCan/kubeflow that referenced this issue Feb 11, 2021
* The infrastructure for continuously rebuilding our docker images
  and updating our kustomize manifests has now been generalized.

  see kubeflow/testing#450
  and https://github.com/kubeflow/testing/tree/master/apps-cd

* This is the old code for updating the jupyter web app and is no longer
  needed.
saffaalvi pushed a commit to StatCan/kubeflow that referenced this issue Feb 12, 2021
…cluster (kubeflow#4568)

* Get rid of the PVC used to pass the image digest file between the build
  and update manifests step

  * Creating a PVC just creates operational complexity

* We combine the build and update manifests step into one task. We can
  then use /workspace (a pod volume) to pass data like the image digest
  file between the steps

* Update pipelineRun to work with version 0.9 of Tekton
  * Field serviceAccount has been renamed serviceAccountName

  * TaskRun no longer supports outputImageDir so we remove it; we will
    have to use Tekton to pass the image digest file

* Remove the namespace.yaml and secrets.yaml from the kustomize package

  * The secrets should be created out of band and not checked in
  * So the behavior should be to deploy the kustomize package in a namespace
    that already exists with the appropriate secrets

  * Checking in secrets is confusing

    * If we check in dummy secrets then users will get confused about
      whether the secrets are valid or not

    * Furthermore, the file secrets.yaml is an invitation to end up checking
      the secrets into source control.

* Configure some values to use gcr.io/kubeflow-images-public

* Disable ISTIO sidecar in the pipelines

* For kaniko we don't need the secret to be named a certain way we just
  need to set GOOGLE_APPLICATION_CREDENTIALS to point to the correct value

* We change kaniko to use the user-gcp-sa secret that Kubeflow creates

* We shouldn't need an image pull secret since kubeflow-images-public is public
  * GOOGLE_APPLICATION_CREDENTIALS should be used for pushing images

* Change the name of the secret containing ssh credentials for kubeflow-bot
  to kubeflow-bot-github-ssh

* rebuild-manifests.sh should use /workspace to get the image digest
  rather than the PVC.

* Simplify rebuild-manifests.ssh

  * Tekton will mount the .ssh information in /tekton/home/.ssh
    so we just need to create a symbolic link to /root/.ssh

  * The image digest file should be fetched from /workspace and not some PVC.

  * Set GITHUB_TOKEN  environment variable using secrets so that we don't
    need to use kubectl get to fetch it

  * We need to make the clone of kubeflow/manifests a non-shallow clone
    before we can push changes to the remote repo

* I was able to successfully run the profile controller workflow and create a
  PR

  kubeflow/manifests#669

Next steps:

* This PR only updated the profile controller

* We need to refactor how the PipelineRun's are laid out

  * I think we may want the PipelineRun's to be separate from the reused
    resurces like Task

* rebuil-manifests.sh should only regenerate tests for changed files

* The created PRs don't satisfy the Kubeflow CLA check.

Related to: kubeflow/testing#450
saffaalvi pushed a commit to StatCan/kubeflow that referenced this issue Feb 12, 2021
* The infrastructure for continuously rebuilding our docker images
  and updating our kustomize manifests has now been generalized.

  see kubeflow/testing#450
  and https://github.com/kubeflow/testing/tree/master/apps-cd

* This is the old code for updating the jupyter web app and is no longer
  needed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants