Table of Contents generated with DocToc
- Test Infrastructure
- Anatomy of our Tests
- Accessing The Argo UI
- Working with the test infrastructure
- Logs
- Debugging Failed Tests
- Adding an E2E test for a new repository
- Testing Changes to the ProwJobs
- Cleaning up leaked resources
- Integration with K8s Prow Infrastructure.
- Setting up Kubeflow Test Infrastructure
- Setting up a Kubeflow Repository to Use Prow
This directory contains the Kubeflow test Infrastructure.
We use Prow, K8s' continuous integration tool.
- Prow is a set of binaries that run on Kubernetes and respond to GitHub events.
We use Prow to run:
- Presubmit jobs
- Postsubmit jobs
- Periodic tests
Here's how it works
- Prow is used to trigger E2E tests
- The E2E test will launch an Argo workflow that describes the tests to run
- Each step in the Argo workflow will be a binary invoked inside a container
- The Argo workflow will use an NFS volume to attach a shared POSIX compliant filesystem to each step in the workflow.
- Each step in the pipeline can write outputs and junit.xml files to a test directory in the volume
- A final step in the Argo pipeline will upload the outputs to GCS so they are available in gubernator
Quick Links
- Our prow jobs are defined here
- Each prow job defines a K8s PodSpec indicating a command to run
- Our prow jobs use to trigger an Argo workflow that checks out our code and runs our tests.
- Our tests are structured as Argo workflows so that we can easily perform steps in parallel.
- The Argo workflow is defined in the repository being tested
- We always use the worfklow at the commit being tested
- is used to checkout the code being tested
- This also checks out kubeflow/testing so that all repositories can rely on it for shared tools.
The UI is publicly available at
The tests store the results of tests in a shared NFS filesystem. To inspect the results you can mount the NFS volume.
To make this easy, We run a stateful set in our test cluster that mounts the same volumes as our Argo workers. Furthermore, this stateful set is using an environment (GCP credentials, docker image, etc...) that mimics our Argo workers. You can ssh into this stateful set in order to get access to the NFS volume.
kubectl exec -it debug-worker-0 /bin/bash
This can be very useful for reproducing test failures.
Logs from the E2E tests are available in a number of places and can be used to troubleshoot test failures.
These should be publicly accessible.
The logs from each step are copied to GCS and made available through gubernator. The K8s-ci robot should post a link to the gubernator UI in the PR. You can also find them as follows
- Open up the prow jobs dashboard e.g. for kubeflow/kubeflow
- Find your job
- Click on the link under job; this goes to the Gubernator dashboard
- Click on artifacts
- Navigate to artifacts/logs
If these logs aren't available it could indicate a problem running the step that uploads the artifacts to GCS for gubernator. In this case you can use one of the alternative methods listed below.
The argo UI is publicly accessible at
- Find and click on the workflow corresponding to your pre/post/periodic job
- Select the workflow tab
- From here you can select a specific step and then see the logs for that step
Since we run our E2E tests on GKE, all logs are persisted in Stackdriver logging.
Viewer access to Stackdriver logs is available by joining one of the following groups
If you know the pod id corresponding to the step of interest then you can use the following Stackdriver filter
resource.labels.container_name = "main"
The ${POD_ID} is of the form
If no results show up in Gubernator this means the prow job didn't get far enough to upload any results/logs to GCS.
To debug this you need the pod logs. You can access the pod logs via the build log link for your job in the prow jobs UI
- Pod logs are ephmeral so you need to check shortly after your job runs.
The pod logs are available in StackDriver but only the Google Kubeflow Team has access
- Prow runs on a cluster owned by the K8s team not Kubeflow
- This policy is determined by K8s not Kubeflow
- This could potentially be fixed by using our own prow build cluster issue#32
To access the stackdriver logs
- Open stackdriver for project k8s-prow-builds
- Get the pod ID by clicking on the build log in the prow jobs UI
- Filter the logs using
- For example, if the TF serving workflow failed, filter the logs using
An Argo workflow fails and you click on the failed step in the Argo UI to get the logs and you see the error
failed to get container status {"docker" "b84b751b0102b5658080a520c9a5c2655855960c4695cf557c0c1e45999f7429"}: rpc error: code = Unknown desc = Error: No such container: b84b751b0102b56580
This error is a red herring; it means the pod is probably gone so Argo couldn't get the logs.
The logs should be in StackDriver but to get them we need to identify the pod
Get the workflow using kubectl
kubectl get wf -o yaml ${WF_NAME} > /tmp/${WF_NAME}.yaml
This requires appropriate K8s RBAC permissions
You'll need to be added to the Google group
- Create a PR adding yourself to ci-team
Search the YAML spec for the pod information for the failed step
kubeflow-presubmit-kfctl-1810-70210d5-3900-218a-2243590372: boundaryID: kubeflow-presubmit-kfctl-1810-70210d5-3900-218a displayName: kfctl-apply-gcp finishedAt: 2018-10-17T05:07:58Z id: kubeflow-presubmit-kfctl-1810-70210d5-3900-218a-2243590372 message: failed with exit code 1 name: kubeflow-presubmit-kfctl-1810-70210d5-3900-218a.kfctl-apply-gcp phase: Failed startedAt: 2018-10-17T05:04:20Z templateName: kfctl-apply-gcp type: Pod
- You can use displayName to match the text shown in the UI
- id will be the name of the pod.
Follow the instructions to get the stackdriver logs for the pod.
If an E2E test fails because a pod doesn't start (e.g JupyterHub) we can debug this by looking at the events associated with the pod.
If you have access to the pod you can do kubectl describe pods
Events are also persisted to StackDriver and can be fetched with a query like the following.
jsonPayload.involvedObject.namespace = "kubeflow-presubmit-tf-serving-image-299-439a983-360-fa0d"
- Change the namespace to be the actual namespace used for the test
We use prow to launch Argo workflows. Here are the steps to create a new E2E test for a repository. This assumes prow is already configured for the repository (see these instructions for info on setting up prow).
Create a ksonnet App in that repository and define an Argo workflow
- The first step in the workflow should checkout the code using
- Code should be checked out to a shared NFS volume to make it accessible to subsequent steps
Create a container to use with the Prow job
- For an example look at the kubeflow/testing Dockerfile
- Image should be based on
- Create an entrypoint that does two things
- Run to download the source
- Use kubeflow.testing.run_e2e_workflow to run the Argo workflow.
- Add a
file that will be passed into run_e2e_workflow to determine which ksonnet app to use for testing. An example can be seen here.-
Workflows can optionally be scoped by job type (presubmit/postsubmit) or modified directories. For example:
workflows: - app_dir: kubeflow/testing/workflows component: workflows name: unittests job_types: - presubmit include_dirs: - foo/* - bar/*
This configures the
workflow to only run during presubmit jobs, and only if there are changes under directoriesfoo
Create a prow job for that repository
- The command for the prow job should be set via the entrypoint baked into the Docker image
- This way we can change the Prow job just by pushing a docker image and we don't need to update the prow config.
Changes to our ProwJob configs in config.yaml should be relatively infrequent since most of the code invoked as part of our tests lives in the repository.
However, in the event we need to make changes here are some instructions for testing them.
Follow Prow's getting started guide to create your own prow cluster.
- Note The only part you really need is the ProwJob CRD and controller.
Checkout kubernetes/test-infra.
git clone git_k8s-test-infra
Build the mkpj binary
bazel build build prow/cmd/mkpj
Generate the ProwJob Config
./bazel-bin/prow/cmd/mkpj/mkpj --job=$JOB_NAME --config-path=$CONFIG_PATH
- This binary will prompt for needed information like the sha #
- The output will be a ProwJob spec which can be instantiated using kubectl
Create the ProwJob
kubectl create -f ${PROW_JOB_YAML_FILE}
- To rerun the job bump and status.startTime
To monitor the job open Prow's UI by navigating to the external IP associated with the ingress for your Prow cluster or using kubectl proxy.
Test failures sometimes leave resources (GCP deployments, VMs, GKE clusters) still running. The following scripts can be used to garbage collect leaked resources.
py/testing/kubeflow/testing/ --delete_script=${DELETE_SCRIPT}
- DELETE_SCRIPT should be the path to a copy of
There's a second script cleanup_kubeflow_ci in the kubeflow repository to cleanup resources left by ingresses.
We rely on K8s instance of Prow to actually run our jobs.
Here's a dashboard with the results.
Our jobs should be added to K8s config
Our tests require:
- a K8s cluster
- Argo installed on the cluster
- A shared NFS filesystem
Our prow jobs execute Argo worflows in project/clusters owned by Kubeflow. We don't use the shared Kubernetes test clusters for this.
- This gives us more control of the resources we want to use e.g. GPUs
This section provides the instructions for setting this up.
Create a GKE cluster
gcloud --project=${PROJECT} container clusters create \
--zone=${ZONE} \
--machine-type=n1-standard-8 \
gcloud compute --project=${PROJECT} addresses create argo-ui --global
gcloud services --project=${PROJECT} enable
gcloud services --project=${PROJECT} enable
gcloud services --project=${PROJECT} enable
- The tests need a GCP service account to upload data to GCS for Gubernator
gcloud iam service-accounts --project=${PROJECT} create ${SERVICE_ACCOUNT} --display-name "Kubeflow testing account"
gcloud projects add-iam-policy-binding ${PROJECT} \
--member serviceAccount:${SERVICE_ACCOUNT}@${PROJECT} --role roles/container.admin \
--role=roles/viewer \
--role=roles/cloudbuild.builds.editor \
--role=roles/logging.viewer \
--role=roles/storage.admin \
- Our tests create K8s resources (e.g. namespaces) which is why we grant it developer permissions.
- Project Viewer (because GCB requires this with gcloud)
- Kubernetes Engine Admin (some tests create GKE clusters)
- Logs viewer (for GCB)
- Compute Instance Admin to create VMs for minikube
- Storage Admin (For GCR)
gcloud --project=${PROJECT} iam service-accounts add-iam-policy-binding \
${GCE_DEFAULT} --member="serviceAccount:${FULL_SERVICE}" \
- Service Account User of the Compute Engine Default Service account (to avoid this error)
Create a secret key containing a GCP private key for the service account
KEY_FILE=<path to key>
gcloud iam service-accounts keys create ${KEY_FILE} \
--iam-account ${SERVICE_ACCOUNT}@${PROJECT}
kubectl create secret generic ${SECRET_NAME} \
--namespace=${NAMESPACE} --from-file=key.json=${KEY_FILE}
Make the service account a cluster admin
kubectl create clusterrolebinding ${SERVICE_ACCOUNT}-admin --clusterrole=cluster-admin \
- The service account is used to deploye Kubeflow which entails creating various roles; so it needs sufficient RBAC permission to do so.
Add a clusterrolebinding that uses the numeric id of the service account as a work around for ksonnet/ksonnet#396
NUMERIC_ID=`gcloud --project=kubeflow-ci iam service-accounts describe ${SERVICE_ACCOUNT}@${PROJECT} --format="value(oauth2ClientId)"`
kubectl create clusterrolebinding ${SERVICE_ACCOUNT}-numeric-id-admin --clusterrole=cluster-admin \
You need to use a GitHub token with ksonnet otherwise the test quickly runs into GitHub API limits.
TODO(jlewi): We should create a GitHub bot account to use with our tests and then create API tokens for that bot.
You can use the GitHub API to create a token
- The token doesn't need any scopes because its only accessing public data and is needed only for API metering.
To create the secret run
kubectl create secret generic github-token --namespace=${NAMESPACE} --from-literal=github_token=${GITHUB_TOKEN}
We use GCP Cloud Launcher to create a single node NFS share; current settings
- 8 VCPU
- 1 TB disk
The ksonnet app test-infra
contains ksonnet configs to deploy the test infrastructure.
First, install the kubeflow package
ks pkg install kubeflow/core
Then change the server ip in test-infra/environments/prow/spec.json
point to your cluster.
You can deploy argo as follows (you don't need to use argo's CLI)
Set up the environment
NFS_SERVER=<Internal GCE IP address of the NFS Server>
ks env add ${ENV}
ks param set --env=${ENV} argo namespace ${NAMESPACE}
ks param set --env=${ENV} debug-worker namespace ${NAMESPACE}
ks param set --env=${ENV} nfs-external namespace ${NAMESPACE}
ks param set --env=${ENV} nfs-external nfsServer ${NFS_SERVER}
In the testing environment (but not release) we also expose the UI
ks param set --env=${ENV} argo exposeUi true
ks apply ${ENV} -c argo
Create the PVs corresponding to external NFS
ks apply ${ENV} -c nfs-external
User or service account deploying the test infrastructure needs sufficient permissions to create the roles that are created as part deploying the test infrastructure. So you may need to run the following command before using ksonnet to deploy the test infrastructure.
kubectl create clusterrolebinding default-admin --clusterrole=cluster-admin
Define ProwJobs see pull/4951
- Add prow jobs to prow/config.yaml
- Add trigger plugin to prow/plugins.yaml
- Add test dashboards to testgrid/config.yaml
- Modify testgrid/cmd/configurator/config_test.go to allow presubmits for the new repo.
Add the
team to the repository with write access- Write access will allow bots in the team to update status
Follow instructions for adding a repository to the PR dashboard.
Add an
to your Kubeflow repository. TheOWNERS
file, like this one, will specify who can review and approve on this repo.
Webhooks for prow should already be configured according to these instructions for the org so you shouldn't need to set hooks per repository. * Use as the target * Get HMAC token from k8s test team