Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate standalone deployment to workload identity on GCP #2619

Merged
merged 39 commits into from
Dec 17, 2019

Conversation

Bobgy
Copy link
Contributor

@Bobgy Bobgy commented Nov 18, 2019

Part of #1691

Changes:

  • Added a bash script for conveniently binding workload identity
  • Changed test infra to use workload identity
  • Changed samples to not use use_gcp_secret
  • Changed container builder to not use gcp secret (Future work of supporting both is tracked in [SDK] Container builder should allow secret configuration #2702)
  • Changed tensorboard template to not use gcp secret

Note, I didn't completely remove test infra for other than workload identity, because I think it still makes sense to also test KFP in for example full scope cluster. So that we can easily add another test configuration in prow with different ENV to use a different one.

/assign @IronPan
/cc @gaoning777
/cc @rmgogogo


This change is Reviewable

Copy link
Contributor Author

@Bobgy Bobgy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@IronPan Can you help me briefly review?
Focus on https://github.com/kubeflow/pipelines/pull/2619/files#diff-e5921efca5c992d5c5ab4e91927c9ff3.
It is the script I will let users run.

manifests/kustomize/gcp-workload-identity-setup.sh Outdated Show resolved Hide resolved
@Bobgy Bobgy changed the title Migrate standalone deployment to workload identity on GCP [WIP] Migrate standalone deployment to workload identity on GCP Nov 18, 2019
@rmgogogo
Copy link
Contributor

/lgtm

@rmgogogo
Copy link
Contributor

/test

@k8s-ci-robot k8s-ci-robot removed the lgtm label Nov 19, 2019
@Bobgy
Copy link
Contributor Author

Bobgy commented Nov 19, 2019

/retest

@gaoning777
Copy link
Contributor

Is this PR ready to be reviewed?

@Bobgy
Copy link
Contributor Author

Bobgy commented Nov 20, 2019

Is this PR ready to be reviewed?

Yes, but only review the script used to set up workload identity.

I ran samples successfully with it. However, still struggling to migrate test infra on workload identity.

@Bobgy
Copy link
Contributor Author

Bobgy commented Nov 22, 2019

It turns out that tests failed before because our image builder docker image was built a year ago with google/cloud-sdk. The old version of cloud sdk doesn't work well with workload identity.

@Bobgy
Copy link
Contributor Author

Bobgy commented Nov 27, 2019

UPDATE, I removed all usages of use_gcp_secret to pass sample tests.
However, most of the samples need documentation update about how to authorize on GCP

@Bobgy
Copy link
Contributor Author

Bobgy commented Dec 10, 2019

Verify testing again for flakiness
/test kubeflow-pipeline-e2e-test
/test kubeflow-pipeline-sample-test

@rmgogogo
Copy link
Contributor

/lgtm

@Bobgy
Copy link
Contributor Author

Bobgy commented Dec 16, 2019

/test kubeflow-pipeline-sample-test
sidecar sample is flaky

@Bobgy
Copy link
Contributor Author

Bobgy commented Dec 17, 2019

container builder seems to be flaky at fetching some image: https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/kubeflow_pipelines/2619/kubeflow-pipeline-sample-test/1206752832519147523#1:build-log.txt%3A4165
secret sample has the following error, I haven't yet figured out if that's a workload identity transient error or not
https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/kubeflow_pipelines/2619/kubeflow-pipeline-sample-test/1206752832519147523#1:build-log.txt%3A4000

Test again to verify
/test kubeflow-pipeline-sample-test

@Bobgy
Copy link
Contributor Author

Bobgy commented Dec 17, 2019

Verify again for flakiness
/test kubeflow-pipeline-sample-test

@rmgogogo
Copy link
Contributor

/lgtm

/retest

local gsa_full="$gsa@$project.iam.gserviceaccount.com"
local namespace=${4:-$NAMESPACE}

gcloud iam service-accounts add-iam-policy-binding $gsa_full \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add retries here? This command is a source of some test flakiness:

++ gcloud iam service-accounts add-iam-policy-binding test-argo@ml-pipeline-test.iam.gserviceaccount.com '--member=serviceAccount:ml-pipeline-test.svc.id.goog[kubeflow/test-runner]' --role=roles/iam.workloadIdentityUser
ERROR: (gcloud.iam.service-accounts.add-iam-policy-binding) ABORTED: There were concurrent policy changes. Please retry the whole read-modify-write with exponential backoff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me put up a PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name=yaml_spec['spec']['containers'][0]['env'][0]['name'],
value=yaml_spec['spec']['containers'][0]['env'][0]['value'],
)])
args = yaml_spec['spec']['containers'][0]['args'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned about this change to the container-building functions for non-GKE deployments.
For container-building functions there might be no way for the user to supply secrets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I also think it's important to fix. However, I don't currently have bandwidth. Let's get this scheduled soon then.

@@ -57,11 +56,8 @@ def use_gcp_api_op():
def secret_op_pipeline(url='gs://ml-pipeline-playground/shakespeare1.txt'):
"""A pipeline that uses secret to access cloud hosted resouces."""

gcs_read_task = gcs_read_op(url).apply(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of secrets seems to be the core feature demonstrated by this sample. Perhaps this should not be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. Current situation is that we are not running tests that use a service account token. Shall we remove this sample from presubmit/postsubmit tests?

Jeffwan pushed a commit to Jeffwan/pipelines that referenced this pull request Dec 9, 2020
)

* Script to set up workload identity for standalone deployment

* Migrate tests to run on standalone + workload identity

* Fix test script

* Switch to static GSAs for testing, because they have name length limit

* Add workload identity binding for argo

* Fix argo workload identity bindings

* Remove user-gcp-sa from tests

* Remove use_gcp_secret from xgboost sample

* Allow debugging tests locally

* Wait for policies to take effect

* Update deploy-pipeline-lite.sh

* Update deploy-pipeline-lite.sh

* [WIP] test gcloud auth list with test-runner sa

* Add namespace

* test again

* Use new image builder

* test again

* Remove debug code

* Remove usages of use_gcp_secret

* Fix unit test and tensorboard pod template

* Add debug code again to test

* Try waiting until workload identity bindings are ready

* Fix some other samples

* Fix parameterized tfx oss sample

* Add retry to image building

* Try fixing tfx oss sample

* Fix compiled tfx oss sample

* Update all google/cloud-sdk to latest

* Try fixing parameterized tfx oss sample again

* Also verify pipeline-runner ksa is working

* Fix parameterized_tfx_oss sample

* Update gcp-workload-identity-setup.sh

* Revert unneeded change

* Pin to new google/cloud-sdk

* Remove wrongly commited binaries
magdalenakuhn17 pushed a commit to magdalenakuhn17/pipelines that referenced this pull request Oct 22, 2023
Signed-off-by: ddelange <14880945+ddelange@users.noreply.github.com>

Signed-off-by: ddelange <14880945+ddelange@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants