example TFX taxi support on-prem cluster #721

gyliu513 · 2019-01-22T08:09:12Z

Currently, all examples https://github.com/kubeflow/pipelines/tree/master/samples for pipeline are using GCP and we cannot run those examples when using on-prem clusters, it is better to add some examples to enable end user can test cases with local storage.

FYI @jinchihe

jinchihe · 2019-01-22T23:30:12Z

@gaoning777 Current samples of Pipeline are strong reply on the GCP, such as the import E2E example TFX, I think we should enhance or create new one to show how to run Kubeflow Pipeline for on-prem clusters while using PV/PVC to storage data, I almost passed the case on on-prem clusters by updated the TFX Python classes files manually. What do you think for this? Update old one or create new one? Any plan? Thanks a lot!

jinchihe · 2019-01-23T08:15:40Z

@IronPan I saw you agree to use PersistentVolume to decouple actual storage in #708 , in fact I executed a TFX taix sample by using PersistentVolume successfully, what do you think to create new one TFX taxi sample for on-prem cluster that uses PersistentVolume? If you agree, I will create this (including readme file). Thanks.

gyliu513 · 2019-01-23T08:38:02Z

@jinchihe @gaoning777 @IronPan I think updating existing document is good enough, perhaps we can update Pipeline samples one by one to include how to run them in on-prem cluster but not GKE.

The document can be updated as follows, like:

GKE

xxxx

On-prem

xxxx

jinchihe · 2019-01-23T09:27:43Z

@gyliu513 Thanks a lot. Agree with you, but I think at least we need to create new taxi-cab-classification-pipeline.py, may be better to use new name, such as taxi-cab-classification-pipeline_on_prem.py. @gaoning777 , WDYT? Thanks

jlewi · 2019-01-23T14:26:17Z

It might make sense to create a new example if the code is sufficiently different from the current example. However, can we try to write the example in a way that works for both on prem and Clouds? The fact that the current example is GCP only is a bug and something that should be fixed.

Here's a related issue #677 for how to create better patterns for working with TFJob and make it cloud agnostic.

Related to this: I'm wondering if there's a better K8s pattern for writing this example
https://github.com/kubeflow/pipelines/blob/master/samples/tfx/taxi-cab-classification-pipeline.py

Currently the pattern is

Write python code for your logic e.g. predict.py
Create a DockerFile
Create some python code to generate dsl.ContainerOp
- This is basically a wrapper around a pod but doesn't install things like volumes

Would it be better to instead orchestrate the steps more directly using K8s resources? e.g. we can define a K8s job for each of these steps. We can now just use pipelines to run some python code to create the K8s job and wait for it.

There's some discussion here about how we could make this light weight.

ddutta · 2019-01-23T19:08:43Z

@swiftdiaries @johnugeorge @ramdootp for the onprem CUJ work. We have an onprem CUJ in the works (very shortly) where we had to tackle the same problem. Hence we could solve this issue

swiftdiaries · 2019-01-23T19:31:57Z

We have a set of onprem / local specific components in the works as part of the onprem CUJ.
We found issues integrating the tensorboard and confusion matrix components into the Pipelines UI so we made workarounds to enable those.

As for the TFX example, I think having separate example code is better. The onprem wouldn't be using gcp secrets so having another tfx_taxi_cab_onprem.py makes sense.
And we'd be needing volumes, so the code would look very different.

To attach volumes to containerops, we did
train_op.add_volume(k8s_client.V1Volume(name='workflow', persistent_volume_claim=\ k8s_client.V1PersistentVolumeClaimVolumeSource(claim_name='claim-name'))).add_volume_mount(k8s_client.V1VolumeMount(mount_path='/mnt/workflow/',\ name='workflow'))

Agree that there should be a better way to handle k8s resources, if we could wrap this inside the dsl it'd be really cool. It becomes messy when you're attaching 2 or more volumes to one op.

We're addressing issues across all components in the onprem CUJ.I'll try to upload the slides today. It's very much a WIP and we'd would love to have some feedback around your experiences

gaoning777 · 2019-01-25T23:30:02Z

Thanks @jlewi for keeping up with the issue.
Echoed the fact that the current example is GCP only is a bug and something that should be fixed. We should produce samples that work for both GCP and on-prem environments. In fact the DSL should be improved to support the Kubernetes resources directly.

gaoning777 · 2019-01-25T23:31:53Z

I think the team has been working on building new features and we should allocate some amount of engineering efforts to make this a cross-platform product.

vicaire · 2019-03-26T04:43:40Z

@gaoning777, I am assigning the DSL sample issues to illustrate features to you. Please reassign to the most appropriate person as needed.

jinchihe · 2019-05-22T01:39:25Z

/assign @jinchihe

Already send a PR #749 for the ticket, please review if you're interested. Thanks.

jinchihe mentioned this issue Jan 22, 2019

An error occurs when run TFX example in local kubeflow cluster #703

Closed

jinchihe mentioned this issue Jan 24, 2019

Pipeline deploy component does not support local model #726

Closed

jlewi added priority/p2 area/example labels Jan 28, 2019

jinchihe mentioned this issue Jan 29, 2019

Enhance pipeline TFX taxi sample to support on-prem cluster #749

Merged

vkoukis mentioned this issue Feb 9, 2019

Extend the DSL with support for Persistent Volumes and Snapshots #801

Closed

gaoning777 mentioned this issue Mar 25, 2019

DSL samples to illustrate the full set of features #1034

Closed

vicaire added the kind/feature label Mar 26, 2019

vicaire assigned gaoning777 Mar 26, 2019

gyliu513 changed the title ~~Add examples for using local storage~~ example TFX taxi support on-prem cluster May 10, 2019

k8s-ci-robot assigned jinchihe May 22, 2019

k8s-ci-robot closed this as completed in #749 Jun 1, 2019

HumairAK pushed a commit to red-hat-data-services/data-science-pipelines that referenced this issue Mar 11, 2024

Update CI pipeline with new dependency flow (kubeflow#721)

8caa81e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

example TFX taxi support on-prem cluster #721

example TFX taxi support on-prem cluster #721

gyliu513 commented Jan 22, 2019 •

edited

Loading

jinchihe commented Jan 22, 2019

jinchihe commented Jan 23, 2019

gyliu513 commented Jan 23, 2019

jinchihe commented Jan 23, 2019

jlewi commented Jan 23, 2019

ddutta commented Jan 23, 2019 •

edited

Loading

swiftdiaries commented Jan 23, 2019

gaoning777 commented Jan 25, 2019

gaoning777 commented Jan 25, 2019

vicaire commented Mar 26, 2019

jinchihe commented May 22, 2019

example TFX taxi support on-prem cluster #721

example TFX taxi support on-prem cluster #721

Comments

gyliu513 commented Jan 22, 2019 • edited Loading

jinchihe commented Jan 22, 2019

jinchihe commented Jan 23, 2019

gyliu513 commented Jan 23, 2019

GKE

On-prem

jinchihe commented Jan 23, 2019

jlewi commented Jan 23, 2019

ddutta commented Jan 23, 2019 • edited Loading

swiftdiaries commented Jan 23, 2019

gaoning777 commented Jan 25, 2019

gaoning777 commented Jan 25, 2019

vicaire commented Mar 26, 2019

jinchihe commented May 22, 2019

gyliu513 commented Jan 22, 2019 •

edited

Loading

ddutta commented Jan 23, 2019 •

edited

Loading