Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

example TFX taxi support on-prem cluster #721

Closed
gyliu513 opened this issue Jan 22, 2019 · 11 comments · Fixed by #749
Closed

example TFX taxi support on-prem cluster #721

gyliu513 opened this issue Jan 22, 2019 · 11 comments · Fixed by #749

Comments

@gyliu513
Copy link
Member

gyliu513 commented Jan 22, 2019

Currently, all examples https://github.com/kubeflow/pipelines/tree/master/samples for pipeline are using GCP and we cannot run those examples when using on-prem clusters, it is better to add some examples to enable end user can test cases with local storage.

FYI @jinchihe

@jinchihe
Copy link
Member

@gaoning777 Current samples of Pipeline are strong reply on the GCP, such as the import E2E example TFX, I think we should enhance or create new one to show how to run Kubeflow Pipeline for on-prem clusters while using PV/PVC to storage data, I almost passed the case on on-prem clusters by updated the TFX Python classes files manually. What do you think for this? Update old one or create new one? Any plan? Thanks a lot!

@jinchihe
Copy link
Member

@IronPan I saw you agree to use PersistentVolume to decouple actual storage in #708 , in fact I executed a TFX taix sample by using PersistentVolume successfully, what do you think to create new one TFX taxi sample for on-prem cluster that uses PersistentVolume? If you agree, I will create this (including readme file). Thanks.

@gyliu513
Copy link
Member Author

@jinchihe @gaoning777 @IronPan I think updating existing document is good enough, perhaps we can update Pipeline samples one by one to include how to run them in on-prem cluster but not GKE.

The document can be updated as follows, like:

GKE

xxxx

On-prem

xxxx

@jinchihe
Copy link
Member

@gyliu513 Thanks a lot. Agree with you, but I think at least we need to create new taxi-cab-classification-pipeline.py, may be better to use new name, such as taxi-cab-classification-pipeline_on_prem.py. @gaoning777 , WDYT? Thanks

@jlewi
Copy link
Contributor

jlewi commented Jan 23, 2019

It might make sense to create a new example if the code is sufficiently different from the current example. However, can we try to write the example in a way that works for both on prem and Clouds? The fact that the current example is GCP only is a bug and something that should be fixed.

Here's a related issue #677 for how to create better patterns for working with TFJob and make it cloud agnostic.

Related to this: I'm wondering if there's a better K8s pattern for writing this example
https://github.com/kubeflow/pipelines/blob/master/samples/tfx/taxi-cab-classification-pipeline.py

Currently the pattern is

  • Write python code for your logic e.g. predict.py
  • Create a DockerFile
  • Create some python code to generate dsl.ContainerOp
    • This is basically a wrapper around a pod but doesn't install things like volumes

Would it be better to instead orchestrate the steps more directly using K8s resources? e.g. we can define a K8s job for each of these steps. We can now just use pipelines to run some python code to create the K8s job and wait for it.

There's some discussion here about how we could make this light weight.

@ddutta
Copy link
Member

ddutta commented Jan 23, 2019

@swiftdiaries @johnugeorge @ramdootp for the onprem CUJ work. We have an onprem CUJ in the works (very shortly) where we had to tackle the same problem. Hence we could solve this issue

@swiftdiaries
Copy link
Member

We have a set of onprem / local specific components in the works as part of the onprem CUJ.
We found issues integrating the tensorboard and confusion matrix components into the Pipelines UI so we made workarounds to enable those.

As for the TFX example, I think having separate example code is better. The onprem wouldn't be using gcp secrets so having another tfx_taxi_cab_onprem.py makes sense.
And we'd be needing volumes, so the code would look very different.

To attach volumes to containerops, we did
train_op.add_volume(k8s_client.V1Volume(name='workflow', persistent_volume_claim=\ k8s_client.V1PersistentVolumeClaimVolumeSource(claim_name='claim-name'))).add_volume_mount(k8s_client.V1VolumeMount(mount_path='/mnt/workflow/',\ name='workflow'))

Agree that there should be a better way to handle k8s resources, if we could wrap this inside the dsl it'd be really cool. It becomes messy when you're attaching 2 or more volumes to one op.

We're addressing issues across all components in the onprem CUJ.I'll try to upload the slides today. It's very much a WIP and we'd would love to have some feedback around your experiences

@gaoning777
Copy link
Contributor

Thanks @jlewi for keeping up with the issue.
Echoed the fact that the current example is GCP only is a bug and something that should be fixed. We should produce samples that work for both GCP and on-prem environments. In fact the DSL should be improved to support the Kubernetes resources directly.

@gaoning777
Copy link
Contributor

I think the team has been working on building new features and we should allocate some amount of engineering efforts to make this a cross-platform product.

@vicaire
Copy link
Contributor

vicaire commented Mar 26, 2019

@gaoning777, I am assigning the DSL sample issues to illustrate features to you. Please reassign to the most appropriate person as needed.

@gyliu513 gyliu513 changed the title Add examples for using local storage example TFX taxi support on-prem cluster May 10, 2019
@jinchihe
Copy link
Member

/assign @jinchihe

Already send a PR #749 for the ticket, please review if you're interested. Thanks.

HumairAK pushed a commit to red-hat-data-services/data-science-pipelines that referenced this issue Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants