Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add information how to run TFjob and Pytorch examples in Katib #321

Merged

Conversation

andreyvelich
Copy link
Member

@andreyvelich andreyvelich commented Jan 12, 2019

Fixes: #313.


This change is Reviewable

@andreyvelich
Copy link
Member Author

/retest

- [Web UI](#web-ui)
- [API Documentation](#api-documentation)
- [Quickstart to run tfjob and pytorch operator jobs in Katib](#quickstart-to-run-tfjob-and-pytorch-operator-jobs-in-katib)
- [TFjob operator](#tfjob-operator)
Copy link
Member

@johnugeorge johnugeorge Jan 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you already updated Katib documentation in Kubeflow repo(https://github.com/kubeflow/kubeflow/tree/master/kubeflow/katib), shall we just point out to the Kubeflow page instead of duplicating it in this section? This will help in easy managing the pages for future updates.

I think, we can merge this entire section with "Getting Started" section.
"For running TFJobs and PyTorchJobs in Katib, Install job operators given in "

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But here #313 (comment) decided to copy information in Katib README as well. And what we should do with information about running TFjob and Pytorch examples?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. @richardsliu WDYT?

Copy link
Member

@hougangliu hougangliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 9 unresolved discussions (waiting on @andreyvelich and @hougangliu)


README.md, line 95 at r1 (raw file):

## Quickstart to run tfjob and pytorch operator jobs in Katib

For running tfjob and pytorch operator jobs in Katib you have to install their packages.

For running tfjob and pytorch operator jobs in Katib, you have to install their packages.


README.md, line 116 at r1 (raw file):

After this you have to install volume for tfjob operator.

After this, you have to install volume for tfjob operator.
BTW, tfjob operator doesn't depend on pv/pvc, what you create here is used by tfjob in katib example. So I suggest you can move this part to #running-examples


README.md, line 135 at r1 (raw file):

    requests:
      storage: 10Gi

I suggest you can use kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/tfevent-volume/tfevent-pvc.yaml to replace yaml content


README.md, line 142 at r1 (raw file):

kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/tfevent-volume/tfevent-pvc.yaml

kubectl create -f https://raw.githubusercontent.com/andreyvelich/katib/example-doc-pytorch-tfjob-313/examples/tfevent-volume/tfevent-pv.yaml

use kubeflow/katib/master instead of andreyvelich/katib/example-doc-pytorch-tfjob-313


README.md, line 189 at r1 (raw file):

kubectl create -f katib-mysql-pv.yaml

please use https://raw.githubusercontent.com/kubeflow/katib/master/manifests/pv/pv.yaml instead


README.md, line 193 at r1 (raw file):

### Running examples

After deploy everything you can run examples.

After deploy everything, you can run examples.


README.md, line 217 at r1 (raw file):

If you create pv for Katib delete it

If you create pv for Katib, delete it


README.md, line 220 at r1 (raw file):

kubectl delete -f katib-mysql-pv.yaml

please use https://raw.githubusercontent.com/kubeflow/katib/master/manifests/pv/pv.yaml instead

@andreyvelich
Copy link
Member Author

@hougangliu
Thank you for your review!
I fixed it, but I can't use this yaml file https://raw.githubusercontent.com/kubeflow/katib/master/examples/tfevent-volume/tfevent-pvc.yaml at this part: https://github.com/kubeflow/katib/pull/321/files#diff-04c6e90faac2675aa89e2176d2eec7d8R170.
Default StorageClass in GKE supports only
accessModes:
- ReadWriteOnce
but in example we have
accessModes:
- ReadWriteMany.

@hougangliu
Copy link
Member

/lgtm

```
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/pytorchjob-example.yaml
```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please write about checking the status of jobs kubectl get studyjob and look UI.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YujiOshima Done.

@k8s-ci-robot k8s-ci-robot removed the lgtm label Jan 15, 2019
Copy link
Member

@hougangliu hougangliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1.
Reviewable status: 0 of 1 files reviewed, 14 unresolved discussions (waiting on @hougangliu, @andreyvelich, and @YujiOshima)


README.md, line 210 at r3 (raw file):

```yaml
kubectl describe studyjob pytorchjob-example -n kubeflow

$ kubectl describe studyjob pytorchjob-example -n kubeflow


README.md, line 330 at r3 (raw file):

When the spec.Status.Condition becomes ```Completed```, the StudyJob is finished.

You can monitor your results in Katib UI. For accessing to Katib UI you have to install Ambassador.

For accessing to Katib UI, you have to install Ambassador.


README.md, line 339 at r3 (raw file):

After deploy Ambassador, you can access to Katib UI using /katib/ path.

here, we had better show the full url path of ambassador service with /katib/, too. (Just in case that a user had no idea about what ambassador is)


README.md, line 357 at r3 (raw file):

If you deploy Ambassador delete it

If you deploy Ambassador, delete it

@andreyvelich
Copy link
Member Author

@hougangliu
With Ambassador you can access by 2 ways:

  1. You can port-forward one of the Ambassador pod (e.g. on 8080 port) and access to Katib in localhost:8080/katib/ URL
  2. You can change Ambassador service to NodePort and access to Katib in
    :/katib/ URL
    What kind of way should I explain?

@hougangliu
Copy link
Member

@hougangliu
With Ambassador you can access by 2 ways:

  1. You can port-forward one of the Ambassador pod (e.g. on 8080 port) and access to Katib in localhost:8080/katib/ URL
  2. You can change Ambassador service to NodePort and access to Katib in
    :/katib/ URL
    What kind of way should I explain?

I prefer the first one (you had better show the kubectl port-forward to forward ambassador service here)

@hougangliu
Copy link
Member

/lgtm

@YujiOshima
Copy link
Contributor

@andreyvelich Great thank you!
/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, YujiOshima

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit d41f8e8 into kubeflow:master Jan 16, 2019
@andreyvelich andreyvelich deleted the example-doc-pytorch-tfjob-313 branch October 6, 2021 00:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants