Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add katib studyjob launcher #754

Merged
merged 5 commits into from
Mar 4, 2019
Merged

Conversation

hougangliu
Copy link
Member

@hougangliu hougangliu commented Jan 30, 2019

This change is Reviewable

@hougangliu
Copy link
Member Author

/retest kubeflow-pipeline-e2e-test

@hougangliu
Copy link
Member Author

/assign @richardsliu

@IronPan
Copy link
Member

IronPan commented Feb 5, 2019

/assign @Ark-kun @gaoning777

@vicaire
Copy link
Contributor

vicaire commented Feb 13, 2019

Hello @hougangliu,

Do you think it would make sense to implement something more generic that could launch and wait for any CRD?

On top of this generic functionality, some convenience higher level methods could be provided (as the Katib one that you are proposing here). WDYT?

@hougangliu
Copy link
Member Author

Hello @hougangliu,

Do you think it would make sense to implement something more generic that could launch and wait for any CRD?

On top of this generic functionality, some convenience higher level methods could be provided (as the Katib one that you are proposing here). WDYT?

It makes sense

@vicaire
Copy link
Contributor

vicaire commented Feb 14, 2019

/cc @hongye-sun

Adding Hongye to this thread as he is looking into some related problem.

@hougangliu
Copy link
Member Author

/retest continuous-integration

@hougangliu
Copy link
Member Author

@gaoning777 @Ark-kun @hongye-sun any comment?

@hougangliu
Copy link
Member Author

/retest

@hougangliu
Copy link
Member Author

/assign @IronPan

if wait_response.get("status", {}).get("condition") == "Completed":
succ = True
trial = get_best_trial(wait_response["status"]["bestTrialId"])
with open('/output.txt', 'w') as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please get this path from the command line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done! thanks!

metricsnames, parameterconfigs, nasConfig, workertemplatepath, mcollectortemplatepath, suggestionspec):
"""_generate_studyjob_yaml generates studyjob yaml file based on hp.template.yaml"""
with open(src_filename, 'r') as f:
content = yaml.load(f)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can just fully build the spec dict here in code instead of loading it from file and replacing some values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just follow how tfjob-launcher fills spec dict. I will submit another PR to fix the both two components. Thanks!

.cloudbuild.yaml Outdated
@@ -97,8 +97,12 @@ steps:
id: 'buildDeployer'
- name: 'gcr.io/cloud-builders/docker'
entrypoint: '/bin/bash'
args: ['-c', 'cd /workspace/components/kubeflow/launcher && ./build_image.sh -p $PROJECT_ID -t $COMMIT_SHA']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we not build the container automatically for now? It requires a complicated review process on our side.

Copy link
Member Author

@hougangliu hougangliu Feb 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Behavior here will keep unchanged. I just rename launcher (in fact, it only launches tfjob) to tf-launcher since we can add more launchers (saying pytorch job launcher, studyjob launcher and so on)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @hougangliu. Could we do the renaming in a separate PR? There are several dependencies on this tfjob component so I would like to separate the renaming into another PR in case we need to rollback.

.cloudbuild.yaml Outdated
@@ -97,8 +97,12 @@ steps:
id: 'buildDeployer'
- name: 'gcr.io/cloud-builders/docker'
entrypoint: '/bin/bash'
args: ['-c', 'cd /workspace/components/kubeflow/launcher && ./build_image.sh -p $PROJECT_ID -t $COMMIT_SHA']
args: ['-c', 'cd /workspace/components/kubeflow/tf-launcher && ./build_image.sh -p $PROJECT_ID -t $COMMIT_SHA']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the change from launcher to tf-launcher necessary? It seems unrelated to this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just rename launcher (in fact, it only launches tfjob) to tf-launcher since we can add more launchers (saying pytorch job launcher, studyjob launcher and so on)

@vicaire
Copy link
Contributor

vicaire commented Feb 28, 2019

@hongye-sun, could you please have a look at this one? I think you could have some good feedback as you are working on many components.

.cloudbuild.yaml Outdated
@@ -97,8 +97,12 @@ steps:
id: 'buildDeployer'
- name: 'gcr.io/cloud-builders/docker'
entrypoint: '/bin/bash'
args: ['-c', 'cd /workspace/components/kubeflow/launcher && ./build_image.sh -p $PROJECT_ID -t $COMMIT_SHA']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @hougangliu. Could we do the renaming in a separate PR? There are several dependencies on this tfjob component so I would like to separate the renaming into another PR in case we need to rollback.

.cloudbuild.yaml Outdated
id: 'buildLauncher'
- name: 'gcr.io/cloud-builders/docker'
entrypoint: '/bin/bash'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @hougangliu,

This entry is building the new container automatically. Could you please remove this entry? It requires a complicated review process on our side before we can release new containers automatically.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, Done

@hougangliu hougangliu force-pushed the launch-study branch 2 times, most recently from 6bfd44b to 3eb4b1a Compare March 3, 2019 23:31
@vicaire
Copy link
Contributor

vicaire commented Mar 4, 2019

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vicaire

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 86c4d3b into kubeflow:master Mar 4, 2019
cheyang pushed a commit to alibaba/pipelines that referenced this pull request Mar 28, 2019
* add katib studyjob launcher

* delete tmp file

* fix link error to tf-laucher

* import studyjob client from katib project

* specify output file with a parameter
undo tf-launcher
HumairAK pushed a commit to red-hat-data-services/data-science-pipelines that referenced this pull request Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants