Support restarting training job #901

hougangliu · 2019-10-30T02:00:22Z

Fixes: #896

This change is

hougangliu · 2019-10-30T02:01:26Z

gaocegege · 2019-10-30T02:16:22Z

pkg/metricscollector/v1alpha3/common/pns.go

@@ -95,6 +99,16 @@ func WaitPIDS(pids []int, opts ...WaitPidsOpts) error {
 				_, err := os.Stat(path)
 				if err != nil {
 					if os.IsNotExist(err) {
+						if opts[0].CompletedMarkedDirPath != "" {


Cannot figure out how this helps solve the problem, could you please explain more about it?

This PR adds echo completed > $mountPath/$$$$.pid after training command as below, if the training container succeed, it will touch a file named $mountPath/$processID.pid with "completed"
MetricsCollector container watches the training process, once find the process exit, it will check file $mountPath/$processID.pid to judge if the training process succeeds or not; if succeed, parse the metrics file; otherwise raise exception to exit.
Before this PR, metrics collector once find training process exit, it starts to parse metrics file, ignoring if training process exit status.

For now, it is hard to check other process exit code (I tries to use "strace" to implement it, but it need more linux capability.) Also we can call k8s api to get pod.status.containerStatus, but we need add extra role to worker pod, or add another service to proxy it.

- args: - python /mxnet/example/image-classification/train_mnist.py --batch-size=64 --lr=0.02273874688380991 --num-layers=3 --optimizer=sgd 1>/var/log/katib/metrics.log 2>&1 && echo completed > /var/log/katib/$$$$.pid command: - sh - -c

Will this cause the similar pipe exit code problem as tee does?

no, in fact for tee, the container exit code is always 0 even training process fails

for this PR:

if training process fails, it will return training process exit code ( && echo will not be executed)

if training process succeeds (exit code is 0), && echo will return 0, too. so the container exit code is 0, too

Oh yeah, misunderstand the logic here, SGTM

gaocegege

/lgtm

hougangliu · 2019-10-30T04:04:33Z

/approve

k8s-ci-robot · 2019-10-30T04:04:44Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hougangliu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [hougangliu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Support restarting training job

d877a9d

k8s-ci-robot requested review from Akado2009 and jinan-zhou October 30, 2019 02:00

k8s-ci-robot added the size/M label Oct 30, 2019

k8s-ci-robot requested review from gaocegege and johnugeorge October 30, 2019 02:01

gaocegege reviewed Oct 30, 2019

View reviewed changes

k8s-ci-robot assigned gaocegege Oct 30, 2019

k8s-ci-robot added the lgtm label Oct 30, 2019

k8s-ci-robot added the approved label Oct 30, 2019

k8s-ci-robot merged commit 1608d28 into kubeflow:master Oct 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support restarting training job #901

Support restarting training job #901

hougangliu commented Oct 30, 2019 •

edited by jlewi

Loading

hougangliu commented Oct 30, 2019

gaocegege Oct 30, 2019

hougangliu Oct 30, 2019 •

edited

Loading

gaocegege Oct 30, 2019

hougangliu Oct 30, 2019 •

edited

Loading

gaocegege Oct 30, 2019

gaocegege left a comment

hougangliu commented Oct 30, 2019

k8s-ci-robot commented Oct 30, 2019

Support restarting training job #901

Support restarting training job #901

Conversation

hougangliu commented Oct 30, 2019 • edited by jlewi Loading

hougangliu commented Oct 30, 2019

gaocegege Oct 30, 2019

Choose a reason for hiding this comment

hougangliu Oct 30, 2019 • edited Loading

Choose a reason for hiding this comment

gaocegege Oct 30, 2019

Choose a reason for hiding this comment

hougangliu Oct 30, 2019 • edited Loading

Choose a reason for hiding this comment

gaocegege Oct 30, 2019

Choose a reason for hiding this comment

gaocegege left a comment

Choose a reason for hiding this comment

hougangliu commented Oct 30, 2019

k8s-ci-robot commented Oct 30, 2019

hougangliu commented Oct 30, 2019 •

edited by jlewi

Loading

hougangliu Oct 30, 2019 •

edited

Loading

hougangliu Oct 30, 2019 •

edited

Loading