Remove unnecessary services for worker #191

hougangliu · 2019-07-29T08:45:00Z

For pytorchJob, MASTER_PORT and MASTER_ADDR are necessary. andMASTER_PORT will be occupied only in Master Pod.

In master Pod, it will

Creates sockets for all workers.
Waits for all workers to connect.
Sends them information about the location of the other processes.

all worker Pods will

creates a socket to MASTER_ADDR (master service name): MASTER_PORT
send their own location information (ip:randomPort).
Receives information about the other workers.
Opens a socket and handshakes with all other workers.

That is, MASTER_PORT on worker will not be used. Services for all workers is unnecessary.

hougangliu · 2019-07-29T08:45:32Z

/cc @johnugeorge

hougangliu · 2019-07-29T13:38:17Z

/test kubeflow-pytorch-operator-presubmit

hougangliu · 2019-07-29T13:52:09Z

/test kubeflow-pytorch-operator-presubmit

coveralls · 2019-07-29T13:53:56Z

Coverage increased (+14.2%) to 85.281% when pulling ee7ba2c on hougangliu:remove-dup-service into 6c75b0c on kubeflow:master.

hougangliu · 2019-07-29T22:52:45Z

/test kubeflow-pytorch-operator-presubmit

gaocegege

/assign @johnugeorge

Ref https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization

gaocegege · 2019-07-30T01:55:50Z

If the worker does not expose port, how can they Opens a socket and handshakes with all other workers.

hougangliu · 2019-07-30T02:21:52Z

If the worker does not expose port, how can they Opens a socket and handshakes with all other workers.

In fact, in a container, even port is not exposed explicitly. we can still visit it by "podIP:port". workers/master use these sockets by "podIP:port"

gaocegege · 2019-07-30T02:28:42Z

@hougangliu

Yeah, I know. I am wondering what will happen if there is a worker pod failed. The IP will be changed.

hougangliu · 2019-07-30T05:15:46Z

Yeah, I know. I am wondering what will happen if there is a worker pod failed. The IP will be changed.

Once when a worker Pod failed, the PytorchJob restarts.

gaocegege · 2019-07-30T06:04:17Z

Yeah, the operator will restart the pod, but the pod IP may change. Can we handle the IP change in the PR?

hougangliu · 2019-07-30T06:16:05Z

Yeah, the operator will restart the pod, but the pod IP may change. Can we handle the IP change in the PR?

We don't need handle it. When new Pod created, it will register its NewPodIP:NewPort to MASTER_ADDR (master service name): MASTER_PORT by call Initialization method in pytorch source code.

gaocegege · 2019-07-30T06:20:02Z

OK, then LGTM.

/lgtm

I am not familiar with Distributed Training in PyTorch.

So, /assign @johnugeorge

johnugeorge · 2019-07-30T06:43:30Z

Just wondering if you have faced some issue without this PR? Or is it just a cleanup?

@gaocegege How is this different from distributed TF?

hougangliu · 2019-07-30T06:49:29Z

Just wondering if you have faced some issue without this PR? Or is it just a cleanup?

@gaocegege How is this different from distributed TF?

This is just for cleanup, all worker services in fact are unused. We can drop them.
tensorflow builds distribution-cluster by hostname+port list explicitly, but for pytorch, they only need a service (MASTER_ADDR: MASTER_PORT) for workers, all workers will register itself to the service to generate a cluster. They build cluster in different ways

gaocegege · 2019-07-30T06:52:24Z

tensorflow builds distribution-cluster by hostname+port list explicitly, but for pytorch, they only need a service (MASTER_ADDR: MASTER_PORT) for workers, all workers will register itself to the service to generate a cluster. They build cluster in different ways

I think it is the difference. @johnugeorge

johnugeorge · 2019-07-30T07:42:46Z

Got it.

/approve

k8s-ci-robot · 2019-07-30T07:42:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [johnugeorge]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from andreyvelich and gaocegege July 29, 2019 08:45

k8s-ci-robot added the size/S label Jul 29, 2019

k8s-ci-robot requested a review from johnugeorge July 29, 2019 08:45

Remove unnecessary services for worker

ee7ba2c

hougangliu force-pushed the remove-dup-service branch from 2708cf9 to ee7ba2c Compare July 29, 2019 13:29

k8s-ci-robot assigned johnugeorge Jul 30, 2019

gaocegege reviewed Jul 30, 2019

View reviewed changes

k8s-ci-robot assigned gaocegege Jul 30, 2019

k8s-ci-robot added the lgtm label Jul 30, 2019

k8s-ci-robot added the approved label Jul 30, 2019

k8s-ci-robot merged commit 4028276 into kubeflow:master Jul 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove unnecessary services for worker #191

Remove unnecessary services for worker #191

hougangliu commented Jul 29, 2019

hougangliu commented Jul 29, 2019

hougangliu commented Jul 29, 2019

hougangliu commented Jul 29, 2019

coveralls commented Jul 29, 2019

hougangliu commented Jul 29, 2019

gaocegege left a comment •

edited

Loading

gaocegege commented Jul 30, 2019

hougangliu commented Jul 30, 2019 •

edited

Loading

gaocegege commented Jul 30, 2019 •

edited

Loading

hougangliu commented Jul 30, 2019

gaocegege commented Jul 30, 2019

hougangliu commented Jul 30, 2019

gaocegege commented Jul 30, 2019

johnugeorge commented Jul 30, 2019

hougangliu commented Jul 30, 2019 •

edited

Loading

gaocegege commented Jul 30, 2019

johnugeorge commented Jul 30, 2019

k8s-ci-robot commented Jul 30, 2019

Remove unnecessary services for worker #191

Remove unnecessary services for worker #191

Conversation

hougangliu commented Jul 29, 2019

hougangliu commented Jul 29, 2019

hougangliu commented Jul 29, 2019

hougangliu commented Jul 29, 2019

coveralls commented Jul 29, 2019

hougangliu commented Jul 29, 2019

gaocegege left a comment • edited Loading

Choose a reason for hiding this comment

gaocegege commented Jul 30, 2019

hougangliu commented Jul 30, 2019 • edited Loading

gaocegege commented Jul 30, 2019 • edited Loading

hougangliu commented Jul 30, 2019

gaocegege commented Jul 30, 2019

hougangliu commented Jul 30, 2019

gaocegege commented Jul 30, 2019

johnugeorge commented Jul 30, 2019

hougangliu commented Jul 30, 2019 • edited Loading

gaocegege commented Jul 30, 2019

johnugeorge commented Jul 30, 2019

k8s-ci-robot commented Jul 30, 2019

gaocegege left a comment •

edited

Loading

hougangliu commented Jul 30, 2019 •

edited

Loading

gaocegege commented Jul 30, 2019 •

edited

Loading

hougangliu commented Jul 30, 2019 •

edited

Loading