Skip to content
This repository has been archived by the owner on Sep 19, 2022. It is now read-only.

Remove unnecessary services for worker #191

Merged
merged 1 commit into from
Jul 30, 2019

Conversation

hougangliu
Copy link
Member

For pytorchJob, MASTER_PORT and MASTER_ADDR are necessary. andMASTER_PORT will be occupied only in Master Pod.

In master Pod, it will

  1. Creates sockets for all workers.
  2. Waits for all workers to connect.
  3. Sends them information about the location of the other processes.

all worker Pods will

  1. creates a socket to MASTER_ADDR (master service name): MASTER_PORT
  2. send their own location information (ip:randomPort).
  3. Receives information about the other workers.
  4. Opens a socket and handshakes with all other workers.

That is, MASTER_PORT on worker will not be used. Services for all workers is unnecessary.

@hougangliu
Copy link
Member Author

/cc @johnugeorge

@hougangliu
Copy link
Member Author

/test kubeflow-pytorch-operator-presubmit

1 similar comment
@hougangliu
Copy link
Member Author

/test kubeflow-pytorch-operator-presubmit

@coveralls
Copy link

Coverage Status

Coverage increased (+14.2%) to 85.281% when pulling ee7ba2c on hougangliu:remove-dup-service into 6c75b0c on kubeflow:master.

@hougangliu
Copy link
Member Author

/test kubeflow-pytorch-operator-presubmit

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gaocegege
Copy link
Member

If the worker does not expose port, how can they Opens a socket and handshakes with all other workers.

@hougangliu
Copy link
Member Author

hougangliu commented Jul 30, 2019

If the worker does not expose port, how can they Opens a socket and handshakes with all other workers.

In fact, in a container, even port is not exposed explicitly. we can still visit it by "podIP:port". workers/master use these sockets by "podIP:port"

@gaocegege
Copy link
Member

gaocegege commented Jul 30, 2019

@hougangliu

Yeah, I know. I am wondering what will happen if there is a worker pod failed. The IP will be changed.

@hougangliu
Copy link
Member Author

Yeah, I know. I am wondering what will happen if there is a worker pod failed. The IP will be changed.

Once when a worker Pod failed, the PytorchJob restarts.

@gaocegege
Copy link
Member

Yeah, the operator will restart the pod, but the pod IP may change. Can we handle the IP change in the PR?

@hougangliu
Copy link
Member Author

Yeah, the operator will restart the pod, but the pod IP may change. Can we handle the IP change in the PR?

We don't need handle it. When new Pod created, it will register its NewPodIP:NewPort to MASTER_ADDR (master service name): MASTER_PORT by call Initialization method in pytorch source code.

@gaocegege
Copy link
Member

OK, then LGTM.

/lgtm

I am not familiar with Distributed Training in PyTorch.

So, /assign @johnugeorge

@johnugeorge
Copy link
Member

Just wondering if you have faced some issue without this PR? Or is it just a cleanup?

@gaocegege How is this different from distributed TF?

@hougangliu
Copy link
Member Author

hougangliu commented Jul 30, 2019

Just wondering if you have faced some issue without this PR? Or is it just a cleanup?

@gaocegege How is this different from distributed TF?

This is just for cleanup, all worker services in fact are unused. We can drop them.
tensorflow builds distribution-cluster by hostname+port list explicitly, but for pytorch, they only need a service (MASTER_ADDR: MASTER_PORT) for workers, all workers will register itself to the service to generate a cluster. They build cluster in different ways

@gaocegege
Copy link
Member

tensorflow builds distribution-cluster by hostname+port list explicitly, but for pytorch, they only need a service (MASTER_ADDR: MASTER_PORT) for workers, all workers will register itself to the service to generate a cluster. They build cluster in different ways

I think it is the difference. @johnugeorge

@johnugeorge
Copy link
Member

Got it.

/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 4028276 into kubeflow:master Jul 30, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants