-
Notifications
You must be signed in to change notification settings - Fork 143
Remove unnecessary services for worker #191
Conversation
/cc @johnugeorge |
2708cf9
to
ee7ba2c
Compare
/test kubeflow-pytorch-operator-presubmit |
1 similar comment
/test kubeflow-pytorch-operator-presubmit |
/test kubeflow-pytorch-operator-presubmit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the worker does not expose port, how can they |
In fact, in a container, even port is not exposed explicitly. we can still visit it by "podIP:port". workers/master use these sockets by "podIP:port" |
Yeah, I know. I am wondering what will happen if there is a worker pod failed. The IP will be changed. |
Once when a worker Pod failed, the PytorchJob restarts. |
Yeah, the operator will restart the pod, but the pod IP may change. Can we handle the IP change in the PR? |
We don't need handle it. When new Pod created, it will register its |
OK, then LGTM. /lgtm I am not familiar with Distributed Training in PyTorch. So, /assign @johnugeorge |
Just wondering if you have faced some issue without this PR? Or is it just a cleanup? @gaocegege How is this different from distributed TF? |
This is just for cleanup, all worker services in fact are unused. We can drop them. |
I think it is the difference. @johnugeorge |
Got it. /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: johnugeorge The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
For pytorchJob, MASTER_PORT and MASTER_ADDR are necessary. andMASTER_PORT will be occupied only in
Master
Pod.In master Pod, it will
all worker Pods will
That is, MASTER_PORT on worker will not be used. Services for all workers is unnecessary.