[SDK] Use Elastic Policy and torchrun as a Default Entrypoint for PyTorchJob #1991

andreyvelich · 2024-01-16T20:29:30Z

Currently, we use python3 as an entrypoint to create Training Job using function:

training-operator/sdk/python/kubeflow/training/utils/utils.py

Line 230 in 0b6a30c

python3 -u \"$program_path/ephemeral_script.py\""""

Since it is recommended to use the torchrun as an entrypoint to run distributed PyTorch, we should discuss if we need to change the entrypoint for PyTorchJob created from function.
Also, we need to set the ElasticPolicy c10d backend.

We need to make sure that we can use torchrun with PyTorch code that is not using distributed capabilities.

cc @johnugeorge @tenzen-y @deepanker13 @kuizhiqing

The text was updated successfully, but these errors were encountered:

tenzen-y · 2024-01-17T13:03:42Z

The torchrun was introduced in PyTorch v1.10: https://github.com/pytorch/pytorch/releases/tag/v1.10.0
So if we switch to torchrun, we need to announce users that the new Python SDK doesn't support PyTorch<1.10.

Also, we need to set the ElasticPolicy c10d backend.

I'm not sure why we need to use ElasticPolicy as a default once we switch to the torchrun.
Could you clarify?

deepanker13 · 2024-01-17T14:15:55Z

@tenzen-y I think environment variables like PET_RDZV_ENDPOINT, PET_RDZV_BACKEND etc get set for the containers only when we pass the elastic policy spec (

training-operator/pkg/controller.v1/pytorch/envvar.go

Line 109 in 0b6a30c

if pytorchjob.Spec.ElasticPolicy != nil {

).

And the above mentioned environment variables are necessary to start multi node training as mentioned in
https://github.com/pytorch/pytorch/blob/5bb2298da769121421711504da47955d3129b54f/torch/distributed/run.py#L164

.

tenzen-y · 2024-01-25T18:51:35Z

@tenzen-y I think environment variables like PET_RDZV_ENDPOINT, PET_RDZV_BACKEND etc get set for the containers only when we pass the elastic policy spec (

training-operator/pkg/controller.v1/pytorch/envvar.go

Line 109 in 0b6a30c

if pytorchjob.Spec.ElasticPolicy != nil {

).
And the above mentioned environment variables are necessary to start multi node training as mentioned in https://github.com/pytorch/pytorch/blob/5bb2298da769121421711504da47955d3129b54f/torch/distributed/run.py#L164

.

It makes sense. Thank you for investigating this :)

ckcd · 2024-03-11T08:12:46Z

Hi @andreyvelich @tenzen-y Has this issue been discussed clearly? If so can I take it and do some implementation?

andreyvelich · 2024-03-11T16:41:25Z

Hi @ckcd, we haven't got a chance to discuss this issue in details yet.
We need to identify pros and cons of using torchrun for all PyTorch-based tasks (e.g. single-node single-gpu, single-node multi-gpu, multi-node multi-gpu run).

github-actions · 2024-06-09T20:01:48Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

andreyvelich · 2024-06-10T12:41:30Z

/remove lifecycle/stale
/help

google-oss-prow · 2024-06-10T12:41:32Z

@andreyvelich:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/remove lifecycle/stale
/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

github-actions · 2024-09-08T20:01:40Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions · 2024-09-28T20:02:01Z

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

andreyvelich · 2024-09-30T15:52:28Z

/good-first-issue

google-oss-prow · 2024-09-30T15:52:31Z

@andreyvelich:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

andreyvelich · 2024-10-02T15:36:42Z

/assign @andreyvelich

github-actions bot added the lifecycle/stale label Jun 9, 2024

google-oss-prow bot added the help wanted label Jun 10, 2024

github-actions bot removed the lifecycle/stale label Jun 10, 2024

github-actions bot added the lifecycle/stale label Sep 8, 2024

github-actions bot closed this as completed Sep 28, 2024

andreyvelich reopened this Sep 30, 2024

google-oss-prow bot added the good first issue label Sep 30, 2024

github-actions bot removed the lifecycle/stale label Sep 30, 2024

google-oss-prow bot assigned andreyvelich Oct 2, 2024

andreyvelich mentioned this issue Oct 8, 2024

[SDK] Use torchrun to create PyTorchJob from function #2276

Merged

google-oss-prow bot closed this as completed in #2276 Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SDK] Use Elastic Policy and torchrun as a Default Entrypoint for PyTorchJob #1991

[SDK] Use Elastic Policy and torchrun as a Default Entrypoint for PyTorchJob #1991

andreyvelich commented Jan 16, 2024

tenzen-y commented Jan 17, 2024

deepanker13 commented Jan 17, 2024 •

edited

Loading

tenzen-y commented Jan 25, 2024

ckcd commented Mar 11, 2024

andreyvelich commented Mar 11, 2024

github-actions bot commented Jun 9, 2024

andreyvelich commented Jun 10, 2024

google-oss-prow bot commented Jun 10, 2024

github-actions bot commented Sep 8, 2024

github-actions bot commented Sep 28, 2024

andreyvelich commented Sep 30, 2024

google-oss-prow bot commented Sep 30, 2024

andreyvelich commented Oct 2, 2024

[SDK] Use Elastic Policy and torchrun as a Default Entrypoint for PyTorchJob #1991

[SDK] Use Elastic Policy and torchrun as a Default Entrypoint for PyTorchJob #1991

Comments

andreyvelich commented Jan 16, 2024

tenzen-y commented Jan 17, 2024

deepanker13 commented Jan 17, 2024 • edited Loading

tenzen-y commented Jan 25, 2024

ckcd commented Mar 11, 2024

andreyvelich commented Mar 11, 2024

github-actions bot commented Jun 9, 2024

andreyvelich commented Jun 10, 2024

google-oss-prow bot commented Jun 10, 2024

github-actions bot commented Sep 8, 2024

github-actions bot commented Sep 28, 2024

andreyvelich commented Sep 30, 2024

google-oss-prow bot commented Sep 30, 2024

andreyvelich commented Oct 2, 2024

deepanker13 commented Jan 17, 2024 •

edited

Loading