Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SDK] Use Elastic Policy and torchrun as a Default Entrypoint for PyTorchJob #1991

Closed
andreyvelich opened this issue Jan 16, 2024 · 13 comments · Fixed by #2276
Closed

[SDK] Use Elastic Policy and torchrun as a Default Entrypoint for PyTorchJob #1991

andreyvelich opened this issue Jan 16, 2024 · 13 comments · Fixed by #2276

Comments

@andreyvelich
Copy link
Member

Currently, we use python3 as an entrypoint to create Training Job using function:

python3 -u \"$program_path/ephemeral_script.py\""""

Since it is recommended to use the torchrun as an entrypoint to run distributed PyTorch, we should discuss if we need to change the entrypoint for PyTorchJob created from function.
Also, we need to set the ElasticPolicy c10d backend.

We need to make sure that we can use torchrun with PyTorch code that is not using distributed capabilities.

cc @johnugeorge @tenzen-y @deepanker13 @kuizhiqing

@tenzen-y
Copy link
Member

The torchrun was introduced in PyTorch v1.10: https://github.com/pytorch/pytorch/releases/tag/v1.10.0
So if we switch to torchrun, we need to announce users that the new Python SDK doesn't support PyTorch<1.10.

Also, we need to set the ElasticPolicy c10d backend.

I'm not sure why we need to use ElasticPolicy as a default once we switch to the torchrun.
Could you clarify?

@deepanker13
Copy link
Contributor

deepanker13 commented Jan 17, 2024

@tenzen-y I think environment variables like PET_RDZV_ENDPOINT, PET_RDZV_BACKEND etc get set for the containers only when we pass the elastic policy spec (

if pytorchjob.Spec.ElasticPolicy != nil {
).

And the above mentioned environment variables are necessary to start multi node training as mentioned in
https://github.com/pytorch/pytorch/blob/5bb2298da769121421711504da47955d3129b54f/torch/distributed/run.py#L164

.

@tenzen-y
Copy link
Member

@tenzen-y I think environment variables like PET_RDZV_ENDPOINT, PET_RDZV_BACKEND etc get set for the containers only when we pass the elastic policy spec (

if pytorchjob.Spec.ElasticPolicy != nil {

).
And the above mentioned environment variables are necessary to start multi node training as mentioned in https://github.com/pytorch/pytorch/blob/5bb2298da769121421711504da47955d3129b54f/torch/distributed/run.py#L164

.

It makes sense. Thank you for investigating this :)

@ckcd
Copy link

ckcd commented Mar 11, 2024

Hi @andreyvelich @tenzen-y Has this issue been discussed clearly? If so can I take it and do some implementation?

@andreyvelich
Copy link
Member Author

Hi @ckcd, we haven't got a chance to discuss this issue in details yet.
We need to identify pros and cons of using torchrun for all PyTorch-based tasks (e.g. single-node single-gpu, single-node multi-gpu, multi-node multi-gpu run).

Copy link

github-actions bot commented Jun 9, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@andreyvelich
Copy link
Member Author

/remove lifecycle/stale
/help

Copy link

@andreyvelich:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/remove lifecycle/stale
/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link

github-actions bot commented Sep 8, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Copy link

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

@andreyvelich
Copy link
Member Author

/good-first-issue

Copy link

@andreyvelich:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@andreyvelich
Copy link
Member Author

/assign @andreyvelich

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants