Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SDK] Use torchrun to create PyTorchJob from function #2276

Merged
merged 4 commits into from
Oct 11, 2024

Conversation

andreyvelich
Copy link
Member

Fixes: #1991

I updated the SDK to use torchrun when distributed PyTorchJob is created from function.
Also, I updated the unit tests and example.

/assign @kubeflow/wg-training-leads @Electronic-Waste @helenxie-bit @YosiElias @droctothorpe @Bobbins228 @akshay @deepanker13 @akshaychitneni @shravan-achar @kuizhiqing

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@coveralls
Copy link

coveralls commented Oct 8, 2024

Pull Request Test Coverage Report for Build 11295522063

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 11164961069: 0.0%
Covered Lines: 66
Relevant Lines: 66

💛 - Coveralls

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks much cleaner. Thanks!

@@ -355,6 +356,9 @@ def create_job(
set, Base Image must support `bash` CLI to execute the training script.
parameters: Dict of input parameters that training function might receive.
num_workers: Number of Worker replicas for the Job.
num_procs_per_worker: Number of processes per PyTorchJob worker for `torchrun` CLI. You
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this option is only for PyTorchJob, but this API is not only for PyTorchJob.
Do you aim to support other frameworks as well?
If not, pytorch prefix or suffix would be better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y I think, we support TF and PyTorch today when this parameter is used: https://github.com/kubeflow/training-operator/blob/master/sdk/python/kubeflow/training/api/training_client.py#L445-L460.
I added validation to have correct setting for TFJob and PyTorchJob.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense.
Thanks.
/lgtm
/approve

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: droctothorpe, Electronic-Waste, tenzen-y, terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [tenzen-y,terrytangyuan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 3541327 into kubeflow:master Oct 11, 2024
40 checks passed
@andreyvelich andreyvelich deleted the issue-1991-sdk-torchrun branch October 11, 2024 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[SDK] Use Elastic Policy and torchrun as a Default Entrypoint for PyTorchJob
6 participants