-
Notifications
You must be signed in to change notification settings - Fork 696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SDK] Use torchrun to create PyTorchJob from function #2276
[SDK] Use torchrun to create PyTorchJob from function #2276
Conversation
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Pull Request Test Coverage Report for Build 11295522063Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks much cleaner. Thanks!
@@ -355,6 +356,9 @@ def create_job( | |||
set, Base Image must support `bash` CLI to execute the training script. | |||
parameters: Dict of input parameters that training function might receive. | |||
num_workers: Number of Worker replicas for the Job. | |||
num_procs_per_worker: Number of processes per PyTorchJob worker for `torchrun` CLI. You |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this option is only for PyTorchJob, but this API is not only for PyTorchJob.
Do you aim to support other frameworks as well?
If not, pytorch prefix or suffix would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y I think, we support TF and PyTorch today when this parameter is used: https://github.com/kubeflow/training-operator/blob/master/sdk/python/kubeflow/training/api/training_client.py#L445-L460.
I added validation to have correct setting for TFJob and PyTorchJob.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense.
Thanks.
/lgtm
/approve
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: droctothorpe, Electronic-Waste, tenzen-y, terrytangyuan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Fixes: #1991
I updated the SDK to use
torchrun
when distributed PyTorchJob is created from function.Also, I updated the unit tests and example.
/assign @kubeflow/wg-training-leads @Electronic-Waste @helenxie-bit @YosiElias @droctothorpe @Bobbins228 @akshay @deepanker13 @akshaychitneni @shravan-achar @kuizhiqing