-
Notifications
You must be signed in to change notification settings - Fork 696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade PyTorchJob examples to PyTorch v2 #2024
Upgrade PyTorchJob examples to PyTorch v2 #2024
Conversation
Signed-off-by: champon1020 <nagatelu1020@gmail.com>
b22d017
to
e6db772
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution!
Could you address in the following items as well?
- Add PyTorch image to CI:
training-operator/.github/workflows/publish-example-images.yaml
Lines 58 to 62 in 14eeaeb
# TODO (tenzen-y): Fix the below broken Dockerfiles # - component-name: pytorch-dist-mnist-mpi # dockerfile: examples/pytorch/mnist/Dockerfile-mpi # - component-name: pytorch-dist-mnist # dockerfile: examples/pytorch/mnist/Dockerfile - Upgrade PyTorch version in the following images as well:
FROM $BASE_IMAGE | ||
|
||
WORKDIR /workspace | ||
|
||
# download imagenet tiny for data | ||
RUN apt-get -q update && apt-get -q install -y wget unzip | ||
RUN apt-get -q update && apt-get -q install -y wget unzip g++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't install g++
, any error happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I got following error messages when I ran container built without g++, both CPU and GPU backend.
# On CPU
nvalidCxxCompiler: No working C++ compiler found in torch._inductor.config.cpp.cxx: (None, 'g++')
# On GPU
Failed to find C compiler. Please specify via CC environment variable
Then, after adding the line, I solved this issue. (It seems to be due to the JIT-compile in torch.compile
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thank you for the clarifications.
Based on @andreyvelich's comment, if you remove torch.compile
, let's remove this installation. Otherwise, I'm ok with leaving this here.
examples/pytorch/mnist/Dockerfile
Outdated
RUN apt update \ | ||
&& apt install --no-install-recommends -y g++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the same question above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we could remove these examples since we have examples with CUDA in examples/pytorch/mnist/Dockerfile
.
@kubeflow/wg-training-leads WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Can you remove it ? @champon1020
Pull Request Test Coverage Report for Build 8493497882Details
💛 - Coveralls |
@tenzen-y Although it seems to be installed pytorch by git clone in these Dockerfiles, do I need to modify them? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution @champon1020 🎉
I left a few comments.
@@ -1,10 +1,10 @@ | |||
ARG BASE_IMAGE=pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime | |||
ARG BASE_IMAGE=pytorch/pytorch:2.2.1-cuda12.1-cudnn8-runtime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use the same image as for Katib in this PR: kubeflow/katib#2279.
nvcr.io/nvidia/pytorch:24.01-py3
To support ARM-based builds in the future?
@@ -252,6 +252,7 @@ def initialize_model( | |||
# should always set the single device scope, otherwise, | |||
# DistributedDataParallel will use all available devices. | |||
model.to(device) | |||
model = torch.compile(model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need torch.compile()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think torch.compile
is the main feature of PyTorch v2 and it should be guaranteed to work properly.
But If this is out of scope, I'll remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I can remember, the torch.compile
could help us improve training performance.
So, I'd suggest using this torch.compile
.
Probably, we should introduce the torch.compile
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only one concern is about supported GPUs:
Caveats: On a desktop-class GPU such as a NVIDIA 3090, we’ve measured that speedups are lower than on server-class GPUs such as A100. As of today, our default backend TorchInductor supports CPUs and NVIDIA Volta and Ampere GPUs. It does not (yet) support other GPUs, xPUs or older NVIDIA GPUs.
https://pytorch.org/get-started/pytorch-2.0/#pytorch-2x-faster-more-pythonic-and-as-dynamic-as-ever
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y @champon1020 Can we introduce this change in the followup PR ? @champon1020 You can create tracking issue to discuss update of our examples with torch.compile()
where we are going to investigate pros and cons adding this to all PyTorch examples ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll create a new issue and remove these changes related to torch.compile
in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove the installation of g++
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove the installation of g++ as well?
Yes, I'll also remove it.
examples/pytorch/mnist/mnist.py
Outdated
@@ -185,8 +185,8 @@ def main(): | |||
print(f"World Size: {os.environ['WORLD_SIZE']}. Rank: {os.environ['RANK']}") | |||
|
|||
dist.init_process_group(backend=args.backend) | |||
Distributor = nn.parallel.DistributedDataParallel | |||
model = Distributor(model) | |||
model = torch.compile(model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question for compile.
For the For the |
@tenzen-y I'll remove Dockerfile.ppc64le on another PR. |
SGTM |
Signed-off-by: champon1020 <nagatelu1020@gmail.com>
3403015
to
06e08a4
Compare
Signed-off-by: champon1020 <nagatelu1020@gmail.com>
06e08a4
to
9e1d6c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y @andreyvelich I have modified some points so please review again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that openmpi is already installed in nvcr.io/nvidia/pytorch:24.01-py3
, so removed the lines for installation.
(Since the difference between Dockerfile and Dockerfile-mpi is only ENTRYPOINT, I suppose it can be unified to one Dockerfile)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@champon1020 Thank you for the updates!
Basically lgtm
- component-name: pytorch-dist-mnist | ||
dockerfile: examples/pytorch/mnist/Dockerfile | ||
- component-name: pytorch-dist-mnist-mpi | ||
dockerfile: examples/pytorch/mnist/Dockerfile-mpi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- component-name: pytorch-dist-mnist | |
dockerfile: examples/pytorch/mnist/Dockerfile | |
- component-name: pytorch-dist-mnist-mpi | |
dockerfile: examples/pytorch/mnist/Dockerfile-mpi | |
- component-name: pytorch-dist-mnist | |
dockerfile: examples/pytorch/mnist/Dockerfile | |
context: examples/pytorch/mnist | |
- component-name: pytorch-dist-mnist-mpi | |
dockerfile: examples/pytorch/mnist/Dockerfile-mpi | |
context: examples/pytorch/mnist |
Specifying the context
should resolve CI errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Thanks!
@@ -1,4 +1,4 @@ | |||
ARG BASE_IMAGE=pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime | |||
ARG BASE_IMAGE=nvcr.io/nvidia/pytorch:24.01-py3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* add Dockerfile context to github workflow yaml * add commenets to the head of Dockerfile Signed-off-by: champon1020 <nagatelu1020@gmail.com>
fbcb75b
to
4feab3b
Compare
@tenzen-y Sorry for the delay, but I modified some points commented out :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@champon1020 Thank you for this great contribution!
I'm looking forward to your next contribution!
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: champon1020, tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* refactor: upgrade pytorch job examples to pytorch v2 Signed-off-by: champon1020 <nagatelu1020@gmail.com> * fix: remove torch.compile and update base image of Dockerfile Signed-off-by: champon1020 <nagatelu1020@gmail.com> * fix: comment out pytorch mnist Dockerfiles in the config of CI Signed-off-by: champon1020 <nagatelu1020@gmail.com> * fix: minor changes * add Dockerfile context to github workflow yaml * add commenets to the head of Dockerfile Signed-off-by: champon1020 <nagatelu1020@gmail.com> --------- Signed-off-by: champon1020 <nagatelu1020@gmail.com> Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>
Since the base image was changed, it seems now that 'etcd' library is missing. I can no longer run my elastic pytorch jobs that use 'rdzvBackend: etcd'. It works on the previous version, so just wondering if others are experiencing this problem too. |
@mathias9395 Oh, thank you for letting us know! I'll try to fix it. |
* refactor: upgrade pytorch job examples to pytorch v2 Signed-off-by: champon1020 <nagatelu1020@gmail.com> * fix: remove torch.compile and update base image of Dockerfile Signed-off-by: champon1020 <nagatelu1020@gmail.com> * fix: comment out pytorch mnist Dockerfiles in the config of CI Signed-off-by: champon1020 <nagatelu1020@gmail.com> * fix: minor changes * add Dockerfile context to github workflow yaml * add commenets to the head of Dockerfile Signed-off-by: champon1020 <nagatelu1020@gmail.com> --------- Signed-off-by: champon1020 <nagatelu1020@gmail.com>
* refactor: upgrade pytorch job examples to pytorch v2 Signed-off-by: champon1020 <nagatelu1020@gmail.com> * fix: remove torch.compile and update base image of Dockerfile Signed-off-by: champon1020 <nagatelu1020@gmail.com> * fix: comment out pytorch mnist Dockerfiles in the config of CI Signed-off-by: champon1020 <nagatelu1020@gmail.com> * fix: minor changes * add Dockerfile context to github workflow yaml * add commenets to the head of Dockerfile Signed-off-by: champon1020 <nagatelu1020@gmail.com> --------- Signed-off-by: champon1020 <nagatelu1020@gmail.com>
What this PR does / why we need it:
Upgrade PyTorchJob examples to PyTorch v2.
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Issue: #2016
Checklist: