Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade PyTorchJob examples to PyTorch v2 #2024

Merged
merged 4 commits into from
Mar 30, 2024

Conversation

champon1020
Copy link
Contributor

What this PR does / why we need it:
Upgrade PyTorchJob examples to PyTorch v2.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Issue: #2016

Checklist:

  • Docs included if any changes are user facing

Signed-off-by: champon1020 <nagatelu1020@gmail.com>
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!
Could you address in the following items as well?

FROM $BASE_IMAGE

WORKDIR /workspace

# download imagenet tiny for data
RUN apt-get -q update && apt-get -q install -y wget unzip
RUN apt-get -q update && apt-get -q install -y wget unzip g++
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't install g++, any error happen?

Copy link
Contributor Author

@champon1020 champon1020 Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I got following error messages when I ran container built without g++, both CPU and GPU backend.

# On CPU
nvalidCxxCompiler: No working C++ compiler found in torch._inductor.config.cpp.cxx: (None, 'g++')

# On GPU
Failed to find C compiler. Please specify via CC environment variable

Then, after adding the line, I solved this issue. (It seems to be due to the JIT-compile in torch.compile)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thank you for the clarifications.
Based on @andreyvelich's comment, if you remove torch.compile, let's remove this installation. Otherwise, I'm ok with leaving this here.

Comment on lines 3 to 4
RUN apt update \
&& apt install --no-install-recommends -y g++
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the same question above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we could remove these examples since we have examples with CUDA in examples/pytorch/mnist/Dockerfile.

@kubeflow/wg-training-leads WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Can you remove it ? @champon1020

@coveralls
Copy link

coveralls commented Mar 12, 2024

Pull Request Test Coverage Report for Build 8493497882

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 6 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.07%) to 42.785%

Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/mpi/mpijob_controller.go 6 80.48%
Totals Coverage Status
Change from base Build 8473635677: -0.07%
Covered Lines: 3745
Relevant Lines: 8753

💛 - Coveralls

@champon1020
Copy link
Contributor Author

champon1020 commented Mar 12, 2024

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution @champon1020 🎉
I left a few comments.

@@ -1,10 +1,10 @@
ARG BASE_IMAGE=pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime
ARG BASE_IMAGE=pytorch/pytorch:2.2.1-cuda12.1-cudnn8-runtime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use the same image as for Katib in this PR: kubeflow/katib#2279.

nvcr.io/nvidia/pytorch:24.01-py3

To support ARM-based builds in the future?

@@ -252,6 +252,7 @@ def initialize_model(
# should always set the single device scope, otherwise,
# DistributedDataParallel will use all available devices.
model.to(device)
model = torch.compile(model)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need torch.compile()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think torch.compile is the main feature of PyTorch v2 and it should be guaranteed to work properly.
But If this is out of scope, I'll remove it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I can remember, the torch.compile could help us improve training performance.
So, I'd suggest using this torch.compile.

Probably, we should introduce the torch.compile.

Copy link
Member

@tenzen-y tenzen-y Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one concern is about supported GPUs:

Caveats: On a desktop-class GPU such as a NVIDIA 3090, we’ve measured that speedups are lower than on server-class GPUs such as A100. As of today, our default backend TorchInductor supports CPUs and NVIDIA Volta and Ampere GPUs. It does not (yet) support other GPUs, xPUs or older NVIDIA GPUs.

https://pytorch.org/get-started/pytorch-2.0/#pytorch-2x-faster-more-pythonic-and-as-dynamic-as-ever

Copy link
Member

@andreyvelich andreyvelich Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y @champon1020 Can we introduce this change in the followup PR ? @champon1020 You can create tracking issue to discuss update of our examples with torch.compile() where we are going to investigate pros and cons adding this to all PyTorch examples ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll create a new issue and remove these changes related to torch.compile in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the installation of g++ as well?

Copy link
Contributor Author

@champon1020 champon1020 Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the installation of g++ as well?

Yes, I'll also remove it.

@@ -185,8 +185,8 @@ def main():
print(f"World Size: {os.environ['WORLD_SIZE']}. Rank: {os.environ['RANK']}")

dist.init_process_group(backend=args.backend)
Distributor = nn.parallel.DistributedDataParallel
model = Distributor(model)
model = torch.compile(model)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question for compile.

@tenzen-y
Copy link
Member

tenzen-y commented Mar 12, 2024

Upgrade PyTorch version in the following images as well:
https://github.com/kubeflow/training-operator/blob/e2d6ba41d4d11eb333a80cb95a1c22a29e3da156/examples/pytorch/mnist/Dockerfile-mpi
https://github.com/kubeflow/training-operator/blob/e2d6ba41d4d11eb333a80cb95a1c22a29e3da156/examples/pytorch/mnist/Dockerfile.ppc64le

@tenzen-y Although it seems to be installed pytorch by git clone in these Dockerfiles, is there a need to modify them?

For the Dockerfile-mpi, yes, we should do it, but we can change the base image. If we can change the base image built-in PyTorch, I guess that we can reduce some building scripts.

For the Dockerfile.ppc64le, I think that we should consolidate mnist/Dockerfile.ppc64le and mnist/Dockerfile into mnist/Dockerfile, and then we should remove the mnist/Dockerfile.ppc64le, I'm ok with working on this in another PR. Would you like to work on this in this PR? Or, Another PR?

@champon1020
Copy link
Contributor Author

Upgrade PyTorch version in the following images as well:
https://github.com/kubeflow/training-operator/blob/e2d6ba41d4d11eb333a80cb95a1c22a29e3da156/examples/pytorch/mnist/Dockerfile-mpi
https://github.com/kubeflow/training-operator/blob/e2d6ba41d4d11eb333a80cb95a1c22a29e3da156/examples/pytorch/mnist/Dockerfile.ppc64le

@tenzen-y Although it seems to be installed pytorch by git clone in these Dockerfiles, is there a need to modify them?

For the Dockerfile-mpi, yes, we should do it, but we can change the base image. If we can change the base image built-in PyTorch, I guess that we can reduce some building scripts.

For the Dockerfile.ppc64le, I think that we should consolidate mnist/Dockerfile.ppc64le and mnist/Dockerfile into mnist/Dockerfile, and then we should remove the mnist/Dockerfile.ppc64le, I'm ok with working on this in another PR. Would you like to work on this in this PR? Or, Another PR?

@tenzen-y I'll remove Dockerfile.ppc64le on another PR.
Now I'm working to modify Dockerfile-mpi so please wait for a little.

@tenzen-y
Copy link
Member

Upgrade PyTorch version in the following images as well:
https://github.com/kubeflow/training-operator/blob/e2d6ba41d4d11eb333a80cb95a1c22a29e3da156/examples/pytorch/mnist/Dockerfile-mpi
https://github.com/kubeflow/training-operator/blob/e2d6ba41d4d11eb333a80cb95a1c22a29e3da156/examples/pytorch/mnist/Dockerfile.ppc64le

@tenzen-y Although it seems to be installed pytorch by git clone in these Dockerfiles, is there a need to modify them?

For the Dockerfile-mpi, yes, we should do it, but we can change the base image. If we can change the base image built-in PyTorch, I guess that we can reduce some building scripts.
For the Dockerfile.ppc64le, I think that we should consolidate mnist/Dockerfile.ppc64le and mnist/Dockerfile into mnist/Dockerfile, and then we should remove the mnist/Dockerfile.ppc64le, I'm ok with working on this in another PR. Would you like to work on this in this PR? Or, Another PR?

@tenzen-y I'll remove Dockerfile.ppc64le on another PR. Now I'm working to modify Dockerfile-mpi so please wait for a little.

SGTM

@google-oss-prow google-oss-prow bot added size/L and removed size/M labels Mar 17, 2024
Signed-off-by: champon1020 <nagatelu1020@gmail.com>
Signed-off-by: champon1020 <nagatelu1020@gmail.com>
Copy link
Contributor Author

@champon1020 champon1020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y @andreyvelich I have modified some points so please review again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that openmpi is already installed in nvcr.io/nvidia/pytorch:24.01-py3, so removed the lines for installation.
(Since the difference between Dockerfile and Dockerfile-mpi is only ENTRYPOINT, I suppose it can be unified to one Dockerfile)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Thanks!

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@champon1020 Thank you for the updates!
Basically lgtm

Comment on lines 58 to 61
- component-name: pytorch-dist-mnist
dockerfile: examples/pytorch/mnist/Dockerfile
- component-name: pytorch-dist-mnist-mpi
dockerfile: examples/pytorch/mnist/Dockerfile-mpi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- component-name: pytorch-dist-mnist
dockerfile: examples/pytorch/mnist/Dockerfile
- component-name: pytorch-dist-mnist-mpi
dockerfile: examples/pytorch/mnist/Dockerfile-mpi
- component-name: pytorch-dist-mnist
dockerfile: examples/pytorch/mnist/Dockerfile
context: examples/pytorch/mnist
- component-name: pytorch-dist-mnist-mpi
dockerfile: examples/pytorch/mnist/Dockerfile-mpi
context: examples/pytorch/mnist

Specifying the context should resolve CI errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Thanks!

@@ -1,4 +1,4 @@
ARG BASE_IMAGE=pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime
ARG BASE_IMAGE=nvcr.io/nvidia/pytorch:24.01-py3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* add Dockerfile context to github workflow yaml

* add commenets to the head of Dockerfile

Signed-off-by: champon1020 <nagatelu1020@gmail.com>
@champon1020
Copy link
Contributor Author

@tenzen-y Sorry for the delay, but I modified some points commented out :)

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@champon1020 Thank you for this great contribution!
I'm looking forward to your next contribution!

/lgtm
/approve

@google-oss-prow google-oss-prow bot added the lgtm label Mar 30, 2024
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: champon1020, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 21f25ce into kubeflow:master Mar 30, 2024
39 checks passed
deepanker13 pushed a commit to deepanker13/deepanker-training-operator that referenced this pull request Apr 8, 2024
* refactor: upgrade pytorch job examples to pytorch v2

Signed-off-by: champon1020 <nagatelu1020@gmail.com>

* fix: remove torch.compile and update base image of Dockerfile

Signed-off-by: champon1020 <nagatelu1020@gmail.com>

* fix: comment out pytorch mnist Dockerfiles in the config of CI

Signed-off-by: champon1020 <nagatelu1020@gmail.com>

* fix: minor changes

* add Dockerfile context to github workflow yaml

* add commenets to the head of Dockerfile

Signed-off-by: champon1020 <nagatelu1020@gmail.com>

---------

Signed-off-by: champon1020 <nagatelu1020@gmail.com>
Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>
@mathias9395
Copy link

mathias9395 commented Apr 9, 2024

Since the base image was changed, it seems now that 'etcd' library is missing. I can no longer run my elastic pytorch jobs that use 'rdzvBackend: etcd'. It works on the previous version, so just wondering if others are experiencing this problem too.

@tenzen-y
Copy link
Member

Since the base image was changed, it seems now that 'etcd' library is missing. I can no longer run my elastic pytorch jobs that use 'rdzvBackend: etcd'. It works on the previous version, so just wondering if others are experiencing this problem too.

@mathias9395 Oh, thank you for letting us know! I'll try to fix it.

johnugeorge pushed a commit to johnugeorge/training-operator that referenced this pull request Apr 28, 2024
* refactor: upgrade pytorch job examples to pytorch v2

Signed-off-by: champon1020 <nagatelu1020@gmail.com>

* fix: remove torch.compile and update base image of Dockerfile

Signed-off-by: champon1020 <nagatelu1020@gmail.com>

* fix: comment out pytorch mnist Dockerfiles in the config of CI

Signed-off-by: champon1020 <nagatelu1020@gmail.com>

* fix: minor changes

* add Dockerfile context to github workflow yaml

* add commenets to the head of Dockerfile

Signed-off-by: champon1020 <nagatelu1020@gmail.com>

---------

Signed-off-by: champon1020 <nagatelu1020@gmail.com>
johnugeorge pushed a commit to johnugeorge/training-operator that referenced this pull request Apr 28, 2024
* refactor: upgrade pytorch job examples to pytorch v2

Signed-off-by: champon1020 <nagatelu1020@gmail.com>

* fix: remove torch.compile and update base image of Dockerfile

Signed-off-by: champon1020 <nagatelu1020@gmail.com>

* fix: comment out pytorch mnist Dockerfiles in the config of CI

Signed-off-by: champon1020 <nagatelu1020@gmail.com>

* fix: minor changes

* add Dockerfile context to github workflow yaml

* add commenets to the head of Dockerfile

Signed-off-by: champon1020 <nagatelu1020@gmail.com>

---------

Signed-off-by: champon1020 <nagatelu1020@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants