Upgrade PyTorchJob examples to PyTorch v2 #2024

champon1020 · 2024-03-11T15:32:48Z

What this PR does / why we need it:
Upgrade PyTorchJob examples to PyTorch v2.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Issue: #2016

Checklist:

Docs included if any changes are user facing

Signed-off-by: champon1020 <nagatelu1020@gmail.com>

tenzen-y

Thanks for the contribution!
Could you address in the following items as well?

Add PyTorch image to CI:

training-operator/.github/workflows/publish-example-images.yaml

Lines 58 to 62 in 14eeaeb

    
           # TODO (tenzen-y): Fix the below broken Dockerfiles 
        
           #          - component-name: pytorch-dist-mnist-mpi 
        
           #            dockerfile: examples/pytorch/mnist/Dockerfile-mpi 
        
           #          - component-name: pytorch-dist-mnist 
        
           #            dockerfile: examples/pytorch/mnist/Dockerfile

Upgrade PyTorch version in the following images as well:
- https://github.com/kubeflow/training-operator/blob/e2d6ba41d4d11eb333a80cb95a1c22a29e3da156/examples/pytorch/mnist/Dockerfile-mpi
- https://github.com/kubeflow/training-operator/blob/e2d6ba41d4d11eb333a80cb95a1c22a29e3da156/examples/pytorch/mnist/Dockerfile.ppc64le

tenzen-y · 2024-03-12T00:40:18Z

examples/pytorch/elastic/imagenet/Dockerfile

 FROM $BASE_IMAGE

 WORKDIR /workspace

 # download imagenet tiny for data
-RUN apt-get -q update && apt-get -q install -y wget unzip
+RUN apt-get -q update && apt-get -q install -y wget unzip g++


If you don't install g++, any error happen?

Yes. I got following error messages when I ran container built without g++, both CPU and GPU backend.

# On CPU nvalidCxxCompiler: No working C++ compiler found in torch._inductor.config.cpp.cxx: (None, 'g++') # On GPU Failed to find C compiler. Please specify via CC environment variable

Then, after adding the line, I solved this issue. (It seems to be due to the JIT-compile in torch.compile)

I see. Thank you for the clarifications.
Based on @andreyvelich's comment, if you remove torch.compile, let's remove this installation. Otherwise, I'm ok with leaving this here.

tenzen-y · 2024-03-12T00:41:04Z

examples/pytorch/mnist/Dockerfile

+RUN apt update \
+    && apt install --no-install-recommends -y g++


Here is the same question above.

tenzen-y · 2024-03-12T00:44:22Z

examples/pytorch/pytorch_cuda_docker/Dockerfile

I'm wondering if we could remove these examples since we have examples with CUDA in examples/pytorch/mnist/Dockerfile.

@kubeflow/wg-training-leads WDYT?

I agree. Can you remove it ? @champon1020

coveralls · 2024-03-12T11:17:19Z

Pull Request Test Coverage Report for Build 8493497882

Details

0 of 0 changed or added relevant lines in 0 files are covered.
6 unchanged lines in 1 file lost coverage.
Overall coverage decreased (-0.07%) to 42.785%

Files with Coverage Reduction	New Missed Lines	%
pkg/controller.v1/mpi/mpijob_controller.go	6	80.48%

Totals
Change from base Build 8473635677:	-0.07%
Covered Lines:	3745
Relevant Lines:	8753

💛 - Coveralls

champon1020 · 2024-03-12T12:42:51Z

Upgrade PyTorch version in the following images as well:
https://github.com/kubeflow/training-operator/blob/e2d6ba41d4d11eb333a80cb95a1c22a29e3da156/examples/pytorch/mnist/Dockerfile-mpi
https://github.com/kubeflow/training-operator/blob/e2d6ba41d4d11eb333a80cb95a1c22a29e3da156/examples/pytorch/mnist/Dockerfile.ppc64le

@tenzen-y Although it seems to be installed pytorch by git clone in these Dockerfiles, do I need to modify them?

andreyvelich

Thank you for your contribution @champon1020 🎉
I left a few comments.

andreyvelich · 2024-03-12T15:07:58Z

examples/pytorch/elastic/imagenet/Dockerfile

@@ -1,10 +1,10 @@
-ARG BASE_IMAGE=pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime
+ARG BASE_IMAGE=pytorch/pytorch:2.2.1-cuda12.1-cudnn8-runtime


Should we use the same image as for Katib in this PR: kubeflow/katib#2279.

nvcr.io/nvidia/pytorch:24.01-py3

To support ARM-based builds in the future?

andreyvelich · 2024-03-12T15:09:21Z

examples/pytorch/elastic/imagenet/imagenet.py

@@ -252,6 +252,7 @@ def initialize_model(
    # should always set the single device scope, otherwise,
    # DistributedDataParallel will use all available devices.
    model.to(device)
+    model = torch.compile(model)


Why do we need torch.compile()?

I think torch.compile is the main feature of PyTorch v2 and it should be guaranteed to work properly.
But If this is out of scope, I'll remove it.

As I can remember, the torch.compile could help us improve training performance.
So, I'd suggest using this torch.compile.

Probably, we should introduce the torch.compile.

Only one concern is about supported GPUs:

Caveats: On a desktop-class GPU such as a NVIDIA 3090, we’ve measured that speedups are lower than on server-class GPUs such as A100. As of today, our default backend TorchInductor supports CPUs and NVIDIA Volta and Ampere GPUs. It does not (yet) support other GPUs, xPUs or older NVIDIA GPUs.

https://pytorch.org/get-started/pytorch-2.0/#pytorch-2x-faster-more-pythonic-and-as-dynamic-as-ever

@tenzen-y @champon1020 Can we introduce this change in the followup PR ? @champon1020 You can create tracking issue to discuss update of our examples with torch.compile() where we are going to investigate pros and cons adding this to all PyTorch examples ?

Sure, I'll create a new issue and remove these changes related to torch.compile in this PR.

Can we remove the installation of g++ as well?

Can we remove the installation of g++ as well?

Yes, I'll also remove it.

andreyvelich · 2024-03-12T15:11:49Z

examples/pytorch/mnist/mnist.py

@@ -185,8 +185,8 @@ def main():
    print(f"World Size: {os.environ['WORLD_SIZE']}. Rank: {os.environ['RANK']}")

    dist.init_process_group(backend=args.backend)
-    Distributor = nn.parallel.DistributedDataParallel
-    model = Distributor(model)
+    model = torch.compile(model)


Same question for compile.

tenzen-y · 2024-03-12T15:19:52Z

Upgrade PyTorch version in the following images as well:
https://github.com/kubeflow/training-operator/blob/e2d6ba41d4d11eb333a80cb95a1c22a29e3da156/examples/pytorch/mnist/Dockerfile-mpi
https://github.com/kubeflow/training-operator/blob/e2d6ba41d4d11eb333a80cb95a1c22a29e3da156/examples/pytorch/mnist/Dockerfile.ppc64le

@tenzen-y Although it seems to be installed pytorch by git clone in these Dockerfiles, is there a need to modify them?

For the Dockerfile-mpi, yes, we should do it, but we can change the base image. If we can change the base image built-in PyTorch, I guess that we can reduce some building scripts.

For the Dockerfile.ppc64le, I think that we should consolidate mnist/Dockerfile.ppc64le and mnist/Dockerfile into mnist/Dockerfile, and then we should remove the mnist/Dockerfile.ppc64le, I'm ok with working on this in another PR. Would you like to work on this in this PR? Or, Another PR?

champon1020 · 2024-03-15T17:47:53Z

Upgrade PyTorch version in the following images as well:
https://github.com/kubeflow/training-operator/blob/e2d6ba41d4d11eb333a80cb95a1c22a29e3da156/examples/pytorch/mnist/Dockerfile-mpi
https://github.com/kubeflow/training-operator/blob/e2d6ba41d4d11eb333a80cb95a1c22a29e3da156/examples/pytorch/mnist/Dockerfile.ppc64le

@tenzen-y Although it seems to be installed pytorch by git clone in these Dockerfiles, is there a need to modify them?

For the Dockerfile-mpi, yes, we should do it, but we can change the base image. If we can change the base image built-in PyTorch, I guess that we can reduce some building scripts.

For the Dockerfile.ppc64le, I think that we should consolidate mnist/Dockerfile.ppc64le and mnist/Dockerfile into mnist/Dockerfile, and then we should remove the mnist/Dockerfile.ppc64le, I'm ok with working on this in another PR. Would you like to work on this in this PR? Or, Another PR?

@tenzen-y I'll remove Dockerfile.ppc64le on another PR.
Now I'm working to modify Dockerfile-mpi so please wait for a little.

tenzen-y · 2024-03-15T23:09:24Z

Upgrade PyTorch version in the following images as well:
https://github.com/kubeflow/training-operator/blob/e2d6ba41d4d11eb333a80cb95a1c22a29e3da156/examples/pytorch/mnist/Dockerfile-mpi
https://github.com/kubeflow/training-operator/blob/e2d6ba41d4d11eb333a80cb95a1c22a29e3da156/examples/pytorch/mnist/Dockerfile.ppc64le

@tenzen-y Although it seems to be installed pytorch by git clone in these Dockerfiles, is there a need to modify them?

For the Dockerfile-mpi, yes, we should do it, but we can change the base image. If we can change the base image built-in PyTorch, I guess that we can reduce some building scripts.
For the Dockerfile.ppc64le, I think that we should consolidate mnist/Dockerfile.ppc64le and mnist/Dockerfile into mnist/Dockerfile, and then we should remove the mnist/Dockerfile.ppc64le, I'm ok with working on this in another PR. Would you like to work on this in this PR? Or, Another PR?

@tenzen-y I'll remove Dockerfile.ppc64le on another PR. Now I'm working to modify Dockerfile-mpi so please wait for a little.

SGTM

Signed-off-by: champon1020 <nagatelu1020@gmail.com>

champon1020

@tenzen-y @andreyvelich I have modified some points so please review again.

champon1020 · 2024-03-17T15:54:10Z

examples/pytorch/mnist/Dockerfile-mpi

I found that openmpi is already installed in nvcr.io/nvidia/pytorch:24.01-py3, so removed the lines for installation.
(Since the difference between Dockerfile and Dockerfile-mpi is only ENTRYPOINT, I suppose it can be unified to one Dockerfile)

Good point! Thanks!

tenzen-y

@champon1020 Thank you for the updates!
Basically lgtm

tenzen-y · 2024-03-26T07:10:12Z

.github/workflows/publish-example-images.yaml

+          - component-name: pytorch-dist-mnist
+            dockerfile: examples/pytorch/mnist/Dockerfile
+          - component-name: pytorch-dist-mnist-mpi
+            dockerfile: examples/pytorch/mnist/Dockerfile-mpi


Suggested change

- component-name: pytorch-dist-mnist

dockerfile: examples/pytorch/mnist/Dockerfile

- component-name: pytorch-dist-mnist-mpi

dockerfile: examples/pytorch/mnist/Dockerfile-mpi

- component-name: pytorch-dist-mnist

dockerfile: examples/pytorch/mnist/Dockerfile

context: examples/pytorch/mnist

- component-name: pytorch-dist-mnist-mpi

dockerfile: examples/pytorch/mnist/Dockerfile-mpi

context: examples/pytorch/mnist

Specifying the context should resolve CI errors.

tenzen-y · 2024-03-26T07:10:46Z

examples/pytorch/mnist/Dockerfile-mpi

Good point! Thanks!

tenzen-y · 2024-03-26T07:12:35Z

examples/pytorch/elastic/imagenet/Dockerfile

@@ -1,4 +1,4 @@
-ARG BASE_IMAGE=pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime
+ARG BASE_IMAGE=nvcr.io/nvidia/pytorch:24.01-py3


Can we add a similar comment as https://github.com/kubeflow/katib/blob/250e9d176f047f524d37bb417db46b2563d0a72d/examples/v1beta1/trial-images/pytorch-mnist/Dockerfile.gpu#L1-L3?

* add Dockerfile context to github workflow yaml * add commenets to the head of Dockerfile Signed-off-by: champon1020 <nagatelu1020@gmail.com>

champon1020 · 2024-03-30T22:01:47Z

@tenzen-y Sorry for the delay, but I modified some points commented out :)

tenzen-y

@champon1020 Thank you for this great contribution!
I'm looking forward to your next contribution!

/lgtm
/approve

google-oss-prow · 2024-03-30T22:32:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: champon1020, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

* refactor: upgrade pytorch job examples to pytorch v2 Signed-off-by: champon1020 <nagatelu1020@gmail.com> * fix: remove torch.compile and update base image of Dockerfile Signed-off-by: champon1020 <nagatelu1020@gmail.com> * fix: comment out pytorch mnist Dockerfiles in the config of CI Signed-off-by: champon1020 <nagatelu1020@gmail.com> * fix: minor changes * add Dockerfile context to github workflow yaml * add commenets to the head of Dockerfile Signed-off-by: champon1020 <nagatelu1020@gmail.com> --------- Signed-off-by: champon1020 <nagatelu1020@gmail.com> Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>

mathias9395 · 2024-04-09T17:31:10Z

Since the base image was changed, it seems now that 'etcd' library is missing. I can no longer run my elastic pytorch jobs that use 'rdzvBackend: etcd'. It works on the previous version, so just wondering if others are experiencing this problem too.

tenzen-y · 2024-04-10T06:11:37Z

Since the base image was changed, it seems now that 'etcd' library is missing. I can no longer run my elastic pytorch jobs that use 'rdzvBackend: etcd'. It works on the previous version, so just wondering if others are experiencing this problem too.

@mathias9395 Oh, thank you for letting us know! I'll try to fix it.

* refactor: upgrade pytorch job examples to pytorch v2 Signed-off-by: champon1020 <nagatelu1020@gmail.com> * fix: remove torch.compile and update base image of Dockerfile Signed-off-by: champon1020 <nagatelu1020@gmail.com> * fix: comment out pytorch mnist Dockerfiles in the config of CI Signed-off-by: champon1020 <nagatelu1020@gmail.com> * fix: minor changes * add Dockerfile context to github workflow yaml * add commenets to the head of Dockerfile Signed-off-by: champon1020 <nagatelu1020@gmail.com> --------- Signed-off-by: champon1020 <nagatelu1020@gmail.com>

google-oss-prow bot added the do-not-merge/work-in-progress label Mar 11, 2024

google-oss-prow bot requested review from jinchihe and kuizhiqing March 11, 2024 15:32

google-oss-prow bot added the size/M label Mar 11, 2024

champon1020 marked this pull request as ready for review March 11, 2024 15:33

google-oss-prow bot removed the do-not-merge/work-in-progress label Mar 11, 2024

refactor: upgrade pytorch job examples to pytorch v2

e6db772

Signed-off-by: champon1020 <nagatelu1020@gmail.com>

champon1020 force-pushed the patch-issue-2016 branch from b22d017 to e6db772 Compare March 11, 2024 15:34

tenzen-y reviewed Mar 12, 2024

View reviewed changes

andreyvelich reviewed Mar 12, 2024

View reviewed changes

champon1020 mentioned this pull request Mar 12, 2024

Introduce torch.compile to all PyTorch examples #2027

Open

google-oss-prow bot added size/L and removed size/M labels Mar 17, 2024

fix: remove torch.compile and update base image of Dockerfile

73893a2

Signed-off-by: champon1020 <nagatelu1020@gmail.com>

champon1020 force-pushed the patch-issue-2016 branch from 3403015 to 06e08a4 Compare March 17, 2024 15:49

fix: comment out pytorch mnist Dockerfiles in the config of CI

9e1d6c5

Signed-off-by: champon1020 <nagatelu1020@gmail.com>

champon1020 force-pushed the patch-issue-2016 branch from 06e08a4 to 9e1d6c5 Compare March 17, 2024 15:51

champon1020 commented Mar 17, 2024

View reviewed changes

tenzen-y reviewed Mar 26, 2024

View reviewed changes

fix: minor changes

4feab3b

* add Dockerfile context to github workflow yaml * add commenets to the head of Dockerfile Signed-off-by: champon1020 <nagatelu1020@gmail.com>

champon1020 force-pushed the patch-issue-2016 branch from fbcb75b to 4feab3b Compare March 30, 2024 21:25

champon1020 requested a review from tenzen-y March 30, 2024 22:02

tenzen-y reviewed Mar 30, 2024

View reviewed changes

google-oss-prow bot assigned tenzen-y Mar 30, 2024

google-oss-prow bot added the lgtm label Mar 30, 2024

google-oss-prow bot added the approved label Mar 30, 2024

google-oss-prow bot merged commit 21f25ce into kubeflow:master Mar 30, 2024
39 checks passed

champon1020 mentioned this pull request Mar 30, 2024

Remove Dockerfile.ppc64le of pytorch example #2042

Merged

1 task

This was referenced Apr 10, 2024

Unable to start elastic PyTorchJob example #2050

Closed

fix: fix build mnist image err due to python in base pytorch image too old #2009

Closed

	# TODO (tenzen-y): Fix the below broken Dockerfiles
	# - component-name: pytorch-dist-mnist-mpi
	# dockerfile: examples/pytorch/mnist/Dockerfile-mpi
	# - component-name: pytorch-dist-mnist
	# dockerfile: examples/pytorch/mnist/Dockerfile

		RUN apt update \
		&& apt install --no-install-recommends -y g++

		@@ -1,10 +1,10 @@
		ARG BASE_IMAGE=pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime
		ARG BASE_IMAGE=pytorch/pytorch:2.2.1-cuda12.1-cudnn8-runtime

		@@ -1,4 +1,4 @@
		ARG BASE_IMAGE=pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime
		ARG BASE_IMAGE=nvcr.io/nvidia/pytorch:24.01-py3

Upgrade PyTorchJob examples to PyTorch v2 #2024

Upgrade PyTorchJob examples to PyTorch v2 #2024

Conversation

champon1020 commented Mar 11, 2024

tenzen-y left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

champon1020 Mar 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Mar 12, 2024 • edited Loading

Pull Request Test Coverage Report for Build 8493497882

Details

💛 - Coveralls

champon1020 commented Mar 12, 2024 • edited Loading

andreyvelich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y Mar 12, 2024 • edited Loading

Choose a reason for hiding this comment

andreyvelich Mar 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

champon1020 Mar 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y commented Mar 12, 2024 • edited Loading

champon1020 commented Mar 15, 2024

tenzen-y commented Mar 15, 2024

champon1020 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

champon1020 commented Mar 30, 2024

tenzen-y left a comment

Choose a reason for hiding this comment

google-oss-prow bot commented Mar 30, 2024

mathias9395 commented Apr 9, 2024 • edited Loading

tenzen-y commented Apr 10, 2024

champon1020 Mar 12, 2024 •

edited

Loading

coveralls commented Mar 12, 2024 •

edited

Loading

champon1020 commented Mar 12, 2024 •

edited

Loading

tenzen-y Mar 12, 2024 •

edited

Loading

andreyvelich Mar 12, 2024 •

edited

Loading

champon1020 Mar 12, 2024 •

edited

Loading

tenzen-y commented Mar 12, 2024 •

edited

Loading

mathias9395 commented Apr 9, 2024 •

edited

Loading