refactor accelerator teardown -> training type plugin teardown #7579

shuyingsunshine21 · 2021-05-17T20:33:38Z

What does this PR do?

Currently teardown is controlled by accelerator based on device (GPU/CPU/TPU), but each training type plugin might need to control that logic differently.

Also motivated by #7324

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

…lightning pull latest code

…oint_consolidate Update test_all_gather_grad.py

This reverts commit 9d4a2b8.

…1-checkpoint_consolidate" This reverts commit c5053da, reversing changes made to 0d23d75.

This reverts commit 0d23d75.

This reverts commit 70fe5da.

This reverts commit a9aae99.

This reverts commit ea74906.

This reverts commit bf70e43.

This reverts commit f172101.

This reverts commit 536c132.

This reverts commit 3a9fde9.

This reverts commit 7a369f4.

…lightning

This reverts commit 8222dc9.

This reverts commit 6c095b2.

This reverts commit 250d0aa.

This reverts commit 8651d54.

This reverts commit dcdcd29.

…-lightning

shuyingsunshine21 · 2021-05-20T19:07:45Z

special test failed for ddp_sharded_spawn

tests/plugins/test_deepspeed_plugin.py::test_deepspeed_multigpu_stage_2_accumulated_grad_batches

E       Exception: 
E       
E       -- Process 0 terminated with the following error:
E       Traceback (most recent call last):
E         File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
E           fn(i, *args)
E         File "/__w/1/s/pytorch_lightning/plugins/training_type/sharded_spawn.py", line 94, in new_process
E           super().new_process(process_idx, trainer, mp_queue)
E         File "/__w/1/s/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 180, in new_process
E           self.init_ddp_connection(self.global_rank, self.world_size)
E         File "/__w/1/s/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 266, in init_ddp_connection
E           torch_distrib.init_process_group(self.torch_distributed_backend, rank=global_rank, world_size=world_size)
E         File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 422, in init_process_group
E           store, rank, world_size = next(rendezvous_iterator)
E         File "/usr/local/lib/python3.8/dist-packages/torch/distributed/rendezvous.py", line 172, in _env_rendezvous_handler
E           store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
E       RuntimeError: Address already in use

But I do not think it is relevant to this PR though.

awaelchli · 2021-05-20T21:14:34Z

@shuyingsunshine21 Let's try to add set_random_master_port before seed_everything in that test.

I think what is happening is the parameterization + seed_everything is causing both processes to get the same port and there is probably not enough time between the two tests to completely release the port. I think that's why we get "Address already in use"

tchaton · 2021-05-21T11:19:45Z

tests/plugins/test_training_type_plugin_device.py

+    assert trainer.training_type_plugin.root_device == torch.device("cuda:0")
+
+
+@RunIf(tpu=True)


Need to add the functionality.

carmocca

Any way we can test the teardowns?

pytorch_lightning/plugins/training_type/single_tpu.py

pytorch_lightning/plugins/training_type/training_type_plugin.py

pytorch_lightning/plugins/training_type/tpu_spawn.py

pytorch_lightning/plugins/training_type/parallel.py

pytorch_lightning/plugins/training_type/single_device.py

pytorch_lightning/accelerators/accelerator.py

ananthsub

pytorch_lightning/utilities/teardown.py seems to have been added by mistake

shuyingsunshine21 · 2021-05-22T00:17:10Z

weird issue trying to switch branch....

ananthsub · 2021-05-22T05:53:39Z

tests/plugins/test_training_type_plugin_teardown.py

+    def on_fit_end(self) -> None:
+        assert "PT_XLA_DEBUG" not in os.environ


this is called before teardown: https://github.com/PyTorchLightning/pytorch-lightning/blob/a8d9b5f783528c2b18407d42298e10d2c7b18c61/pytorch_lightning/trainer/trainer.py#L776-L784

we could assert that PT_XLA_DEBUG is in the environ during training, and that it's not after trainer.fit() completes in L56

Oh, the L784, is for model teardown, accelerator teardown is actually here: https://github.com/PyTorchLightning/pytorch-lightning/blob/a8d9b5f783528c2b18407d42298e10d2c7b18c61/pytorch_lightning/trainer/trainer.py#L804

which is in _post_dispatch

It is fine also for your proposal! Added on_fit_end as would like to test immediately after accelerator teardown.

ananthsub · 2021-05-22T05:54:03Z

tests/plugins/test_training_type_plugin_teardown.py

+    """Tests if teardown correctly for single GPU plugin."""
+    trainer = Trainer(gpus=1, fast_dev_run=True)
+    model = BoringModelGPUTearDown()
+    trainer.fit(model)


similarly, you can assert that the model is on gpu during training, and that after training the model's device is cpu

tests/plugins/test_training_type_plugin_device.py

Shuying Sun and others added 30 commits March 23, 2021 12:06

Fix some test errors

89f284d

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Merge branch 'master' of https://github.com/PyTorchLightning/pytorch-…

80cfbff

…lightning pull latest code

checkpoint consolidation

536c132

Update ddp_spawn.py

f172101

Update test_metric_result_integration.py

bf70e43

Update test_results.py

ea74906

Update utils.py

a9aae99

Update utils.py

70fe5da

Update test_all_gather_grad.py

0d23d75

Update test_all_gather_grad.py

ca6f98b

Merge pull request #1 from shuyingsunshine21/shuyingsunshine21-checkp…

c5053da

…oint_consolidate Update test_all_gather_grad.py

Update test_results.py

9d4a2b8

Revert "Update test_results.py"

7635b4f

This reverts commit 9d4a2b8.

Revert "Merge pull request #1 from shuyingsunshine21/shuyingsunshine2…

d64f90c

…1-checkpoint_consolidate" This reverts commit c5053da, reversing changes made to 0d23d75.

Revert "Update test_all_gather_grad.py"

dcdcd29

This reverts commit 0d23d75.

Revert "Update utils.py"

8651d54

This reverts commit 70fe5da.

Revert "Update utils.py"

15f4b9e

This reverts commit a9aae99.

Revert "Update test_results.py"

250d0aa

This reverts commit ea74906.

Revert "Update test_metric_result_integration.py"

6c095b2

This reverts commit bf70e43.

Revert "Update ddp_spawn.py"

8222dc9

This reverts commit f172101.

Revert "checkpoint consolidation"

3a9fde9

This reverts commit 536c132.

Revert "Revert "checkpoint consolidation""

7a369f4

This reverts commit 3a9fde9.

Revert "Revert "Revert "checkpoint consolidation"""

b4a0b9e

This reverts commit 7a369f4.

Merge branch 'master' of https://github.com/PyTorchLightning/pytorch-…

5cf1db1

…lightning

Revert "Revert "Update ddp_spawn.py""

0ce7e05

This reverts commit 8222dc9.

Revert "Revert "Update test_metric_result_integration.py""

fe9736d

This reverts commit 6c095b2.

Revert "Revert "Update test_results.py""

c314ef6

This reverts commit 250d0aa.

Revert "Revert "Update utils.py""

c3feda0

This reverts commit 8651d54.

Revert "Revert "Update test_all_gather_grad.py""

c759477

This reverts commit dcdcd29.

Merge branch 'master' of https://github.com/shuyingsunshine21/pytorch…

7a8e540

…-lightning

ananthsub added design Includes a design discussion feature Is an improvement or enhancement labels May 20, 2021

ananthsub approved these changes May 20, 2021

View reviewed changes

Shuying Sun added 2 commits May 20, 2021 19:17

add random master port for deep speed test

18dd3c4

testing for changing order of seed and random master port

2720b01

mergify bot added the has conflicts label May 21, 2021

rebase

bf27d41

mergify bot removed the has conflicts label May 21, 2021

tchaton approved these changes May 21, 2021

View reviewed changes

carmocca reviewed May 21, 2021

View reviewed changes

comments and add test for teardown

88e0224

ananthsub reviewed May 21, 2021

View reviewed changes

pytorch_lightning/accelerators/accelerator.py Outdated Show resolved Hide resolved

Shuying Sun added 2 commits May 21, 2021 15:40

fix and comments

33707d1

remove unintended change

3fb712b

carmocca approved these changes May 21, 2021

View reviewed changes

ananthsub reviewed May 21, 2021

View reviewed changes

ananthsub reviewed May 22, 2021

View reviewed changes

tests/plugins/test_training_type_plugin_device.py Outdated Show resolved Hide resolved

Shuying Sun added 6 commits May 21, 2021 23:30

fix

cb8d7c0

separate tests

297fbea

add missing imports

73990c5

fix

b8c3024

add barrier

5f97fd2

fix

5d74fc1

ananthsub merged commit 2242423 into Lightning-AI:master May 22, 2021

ananthsub mentioned this pull request Dec 16, 2021

Strategies and Accelerator teardown logic clean up #11091

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor accelerator teardown -> training type plugin teardown #7579

refactor accelerator teardown -> training type plugin teardown #7579

shuyingsunshine21 commented May 17, 2021 •

edited by ananthsub

Loading

shuyingsunshine21 commented May 20, 2021 •

edited

Loading

awaelchli commented May 20, 2021

tchaton May 21, 2021

carmocca left a comment

ananthsub left a comment

shuyingsunshine21 commented May 22, 2021

ananthsub May 22, 2021

shuyingsunshine21 May 22, 2021

ananthsub May 22, 2021

		assert trainer.training_type_plugin.root_device == torch.device("cuda:0")


		@RunIf(tpu=True)

		def on_fit_end(self) -> None:
		assert "PT_XLA_DEBUG" not in os.environ

refactor accelerator teardown -> training type plugin teardown #7579

refactor accelerator teardown -> training type plugin teardown #7579

Conversation

shuyingsunshine21 commented May 17, 2021 • edited by ananthsub Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

shuyingsunshine21 commented May 20, 2021 • edited Loading

awaelchli commented May 20, 2021

tchaton May 21, 2021

Choose a reason for hiding this comment

carmocca left a comment

Choose a reason for hiding this comment

ananthsub left a comment

Choose a reason for hiding this comment

shuyingsunshine21 commented May 22, 2021

ananthsub May 22, 2021

Choose a reason for hiding this comment

shuyingsunshine21 May 22, 2021

Choose a reason for hiding this comment

ananthsub May 22, 2021

Choose a reason for hiding this comment

shuyingsunshine21 commented May 17, 2021 •

edited by ananthsub

Loading

shuyingsunshine21 commented May 20, 2021 •

edited

Loading