Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor accelerator teardown -> training type plugin teardown #7579

Merged
merged 73 commits into from
May 22, 2021

Conversation

shuyingsunshine21
Copy link
Contributor

@shuyingsunshine21 shuyingsunshine21 commented May 17, 2021

What does this PR do?

Currently teardown is controlled by accelerator based on device (GPU/CPU/TPU), but each training type plugin might need to control that logic differently.

Also motivated by #7324

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

Shuying Sun and others added 30 commits March 23, 2021 12:06
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
…oint_consolidate

Update test_all_gather_grad.py
…1-checkpoint_consolidate"

This reverts commit c5053da, reversing
changes made to 0d23d75.
This reverts commit 70fe5da.
This reverts commit a9aae99.
@ananthsub ananthsub added design Includes a design discussion feature Is an improvement or enhancement labels May 20, 2021
@shuyingsunshine21
Copy link
Contributor Author

shuyingsunshine21 commented May 20, 2021

special test failed for ddp_sharded_spawn

tests/plugins/test_deepspeed_plugin.py::test_deepspeed_multigpu_stage_2_accumulated_grad_batches

E       Exception: 
E       
E       -- Process 0 terminated with the following error:
E       Traceback (most recent call last):
E         File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
E           fn(i, *args)
E         File "/__w/1/s/pytorch_lightning/plugins/training_type/sharded_spawn.py", line 94, in new_process
E           super().new_process(process_idx, trainer, mp_queue)
E         File "/__w/1/s/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 180, in new_process
E           self.init_ddp_connection(self.global_rank, self.world_size)
E         File "/__w/1/s/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 266, in init_ddp_connection
E           torch_distrib.init_process_group(self.torch_distributed_backend, rank=global_rank, world_size=world_size)
E         File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 422, in init_process_group
E           store, rank, world_size = next(rendezvous_iterator)
E         File "/usr/local/lib/python3.8/dist-packages/torch/distributed/rendezvous.py", line 172, in _env_rendezvous_handler
E           store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
E       RuntimeError: Address already in use

But I do not think it is relevant to this PR though.

@awaelchli
Copy link
Contributor

@shuyingsunshine21 Let's try to add set_random_master_port before seed_everything in that test.

I think what is happening is the parameterization + seed_everything is causing both processes to get the same port and there is probably not enough time between the two tests to completely release the port. I think that's why we get "Address already in use"

@mergify mergify bot removed the has conflicts label May 21, 2021
assert trainer.training_type_plugin.root_device == torch.device("cuda:0")


@RunIf(tpu=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add the functionality.

Copy link
Contributor

@carmocca carmocca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any way we can test the teardowns?

pytorch_lightning/plugins/training_type/single_tpu.py Outdated Show resolved Hide resolved
pytorch_lightning/plugins/training_type/tpu_spawn.py Outdated Show resolved Hide resolved
pytorch_lightning/plugins/training_type/parallel.py Outdated Show resolved Hide resolved
pytorch_lightning/plugins/training_type/parallel.py Outdated Show resolved Hide resolved
pytorch_lightning/plugins/training_type/single_device.py Outdated Show resolved Hide resolved
pytorch_lightning/plugins/training_type/single_device.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ananthsub ananthsub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytorch_lightning/utilities/teardown.py seems to have been added by mistake

@shuyingsunshine21
Copy link
Contributor Author

weird issue trying to switch branch....

Comment on lines 46 to 47
def on_fit_end(self) -> None:
assert "PT_XLA_DEBUG" not in os.environ
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is called before teardown: https://github.com/PyTorchLightning/pytorch-lightning/blob/a8d9b5f783528c2b18407d42298e10d2c7b18c61/pytorch_lightning/trainer/trainer.py#L776-L784

we could assert that PT_XLA_DEBUG is in the environ during training, and that it's not after trainer.fit() completes in L56

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, the L784, is for model teardown, accelerator teardown is actually here: https://github.com/PyTorchLightning/pytorch-lightning/blob/a8d9b5f783528c2b18407d42298e10d2c7b18c61/pytorch_lightning/trainer/trainer.py#L804

which is in _post_dispatch

It is fine also for your proposal! Added on_fit_end as would like to test immediately after accelerator teardown.

"""Tests if teardown correctly for single GPU plugin."""
trainer = Trainer(gpus=1, fast_dev_run=True)
model = BoringModelGPUTearDown()
trainer.fit(model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly, you can assert that the model is on gpu during training, and that after training the model's device is cpu

@ananthsub ananthsub merged commit 2242423 into Lightning-AI:master May 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design Includes a design discussion feature Is an improvement or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants