-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor accelerator teardown -> training type plugin teardown #7579
Conversation
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
…oint_consolidate Update test_all_gather_grad.py
This reverts commit 9d4a2b8.
This reverts commit 0d23d75.
This reverts commit 70fe5da.
This reverts commit a9aae99.
This reverts commit ea74906.
This reverts commit bf70e43.
This reverts commit f172101.
This reverts commit 536c132.
This reverts commit 3a9fde9.
This reverts commit 7a369f4.
This reverts commit 8222dc9.
This reverts commit 6c095b2.
This reverts commit 250d0aa.
This reverts commit 8651d54.
This reverts commit dcdcd29.
special test failed for
But I do not think it is relevant to this PR though. |
@shuyingsunshine21 Let's try to add I think what is happening is the parameterization + seed_everything is causing both processes to get the same port and there is probably not enough time between the two tests to completely release the port. I think that's why we get "Address already in use" |
assert trainer.training_type_plugin.root_device == torch.device("cuda:0") | ||
|
||
|
||
@RunIf(tpu=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add the functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any way we can test the teardowns?
pytorch_lightning/plugins/training_type/training_type_plugin.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytorch_lightning/utilities/teardown.py
seems to have been added by mistake
weird issue trying to switch branch.... |
def on_fit_end(self) -> None: | ||
assert "PT_XLA_DEBUG" not in os.environ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is called before teardown: https://github.com/PyTorchLightning/pytorch-lightning/blob/a8d9b5f783528c2b18407d42298e10d2c7b18c61/pytorch_lightning/trainer/trainer.py#L776-L784
we could assert that PT_XLA_DEBUG is in the environ during training, and that it's not after trainer.fit()
completes in L56
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, the L784, is for model teardown, accelerator teardown is actually here: https://github.com/PyTorchLightning/pytorch-lightning/blob/a8d9b5f783528c2b18407d42298e10d2c7b18c61/pytorch_lightning/trainer/trainer.py#L804
which is in _post_dispatch
It is fine also for your proposal! Added on_fit_end
as would like to test immediately after accelerator teardown.
"""Tests if teardown correctly for single GPU plugin.""" | ||
trainer = Trainer(gpus=1, fast_dev_run=True) | ||
model = BoringModelGPUTearDown() | ||
trainer.fit(model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similarly, you can assert that the model is on gpu during training, and that after training the model's device is cpu
What does this PR do?
Currently teardown is controlled by accelerator based on device (GPU/CPU/TPU), but each training type plugin might need to control that logic differently.
Also motivated by #7324
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃