Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cascade SIGTERM to subprocesses #16525

Merged
merged 14 commits into from
Jan 31, 2023
Merged

Conversation

carmocca
Copy link
Contributor

@carmocca carmocca commented Jan 27, 2023

What does this PR do?

Follow-up to #16501

Part of #16410

Does your PR introduce any breaking changes? If yes, please list them.

None

cc @Borda @carmocca @justusschock @awaelchli

@carmocca carmocca added feature Is an improvement or enhancement fault tolerance strategy: ddp DistributedDataParallel pl Generic label for PyTorch Lightning package labels Jan 27, 2023
@carmocca carmocca added this to the 2.0 milestone Jan 27, 2023
@carmocca carmocca self-assigned this Jan 27, 2023
@carmocca
Copy link
Contributor Author

carmocca commented Jan 27, 2023

@awaelchli I ran a manual test and this is necessary. The subprocesses do not get the signal automatically and the patch works as expected.

Do you prefer that I add a test with mocks, or a complete test to https://github.com/Lightning-AI/lightning/blob/c854e4d1e230133b80e92c5dd088f2289219772f/tests/tests_pytorch/run_standalone_tasks.sh?

@awaelchli
Copy link
Contributor

@carmocca This looks great. A simple test through unitest.mock would be great!

@carmocca carmocca force-pushed the feat/sigterm-handler-notify-subprocesses branch from ed1befb to f977a2b Compare January 31, 2023 00:53
@github-actions github-actions bot added app (removed) Generic label for Lightning App package fabric lightning.fabric.Fabric labels Jan 31, 2023
@github-actions github-actions bot removed the app (removed) Generic label for Lightning App package label Jan 31, 2023
@carmocca carmocca marked this pull request as ready for review January 31, 2023 01:57
@github-actions
Copy link
Contributor

github-actions bot commented Jan 31, 2023

⚡ Required checks status: All passing 🟢

Groups summary

🟢 pytorch_lightning: Tests workflow
Check ID Status
pl-cpu (macOS-11, pytorch, 3.8, 1.11) success
pl-cpu (macOS-11, pytorch, 3.9, 1.12) success
pl-cpu (macOS-11, pytorch, 3.10, 1.13) success
pl-cpu (macOS-11, pytorch, 3.8, 1.10, oldest) success
pl-cpu (ubuntu-20.04, pytorch, 3.8, 1.10) success
pl-cpu (ubuntu-20.04, pytorch, 3.9, 1.11) success
pl-cpu (ubuntu-20.04, pytorch, 3.10, 1.12) success
pl-cpu (ubuntu-20.04, pytorch, 3.10, 1.13) success
pl-cpu (ubuntu-20.04, pytorch, 3.7, 1.10, oldest) success
pl-cpu (windows-2022, pytorch, 3.9, 1.11) success
pl-cpu (windows-2022, pytorch, 3.10, 1.12) success
pl-cpu (windows-2022, pytorch, 3.10, 1.13) success
pl-cpu (windows-2022, pytorch, 3.7, 1.10, oldest) success
pl-cpu (slow, macOS-11, pytorch, 3.7, 1.11) success
pl-cpu (slow, ubuntu-20.04, pytorch, 3.7, 1.11) success
pl-cpu (slow, windows-2022, pytorch, 3.7, 1.11) success
pl-cpu (macOS-11, lightning, 3.8, 1.13) success
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.13) success
pl-cpu (windows-2022, lightning, 3.8, 1.13) success

These checks are required after the changes to src/lightning_fabric/strategies/launchers/__init__.py, src/lightning_fabric/strategies/launchers/launcher.py, src/lightning_fabric/strategies/launchers/multiprocessing.py, src/lightning_fabric/strategies/launchers/subprocess_script.py, src/lightning_fabric/strategies/launchers/xla.py, src/lightning_fabric/strategies/strategy.py, src/pytorch_lightning/strategies/launchers/launcher.py, src/pytorch_lightning/strategies/launchers/multiprocessing.py, src/pytorch_lightning/strategies/launchers/subprocess_script.py, src/pytorch_lightning/strategies/launchers/xla.py, src/pytorch_lightning/strategies/strategy.py, src/pytorch_lightning/trainer/connectors/signal_connector.py, tests/tests_pytorch/strategies/launchers/test_multiprocessing.py, tests/tests_pytorch/strategies/launchers/test_subprocess_script.py, tests/tests_pytorch/trainer/connectors/test_signal_connector.py.

🟢 pytorch_lightning: Azure GPU
Check ID Status
pytorch-lightning (GPUs) success

These checks are required after the changes to src/pytorch_lightning/strategies/launchers/launcher.py, src/pytorch_lightning/strategies/launchers/multiprocessing.py, src/pytorch_lightning/strategies/launchers/subprocess_script.py, src/pytorch_lightning/strategies/launchers/xla.py, src/pytorch_lightning/strategies/strategy.py, src/pytorch_lightning/trainer/connectors/signal_connector.py, tests/tests_pytorch/strategies/launchers/test_multiprocessing.py, tests/tests_pytorch/strategies/launchers/test_subprocess_script.py, tests/tests_pytorch/trainer/connectors/test_signal_connector.py, src/lightning_fabric/strategies/launchers/__init__.py, src/lightning_fabric/strategies/launchers/launcher.py, src/lightning_fabric/strategies/launchers/multiprocessing.py, src/lightning_fabric/strategies/launchers/subprocess_script.py, src/lightning_fabric/strategies/launchers/xla.py, src/lightning_fabric/strategies/strategy.py.

🟢 pytorch_lightning: Azure HPU
Check ID Status
pytorch-lightning (HPUs) success

These checks are required after the changes to src/lightning_fabric/strategies/launchers/__init__.py, src/lightning_fabric/strategies/launchers/launcher.py, src/lightning_fabric/strategies/launchers/multiprocessing.py, src/lightning_fabric/strategies/launchers/subprocess_script.py, src/lightning_fabric/strategies/launchers/xla.py, src/lightning_fabric/strategies/strategy.py, src/pytorch_lightning/strategies/launchers/launcher.py, src/pytorch_lightning/strategies/launchers/multiprocessing.py, src/pytorch_lightning/strategies/launchers/subprocess_script.py, src/pytorch_lightning/strategies/launchers/xla.py, src/pytorch_lightning/strategies/strategy.py, src/pytorch_lightning/trainer/connectors/signal_connector.py, tests/tests_pytorch/strategies/launchers/test_multiprocessing.py, tests/tests_pytorch/strategies/launchers/test_subprocess_script.py, tests/tests_pytorch/trainer/connectors/test_signal_connector.py.

🟢 pytorch_lightning: Azure IPU
Check ID Status
pytorch-lightning (IPUs) success

These checks are required after the changes to src/lightning_fabric/strategies/launchers/__init__.py, src/lightning_fabric/strategies/launchers/launcher.py, src/lightning_fabric/strategies/launchers/multiprocessing.py, src/lightning_fabric/strategies/launchers/subprocess_script.py, src/lightning_fabric/strategies/launchers/xla.py, src/lightning_fabric/strategies/strategy.py, src/pytorch_lightning/strategies/launchers/launcher.py, src/pytorch_lightning/strategies/launchers/multiprocessing.py, src/pytorch_lightning/strategies/launchers/subprocess_script.py, src/pytorch_lightning/strategies/launchers/xla.py, src/pytorch_lightning/strategies/strategy.py, src/pytorch_lightning/trainer/connectors/signal_connector.py, tests/tests_pytorch/strategies/launchers/test_multiprocessing.py, tests/tests_pytorch/strategies/launchers/test_subprocess_script.py, tests/tests_pytorch/trainer/connectors/test_signal_connector.py.

🟢 pytorch_lightning: Docs
Check ID Status
make-doctest (pytorch) success
make-html (pytorch) success

These checks are required after the changes to src/pytorch_lightning/strategies/launchers/launcher.py, src/pytorch_lightning/strategies/launchers/multiprocessing.py, src/pytorch_lightning/strategies/launchers/subprocess_script.py, src/pytorch_lightning/strategies/launchers/xla.py, src/pytorch_lightning/strategies/strategy.py, src/pytorch_lightning/trainer/connectors/signal_connector.py.

🟢 lightning_fabric: CPU workflow
Check ID Status
fabric-cpu (macOS-11, fabric, 3.8, 1.11) success
fabric-cpu (macOS-11, fabric, 3.9, 1.12) success
fabric-cpu (macOS-11, fabric, 3.10, 1.13) success
fabric-cpu (macOS-11, fabric, 3.7, 1.10, oldest) success
fabric-cpu (ubuntu-20.04, fabric, 3.8, 1.10) success
fabric-cpu (ubuntu-20.04, fabric, 3.9, 1.11) success
fabric-cpu (ubuntu-20.04, fabric, 3.10, 1.12) success
fabric-cpu (ubuntu-20.04, fabric, 3.10, 1.13) success
fabric-cpu (ubuntu-20.04, fabric, 3.7, 1.10, oldest) success
fabric-cpu (windows-2022, fabric, 3.9, 1.11) success
fabric-cpu (windows-2022, fabric, 3.10, 1.12) success
fabric-cpu (windows-2022, fabric, 3.10, 1.13) success
fabric-cpu (windows-2022, fabric, 3.7, 1.10, oldest) success
fabric-cpu (macOS-11, lightning, 3.8, 1.13) success
fabric-cpu (ubuntu-20.04, lightning, 3.8, 1.13) success
fabric-cpu (windows-2022, lightning, 3.8, 1.13) success

These checks are required after the changes to src/lightning_fabric/strategies/launchers/__init__.py, src/lightning_fabric/strategies/launchers/launcher.py, src/lightning_fabric/strategies/launchers/multiprocessing.py, src/lightning_fabric/strategies/launchers/subprocess_script.py, src/lightning_fabric/strategies/launchers/xla.py, src/lightning_fabric/strategies/strategy.py.

🟢 lightning_fabric: Azure GPU
Check ID Status
lightning-fabric (GPUs) success

These checks are required after the changes to src/lightning_fabric/strategies/launchers/__init__.py, src/lightning_fabric/strategies/launchers/launcher.py, src/lightning_fabric/strategies/launchers/multiprocessing.py, src/lightning_fabric/strategies/launchers/subprocess_script.py, src/lightning_fabric/strategies/launchers/xla.py, src/lightning_fabric/strategies/strategy.py.

🟢 mypy
Check ID Status
mypy success

These checks are required after the changes to src/lightning_fabric/strategies/launchers/__init__.py, src/lightning_fabric/strategies/launchers/launcher.py, src/lightning_fabric/strategies/launchers/multiprocessing.py, src/lightning_fabric/strategies/launchers/subprocess_script.py, src/lightning_fabric/strategies/launchers/xla.py, src/lightning_fabric/strategies/strategy.py, src/pytorch_lightning/strategies/launchers/launcher.py, src/pytorch_lightning/strategies/launchers/multiprocessing.py, src/pytorch_lightning/strategies/launchers/subprocess_script.py, src/pytorch_lightning/strategies/launchers/xla.py, src/pytorch_lightning/strategies/strategy.py, src/pytorch_lightning/trainer/connectors/signal_connector.py.

🟢 install
Check ID Status
install-pkg (ubuntu-22.04, app, 3.7) success
install-pkg (ubuntu-22.04, app, 3.10) success
install-pkg (ubuntu-22.04, fabric, 3.7) success
install-pkg (ubuntu-22.04, fabric, 3.10) success
install-pkg (ubuntu-22.04, pytorch, 3.7) success
install-pkg (ubuntu-22.04, pytorch, 3.10) success
install-pkg (ubuntu-22.04, lightning, 3.7) success
install-pkg (ubuntu-22.04, lightning, 3.10) success
install-pkg (ubuntu-22.04, notset, 3.7) success
install-pkg (ubuntu-22.04, notset, 3.10) success
install-pkg (macOS-12, app, 3.7) success
install-pkg (macOS-12, app, 3.10) success
install-pkg (macOS-12, fabric, 3.7) success
install-pkg (macOS-12, fabric, 3.10) success
install-pkg (macOS-12, pytorch, 3.7) success
install-pkg (macOS-12, pytorch, 3.10) success
install-pkg (macOS-12, lightning, 3.7) success
install-pkg (macOS-12, lightning, 3.10) success
install-pkg (macOS-12, notset, 3.7) success
install-pkg (macOS-12, notset, 3.10) success
install-pkg (windows-2022, app, 3.7) success
install-pkg (windows-2022, app, 3.10) success
install-pkg (windows-2022, fabric, 3.7) success
install-pkg (windows-2022, fabric, 3.10) success
install-pkg (windows-2022, pytorch, 3.7) success
install-pkg (windows-2022, pytorch, 3.10) success
install-pkg (windows-2022, lightning, 3.7) success
install-pkg (windows-2022, lightning, 3.10) success
install-pkg (windows-2022, notset, 3.7) success
install-pkg (windows-2022, notset, 3.10) success

These checks are required after the changes to src/lightning_fabric/strategies/launchers/__init__.py, src/lightning_fabric/strategies/launchers/launcher.py, src/lightning_fabric/strategies/launchers/multiprocessing.py, src/lightning_fabric/strategies/launchers/subprocess_script.py, src/lightning_fabric/strategies/launchers/xla.py, src/lightning_fabric/strategies/strategy.py, src/pytorch_lightning/strategies/launchers/launcher.py, src/pytorch_lightning/strategies/launchers/multiprocessing.py, src/pytorch_lightning/strategies/launchers/subprocess_script.py, src/pytorch_lightning/strategies/launchers/xla.py, src/pytorch_lightning/strategies/strategy.py, src/pytorch_lightning/trainer/connectors/signal_connector.py.

🟢 link-check
Check ID Status
markdown-link-check success

These checks are required after the changes to src/pytorch_lightning/CHANGELOG.md.


Thank you for your contribution! 💜

Note
This comment is automatically generated and updates for 60 minutes every 180 seconds. If you have any other questions, contact carmocca for help.

Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really neat !

Comment on lines +20 to +23
class _Launcher(_FabricLauncher, ABC):
@abstractmethod
def kill(self, signum: _SIGNUM) -> None:
"""Kill existing alive processes."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carmocca You could also push this to fabric and raise NotImplementedError. To avoid having to define a new base class in PL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But do you see us implementing signal handling for fabric in the future? Otherwise I wouldn't do it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at the core, fabric would expose these primitives yes. You could call fabric.strategy.launcher.kill() for example if that's useful to you. The launcher implementation is also very PL-independent. The only reason why they are subclassed in PL is because of some extra state management and hydra support.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not do this as it would be unused code in Fabric. In the future, we can do that and add docs for it, or try to merge the classes

Copy link
Contributor Author

@carmocca carmocca Jan 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolving. We can do this as a follow-up if we decide

@mergify mergify bot added the ready PRs ready to be merged label Jan 31, 2023
@carmocca carmocca enabled auto-merge (squash) January 31, 2023 16:24
@carmocca carmocca merged commit 9f8043e into master Jan 31, 2023
@carmocca carmocca deleted the feat/sigterm-handler-notify-subprocesses branch January 31, 2023 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fabric lightning.fabric.Fabric fault tolerance feature Is an improvement or enhancement pl Generic label for PyTorch Lightning package ready PRs ready to be merged strategy: ddp DistributedDataParallel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants