Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process reconciliation as a plugin #16410

Closed
1 task
awaelchli opened this issue Jan 17, 2023 · 1 comment · Fixed by #18218
Closed
1 task

Process reconciliation as a plugin #16410

awaelchli opened this issue Jan 17, 2023 · 1 comment · Fixed by #18218
Labels
fabric lightning.fabric.Fabric refactor strategy: ddp DistributedDataParallel
Milestone

Comments

@awaelchli
Copy link
Contributor

awaelchli commented Jan 17, 2023

Outline & Motivation

The DDP strategy and its subclasses have a feature called "deadlock detection and process reconciliation". It can ensure that all processes terminate properly when an error occurs on a subset of the ranks. Without this feature, the processes where no errors occur would continue to run and hang/wait at the collectives.

Pro:

  • Can save you costs when running in the cloud.
  • No zombie processes you have to manually kill

Con:

  • Implementation is hardcoded into the DDPStrategy, does not work well with inheritance
  • Makes trainer exception handling complex
  • Assumes a shared filesystem

Pitch

  1. Remove the feature from PL strategies (Remove deadlock detection / process reconciliation logic #16204)
  2. Introduce it as a plugin under fabric
  3. Introduce Strategy.on_exception that the exception handler can call in a standardized way
  4. Re-introduce it in PL strategies once flattening preparations for Fabric integration are done

A strategy can enable the plugin like so:

if not torch_greater_equal_foo and has_shared_filesystem:
    enable_plugin()

and by implementing

def on_exception(self, exception):
    self.plugin.reconciliate_processes(...)

Additional context

Credit for the ideas @carmocca

No response

cc @justusschock @awaelchli @carmocca

@awaelchli awaelchli added needs triage Waiting to be triaged by maintainers refactor fabric lightning.fabric.Fabric strategy: ddp DistributedDataParallel and removed needs triage Waiting to be triaged by maintainers labels Jan 17, 2023
@carmocca carmocca added this to the future milestone Jan 18, 2023
@carmocca
Copy link
Contributor

carmocca commented Jan 31, 2023

In #16525 I implemented a limited version of this proposal. It's intra-node only and only runs under the SIGTERM signal, but it could be extended to support inter-node, more signals, or any exception type.

It's only for PL as Fabric does not have any signal management at the moment.

My code was loosely based on https://github.com/pytorch/pytorch/blob/v1.13.1/torch/distributed/elastic/multiprocessing/api.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fabric lightning.fabric.Fabric refactor strategy: ddp DistributedDataParallel
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants