You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The DDP strategy and its subclasses have a feature called "deadlock detection and process reconciliation". It can ensure that all processes terminate properly when an error occurs on a subset of the ranks. Without this feature, the processes where no errors occur would continue to run and hang/wait at the collectives.
Pro:
Can save you costs when running in the cloud.
No zombie processes you have to manually kill
Con:
Implementation is hardcoded into the DDPStrategy, does not work well with inheritance
In #16525 I implemented a limited version of this proposal. It's intra-node only and only runs under the SIGTERM signal, but it could be extended to support inter-node, more signals, or any exception type.
It's only for PL as Fabric does not have any signal management at the moment.
Outline & Motivation
The DDP strategy and its subclasses have a feature called "deadlock detection and process reconciliation". It can ensure that all processes terminate properly when an error occurs on a subset of the ranks. Without this feature, the processes where no errors occur would continue to run and hang/wait at the collectives.
Pro:
Con:
Pitch
Strategy.on_exception
that the exception handler can call in a standardized wayA strategy can enable the plugin like so:
and by implementing
Additional context
Credit for the ideas @carmocca
No response
cc @justusschock @awaelchli @carmocca
The text was updated successfully, but these errors were encountered: