-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up environment access in plugins #6941
Conversation
Hello @awaelchli! Thanks for updating this PR. There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-04-13 17:39:06 UTC |
Codecov Report
@@ Coverage Diff @@
## master #6941 +/- ##
========================================
- Coverage 92% 83% -9%
========================================
Files 194 194
Lines 12366 13132 +766
========================================
- Hits 11350 10897 -453
- Misses 1016 2235 +1219 |
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
|
||
@abstractmethod | ||
def global_rank(self) -> int: | ||
""" The rank (index) of the currently running process across all nodes and devices. """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be pass
too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't make a difference. pass is only required when we have nothing under the function. here the docstring is already enough :)
…bugfix/elastic-world-size
pytorch_lightning/plugins/environments/lightning_environment.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carmocca regarding getter/setter, what's your recommendation. Doing it in this PR would touch the full interface.
Let's do it after this one
@@ -78,11 +78,11 @@ def __init__( | |||
self._ddp_kwargs = kwargs | |||
self._has_spawned_children = False | |||
self.task_idx = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this?
@@ -34,6 +35,8 @@ class LightningEnvironment(ClusterEnvironment): | |||
def __init__(self): | |||
super().__init__() | |||
self._master_port = None | |||
self._global_rank: int = 0 | |||
self._world_size: Optional[int] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can a world size be None? Isn't it 1 then? Also what happens if this is None? Do we check this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we test it in tests/plugins/environments/test_lightning_environments
you are right, it could be 1 by default now.
we never see None because at trainer init we immediately let the plugin overwrite it to an int.
# no-op, we are not allowed to change rank in SLURM | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe better raise an Error here to make that more explicit?
Because if I call a function like this and it doesn't error I expect the new rank to be set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, but then we would have to make the call to the setter conditional in the training plugins.
For example, in DDPSpawnPlugin we use the setter to update the global rank once we have spawned the new processes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @ananthsub
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can make add log an error here to warn users but not raise an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a warning would be fine. Might be a good time to introduce several warn / log level depending on risks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went for log.debug for now. I don't want to risk users seeing error messages pop up that they don't understand. The team wants to include this PR in in the patch release today. I could not find a smarter way to avoid the setter in the limited time.
def __init__(self): | ||
super().__init__() | ||
@staticmethod | ||
def is_using_torchelastic() -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we have something similar for slurm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have, it's in the accelerator connector. and I will do a follow up. didn't want to make the PR larger
def global_rank(self) -> int: | ||
return hvd.rank() | ||
|
||
@property | ||
def local_rank(self) -> int: | ||
return hvd.local_rank() | ||
|
||
@property | ||
def world_size(self) -> int: | ||
return hvd.size() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be cleaner to also have a horovod environment plugin here? since for this part it's similar to torchelastic and should also be handled like that.
"SLURM_PROCID": "1", | ||
"SLURM_LOCALID": "1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why Do you need to change these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these were artificial values, and they don't match with what we expect.
SLURM_NTASKS=2 but LOCAL_RANK 10 is not valid, and would actually lead to Lightning ignoring the SlurmEnvironment and instead select the LightningEnvironment (there is a comment in the code, we should document that behavior better, and write proper tests instead of these misleading ones that were here)
@@ -302,8 +302,8 @@ def root_gpu(self) -> Optional[int]: | |||
|
|||
@property | |||
def is_using_torchelastic(self) -> bool: | |||
te_flags_passed = "WORLD_SIZE" in os.environ and ("GROUP_RANK" in os.environ or "NODE_RANK" in os.environ) | |||
return te_flags_passed | |||
required_env_vars = ("RANK", "GROUP_RANK", "LOCAL_RANK") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about WORLD_SIZE ?
def world_size(self) -> int: | ||
return self._world_size | ||
|
||
def set_world_size(self, size: int) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need set_world_size. Can you just use a setter and make everything properties ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but this interface is old and all getters are methods from the beginning.
It would be better to have proper setters and getters. I decided not to keep backward compatibility and make the interface consistent. I propose to do a deprecation in a follow up to keep this PR managable
# no-op, we are not allowed to change rank in SLURM | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a warning would be fine. Might be a good time to introduce several warn / log level depending on risks.
Great, one more headache solved. A few items from reviews that I could not address fully will be tracked here: #6303 |
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com> Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
What does this PR do?
Fixes #6853
Fixes
Credit: Proposed by @ananthsub
With this PR:
TODO:
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃