Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Allow hashing of metrics with lists in their state #5939

Merged
merged 20 commits into from
Feb 18, 2021

Conversation

peblair
Copy link
Contributor

@peblair peblair commented Feb 12, 2021

What does this PR do?

Fixes #5926. This PR allows metrics to contain lists within their internal state, so long as the contents of those lists are hashable.

Before submitting

  • Was this discussed/approved via a GitHub issue? AveragePrecision Broken on Master #5926
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary) (N/A)
  • Did you write any new necessary tests? (not for typos and docs) (Feel free to push back on this 🙂 )
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings) (bug not in released version, so omitted from CHANGELOG)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified
  • Check that target branch and milestone match!

Did you have fun?

Make sure you had fun coding 🙃

Always

@codecov
Copy link

codecov bot commented Feb 12, 2021

Codecov Report

Merging #5939 (be8abda) into master (fcfa7fa) will increase coverage by 0%.
The diff coverage is 89%.

@@           Coverage Diff           @@
##           master   #5939    +/-   ##
=======================================
  Coverage      90%     91%            
=======================================
  Files         170     160    -10     
  Lines       11784   11400   -384     
=======================================
- Hits        10664   10345   -319     
+ Misses       1120    1055    -65     

@SkafteNicki
Copy link
Member

@peblair could you add a test to this file:
https://github.com/PyTorchLightning/pytorch-lightning/blob/master/tests/metrics/test_metric.py
that test that metrics are hashable, preferable both metric for which the metric state is a tensor and one where it is a list.

@SkafteNicki SkafteNicki added the bug Something isn't working label Feb 12, 2021
Copy link
Contributor

@SeanNaren SeanNaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran into this issue as well, and this was my fix

@peblair
Copy link
Contributor Author

peblair commented Feb 12, 2021

@SkafteNicki I went to add the test, but I came across something weird in the process. This was my original test:

def test_hash():

    class A(Dummy):
        pass

    class B(DummyList):
        pass

    a1 = A()
    a2 = A()
    assert hash(a1) == hash(a2)

    b1 = B()
    b2 = B()
    assert hash(b1) == hash(b2)
    assert isinstance(b1.x, list) and len(b1.x) == 0
    b1.x.append(torch.tensor(5))
    assert isinstance(hash(b1), int) # <- check that nothing crashes
    assert isinstance(b1.x, list) and len(b1.x) == 1

This actually fails on the first assert, though, since the hash() of each A instance's x values (both Torch tensors) are different. A quick trip to the CLI confirms this:

>>> import torch
>>> t1 = torch.tensor(0.0)
>>> t2 = torch.tensor(0.0)
>>> hash(t1) == hash(t2)
False

This is a bit counterintuitive to me, since it appears that the __eq__ is doing a structural equality check (meaning that their hashes should be the same so long as their contents are the same). I am 90+% sure that this is an even deeper bug in __hash__ (distinct from the one which I am trying to fix in this PR). How should we proceed?

@SeanNaren
Copy link
Contributor

@peblair
Copy link
Contributor Author

peblair commented Feb 12, 2021

@SeanNaren Understood. The issue is that __hash__ and __eq__ are intended to have a specific relationship; specifically, a == b => hash(a) == hash(b). This relationship is important to make data structures such as sets and hash maps behave correctly. This is currently being violated on master, though, since a == b is doing a structural equality check (i.e. do all of the fields on these metrics have the same contents?), whereas hash(a) == hash(b) is doing an identity equality check (i.e. are all of the fields on these metrics the same objects?).

One of these two implementations needs to be changed; I am unsure which, though, since I am unsure what the original intention was when the __eq__ and __hash__ implementations were written. Take the snippet in my above post, for example. Is the intention for a1 == a2 to be True? If so, we need to fix __hash__. If not, we need to fix __eq__.

@peblair
Copy link
Contributor Author

peblair commented Feb 12, 2021

(To clarify, when I said "This is a bit counterintuitive to me", I specifically mean "This implementation of __hash__ is a bit counterintuitive to me;" apologies if that led to any confusion)

@SkafteNicki
Copy link
Member

we should probably get @justusschock opinion
IMO two metrics should be equal if they are the same class and their metric states are equal

@peblair
Copy link
Contributor Author

peblair commented Feb 12, 2021

@SkafteNicki that is my intuition too. Let's see what he has to say, but, if that is the intention, I can propose this alternative implementation of __hash__ which fixes this PR's bug and the equality issue:

    def __hash__(self):
        hash_vals = [self.__class__.__name__]

        for key in self._defaults.keys():
            val = getattr(self, key)
            # Special case: allow list values, so long as their elements are hashable
            if isinstance(val, list):
                hash_vals.extend(val)
            else:
                hash_vals.append(val)

        # Torch tensors are hashable, but based on their underlying pointer.
        # Since we do a structural equality check below, we circumvent this
        # by hashing the _sum_ of tensors' contents
        hash_vals = [(float(x.detach().cpu().sum()) if isinstance(x, torch.Tensor) else x) for x in hash_vals]

        return hash(tuple(hash_vals))

There are performance implications for this, of course, but I think the cost is about as sensible I'd expect any other structurally-based hash to be.

@Borda Borda added this to the 1.2 milestone Feb 12, 2021
hash_vals.append(getattr(self, key))
val = getattr(self, key)
# Special case: allow list values, so long as their elements are hashable
if isinstance(val, list):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are lists the only issue here? should we do it with sequences instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. I was erring on the side of caution, but perhaps it was too cautious. What about this condition instead?

if hasattr(val, '__iter__') and not isinstance(val, torch.Tensor):

@justusschock
Copy link
Member

justusschock commented Feb 12, 2021

@peblair I think your suggestion is quite good, but instead of

hash_vals = [(float(x.detach().cpu().sum()) if isinstance(x, torch.Tensor) else x) for x in hash_vals]

i'd probably use

hash_vals = [tuple(x.detach().cpu().tolist()) if isinstance(x, torch.Tensor) else x) for x in hash_vals]

since tuples can be hashed, right?

Also

since a == b is doing a structural equality check (i.e. do all of the fields on these metrics have the same contents?)

should not be true, Instead we actually create a compositional metric there that checks for each output that is compute for a and b if they're identically. So basically this is kind of a lazy way to check equality

@peblair
Copy link
Contributor Author

peblair commented Feb 12, 2021

@peblair I think your suggestion is quite good, but instead of

hash_vals = [(float(x.detach().cpu().sum()) if isinstance(x, torch.Tensor) else x) for x in hash_vals]

i'd probably use

hash_vals = [tuple(x.detach().cpu().tolist()) if isinstance(x, torch.Tensor) else x) for x in hash_vals]

since tuples can be hashed, right?

That is true, but this has the benefit of working for tensors of an arbitrary dimensionality. Moreover, I'd imagine that, under the hood, it's effectively the same amount of work to do hash(float(sum(tensor))) as doing hash(tuple(tensor)) (since both are O(n)...in fact, the summation avoids an extra traversal of the tensor's contents, so it may even be faster).

Also

since a == b is doing a structural equality check (i.e. do all of the fields on these metrics have the same contents?)

should not be true, Instead we actually create a compositional metric there that checks for each output that is compute for a and b if they're identically. So basically this is kind of a lazy way to check equality

This muddies things a little bit 🙂 . The defined semantics of these methods are less clear in situations like this, but I would argue that we should err on the side of a structural hash here, since that appears to be a sensible choice (all(a == b) => hash(a) == hash(b) doesn't have quite the same ring to it, but it makes sense IMO). So, should I move forward with a version of the proposed solution in my last post?

@pep8speaks
Copy link

pep8speaks commented Feb 12, 2021

Hello @peblair! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-02-18 09:26:41 UTC

@peblair
Copy link
Contributor Author

peblair commented Feb 12, 2021

I have gone ahead and pushed the discussed changes. I am happy to discuss any further feedback, but hopefully this addresses everything. Thanks for being so responsive, everyone!

@peblair peblair requested a review from justusschock February 12, 2021 17:15
@Borda Borda added the ready PRs ready to be merged label Feb 13, 2021
Copy link
Member

@SkafteNicki SkafteNicki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single comment, else LGTM

pytorch_lightning/metrics/metric.py Outdated Show resolved Hide resolved
Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, small nits.

pytorch_lightning/metrics/metric.py Outdated Show resolved Hide resolved
@@ -154,6 +154,27 @@ def compute(self):
assert a.compute() == 5


def test_hash():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this test be done directly on Metric objects ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tchaton I apologize, could you elaborate? The hashing is being done on two subclasses of Metric, so I am unsure what precisely you mean.

@Borda
Copy link
Member

Borda commented Feb 15, 2021

@peblair mind check the last remaining comments above?

@Borda Borda enabled auto-merge (squash) February 15, 2021 20:12
@peblair
Copy link
Contributor Author

peblair commented Feb 16, 2021

I unfortunately discovered a major issue with this change today, which I think means that we need to revisit the semantics issue. You can see a reproduction on the BoringModule here, but here is the explanation: PyTorch's nn.Module's children() method uses a set under the hood to memoize the parameters on modules. This means that, if you have two metrics which hash the same and are __eq__ to one another, then torch.nn.Module.named_children() (and, by extension, .children()) will only return one of them. This is the case when the metrics' contents are the same, such as during initialization. You can see in the Colab link that this causes issues; because torch.nn.Module.to() delegates out to children(), if you have, say, two MSE metrics on a LightningModule, only one gets moved to the GPU.

So, what to do? Here are some possible options:

  1. Revert to the original hashing semantics (plus the list fix which originally inspired this PR). I still think this is counterintuitive, but it won't crash.
  2. Same as no. 1, but also modify the semantics of __eq__ to better reflect the __hash__ semantics (e.g. doing an identity equality check on underlying tensors).
  3. Something else that someone more clever than I comes up with 🙂

cc @SkafteNicki @justusschock

@SkafteNicki
Copy link
Member

Then I am in favour of option 1, because else __eq__ will behave differently than other logical operators (I was originally in favour of structual equality, but then I think it should be the same for all logical operators).
That said we should probably clarify this in the documentation, as we right now only have an example of adding two metrics (which makes sense) but not one of a logical operator.

@peblair
Copy link
Contributor Author

peblair commented Feb 16, 2021

Reflecting on this a bit more, I am wondering if the semantics of __eq__ should be changed. The fact that two tensors hash to different values is mostly a coincidence; I can imagine that this will lead to some confusing errors for people who implement their own metrics (imagine that you had implemented the MSE metric yourself and randomly only one of your metrics was being placed on the GPU)...at a minimum, I think the semantics should be clearly defined and documented.

@Borda Borda removed the ready PRs ready to be merged label Feb 16, 2021
@Borda
Copy link
Member

Borda commented Feb 16, 2021

@peblair mind check the failing test on GPU

@justusschock
Copy link
Member

I am also in favor of of number one (mostly for the same reasons as @SkafteNicki)

However, we definitely need to fix the moving issue in a separate PR (not yet sure how though)

@SkafteNicki
Copy link
Member

I think the way forward is to keep the implementation as it is now. It fixes the current bug without introducing a new one (the moving issue is only a problem if we try change the hashing function to do structual similarity).
The tests will have to be changed to

a1 = A()
a2 = A()
assert hash(a1) != hash(a2)

so the tests document that we are not expecting two metric objects to be the same even if their content is the same (basically following pytorch standards).

We can do a follow up PR where we document this behaviour better (that evaluating if two metrics are equal does not check if it is the same metric, but if the output from their compute is the same).

@peblair
Copy link
Contributor Author

peblair commented Feb 18, 2021

The problematic test is now fixed to be consistent with the new semantics, but I think that the situation is still precarious (on master, before this change). Because __eq__ returns a non-bool, this line passes if I add it to test_hash:

assert (b1 == b2) and (b1 != b2)

Now, that's fine, I suppose, except for the fact that I am 99% sure that torch.nn.Module.named_children() is now only not broken because different metric instances return different hashes, so they happen to be able to both be added to the underlying set() that named_children is using (since they have different hashes, they end up in different buckets in the hash set implementation CPython uses). I feel like this is asking for nondeterministic failures to crop up, since a bad pair of pointers could both hash (mod python_dict_size) to the same bucket. I'm wondering if the solution here (which is beyond the scope of this PR at this point) is to escalate to PyTorch. Maybe their memoization in named_children() should contain the id() of each object so as to only check for identity equality...

@Borda Borda merged commit 77f6aa4 into Lightning-AI:master Feb 18, 2021
kamil-kaczmarek pushed a commit to neptune-ai/pytorch-lightning that referenced this pull request Mar 30, 2021
* Add hint in docs for how to use shared memory (#6036)

* Prevent flickering progress bar (#6009)

* add padding

* fix

* fix

* Update pytorch_lightning/callbacks/progress.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* updated based on suggestion

* changelog

* add test

* fix pep8

* resolve test

* fix code format

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: tchaton <thomas@grid.ai>

* Fix Wrapping optimizers upon assignment (#6006)

* Update properties.py

* pep8

* [Bugfix] Apply untoggle_optimizer when result is None (#5983)

* update changelog

* apply untoggle_optimizer when result is None

* update tests

* still return loss sometimes

* Update CHANGELOG.md

Co-authored-by: deng-cy <dcy1996@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* remove outdated info (#6032)

* DeepSpeed Integration (#5954)

* Add initial deepspeed changes

* Address code review

* Move static method outside of function

* Fixes

* Add missing annotation

* Remove seed setting

* Doc changes

* Doc changes, add address reviews

* Fix docs

* Try fixing issue by moving to torch adam

* Clean up check

* Changes, better APIs!

* Add wrapper, swap to git install revision

* Add special test

* Add warning

* Address review

* Add better disclaimer

* Turn off ZeRO for testing due to compilation

* Add description on modifying parameters via the plugin

* Doc strings clear

* Small doc fixes

* Fix hash, reduce test

* Added CI change

* Move to azure pipeline

* Fix test name

* Add missing flag

* Remove sudo...

* Try conda instead

* Swap to conda base

* Try suggested install

* Apply suggestions from code review

* Apply suggestions from code review

* Revert "Apply suggestions from code review"

This reverts commit 41cca05a

* Revert "Apply suggestions from code review"

This reverts commit e06ec29e

* Remove setter

* Address most review

* Move out function, remove DeepSpeed from requirements

* Install deepspeed/mpi4py within container

* Use special tests, move to master commit for deepspeed

* Export path

* Force compile to happen first

* Remove!

* Debugging ninja

* Fix error in optimizer step logic

* Attempt to fix symbolic link

* Reverse to aid debugging

* Export path again

* Clean up mess

* var

* Revert "var"

This reverts commit 3450eaca

* Address review, add todo

* Add note about unsupported functionality

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: tchaton <thomas@grid.ai>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>

* Trainer only references accelerator (#6039)

* Trainer only references accelerator where it can

* Move teardown to the trainer, as it is reponsible for the accelerator

* Address code review for deepspeed (#6042)

* [feat] Add Trainer(stochastic_weight_avg=True/False) (#6038)

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* [CI] Move DeepSpeed into CUDA image, remove DeepSpeed install from azure (#6043)

* Move to CUDA image

* Remove deepspeed install as deepspeed now in the cuda image

* Remove path setting, as ninja should be in the container now

* drop deprecated result object 1/n (#5005)

* ro1

* ro2

* Add option for weight tying on TPU's (#5441)

* added on_post_move_to_device

* added tests

* docs and refactors

* Update tests/backends/test_tpu_backend.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update docs/source/tpu.rst

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update docs/source/tpu.rst

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/core/decorators.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/core/decorators.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update docs/source/tpu.rst

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* Update pytorch_lightning/core/decorators.py

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* Update pytorch_lightning/core/decorators.py

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* Update pytorch_lightning/core/decorators.py

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* Update pytorch_lightning/core/decorators.py

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* Update pytorch_lightning/core/hooks.py

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* moved weight sharing module back to test

updated tpu available

* add count to warning

* fix doctest

* import trainer in doctest

* import trainer in doctest

* do not test code as no TPU device

* param count to layer count

* formatting

* update docs

* update import

* update

* resolve tests

* remove legacy accelerator

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: tchaton <thomas@grid.ai>
Co-authored-by: Your Name <you@example.com>

* Delete tests.helpers.TrialMNISTDataModule (#5999)

* Remove TrialMNISTDataModule

* Allow using TrialMNIST in the MNISTDataModule

* Update tests/helpers/datasets.py

* Fix: Allow hashing of metrics with lists in their state (#5939)

* Fix: Allow hashing of metrics with lists in their state

* Add test case and modify semantics of Metric __hash__ in order to be compatible with structural equality checks

* Fix pep8 style issue

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* et al. (#6050)

* et al.

* Apply suggestions from code review

* Apply suggestions from code review

Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: chaton <thomas@grid.ai>

* [ModelPruning] Add missing attribute with use_global_unstructured=False and verbose (#6045)

* fix/test quant (#6040)

* fix/test quant

* ...

* ---

* Add descriptions to accelerator broadcast function/clean up all_gather (#6044)

* Add descriptions to accelerator broadcast function/clean up all_gather

* Remove todo

* Add before_batch_transfer and after_batch_transfer hooks (#3671)

* add hooks

* comment

* docs

* add tests

* make it private

* fix tests

* docs

* chlog

* testcode

* codefactor

* fix doctest

* fix doctest

* suggestions

* is always overriden

* pep and BoringModel

* BoringModel

* docs

* docs

* docs

* fix

* rebase

* rebase

* suggestions

* docs

* suggestions

* try fix docs

* docs

* update name

* yapf

* docs

* rebase

* yapf

* Make parallel devices optional across all plugins (#6051)

* Make parallel devices optional across all plugins so that they can be instantiated

* Add any to types to capture vars passed in

* clarify gpu / process (#6049)

* Fix docs typo (#6055)

Put .test() in  code blocks

* Docs for Pruning, Quantization, and SWA (#6041)

Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>

* Replace .get_model() with explicit .lightning_module (#6035)

* rename get_model -> lightning_module

* update references to get_model

* pep8

* add proper deprecation

* remove outdated _get_reference_model

* fix cyclic import

* rename accelerator_backend -> accelerator (#6034)

* rename accelerator backend

* rename new additions from master

* add proper deprecation

* pep8

* warning match

* add missing warning type

* fix flake8 for new plugins (#5951)

* flake8

* fix cyclic import

* isort

* fix docs links (#6057)

* Add warnings to on_before/after_batch_transfer hooks (#6059)

* Add warnings to hooks

* Add default idx to prevent signature change in the future

* Nothing to see here

* Add default val to transfer_batch_to_device hook

* Apply suggestions from code review

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Revert "Add default val to transfer_batch_to_device hook"

This reverts commit 5c6a68f2

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* v1.2.0rc2 (#6063)

* v1.2.0rc2

* chlogs

* chlogs

* format

* Apply suggestions from code review

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* Update auto-opt docs (#6037)

* fix docs

* update on comments

* Apply suggestions from code review

Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>

* Apply suggestions from code review

Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>

* Apply suggestions from code review

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* rm comment

* Update docs/source/common/lightning_module.rst

Co-authored-by: chaton <thomas@grid.ai>

Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: chaton <thomas@grid.ai>

* Raise AttributeError in lightning_getattr and lightning_setattr when attribute not found (#6024)

* Empty commit

* Raise AttributeError instead of ValueError

* Make functions private

* Update tests

* Add match string

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* lightning to Lightning

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* default sched (#6062)

* v1.2.0 (#6065)

* v1.2.0

* docs

* add Azure tags trigger (#6066)

* add Azure tags trigger

* fix

* mnodes

* pypi azure badges - tags (#6068)

* pypi azure badges - tags

* pep8

* id

* continue towards 1.3 (#6069)

* Fix amp autocast  (#6080)

* precision fixes

* add amp test model

* fix test

* revert

* move assert to training step

* fix test

* fix test

* remove unrelated changes

* add changelog

* remove unused import

* add sanity check on nb available GPUs (#6092)

* consistent behavior for reduce method across all Plugins (#6011)

* reduction docs

* docs for abstract base method

* make mean the default

* add preliminary chlog

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* [Hot Fix] Give priority to plugins to set distributed mode, and then accelerator (#6089)

* Give priority to plugins to set distributed mode, and then accelerator

* Add CHANGELOG.md

* Update CHANGELOG.md

* Remove very scary line

* Ensure we set cluster environment after slurm configured if necessary

* Simplify the fix with a reset

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Enable ZeRO tests for CI, fix to/half function calls (#6070)

* Enable ZeRO optimization, and make sure that the lightning module hook is called when we move to half precision

* Added test, update to function

* Expose DeepSpeed FP16 parameters due to loss instability (#6115)

* Expose deepspeed config parameters to init function due to instability in parameters

* See if tests can run on normal CI, without special tests

* Add changelog

* Update pytorch_lightning/plugins/training_type/deepspeed.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Collapse 2 DeepSpeed tests (#6108)

* fix amp/apex misconfiguration error for cpu (#6107)

* fix weird test

* fix apex plugin test

* fix raise

* cpu test

* fix type

* add changelog

* Update Contributing Guide (#6118)

* Update Contributing Guide

* update docs

* Minor fixes/improvements in Metric docs (#6114)

* Fix wrong render

* Improve classification metrics docs

* Improve other domain metrics docs

* Change the structure level in the docs

* Avoid printing ModelCheckpoint log with monitor=None and verbose=True (#6109)

* Feature/5275 clean progress bar print (#5470)

* Trainer.test should return only test metrics (#5214)

* resolve bug

* merge tests

* Fix metric state reset (#5273)

* Fix metric state reset

* Fix test

* Improve formatting

Co-authored-by: Ananya Harsh Jha <ananya@pytorchlightning.ai>

* print() method added to ProgressBar

* printing alongside progress bar added to LightningModule.print()

* LightningModule.print() method documentation updated

* ProgressBarBase.print() stub added

* stub

* add progress bar tests

* fix isort

* Progress Callback fixes

* test_metric.py duplicate DummyList removed

* PEP and isort fixes

* CHANGELOG updated

* test_progress_bar_print win linesep fix

* test_progress_bar.py remove whitespaces

* Update CHANGELOG.md

Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Tadej Svetina <tadej.svetina@gmail.com>
Co-authored-by: Ananya Harsh Jha <ananya@pytorchlightning.ai>
Co-authored-by: Alexander Snorkin <Alexander.Snorkin@acronis.com>
Co-authored-by: rohitgr7 <rohitgr1998@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* mini refactor for _running_stage access (#5724)

* running stage

* circular import

* running stage cleanup

* fix unused import

* fix running stage access

* add return type

* Revert "add return type"

This reverts commit 65b0fe269c6547213e34b6a88b97bee31cdfe8c7.

* try fix typing

* Add specifics around DeepSpeed docs (#6142)

* Be more specific with DeepSpeed compatibility

* Better wording

* Ensure accelerator is valid if running interactively (#5970)

Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>

* fixing miss-leading tested acc values (#5876)

* fixing tested values

* .

* tests

* yapf

* softmax

* hvd

* rename

* lr

* duplicate

* drop

* classif

* rm EvalModel

* Revert "rm EvalModel"

This reverts commit 6c3fb39ebe0c4bfb52357bccfd050438f2c0f31c.

* update tests

* fix

* azure

* azure

* self

* cpu

* Apply suggestions from code review

Co-authored-by: rohitgr7 <rohitgr1998@gmail.com>

* Update CHANGELOG (#6156)

* prune deprecated profiler as bool (#6164)

* prune profiler

* chlog

* prune deprecated Trainer arg `enable_pl_optimizer` (#6163)

* prune enable_pl_optimizer

* prune automatic_optimization

* Prune deprecated metrics for 1.3 (#6161)

* prune deprecated metrics for 1.3

* isort / yapf

* [Bugfix] Fixed epoch level schedulers not being called when val_check_interval < 1.0 (#6075)

* fix bug

* fix tests

* changelog

* fix pep8

* fix tests

* fix and add some tests

* add test for rlop

* chlog

* Update CHANGELOG.md

Co-authored-by: rohitgr7 <rohitgr1998@gmail.com>

* Prune deprecated checkpoint arguments (#6162)

* prune prefix

* prune mode=auto

* chlog

* Prune deprecated EarlyStopping(mode='auto') (#6167)

Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* Fix typo (#6178)

* Update issue template to use discussions for questions (#6155)

* add issue config

* remove question template

* update URL

* Update README.md

* Update README.md

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* Update .github/ISSUE_TEMPLATE/config.yml

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* Update with GitHub Discussions (#6186)

* Update gpu warning (#6181)

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Kaushik Bokka <kaushikbokka@gmail.com>

* type accelerators (#6148)

* Fix for multiple callbacks (#6197)

* Fix for multiple callbacks

* Add CHANGELOG.md

* Remove old params

* Skip tests on windows using ddp

* Change name of the variable to not clash with should stop, which is separate

* Apply suggestions from code review

* Fix params

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Add checkpoint parameter to on_save_checkpoint (#6072)

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>

* Document exceptions in loggers (#6171)

* Document exceptions in loggers

* minor formatting

* docstring changed in comet.py

* Apply suggestions from code review

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* Prune deprecated Trainer(checkpoint_callback=ModelCheckpoint()) (#6166)

* fix parallel devices return type & add copyright (#6215)

* Add mypy typing to precision plugins. (#6149)

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>

* apply_func.py: from torchtext.legacy.data import Batch (#6211)

* Update apply_func.py

The name Batch is no longer located under torchtext.data
--Error message--
File "/home/daniel/py38/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 25, in <module>                                                      
    from torchtext.data import Batch                                                  
ImportError: cannot import name 'Batch' from 'torchtext.data' (/home/daniel/py38/lib/p
ython3.8/site-packages/torchtext/data/__init__.py)
You can fix this by changing line line 28 to:
    from torchtext.legacy.data import Batch

* Update apply_func.py

* Update apply_func.py

* Update apply_func.py

* Update apply_func.py

* Update apply_func.py

* fix(wandb): prevent WandbLogger from dropping values (#5931)

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Prune deprecated hparams setter (#6207)

* document exceptions for metrics/regression (#6202)

Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
Co-authored-by: Prajakta Phadke <pphadke@iu.edu>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* simplify skip-if tests >> 0/n (#5920)

* skipif + yapf + isort

* tests

* docs

* pp

* update (#6237)

* Document Exceptions in profilers (#6229)

* docstring changes in profilers

* minor changes in profilers.py

* Call `optimizer.zero_grad()` before backward inside closure in AutoOpt (#6147)

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>

* Fix for incorrect usage of detach(), cpu(), to() (#6216)

* Fix for incorrect detach/cpu calls (#6214)

* Fix incorrect use of detach(), to(), and cpu(), #6214

* Fix incorrect use of detach() and cpu(), #6214

* update pr

* add typing

* chlog

* more...

* revert on module

* update on comments

* revert changes on model

Co-authored-by: tchaton <thomas@grid.ai>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>

* add skipif warpper (#6258)

* cleaning SWA (#6259)

* rename

* if

* test

* chlog

* Remove opt from manual_backward in docs (#6267)

* switch agents pool (#6270)

* docstring changes in tuner (#6264)

* docstring changes in tuner

* added full stop

* Disable CPU Offload as default for DeepSpeed (#6262)

* Change default for CPU offload to false for best throughput/memory efficiency

* Add changelog

* default

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* split profilers (#6261)

* Refactor: skipif for multi - gpus 1/n (#6266)

* ngpus

* gpu

* isort

* pt

* flake8

* Improved EarlyStopping.patience documentation (#6278)

* Improved early stopping documentation

* Changed to 120 column format

* doc

* doc

* doc

Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>

* Refactor: skipif for Windows 2/n (#6268)

* win

* isort

* flake8

* fix duplicate console logging bug v2 (#6275)

Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Refactor: skipif for AMPs 3/n (#6293)

* args

* native

* apex

* isort

* [fix] Ensure we check deepspeed/sharded in multinode DDP (#6297)

* Ensure we check deepspeed/sharded in multinode

* Add CHANGELOG.md

* Add CHANGELOG.md

* Drop mock, use actual multi-gpu node

* unfreeze torchtext version (#6302)

* Add possibility for custom naming when using multiple dataloaders (#6274)

* try to fix imports for parsing (#6256)

* try to fix imports

* legacy 1.2.1

* Refactor: Runif for TPU and Horovod 5/n (#6301)

* TPU

* horovod

* extra

* fix

* Apply suggestions from code review

Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>

* doc

Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>

* Refactor: runif for spec 6/6 (#6307)

* special

* rpc

* Add fairscale & deepspeed to skipif 4/n (#6281)

* add fairscale & windows to skipif

* add deepspeed to runif

* fairscale

* deepspeed

* flake8

Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>

* [bugfix] TPU test hangs to barrier on 1 process (#6272)

* update

* resolve flake8

* update

* update

* update changelog

* update

* resolve flake8

Co-authored-by: Your Name <you@example.com>

* prune duplicite test in optim (#6312)

* Simplify test for AMP plugins (#6311)

* AMP

* fuse

* yapf

* Fix ModelPruning(make_pruning_permanent=True) buffers getting removed when saved during training (#6073)

Co-authored-by: chaton <thomas@grid.ai>

* [bugfix] TPU + all_gather + SingleTPU shouldn't call xm.all_gather (#6296)

* resolve an issue with TPU

* update

* add changelog

* drop unused variable in API (#6308)

* drop unused pl model in ckpt

* irelevant

* on_evaluation_batch_start

* evaluation_epoch_end

* attach_datamodule

* hotfix for PT1.6 and torchtext (#6323)

* ci: azure reinstall torchtext

* move

* todos

* 0.6.0

* skip examples

* formatter

* skip

* todo

* Apply suggestions from code review

* [fix] Use training type plugin hook when saving (FSDP 1/n) (#6321)

* Rely on training type plugin when saving

* Add better typing to training type plugin

* leaving lezwon (#6347)

* Add `tests/utilities/test_parsing.py` (#4460)

* Create branch tests/4400_parsing

* Rename test file for parsing.py

* Fix lightning_hasattr

* Fix lightning_hasattr

* Fix lightning_setattr

* Add empty lines and remove rubbish spaces

* Raise AttributeError not ValueError

* Use getattr in hasattr

* Remove rubbish spaces

* Fix getattr

* Fix by flake8

* Add tests for str_to_bool_or_str

* Fix by flake8

* Add tests for str_to_bool

* Add tests for is_picklable

* Add tests for clean_namespace

* Fix typo

* Fix lightning_getattr

* Add tests for AttributeDict

* Add tests for flatten_dict

* Fix by flake8

* Apply suggestions from code review

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Apply isort

* Revert "Apply suggestions from code review"

* Define unpicklable_function outside

* Add comment to test_clean_namespace

* Add tests for parse_class_init_keys

* Add tests for get_init_args and collect_init_args

* Share objects across the tests

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Ethan Harris <ewah1g13@soton.ac.uk>

* Add ignore param to save_hyperparameters (#6056)

* add ignore param to save_hyperparameters

* add docstring for ignore

* add type for frame object

* Update pytorch_lightning/core/lightning.py

Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>

* Update pytorch_lightning/core/lightning.py

Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>

* fix whitespace

* Update pytorch_lightning/core/lightning.py

Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>

* Parametrize tests

* Update pytorch_lightning/core/lightning.py

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* Update pytorch_lightning/core/lightning.py

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* seq

* fix docs

* Update lightning.py

* Update lightning.py

* fix docs errors

* add example keyword

* update docstring

Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* Fix when _stable_1d_sort to work when n >= N (#6177)

* Fix when _stable_1d_sort to work when n >= N

* Apply suggestions

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>

* Update docs on arg train_dataloader in fit (#6076)

* add to docs

* update docs

* Apply suggestions from code review

* Update pytorch_lightning/core/hooks.py

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* nested loaders

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* shorten text length

* Update pytorch_lightning/core/hooks.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* missing tests default_root_dir=tmpdir (#6314)

* default_root_dir=tmpdir

* miss

* Document exception for metrics/classification (#6190)

* document exception for metrics/classification

* minor formatting fixes

* fix trailing whitespaces

* document exception for metrics

* Apply suggestions from code review

Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>

* Apply suggestions from code review

Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>

* Apply suggestions from code review

Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>

Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>

* [Fix] Call clip gradients if clip val greater than 0 (#6330)

* Call clip gradients if clip val greater than 0

* format

* Format

* Move to top of file

* [bugfix] Check LightningOptimizer doesn't delete optimizer hooks (#6305)

* update

* resolve bug

* docstring changes in accelerators (#6327)

* docstring changes in accelerators

* docstrings moved

* whitespaces removed

* PEP8 correction[1]

* [bugfix] Perform reduction for dict in training_step and DP (#6324)

* fix

* update

* update

* add changelog

* Update CHANGELOG.md

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Update tests/accelerators/test_dp.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* update changelog

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* introduce default cluster environment for lightning-specific ddp (#5915)

* handle distributed_sampler_kwargs

* move emptying cache to accelertor

* fix a few tests

* restoring the result from subprocess

* fix queue.get() order for results

* add missing "block_backward_sync" context manager

* add missing "block_backward_sync" context manager

* fix sync_batchnorm

* fix supported gpu-ids for tuple

* fix clip gradients and inf recursion

* accelerator selection: added cluster_environment plugin

* fix torchelastic test

* fix reduce early stopping decision for DDP

* fix tests: callbacks, conversion to lightning optimizer

* fix lightning optimizer does not pickle

* fix setting benchmark and deterministic option

* fix slurm amp test

* fix prepare_data test and determine node_rank

* fix retrieving last path when testing

* remove obsolete plugin argument

* fix test: test_trainer_config

* fix torchscript tests

* fix trainer.model access

* move properties

* fix test_transfer_batch_hook

* fix auto_select_gpus

* fix omegaconf test

* fix test that needs to simulate slurm ddp

* add horovod plugin

* fix test with named arguments

* clean up whitespace

* fix datamodules test

* remove old accelerators

* fix naming

* move old plugins

* move to plugins

* create precision subpackage

* create training_type subpackage

* fix all new import errors

* fix wrong arguments order passed to test

* fix LR finder

* Added sharded training type and amp plugin

* Move clip grad to precision plugin

* Added sharded spawn, select accelerators based on distributed_backend + enable custom fp16 plugin automatically

* Fix import issue, attempting to fix tests

* Fix initial test

* Reflect hook logic from master, should wrap model after move to device

* Optional state consolidation, since master has optimizers not wrapped

* change attribute for instance test

* reset optimizers

optimizers are not used in main process, so state would be wrong.

* legacy

* imports in accel

* legacy2

* trainer imports

* fix import errors after rebase

* move hook to new setup location

* provide unwrapping logic

* fix trainer callback system

* added ddp2 implementation

* fix imports .legacy

* move plugins

* restore legacy

* drop test.py from root

* add tpu accelerator and plugins

* fixes

* fix lightning optimizer merge

* reset bugreportmodel

* unwrapping

* step routing forward

* model access

* unwrap

* opt

* integrate distrib_type

* sync changes

* sync

* fixes

* add forgotten generators

* add missing logic

* update

* import

* missed imports

* import fixes

* isort

* mv f

* changelog

* format

* move helper to parallel plugin

* d

* add world size

* clean up

* duplicate

* activate ddp_sharded and tpu

* set nvidia flags

* remove unused colab var

* use_tpu <-> on_tpu attrs

* make some ddp_cpu and clusterplugin tests pass

* Ref/accelerator connector (#5742)

* final cleanup

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* connector cleanup

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* trainer cleanup

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* accelerator cleanup + missing logic in accelerator connector

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* add missing changes to callbacks

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* reflect accelerator changes to lightning module

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* clean cluster envs

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* cleanup plugins

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* add broadcasting

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* yapf

* remove plugin connector

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* plugins

* manual optimization

* update optimizer routing

* add rank to torchelastic

* fix memory mixed precision

* setstate on trainer for pickling in ddp spawn

* add predict method

* add back commented accelerator code

* adapt test for sync_batch_norm to new plugin

* fix deprecated tests

* fix ddp cpu choice when no num_processes are given

* yapf format

* skip a memory test that cannot pass anymore

* fix pickle error in spawn plugin

* x

* avoid

* x

* fix cyclic import in docs build

* add support for sharded

* update typing

* add sharded and sharded_spawn to distributed types

* make unwrap model default

* refactor LightningShardedDataParallel similar to LightningDistributedDataParallel

* update sharded spawn to reflect changes

* update sharded to reflect changes

* Merge 1.1.5 changes

* fix merge

* fix merge

* yapf isort

* fix merge

* yapf isort

* fix indentation in test

* copy over reinit scheduler implementation from dev1.2

* fix apex tracking calls with dev_debugger

* reduce diff to dev1.2, clean up

* fix trainer config test  when gpus>0 and num_processes >0 and ddp_cpu

* sort plugin tests legacy/new

* fix error handling for amp on cpu

* fix merge


fix merge


fix merge

* [Feat] Resolve manual_backward (#5837)

* resolve manual_backward

* resolve flake8

* update

* resolve for ddp_spawn

* resolve flake8

* resolve flake8

* resolve flake8

Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>

* fix tests/accelerator tests on cpu

* [BugFix] Resolve manual optimization (#5852)

* resolve manual_optimization

* update

* update

Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>

* Remove copy trainer parameters to happen earlier within the loop and add safe guard to get ref model (#5856)

* resovle a bug

* Accelerator refactor sharded rpc (#5854)

* rpc branch

* merge

* update handling of rpc

* make devices etc. Optional in RPC

* set devices etc. later if necessary

* remove devices from sequential

* make devices optional in rpc

* fix import

* uncomment everything

* fix cluster selection

Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>

* resolve bug

* fix assert in rpc test

* resolve a test

* fix docs compilation

* accelerator refactor - fix for sharded parity test (#5866)

* fix memory issue with ddp_spawn

* x


x


x


x


x


x


x


x


x

* x

* Remove DDP2 as this does not apply

* Add missing pre optimizer hook to ensure lambda closure is called

* fix apex docstring

* [accelerator][BugFix] Resolve some test for 1 gpu (#5863)

* update

* revert init

* resolve a bug

* update

* resolve flake8

* update

* update

* update

* revert init

* resolve a bug

* update

* resolve flake8

* update

* update

* update

* update

* update

* revert init

* resolve a bug

* update

* resolve flake8

* update

* update

* update

* revert init

* update

* resolve flake8

* update

* update

* update

* update

* update

* all_gather

* update

* make plugins work, add misconfig for RPC

* update

* update

* remove breaking test

* resolve some tests

* resolve flake8

* revert to ddp_spawn

Co-authored-by: root <root@ip-172-31-88-60.ec2.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>
Co-authored-by: Justus Schock <justus.schock@rwth-aachen.de>

* yapf isort

* resolve flake8

* fix apex doctests

* fix apex doctests 2

* resolve docs

* update drone

* clean env

* update

* update

* update

* update

* merge

* Fix RPC related tests, clean out old API, update for new accelerator API [skip ci] (#5881)

* Fix RPC related tests, clean out old API, update for new accelerator API

* Move tests out of legacy folder, update paths and names

* Update test_remove_1-4.py

* Expose properties for tpu cores/gpus/num_gpus

* Add root GPU property

* Move properties to properties.py

* move tests that were previously in drone

* Fix root GPU property (#5908)

* Move root GPU to property, remove horovod set as this is handled in horovod plugin, ensure we mock correctly to set GPU accelerator

* Add missing tests back

* fix best model path transfer when no checkpoint callback available

* Fix setup hook order [wip] (#5858)

* Call trainer setup hook before accelerator setup

* Add test case

* add new test

* typo

* fix callback order in test

Co-authored-by: tchaton <thomas@grid.ai>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* rename ddp sequential -> rpc sequential for special test

* revert

* fix stupid merge problem

* abstract the cluster plugins

* default plugin

* integrate default environment

* fix property

* adapt tests

* adjust test

* fix world size access

* base cluster env

* revert rebase errors

* revert rebase errors

* missing import

* revert unrelated change

* remove unused cluster local rank

* remove unrelated changes

* fix unrelated changes

* fix pep8

* remove unused var

* reset permissions

* ypaf

* test default environment

* test torchelastic environment

* world  size as int

* tests for slurm environment

* changelog

* test comments

* remove unintended change

* keep master port fixed after it is generated

* test random master port

* yapf

* add missing default environment

* move helper function

* rename default environment

* rename

* rename

* yapf

* Update pytorch_lightning/plugins/environments/lightning_environment.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Update CHANGELOG.md

Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>

* spawn -> create

Co-authored-by: justusschock <justus.schock@posteo.de>
Co-authored-by: SeanNaren <sean@grid.ai>
Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>
Co-authored-by: Justus Schock <justus.schock@rwth-aachen.de>
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-88-60.ec2.internal>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: root <root@ip-172-31-88-60.ec2.internal>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* [bugfix] Resolve memory leak for evaluation (#6326)

* resolve bug

* resolve flake8

* revert name

* Update changelog for v1.2.2 (#6325)

* update changelog for v1.2.2

* ckpr 1.2.2

Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>

* CI: fix examples - patch download MNIST (#6357)

* patch download

* CI

* isort

* extra

* [bug] Fix Pytorch profiler with emit_nvtx (#6260)

* resolve bug

* update changelog

* Update tests/trainer/test_trainer.py

* Update pytorch_lightning/profiler/profilers.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* resolve comments

* resolve flake8

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* fix importing torchtext batch (#6365)

* copy torchtext batch

* update

* rev

* rev

* give a more complete GAN example (#6294)

* Refactor RunningStage usage in advance of implementing Trainer.validate() (#4945)

* Update code

Co-authored-by: EliaCereda

* More property updates

* Move properties. Introduce trainer._fitting

* Use trainer.fitting

* Fix reset dataloaders

* Unused code

* RunningStage.SANITY_CHECKING

* Use setters

* Fix bugs

* Fix bugs

* TrainerState.{FITTING,VALIDATING,TESTING,PREDICTING,TUNING}

* Fix bugs

* Fix bugs

* Fix tests

* Update CHANGELOG. Add deprecation warning. Fix tests

* Unused imports

* Optional trainer

* More deprecation. More refactoring

* Correct version

* Use properties

* Address comments

* flake8

* Missed renamings

* Typo

* is -> ==

It is recommended to use  for Enums since they are singletons, however, since the LightningEnum subclasses str, it's not a good idea in case a user sets the state/stage with a str

* Also for tests

* Typo

* Address @tchaton's comments

* PEP8

* Correct property

* Update CHANGELOG

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update pytorch_lightning/trainer/trainer.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Remove called sanity check

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* require: adjust versions (#6363)

* adjust versions

* release

* manifest

* pep8

* CI

* fix

* build

* Use f-"""-string in a Trainer comment (#6377)

* Use f-"""-string

* Add r

* Use Trainer.

* r -> noqa: W605

* Remove no return warning from val/test step (#6139)

* remove warning

* auto_opt

* chlog

* auto_opt

* no_warning_call

* rm old code

* add warning for predict

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Fix manual optimization in pl_example (#6373)

* Fix automatic_optimization

* Fix automatic_optimization

* Uncomment fairscale

* Update Sharded test with RunIf (#6384)

* Remove optimizer_idx arg in manual optimization (#6093)

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: chaton <thomas@grid.ai>

* [doc] Improve Multiple Val/Test Dataloaders with simultaneous batches option (#6320)

* improve doc to describe how to combine batches of multiple test and val dataloaders simultaneously

* fix typo

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* use paramref

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* [doc] Fix closure in manual optimization (#6374)

* Fix manual optimization docs

* Fix typo. Thanks @import-antigravity

* Fix ModelCheckpoint(monitor=None, save_last=True) not saving checkpoints (#6136)

Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

* Update TBLogger docs (#6315)

* Update tensorboard.py

* Update logging.rst

* pep8

* Update logging.rst

* Update logging.rst

* Apply suggestions from code review

* add code sample

* Update logging.rst

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Fix trainer not resetting lightning_optimizers (#6372)

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* update python version (#6399)

* Fix AttributeError: 'NoneType' object has no attribute 'finalize'  on TPU (#6221)

* Fix bug

Fix AttributeError: 'NoneType' object has no attribute 'finalize'

* Update CHANGELOG.md

* deleted a period

* Update CHANGELOG.md

Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>

* Update CHANGELOG.md

* Update pytorch_lightning/plugins/training_type/tpu_spawn.py

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* Run CI (#6402)

* Pass {fit,validate,test,predict} to setup() and teardown() (#6386)

* fix dp reduction test (#6404)

* fix

* update

* fix

* move the class outside

* Add check for verbose attribute of ModelCheckpoint (#6419)

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* fixed bug where tuner would not tune lr if also tuning batch_size (#4688)

* fixed bug where tuner would not tune lr if also tuning batch_size

* added a '+1' to computing the smoothed loss. This maintains the behavior for the smoothed loss as before the bug fix

* pep8 fix

* add changelog

Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* update (#6403)

* fix logger creating directory structure too early in DDP (#6380)

* fix

* add simple test

* fix imports

* add changelog

* tighter test with on_fit_start hook closer to the dispatch call

* move class inside test f unction

* add a comment

* Typing for tests 1/n (#6313)

* typing

* yapf

* typing

* [changelog] Update Changelog on release v1.2.3 (#6444)

* update changelog

* legacy 1.2.3

Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>

* Improve DummyLogger (#6398)

* fix dummy logger

* docs

* update docs

* add changelog

* add none return annotation

* return empty string for name, version

* Raise an exception if check_val_every_n_epoch is not an integer (#6411)

* raise an exception if check_val_every_n_epoch is not an integer

* remove unused object

* add type hints

* add return type

* update exception message

* update exception message

* Set find unused parameters to True by default to fix breaking compatibility (#6438)

* Set find unused parameters to True by default to fix breaking models, add suggestion to re-enable

* Add changelog

* [bug] All_gather support tensor on cpu (#6416)

* add test

* update changelog

* update

* rename function

* [Fix] Ensure we set the default device before initializing deepspeed (#6460)

* Ensure we set the default device before initializing deepspeed

* Add CHANGELOG.md

* Update pytorch_lightning/plugins/training_type/deepspeed.py

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>

Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com>

* Remove redundant test (#6466)

* Add Trainer.validate(…) method to run one validation epoch (#4948)

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Allow user to disable the automatic formatting of checkpoint file names. (#6277)

* cleaning SWA (#6259)

* rename

* if

* test

* chlog

* Remove opt from manual_backward in docs (#6267)

* switch agents pool (#6270)

* Allow user to disable the automatic formatting of checkpoint file names.

* Added changelog entry.

* Made flake8 happy.

* Applied review suggestion: quotes for special characters in docstring

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Fixed example in docstring.

* Fixed syntax error in docstring.

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
Co-authored-by: thomas chaton <thomas@grid.ai>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Hotfix for torchvision (#6476)

* cover subproc coverage (#6477)

* argparse: Add use_argument_group=True (#6088)

* argparse: Add inplace option

Replicate in GAN model

* datamodule: Deduplicate logic w/ argparser utilities

* Update pl_examples/domain_templates/generative_adversarial_net.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>

* Keep docstrings

* Correct name

* Whitespace

* Consistency

* fix weird type stuff

* try alt - use_argument_group

* fix syntax + lint

* fix ci errs

* fix ci

* change examples... still failing w/ "unrecognized arguments: --batch_size"

* address review

* mnist_datamodule: add some docstrings

* argparse: check cls or cls.__init__ for param

didn't capture issue, but meh

* fix lint

* fix no-doc edge case

* address review

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>

* Disable batch transfer in DP mode (#6098)

* add exceptions and test

* hook

* fix

* clean up

* clean up

* regex

* regex

* docs

* rev

* comment and docs

* chlog

* Apply suggestions from code review

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Apply suggestions from code review

Co-authored-by: chaton <thomas@grid.ai>

* Monkey-patch device count

* docs

* pep

* api_change

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: chaton <thomas@grid.ai>

* remove obsolete todo in pl_examples (#6475)

* [feat] Support iteration-based checkpointing in model checkpoint callback (#6146)

* Update model_checkpoint.py

* add tests

* Update model_checkpoint.py

* Update test_model_checkpoint.py

* fix tests

* every_n_batches

* Update test_model_checkpoint.py

* defaults

* rm tests

* Update model_checkpoint.py

* Update test_model_checkpoint.py

* Prune deprecated metrics for 1.3 (#6161)

* prune deprecated metrics for 1.3

* isort / yapf

* Update model_checkpoint.py

* add tests

* defaults

* Update CHANGELOG.md

* pre-commit

* Update model_checkpoint.py

* update defaults

* Update test_remove_1-5.py

* Update model_checkpoint.py

* Update model_checkpoint.py

* Update model_checkpoint.py

* Update model_checkpoint.py

* Update model_checkpoint.py

* Update model_checkpoint.py

* fix tests

* Update test_model_checkpoint.py

* Update model_checkpoint.py

* Update model_checkpoint.py

* Update model_checkpoint.py

* Update test_model_checkpoint.py

* ckpt-callback

* Update test_model_checkpoint.py

* Update model_checkpoint.py

* Update model_checkpoint.py

* validation-end

* Update model_checkpoint.py

* Update test_model_checkpoint.py

* Update test_model_checkpoint.py

* Update test_model_checkpoint.py

* Update test_model_checkpoint.py

* clarify-names

- Make names explicit as to which hooks they apply to
- Use step instead of batch for consistency with global step

* Update model_checkpoint.py

* Update model_checkpoint.py

* Update model_checkpoint.py

* Update model_checkpoint.py

* Update model_checkpoint.py

* mutual-exclusive

Make every_n_train_steps and every_n_val_epochs mutually exclusive

* fix-default-0

* Update CHANGELOG.md

* formatting

* make-private

make attributes private to the class

* rebase

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* update xla version (#6464)

* Remove unused mixin attributes (#6487)

* Remove unused mixing attributes

* Missing import

* [doc] Update the order of zero_grad and backward (#6478)

* Fix zero_grad in docs

* Fix zero_grad in docs

* Fix tuner.scale_batch_size not finding batch size attribute when using datamodule (#5968)

* Update docs for limit_predict_batches (#6507)

* add docs and minor updates

* docs

* fraction

* [bug] Update broadcast + reduce decision ModelCheckpoint] (#6410)

* resolve bug

* update

* update changelog

* update PR

* Update pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* add todo

* resolve issues

* resolve flake8

* update

* add coverage for reduce

* wip

* restore back to brodbact

* remove test.py

* resolve flake8

* update

* check world size

* resolve test

* update

* use pytorch version when defined

* update on comments

* update on comments

* flake8

* resolve bugs

* Update CHANGELOG.md

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* update

* update

* update

* update

* remove test

* update

* resolve flake8

* update

* update

* update

* proxy

* update

* update

* resolve typo

* prune

* update parallel

* update

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Handle torch.jit scripted modules in layer summary (#6511)

* CI: resume testing with py3.8 (#6516)

* testing on python 3.8

* req

* document exceptions for metrics/functional (#6273)

* document exceptions for metrics/functional

* Apply suggestions from code review

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* Apply suggestions from code review

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>

* Mean Average Precision metric for Information Retrieval (1/5) (#5032)

* init information retrieval metrics

* changed retrieval metrics names, expanded arguments and fixed typo

* added 'Retrieval' prefix to metrics and fixed conflict with already-present 'average_precision' file

* improved code formatting

* pep8 code compatibility

* features/implemented new Mean Average Precision metrics for Information Retrieval + doc

* fixed pep8 compatibility

* removed threshold parameter and fixed typo on types in RetrievalMAP and improved doc

* improved doc, put first class-specific args in RetrievalMetric and transformed RetrievalMetric in abstract class

* implemented tests for functional and class metric. fixed typo when input tensors are empty or when all targets are False

* fixed typos in doc and changed torch.true_divide to torch.div

* fixed typos pep8 compatibility

* fixed types in long division in ir_average_precision and example in mean_average_precision

* RetrievalMetric states are not lists and _metric method accepts predictions and targets for easier extension

* updated CHANGELOG file

* added '# noqa: F401' flag to not used imports

* added double space before '# noqa: F401' flag

* Update CHANGELOG.md

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* change get_mini_groups in get_group_indexes

* added checks on target inputs

* minor refactoring for code cleanness

* split tests over exception raising in separate function && refactored test code into multiple functions

* fixed pep8 compatibility

* implemented suggestions of @SkafteNicki

* fixed imports for isort and added types annontations to functions in test_map.py

* isort on test_map and fixed typing

* isort on retrieval and on __init__.py and utils.py in metrics package

* fixed typo in pytorch_lightning/metrics/__init__.py regarding code style

* fixed yapf compatibility

* fixed yapf compatibility

* fixed typo in doc

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* CI: Azure publish results (#6514)

* deprecate metrics pkg (#6505)

* deprecate metrics

* examples

* req

* docs

* Apply suggestions from code review

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>

* pep8

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>

* [test] lr_find with bs_scale (#6422)

* init test: test_lr_find_with_bs_scale

* Update test_lr_finder.py

* remove gpu req

* try boring model

* custom boring model

* pep8

* fix typo

* Update test_lr_finder.py

* typo

* typo

* Update DeepSpeed docs (#6528)

* Clean up docs and add some explicitness around stages

* Apply suggestions from code review

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* fix attribute access in LightningModule.toggle_optimizer (#6513)

* Update hook lifecycle (#6538)

* Update hook lifecycle

* Update docs/source/common/lightning_module.rst

* Prune metrics base classes 2/n (#6530)

* base class

* extensions

* chlog

* _stable_1d_sort

* _check_same_shape

* _input_format_classification_one_hot

* utils

* to_onehot

* select_topk

* to_categorical

* get_num_classes

* reduce

* class_reduce

* tests

* Custom Plugin is_distributed (#6537)

* return from plugin

* dont return for tpu

* refactor reading env defaults (#6510)

* change tests

* fix

* test

* _defaults_from_env_vars

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Prune metric: helpers and inputs 3/n (#6547)

* _basic_input_validation

* _check_shape_and_type_consistency

* _check_num_classes_binary

* _check_num_classes_mc

* _check_num_classes_ml

* _check_top_k

* _check_classification_inputs

* _input_format_classification

* _reduce_stat_scores

* DataType

* rest

* flake8

* chlog

* prune warning & deprecation wrapper (#6540)

* docs

* wrapper

* test

* count

* flake8

* Add outputs param for `on_val/test_epoch_end` hooks (#6120)

* add outputs param for on_val/test_epoch_end hooks

* update changelog

* fix warning message

* add custom call hook

* cache logged metrics

* add args to docstrings

* use warning cache

* add utility method for param in sig check

* Update CHANGELOG.md

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* update docstring

* add test for eval epoch end hook

* add types and replace model ref

* add deprecation test

* fix test fx name

* add model hooks warning

* add old signature model to tests

* add clear warning cache

* sopport args param

* update tests

* add tests for model hooks

* code suggestions

* add signature utils

* fix pep8 issues

* fix pep8 issues

* fix outputs issue

* fix tests

* code fixes

* fix validate test

* test

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* [doc] Add Zero Grad `set_to_none=True` trick (#6548)

* add trick to doc

* update

* update path

* Update docs/source/benchmarking/performance.rst

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

* fix deprecation wrapper & tests (#6553)

* fix deprecation wrapper & tests

* flake8

* prune metric: accuracy 4/n (#6515)

* prune accuracy

* chlog

* flake8

* Apply suggestions from code review

Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>

* wrap

* test

* test

* fix

Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>

* Prune metrics: AUC & AUROC (#6572)

* class: AUC AUROC

* func: auc auroc

* format

* tests

* [doc] Update Dict Train Loader doc.  (#6579)

* update doc

* update example

* Prune metrics: precision & recall 6/n (#6573)

* avg precision

* precision
* recall

* curve

* tests

* chlog

* isort

* fix

* Update Changelog for v1.2.4 (#6581)

* Update changelog for v1.2.4

* lagacy v1.2.4

* prune duplicates from changelog

Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz>

* [Fix] Move init dist connection into the setup function (#6506)

* Move connection setup into the setup function. Call setup hook after we set up the accelerator

* Added CHANGELOG.md

* fix setup order in callback test

* fix input arguments in test

* Mock distributed function, remove protection to turn into training type hook

* Remove import

* Add missing mock, ensure custom plugin does not create children process

* Skip test on windows

* Update deepspeed to init connection in setup

* Do not initialize distributed module

* Move DeepSpeed tests to special tests since dist communication is being set up

* Special the test to see if this fixes CI

* Delete accelerator connector test to see if its causing build to fail

* Delete deepspeed test

* Revert "Delete accelerator connector test to see if its causing build to fail"

This reverts commit edde60b8

* Revert "Delete deepspeed test"

This reverts commit 9d317429

* Reverse hook

* Reverse setup hooks to debug again

* Add todo so i know where i left off

* For single device move in pre_dispatch after setup function

* Add additional model to device hook if any additional parameters have been set

* See if we can enable deepspeed tests

* Revert "See if we can enable deepspeed tests"

This reverts commit b5450def

* See if this hook approach works

* Introduce new granular hooks

* Remove import, fix tpu spawn by moving the function to setup

* Added missing special test

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Fix all_gather for tpu_cores=8 (#6587)

* Update Gradient Clipping for TPU Accelerator (#6576)

* NGC container PoC (#6187)

* add NVIDIA flows

* push

* pull

* ...

* extras

* ci prune

* fix

* tag

* .

* list

* Automatically set sync_batchnorm for training_type_plugin (#6536)

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>
Co-authored-by: Kaushik Bokka <kaushikbokka@gmail.com>

* Prune metrics: other classification 7/n (#6584)

* confusion_matrix

* iou

* f_beta

* hamming_distance

* stat_scores

* tests

* flake8

* chlog

* fixing examples (#6600)

* try Azure

* -e

* path

* Add AMP for validation, prediction and testing (#6565)

* Add Tests for val and test-steps

* Add native AMP

* pep8 tests

* pep8 plugin

* changelog

* Add trainer.predict config validation (#6543)

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

* Add DDP Spawn being default for Multi GPUs (#6292)

* Move profiler tests (#6619)

* drop mypy from .pre-commit-config.yaml (#6542)

* Clean utilities/argparse and add missing tests (#6607)

* Allow training type plugin to delay optimizer creation (FSDP 2/n) (#6331)

* Allow training_type_plugin to delay optimizer configure

* Add missing references to trainer, add a CPU accelerator based test

* Add teardown method to BaseProfiler. (#6370)

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>

* refactoring setup (#6590)

* refactoring setup

* .

* docs

* flake8

* hotfix: mock examples (#6632)

* mock examples

* drop from GA

* [refactor] Add setup to profilers + _run_stage_setup to trainer 2/5 (#6633)

* add setup

* update

* updates on comment

* Minor changes

* Extra import

* Docs

Co-authored-by: Carlos Mocholi <carlossmocholi@gmail.com>

* fix comparing versions (#6434)

* fix comparing versions

* chlog

* .

* ...

* datasets

* Prune metrics: regression 8/n (#6636)

* explained_variance

* tests

* mean_absolute_error

* mean_squared_error

* mean_relative_error

* mean_squared_log_error

* chlog

* Prune metyrics: regression 9/n (#6637)

* psnr

* r2score

* ssim

* chlog

* Refactor base profilers 3/5 (#6621)

Co-authored-by: tchaton <thomas@grid.ai>

* prune metrics: info retrieval (#6649)

* Flash predict step (#6577)

* add predict_step

* Update predict_loop.py

* Update trainer.py

* Update trainer.py

* resolve bugs

* update

* update

* update

* resolve bug

* resolve some failing tests

* udpate tests

* update

* resolve tests

* add a test

* remove typo

* add a test for attachement

* update

* changed to on_train_dataloader

* remove __flash_special_attr__

* resolve tests

* update

* update

* update

* update on comments

* Update pytorch_lightning/trainer/data_loading.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* fix back-compatibility for Accel (#6655)

* Refactor PyTorch profiler 4/5 (#6349)

Co-authored-by: thomas chaton <thomas@grid.ai>

* Add PyTorch 1.8 Profiler 5/5 (#6618)

* Refactor profilers

* Update PassThrough

* WIP - This is broken and will change

* Update pytorch_lightning/profiler/pytorch.py

Co-authored-by: thomas chaton <thomas@grid.ai>

* resolve tests

* resolve tests

* find output

* try something

* update

* add support for test and predict

* update

* update

* use getattr

* test

* test

* update

* tests

* update

* update

* update

* update

* update

* remove file

* update

* update

* update

* update

* update

* test

* update#

* update

* update tests

* update

* add suport for 1.8

* rename records

* add support for 1.8

* update

* resolve flake8

* resolve test

* Refactor basic profilers

* Fixes

* Unused import

* Introduce setup

* Profile on all ranks. Print to stdout on 0

* Introduce dirpath + filename

* CHANGELOG

* Add tests. Address comments

* add `on_run_stage_setup`

* add on_run_stage_setup function

* update

* add test for RegisterRecordFunction

* update lightnng flow direction

* move variable to private

* remove trace

* Undo code that should be in 3/4

* Multi-stage multi-rank

* 2/5 changes

* Pass stage in __del__

* Remove TODOs

* Describe on_evaluation_end. Add tests

* Typo

* Address comments

* deepcopy tests

* Advanced teardown

* Fix teardown test

* Fix tests

* Minor change

* Update CHANGELOG.md

* Fix test

* Quick fixes

* Fix 6522

* resolve ddp tests

* resolve tests

* resolve some tests
…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AveragePrecision Broken on Master
7 participants