Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[3.12] getattr_static can result in reference leaks #118013

Closed
williamwen42 opened this issue Apr 17, 2024 · 13 comments
Closed

[3.12] getattr_static can result in reference leaks #118013

williamwen42 opened this issue Apr 17, 2024 · 13 comments
Assignees
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@williamwen42
Copy link
Contributor

williamwen42 commented Apr 17, 2024

Bug report

Bug description:

I discovered this bug while working on pytorch/pytorch#124302.

It looks like calling getattr_static is causing an object to be reference leaked:

import gc
from inspect import getattr_static
import weakref

class C1:
    pass

def main():
    obj = C1()
    weakref.finalize(obj, lambda: print("obj deleted!"))
    class C2:
        def __init__(self):
            self.obj = obj
    c2 = C2()
    getattr_static(c2, "bad", None)
    print("done main!")

main()
gc.collect()
print("done!")

Output:

done main!
done!
obj deleted!

If I comment out the getattr_static line, the output is as expected:

done main!
obj deleted!
done!

It looks like this PR #104267 indirectly cached calls to getattr_static, which is resulting in reference leaks. Perhaps this cache needs to use weak references?

Original PyTorch code for reference (torch.compile calls getattr_static on mod at some point):

import torch

import gc
import sys
import weakref

from inspect import getattr_static

def dbg(o):
    refs = gc.get_referrers(o)
    print(len(refs), sys.getrefcount(o))
    return refs

gm_list = []
def backend(gm, _):
    gm_list.append(weakref.ref(gm, lambda _: print("gm deleted")))
    # breakpoint()
    return gm

def main():
    param = torch.nn.Parameter(torch.randn(5, 5))

    class Mod(torch.nn.Module):
        def __init__(self):
            super().__init__()
            self.param = param

        def forward(self, x):
            return self.param * x

    mod = Mod()
    ref = weakref.ref(param, lambda _: print("obj deleted"))
    opt_mod = torch.compile(mod, backend=backend)
    print(opt_mod(torch.randn(5, 5)))
    return ref

ref = main()

gc.collect()
print("done!")
print(ref)

CPython versions tested on:

3.12

Operating systems tested on:

Linux

Linked PRs

@williamwen42 williamwen42 added the type-bug An unexpected behavior, bug, or error label Apr 17, 2024
@mdboom
Copy link
Contributor

mdboom commented Apr 18, 2024

I'm not sure I see a reference leak here, but certainly a longer lifetime than before. The object will still be deleted eventually. I do see how this behavior might be unexpected or undesirable, however, particularly for large objects.

Pinging @AlexWaygood, as the author of #104267.

@AlexWaygood
Copy link
Member

Pinging @AlexWaygood, as the author of #104267.

Thanks for the ping!

I'm not sure I see a reference leak here, but certainly a longer lifetime than before. The object will still be deleted eventually. I do see how this behavior might be unexpected or undesirable, however, particularly for large objects.

Agreed on all counts. I deliberately made sure to use a bounded cache in #104267, so that the size of the cache (and with it, memory consumption) wouldn't grow indefinitely. And even with a bounded cache, the only reason why I thought this would probably be safe was that classes are generally defined once in the global scope, and persist for the entire program, meaning that a strong reference to a class in a cache won't generally extend the lifetime of that class longer than it would have been otherwise.

What problem is this causing for your program exactly, @williamwen42? Is it causing a significant increase in memory consumption? Or is it just that a test somewhere is reporting that a class has more references to it at the end of a program than at the start of the program? I'd prefer to keep the caching if possible, as it significantly speeds up getattr_static() and users of getattr_static().

@AlexWaygood AlexWaygood added stdlib Python modules in the Lib dir 3.12 bugs and security fixes 3.13 bugs and security fixes labels Apr 18, 2024
@williamwen42
Copy link
Contributor Author

williamwen42 commented Apr 18, 2024

Thanks for the quick response!

The longer lifetime is the "reference leak" issue that I'm referring to. In my case, the leaked variable may be holding on to additional resources, e.g. GPU memory. So we could imagine a case where a user might do something like:

my_model1 = torch.compile(MyModel())
my_model1(input1)
del my_model1
# e.g. 2GB of GPU memory could still be allocated

my_model2 = torch.compile(MyModel2())
# now my_model2 cannot run due to CUDA OOM
# unclear to user how to release the remaining GPU memory
my_model2(input2)

I do see the point where classes defined in a function are not common. And I don't see this issue happening if the leaking variable and class definition are instead in the global scope, although it does occur if the variable is a class member. But I think it's still bad user experience for resources to be mysteriously held on (under an unclear set of circumstances), even though the user intends to release it.

In our case, we don't anticipate calling getattr_static on the same object many times and even when we do, we're calling it in the compilation phase of our program, so performance is not as critical.

Some ideas that were floating in my head:

  • use weakrefs (but not all objects are weakref'able)
  • use a different caching algorithm to cache hot objects (but this doesn't completely solve the problem)
  • don't cache on the corner cases (but which cases are these and how do we determine if an object satisfies a case?)
  • provide an uncached version of getattr_static

Global class member example for reference:

import gc
from inspect import getattr_static
import sys
import weakref

class C1:
    pass

obj = C1()
weakref.finalize(obj, lambda: print("obj deleted!"))
class C2:
    member = obj
    def __init__(self):
        pass
c2 = C2()
getattr_static(c2, "member", None)
print("done main!")

del obj
del c2
del C2

gc.collect()
print("done!")

Output:

done main!
done!
obj deleted!

@AlexWaygood
Copy link
Member

AlexWaygood commented Apr 19, 2024

my_model1 = torch.compile(MyModel())
my_model1(input1)
del my_model1
# e.g. 2GB of GPU memory could still be allocated

my_model2 = torch.compile(MyModel2())
# now my_model2 cannot run due to CUDA OOM
# unclear to user how to release the remaining GPU memory
my_model2(input2)

Just so I'm clear here: in this example, torch.compile() is creating a class somewhere, and then later in the torch.compile() call something somewhere is calling getattr_static() on that class. The getattr_static() call then means that a strong reference to the dynamically created class is retained in the cache somewhere, which means that 2GB of memory (that would otherwise be deallocated after the del my_model1 call) is not freed.

Is that correct? It surprises me that retaining a strong reference to a class could lead to a 2GB memory leak. But if that's the case, then I'm definitely happy to look at alternatives to the current caching strategy (the weakref option is probably best here, I think)

@williamwen42
Copy link
Contributor Author

Actually torch.compile calls getattr_static() on the user's object, which causes a strong ref to be held to the user's object. So if a user tries to torch.compile an object whose class is dynamically created or an object with a class member, there is a leak.

Technically, torch.compile does dynamically create a class, although it should not be relevant to this issue since its lifetime is designed to be attached to the original object's lifetime.

The 2GB+ leak from a single variable can happen since the single reference could be referring to a large tensor/nd-array or even an entire model.

I commented out the lru_cache in my local copy of inspect.py and the PyTorch test that was failing then passes, so the cache is indeed the cause of the test failure.

@AlexWaygood
Copy link
Member

AlexWaygood commented Apr 19, 2024

torch.compile calls getattr_static() on the user's object, which causes a strong ref to be held to the user's object.

But how can it? The inspect.py cache only stores a strong reference to the class of the object getattr_static() is called against, not the object itself. The instance f here does not have any strong references to it, so calling the weakref to f returns None after it has been deleted:

>>> import inspect, weakref
>>> class Foo: ...
... 
>>> f = Foo()
>>> weakref_to_class = weakref.ref(Foo)
>>> weakref_to_instance = weakref.ref(f)
>>> inspect.getattr_static(f, 'foo', 'bar')
'bar'
>>> del f
>>> weakref_to_instance()
>>> del Foo
>>> weakref_to_class()
<class '__main__.Foo'>

(Well, strictly speaking it stores a reference to the mro tuple of the class of the object that getattr_static() is being called on)

@williamwen42
Copy link
Contributor Author

Ah, I'm not being precise enough here. Yes, calling getattr_static() causes the class to be cached, but the class can hold on to strong references to other user objects through cell variables, class members, etc.

@AlexWaygood
Copy link
Member

Okay, got it — thanks! I'll work on a fix (unless someone else beats me to it :-)

@AlexWaygood
Copy link
Member

I'm assigning this to myself so I don't forget to look at it, but that doesn't mean somebody else shouldn't take a look if they fancy it

@AlexWaygood AlexWaygood self-assigned this Apr 20, 2024
@AlexWaygood
Copy link
Member

Hmm, well, I experimented with using weakrefs for the cache, i.e. a patch like this:

--- a/Lib/inspect.py
+++ b/Lib/inspect.py
@@ -157,6 +157,7 @@
 import types
 import functools
 import builtins
+import weakref
 from keyword import iskeyword
 from operator import attrgetter
 from collections import namedtuple, OrderedDict
@@ -1815,6 +1816,7 @@ def trace(context=1):
 _sentinel = object()
 _static_getmro = type.__dict__['__mro__'].__get__
 _get_dunder_dict_of_class = type.__dict__["__dict__"].__get__
+_SHADOWED_DICT_CACHE = {}
 
 
 def _check_instance(obj, attr):
@@ -1832,7 +1834,7 @@ def _check_class(klass, attr):
             return entry.__dict__[attr]
     return _sentinel
 
-@functools.lru_cache()
+
 def _shadowed_dict_from_mro_tuple(mro):
     for entry in mro:
         dunder_dict = _get_dunder_dict_of_class(entry)
@@ -1844,8 +1846,21 @@ def _shadowed_dict_from_mro_tuple(mro):
                 return class_dict
     return _sentinel
 
+
 def _shadowed_dict(klass):
-    return _shadowed_dict_from_mro_tuple(_static_getmro(klass))
+    mro = _static_getmro(klass)
+
+    try:
+        weakref_mro = tuple(map(weakref.ref, mro))
+    except TypeError:
+        # is it possible to have a class that can't be weakref'd?
+        # not sure -- but better to be safe than sorry
+        return _shadowed_dict_from_mro_tuple(mro)
+
+    if weakref_mro not in _SHADOWED_DICT_CACHE:
+        _SHADOWED_DICT_CACHE[weakref_mro] = _shadowed_dict_from_mro_tuple(mro)
+    return _SHADOWED_DICT_CACHE[weakref_mro]
+
 
 def getattr_static(obj, attr, default=_sentinel):
     """Retrieve attributes without triggering dynamic lookup via the
diff --git a/Lib/test/libregrtest/utils.py b/Lib/test/libregrtest/utils.py
index 791f996127..e0bcd18100 100644
--- a/Lib/test/libregrtest/utils.py
+++ b/Lib/test/libregrtest/utils.py
@@ -275,7 +275,6 @@ def clear_caches():
     except KeyError:
         pass
     else:
-        inspect._shadowed_dict_from_mro_tuple.cache_clear()
         inspect._filesbymodname.clear()
         inspect.modulesbyfile.clear()

Unfortunately, it seems like the cost of constructing the weakrefs on every call wipes out any performance gain from having the cache (the function actually becomes slower than it would be if we simply reverted 1b19bd1).

@carljm, I don't suppose you see any way of avoiding doing a plain revert, do you? It leads to a pretty big slowdown :(

@carljm
Copy link
Member

carljm commented Apr 22, 2024

A few ideas, but all have major problems:

  1. Could do the caching manually, keyed only by hash of the MRO tuple instead of by using lru_cache; this should avoid keeping all the classes alive. But it does risk hash collisions and wrong results, so probably not an option.

  2. Could cache the results directly on the class using a dunder name. But this has problems with slotified classes and classes with a metaclass that overrides __getattribute__ -- not really suited for getattr_static.

  3. Could try to slim down the contents of the cache key by pre-processing the MRO into a more limited form that includes only what we need (the __dict__ of each class, not the full class). I think this could solve the OP leak case, but it could still leak in other cases (heavy class attributes.) Also it probably gives up a lot of the benefit of caching, because it means we still have to walk the MRO uncached.

Listing these in case any of them prompt any other brilliant ideas, but I don't think any of these as described are a workable option.

@AlexWaygood
Copy link
Member

After a bit of experimentation, I've managed to come up with #118202, which is still a fair bit slower than main, but seems significantly faster than a simple revert of #104267. If anybody has any ideas for how to micro-optimise that PR further, please feel free to comment on the PR. I'll try to add news/tests/benchmarks tomorrow.

miss-islington pushed a commit to miss-islington/cpython that referenced this issue Apr 24, 2024
…_dict` (pythonGH-118202)

(cherry picked from commit 8227883)

Co-authored-by: Alex Waygood <Alex.Waygood@Gmail.com>
AlexWaygood added a commit that referenced this issue Apr 24, 2024
…d_dict` (GH-118202) (#118232)

Co-authored-by: Alex Waygood <Alex.Waygood@Gmail.com>
@AlexWaygood
Copy link
Member

AlexWaygood commented Apr 24, 2024

Thanks @williamwen42 for reporting the issue and for patiently explaining it to me! This should be fixed on main now, and the fix has been backported to 3.12, so should be available in the next patch release for Python 3.12.

williamwen42 added a commit to pytorch/pytorch that referenced this issue Apr 26, 2024
…_static"


For tracking #124302 so that we can re-enable the test once 3.12 updates with the bug fix for python/cpython#118013.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang

[ghstack-poisoned]
williamwen42 added a commit to pytorch/pytorch that referenced this issue Apr 29, 2024
…to buggy getattr_static"


For tracking #124302 so that we can re-enable the test once 3.12 updates with the bug fix for python/cpython#118013.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang

[ghstack-poisoned]
williamwen42 added a commit to pytorch/pytorch that referenced this issue Apr 29, 2024
…_static"


For tracking #124302 so that we can re-enable the test once 3.12 updates with the bug fix for python/cpython#118013.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang

[ghstack-poisoned]
williamwen42 added a commit to pytorch/pytorch that referenced this issue Apr 30, 2024
…to buggy getattr_static"


For tracking #124302 so that we can re-enable the test once 3.12 updates with the bug fix for python/cpython#118013.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang

[ghstack-poisoned]
williamwen42 added a commit to pytorch/pytorch that referenced this issue Apr 30, 2024
…_static"


For tracking #124302 so that we can re-enable the test once 3.12 updates with the bug fix for python/cpython#118013.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang

[ghstack-poisoned]
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this issue Apr 30, 2024
…25062)

For tracking #124302 so that we can re-enable the test once 3.12 updates with the bug fix for python/cpython#118013.

Pull Request resolved: #125062
Approved by: https://github.com/anijain2305, https://github.com/jansel
facebook-github-bot pushed a commit to pytorch/benchmark that referenced this issue May 1, 2024
Summary:
For tracking pytorch/pytorch#124302 so that we can re-enable the test once 3.12 updates with the bug fix for python/cpython#118013.

X-link: pytorch/pytorch#125062
Approved by: https://github.com/anijain2305, https://github.com/jansel

Reviewed By: kit1980

Differential Revision: D56801860

Pulled By: williamwen42

fbshipit-source-id: 16b38ce7f1b8cabe2fcdd660fc8a2fabdda04aca
pytorch-bot bot pushed a commit to pytorch/pytorch that referenced this issue May 3, 2024
…25062)

For tracking #124302 so that we can re-enable the test once 3.12 updates with the bug fix for python/cpython#118013.

Pull Request resolved: #125062
Approved by: https://github.com/anijain2305, https://github.com/jansel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

4 participants