[3.12] getattr_static can result in reference leaks #118013

williamwen42 · 2024-04-17T21:58:22Z

Bug report

Bug description:

I discovered this bug while working on pytorch/pytorch#124302.

It looks like calling getattr_static is causing an object to be reference leaked:

import gc
from inspect import getattr_static
import weakref

class C1:
    pass

def main():
    obj = C1()
    weakref.finalize(obj, lambda: print("obj deleted!"))
    class C2:
        def __init__(self):
            self.obj = obj
    c2 = C2()
    getattr_static(c2, "bad", None)
    print("done main!")

main()
gc.collect()
print("done!")

Output:

done main!
done!
obj deleted!

If I comment out the getattr_static line, the output is as expected:

done main!
obj deleted!
done!

It looks like this PR #104267 indirectly cached calls to getattr_static, which is resulting in reference leaks. Perhaps this cache needs to use weak references?

Original PyTorch code for reference (torch.compile calls getattr_static on mod at some point):

import torch

import gc
import sys
import weakref

from inspect import getattr_static

def dbg(o):
    refs = gc.get_referrers(o)
    print(len(refs), sys.getrefcount(o))
    return refs

gm_list = []
def backend(gm, _):
    gm_list.append(weakref.ref(gm, lambda _: print("gm deleted")))
    # breakpoint()
    return gm

def main():
    param = torch.nn.Parameter(torch.randn(5, 5))

    class Mod(torch.nn.Module):
        def __init__(self):
            super().__init__()
            self.param = param

        def forward(self, x):
            return self.param * x

    mod = Mod()
    ref = weakref.ref(param, lambda _: print("obj deleted"))
    opt_mod = torch.compile(mod, backend=backend)
    print(opt_mod(torch.randn(5, 5)))
    return ref

ref = main()

gc.collect()
print("done!")
print(ref)

CPython versions tested on:

3.12

Operating systems tested on:

Linux

Linked PRs

The text was updated successfully, but these errors were encountered:

mdboom · 2024-04-18T11:55:56Z

I'm not sure I see a reference leak here, but certainly a longer lifetime than before. The object will still be deleted eventually. I do see how this behavior might be unexpected or undesirable, however, particularly for large objects.

Pinging @AlexWaygood, as the author of #104267.

AlexWaygood · 2024-04-18T12:32:08Z

Pinging @AlexWaygood, as the author of #104267.

Thanks for the ping!

I'm not sure I see a reference leak here, but certainly a longer lifetime than before. The object will still be deleted eventually. I do see how this behavior might be unexpected or undesirable, however, particularly for large objects.

Agreed on all counts. I deliberately made sure to use a bounded cache in #104267, so that the size of the cache (and with it, memory consumption) wouldn't grow indefinitely. And even with a bounded cache, the only reason why I thought this would probably be safe was that classes are generally defined once in the global scope, and persist for the entire program, meaning that a strong reference to a class in a cache won't generally extend the lifetime of that class longer than it would have been otherwise.

What problem is this causing for your program exactly, @williamwen42? Is it causing a significant increase in memory consumption? Or is it just that a test somewhere is reporting that a class has more references to it at the end of a program than at the start of the program? I'd prefer to keep the caching if possible, as it significantly speeds up getattr_static() and users of getattr_static().

williamwen42 · 2024-04-18T18:25:09Z

Thanks for the quick response!

The longer lifetime is the "reference leak" issue that I'm referring to. In my case, the leaked variable may be holding on to additional resources, e.g. GPU memory. So we could imagine a case where a user might do something like:

my_model1 = torch.compile(MyModel())
my_model1(input1)
del my_model1
# e.g. 2GB of GPU memory could still be allocated

my_model2 = torch.compile(MyModel2())
# now my_model2 cannot run due to CUDA OOM
# unclear to user how to release the remaining GPU memory
my_model2(input2)

I do see the point where classes defined in a function are not common. And I don't see this issue happening if the leaking variable and class definition are instead in the global scope, although it does occur if the variable is a class member. But I think it's still bad user experience for resources to be mysteriously held on (under an unclear set of circumstances), even though the user intends to release it.

In our case, we don't anticipate calling getattr_static on the same object many times and even when we do, we're calling it in the compilation phase of our program, so performance is not as critical.

Some ideas that were floating in my head:

use weakrefs (but not all objects are weakref'able)
use a different caching algorithm to cache hot objects (but this doesn't completely solve the problem)
don't cache on the corner cases (but which cases are these and how do we determine if an object satisfies a case?)
provide an uncached version of getattr_static

Global class member example for reference:

import gc
from inspect import getattr_static
import sys
import weakref

class C1:
    pass

obj = C1()
weakref.finalize(obj, lambda: print("obj deleted!"))
class C2:
    member = obj
    def __init__(self):
        pass
c2 = C2()
getattr_static(c2, "member", None)
print("done main!")

del obj
del c2
del C2

gc.collect()
print("done!")

Output:

done main!
done!
obj deleted!

AlexWaygood · 2024-04-19T16:51:14Z

my_model1 = torch.compile(MyModel())
my_model1(input1)
del my_model1
# e.g. 2GB of GPU memory could still be allocated

my_model2 = torch.compile(MyModel2())
# now my_model2 cannot run due to CUDA OOM
# unclear to user how to release the remaining GPU memory
my_model2(input2)

Just so I'm clear here: in this example, torch.compile() is creating a class somewhere, and then later in the torch.compile() call something somewhere is calling getattr_static() on that class. The getattr_static() call then means that a strong reference to the dynamically created class is retained in the cache somewhere, which means that 2GB of memory (that would otherwise be deallocated after the del my_model1 call) is not freed.

Is that correct? It surprises me that retaining a strong reference to a class could lead to a 2GB memory leak. But if that's the case, then I'm definitely happy to look at alternatives to the current caching strategy (the weakref option is probably best here, I think)

williamwen42 · 2024-04-19T17:36:03Z

Actually torch.compile calls getattr_static() on the user's object, which causes a strong ref to be held to the user's object. So if a user tries to torch.compile an object whose class is dynamically created or an object with a class member, there is a leak.

Technically, torch.compile does dynamically create a class, although it should not be relevant to this issue since its lifetime is designed to be attached to the original object's lifetime.

The 2GB+ leak from a single variable can happen since the single reference could be referring to a large tensor/nd-array or even an entire model.

I commented out the lru_cache in my local copy of inspect.py and the PyTorch test that was failing then passes, so the cache is indeed the cause of the test failure.

AlexWaygood · 2024-04-19T17:56:30Z

torch.compile calls getattr_static() on the user's object, which causes a strong ref to be held to the user's object.

But how can it? The inspect.py cache only stores a strong reference to the class of the object getattr_static() is called against, not the object itself. The instance f here does not have any strong references to it, so calling the weakref to f returns None after it has been deleted:

>>> import inspect, weakref
>>> class Foo: ...
... 
>>> f = Foo()
>>> weakref_to_class = weakref.ref(Foo)
>>> weakref_to_instance = weakref.ref(f)
>>> inspect.getattr_static(f, 'foo', 'bar')
'bar'
>>> del f
>>> weakref_to_instance()
>>> del Foo
>>> weakref_to_class()
<class '__main__.Foo'>

(Well, strictly speaking it stores a reference to the mro tuple of the class of the object that getattr_static() is being called on)

williamwen42 · 2024-04-19T18:38:24Z

Ah, I'm not being precise enough here. Yes, calling getattr_static() causes the class to be cached, but the class can hold on to strong references to other user objects through cell variables, class members, etc.

AlexWaygood · 2024-04-19T18:55:28Z

Okay, got it — thanks! I'll work on a fix (unless someone else beats me to it :-)

AlexWaygood · 2024-04-20T14:08:20Z

I'm assigning this to myself so I don't forget to look at it, but that doesn't mean somebody else shouldn't take a look if they fancy it

AlexWaygood · 2024-04-20T23:05:03Z

Hmm, well, I experimented with using weakrefs for the cache, i.e. a patch like this:

--- a/Lib/inspect.py
+++ b/Lib/inspect.py
@@ -157,6 +157,7 @@
 import types
 import functools
 import builtins
+import weakref
 from keyword import iskeyword
 from operator import attrgetter
 from collections import namedtuple, OrderedDict
@@ -1815,6 +1816,7 @@ def trace(context=1):
 _sentinel = object()
 _static_getmro = type.__dict__['__mro__'].__get__
 _get_dunder_dict_of_class = type.__dict__["__dict__"].__get__
+_SHADOWED_DICT_CACHE = {}
 
 
 def _check_instance(obj, attr):
@@ -1832,7 +1834,7 @@ def _check_class(klass, attr):
             return entry.__dict__[attr]
     return _sentinel
 
-@functools.lru_cache()
+
 def _shadowed_dict_from_mro_tuple(mro):
     for entry in mro:
         dunder_dict = _get_dunder_dict_of_class(entry)
@@ -1844,8 +1846,21 @@ def _shadowed_dict_from_mro_tuple(mro):
                 return class_dict
     return _sentinel
 
+
 def _shadowed_dict(klass):
-    return _shadowed_dict_from_mro_tuple(_static_getmro(klass))
+    mro = _static_getmro(klass)
+
+    try:
+        weakref_mro = tuple(map(weakref.ref, mro))
+    except TypeError:
+        # is it possible to have a class that can't be weakref'd?
+        # not sure -- but better to be safe than sorry
+        return _shadowed_dict_from_mro_tuple(mro)
+
+    if weakref_mro not in _SHADOWED_DICT_CACHE:
+        _SHADOWED_DICT_CACHE[weakref_mro] = _shadowed_dict_from_mro_tuple(mro)
+    return _SHADOWED_DICT_CACHE[weakref_mro]
+
 
 def getattr_static(obj, attr, default=_sentinel):
     """Retrieve attributes without triggering dynamic lookup via the
diff --git a/Lib/test/libregrtest/utils.py b/Lib/test/libregrtest/utils.py
index 791f996127..e0bcd18100 100644
--- a/Lib/test/libregrtest/utils.py
+++ b/Lib/test/libregrtest/utils.py
@@ -275,7 +275,6 @@ def clear_caches():
     except KeyError:
         pass
     else:
-        inspect._shadowed_dict_from_mro_tuple.cache_clear()
         inspect._filesbymodname.clear()
         inspect.modulesbyfile.clear()

Unfortunately, it seems like the cost of constructing the weakrefs on every call wipes out any performance gain from having the cache (the function actually becomes slower than it would be if we simply reverted 1b19bd1).

@carljm, I don't suppose you see any way of avoiding doing a plain revert, do you? It leads to a pretty big slowdown :(

carljm · 2024-04-22T18:44:59Z

A few ideas, but all have major problems:

Could do the caching manually, keyed only by hash of the MRO tuple instead of by using lru_cache; this should avoid keeping all the classes alive. But it does risk hash collisions and wrong results, so probably not an option.
Could cache the results directly on the class using a dunder name. But this has problems with slotified classes and classes with a metaclass that overrides __getattribute__ -- not really suited for getattr_static.
Could try to slim down the contents of the cache key by pre-processing the MRO into a more limited form that includes only what we need (the __dict__ of each class, not the full class). I think this could solve the OP leak case, but it could still leak in other cases (heavy class attributes.) Also it probably gives up a lot of the benefit of caching, because it means we still have to walk the MRO uncached.

Listing these in case any of them prompt any other brilliant ideas, but I don't think any of these as described are a workable option.

AlexWaygood · 2024-04-23T23:11:10Z

After a bit of experimentation, I've managed to come up with #118202, which is still a fair bit slower than main, but seems significantly faster than a simple revert of #104267. If anybody has any ideas for how to micro-optimise that PR further, please feel free to comment on the PR. I'll try to add news/tests/benchmarks tomorrow.

…#118202)

…_dict` (pythonGH-118202) (cherry picked from commit 8227883) Co-authored-by: Alex Waygood <Alex.Waygood@Gmail.com>

…d_dict` (GH-118202) (#118232) Co-authored-by: Alex Waygood <Alex.Waygood@Gmail.com>

AlexWaygood · 2024-04-24T15:30:19Z

Thanks @williamwen42 for reporting the issue and for patiently explaining it to me! This should be fixed on main now, and the fix has been backported to 3.12, so should be available in the next patch release for Python 3.12.

…_static" For tracking #124302 so that we can re-enable the test once 3.12 updates with the bug fix for python/cpython#118013. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang [ghstack-poisoned]

…to buggy getattr_static" For tracking #124302 so that we can re-enable the test once 3.12 updates with the bug fix for python/cpython#118013. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang [ghstack-poisoned]

…_static" For tracking #124302 so that we can re-enable the test once 3.12 updates with the bug fix for python/cpython#118013. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang [ghstack-poisoned]

…to buggy getattr_static" For tracking #124302 so that we can re-enable the test once 3.12 updates with the bug fix for python/cpython#118013. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang [ghstack-poisoned]

…_static" For tracking #124302 so that we can re-enable the test once 3.12 updates with the bug fix for python/cpython#118013. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang [ghstack-poisoned]

…25062) For tracking #124302 so that we can re-enable the test once 3.12 updates with the bug fix for python/cpython#118013. Pull Request resolved: #125062 Approved by: https://github.com/anijain2305, https://github.com/jansel

Summary: For tracking pytorch/pytorch#124302 so that we can re-enable the test once 3.12 updates with the bug fix for python/cpython#118013. X-link: pytorch/pytorch#125062 Approved by: https://github.com/anijain2305, https://github.com/jansel Reviewed By: kit1980 Differential Revision: D56801860 Pulled By: williamwen42 fbshipit-source-id: 16b38ce7f1b8cabe2fcdd660fc8a2fabdda04aca

…25062) For tracking #124302 so that we can re-enable the test once 3.12 updates with the bug fix for python/cpython#118013. Pull Request resolved: #125062 Approved by: https://github.com/anijain2305, https://github.com/jansel

williamwen42 added the type-bug An unexpected behavior, bug, or error label Apr 17, 2024

AlexWaygood added stdlib Python modules in the Lib dir 3.12 bugs and security fixes 3.13 bugs and security fixes labels Apr 18, 2024

AlexWaygood self-assigned this Apr 20, 2024

bedevere-app bot mentioned this issue Apr 23, 2024

gh-118013: Use weakrefs for the cache key in inspect._shadowed_dict #118202

Merged

AlexWaygood added a commit that referenced this issue Apr 24, 2024

gh-118013: Use weakrefs for the cache key in inspect._shadowed_dict (…

8227883

…#118202)

github-actions bot mentioned this issue Apr 24, 2024

[3.12] gh-118013: Use weakrefs for the cache key in inspect._shadowed_dict (GH-118202) #118232

Merged

AlexWaygood added a commit that referenced this issue Apr 24, 2024

[3.12] gh-118013: Use weakrefs for the cache key in `inspect._shadowe…

0691bd8

…d_dict` (GH-118202) (#118232) Co-authored-by: Alex Waygood <Alex.Waygood@Gmail.com>

AlexWaygood closed this as completed Apr 24, 2024

williamwen42 mentioned this issue Apr 26, 2024

[dynamo, 3.12] xfail refleaking tests due to buggy getattr_static pytorch/pytorch#125062

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[3.12] getattr_static can result in reference leaks #118013

[3.12] getattr_static can result in reference leaks #118013

williamwen42 commented Apr 17, 2024 •

edited by bedevere-app bot

Loading

mdboom commented Apr 18, 2024 •

edited

Loading

AlexWaygood commented Apr 18, 2024

williamwen42 commented Apr 18, 2024 •

edited

Loading

AlexWaygood commented Apr 19, 2024 •

edited

Loading

williamwen42 commented Apr 19, 2024

AlexWaygood commented Apr 19, 2024 •

edited

Loading

williamwen42 commented Apr 19, 2024

AlexWaygood commented Apr 19, 2024

AlexWaygood commented Apr 20, 2024

AlexWaygood commented Apr 20, 2024

carljm commented Apr 22, 2024

AlexWaygood commented Apr 23, 2024

AlexWaygood commented Apr 24, 2024 •

edited

Loading

[3.12] getattr_static can result in reference leaks #118013

[3.12] getattr_static can result in reference leaks #118013

Comments

williamwen42 commented Apr 17, 2024 • edited by bedevere-app bot Loading

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

Linked PRs

mdboom commented Apr 18, 2024 • edited Loading

AlexWaygood commented Apr 18, 2024

williamwen42 commented Apr 18, 2024 • edited Loading

AlexWaygood commented Apr 19, 2024 • edited Loading

williamwen42 commented Apr 19, 2024

AlexWaygood commented Apr 19, 2024 • edited Loading

williamwen42 commented Apr 19, 2024

AlexWaygood commented Apr 19, 2024

AlexWaygood commented Apr 20, 2024

AlexWaygood commented Apr 20, 2024

carljm commented Apr 22, 2024

AlexWaygood commented Apr 23, 2024

AlexWaygood commented Apr 24, 2024 • edited Loading

williamwen42 commented Apr 17, 2024 •

edited by bedevere-app bot

Loading

mdboom commented Apr 18, 2024 •

edited

Loading

williamwen42 commented Apr 18, 2024 •

edited

Loading

AlexWaygood commented Apr 19, 2024 •

edited

Loading

AlexWaygood commented Apr 19, 2024 •

edited

Loading

AlexWaygood commented Apr 24, 2024 •

edited

Loading