Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential issue with *.pyc files #589

Closed
AdrianSosic opened this issue Oct 8, 2024 · 12 comments
Closed

Potential issue with *.pyc files #589

AdrianSosic opened this issue Oct 8, 2024 · 12 comments

Comments

@AdrianSosic
Copy link
Contributor

  • cattrs version: 24.1.0
  • Python version: 3.10
  • Operating System: macOS

Description

Hi, I'm not yet 100% certain, but I might have discovered a cattrs bug related to the *.pyc created when executing code. Not sure if this is anyhow related to the caching/compilation mechanism of Converter, and I also haven't been able to create a minimum example yet because my setting is quite a bit involved. But I thought perhaps we can already shed some light on it when I report what I see.

What I Did

In the past, we've already had several CI fails caused by cattrs showing some rather surprising/inconsistent behavior that we were not able to properly reproduce. The two most common types of errors were of the following kind:

  • Objects are no longer consistent after roundtrip serialization:
    image
  • The hooks suddenly fail because of inconsistent settings:
    image

The interesting part here is not the errors themselves but rather that they arise more or less sporadically, and mostly in CI and not locally. For instance, as the first screenshot shows, the error happened to occur on Python 3.10 but not on 3.12 in this particular case. And I haven't figured out a clear pattern for this. My only conjecture is that it might have to do with cached/compiled files that are sometimes present and sometimes not.

Now the good news, which seem to confirm this conjecture: I was just able to consistently reproduce the second error on my local machine. The interesting piece: as soon as I delete my *.pyc files, the code runs without problems once. After the execution, the *.pyc files are recreated and the test fails again. Same behavior of course if I use PYTHONDONTWRITEBYTECODE. So this seems to indicate that there is some unexpected interaction between the created byte code and cattrs.

Any idea what could be the root cause or how we could nail down the problem furhter?

@Tinche
Copy link
Member

Tinche commented Oct 8, 2024

Huh, interesting. I don't think code generation that cattrs does goes into a pyc file, my impression is that it's purely in memory. Would be amazing if you could put together a simple repro case so we can get to the bottom of the issue.

attrs and dataclasses do similar codegeneration at runtime (as do other libs like jinja2), and I've never had this issue myself. Super curious.

@Tinche
Copy link
Member

Tinche commented Oct 8, 2024

You say you get problems mostly in CI, but folks don't commonly commit pyc files, right? So CI runs should be "clean"?

@AdrianSosic
Copy link
Contributor Author

You say you get problems mostly in CI, but folks don't commonly commit pyc files, right? So CI runs should be "clean"?

I know, I'm also surprised. We definitely have no pyc files in our git repo, but I also have to say I don't fully understand how exactly the Github runners do their thing. That is, no clue what parts of environments / containers are reused when and when not. (Because of these issues, we've also deactivated environment caching a while ago which first seemed to have solved the problem but then they came back 🙈 )

Will try and see if I can somehow get a reproducing minimal example

@AdrianSosic
Copy link
Contributor Author

Hey @Tinche, the weirdness doesn't stop. I was trying to create a minimal example for you by deleting all unnecessary parts from our codebase (had to go top-down because couldn't get it running bottom-up). And now I'm at the weirdest Python behavior I've ever observed, which blocks me from reducing the example any further:

I have deleted ~99% from our code base and have a reproducible setting that creates the problem – a colleague of mine (@Scienfitz) confirmed on his machine. Surprisingly, whatever I do from here onwards, even deleting "dead" code that is nowhere used, changes the effect. Perhaps the problem appears only if the byte code exceeds a certain size 🤔 !? In any case, I simply can't get the example any smaller. However, if you are willing to have a look, you can reproduce the issue on our repo using a few lines of simple instructions. I'd be super curious to hear your thoughts and see if you are also as surprised as we are.

Instructions

Installation

  1. Clone repo and checkout branch
    git clone https://github.com/emdgroup/baybe.git
    cd baybe
    git checkout cattrs_bug
  2. Create a virtual Python 3.10 environment. We've tested both pyenv and mamba for environment creation under 3.10 and 3.12. With 3.10, you get the behavior described below. With 3.12, the tests always fail!
  3. Install the code
    pip install uv
    uv pip install -e ".[dev]"

Error reproduction

  1. Run the test and see that it passes
    pytest
  2. Run it again and be surprised that it fails, also for successive runs
  3. Delete the compiled files
    find . | grep -E "(/__pycache__$|\.pyc$|\.pyo$)" | xargs rm -rf
  4. Run it once again and be even more surprised that it passes again

@Tinche
Copy link
Member

Tinche commented Oct 9, 2024

Interesting. I'll be sure to take a look since I'm very curious myself.

Does the problem go away if pytest isn't used? I know pytest does some black magic rewriting bytecode for the assert statement to work.

@AdrianSosic
Copy link
Contributor Author

As far as I can tell right now, without pytest the behavior is consistent, regardless of whether the bytecode files are present or not

@Tinche
Copy link
Member

Tinche commented Oct 9, 2024

Alright, I think I got a simple reproducer on 3.12 with this:

from baybe.surrogates.base import Surrogate
from baybe.surrogates.random_forest import RandomForestSurrogate


surrogate = RandomForestSurrogate()
string = surrogate.to_json()
Surrogate.from_json(string)

But the issue here is - a ton of fields in Surrogate are init=False. Cattrs will skip these fields by default. That's an easy fix - just pass _cattrs_include_init_false to make_dict_structure_fn.

This didn't fix the problem. The true fix is to change find_subclass to:

from gc import collect

def find_subclass(base: type, name_or_abbr: str, /):
    """Retrieve a specific subclass of a base class via its name or abbreviation."""
    collect()
    try:
        return next(cl for cl in get_subclasses(base) if refers_to(cl, name_or_abbr))
    except StopIteration:
        raise UnidentifiedSubclassError(
            f"The class name or abbreviation '{name_or_abbr}' does not refer to any "
            f"of the subclasses of '{base.__name__}'."
        )

When you use @define, attrs will add slots to your class. Slots can't be added to classes that already exist, so attrs will actually create a whole new class as a clone of the original, see https://www.attrs.org/en/stable/glossary.html#term-slotted-classes for more details.

We try very hard to make this work transparently, but sometimes it doesn't. One place is cls.__subclasses__ - unfortunately, the old class remains in this list until a garbage collection run. cattrs was generating a structuring function for the old class, not the new one, which was causing the errors.

This is also why the problem was sporadic - if a gc run happens before, the problem solves itself.

@AdrianSosic
Copy link
Contributor Author

Oh wow, that was REALLY fast 😳 That also perfectly explains the behavior I was just about to report to you:

  • When I just run the code from the test body without pytest (i.e. in a normal script), I get the error.
  • When I run it with the debugger, I also get an error.
  • However, when I place a breakpoint somewhere and then just continue without doing anything else, things work!

I think you might just have saved us days of work, honestly!!! Let my have a proper look if I can confirm the behavior. Will let you know!

@AdrianSosic
Copy link
Contributor Author

Hey @Tinche, so it indeed seems to solve all problems when running stuff locally. I've also started the first tests in CI but I think in the end only time will show if everything's fixed. However, so far looking really good 🤩 Hence, let us consider the core of the problem solved for now 👏🏼 Many thanks again, your help is much appreciated!!

One other related thing: while the gc.collect inside the function works, it makes our CI runs roughly 3x longer =/ The alternative where I do the garbage collection right after my class definitions (to avoid repeated calls at runtime) is much faster, but comes with the immense drawback that I've to add more than 40 such calls to our codebase + there is always the risk that we'll forget it in new files. Do you have any suggestion how to handle it more elegantly? Also, was there ever a discussion about triggering the garbage collection from within attrs itself?

@Tinche
Copy link
Member

Tinche commented Oct 9, 2024

There was a little bit of discussion here: python-attrs/attrs#407

Feel free to open a new issue there and maybe we can figure out a better way. I wonder what's actually keeping the original class alive, and can we just break those references.

AdrianSosic added a commit to emdgroup/baybe that referenced this issue Oct 9, 2024
@Tinche
Copy link
Member

Tinche commented Oct 9, 2024

There are probably some other solutions, like if you can figure out a way to call gc.collect only once after all models are imported, or we can look into making get_subclasses smarter (if all your classes are slotted, you could probably filter out the non-slotted ones here).

@AdrianSosic
Copy link
Contributor Author

I wonder what's actually keeping the original class alive, and can we just break those references.

Very good question, indeed. I can only say that this is not the first time I stumbled over it but I understand it's a very complicated matter.

There are probably some other solutions, like if you can figure out a way to call gc.collect only once after all models are imported, or we can look into making get_subclasses smarter (if all your classes are slotted, you could probably filter out the non-slotted ones here).

Unfortunately not possible. Since we are building a package and not an application, there is no obvious "entry point" for the user and thus it's not clear what parts of the code will be loaded and when.

But at least we have a workaround for now, which already solves 99% of my problems 🙃 So let me thank you again and close the issue for now 👍🏼

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants