Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DO NOT MERGE - WIP: Add ability to serialize/deserialze astroid.Module (and friends) #1194

Closed
wants to merge 24 commits into from

Conversation

thejcannon
Copy link
Contributor

@thejcannon thejcannon commented Oct 1, 2021

Steps

  • For new features or bug fixes, add a ChangeLog entry describing what your PR does.
  • Write a good description on what the PR does.

Description

This PR adds the ability to serialize and deserialize astroid.Module instances (and by association, every other kind of instance it may reference) to-and-from JSON.

Why persistence?

Persisting processed modules is useful primarily when parsing a large number of files, or files with a lot of interdependencies. So long as desrializing form the cache is quicker than parsing the files, it's a net win.

(I'll do some profiling with pylint on our codebase of about 1k files when this is more polished)

Why JSON, why not YAML, XML, or pickle?

JSON was chosen as it is compact (as opposed to yml or xml or similar) and we can choose what and how to store information. Although pickling might be simpler in terms of complexity, it is inherently insecure as a bad actor only needs to infect the astroid pickled data on disk to infect the next parse. Using JSON, since we control the data, the actor would have to infect the cache, and the Python environment the next parse runs in.

And since we aren't inventing anything new here, it should also be stated mypy's cache is also JSON 😉

Changes

The bulk of the relevant changes have been localized to the _persistence module, which has been specially implemented in order to reduce the amount of changes to all of the relevant astroid classes. Additional changes have been made in order to either serialize/deserialize in a special way or to enable the default way.

A future change will have the changes to the module cache to leverage this new feature.

Testing

TODO

Type of Changes

Type
🐛 Bug fix
✨ New feature
🔨 Refactoring
📜 Docs

Related Issue

Related #1145

Copy link
Contributor Author

@thejcannon thejcannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all tests pass right now. Latest errors are from inference not working (and since inference is quite the beasty, I thought I'd publish this PR as-is for initial feedback)


def pytest_addoption(parser):
parser.addoption(
"--test-roundtrip-persistence", type=bool, default=True,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually, I think we'll add a tox section to re-run all tests with this enabled. For now it's just enabled so I can test easily.

@thejcannon
Copy link
Contributor Author

I'll use this PR to update progress and dump thoughts as I work through the remaining problems.

The current test is failing AFAICT due to _explicit_inference not persisting after serialization. I assume this should come from the
inference_tip cache? 🤔

Copy link
Member

@Pierre-Sassoulas Pierre-Sassoulas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for opening this. Sorry I don't have time for an in depth review right now. I agree with the need to improve the caching though. Could you be more specific about the advantages of JSON compared to pickle and about what make the persistance necessary ? Is it for very big code base or low ram machine where the cache would be bigger than ram ? Also could you add typing on newly added functions, please ?

astroid/nodes/scoped_nodes.py Show resolved Hide resolved
@@ -3045,3 +3103,12 @@ def _get_assign_nodes(self):
child_node._get_assign_nodes() for child_node in self.body
)
return list(itertools.chain.from_iterable(children_assign_nodes))

def __dump__(self, dumper):
data = super().__dump__(dumper)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a cultural reference to dumb and dumber ;) ?

Copy link
Contributor Author

@thejcannon thejcannon Oct 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha no, but we can pretend it is.
I originally was passing the refmap around but thought that was kind of dirty, so wanted to pass a function that did the dumping and remembered the remap, so "dumper" was the first word that came to mind 😂

@thejcannon thejcannon changed the title Persistence DO NOT MERGE - WIP: Persistence Oct 3, 2021
@cdce8p cdce8p marked this pull request as draft October 3, 2021 12:14
@thejcannon thejcannon changed the title DO NOT MERGE - WIP: Persistence DO NOT MERGE - WIP: Add ability to serialize/deserialze astroid.Module (and friends) Oct 3, 2021
@thejcannon
Copy link
Contributor Author

@Pierre-Sassoulas I got tests to pass. I think the next step is I'll break out this monolothic change into smaller ones.
Most of the PRs will be cleanup/refactoring, then we can make this one the "persistence" PR

@thejcannon
Copy link
Contributor Author

thejcannon commented Oct 20, 2021

#1215
#1216

Copy link
Member

@cdce8p cdce8p left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced we'll see any noticeable improvement with this approach. As far as I'm aware most time is spend during the actual inference. Parsing the ast tree by contrast is fairly quick. So much so that I might bet we won't see measurable changes when constructing it from cache instead.

A more useful approach IMO would be to try and cache the inference results for each file although it still needs to be investigated to what extend that is possible.

@thejcannon thejcannon closed this Nov 4, 2021
@thejcannon thejcannon deleted the persistence branch November 4, 2021 21:37
@thejcannon thejcannon restored the persistence branch November 4, 2021 21:37
@thejcannon
Copy link
Contributor Author

I agree the AST parsing caching itself wouldn't be sufficient (hence why there isn't any code that actually persists the AST to disk yet). I assumed this would be the first step that would eventually lead to inference caching (since the inference results would need to be persistable).

Closing, though as I won't have much free time anymore to walk this along. It can be used as a starting point for that future effort though 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants